关于python:将来自2个RDD(一个带有unicode数据,一个带有普通数据)的数据写入PySpark中的csv文件中的问题?

issue in writing data from 2 RDDs (one with unicode data and one with normal )into a csv file in PySpark?

有两RDD's

RDD1RDD-1:数据是在Unicode格式

1
[[u'a',u'b',u'c'],[u'c',u'f',u'a'],[u'ab',u'cd',u'gh']...]

rdd2:

1
[(10.1, 10.0), (23.0, 34.0), (45.0, 23.0),....]

RDDs都有相同的原子数(但有一行2列/行/元素在每个记录和一个有3)。现在我要做的就是把所有的元素从RDD22nd记录从RDD1和写他们离开两个csv文件在本地文件系统HDFS槽)。母猪的输出文件的csv以上样品将:

1
2
3
a,b,c,10.0
c,f,a,34.0
ab,cd,gh,23.0

我怎么做的PySpark吗?

更新:这是我流的代码:

1
2
3
4
5
6
7
8
9
columns_num = [0,1,2,4,7]
rdd1 = rdd3.map(lambda row: [row[i] for i in columns_num])

rdd2 = rd.map(lambda tup: (tup[0], tup[1]+ (tup[0]/3)) if tup[0] - tup[1] >= tup[0]/3 else (tup[0],tup[1]))

with open("output.csv","w") as fw:
    writer = csv.writer(fw)
    for (r1, r2) in izip(rdd1.toLocalIterator(), rdd2.toLocalIterator()):
        writer.writerow(r1 + tuple(r2[1:2]))

我要为TypeError: can only concatenate list (not"tuple") to list误差。如果我那么做writer.writerow(tuple(r1) + r2[1:2])误差为UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 16: ordinal not in range(128) `


如果在本地,您的意思是驱动程序文件系统,那么您可以简单地collect或转换toLocalIterator并写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import csv
import sys
if sys.version_info.major == 2:
    from itertools import izip
else:
    izip = zip

rdd1 = sc.parallelize([(10.1, 10.0), (23.0, 34.0), (45.0, 23.0)])
rdd2 = sc.parallelize([("a","b" ," c"), ("c","f","a"), ("ab","cd","gh")])

with open("output.csv","w") as fw:
    writer = csv.writer(fw)
    for (r1, r2) in izip(rdd2.toLocalIterator(), rdd1.toLocalIterator()):
        writer.writerow(r1 + r2[1:2])