Pickling Error with Spark-Submit “_pickle.PicklingError: args[0] from __newobj__ args has the wrong class”
尝试通过Spark-Submit或Zeppelin运行某些代码时出现以下错误:" _pickle.PicklingError:__ newobj __ args中的args [0]具有错误的类"
我到处看过有同样问题的帖子,对这个问题的了解不多。
追溯(包括在下面)指向我使用的udf之一:
1 2 3 4 | udf_stop_words = udf(stop_words, ArrayType(StringType())) def stop_words(words): return list(word.lower() for word in words if word.lower() not in stopwords.words("english")) |
函数的输入和输出都是字符串列表。这些是来自输入的3行:
[Row(split_tokenized_activity_description=['A', 'delightful', '45',
'minute', 'Swedish', 'style', 'massage']),
Row(split_tokenized_activity_description=['A', 'more', 'intense',
'45', 'minute', 'version', 'of', 'a', 'Swedish', 'style', 'massage']),
Row(split_tokenized_activity_description=['A', 'relaxing', '45',
'minute', 'Swedish', 'style', 'massage'])
这是我正在使用的代码的一小段。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | def special_car(x): # remove the special character and replace them with the stop word"" (space) return [re.sub('[^A-Za-z0-9]+', ' ', x)] # Create UDF from function udf_special_car = udf(special_car, ArrayType(StringType())) # Function to remove stops words def stop_words(words): return list(word.lower() for word in words if word.lower() not in stopwords.words("english")) udf_stop_words = udf(stop_words, ArrayType(StringType())) # Load in data df_tags = spark.sql("select * from database") # Remove special Characters df1_tags = df_tags.withColumn('tokenized_name', udf_special_car(df_tags.name)) df2_tags = df1_tags.withColumn('tokenized_description', udf_special_car(df1_tags.description)) # Select only relevent columns df3_tags = df2_tags.select(['tag_id', 'tokenized_name', 'tokenized_description']) # Tokenize tag_name and tag_desc (Seperate on spaces) (This uses the pyspark.sql.split function) df4_tags = df3_tags.withColumn('split_tokenized_name', split(df3_tags['tokenized_name'].getItem(0), ' ')) df5_tags = df4_tags.withColumn('split_tokenized_description', split(df3_tags['tokenized_description'].getItem(0), ' ')) # Select only relevent columns df6_tags = df5_tags.select(['tag_id', 'split_tokenized_name', 'split_tokenized_description']) # Remove Stop words df7_tags = df6_tags.withColumn('stop_words_tokenized_name', udf_stop_words(df6_tags.split_tokenized_name)) df8_tags = df7_tags.withColumn('stop_words_tokenized_description', udf_stop_words(df7_tags.split_tokenized_description)) |
奇怪的是,通过Zeppelin运行我的代码的前两次,我得到了错误,但是在第3次尝试之后,它运行得很好,并且输出是我期望的样子。齐柏林飞艇仅用于测试;我需要让它通过Spark-Submit运行。
Traceback (most recent call last): File
"/tmp/testing_test.py", line 262, in
udf_stop_words = udf(stop_words, ArrayType(StringType())) File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1872, in udf File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1830, in init File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1835, in _create_judf File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1815, in _wrap_function File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2359, in
_prepare_for_python_RDD File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
460, in dumps File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
703, in dumps File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
147, in dump File"/home/hadoop/anaconda/lib/python3.6/pickle.py",
line 409, in dump
self.save(obj) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 736, in
save_tuple
save(element) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
248, in save_function File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
296, in save_function_tuple File
"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 821, in
save_dict
self._batch_setitems(obj.items()) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 852, in
_batch_setitems
save(v) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv) File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
564, in save_reduce
_pickle.PicklingError: args[0] from newobj args has the wrong class
我已经尝试了几种方法来解决此问题,但都没有成功。它们都返回相同的错误。
我尝试将udf更改为单行lambda函数:
1 | udf(lambda words: list(word.lower() for word in words if word.lower() not in stopwords.words('english')), ArrayType(StringType())). |
我尝试过更改udf以返回字符串:
1 | udf_stop_words = udf(stop_words, StringType()) |
并稍微更改udf以使其匹配。
1 2 | def stop_words(words): return str(word.lower() for word in words if word.lower() not in stopwords.words('english')) |
我尝试将其定义为StructType:
1 | udf_stop_words = udf(stop_words, StructType([StructField("words", ArrayType(StringType()), False)])) |
和
1 | udf_stop_words = udf(stop_words, StructType([StructField("words", StringType(), False)])). |
我也尝试了以上的多种组合。
返回类型应为
我对此不太确定,但是问题可能出在您的节点上没有安装
由于
1 2 3 4 5 6 7 8 9 | from nltk.corpus import stopwords english_stopwords = stopwords.words("english") sc.broadcast(english_stopwords) def stop_words(words): return list(word.lower() for word in words if word.lower() not in english_stopwords) from pyspark.sql.types import ArrayType, StringType import pyspark.sql.functions as psf udf_stop_words = psf.udf(stop_words, ArrayType(StringType())) |
我有一个类似的问题。 就我而言,抛出异常是因为我在spark脚本本身中定义了一个类。 通过创建包含类定义和方法的单独的.py文件来解决此问题。 然后通过