关于python:Spark-Submit出现“ Pickling错误”“ _pickle.PicklingError:__newobj__ args中的args [0]具有错误的类”

Pickling Error with Spark-Submit “_pickle.PicklingError: args[0] from __newobj__ args has the wrong class”

尝试通过Spark-Submit或Zeppelin运行某些代码时出现以下错误:" _pickle.PicklingError:__ newobj __ args中的args [0]具有错误的类"

我到处看过有同样问题的帖子,对这个问题的了解不多。

追溯(包括在下面)指向我使用的udf之一:

1
2
3
4
udf_stop_words = udf(stop_words, ArrayType(StringType()))

def stop_words(words):
    return list(word.lower() for word in words if word.lower() not in  stopwords.words("english"))

函数的输入和输出都是字符串列表。这些是来自输入的3行:

[Row(split_tokenized_activity_description=['A', 'delightful', '45',
'minute', 'Swedish', 'style', 'massage']),
Row(split_tokenized_activity_description=['A', 'more', 'intense',
'45', 'minute', 'version', 'of', 'a', 'Swedish', 'style', 'massage']),
Row(split_tokenized_activity_description=['A', 'relaxing', '45',
'minute', 'Swedish', 'style', 'massage'])

这是我正在使用的代码的一小段。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def special_car(x):
    # remove the special character and replace them with the stop word"" (space)
    return [re.sub('[^A-Za-z0-9]+', ' ', x)]

# Create UDF from function
udf_special_car = udf(special_car, ArrayType(StringType()))

# Function to remove stops words
def stop_words(words):
    return list(word.lower() for word in words if word.lower() not in  stopwords.words("english"))

udf_stop_words = udf(stop_words, ArrayType(StringType()))

# Load in data
df_tags = spark.sql("select * from database")

# Remove special Characters
df1_tags = df_tags.withColumn('tokenized_name', udf_special_car(df_tags.name))
df2_tags = df1_tags.withColumn('tokenized_description', udf_special_car(df1_tags.description))

# Select only relevent columns
df3_tags = df2_tags.select(['tag_id', 'tokenized_name', 'tokenized_description'])

# Tokenize tag_name and tag_desc (Seperate on spaces) (This uses the pyspark.sql.split function)
df4_tags = df3_tags.withColumn('split_tokenized_name', split(df3_tags['tokenized_name'].getItem(0), ' '))
df5_tags = df4_tags.withColumn('split_tokenized_description', split(df3_tags['tokenized_description'].getItem(0), ' '))

# Select only relevent columns
df6_tags = df5_tags.select(['tag_id', 'split_tokenized_name', 'split_tokenized_description'])

# Remove Stop words
df7_tags = df6_tags.withColumn('stop_words_tokenized_name', udf_stop_words(df6_tags.split_tokenized_name))
df8_tags = df7_tags.withColumn('stop_words_tokenized_description', udf_stop_words(df7_tags.split_tokenized_description))

奇怪的是,通过Zeppelin运行我的代码的前两次,我得到了错误,但是在第3次尝试之后,它运行得很好,并且输出是我期望的样子。齐柏林飞艇仅用于测试;我需要让它通过Spark-Submit运行。

Traceback (most recent call last): File
"/tmp/testing_test.py", line 262, in
udf_stop_words = udf(stop_words, ArrayType(StringType())) File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1872, in udf File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1830, in init File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1835, in _create_judf File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1815, in _wrap_function File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2359, in
_prepare_for_python_RDD File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
460, in dumps File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
703, in dumps File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
147, in dump File"/home/hadoop/anaconda/lib/python3.6/pickle.py",
line 409, in dump
self.save(obj) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 736, in
save_tuple
save(element) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
248, in save_function File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
296, in save_function_tuple File
"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 821, in
save_dict
self._batch_setitems(obj.items()) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 852, in
_batch_setitems
save(v) File"/home/hadoop/anaconda/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv) File"/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line
564, in save_reduce
_pickle.PicklingError: args[0] from newobj args has the wrong class

我已经尝试了几种方法来解决此问题,但都没有成功。它们都返回相同的错误。

我尝试将udf更改为单行lambda函数:

1
udf(lambda words: list(word.lower() for word in words if word.lower() not in stopwords.words('english')), ArrayType(StringType())).

我尝试过更改udf以返回字符串:

1
udf_stop_words = udf(stop_words, StringType())

并稍微更改udf以使其匹配。

1
2
def stop_words(words):
    return str(word.lower() for word in words if word.lower() not in stopwords.words('english'))

我尝试将其定义为StructType:

1
udf_stop_words = udf(stop_words, StructType([StructField("words", ArrayType(StringType()), False)]))

1
udf_stop_words = udf(stop_words, StructType([StructField("words", StringType(), False)])).

我也尝试了以上的多种组合。


返回类型应为ArrayType(StringType())

我对此不太确定,但是问题可能出在您的节点上没有安装nltk(或者从未在节点上下载corpus stopwords)这一事实。 由于在UDF中调用stopwords.words("english")就像在节点上调用它一样,它可能会因为找不到语料库而失败。

由于stopwords.words("english")只是一个列表,因此应该在驱动程序上调用它,然后将其广播到节点:

1
2
3
4
5
6
7
8
9
from nltk.corpus import stopwords
english_stopwords = stopwords.words("english")
sc.broadcast(english_stopwords)
def stop_words(words):
    return list(word.lower() for word in words if word.lower() not in english_stopwords)

from pyspark.sql.types import ArrayType, StringType
import pyspark.sql.functions as psf
udf_stop_words = psf.udf(stop_words, ArrayType(StringType()))

我有一个类似的问题。 就我而言,抛出异常是因为我在spark脚本本身中定义了一个类。 通过创建包含类定义和方法的单独的.py文件来解决此问题。 然后通过sc.addPyFile(path),最后是from FileName import *,将该脚本添加到脚本中。