关于Python的”ASCII”：unicodedecodeerror编解码器不能解码的字节顺序中的位置0xe2 13：不在范围内(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

我正在使用nltk对我的文本文件执行kmeans集群，其中每一行都被视为一个文档。例如，我的文本文件是这样的：

属于手指死亡冲刺草率的迈克·哈斯蒂·沃尔斯·杰里科J？格梅斯特法则规则乐队遵循执行j？格梅斯特阶段方法

现在我要运行的演示代码是：https://gist.github.com/xim/1279283

我收到的错误是：

1
2
3
4
5
6
7
8
9
10
11
12
13

Traceback (most recent call last):
File"cluster_example.py", line 40, in
words = get_words(job_titles)
File"cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File"", line 1, in
File"/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File"cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File"/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

这里发生了什么？

文件被读取为一堆str，但它应该是unicode。python试图隐式转换，但失败了。变化：

1	job_titles = [line.strip() for line in title_file.readlines()]

要显式地将str解码为unicode(这里假设为utf-8)：

1	job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

它也可以通过导入codecs模块并使用codecs.open而不是内置的open来解决。

相关讨论

对我来说，终端编码有问题。将utf-8添加到.bashrc解决了以下问题：

1	export LC_CTYPE=en_US.UTF-8

别忘了重新加载.bashrc之后：

1	source ~/.bashrc

要查找所有与Unicode相关的错误…使用以下命令：

1	grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

发现我的

1	/etc/letsencrypt/options-ssl-nginx.conf: # The following CSP directives don't use default-src as

我用shed找到了有问题的序列。原来是编辑的错误。
00008099: C2 194 302 11000010
00008100: A0 160 240 10100000
00008101: d 64 100 144 01100100
00008102: e 65 101 145 01100101
00008103: f 66 102 146 01100110
00008104: a 61 097 141 01100001
00008105: u 75 117 165 01110101
00008106: l 6C 108 154 01101100
00008107: t 74 116 164 01110100
00008108: - 2D 045 055 00101101
00008109: s 73 115 163 01110011
00008110: r 72 114 162 01110010
00008111: c 63 099 143 01100011
00008112: C2 194 302 11000010
00008113: A0 160 240 10100000

您可以在使用job_titles字符串之前尝试此操作：

1	source = unicode(job_titles, 'utf-8')

python3x或更高

在字节流中加载文件：

身体=对于open("website/index.html"，"rb")中的行：decodedline=lines.decode('utf-8')body=body+decodedline.strip()。返回体

使用全局设置：

进口输入输出导入系统sys.stdout=io.textiowrapper(sys.stdout.buffer，encoding='utf-8')

只需执行以下操作---------------------

执行open(fn，'rb').read().decode('utf-8')，而不是只打开(fn).read()。

对于python 3，默认编码是"utf-8"。基本文档中建议以下步骤：https://docs.python.org/2/library/csv.html csv示例，以防出现任何问题

创建函数

1
2
3

def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')

然后使用读卡器内部的函数，例如

1	csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))