关于Python的：unicodeencodeerror：’ascii’ codec can’t encode character

UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)

我在处理从不同网页(在不同网站)提取的文本中的Unicode字符时遇到问题。我在用漂亮的汤。

问题是，错误并不总是可复制的；它有时与某些页面一起工作，有时通过抛出UnicodeEncodeError而出错。我已经尝试了我所能想到的一切，但是我没有找到任何能在不抛出某种与Unicode相关的错误的情况下持续工作的方法。

导致问题的代码部分如下所示：

1
2
3

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

下面是运行上面的代码段时在某些字符串上生成的堆栈跟踪：

1
2
3
4

Traceback (most recent call last):
File"foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

我怀疑这是因为某些页面(或者更具体地说，来自某些站点的页面)可能被编码，而其他页面可能未编码。所有网站都位于英国，并提供英国消费的数据，因此没有任何问题与内化或处理用英语以外的任何文字书写的问题。

有人对如何解决这个问题有什么想法吗？这样我才能一直解决这个问题。

相关讨论

您需要阅读python unicode howto。这个错误就是第一个例子。

基本上，停止使用str将Unicode转换为编码文本/字节。

相反，正确使用.encode()对字符串进行编码：

1	p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

或者完全使用Unicode。

相关讨论

同意！我学到的一个很好的经验法则是使用"Unicode三明治"的想法。您的脚本接受来自外部世界的字节，但所有处理都应该用Unicode完成。只有当您准备好输出您的数据时，它才应该被迅速地恢复为字节！
如果有人对此感到困惑，我发现了一件奇怪的事情：我的终端使用UTF-8，当我使用cx1(0)时，我的UTF-8字符串工作得很好。但是，当我将程序输出管道传输到一个文件时，它会抛出一个UnicodeEncodeError。实际上，当输出被重定向(指向文件或管道)时，我发现sys.stdout.encoding是None！粘上.encode('utf-8')解决了这个问题。
@drevicko：使用PYTHONIOENCODING=utf-8代替，即打印unicode字符串，并让环境设置预期的编码。
但是速度很慢…
@塞巴斯蒂安：你认为在任何情况下这都是一种有效的方法吗？假设您有一个工具正在导出一个需要特定编码的报表，我觉得用户只需要为该导出更改环境设置就不那么简单了。然后，我宁愿让程序将编码作为一个有着合理默认值的参数。
@斯坦纳：任何情况下都是无效的。一般来说，用户不应该关心使用python来实现实用程序(如果出于任何原因决定用另一种语言重新实现该实用程序，则不应更改接口)，因此，您不应该期望用户了解特定于python的envvar。强制用户指定字符编码是不好的UI；如果需要，可以将字符编码嵌入到报告格式中。注意：在一般情况下，任何硬编码编码都不能是"合理的默认值"。
@J.F.Sebastian我同意这会因用例而异。我担心的是，读者可能会遵循设置环境变量的模式，而不是更清晰地配置环境变量，这会导致用户不得不担心用什么语言编写实用程序。我认为，在某些情况下，让用户指定字符编码可能有很大的意义，并且似乎是比设置环境变量更好的选择。但是我正在考虑为高级用户提供工具。
如果区域设置构建不正确或用户使用的是非UTF-8区域设置，则似乎需要PYTHONIOENCODING。设置PYTHONIOENCODING前检查locale。
这是不好的和令人困惑的建议。人们使用str的原因是因为对象已经不是字符串，所以没有可调用的.encode()方法。
@我不知道你的意思。如果您得到这个错误，您肯定有一个str类型的对象——一个字符串。
@agf，不，这个错误是由代码unicode(list(DjangoQueryObject))引发的。
.strip()的原因是什么？对于所有到.encode("utf-8")的转换调用，建议这样做吗？
@不，不是。它只是从原始代码复制的，比如join和变量名。这是他的用例特有的东西。
@德维科说的对，这就是发生的一切
我仔细阅读了你提到的文件，但仍然不明白为什么u''和.encode('utf-8')都是必要的。"U"是否表示Unicode字符串？
@马特，我不确定我理解你的问题。是，"U"表示Unicode字符串。encode接收解码后的unicode字符串，并将其编码为特定的utf-8字节序列。所以在将它编码为UTF-8之后，就Python而言，它只是一个字节序列，不再是Unicode。
@一个是Unicode字符串，另一个是我不欣赏的字节序列。我仍然不明白为什么没有encode()的电话(根据@drevicko的评论)，外壳中的管道会破裂。

这是典型的python unicode痛点！考虑以下事项：

1
2
3

a = u'bats\u00E0'
print a
=> batsà

到目前为止一切都很好，但是如果我们称之为str(a)，我们来看看会发生什么：

1
2
3
4

str(a)
Traceback (most recent call last):
File"<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

哦，迪普，这对任何人都没有好处！要修复错误，请使用.encode显式编码字节，并告诉python要使用的编解码器：

1
2
3
4

a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà

vueu00 e0！

问题是，当您调用str()时，python使用默认的字符编码来尝试对您提供的字节进行编码，在您的情况下，这有时是Unicode字符的表示。要解决这个问题，您必须告诉python如何使用.encode("whatever_nicode")处理您提供的字符串。大多数情况下，使用UTF-8应该可以。

有关此主题的精彩介绍，请参阅内德·巴切尔德的Pycon演讲：http://ned batchelder.com/text/unipain.html

相关讨论

我发现优雅的工作在我周围删除符号，并继续保持字符串如下：

1	yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

重要的是要注意，使用ignore选项是危险的，因为它会悄悄地从使用它的代码中删除任何Unicode(和国际化)支持，如下所示(转换Unicode)：

1 2	>>> u'City: Malm?'.encode('ascii', 'ignore').decode('ascii') 'City: Malm'

相关讨论

好吧，我什么都试过了，但没用，在谷歌搜索了一下之后，我发现了下面的内容，它起了作用。python 2.7正在使用中。

1
2
3
4

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

相关讨论

导致打印失败的一个微妙问题是环境变量设置错误，例如，这里的lc_都设置为"c"。在Debian中，他们不鼓励设置：Debian wiki on locale

1
2
3
4
5
6
7
8
9
10
11
12
13
14

$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c"print (u'voil\u00e0')"
Traceback (most recent call last):
File"<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c"print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c"print (u'voil\u00e0')"
voilà

相关讨论

事实上，我发现，在大多数情况下，仅仅去掉这些字符就简单多了：

1	s = mystring.decode('ascii', 'ignore')

相关讨论

对我来说，真正起作用的是：

1	BeautifulSoup(html_text,from_encoding="utf-8")

希望这能帮助别人。

这可能会尝试解决，P＞＜／

1
2
3
4

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

相关讨论

问题是，您试图打印一个Unicode字符，但您的终端不支持它。

您可以尝试安装language-pack-en包来修复此问题：

1	sudo apt-get install language-pack-en

它为所有支持的包(包括python)提供英文翻译数据更新。如有必要，安装不同的语言包(取决于您要打印的字符)。

在某些Linux发行版上，需要确保正确设置默认的英语区域设置(以便外壳/终端可以处理Unicode字符)。有时安装它比手动配置要容易。

然后在编写代码时，确保在代码中使用正确的编码。

例如：

1	open(foo, encoding='utf-8')

如果仍然有问题，请重新检查系统配置，例如：

您的区域设置文件(/etc/default/locale)，它应该具有例如

1
2
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

或：

1
2
LC_ALL=C.UTF-8
LANG=C.UTF-8
壳中LANG和LC_CTYPE的值。
通过以下方式检查外壳支持的区域设置：

1
locale -a | grep"UTF-8"

在新的虚拟机中演示问题和解决方案。

初始化和设置虚拟机(例如使用vagrant)：

1	vagrant init ubuntu/trusty64; vagrant up; vagrant ssh

请参见：可用的Ubuntu框。

打印unicode字符(如商标符号，如?)：

1
2
3
4

$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File"<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)

现在安装language-pack-en：

1
2
3
4
5
6

$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.

现在需要解决的问题是：

1 2	$ python -c 'print(u"\u2122");' ?

否则，请尝试以下命令：

1 2	$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");' ?

相关讨论

在脚本开头添加以下行(或作为第二行)：

1	# -- coding: utf-8 --

这就是Python源代码编码的定义。更多信息请参见PEP 263。

相关讨论

在rehashing of some other是所谓的"警察"的答案。which there are)将在简单的情况troublesome strings is the characters /良好的解决方案，尽管protests voiced the在这里。P＞＜／

1
2
3
4
5

def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return""

测试：P＞＜／

1
2
3
4
5

if __name__ == '__main__':
print safeStr( 1 )
print safeStr("test" )
print u'98\xb0'
print safeStr( u'98\xb0' )

结果：P＞＜／

1
2
3
4

1
test
98°
98

建议：你可能想toAsciito this function name instead？这是在物的偏好。P＞＜／

this was written for Python 2。Python for 3，我相信你会想使用bytes(obj,"ascii")Rather str(obj)比。我没有测试这个T喃喃，但会在一些点和revise the回答。P＞＜／

在Shell：P＞＜／

当地支持UTF-8 find命令：by the followingP＞＜／

1	locale -a \| grep"UTF-8"

出口茶恩，之前运行脚本，例如：P＞＜／

1	export LC_ALL=$(locale -a \| grep UTF-8)

类：or manuallyP＞＜／

1	export LC_ALL=C.UTF-8

测试模式的印刷字符的特殊恩?，例如：P＞＜／

1	python -c 'print(u"\u2122");'

在Ubuntu上测试。P＞＜／

我总是放在第一队列below the the Lines of the Python文件二：P＞＜／

1 2	# -- coding: utf-8 -- from __future__ import unicode_literals

此处找到简单的帮助程序函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

def safe_unicode(obj, *args):
""" return the unicode representation of obj"""
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)

def safe_str(obj):
""" return the byte string representation of obj"""
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')

相关讨论

just add to a variable(UTF-8编码)P＞＜／

1	agent_contact.encode('utf-8')

相关讨论

我只是挤压below solution for added，P＞＜／

u"String"

(representing a as before the Unicode字符串的字符串)。P＞＜／

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>


Hello all,

Here's weekly summary report. Let me know if you have any questions.

Data Summary

{0}



Thanks,


Data Team

</body></html>
""".format(result_html)

结果：just used the followingP＞＜／

1 2	import unicodedata message = unicodedata.normalize("NFKD", message)

检查文件：说什么。P＞＜／

unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.

对我来说solves EN。简单和容易。P＞＜／

我们struck this error when manage.py migrate定在与fixtures运行Django。P＞＜／

我们的独立宣言，the # -*- coding: utf-8 -*-MySQL开源，是正确configured for the UTF8和Ubuntu有适当的语言包/etc/default/locale布尔值。P＞＜／

问题是简单的，茶茶(我们使用Django的集装箱码头)was missing the LANGENV变种。P＞＜／

LANGto the setting en_US.UTF-8容器之前和restarting王自迁移问题的固定销。P＞＜／

我刚刚遇到了这个问题，谷歌把我带到了这里，所以为了增加这里的一般解决方案，这对我很有用：

1
2
3
4

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

我是在读了内德的演讲后想到这个主意的。

不过，我并不完全理解这项工作的原因。所以如果有人能编辑这个答案或发表评论来解释，我会很感激的。

相关讨论

Please open命令终端与火：below theP＞＜／

1	export LC_ALL="en_US.UTF-8"

本厂在Python 3唉least…P＞＜／

Python 3P＞＜／

the error is the enviroment有时和我在变量编码P＞＜／

1
2
3
4
5
6

import os
import locale
os.environ["PYTHONIOENCODING"] ="utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))

在忽视的错误是在编码。P＞＜／

如果你have something like then do this on the next packet_data ="This is data"在线initializing packet_data：右后P＞＜／

1 2	unic = u'' packet_data = unic

Update for Python 3.0及以后。尝试在the following the Python编辑器：P＞＜／

1
2
3

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

这是默认的本地系统的集合S the to the UTF-8编码格式。P＞＜／

在PEP的黑莓读can be at the C - 538 coercing遗留局部地方的UTF - 8的基础。P＞＜／

在多对多的答案(for example andbdrew AGF和@ @)addressed have already the most of the作品方面的问题立即。P＞＜／

不管一个人多，但我想显示一subtle there is important that has been largely忽视的那类问题dearly for everyone谁来当我端上，想让encodings of Python Python。Python 2 vs 3：管理代表wildly character is different。我喜欢听到有大块has to do out of people with about encodings阅读Python版本没有被感知。P＞＜／

蓝晶石的兴趣的人认识问题的根原因of the OP S模式spolsky'开始阅读和Unicode字符的介绍和我representations to Move to batchelder Python Python的Unicode在线2和3。P＞＜／

我想这个问题stdout输出Unicode characters to，but with sys.stdout.writeRather(我知道，我可以比打印输出到不同的文件支持(好)。P＞＜／

从beautifulsoup' S自身solved this with the documentation，codecs库：P＞＜／

1
2
3
4
5
6
7
8
9
10
11
12

import sys
import codecs

def main(fIn, fOut):
soup = BeautifulSoup(fIn)
# Do processing, with data including non-ASCII characters
fOut.write(unicode(soup))

if __name__ == '__main__':
with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
with codecs.getwriter('utf-8')(sys.stdout) as fOut:
main(fIn, fOut)