UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)
我在处理从不同网页(在不同网站)提取的文本中的Unicode字符时遇到问题。我在用漂亮的汤。
问题是,错误并不总是可复制的;它有时与某些页面一起工作,有时通过抛出
导致问题的代码部分如下所示:
1 2 3 | agent_telno = agent.find('div', 'agent_contact_number') agent_telno = '' if agent_telno is None else agent_telno.contents[0] p.agent_info = str(agent_contact + ' ' + agent_telno).strip() |
下面是运行上面的代码段时在某些字符串上生成的堆栈跟踪:
1 2 3 4 | Traceback (most recent call last): File"foobar.py", line 792, in <module> p.agent_info = str(agent_contact + ' ' + agent_telno).strip() UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) |
我怀疑这是因为某些页面(或者更具体地说,来自某些站点的页面)可能被编码,而其他页面可能未编码。所有网站都位于英国,并提供英国消费的数据,因此没有任何问题与内化或处理用英语以外的任何文字书写的问题。
有人对如何解决这个问题有什么想法吗?这样我才能一直解决这个问题。
您需要阅读python unicode howto。这个错误就是第一个例子。
基本上,停止使用
相反,正确使用
1 | p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip() |
或者完全使用Unicode。
这是典型的python unicode痛点!考虑以下事项:
1 2 3 | a = u'bats\u00E0' print a => batsà |
到目前为止一切都很好,但是如果我们称之为str(a),我们来看看会发生什么:
1 2 3 4 | str(a) Traceback (most recent call last): File"<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128) |
哦,迪普,这对任何人都没有好处!要修复错误,请使用.encode显式编码字节,并告诉python要使用的编解码器:
1 2 3 4 | a.encode('utf-8') => 'bats\xc3\xa0' print a.encode('utf-8') => batsà |
vueu00 e0!
问题是,当您调用str()时,python使用默认的字符编码来尝试对您提供的字节进行编码,在您的情况下,这有时是Unicode字符的表示。要解决这个问题,您必须告诉python如何使用.encode("whatever_nicode")处理您提供的字符串。大多数情况下,使用UTF-8应该可以。
有关此主题的精彩介绍,请参阅内德·巴切尔德的Pycon演讲:http://ned batchelder.com/text/unipain.html
我发现优雅的工作在我周围删除符号,并继续保持字符串如下:
1 | yourstring = yourstring.encode('ascii', 'ignore').decode('ascii') |
重要的是要注意,使用ignore选项是危险的,因为它会悄悄地从使用它的代码中删除任何Unicode(和国际化)支持,如下所示(转换Unicode):
1 2 | >>> u'City: Malm?'.encode('ascii', 'ignore').decode('ascii') 'City: Malm' |
好吧,我什么都试过了,但没用,在谷歌搜索了一下之后,我发现了下面的内容,它起了作用。python 2.7正在使用中。
1 2 3 4 | # encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8') |
导致打印失败的一个微妙问题是环境变量设置错误,例如,这里的lc_都设置为"c"。在Debian中,他们不鼓励设置:Debian wiki on locale
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | $ echo $LANG en_US.utf8 $ echo $LC_ALL C $ python -c"print (u'voil\u00e0')" Traceback (most recent call last): File"<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128) $ export LC_ALL='en_US.utf8' $ python -c"print (u'voil\u00e0')" voilà $ unset LC_ALL $ python -c"print (u'voil\u00e0')" voilà |
事实上,我发现,在大多数情况下,仅仅去掉这些字符就简单多了:
1 | s = mystring.decode('ascii', 'ignore') |
对我来说,真正起作用的是:
1 | BeautifulSoup(html_text,from_encoding="utf-8") |
希望这能帮助别人。
这可能会尝试解决,P></
1 2 3 4 | # encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8') |
问题是,您试图打印一个Unicode字符,但您的终端不支持它。
您可以尝试安装
1 | sudo apt-get install language-pack-en |
它为所有支持的包(包括python)提供英文翻译数据更新。如有必要,安装不同的语言包(取决于您要打印的字符)。
在某些Linux发行版上,需要确保正确设置默认的英语区域设置(以便外壳/终端可以处理Unicode字符)。有时安装它比手动配置要容易。
然后在编写代码时,确保在代码中使用正确的编码。
例如:
1 | open(foo, encoding='utf-8') |
如果仍然有问题,请重新检查系统配置,例如:
您的区域设置文件(
/etc/default/locale ),它应该具有例如1
2LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"或:
1
2LC_ALL=C.UTF-8
LANG=C.UTF-8壳中
LANG 和LC_CTYPE 的值。通过以下方式检查外壳支持的区域设置:
1locale -a | grep"UTF-8"
在新的虚拟机中演示问题和解决方案。
初始化和设置虚拟机(例如使用
1 | vagrant init ubuntu/trusty64; vagrant up; vagrant ssh |
请参见:可用的Ubuntu框。
打印unicode字符(如商标符号,如
1 2 3 4 | $ python -c 'print(u"\u2122");' Traceback (most recent call last): File"<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128) |
现在安装
1 2 3 4 5 6 | $ sudo apt-get -y install language-pack-en The following extra packages will be installed: language-pack-en-base Generating locales... en_GB.UTF-8... /usr/sbin/locale-gen: done Generation complete. |
现在需要解决的问题是:
1 2 | $ python -c 'print(u"\u2122");' ? |
否则,请尝试以下命令:
1 2 | $ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");' ? |
在脚本开头添加以下行(或作为第二行):
1 | # -*- coding: utf-8 -*- |
这就是Python源代码编码的定义。更多信息请参见PEP 263。
在rehashing of some other是所谓的"警察"的答案。which there are)将在简单的情况troublesome strings is the characters /良好的解决方案,尽管protests voiced the在这里。P></
1 2 3 4 5 | def safeStr(obj): try: return str(obj) except UnicodeEncodeError: return obj.encode('ascii', 'ignore').decode('ascii') except: return"" |
测试:P></
1 2 3 4 5 | if __name__ == '__main__': print safeStr( 1 ) print safeStr("test" ) print u'98\xb0' print safeStr( u'98\xb0' ) |
结果:P></
1 2 3 4 | 1 test 98° 98 |
建议:你可能想
this was written for Python 2。Python for 3,我相信你会想使用
在Shell:P></
当地支持UTF-8 find命令:by the followingP></
1 | locale -a | grep"UTF-8" |
出口茶恩,之前运行脚本,例如:P></
1 | export LC_ALL=$(locale -a | grep UTF-8) |
类:or manuallyP></
1 | export LC_ALL=C.UTF-8 |
测试模式的印刷字符的特殊恩
1 | python -c 'print(u"\u2122");' |
在Ubuntu上测试。P></
我总是放在第一队列below the the Lines of the Python文件二:P></
1 2 | # -*- coding: utf-8 -*- from __future__ import unicode_literals |
此处找到简单的帮助程序函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def safe_unicode(obj, *args): """ return the unicode representation of obj""" try: return unicode(obj, *args) except UnicodeDecodeError: # obj is byte string ascii_text = str(obj).encode('string_escape') return unicode(ascii_text) def safe_str(obj): """ return the byte string representation of obj""" try: return str(obj) except UnicodeEncodeError: # obj is unicode return unicode(obj).encode('unicode_escape') |
just add to a variable(UTF-8编码)P></
1 | agent_contact.encode('utf-8') |
我只是挤压below solution for added,P></
u"String"
(representing a as before the Unicode字符串的字符串)。P></
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | result_html = result.to_html(col_space=1, index=False, justify={'right'}) text = u""" <html> <body> <p> Hello all, Here's weekly summary report. Let me know if you have any questions. Data Summary {0} </p> <p> Thanks, </p> <p> Data Team </p> </body></html> """.format(result_html) |
结果:just used the followingP></
1 2 | import unicodedata message = unicodedata.normalize("NFKD", message) |
检查文件:说什么。P></
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
对我来说solves EN。简单和容易。P></
我们struck this error when
我们的独立宣言,the
问题是简单的,茶茶(我们使用Django的集装箱码头)was missing the
我刚刚遇到了这个问题,谷歌把我带到了这里,所以为了增加这里的一般解决方案,这对我很有用:
1 2 3 4 | # 'value' contains the problematic data unic = u'' unic += value value = unic |
我是在读了内德的演讲后想到这个主意的。
不过,我并不完全理解这项工作的原因。所以如果有人能编辑这个答案或发表评论来解释,我会很感激的。
Please open命令终端与火:below theP></
1 | export LC_ALL="en_US.UTF-8" |
本厂在Python 3唉least…P></
Python 3P></
the error is the enviroment有时和我在变量编码P></
1 2 3 4 5 6 | import os import locale os.environ["PYTHONIOENCODING"] ="utf-8" myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8") ... print(myText.encode('utf-8', errors='ignore')) |
在忽视的错误是在编码。P></
如果你have something like then do this on the next
1 2 | unic = u'' packet_data = unic |
Update for Python 3.0及以后。尝试在the following the Python编辑器:P></
1 2 3 | locale-gen en_US.UTF-8 export LANG=en_US.UTF-8 LANGUAGE=en_US.en LC_ALL=en_US.UTF-8 |
这是默认的本地系统的集合S the to the UTF-8编码格式。P></
在PEP的黑莓读can be at the C - 538 coercing遗留局部地方的UTF - 8的基础。P></
在多对多的答案(for example andbdrew AGF和@ @)addressed have already the most of the作品方面的问题立即。P></
不管一个人多,但我想显示一subtle there is important that has been largely忽视的那类问题dearly for everyone谁来当我端上,想让encodings of Python Python。Python 2 vs 3:管理代表wildly character is different。我喜欢听到有大块has to do out of people with about encodings阅读Python版本没有被感知。P></
蓝晶石的兴趣的人认识问题的根原因of the OP S模式spolsky'开始阅读和Unicode字符的介绍和我representations to Move to batchelder Python Python的Unicode在线2和3。P></
我想这个问题
从beautifulsoup' S自身solved this with the documentation,codecs库:P></
1 2 3 4 5 6 7 8 9 10 11 12 | import sys import codecs def main(fIn, fOut): soup = BeautifulSoup(fIn) # Do processing, with data including non-ASCII characters fOut.write(unicode(soup)) if __name__ == '__main__': with (sys.stdin) as fIn: # Don't think we need codecs.getreader here with codecs.getwriter('utf-8')(sys.stdout) as fOut: main(fIn, fOut) |