Can't decode utf-8 string in python on os x terminal.app
我将terminal.app设置为接受UTF-8,在bash中,我可以键入unicode字符,复制并粘贴它们,但是如果我启动python shell,我就不能,如果我尝试解码unicode,我会得到错误:
1 2 3 4 5 6 7 8 9 10 | >>> wtf = u'\xe4\xf6\xfc'.decode() Traceback (most recent call last): File"<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> wtf = u'\xe4\xf6\xfc'.decode('utf-8') Traceback (most recent call last): File"<stdin>", line 1, in <module> File"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) |
有人知道我做错了什么吗?
我认为到处都有编码/解码混淆。从unicode对象开始:
1 | u'\xe4\xf6\xfc' |
这是一个Unicode对象,这三个字符是""的Unicode代码点。"?"如果要将其转换为UTF-8,必须对其进行编码:
1 2 | >>> u'\xe4\xf6\xfc'.encode('utf-8') '\xc3\xa4\xc3\xb6\xc3\xbc' |
得到的6个字符是""的UTF-8表示形式。"?"
如果您调用
进一步的混淆可能来自这样一个事实:
1 2 | >>> '\xe4\xf6\xfc'.decode('latin1') u'\xe4\xf6\xfc' |
我认为你有反向编码和解码。将Unicode编码为字节流,然后将字节流解码为Unicode。
1 2 3 4 5 6 7 8 9 10 11 12 | Python 2.6.1 (r261:67515, Dec 6 2008, 16:42:21) [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin Type"help","copyright","credits" or"license" for more information. >>> wtf = u'\xe4\xf6\xfc' >>> wtf u'\xe4\xf6\xfc' >>> print wtf ??ü >>> wtf.encode('UTF-8') '\xc3\xa4\xc3\xb6\xc3\xbc' >>> print '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8') ??ü |
1 2 3 4 5 6 7 8 9 10 11 12 | >>> wtf = '\xe4\xf6\xfc' >>> wtf '\xe4\xf6\xfc' >>> print wtf ??? >>> print wtf.decode("latin-1") ??ü >>> wtf_unicode = unicode(wtf.decode("latin-1")) >>> wtf_unicode u'\xe4\xf6\xfc' >>> print wtf_unicode ??ü |
介绍性教程的Unicode字符串部分很好地解释了这一点:
To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are preferred.
1
2 >>> u"??ü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'