关于macos：无法在os x terminal.app上的python中解码utf-8字符串

Can't decode utf-8 string in python on os x terminal.app

我将terminal.app设置为接受UTF-8，在bash中，我可以键入unicode字符，复制并粘贴它们，但是如果我启动python shell，我就不能，如果我尝试解码unicode，我会得到错误：

1
2
3
4
5
6
7
8
9
10

>>> wtf = u'\xe4\xf6\xfc'.decode()
Traceback (most recent call last):
File"<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> wtf = u'\xe4\xf6\xfc'.decode('utf-8')
Traceback (most recent call last):
File"<stdin>", line 1, in <module>
File"/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

有人知道我做错了什么吗？

相关讨论

我认为到处都有编码/解码混淆。从unicode对象开始：

1	u'\xe4\xf6\xfc'

这是一个Unicode对象，这三个字符是""的Unicode代码点。"？"如果要将其转换为UTF-8，必须对其进行编码：

1 2	>>> u'\xe4\xf6\xfc'.encode('utf-8') '\xc3\xa4\xc3\xb6\xc3\xbc'

得到的6个字符是""的UTF-8表示形式。"？"

如果您调用decode(...)，您将尝试将字符解释为仍然需要转换为Unicode的某种编码。因为它已经是Unicode，所以不能工作。第一个调用尝试使用ASCII到Unicode转换，第二个调用使用UTF-8到Unicode转换。由于u'\xe4\xf6\xfc'既不是有效的ASCII，也不是有效的UTF-8，所以这些转换尝试失败。

进一步的混淆可能来自这样一个事实：'\xe4\xf6\xfc'也是""的拉丁语/iso-8859-1编码。"？"如果编写一个普通的python字符串(不带前导"u"标记为unicode)，则可以使用decode('latin1')将其转换为unicode对象：

1 2	>>> '\xe4\xf6\xfc'.decode('latin1') u'\xe4\xf6\xfc'

相关讨论

我认为你有反向编码和解码。将Unicode编码为字节流，然后将字节流解码为Unicode。

1
2
3
4
5
6
7
8
9
10
11
12

Python 2.6.1 (r261:67515, Dec 6 2008, 16:42:21)
[GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
Type"help","copyright","credits" or"license" for more information.
>>> wtf = u'\xe4\xf6\xfc'
>>> wtf
u'\xe4\xf6\xfc'
>>> print wtf
??ü
>>> wtf.encode('UTF-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> print '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8')
??ü

相关讨论

1
2
3
4
5
6
7
8
9
10
11
12

>>> wtf = '\xe4\xf6\xfc'
>>> wtf
'\xe4\xf6\xfc'
>>> print wtf
???
>>> print wtf.decode("latin-1")
??ü
>>> wtf_unicode = unicode(wtf.decode("latin-1"))
>>> wtf_unicode
u'\xe4\xf6\xfc'
>>> print wtf_unicode
??ü

介绍性教程的Unicode字符串部分很好地解释了这一点：

To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are preferred.

1
2
>>> u"??ü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'

相关讨论