关于python：解码，如果它不是unicode

Decoding if it's not unicode

我希望我的函数接受一个可以是Unicode对象或UTF-8编码字符串的参数。在函数内部，我想将参数转换为Unicode。我有这样的东西：

1
2
3
4
5

def myfunction(text):
if not isinstance(text, unicode):
text = unicode(text, 'utf-8')

...

是否可以避免使用IsInstance？我在找更友好的打字方式。

在我的解码实验中，我遇到了一些关于Python的奇怪行为。例如：

1
2
3
4
5
6
7
8
9

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
File"<input>", line 1, in <module>
File"/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

或

1
2
3
4
5
6

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File"<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

顺便说一句。我用的是python 2.6

相关讨论

您可以尝试使用"utf-8"编解码器对其进行解码，如果这不起作用，则返回该对象。

1
2
3
4
5
6
7
8

def myfunction(text):
try:
text = unicode(text, 'utf-8')
except TypeError:
return text

print(myfunction(u'cer\xf3n'))
# cerón

当您获取一个Unicode对象并使用'utf-8'编解码器调用其decode方法时，python首先尝试将Unicode对象转换为字符串对象，然后调用字符串对象的decode(‘utf-8’)方法。

有时，从Unicode对象到字符串对象的转换失败，因为python2默认使用ASCII编解码器。

因此，一般来说，永远不要尝试解码Unicode对象。或者，如果必须尝试，则将其捕获在try..except块中。可能有一些解码Unicode对象的编解码器在python2中起作用(见下文)，但它们在python3中已被删除。

有关这个问题的有趣讨论，请参阅这个python bug通知单，还有Guido van Rossum的博客：

"We are adopting a slightly different
approach to codecs: while in Python 2,
codecs can accept either Unicode or
8-bits as input and produce either as
output, in Py3k, encoding is always a
translation from a Unicode (text)
string to an array of bytes, and
decoding always goes the opposite
direction. This means that we had to
drop a few codecs that don't fit in
this model, for example rot13, base64
and bz2 (those conversions are still
supported, just not through the
encode/decode API)."

我不知道有什么好方法可以避免isinstance检查您的功能，但可能会有其他人。我可以指出，你提到的两个奇怪之处是因为你做了一些没有意义的事情：试图将已经解码成Unicode的东西解码成Unicode。

第一个应该是这样的，它将该字符串的UTF-8编码解码为Unicode版本：

1 2	>>> 'cer\xc3\xb3n'.decode('utf-8') u'cer\xf3n'

第二个应该是这样的(不使用u''unicode字符串文字)：

1 2	>>> unicode('hello', 'utf-8') u'hello'