Decoding if it's not unicode
我希望我的函数接受一个可以是Unicode对象或UTF-8编码字符串的参数。在函数内部,我想将参数转换为Unicode。我有这样的东西:
1 2 3 4 5 | def myfunction(text): if not isinstance(text, unicode): text = unicode(text, 'utf-8') ... |
是否可以避免使用IsInstance?我在找更友好的打字方式。
在我的解码实验中,我遇到了一些关于Python的奇怪行为。例如:
1 2 3 4 5 6 7 8 9 | >>> u'hello'.decode('utf-8') u'hello' >>> u'cer\xf3n'.decode('utf-8') Traceback (most recent call last): File"<input>", line 1, in <module> File"/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po sition 3: ordinal not in range(128) |
或
1 2 3 4 5 6 | >>> u'hello'.decode('utf-8') u'hello' 12:11 >>> unicode(u'hello', 'utf-8') Traceback (most recent call last): File"<input>", line 1, in <module> TypeError: decoding Unicode is not supported |
顺便说一句。我用的是python 2.6
您可以尝试使用"utf-8"编解码器对其进行解码,如果这不起作用,则返回该对象。
1 2 3 4 5 6 7 8 | def myfunction(text): try: text = unicode(text, 'utf-8') except TypeError: return text print(myfunction(u'cer\xf3n')) # cerón |
当您获取一个Unicode对象并使用
有时,从Unicode对象到字符串对象的转换失败,因为python2默认使用ASCII编解码器。
因此,一般来说,永远不要尝试解码Unicode对象。或者,如果必须尝试,则将其捕获在try..except块中。可能有一些解码Unicode对象的编解码器在python2中起作用(见下文),但它们在python3中已被删除。
有关这个问题的有趣讨论,请参阅这个python bug通知单,还有Guido van Rossum的博客:
"We are adopting a slightly different
approach to codecs: while in Python 2,
codecs can accept either Unicode or
8-bits as input and produce either as
output, in Py3k, encoding is always a
translation from a Unicode (text)
string to an array of bytes, and
decoding always goes the opposite
direction. This means that we had to
drop a few codecs that don't fit in
this model, for example rot13, base64
and bz2 (those conversions are still
supported, just not through the
encode/decode API)."
我不知道有什么好方法可以避免
第一个应该是这样的,它将该字符串的UTF-8编码解码为Unicode版本:
1 2 | >>> 'cer\xc3\xb3n'.decode('utf-8') u'cer\xf3n' |
第二个应该是这样的(不使用
1 2 | >>> unicode('hello', 'utf-8') u'hello' |