Python：为unicode清理一个字符串？

Python: Sanitize a string for unicode?

本问题已经有最佳答案，请猛点这里访问。

Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?

我有一个字符串，我正试图确保unicode()函数的安全：

1
2
3
4
5
6
7
8
9
10
11
12
13

>>> s =" foo"bar bar" weasel"
>>> s.encode('utf-8', 'ignore')

Traceback (most recent call last):
File"<pyshell#8>", line 1, in <module>
s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)

Traceback (most recent call last):
File"<pyshell#9>", line 1, in <module>
unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)

我大部分时间都在这里徘徊。要从字符串中删除不安全的字符，需要做什么？

虽然我不能从这个问题中解决我的问题，但与这个问题有些关联。

这也失败了：

1
2
3
4
5
6
7
8
9
10

>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')

Traceback (most recent call last):
File"<pyshell#13>", line 1, in <module>
s.decode('utf-8')
File"C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte

相关讨论

问得好。编码问题很棘手。让我们从"我有一个字符串"开始，python 2中的字符串不是真正的"字符串"，它们是字节数组。那么你的字符串，它来自哪里，是用什么编码的？您的示例在文本中显示了花引号，我甚至不确定您是如何做到的。我尝试将它粘贴到一个Python解释器中，或者使用选项-[在OSX上键入，但它并没有通过。

不过，看看第二个例子，您有一个十六进制93字符。不能是utf-8，因为在utf-8中，任何高于127的字节都是多字节序列的一部分。所以我猜应该是拉丁语-1。问题是，X93不是拉丁-1字符集中的字符。拉丁语-1中有一个"无效"的范围，从x7f到x9f，这被认为是非法的。然而，微软看到了这个未使用的范围，并决定在其中加上"花引号"。在这样做的过程中，他们创建了类似的编码"windows-1252"，类似于拉丁语-1，其内容在无效范围内。

那么，假设它是Windows-1252。现在怎么办？decode将字节转换为unicode，所以这就是您想要的。第二个例子是正确的，但是失败了，因为字符串不是UTF-8。尝试：

1
2
3
4
5
6

>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo"bar bar" weasel
>>> type(uni)
<type 'unicode'>

这是正确的，因为开头的大括号是unicode u+201c。现在你有了unicode，你可以用你选择的任何编码将它序列化为字节(如果你需要通过线传递它)，或者如果它在python中，就保持unicode。如果要转换为UTF-8，请使用相反的函数string.encode。

1 2	>>> uni.encode("utf-8") 'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'

花引号需要3个字节才能用UTF-8编码。您可以使用UTF-16，它们只有两个字节。但是，您不能将其编码为ASCII或拉丁-1，因为它们没有卷曲引号。