关于python：检测unicode字符串中的非ascii字符

Detecting non-ascii characters in unicode string

本问题已经有最佳答案，请猛点这里访问。

对于文本文件(或Unicode字符串)，什么是检测不属于ASCII编码的字符的好方法？我可以很容易地迭代将每个字符传递给ord()，但是我想知道是否有一种更有效、更优雅或更惯用的方法来实现它。

这里的最终目标是编译数据中不能编码为ASCII的字符列表。

如果重要的话，我的语料库的大小大约是500MB/1200个文本文件。在Win7(64位)上运行(预编译的香草)python 3.3.1。

相关讨论

The ultimate goal here is to compile a list of characters in the data
that cannot encode to ascii.

我能想到的最有效的方法是使用re.sub()去掉任何有效的ASCII字符，这将给您留下一个包含所有非ASCII字符的字符串。

这只会去掉可打印字符…

1
2
3

>>> import re
>>> print re.sub('[ -~]', '', u'￡100 is worth more than €100')
￡€

…或者如果要包括不可打印的字符，请使用此…

1 2	>>> print re.sub('[\x00-\x7f]', '', u'￡100 is worth more than €100') ￡€

为了消除重复，只需创建返回字符串的set()…

1 2	>>> print set(re.sub('[\x00-\x7f]', '', u'￡€￡€')) set([u'\xa3', u'\u20ac'])