关于regex：Python：替换重音的有效方法（é到e），删除[^ a-zA-Zds]和lower（）

Python: efficient method to replace accents (é to e), remove [^a-zA-Zds], and lower()

本问题已经有最佳答案，请猛点这里访问。

使用python 3.3。我要执行以下操作：

替换特殊的字母字符，如e acute(_)和o旋折(？)与基础字符(？例如到o)
删除除字母数字和字母数字之间的空格之外的所有字符文字
转换为小写

这就是我目前为止所拥有的：

1
2
3

mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower()
alphnumspace = re.compile(r"[^a-zA-Z\d\s]")
mystring_modified = alphnumspace.sub('', mystring_modified)

我该如何改进？效率是一个很大的问题，特别是因为我目前正在一个循环中执行操作：

1
2
3
4

# Pseudocode
for mystring in myfile:
mystring_modified = # operations described above
mylist.append(mystring_modified)

每个文件大约有200000个字符。

相关讨论

1
2
3
4

>>> import unicodedata
>>> s='é?'
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'eo'

还检查了Unidecode

What Unidecode provides is a middle road: function unidecode() takes
Unicode data and tries to represent it in ASCII characters (i.e., the
universally displayable characters between 0x00 and 0x7F), where the
compromises taken when mapping between two character sets are chosen
to be near what a human with a US keyboard would choose.

The quality of resulting ASCII representation varies. For languages of
western origin it should be between perfect and good. On the other
hand transliteration (i.e., conveying, in Roman letters, the
pronunciation expressed by the text in some other writing system) of
languages like Chinese, Japanese or Korean is a very complex issue and
this library does not even attempt to address it. It draws the line at
context-free character-by-character mapping. So a good rule of thumb
is that the further the script you are transliterating is from Latin
alphabet, the worse the transliteration will be.

Note that this module generally produces better results than simply
stripping accents from characters (which can be done in Python with
built-in functions). It is based on hand-tuned character mappings that
for example also contain ASCII approximations for symbols and
non-Latin alphabets.