Python: efficient method to replace accents (é to e), remove [^a-zA-Zds], and lower()
使用python 3.3。我要执行以下操作:
- 替换特殊的字母字符,如e acute(_)和o旋折(?)与基础字符(?例如到o)
- 删除除字母数字和字母数字之间的空格之外的所有字符文字
- 转换为小写
这就是我目前为止所拥有的:
1 2 3 | mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower() alphnumspace = re.compile(r"[^a-zA-Z\d\s]") mystring_modified = alphnumspace.sub('', mystring_modified) |
我该如何改进?效率是一个很大的问题,特别是因为我目前正在一个循环中执行操作:
1 2 3 4 | # Pseudocode for mystring in myfile: mystring_modified = # operations described above mylist.append(mystring_modified) |
每个文件大约有200000个字符。
1 2 3 4 | >>> import unicodedata >>> s='é?' >>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')) 'eo' |
还检查了Unidecode
What Unidecode provides is a middle road: function unidecode() takes
Unicode data and tries to represent it in ASCII characters (i.e., the
universally displayable characters between 0x00 and 0x7F), where the
compromises taken when mapping between two character sets are chosen
to be near what a human with a US keyboard would choose.The quality of resulting ASCII representation varies. For languages of
western origin it should be between perfect and good. On the other
hand transliteration (i.e., conveying, in Roman letters, the
pronunciation expressed by the text in some other writing system) of
languages like Chinese, Japanese or Korean is a very complex issue and
this library does not even attempt to address it. It draws the line at
context-free character-by-character mapping. So a good rule of thumb
is that the further the script you are transliterating is from Latin
alphabet, the worse the transliteration will be.Note that this module generally produces better results than simply
stripping accents from characters (which can be done in Python with
built-in functions). It is based on hand-tuned character mappings that
for example also contain ASCII approximations for symbols and
non-Latin alphabets.
You could use str.translate:
ZZU1
产量
1 | 123 foe bar |
在下面,你必须列出你想要翻译的所有特殊的特征。@Gnibbler's method requires less coding.
在上一侧,方法应该是公平的,它可以在一个功能呼叫EDOCX1时满足你的所有要求(下载、删除和清除语音)。
以这种方式,一个有两千个字符的文件不太大。因此,将整个文件读入一个单一EDOCX1&2的效率将会提高,然后将其转换为一个函数呼叫。