关于正则表达式：检测单词中的重音（Python）

Detecting accents in words (Python)

下面是dealio：我编写了一个程序，可以在字典中找到所有的算法类。但是，我在处理重音字符时遇到问题。目前，我的代码读取它们，将它们视为不可见的，但最终仍然以xc3的形式打印出某种替换代码。？？我想把所有带重音的单词都扔掉，但我不知道如何检测它们。

我尝试过的事情：

检查类型是否为Unicode
使用regex检查包含'xc3'的单词
解码/编码(我不完全理解Unicode，但我尝试的方法都不起作用)。

问题/问题：我需要了解如何检测重音符号，但我的程序将这些重音符号打印到命令行上，显示为奇怪的xc3？？？字符，这不是程序处理它们的方式，因为我找不到任何包含xc3的单词？？？尽管它被打印到命令行。

示例：s_->sxc3xa9和s_和s被我的程序视为变位词。

测试字典：

1
2
3
4
5
6
7
8
9
10
11

stop
tops
pots
hello
world
pit
tip
\xc3\xa9
sé
s
se

代码输出：

1
2
3
4
5
6
7
8
9

Found
\xc3\xa9
['pit', 'tip']
['world']
['s\xc3\xa9', 's']
['\\xc3\\xa9']
['stop', 'tops', 'pots']
['se']
['hello']

程序本身：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import re

anadict = {};

for line in open('fakedic.txt'):#/usr/share/dict/words'):
word = line.strip().lower().replace("'","")
line = ''.join(sorted(ch for ch in word if word if ch.isalnum($
if isinstance(word, unicode):
print word
print"UNICODE!"
pattern = re.compile(r'xc3')
if pattern.findall(word):
print 'Found'
print word
if anadict.has_key(line):
if not (word in anadict[line]):
anadict[line].append(word)
else:
anadict[line] = [word]

for key in anadict:
if (len(anadict[key]) >= 1):
print anadict[key]

帮助？

相关讨论

最后我使用了正则表达式(基本上是检查所有不是字母字符的内容)和：

1	if re.match('^[a-zA-Z_]+$', word):

这有助于我去掉任何带有或任何其他数字或时髦符号的单词。不是一个完美的解决方案，但它起作用了。

所以基本上我的答案…请看这里：

如何检查python中的字符串是否为ascii？

要点是，您可以检查每个字符以查看该字符的ord是否小于128，这允许您检查它是否是重音字符。或者你可以做很多尝试和捕捉，寻找Unicode错误，这些错误会在重音字符期间抛出。(后者似乎是更有效的答案)

这对我来说也是一次学习的经历：)很抱歉花了这么长时间