Python：自动更正 | 码农家园

Python: Auto-correct

我有两个文件check.txt和orig.txt。我想检查check.txt中的每个单词，看看它是否与orig.txt中的任何单词匹配。如果它确实匹配，那么代码应该用它的第一个匹配项替换该单词，否则它应该保留原来的单词。但不知何故，它并没有按要求工作。请帮忙。

check.txt如下：

1
2
3
4
5

ukrain

troop

force

而orig.txt看起来：

1
2
3
4
5
6

ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball

rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou

russia 'outraged' at deadly shootout in east #ukraine - moscow:... http://t.co/nqim7uk7zg
#groundtroops #russianpresidentvladimirputin

号

http://pastebin.com/xjedhy3g

1
2
3
4
5
6
7
8
9
10
11
12
13

f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')

for word in f:
for line in orig:
for word2 in line.split(""):
word2 = word2.lower()
if word in word2:
word = word2
else:
print('not found')
new.write(word)

相关讨论

您的代码有两个问题：

当您循环遍历f中的单词时，每个单词仍将有一个新行字符，因此您的in检查不起作用。

您想为来自f的每个单词迭代orig，但文件是迭代器，在来自f的第一个单词之后耗尽。

您可以通过执行word = word.strip()和orig = list(orig)来修复这些问题，或者您可以尝试类似的操作：

1
2
3
4
5
6
7
8
9
10
11
12

# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
for word in original:
if stem in word:
unstemmed[stem] = word
print unstemmed

或者更短(没有最后的双循环)，使用difflib，如注释所示：

1	unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}

号

另外，记住close您的文件，或者使用with关键字自动关闭它们。