Python: Auto-correct
我有两个文件check.txt和orig.txt。我想检查check.txt中的每个单词,看看它是否与orig.txt中的任何单词匹配。如果它确实匹配,那么代码应该用它的第一个匹配项替换该单词,否则它应该保留原来的单词。但不知何故,它并没有按要求工作。请帮忙。
check.txt如下:
1 2 3 4 5 | ukrain troop force |
而orig.txt看起来:
1 2 3 4 5 6 | ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou russia 'outraged' at deadly shootout in east #ukraine - moscow:... http://t.co/nqim7uk7zg #groundtroops #russianpresidentvladimirputin |
号
http://pastebin.com/xjedhy3g
1 2 3 4 5 6 7 8 9 10 11 12 13 | f = open('check.txt','r') orig = open('orig.txt','r') new = open('newfile.txt','w') for word in f: for line in orig: for word2 in line.split(""): word2 = word2.lower() if word in word2: word = word2 else: print('not found') new.write(word) |
您的代码有两个问题:
您可以通过执行
1 2 3 4 5 6 7 8 9 10 11 12 | # get all stemmed words stemmed = [line.strip() for line in f] # set of lowercased original words original = set(word.lower() for line in orig for word in line.split()) # map stemmed words to unstemmed words unstemmed = {word: None for word in stemmed} # find original words for word stems in map for stem in unstemmed: for word in original: if stem in word: unstemmed[stem] = word print unstemmed |
或者更短(没有最后的双循环),使用
1 | unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed} |
号
另外,记住