使用python中的字典在文本文件中查找字典单词

Looking for dictionary words in text file using dictionary in python

我读了如何检查字典单词
我有了使用字典检查我的文本文件的想法。我已经阅读了pyenchant指令，我想如果我使用get_tokenizer来回复文本文件中的所有字典单词。

所以这就是我被困住的地方：我希望我的程序能够以段落的形式给我所有字典词组。一旦遇到任何垃圾字符，就会认为段落中断，并忽略所有内容，直到找到X个连续的单词。

我希望它以filename_nnn.txt的顺序读取文本文件，解析它，然后写入parsed_filname_nnn.txt。我没有去做任何文件操作。

到目前为止我所拥有的：

1
2
3
4
5
6
7
8

import enchant
from enchant.tokenize import get_tokenizer, HTMLChunker
dictSentCheck = get_tokenizer("en_US")
sentCheck = raw_input("Check Sentense:")

def check_dictionary():
outcome = dictCheck.check(wordCheck)
test = [w[0] for w in dictSentCheck(sentCheck)]

- - - 示范文本 - - -

English cricket cuts ties with Zimbabwe Wednesday, 25 June, 2008 text<void(0);><void(0);> <void(0);>email <void(0);>print EMAIL THIS ARTICLE your name: your email address: recipient's name: recipient's email address: <;>add another recipient your comment: Send Mail<void(0);> close this form <http://ad.au.doubleclick.net/jump/sbs.com.au/worldnews;sz=300x250;tile=2;ord=123456789?> The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year.

该脚本应返回：

English cricket cuts ties with Zimbabwe Wednesday

The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year

我接受了abarnert的回应。下面是我的最终剧本。注意它非常低效，应该清理一些。同样免责声明我从很久以前就没有编码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import enchant
from enchant.tokenize import get_tokenizer
import os

def clean_files():
os.chdir("TARGET_DIRECTORY")
for files in os.listdir("."):
#get the numbers out file names
file_number = files[files.rfind("_")+1:files.rfind(".")]

#Print status to screen
print"Working on file:", files

#Read and process original file
original_file = open("name_"+file_number+".txt","r+")
read_original_file = original_file.read();

#Start the parsing of the files
token_words = tokenize_words(read_original_file)
parse_result = ('
'.join(split_on_angle_brackets(token_words,file_number)))
original_file.close()

#Commit changes to parsed file
parsed_file = open("name_"+file_number+"_parse.txt","wb")
parsed_file.write(parse_result);
parsed_file.close()

def tokenize_words(file_words):
tokenized_sentences = get_tokenizer("en_US")
word_tokens = tokenized_sentences(file_words)
token_result = [w[0] for w in word_tokens]
return token_result

def check_dictionary(dict_word):
check_word = enchant.Dict("en_US")
validated_word = check_word.check(dict_word)
return validated_word

def split_on_angle_brackets(token_words, file_number):
para = []
bracket_stack = 0
ignored_words_per_file = open("name_"+file_number+"_ignored_words.txt","wb")
for word in token_words:
if bracket_stack:
if word == 'gt':
bracket_stack -= 1
elif word == 'lt':
bracket_stack += 1
else:
if word == 'lt':
if len(para) >= 7:
yield ' '.join(para)
para = []
bracket_stack = 1
elif word != 'amp':
if check_dictionary(word) == True:
para.append(word)
#print"append", word
else:
print"Ignored word:", word
ignored_words_per_file.write(word +"
")
if para:
yield ' '.join(para)

#Close opened files
ignored_words_per_file.close()

clean_files()

相关讨论

您是否有理由使用'en_US'标记器来解析英式英语？

您是否真的在其中使用HTML实体而不是实际的HTML获取文本？

您的代码中的dictCheck是什么？你认为"垃圾人物"是什么？

好吧，我将解析英语，不确定它是英国还是美国。我可以同时使用这两个词典吗？我想从文本中取出所有hmtl链接。我想我使用的是HTMLChunker错了。垃圾字符：text＆amp; lt; void(0);＆amp; gt;＆amp; lt; void(0);＆amp; gt; ＆amp; lt; void(0);＆amp; gt; email＆amp; lt; void(0);＆amp; gt

好吧，你的代码中根本没有使用HTMLChunker，所以很难说你在不向我们展示的不同代码中使用它是错误的......

与此同时，只需取出所有HTML链接就不会对print EMAIL THIS ARTICLE以及其他非常好的英语字符串做任何事情。

我导入HMLChunker以为我以后会用它。但在我玩get_tokenizer后，我想我可能不需要它。

我很喜欢print EMAIL THIS ARTICLE被遗忘的事情。虽然我可以将所需的连续单词数设置为7，但这样只会将7个连续单词或更多单词写入parsed_filename.txt

我仍然不确定你的问题是什么，或者你的代码应该做什么。

但这条线似乎是关键：

1
test = [w[0] for w in dictSentCheck(sentCheck)]

这会为您提供所有单词的列表。它包括lt和gt之类的单词。并且你想要删除lt和gt对中的任何内容。

而且，正如您在评论中所说，"我可以将所需的连续单词数设置为7"。

所以，像这样：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def split_on_angle_brackets(words):
para = []
bracket_stack = 0
for word in words:
if bracket_stack:
if word == 'gt':
bracket_stack -= 1
elif word == 'lt':
bracket_stack += 1
else:
if word == 'lt':
if len(para) >= 7:
yield ' '.join(para)
para = []
bracket_stack = 1
else:
para.append(word)
if para:
yield ' '.join(para)

如果您将其与样本数据一起使用：

1
2
print('
'.join(split_on_angle_brackets(test)))

你得到这个：

1
2
3
4
English cricket cuts ties with Zimbabwe Wednesday June text
print EMAIL THIS ARTICLE your name your email address recipient's name recipient's email address
add another recipient your comment Send Mail
The England and Wales Cricket Board ECB announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year

这与您的示例输出不匹配，但我想不出任何可以提供示例输出的规则，所以我试图实现您描述的规则。

相关讨论

感谢您的帮助！我无法获得工作/退货的收益。我用result.append()替换yield来测试函数，至少我可以看到它的工作原理。知道为什么屈服对我不起作用？

yield不会"返回"任何东西;它将您的功能转变为生成器功能。当你调用它时，你没有得到一个值，你会得到一些可以迭代的东西，好像它是一个值序列，每次你yield时都会有一个值。在我的代码中，我将split_on_angle_brackets(test)传递给join，它执行迭代。但是你可以通过for thing in split_on_angle_brackets(test): print(test)来看看发生了什么。

如果您无法掌握生成器函数，可以在顶部粘贴result=[]，用result.append行替换yield行，然后在末尾添加return result，将其转换为进入一个函数，立即构建整个列表并返回它。 (列表显然也可以作为一系列值迭代。)但是值得做一些更简单的生成器示例来获取它 - 这是一个抽象，一旦你得到它，使编程更容易。 (如果我记得正确的那个，dabeaz.com/generators非常有帮助。)

好的谢谢。我已经阅读了生成器，不知怎的，我仍然失败:)至少我得到一个十六进制值的对象，这是进步！

@danipolo：如果您获得的对象看起来像，请尝试执行for i in g: print(g)，或只是print(list(g))以查看它生成的序列。如果您只需要完成此项目，请随时切换到列表，然后再回到学习生成器。列表代码的效率稍差，而且行数会延长几行，但是如果你了解它并且可以使它工作，那就重要了。

调用者缺少")"应该打印(' n'.join(split_on_angle_brackets(test)))花了我太长时间才弄明白。

@danipolo：好抓。 (复制粘贴中的逐个错误和它们在循环代码中一样糟糕......)我将在答案中修复它。它现在适合你吗？

是的，这很有效。非常感谢。