Looking for dictionary words in text file using dictionary in python
我读了如何检查字典单词
我有了使用字典检查我的文本文件的想法。我已经阅读了pyenchant指令,我想如果我使用
所以这就是我被困住的地方:我希望我的程序能够以段落的形式给我所有字典词组。一旦遇到任何垃圾字符,就会认为段落中断,并忽略所有内容,直到找到X个连续的单词。
我希望它以
到目前为止我所拥有的:
1 2 3 4 5 6 7 8 | import enchant from enchant.tokenize import get_tokenizer, HTMLChunker dictSentCheck = get_tokenizer("en_US") sentCheck = raw_input("Check Sentense:") def check_dictionary(): outcome = dictCheck.check(wordCheck) test = [w[0] for w in dictSentCheck(sentCheck)] |
- - - 示范文本 - - -
English cricket cuts ties with Zimbabwe Wednesday, 25 June, 2008 text<void(0);><void(0);> <void(0);>email <void(0);>print EMAIL THIS ARTICLE your name: your email address: recipient's name: recipient's email address: <;>add another recipient your comment: Send Mail<void(0);> close this form <http://ad.au.doubleclick.net/jump/sbs.com.au/worldnews;sz=300x250;tile=2;ord=123456789?> The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year.
BLOCKQUOTE>
该脚本应返回:
English cricket cuts ties with Zimbabwe Wednesday
The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year
BLOCKQUOTE>
我接受了abarnert的回应。下面是我的最终剧本。注意它非常低效,应该清理一些。同样免责声明我从很久以前就没有编码。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70 import enchant
from enchant.tokenize import get_tokenizer
import os
def clean_files():
os.chdir("TARGET_DIRECTORY")
for files in os.listdir("."):
#get the numbers out file names
file_number = files[files.rfind("_")+1:files.rfind(".")]
#Print status to screen
print"Working on file:", files
#Read and process original file
original_file = open("name_"+file_number+".txt","r+")
read_original_file = original_file.read();
#Start the parsing of the files
token_words = tokenize_words(read_original_file)
parse_result = ('
'.join(split_on_angle_brackets(token_words,file_number)))
original_file.close()
#Commit changes to parsed file
parsed_file = open("name_"+file_number+"_parse.txt","wb")
parsed_file.write(parse_result);
parsed_file.close()
def tokenize_words(file_words):
tokenized_sentences = get_tokenizer("en_US")
word_tokens = tokenized_sentences(file_words)
token_result = [w[0] for w in word_tokens]
return token_result
def check_dictionary(dict_word):
check_word = enchant.Dict("en_US")
validated_word = check_word.check(dict_word)
return validated_word
def split_on_angle_brackets(token_words, file_number):
para = []
bracket_stack = 0
ignored_words_per_file = open("name_"+file_number+"_ignored_words.txt","wb")
for word in token_words:
if bracket_stack:
if word == 'gt':
bracket_stack -= 1
elif word == 'lt':
bracket_stack += 1
else:
if word == 'lt':
if len(para) >= 7:
yield ' '.join(para)
para = []
bracket_stack = 1
elif word != 'amp':
if check_dictionary(word) == True:
para.append(word)
#print"append", word
else:
print"Ignored word:", word
ignored_words_per_file.write(word +"
")
if para:
yield ' '.join(para)
#Close opened files
ignored_words_per_file.close()
clean_files()
我仍然不确定你的问题是什么,或者你的代码应该做什么。
但这条线似乎是关键:
1 test = [w[0] for w in dictSentCheck(sentCheck)]这会为您提供所有单词的列表。 它包括
lt 和gt 之类的单词。 并且你想要删除lt 和gt 对中的任何内容。而且,正如您在评论中所说,"我可以将所需的连续单词数设置为7"。
所以,像这样:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 def split_on_angle_brackets(words):
para = []
bracket_stack = 0
for word in words:
if bracket_stack:
if word == 'gt':
bracket_stack -= 1
elif word == 'lt':
bracket_stack += 1
else:
if word == 'lt':
if len(para) >= 7:
yield ' '.join(para)
para = []
bracket_stack = 1
else:
para.append(word)
if para:
yield ' '.join(para)如果您将其与样本数据一起使用:
1
2 print('
'.join(split_on_angle_brackets(test)))你得到这个:
1
2
3
4 English cricket cuts ties with Zimbabwe Wednesday June text
print EMAIL THIS ARTICLE your name your email address recipient's name recipient's email address
add another recipient your comment Send Mail
The England and Wales Cricket Board ECB announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year这与您的示例输出不匹配,但我想不出任何可以提供示例输出的规则,所以我试图实现您描述的规则。