How to get rid of punctuation using NLTK tokenizer?
我刚开始使用NLTK,我不太明白如何从文本中获取单词列表。如果我使用
看看NLTK在这里提供的其他标记化选项。例如,您可以定义一个标记器,它将字母数字字符序列作为标记,并删除其他所有内容:
1 2 3 4 | from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!') |
输出:
1 | ['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward'] |
您不需要NLTK来删除标点符号。您可以用简单的python删除它。弦乐:
1 2 3 | import string s = '... some string with punctuation ...' s = s.translate(None, string.punctuation) |
或用于Unicode:
1 2 3 | import string translate_table = dict((ord(char), None) for char in string.punctuation) s.translate(translate_table) |
然后在标记器中使用这个字符串。
P.S.字符串模块还有一些可以删除的其他元素集(如数字)。
下面的代码将删除所有标点符号以及非字母字符。从他们的书上抄下来的。
网址:http://www.nltk.org/book/ch01.html
1 2 3 4 5 6 7 8 9 | import nltk s ="I can't do this now, because I'm so tired. Please give me some time. @ sd 4 232" words = nltk.word_tokenize(s) words=[word.lower() for word in words if word.isalpha()] print(words) |
输出
1 | ['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd'] |
正如在注释中注意到的,从sent_tokenize()开始,因为word_tokenize()只在一个句子上工作。您可以使用filter()过滤掉标点符号。如果您有一个Unicode字符串,请确保它是一个Unicode对象(不是用一些编码(如"utf-8")编码的"str")。
1 2 3 4 5 | from nltk.tokenize import word_tokenize, sent_tokenize text = '''It is a blue, small, and extraordinary ball. Like no other''' tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)] print filter(lambda word: word not in ',-', tokens) |
我只使用了以下代码,删除了所有标点:
1 2 3 4 5 6 7 8 9 | tokens = nltk.wordpunct_tokenize(raw) type(tokens) text = nltk.Text(tokens) type(text) words = [w.lower() for w in text if w.isalpha()] |
我认为您需要某种正则表达式匹配(下面的代码在python 3中):
1 2 3 4 5 6 7 8 9 | import string import re import nltk s ="I can't do this now, because I'm so tired. Please give me some time." l = nltk.word_tokenize(s) ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)] print(l) print(ll) |
输出:
1 2 | ['I', 'ca',"n't", 'do', 'this', 'now', ',', 'because', 'I',"'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.'] ['I', 'ca',"n't", 'do', 'this', 'now', 'because', 'I',"'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time'] |
在大多数情况下应该很好地工作,因为它删除标点符号,同时保留诸如"n't"之类的标记,而这些标记不能从诸如
我使用此代码删除标点:
1 2 3 4 5 6 7 8 | import nltk def getTerms(sentences): tokens = nltk.word_tokenize(sentences) words = [w.lower() for w in tokens if w.isalnum()] print tokens print words getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B") |
如果你想检查令牌是否是一个有效的英文单词,你可能需要pyenchant
辅导的:
1 2 3 4 5 | import enchant d = enchant.Dict("en_US") d.check("Hello") d.check("Helo") d.suggest("Helo") |
去除穿刺(它会去除。以及使用以下代码处理标点符号的一部分)
1 2 3 | tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')) text_string = text_string.translate(tbl) #text_string don't have punctuation w = word_tokenize(text_string) #now tokenize the string |
样品输入/输出:
1 | direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni |