关于python：如何在NLTK中对字符串句子进行标记？

nlpnltkpythontokenize

How do I tokenize a string sentence in NLTK?

我正在使用nltk，所以我想创建自己的自定义文本，就像nltk.books上的默认文本一样。不过，我刚刚学会了

1	my_text = ['This', 'is', 'my', 'text']

我想找到任何输入我的"文本"的方法：

1	my_text ="This is my text, this is a nice way to input text."

哪种方法，python的或来自nltk的允许我这样做。更重要的是，我如何消除标点符号？

相关讨论

这实际上在nltk.org的主页上：

1
2
3
4
5
6
7

>>> import nltk
>>> sentence ="""At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight',"o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did',"n't", 'feel', 'very', 'good', '.']

相关讨论

正如@pavelanossov所回答的，标准答案使用nltk中的word_tokenize函数：

1
2
3

from nltk import word_tokenize
sent ="This is my text, this is a nice way to input text."
word_tokenize(sent)

如果你的句子非常简单：

使用string.punctuation集，删除标点符号，然后使用空格分隔符拆分：

1
2
3
4

import string
x ="This is my text, this is a nice way to input text."
y ="".join([i for i in x if not in string.punctuation]).split("")
print y

相关讨论