关于python:如何在NLTK中对字符串句子进行标记?

How do I tokenize a string sentence in NLTK?

我正在使用nltk,所以我想创建自己的自定义文本,就像nltk.books上的默认文本一样。不过,我刚刚学会了

1
my_text = ['This', 'is', 'my', 'text']

我想找到任何输入我的"文本"的方法:

1
my_text ="This is my text, this is a nice way to input text."

哪种方法,python的或来自nltk的允许我这样做。更重要的是,我如何消除标点符号?


这实际上在nltk.org的主页上:

1
2
3
4
5
6
7
>>> import nltk
>>> sentence ="""At eight o'clock on Thursday morning
... Arthur didn't feel very good."""

>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight',"o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did',"n't", 'feel', 'very', 'good', '.']


正如@pavelanossov所回答的,标准答案使用nltk中的word_tokenize函数:

1
2
3
from nltk import word_tokenize
sent ="This is my text, this is a nice way to input text."
word_tokenize(sent)

如果你的句子非常简单:

使用string.punctuation集,删除标点符号,然后使用空格分隔符拆分:

1
2
3
4
import string
x ="This is my text, this is a nice way to input text."
y ="".join([i for i in x if not in string.punctuation]).split("")
print y