How do I tokenize a string sentence in NLTK?
我正在使用nltk,所以我想创建自己的自定义文本,就像nltk.books上的默认文本一样。不过,我刚刚学会了
1 | my_text = ['This', 'is', 'my', 'text'] |
我想找到任何输入我的"文本"的方法:
1 | my_text ="This is my text, this is a nice way to input text." |
哪种方法,python的或来自nltk的允许我这样做。更重要的是,我如何消除标点符号?
这实际上在nltk.org的主页上:
1 2 3 4 5 6 7 | >>> import nltk >>> sentence ="""At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight',"o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did',"n't", 'feel', 'very', 'good', '.'] |
正如@pavelanossov所回答的,标准答案使用nltk中的
1 2 3 | from nltk import word_tokenize sent ="This is my text, this is a nice way to input text." word_tokenize(sent) |
如果你的句子非常简单:
使用
1 2 3 4 | import string x ="This is my text, this is a nice way to input text." y ="".join([i for i in x if not in string.punctuation]).split("") print y |