NLTK tokenize text with dialog into sentences
我可以将非对话文本标记为句子,但当我在句子中添加引号时,NLTK标记器无法正确地将它们拆分。例如,这可以按预期工作:
1 2 3 4 | import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') text1 = 'Is this one sentence? This is separate. This is a third he said.' tokenizer.tokenize(text1) |
这将产生三个不同句子的列表:
1 | ['Is this one sentence?', 'This is separate.', 'This is a third he said.'] |
但是,如果我把它变成一个对话,同样的过程就不起作用了。
1 2 | text2 = '"Is this one sentence?""This is separate.""This is a third" he said.' tokenizer.tokenize(text2) |
这将它作为一个单独的句子返回:
1 | ['"Is this one sentence?""This is separate.""This is a third" he said.'] |
在这种情况下,我如何使NLTK标记器工作?
似乎标记器不知道如何处理直接引用。用常规的ASCII双引号替换它们,该示例工作正常。
1 2 3 | >>> text3 = re.sub('[""]', '"', text2) >>> nltk.sent_tokenize(text3) ['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.'] |