关于python：NLTK使用对话框将文本标记为句子

NLTK tokenize text with dialog into sentences

我可以将非对话文本标记为句子，但当我在句子中添加引号时，NLTK标记器无法正确地将它们拆分。例如，这可以按预期工作：

1
2
3
4

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)

这将产生三个不同句子的列表：

1	['Is this one sentence?', 'This is separate.', 'This is a third he said.']

但是，如果我把它变成一个对话，同样的过程就不起作用了。

1 2	text2 = '"Is this one sentence?""This is separate.""This is a third" he said.' tokenizer.tokenize(text2)

这将它作为一个单独的句子返回：

1	['"Is this one sentence?""This is separate.""This is a third" he said.']

在这种情况下，我如何使NLTK标记器工作？

似乎标记器不知道如何处理直接引用。用常规的ASCII双引号替换它们，该示例工作正常。

1
2
3

>>> text3 = re.sub('[""]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']