NLTK句子标记符不正确

NLTK Sentence Tokenizer Incorrect

我注意到，NLTK发送的标记器在某些日期出错。是否有任何方法来调整它，以便正确标记以下内容：

1 2	valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.

当前正在运行的sent_tokenize结果为：

1 2	['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

但这会导致：

1 2	['valid any day after january 1.', 'not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

因为"一月一日"之后的期间是一个合法的句子终止字符。

首先，sent_tokenize函数使用punkt标记器来标记格式良好的英语句子。因此，通过包含正确的大写字母可以解决您的问题：

1
2
3
4
5
6
7
8

>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

现在，让我们更深入地挖掘，punkt tokenizer是kiss and strunk(2005)的一个算法，有关实现，请参见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py。

This tokenizer divides a text into a list of sentences, by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on
a large collection of plaintext in the target language before it can
be used.

因此，在sent_tokenize的例子中，我很确定它是在一个格式良好的英语语料库上训练的，因此在句号完全停止后大写是一个很强的句子边界指示。完全停止本身可能不是因为我们有像i.e. , e.g.这样的东西。

在某些情况下，语料库可能有像01. put pasta in pot
02. fill the pot with water这样的东西。在训练数据中有这样的句子/文档，算法很可能认为无标题单词后面的fullstop不是句子边界。

为了解决这个问题，我建议如下：

手动分割10-20%的句子，并在语料库特定的标记器中重新训练。

在使用sent_tokenize之前，将语料库转换为格式良好的正字法。

另请参见：NLTK Punkt的培训数据格式