NLTK句子标记符不正确

NLTK Sentence Tokenizer Incorrect

我注意到,NLTK发送的标记器在某些日期出错。是否有任何方法来调整它,以便正确标记以下内容:

1
2
valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.

当前正在运行的sent_tokenize结果为:

1
2
['valid any day after january 1. not valid on federal holidays, including february 14,
 or with other in-house events, specials, or happy hour.']

但这会导致:

1
2
['valid any day after january 1.', 'not valid on federal holidays, including february 14,
  or with other in-house events, specials, or happy hour.']

因为"一月一日"之后的期间是一个合法的句子终止字符。


首先,sent_tokenize函数使用punkt标记器来标记格式良好的英语句子。因此,通过包含正确的大写字母可以解决您的问题:

1
2
3
4
5
6
7
8
>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

现在,让我们更深入地挖掘,punkt tokenizer是kiss and strunk(2005)的一个算法,有关实现,请参见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py。

This tokenizer divides a text into a list of sentences, by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on
a large collection of plaintext in the target language before it can
be used.

因此,在sent_tokenize的例子中,我很确定它是在一个格式良好的英语语料库上训练的,因此在句号完全停止后大写是一个很强的句子边界指示。完全停止本身可能不是因为我们有像i.e. , e.g.这样的东西。

在某些情况下,语料库可能有像01. put pasta in pot
02. fill the pot with water
这样的东西。在训练数据中有这样的句子/文档,算法很可能认为无标题单词后面的fullstop不是句子边界。

为了解决这个问题,我建议如下:

  • 手动分割10-20%的句子,并在语料库特定的标记器中重新训练。
  • 在使用sent_tokenize之前,将语料库转换为格式良好的正字法。
  • 另请参见:NLTK Punkt的培训数据格式