NLTK Sentence Tokenizer Incorrect
我注意到,NLTK发送的标记器在某些日期出错。是否有任何方法来调整它,以便正确标记以下内容:
1 2 | valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour. |
当前正在运行的sent_tokenize结果为:
1 2 | ['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'] |
但这会导致:
1 2 | ['valid any day after january 1.', 'not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'] |
因为"一月一日"之后的期间是一个合法的句子终止字符。
首先,
1 2 3 4 5 6 7 8 | >>> from nltk import sent_tokenize >>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.' >>> sent_tokenize(s) ['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'] >>>> >>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.' >>> sent_tokenize(s2) ['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'] |
现在,让我们更深入地挖掘,punkt tokenizer是kiss and strunk(2005)的一个算法,有关实现,请参见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py。
This tokenizer divides a text into a list of sentences, by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on
a large collection of plaintext in the target language before it can
be used.
因此,在
在某些情况下,语料库可能有像
02. fill the pot with water
为了解决这个问题,我建议如下:
另请参见:NLTK Punkt的培训数据格式