nltk word_tokenize: why do sentence tokenization before word tokenization?
如源代码中所述,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # Standard word tokenizer. _treebank_word_tokenizer = TreebankWordTokenizer() def word_tokenize(text, language='english', preserve_line=False): """ Return a tokenized copy of *text*, using NLTK's recommended word tokenizer (currently an improved :class:`.TreebankWordTokenizer` along with :class:`.PunktSentenceTokenizer` for the specified language). :param text: text to split into words :param text: str :param language: the model name in the Punkt corpus :type language: str :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it. :type preserver_line: bool """ sentences = [text] if preserve_line else sent_tokenize(text, language) return [token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)] |
在单词标记化技术之前进行句子标记化技术有什么好处?
(P)The Default Tokenizer in NLTK used(EDOCX1 Anual)is the EDOCX1 originally from Michael Heilman's Tokenizer.(p)(P)We see that in the EDOCX1 English 2,it states:(p)字母名称(P)This regex will always split the final period and the assumption is that sentence tokenization is performed before hand.(p)(P)Keeping to the Treebank Tokenizer,the EDOCX1 plication 3 nable performs the same regex operation and documenting the behavior in the class docstring:(p)字母名称(P)More specifically the"separate periods that appear at the end of line"refers to this particular regex:(p)字母名称(P)这是否是一个共同的概念,被定罪时的判刑是在工作时完成的?(p)(P)Maybe,maybe not;depends on your task and how you're evaluating the task.If we look at other word tokenizers,we see that they perform the same final-period split,e.g.in the moses(SMT)Tokenizer:(p)字母名称(P)And similarly in the NLTK Port of the Moses Tokenizer:(p)字母名称(P)Also,in Toktok.PL and its NLTK Port(p)(P)For users who don't want their sentence to be sentence-split,the EDOCX1 plication 4 option is available since https://github.com/nltk/nltk/issues/1710 code merge=)(p)(P)For more explanation of why and what,see https://github.com/nltk/nltk/issues/1699(p)