关于python:nltk word_tokenize:为什么在单词标记化之前进行句子标记化?

nltk word_tokenize: why do sentence tokenization before word tokenization?

如源代码中所述,word_tokenize在运行单词tokenizer(treebank)之前运行一个句子tokenizer(punkt):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

def word_tokenize(text, language='english', preserve_line=False):
   """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    :param text: text to split into words
    :param text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool
   """

    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

在单词标记化技术之前进行句子标记化技术有什么好处?


(P)The Default Tokenizer in NLTK used(EDOCX1 Anual)is the EDOCX1 originally from Michael Heilman's Tokenizer.(p)(P)We see that in the EDOCX1 English 2,it states:(p)字母名称(P)This regex will always split the final period and the assumption is that sentence tokenization is performed before hand.(p)(P)Keeping to the Treebank Tokenizer,the EDOCX1 plication 3 nable performs the same regex operation and documenting the behavior in the class docstring:(p)字母名称(P)More specifically the"separate periods that appear at the end of line"refers to this particular regex:(p)字母名称(P)这是否是一个共同的概念,被定罪时的判刑是在工作时完成的?(p)(P)Maybe,maybe not;depends on your task and how you're evaluating the task.If we look at other word tokenizers,we see that they perform the same final-period split,e.g.in the moses(SMT)Tokenizer:(p)字母名称(P)And similarly in the NLTK Port of the Moses Tokenizer:(p)字母名称(P)Also,in Toktok.PL and its NLTK Port(p)(P)For users who don't want their sentence to be sentence-split,the EDOCX1 plication 4 option is available since https://github.com/nltk/nltk/issues/1710 code merge=)(p)(P)For more explanation of why and what,see https://github.com/nltk/nltk/issues/1699(p)