关于python：nltk word_tokenize：为什么在单词标记化之前进行句子标记化？

nltk word_tokenize: why do sentence tokenization before word tokenization?

如源代码中所述，word_tokenize在运行单词tokenizer(treebank)之前运行一个句子tokenizer(punkt)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:param text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]

在单词标记化技术之前进行句子标记化技术有什么好处？

相关讨论

(P)The Default Tokenizer in NLTK used(EDOCX1 Anual)is the EDOCX1 originally from Michael Heilman's Tokenizer.(p)(P)We see that in the EDOCX1 English 2，it states:(p)字母名称(P)This regex will always split the final period and the assumption is that sentence tokenization is performed before hand.(p)(P)Keeping to the Treebank Tokenizer，the EDOCX1 plication 3 nable performs the same regex operation and documenting the behavior in the class docstring：(p)字母名称(P)More specifically the"separate periods that appear at the end of line"refers to this particular regex:(p)字母名称(P)这是否是一个共同的概念，被定罪时的判刑是在工作时完成的？(p)(P)Maybe，maybe not；depends on your task and how you're evaluating the task.If we look at other word tokenizers，we see that they perform the same final-period split，e.g.in the moses(SMT)Tokenizer:(p)字母名称(P)And similarly in the NLTK Port of the Moses Tokenizer：(p)字母名称(P)Also，in Toktok.PL and its NLTK Port(p)(P)For users who don't want their sentence to be sentence-split，the EDOCX1 plication 4 option is available since https://github.com/nltk/nltk/issues/1710 code merge=)(p)(P)For more explanation of why and what，see https://github.com/nltk/nltk/issues/1699(p)