关于python：斯坦福大学CoreNLP tokenize.whitespace属性不适用于中文

Stanford CoreNLP tokenize.whitespace property not working on Chinese

我使用Stanford CoreNLP进行预加标记的中文文本进行pos标记和NER，我阅读了官方文档https://stanfordnlp.github.io/CoreNLP/tokenize.html，说了tokenize.whitespace选项'If设置为true，仅在遇到空格时分隔单词"。那正是我想要的。

但是我正在使用python，pycorenlp与CoreNLP Server进行交互，而对Java一无所知。然后，我阅读了anwser。如何使用Stanford CoreNLP来对NER和POS标记预先标记的文本？并且以为也许唯一要做的就是在我的请求后属性字典中添加'tokenize.whitespace'='true'和另一个属性，但这根本行不通。我这样运行服务器：

1	java -Xmx4g -cp"*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 150000

在我的jupyter笔记本中：

1
2
3
4
5
6
7
8
9
10
11
12

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

output = nlp.annotate('公司作为物联网行业', properties={
'annotators': 'pos,ner',
'tokenize.whitespace': 'true', # first property
'ssplit.eolonly': 'true', # second property
'outputFormat': 'json'
})

for sentence in output['sentences']:
print(' '.join([token['word'] for token in sentence['tokens']]))

这使：

公司作为物联网行业

CoreNLP仍在标记令牌"物联网"，就像我不添加两个属性一样。然后，我尝试创建一个.properties文件，并在命令行上而不是StanfordCoreNLP-chinese.properties上使用它，但是它也不起作用。在我的test.properties中：

1 2	tokenize.whitespace=true ssplit.eolonly=true

然后我像这样运行服务器：

1	java -Xmx4g -cp"*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties 'test.properties' -port 9000 -timeout 150000

仍然表现得好像我什么都没改变。有人知道我在想什么吗？任何帮助表示赞赏:)

最后，我解决了自己的问题。

在中文文本中使用tokenize.whitespace = true似乎很困难，似乎很难。相反，添加

1	'tokenize.language': 'Whitespace'

到您的属性字典或等效项中，添加

1	tokenize.language: Whitespace

到您的.properties文件中以正确完成操作。

此属性写在同一页https://stanfordnlp.github.io/CoreNLP/tokenize.html#options上，我之前没有注意到。为什么存在两个出于相同目的的属性，这有点令人困惑。