Stanford CoreNLP 纯Python版本的深度学习NLP工具包 stanza 使用笔记

StanfordCoreNLP 是基于java版的，python封装也只是请求java接口，不是很方便。

这个效果可以使用官网测试地址：http://corenlp.run/

stanza是纯Python版的coreNLP，更方便

1、安装

1	pip install stanza

2、下载模型 stanza_resources

1
2
3

import stanza
stanza.download('en') # download English model
stanza.download('zh') # download chinese model

注意：在jupyter中下载如果有问题，可在终端中，python交互界面中下载，也可复制链接后使用下载工具下载，然后按照目录结构解压即可

目录结构：

3、使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178

import stanza
# 可写配置文件，或单独传入
# lang 指定语言，
config = {
'dir':'./stanza_resources/', # 如未使用 stanza.download() 下载模型；必须指定模型文件路径
# 'processors': 'tokenize,mwt,pos,ner', # Comma-separated list of processors to use
'lang': 'zh' #'en', # Language code for the language to build the Pipeline in
# 'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
# 'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
# 'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
# 'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
# 'tokenize_pretokenized': True # Use pretokenized text as input and disable tokenization
}
nlp = stanza.Pipeline(**config)
#输出：
2020-04-15 16:58:35 INFO: Loading these models for language: en (English):
=========================
| Processor | Package |
-------------------------
| tokenize | ewt |
| pos | ewt |
| lemma | ewt |
| depparse | ewt |
| ner | ontonotes |
=========================

2020-04-15 16:58:35 INFO: Use device: gpu
2020-04-15 16:58:35 INFO: Loading: tokenize
2020-04-15 16:58:40 INFO: Loading: pos
2020-04-15 16:58:41 INFO: Loading: lemma
2020-04-15 16:58:41 INFO: Loading: depparse
2020-04-15 16:58:42 INFO: Loading: ner
2020-04-15 16:58:42 INFO: Done loading processors!

doc = nlp('快速的棕色狐狸跳过了懒惰的狗')

doc.sentences
# 输出：
[[
{
"id": "1",
"text": "快速",
"lemma": "快速",
"upos": "ADJ",
"xpos": "JJ",
"head": 4,
"deprel": "amod",
"misc": "start_char=0|end_char=2"
},
{
"id": "2",
"text": "的",
"lemma": "的",
"upos": "PART",
"xpos": "DEC",
"head": 1,
"deprel": "mark:relcl",
"misc": "start_char=2|end_char=3"
},
{
"id": "3",
"text": "棕色",
"lemma": "棕色",
"upos": "NOUN",
"xpos": "NN",
"head": 4,
"deprel": "nmod",
"misc": "start_char=3|end_char=5"
},
{
"id": "4",
"text": "狐狸",
"lemma": "狐狸",
"upos": "NOUN",
"xpos": "NN",
"head": 5,
"deprel": "nsubj",
"misc": "start_char=5|end_char=7"
},
{
"id": "5",
"text": "跳过",
"lemma": "跳过",
"upos": "VERB",
"xpos": "VV",
"head": 0,
"deprel": "root",
"misc": "start_char=7|end_char=9"
},
{
"id": "6",
"text": "了",
"lemma": "了",
"upos": "PART",
"xpos": "AS",
"feats": "Aspect=Perf",
"head": 5,
"deprel": "case:aspect",
"misc": "start_char=9|end_char=10"
},
{
"id": "7",
"text": "懒惰",
"lemma": "懒惰",
"upos": "ADJ",
"xpos": "JJ",
"head": 9,
"deprel": "amod",
"misc": "start_char=10|end_char=12"
},
{
"id": "8",
"text": "的",
"lemma": "的",
"upos": "PART",
"xpos": "DEC",
"head": 7,
"deprel": "mark:relcl",
"misc": "start_char=12|end_char=13"
},
{
"id": "9",
"text": "狗",
"lemma": "狗",
"upos": "NOUN",
"xpos": "NN",
"head": 5,
"deprel": "obj",
"misc": "start_char=13|end_char=14"
}
]]

doc.sentences[0].print_dependencies()
输出：
('快速', '4', 'amod')
('的', '1', 'mark:relcl')
('棕色', '4', 'nmod')
('狐狸', '5', 'nsubj')
('跳过', '0', 'root')
('了', '5', 'case:aspect')
('懒惰', '9', 'amod')
('的', '7', 'mark:relcl')
('狗', '5', 'obj')

doc.sentences[0].print_tokens()
输出：
<Token id=1;words=[<Word id=1;text=快速;lemma=快速;upos=ADJ;xpos=JJ;head=4;deprel=amod>]>
<Token id=2;words=[<Word id=2;text=的;lemma=的;upos=PART;xpos=DEC;head=1;deprel=mark:relcl>]>
<Token id=3;words=[<Word id=3;text=棕色;lemma=棕色;upos=NOUN;xpos=NN;head=4;deprel=nmod>]>
<Token id=4;words=[<Word id=4;text=狐狸;lemma=狐狸;upos=NOUN;xpos=NN;head=5;deprel=nsubj>]>
<Token id=5;words=[<Word id=5;text=跳过;lemma=跳过;upos=VERB;xpos=VV;head=0;deprel=root>]>
<Token id=6;words=[<Word id=6;text=了;lemma=了;upos=PART;xpos=AS;feats=Aspect=Perf;head=5;deprel=case:aspect>]>
<Token id=7;words=[<Word id=7;text=懒惰;lemma=懒惰;upos=ADJ;xpos=JJ;head=9;deprel=amod>]>
<Token id=8;words=[<Word id=8;text=的;lemma=的;upos=PART;xpos=DEC;head=7;deprel=mark:relcl>]>
<Token id=9;words=[<Word id=9;text=狗;lemma=狗;upos=NOUN;xpos=NN;head=5;deprel=obj>]>

doc.sentences[0].print_words()
输出：
<Word id=1;text=快速;lemma=快速;upos=ADJ;xpos=JJ;head=4;deprel=amod>
<Word id=2;text=的;lemma=的;upos=PART;xpos=DEC;head=1;deprel=mark:relcl>
<Word id=3;text=棕色;lemma=棕色;upos=NOUN;xpos=NN;head=4;deprel=nmod>
<Word id=4;text=狐狸;lemma=狐狸;upos=NOUN;xpos=NN;head=5;deprel=nsubj>
<Word id=5;text=跳过;lemma=跳过;upos=VERB;xpos=VV;head=0;deprel=root>
<Word id=6;text=了;lemma=了;upos=PART;xpos=AS;feats=Aspect=Perf;head=5;deprel=case:aspect>
<Word id=7;text=懒惰;lemma=懒惰;upos=ADJ;xpos=JJ;head=9;deprel=amod>
<Word id=8;text=的;lemma=的;upos=PART;xpos=DEC;head=7;deprel=mark:relcl>
<Word id=9;text=狗;lemma=狗;upos=NOUN;xpos=NN;head=5;deprel=obj>

doc = nlp('新冠病毒在美国情况恶劣。')

doc.ents,doc.entities
输出：
[{
"text": "美国",
"type": "GPE",
"start_char": 5,
"end_char": 7
}]

以下为标注解释：来源网络，侵权删

词性和实体标注解释

https://www.cnblogs.com/gaofighting/p/9768023.html

句法关系标注解释：

来源：https://blog.csdn.net/l919898756/article/details/81670228

ROOT：要处理文本的语句
IP：简单从句
NP：名词短语
VP：动词短语
PU：断句符，通常是句号、问号、感叹号等标点符号
LCP：方位词短语
PP：介词短语
CP：由‘的’构成的表示修饰性关系的短语
DNP：由‘的’构成的表示所属关系的短语
ADVP：副词短语
ADJP：形容词短语
DP：限定词短语
QP：量词短语
NN：常用名词
NR：固有名词
NT：时间名词
PN：代词
VV：动词
VC：是
CC：表示连词
VE：有
VA：表语形容词
AS：内容标记（如：了）
VRD：动补复合词
CD: 表示基数词
DT: determiner 表示限定词
EX: existential there 存在句
FW: foreign word 外来词
IN: preposition or conjunction, subordinating 介词或从属连词
JJ: adjective or numeral, ordinal 形容词或序数词
JJR: adjective, comparative 形容词比较级
JJS: adjective, superlative 形容词最高级
LS: list item marker 列表标识
MD: modal auxiliary 情态助动词
PDT: pre-determiner 前位限定词
POS: genitive marker 所有格标记
PRP: pronoun, personal 人称代词
RB: adverb 副词
RBR: adverb, comparative 副词比较级
RBS: adverb, superlative 副词最高级
RP: particle 小品词
SYM: symbol 符号
TO:”to” as preposition or infinitive marker 作为介词或不定式标记
WDT: WH-determiner WH限定词
WP: WH-pronoun WH代词
WP$: WH-pronoun, possessive WH所有格代词
WRB:Wh-adverb WH副词

关系表示
abbrev: abbreviation modifier，缩写
acomp: adjectival complement，形容词的补充；
advcl : adverbial clause modifier，状语从句修饰词
advmod: adverbial modifier状语
agent: agent，代理，一般有by的时候会出现这个
amod: adjectival modifier形容词
appos: appositional modifier,同位词
attr: attributive，属性
aux: auxiliary，非主要动词和助词，如BE,HAVE SHOULD/COULD等到
auxpass: passive auxiliary 被动词
cc: coordination，并列关系，一般取第一个词
ccomp: clausal complement从句补充
complm: complementizer，引导从句的词好重聚中的主要动词
conj : conjunct，连接两个并列的词。
cop: copula。系动词（如be,seem,appear等），（命题主词与谓词间的）连系
csubj : clausal subject，从主关系
csubjpass: clausal passive subject 主从被动关系
dep: dependent依赖关系
det: determiner决定词，如冠词等
dobj : direct object直接宾语
expl: expletive，主要是抓取there
infmod: infinitival modifier，动词不定式
iobj : indirect object，非直接宾语，也就是所以的间接宾语；
mark: marker，主要出现在有“that” or “whether”“because”, “when”,
mwe: multi-word expression，多个词的表示
neg: negation modifier否定词
nn: noun compound modifier名词组合形式
npadvmod: noun phrase as adverbial modifier名词作状语
nsubj : nominal subject，名词主语
nsubjpass: passive nominal subject，被动的名词主语
num: numeric modifier，数值修饰
number: element of compound number，组合数字
parataxis: parataxis: parataxis，并列关系
partmod: participial modifier动词形式的修饰
pcomp: prepositional complement，介词补充
pobj : object of a preposition，介词的宾语
poss: possession modifier，所有形式，所有格，所属
possessive: possessive modifier，这个表示所有者和那个’S的关系
preconj : preconjunct，常常是出现在 “either”, “both”, “neither”的情况下
predet: predeterminer，前缀决定，常常是表示所有
prep: prepositional modifier
prepc: prepositional clausal modifier
prt: phrasal verb particle，动词短语
punct: punctuation，这个很少见，但是保留下来了，结果当中不会出现这个
purpcl : purpose clause modifier，目的从句
quantmod: quantifier phrase modifier，数量短语
rcmod: relative clause modifier相关关系
ref : referent，指示物，指代
rel : relative
root: root，最重要的词，从它开始，根节点
tmod: temporal modifier
xcomp: open clausal complement
xsubj : controlling subject 掌控者
中心语为谓词
subj — 主语
nsubj — 名词性主语（nominal subject）（同步，建设）
top — 主题（topic）（是，建筑）
npsubj — 被动型主语（nominal passive subject），专指由“被”引导的被动句中的主语，一般是谓词语义上的受事（称作，镍）
csubj — 从句主语（clausal subject），中文不存在
xsubj — x主语，一般是一个主语下面含多个从句（完善，有些）
中心语为谓词或介词
obj — 宾语
dobj — 直接宾语（颁布，文件）
iobj — 间接宾语（indirect object），基本不存在
range — 间接宾语为数量词，又称为与格（成交，元）
pobj — 介词宾语（根据，要求）
lobj — 时间介词（来，近年）
中心语为谓词
comp — 补语
ccomp — 从句补语，一般由两个动词构成，中心语引导后一个动词所在的从句(IP) （出现，纳入）
xcomp — x从句补语（xclausal complement），不存在
acomp — 形容词补语（adjectival complement）
tcomp — 时间补语（temporal complement）（遇到，以前）
lccomp — 位置补语（localizer complement）（占，以上）
— 结果补语（resultative complement）
中心语为名词
mod — 修饰语（modifier）
pass — 被动修饰（passive）
tmod — 时间修饰（temporal modifier）
rcmod — 关系从句修饰（relative clause modifier）（问题，遇到）
numod — 数量修饰（numeric modifier）（规定，若干）
ornmod — 序数修饰（numeric modifier）
clf — 类别修饰（classifier modifier）（文件，件）
nmod — 复合名词修饰（noun compound modifier）（浦东，上海）
amod — 形容词修饰（adjetive modifier）（情况，新）
advmod — 副词修饰（adverbial modifier）（做到，基本）
vmod — 动词修饰（verb modifier，participle modifier）
prnmod — 插入词修饰（parenthetical modifier）
neg — 不定修饰（negative modifier） (遇到，不)
det — 限定词修饰（determiner modifier）（活动，这些）
possm — 所属标记（possessive marker），NP
poss — 所属修饰（possessive modifier），NP
dvpm — DVP标记（dvp marker），DVP （简单，的）
dvpmod — DVP修饰（dvp modifier），DVP （采取，简单）
assm — 关联标记（associative marker），DNP （开发，的）
assmod — 关联修饰（associative modifier），NP|QP （教训，特区）
prep — 介词修饰（prepositional modifier） NP|VP|IP（采取，对）
clmod — 从句修饰（clause modifier）（因为，开始）
plmod — 介词性地点修饰（prepositional localizer modifier）（在，上）
asp — 时态标词（aspect marker）（做到，了）
partmod– 分词修饰（participial modifier）不存在
etc — 等关系（etc）（办法，等）
中心语为实词
conj — 联合(conjunct)
cop — 系动(copula) 双指助动词？？？？
cc — 连接(coordination)，指中心词与连词（开发，与）
其它
attr — 属性关系（是，工程）
cordmod– 并列联合动词（coordinated verb compound）（颁布，实行）
mmod — 情态动词（modal verb）（得到，能）
ba — 把字关系
tclaus — 时间从句（以后，积累）
— semantic dependent
cpm — 补语化成分（complementizer），一般指“的”引导的CP （振兴，的）

参考网址：

stanf官网：https://stanfordnlp.github.io/CoreNLP/index.html#human-languages-supported

stanza官网：https://stanfordnlp.github.io/stanza/index.html

网络资源：http://www.52nlp.cn/tag/corenlp