NLTK Python中的词义消歧

Word sense disambiguation in NLTK Python

我是NLTK Python的新手,我正在寻找一些可以消除单词歧义的示例应用程序。 我在搜索结果中有很多算法,但没有示例应用程序。 我只想传递一个句子,并想通过引用wordnet库来了解每个单词的含义。
谢谢

我在PERL中找到了类似的模块。 http://marimba.d.umn.edu/allwords/allwords.html
NLTK Python中是否存在这样的模块?


最近,pywsd代码的一部分已被移植到wsd.py模块中NLTK'的最新版本中,请尝试:

1
2
3
4
5
6
7
>>> from nltk.wsd import lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> lesk(sent, ambiguous)
Synset('bank.v.04')
>>> lesk(sent, ambiguous).definition()
u'act as the banker in a game or in gambling'

为了获得更好的WSD性能,请使用pywsd库而不是NLTK模块。通常,pywsd中的simple_lesk()优于NLTK中的lesk。有空的时候,我会尽量更新NLTK模块。

在回应Chris Spencer的评论时,请注意Lesk算法的局限性。我只是简单地给出算法的准确实现。这不是灵丹妙药,http://en.wikipedia.org/wiki/Lesk_algorithm

还请注意,尽管:

1
lesk("My cat likes to eat mice.","cat","n")

如果没有给出正确的答案,则可以使用max_similarity()pywsd实现:

1
2
3
4
5
>>> from pywsd.similarity import max_similiarity
>>> max_similarity('my cat likes to eat mice', 'cat', 'wup', pos='n').definition
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
>>> max_similarity('my cat likes to eat mice', 'cat', 'lin', pos='n').definition
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'

@Chris,如果您想要python setup.py,只需提出一个礼貌的要求,我会写出来...


是的,事实上,NLTK团队写了一本书,其中有关于分类的多章,其中明确涵盖了如何使用WordNet。您也可以从Safari购买该书的物理版本。

仅供参考:NLTK由自然语言编程学者编写,用于其入门编程课程。


请参阅http://jaganadhg.freeflux.net/blog/archive/2010/10/16/wordnet-sense-similarity-with-nltk-some-basics.html


作为对OP请求的实际回答,以下是几种WSD方法的python实现,该方法以NLTK的同义词集形式返回感觉,https://github.com/alvations/pywsd

这包括

  • Lesk算法(包括原始的Lesk,改编的Lesk和简单的Lesk)
  • 基准算法(随机意义,第一感觉,最常识)

可以这样使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/usr/bin/env python -*- coding: utf-8 -*-

bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']

plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']

print"======== TESTING simple_lesk ===========
"

from lesk import simple_lesk
print"#TESTING simple_lesk() ..."
print"Context:", bank_sents[0]
answer = simple_lesk(bank_sents[0],'bank')
print"Sense:", answer
print"Definition:",answer.definition
print

print"#TESTING simple_lesk() with POS ..."
print"Context:", bank_sents[1]
answer = simple_lesk(bank_sents[1],'bank','n')
print"Sense:", answer
print"Definition:",answer.definition
print

print"#TESTING simple_lesk() with POS and stems ..."
print"Context:", plant_sents[0]
answer = simple_lesk(plant_sents[0],'plant','n', True)
print"Sense:", answer
print"Definition:",answer.definition
print

print"======== TESTING baseline ===========
"

from baseline import random_sense, first_sense
from baseline import max_lemma_count as most_frequent_sense

print"#TESTING random_sense() ..."
print"Context:", bank_sents[0]
answer = random_sense('bank')
print"Sense:", answer
print"Definition:",answer.definition
print

print"#TESTING first_sense() ..."
print"Context:", bank_sents[0]
answer = first_sense('bank')
print"Sense:", answer
print"Definition:",answer.definition
print

print"#TESTING most_frequent_sense() ..."
print"Context:", bank_sents[0]
answer = most_frequent_sense('bank')
print"Sense:", answer
print"Definition:",answer.definition
print

[出]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
======== TESTING simple_lesk ===========

#TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities

#TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

#TESTING simple_lesk() with POS and stems ...
Context: The workers at the industrial plant were overworked
Sense: Synset('plant.n.01')
Definition: buildings for carrying on industrial labor

======== TESTING baseline ===========
#TESTING random_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('deposit.v.02')
Definition: put into a bank account

#TESTING first_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

#TESTING most_frequent_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

NLTK具有访问Wordnet的API。 Wordnet将单词放置为同义词集。这将为您提供有关单词,其上位词,下位词,词根等的一些信息。

"使用NLTK 2.0 Cookbook进行Python文本处理"是一本不错的书,可帮助您入门NLTK的各种功能。它易于阅读,理解和实施。

此外,您还可以查看其他文章(NLTK领域之外),其中讨论了使用Wikipedia消除词义歧义。


是的,NLTK中的wordnet模块是可能的。
NLTK wordnet模块中也存在您所提到的工具中使用的相似性保证。