关于nlp:Python中的词义消歧算法

Word sense disambiguation algorithm in Python

本问题已经有最佳答案,请猛点这里访问。

我正在开发一个简单的NLP项目,并且在给定文本和单词的情况下,我正在查找文本中最有可能出现该单词的含义。

Python中是否有wsd算法的实现? 尚不清楚NLTK中是否有可以帮助我的东西。 即使是像Lesk Algorithm这样的天真的实现,我也会很高兴。

我读过类似的问题,例如NLTK Python中的Word感歧义消除,但它们只提供了对NLTK书的参考,这对WSD的问题不是很重要。


简而言之:https://github.com/alvations/pywsd

总之:WSD使用了无穷无尽的技术,从需要大量GPU功能的令人震惊的机器技术到仅在wordnet中甚至仅使用频率中使用信息,请参阅http://dl.acm.org/citation.cfm ?id = 1459355。

让我们从允许可选词干的简单lesk算法开始,请参阅http://en.wikipedia.org/wiki/Lesk_algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain

bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']

plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']

ps = PorterStemmer()

def lesk(context_sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True):
    max_overlaps = 0; lesk_sense = None
    context_sentence = context_sentence.split()
    for ss in wn.synsets(ambiguous_word):
        # If POS is specified.
        if pos and ss.pos is not pos:
            continue

        lesk_dictionary = []

        # Includes definition.
        lesk_dictionary+= ss.definition.split()
        # Includes lemma_names.
        lesk_dictionary+= ss.lemma_names

        # Optional: includes lemma_names of hypernyms and hyponyms.
        if hyperhypo == True:
            lesk_dictionary+= list(chain(*[i.lemma_names for i in ss.hypernyms()+ss.hyponyms()]))      

        if stem == True: # Matching exact words causes sparsity, so lets match stems.
            lesk_dictionary = [ps.stem(i) for i in lesk_dictionary]
            context_sentence = [ps.stem(i) for i in context_sentence]

        overlaps = set(lesk_dictionary).intersection(context_sentence)

        if len(overlaps) > max_overlaps:
            lesk_sense = ss
            max_overlaps = len(overlaps)
    return lesk_sense

print"Context:", bank_sents[0]
answer = lesk(bank_sents[0],'bank')
print"Sense:", answer
print"Definition:",answer.definition
print

print"Context:", bank_sents[1]
answer = lesk(bank_sents[1],'bank','n')
print"Sense:", answer
print"Definition:",answer.definition
print

print"Context:", plant_sents[0]
answer = lesk(plant_sents[0],'plant','n', True)
print"Sense:", answer
print"Definition:",answer.definition
print

除了类似lesk的算法外,人们还尝试了不同的相似性度量,这是一项不错的但过时但仍有用的调查:http://acl.ldc.upenn.edu/P/P97/P97-1008.pdf?


您可以尝试使用以下短代码使用NLTK中包含的WordNet来获得每个单词的第一感觉:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from nltk.corpus import wordnet as wn

def get_first_sense(word, pos=None):
    if pos:
        synsets = wn.synsets(word,pos)
    else:
        synsets = wn.synsets(word)
    return synsets[0]

best_synset = get_first_sense('bank')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','n')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','v')
print '%s: %s' % (best_synset.name, best_synset.definition)

将打印:

1
2
3
bank.n.01: sloping land (especially the slope beside a body of water)
set.n.01: a group of things of the same kind that belong together and are so used
put.v.01: put into a certain place or abstract location

令人惊讶的是,这非常有效,因为第一种感觉通常主导着其他感觉。


对于Python中的WSD,您可以尝试在NLTK或Gensim库中使用Wordnet绑定。虽然有构建块,但是开发完整的算法可能就在您身上。

例如,使用Wordnet可以实现简化的Lesk算法,如Wikipedia条目中所述。