Word sense disambiguation algorithm in Python
本问题已经有最佳答案,请猛点这里访问。
我正在开发一个简单的NLP项目,并且在给定文本和单词的情况下,我正在查找文本中最有可能出现该单词的含义。
Python中是否有wsd算法的实现? 尚不清楚NLTK中是否有可以帮助我的东西。 即使是像Lesk Algorithm这样的天真的实现,我也会很高兴。
我读过类似的问题,例如NLTK Python中的Word感歧义消除,但它们只提供了对NLTK书的参考,这对WSD的问题不是很重要。
简而言之:https://github.com/alvations/pywsd
总之:WSD使用了无穷无尽的技术,从需要大量GPU功能的令人震惊的机器技术到仅在wordnet中甚至仅使用频率中使用信息,请参阅http://dl.acm.org/citation.cfm ?id = 1459355。
让我们从允许可选词干的简单lesk算法开始,请参阅http://en.wikipedia.org/wiki/Lesk_algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | from nltk.corpus import wordnet as wn from nltk.stem import PorterStemmer from itertools import chain bank_sents = ['I went to the bank to deposit my money', 'The river bank was full of dead fishes'] plant_sents = ['The workers at the industrial plant were overworked', 'The plant was no longer bearing flowers'] ps = PorterStemmer() def lesk(context_sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True): max_overlaps = 0; lesk_sense = None context_sentence = context_sentence.split() for ss in wn.synsets(ambiguous_word): # If POS is specified. if pos and ss.pos is not pos: continue lesk_dictionary = [] # Includes definition. lesk_dictionary+= ss.definition.split() # Includes lemma_names. lesk_dictionary+= ss.lemma_names # Optional: includes lemma_names of hypernyms and hyponyms. if hyperhypo == True: lesk_dictionary+= list(chain(*[i.lemma_names for i in ss.hypernyms()+ss.hyponyms()])) if stem == True: # Matching exact words causes sparsity, so lets match stems. lesk_dictionary = [ps.stem(i) for i in lesk_dictionary] context_sentence = [ps.stem(i) for i in context_sentence] overlaps = set(lesk_dictionary).intersection(context_sentence) if len(overlaps) > max_overlaps: lesk_sense = ss max_overlaps = len(overlaps) return lesk_sense print"Context:", bank_sents[0] answer = lesk(bank_sents[0],'bank') print"Sense:", answer print"Definition:",answer.definition print"Context:", bank_sents[1] answer = lesk(bank_sents[1],'bank','n') print"Sense:", answer print"Definition:",answer.definition print"Context:", plant_sents[0] answer = lesk(plant_sents[0],'plant','n', True) print"Sense:", answer print"Definition:",answer.definition |
除了类似lesk的算法外,人们还尝试了不同的相似性度量,这是一项不错的但过时但仍有用的调查:http://acl.ldc.upenn.edu/P/P97/P97-1008.pdf?
您可以尝试使用以下短代码使用NLTK中包含的WordNet来获得每个单词的第一感觉:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from nltk.corpus import wordnet as wn def get_first_sense(word, pos=None): if pos: synsets = wn.synsets(word,pos) else: synsets = wn.synsets(word) return synsets[0] best_synset = get_first_sense('bank') print '%s: %s' % (best_synset.name, best_synset.definition) best_synset = get_first_sense('set','n') print '%s: %s' % (best_synset.name, best_synset.definition) best_synset = get_first_sense('set','v') print '%s: %s' % (best_synset.name, best_synset.definition) |
将打印:
1 2 3 | bank.n.01: sloping land (especially the slope beside a body of water) set.n.01: a group of things of the same kind that belong together and are so used put.v.01: put into a certain place or abstract location |
令人惊讶的是,这非常有效,因为第一种感觉通常主导着其他感觉。
对于Python中的WSD,您可以尝试在NLTK或Gensim库中使用Wordnet绑定。虽然有构建块,但是开发完整的算法可能就在您身上。
例如,使用Wordnet可以实现简化的Lesk算法,如Wikipedia条目中所述。