Python Maxent Classifier
我一直在python中使用maxent分类器,它失败了,我不明白为什么。
我正在使用电影评论语料库。
(总菜鸟)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import nltk.classify.util from nltk.classify import MaxentClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] classifier = MaxentClassifier.train(trainfeats) |
这是错误(我知道我做错了,请链接到Maxent如何工作)
Warning (from warnings module):
File"C:\Python27\lib\site-packages
ltk\classify\maxent.py", line 1334
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
RuntimeWarning: invalid value encountered in multiplyWarning (from warnings module):
File"C:\Python27\lib\site-packages
ltk\classify\maxent.py", line 1335
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
RuntimeWarning: invalid value encountered in multiplyWarning (from warnings module):
File"C:\Python27\lib\site-packages
ltk\classify\maxent.py", line 1341
deltas -= (ffreq_empirical - sum1) / -sum2
RuntimeWarning: invalid value encountered in divide
我改变并稍微更新了代码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import nltk, nltk.classify.util, nltk.metrics from nltk.classify import MaxentClassifier from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist from sklearn import cross_validation from nltk.classify import MaxentClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] #classifier = nltk.MaxentClassifier.train(trainfeats) algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3) classifier.show_most_informative_features(10) all_words = nltk.FreqDist(word for word in movie_reviews.words()) top_words = set(all_words.keys()[:300]) def word_feats(words): return {word:True for word in words if word in top_words} |
对于
您可以在所有评论中找到
1 2 | all_words = nltk.FreqDist(word for word in movie_reviews.words()) top_words = set(all_words.keys()[:300]) |
然后,您需要做的就是在功能提取器中交叉引用
1 2 | def word_feats(words): return {word:True for word in words if word in top_words} |