关于nltk：Python Maxent分类器

Python Maxent Classifier

我一直在python中使用maxent分类器，它失败了，我不明白为什么。

我正在使用电影评论语料库。
(总菜鸟)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import nltk.classify.util
from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
classifier = MaxentClassifier.train(trainfeats)

这是错误(我知道我做错了，请链接到Maxent如何工作)

Warning (from warnings module):
File"C:\Python27\lib\site-packages
ltk\classify\maxent.py", line 1334
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
RuntimeWarning: invalid value encountered in multiply

Warning (from warnings module):
File"C:\Python27\lib\site-packages
ltk\classify\maxent.py", line 1335
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
RuntimeWarning: invalid value encountered in multiply

Warning (from warnings module):
File"C:\Python27\lib\site-packages
ltk\classify\maxent.py", line 1341
deltas -= (ffreq_empirical - sum1) / -sum2
RuntimeWarning: invalid value encountered in divide

相关讨论

我改变并稍微更新了代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from sklearn import cross_validation

from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
#classifier = nltk.MaxentClassifier.train(trainfeats)

algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3)

classifier.show_most_informative_features(10)

all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])

def word_feats(words):
return {word:True for word in words if word in top_words}

对于numpy溢出问题可能有一个修复，但由于这只是一个用于学习NLTK /文本分类的电影评论分类器(你可能不希望培训花费很长时间)，我将提供一个简单的解决方法：您可以限制功能集中使用的单词。

您可以在所有评论中找到300最常用的单词(如果需要，您可以显然更高)，

1 2	all_words = nltk.FreqDist(word for word in movie_reviews.words()) top_words = set(all_words.keys()[:300])

然后，您需要做的就是在功能提取器中交叉引用top_words以进行评论。另外，作为建议，使用字典理解而不是将list的tuple转换为dict更有效。所以这看起来像，

1 2	def word_feats(words): return {word:True for word in words if word in top_words}