关于nltk：如何使用Python为文本文件创建一个unigram和bigram计数矩阵以及一个类变量到csv？

How to create a unigram and bigram count matrix for a text file along with a class variable into csv using Python?

我想用python为一个文本文件创建一个unigram和bigram计数矩阵以及一个类变量到csv文本文件包含两列，其外观如下

1
2
3

Text Class
I love the movie Pos
I hate the movie Neg

我想要文本列的unigram和bigram计数，输出应该写入csv文件。

1
2
3

I hate love movie the class
1 0 1 1 1 Pos
1 1 0 1 1 Neg

号

小精灵

1
2
3

I love love the the movie I hate hate the class
1 1 1 0 0 Pos
0 0 1 1 1 Neg

有人能帮我把下面的代码改成上面提到的输出格式吗？

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

>>> import nltk
>>> from collections import Counter
>>> fo = open("text.txt")
>>> fo1 = fo.readlines()
>>> for line in fo1:
bigm = list(nltk.bigrams(line.split()))
bigmC = Counter(bigm)
for key, value in bigmC.items():
print(key, value)

('love', 'the') 1
('the', 'movie') 1
('I', 'love') 1
('I', 'hate') 1
('hate', 'the') 1
('the', 'movie') 1

。

我已经对您的输入文件做了一些更详细的说明，这样您就可以相信解决方案是有效的：

1
2
3
4

I love the movie movie
I hate the movie
The movie was rubbish
The movie was fantastic

第一行包含一个单词两次，因为否则您无法知道计数器实际上正在正确计数。

解决方案：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

import csv
import nltk
from collections import Counter
fo = open("text.txt")
fo1 = fo.readlines()
counter_sum = Counter()
for line in fo1:
tokens = nltk.word_tokenize(line)
bigrams = list(nltk.bigrams(line.split()))
bigramsC = Counter(bigrams)
tokensC = Counter(tokens)
both_counters = bigramsC + tokensC
counter_sum += both_counters
# This basically collects the whole 'population' of words and bigrams in your document

# now that we have the population can write a csv

with open('unigrams_and_bigrams.csv', 'w', newline='') as csvfile:
header = sorted(counter_sum, key=lambda x: str(type(x)))
writer = csv.DictWriter(csvfile, fieldnames=header)
writer.writeheader()
for line in fo1:
tokens = nltk.word_tokenize(line)
bigrams = list(nltk.bigrams(line.split()))
bigramsC = Counter(bigrams)
tokensC = Counter(tokens)
both_counters = bigramsC + tokensC
cs = dict(counter_sum)
bc = dict(both_counters)
row = {}
for element in list(cs):
if element in list(bc):
row[element] = bc[element]
else:
row[element] = 0
writer.writerow(row)

号

所以，我使用并建立在你最初的方法之上。你没有说你是否想要分开的csv中的bigrams和unigrams，所以假设你想要它们在一起。否则重新编程对你来说并不难。以这种方式积累人口可能最好使用已经内置到NLP库中的工具，但有趣的是，可以在较低的级别上进行。顺便说一下，我使用的是python 3，如果需要在python 2中工作，您可能需要更改一些东西，比如list的使用。

一些有趣的参考资料是这个关于求和计数器的，这对我来说是新的。另外，我还想问一个问题，把你的大图和统一图分别放在csv的两端。

我知道代码看起来是重复的，但在开始编写之前，您需要先遍历所有行，以获取csv的头。

这是libreoffice的输出

image of csv output 。

你的csv将会变得非常广泛，因为它收集了所有的unigrams和bigrams。如果您真的希望头中有不带括号和逗号的大括号，那么您可以做一些这样的函数。不过，最好将它们保留为元组，以防您需要在某个时候再次将它们解析为Python，而且它也同样可读。

您没有包含生成class列的代码，假设您拥有它，您可以在头被写入csv之前将字符串'class'附加到头上以创建该列并填充它，

1	row['Class'] = sentiment

在写入行之前的最后一行。