适用于NLP的Python：使用Facebook FastText库

Python for NLP: Working with Facebook FastText Library

这是我关于NLP的Python系列文章中的第20条。在过去的几篇文章中，我们一直在探索深度学习技术以执行各种机器学习任务，并且您还应该熟悉单词嵌入的概念。单词嵌入是一种将文本信息转换为数字形式的方法，而该形式又可以用作统计算法的输入。在关于词嵌入的文章中，我解释了如何创建自己的词嵌入以及如何使用内置词嵌入(例如GloVe)。

在本文中，我们将研究FastText，它是用于单词嵌入和文本分类的另一个极其有用的模块。 FastText由Facebook开发，并在许多NLP问题(例如语义相似性检测和文本分类)上显示出优异的结果。

在本文中，我们将简要探讨FastText库。本文分为两个部分。在第一部分中，我们将看到FastText库如何创建向量表示形式，该向量表示形式可用于查找单词之间的语义相似性。在第二部分中，我们将看到FastText库在文本分类中的应用。

语义相似的FastText

FastText支持连续词袋和Skip-Gram模型。在本文中，我们将实现跳跃语法模型，以从Wikipedia上有关人工智能，机器学习，深度学习和神经网络的文章中学习单词的矢量表示。由于这些主题非常相似，因此我们选择这些主题以拥有大量数据来创建语料库。您可以根据需要添加更多类似性质的主题。

第一步，我们需要导入所需的库。我们将使用Python的Wikipedia库，可以通过以下命令下载该库：

1	$ pip install wikipedia

导入库

以下脚本将所需的库导入我们的应用程序：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

from keras.preprocessing.text import Tokenizer
from gensim.models.fasttext import FastText
import numpy as np
import matplotlib.pyplot as plt
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer

import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

%matplotlib inline

您可以看到我们正在使用gensim.models.fasttext库中的FastText模块。对于单词表示和语义相似性，我们可以将Gensim模型用于FastText。该模型可以在Windows上运行，但是，对于文本分类，我们将不得不使用Linux平台。我们将在下一节中看到。

刮维基百科文章

在这一步中，我们将抓取所需的Wikipedia文章。看下面的脚本：

1
2
3
4
5
6
7
8
9
10
11
12
13

artificial_intelligence = wikipedia.page("Artificial Intelligence").content
machine_learning = wikipedia.page("Machine Learning").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content

artificial_intelligence = sent_tokenize(artificial_intelligence)
machine_learning = sent_tokenize(machine_learning)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)

artificial_intelligence.extend(machine_learning)
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)

要抓取Wikipedia页面，可以使用wikipedia模块中的page方法。您要剪贴的页面名称作为参数传递给page方法。该方法返回WikipediaPage对象，然后您可以使用该对象通过content属性检索页面内容，如上面的脚本所示。

然后使用sent_tokenize方法将来自四个Wikipedia页面的抓取的内容标记为句子。 sent_tokenize方法返回句子列表。四个页面的句子分别标记。最后，来自四篇文章的句子通过extend方法连接在一起。

数据预处理

下一步是通过删除标点符号和数字来清除文本数据。我们还将数据转换为小写。我们数据中的单词将被词根化为它们的根形式。此外，停用词和长度小于4的词将从语料库中删除。

preprocess_text函数(如下定义)执行预处理任务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

def preprocess_text(document):
# Remove all the special characters
document = re.sub(r'\W', ' ', str(document))

# remove all single characters
document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

# Remove single characters from the start
document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

# Substituting multiple spaces with single space
document = re.sub(r'\s+', ' ', document, flags=re.I)

# Removing prefixed 'b'
document = re.sub(r'^b\s+', '', document)

# Converting to Lowercase
document = document.lower()

# Lemmatization
tokens = document.split()
tokens = [stemmer.lemmatize(word) for word in tokens]
tokens = [word for word in tokens if word not in en_stop]
tokens = [word for word in tokens if len(word) > 3]

preprocessed_text = ' '.join(tokens)

return preprocessed_text

让我们看看我们的函数是否通过预处理一个伪句子来执行所需的任务：

1
2
3
4
5
6
7
8

sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era")
print(sent)

final_corpus = [preprocess_text(sentence) for sentence in artificial_intelligence if sentence.strip() !='']

word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]

预处理后的句子如下所示：

1	artificial intelligence advanced technology present

您会看到标点符号和停用词已被删除，句子已被复词。此外，长度小于4的单词(例如"时代")也已删除。这些选择是为此测试随机选择的，因此您可以允许语料库中长度较小或较大的单词。

创建单词表示

我们已经对语料库进行了预处理。现在是时候使用FastText创建单词表示形式了。首先让我们为FastText模型定义超参数：

1
2
3
4

embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2

这里的embedding_size是嵌入向量的大小。换句话说，我们语料库中的每个单词将被表示为一个60维向量。 window_size是出现在单词之前和之后的单词数量的大小，将基于该单词学习单词的单词表示形式。这听起来有些棘手，但是在跳过语法模型中，我们向算法输入了一个单词，而输出是上下文单词。如果窗口大小为40，则每个输入将有80个输出：40个单词出现在输入单词之前，40个单词出现在输入单词之后。使用这80个输出单词可以学习输入单词的单词嵌入。

下一个超参数是min_word，它指定语料库中单词生成的最小频率。最后，最常见的单词将由down_sampling属性指定的数字进行下采样。

现在让我们创建用于单词表示的FastText模型。

1
2
3
4
5
6
7
8

%%time
ft_model = FastText(word_tokenized_corpus,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)

上面脚本中的所有参数都是不言自明的，除了sg。 sg参数定义我们要创建的模型的类型。值为1表示我们要创建跳跃语法模型。零指定单词袋模型，这也是默认值。

执行上面的脚本。运行可能需要一些时间。在我的机器上，上述代码运行的时间统计信息如下：

1 2	CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s Wall time: 57.2 s

现在，让我们看一下"人工"一词的表示形式。为此，可以使用FastText对象的wv方法并将其传递给列表中单词的名称。

1	print(ft_model.wv['artificial'])

这是输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

[-3.7653010e-02 -4.5558015e-01 3.2035065e-01 -1.5289043e-01
4.0645871e-02 -1.8946664e-01 7.0426887e-01 2.8806925e-01
-1.8166199e-01 1.7566417e-01 1.1522485e-01 -3.6525184e-01
-6.4378887e-01 -1.6650060e-01 7.4625671e-01 -4.8166099e-01
2.0884991e-01 1.8067230e-01 -6.2647951e-01 2.7614883e-01
-3.6478557e-02 1.4782918e-02 -3.3124462e-01 1.9372456e-01
4.3028224e-02 -8.2326338e-02 1.0356739e-01 4.0792203e-01
-2.0596240e-02 -3.5974573e-02 9.9928051e-02 1.7191900e-01
-2.1196717e-01 6.4424530e-02 -4.4705093e-02 9.7391091e-02
-2.8846195e-01 8.8607501e-03 1.6520244e-01 -3.6626378e-01
-6.2017748e-04 -1.5083785e-01 -1.7499258e-01 7.1994811e-02
-1.9868813e-01 -3.1733567e-01 1.9832127e-01 1.2799081e-01
-7.6522082e-01 5.2335665e-02 -4.5766738e-01 -2.7947658e-01
3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01
-1.2923178e-01 3.9627206e-01 -3.6673656e-01 2.2755004e-01]

在上面的输出中，您可以看到单词" artificial"的60维矢量

现在，让我们找到"人造"，"智能"，"机器"，"网络"，"经常出现"，"深度"这五个最相似的词。您可以选择任意数量的单词。以下脚本将打印指定的单词以及5个最相似的单词。

1
2
3
4
5

semantically_similar_words = {words: [item[0] for item in ft_model.wv.most_similar([words], topn=5)]
for words in ['artificial', 'intelligence', 'machine', 'network', 'recurrent', 'deep']}

for k,v in semantically_similar_words.items():
print(k+":"+str(v))

输出如下：

1
2
3
4
5
6

artificial:['intelligence', 'inspired', 'book', 'academic', 'biological']
intelligence:['artificial', 'human', 'people', 'intelligent', 'general']
machine:['ethic', 'learning', 'concerned', 'argument', 'intelligence']
network:['neural', 'forward', 'deep', 'backpropagation', 'hidden']
recurrent:['rnns', 'short', 'schmidhuber', 'shown', 'feedforward']
deep:['convolutional', 'speech', 'network', 'generative', 'neural']

我们还可以找到任意两个单词的向量之间的余弦相似度，如下所示：

1	print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))

输出显示值为" 0.7481"。该值可以在0到1之间。更高的值表示更高的相似度。

可视化单词相似性

尽管模型中的每个单词都表示为60维向量，但是我们可以使用主成分分析技术来找到两个主成分。然后可以使用两个主要成分在二维空间中绘制单词。但是，首先我们需要创建semantically_similar_words词典中所有单词的列表。以下脚本可以完成该任务：

1
2
3
4
5
6
7

from sklearn.decomposition import PCA

all_similar_words = sum([[k] + v for k, v in semantically_similar_words.items()], [])

print(all_similar_words)
print(type(all_similar_words))
print(len(all_similar_words))

在上面的脚本中，我们迭代semantically_similar_words字典中的所有键值对。字典中的每个键都是一个单词。相应的值是所有语义相似的单词的列表。由于我们在"人工"，"智能"，"机器"，"网络"，"经常性"，"深度"等6个词的列表中找到了前5个最相似的词，因此您会发现其中有30个词 all_similar_words列表。

接下来，我们必须找到所有这30个单词的单词向量，然后使用PCA将单词向量的维数从60减小到2。然后可以使用plt方法，这是在二维向量空间上绘制单词的方法。

执行以下脚本以可视化单词：

1
2
3
4
5
6
7
8
9
10
11
12

word_vectors = ft_model.wv[all_similar_words]

pca = PCA(n_components=2)

p_comps = pca.fit_transform(word_vectors)
word_names = all_similar_words

plt.figure(figsize=(18, 10))
plt.scatter(p_comps[:, 0], p_comps[:, 1], c='red')

for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]):
plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points')

上面脚本的输出如下所示：

现在我们知道了如何使用FastText创建单词嵌入。在下一节中，我们将看到如何将FastText用于文本分类任务。

用于文本分类的FastText

文本分类是指根据文本的内容将文本数据分类为预定义的类别。情绪分析，垃圾邮件检测和标签检测是一些用于文本分类的用例的最常见示例。

FastText文本分类模块只能通过Linux或OSX运行。如果您是Windows用户，则可以使用Google Colaboratory运行FastText文本分类模块。本节中的所有脚本均已使用Google Colaboratory运行。

数据集

可以从此Kaggle链接下载本文的数据集。数据集包含多个文件，但是我们仅对yelp_review.csv文件感兴趣。该文件包含有关餐馆，酒吧，牙医，医生，美容院等不同业务的520万条评论。但是，由于内存限制，我们将仅使用前50,000条记录来训练我们的模型。如果需要，可以尝试更多记录。

让我们导入所需的库并加载数据集：

1
2
3
4
5
6
7
8
9
10
11

import pandas as pd
import numpy as np

yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")

bins = [0,2,5]
review_names = ['negative', 'positive']

yelp_reviews['reviews_score'] = pd.cut(yelp_reviews['stars'], bins, labels=review_names)

yelp_reviews.head()

在上面的脚本中，我们使用pd.read_csv函数加载包含50,000条评论的yelp_review_short.csv文件。

通过将评论的数值转换为分类数值，可以简化我们的问题。这将通过在数据集中添加新列reviews_score来完成。如果用户评论的Stars列中的值介于1-2之间(以1-5的等级对企业进行评分)，则reviews_score列中的字符串值将为negative。如果Stars列中的额定值介于3-5之间，则reviews_score列将包含值positive。这使我们的问题成为二进制分类问题。

最后，数据帧的标题如下所示：

1	!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip

注意：如果要从Linux终端执行上述命令，则不必在上述命令前加上!前缀。在Google Colaboratory笔记本中，!之后的任何命令都将作为shell命令执行，而不是在Python解释器中执行。因此，此处所有非Python命令都以!为前缀。

如果您运行上述脚本并看到以下结果，则表明FastText已成功下载：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

--2019-08-16 15:05:05-- https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following]
--2019-08-16 15:05:05-- https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0
Resolving codeload.github.com (codeload.github.com)... 192.30.255.121
Connecting to codeload.github.com (codeload.github.com)|192.30.255.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: 'v0.1.0.zip'

v0.1.0.zip [ <=> ] 92.06K --.-KB/s in 0.03s

2019-08-16 15:05:05 (3.26 MB/s) - 'v0.1.0.zip' saved [94267]

下一步是解压缩FastText模块。只需键入以下命令：

1	!unzip v0.1.0.zip

接下来，您必须导航到下载FastText的目录，然后执行!make命令以运行C ++二进制文件。执行以下步骤：

1 2	cd fastText-0.1.0 !make

如果看到以下输出，则表明FastText已成功安装在您的计算机上。

1
2
3
4
5
6
7
8
9
10

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext

要验证安装，请执行以下命令：

1	!./fasttext

您应该看到FastText支持以下命令：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

usage: fasttext <command> <args>

The commands supported by FastText are:

supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
nn query for nearest neighbors
analogies query for analogies

文字分类

在训练FastText模型执行文本分类之前，需要先提及FastText接受特殊格式的数据，如下所示：

1 2	_label_tag This is sentence 1 _label_tag2 This is sentence 2.

如果我们查看数据集，则其格式不是所需的。具有积极情绪的文本应如下所示：

1	__label__positive burgers are very big portions here.

同样，负面评论应如下所示：

1	__label__negative They do not use organic ingredients, but I thi...

以下脚本从数据集中过滤reviews_score和text列，然后在reviews_score列中的所有值之前添加__label__前缀。同样，
和\t被text列中的空格替换。最后，更新的数据帧以yelp_reviews_updated.txt的形式写入磁盘。

1
2
3
4
5
6
7
8
9
10
11

import pandas as pd
from io import StringIO
import csv

col = ['reviews_score', 'text']

yelp_reviews = yelp_reviews[col]
yelp_reviews['reviews_score']=['__label__'+ s for s in yelp_reviews['reviews_score']]
yelp_reviews['text']= yelp_reviews['text'].replace('
',' ', regex=True).replace('\t',' ', regex=True)
yelp_reviews.to_csv(r'/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar="")

现在，让我们打印更新的yelp_reviews数据框的头部。

1	yelp_reviews.head()

您应该看到以下结果：

1
2
3
4
5
6

reviews_score text
0 __label__positive Super simple place but amazing nonetheless. It...
1 __label__positive Small unassuming place that changes their menu...
2 __label__positive Lester's is located in a beautiful neighborhoo...
3 __label__positive Love coming here. Yes the place always needs t...
4 __label__positive Had their chocolate almond croissant and it wa...

同样，数据框的尾部如下所示：

1
2
3
4
5
6

reviews_score text
49995 __label__positive This is an awesome consignment store! They hav...
49996 __label__positive Awesome laid back atmosphere with made-to-orde...
49997 __label__positive Today was my first appointment and I can hones...
49998 __label__positive I love this chic salon. They use the best prod...
49999 __label__positive This place is delicious. All their meats and s...

我们已经将数据集转换为所需的形状。下一步是将我们的数据分为训练集和测试集。 80％的数据(即50,000条记录中的前40,000条记录)将用于训练数据，而20％的数据(最后10,000条记录)将用于评估算法的性能。

以下脚本将数据分为训练集和测试集：

1 2	!head -n 40000"/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt">"/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" !tail -n 10000"/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt">"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"

一旦执行了上述脚本，就会生成yelp_reviews_train.txt文件，其中包含训练数据。同样，新生成的yelp_reviews_test.txt文件将包含测试数据。

现在该训练我们的FastText文本分类算法了。

1 2	%%time !./fasttext supervised -input"/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews

要训练算法，我们必须使用supervised命令并将其传递给输入文件。型号名称在-output关键字之后指定。上面的脚本将导致一个称为model_yelp_reviews.bin的训练有素的文本分类模型。这是上面脚本的输出：

1
2
3
4
5
6

Read 4M words
Number of words: 177864
Number of labels: 2
Progress: 100.0% words/sec/thread: 2548017 lr: 0.000000 loss: 0.246120 eta: 0h0m
CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms
Wall time: 15.6 s

您可以通过!ls命令查看模型，如下所示：

!ls

这是输出：

1
2
3
4
5
6
7
8
9
10

args.o Makefile quantization-results.sh
classification-example.sh matrix.o README.md
classification-results.sh model.o src
CONTRIBUTING.md model_yelp_reviews.bin tutorials
dictionary.o model_yelp_reviews.vec utils.o
eval.py PATENTS vector.o
fasttext pretrained-vectors.md wikifil.pl
fasttext.o productquantizer.o word-vector-example.sh
get-wikimedia.sh qmatrix.o yelp_reviews_train.txt
LICENSE quantization-example.sh

您可以在上面的文档列表中看到model_yelp_reviews.bin。

最后，要测试模型，可以使用test命令。您必须在test命令之后指定型号名称和测试文件，如下所示：

1	!./fasttext test model_yelp_reviews.bin"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"

上面脚本的输出如下所示：

1
2
3
4

N 10000
<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="63332352">[emailprotected]</a> 0.909
<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="386a7809">[emailprotected]</a> 0.909
Number of examples: 10000

此处[emailprotected]表示精度，而[emailprotected]表示调用。您可以看到我们的模型达到了0.909的精度和召回率，这相当不错。

现在，让我们尝试清除标点符号和特殊字符的文本，并将其转换为小写字母，以提高文本的一致性。以下脚本清理火车组：

1	!cat"/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" \| sed -e"s/$[.\!?,'/()]$/ \1 /g" \| tr"[:upper:]""[:lower:]">"/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt"

并且以下脚本清除了测试集：

1	"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" \| sed -e"s/$[.\!?,'/()]$/ \1 /g" \| tr"[:upper:]""[:lower:]">"/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"

现在，我们将在清理的训练集上训练模型：

1 2	%%time !./fasttext supervised -input"/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews

最后，我们将使用在清洗的训练集上训练的模型对清洗的测试集进行预测：

1	!./fasttext test model_yelp_reviews.bin"/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"

上面脚本的输出如下：

1
2
3
4

N 10000
<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a0f0e091">[emailprotected]</a> 0.915
<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cb998bfa">[emailprotected]</a> 0.915
Number of examples: 10000

您会发现精度和召回率都有小幅提高。为了进一步改进模型，可以增加模型的时期和学习率。以下脚本将纪元数设置为30，将学习率设置为0.5。

1 2	%%time !./fasttext supervised -input"/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5

您可以尝试不同的数字，看看是否可以获得更好的结果。不要忘记在评论中分享您的结果！

结论

最近，已证明FastText模型可用于许多数据集上的单词嵌入和文本分类任务。与其他单词嵌入模型相比，它非常易于使用并且闪电般快速。

在本文中，我们简要探讨了如何通过使用FastText创建单词嵌入来查找不同单词之间的语义相似性。本文的第二部分介绍了如何通过FastText库执行文本分类。