scikit-learn TfidfVectorizer meaning?
我正在阅读有关scikit-learn的TfidfVectorizer实现的信息,我不明白该方法的输出是什么,例如:
1 2 3 4 | new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball'] new_term_freq_matrix = tfidf_vectorizer.transform(new_docs) print tfidf_vectorizer.vocabulary_ print new_term_freq_matrix.todense() |
输出:
1 2 3 4 5 6 7 | {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]] |
什么是?(例如:u'me':8):
1 | {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} |
这是矩阵还是矢量?,我不明白输出是什么:
1 2 3 4 5 6 | [[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]] |
有人能详细解释这些输出吗?
谢谢!
TfidfVectorizer-将文本转换为可以用作估计器输入的特征向量。
What is?(e.g.: u'me': 8 )
它告诉您令牌" me"在输出矩阵中表示为功能部件号8。
is this a matrix or just a vector?
每个句子是一个向量,您输入的句子是具有3个向量的矩阵。
在每个向量中,数字(权重)代表特征tf-idf得分。
例如:
'julie':4->告诉您在每句话'Julie'出现时,您将具有非零(tf-idf)权重。如您在第二向量中所见:
[0. 0.68091856 0. 0. 0.51785612 0.51785612
0. 0. 0. 0. 0.]
第五元素得分为0.51785612-'朱莉'的tf-idf得分。
有关Tf-Idf评分的更多信息,请阅读此处:http://en.wikipedia.org/wiki/Tf%E2%80%93idf
因此,tf-idf从整个文档集中创建了一套自己的词汇表。在输出的第一行中可以看到。 (为了更好的理解,我对其进行了排序)
1 | {u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6, u'loves': 7, u'me': 8, u'more': 9, u'than': 10, } |
并且当文档被解析以获取其tf-idf时。文献:
He watches basketball and baseball
及其输出,
[0.57735027 0.57735027 0.57735027 0. 0. 0. 0。
0. 0. 0. 0.]
等价于
[baseball basketball he jane julie likes linda loves me more than]
由于我们的文档中只有这些词:棒球,篮球,他,是从词汇上创造出来的。文档向量输出仅在相同的分类词汇位置中对这三个单词具有tf-idf值。
tf-idf用于分类文档,在搜索引擎中排名。 tf:词频(从其自身词汇中出现在文档中的单词数),idf:逆文档频率(单词在每个文档中的重要性)。
该方法解决了一个事实,即不应该对所有单词进行平均加权,而是使用权重来表示文档中最独特的单词,并最适合用来表征它。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | new_docs = ['basketball baseball', 'basketball baseball', 'basketball baseball'] new_term_freq_matrix = vectorizer.fit_transform(new_docs) print (vectorizer.vocabulary_) print ((new_term_freq_matrix.todense())) {'basketball': 1, 'baseball': 0} [[ 0.70710678 0.70710678] [ 0.70710678 0.70710678] [ 0.70710678 0.70710678]] new_docs = ['basketball baseball', 'basketball basketball', 'basketball basketball'] new_term_freq_matrix = vectorizer.fit_transform(new_docs) print (vectorizer.vocabulary_) print ((new_term_freq_matrix.todense())) {'basketball': 1, 'baseball': 0} [[ 0.861037 0.50854232] [ 0. 1. ] [ 0. 1. ]] new_docs = ['basketball basketball baseball', 'basketball basketball', 'basketball basketball'] new_term_freq_matrix = vectorizer.fit_transform(new_docs) print (vectorizer.vocabulary_) print ((new_term_freq_matrix.todense())) {'basketball': 1, 'baseball': 0} [[ 0.64612892 0.76322829] [ 0. 1. ] [ 0. 1. ]] |