How Spark HashingTF works
我是Spark 2的新手。
我尝试了Spark tfidf示例
1 2 3 4 5 6 7 8 9 10 11 12 13 | sentenceData = spark.createDataFrame([ (0.0,"Hi I heard about Spark") ], ["label","sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsData = tokenizer.transform(sentenceData) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32) featurizedData = hashingTF.transform(wordsData) for each in featurizedData.collect(): print(each) |
它输出
1 | Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0})) |
我希望在
1 | tf(w) = (Number of times the word appears in a document) / (Total number of words in the document) |
在我们的例子中是:每个单词
如果我们假设输出
这让我感到困惑。我想念什么?
TL; DR;这只是一个简单的哈希冲突。
详细。默认情况下,
1 2 3 4 5 6 7 | import org.apache.spark.unsafe.types.UTF8String import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes Seq("hi","i","heard","about","spark") .map(UTF8String.fromString(_)) .map(utf8 => hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42)) |
1 | Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105) |
对这些值取非负模时,将得到
1 2 | List(-537608040, -1265344671, 266149357, 146891777, 2101843105) .map(nonNegativeMod(_, 32)) |
1 | List[Int] = List(24, 1, 13, 1, 1) |
列表中的三个单词(i,about和spark)散列到同一存储桶中,每个单词出现一次,因此您得到了结果。
相关:
- Spark对HashingTF使用什么哈希函数,如何复制它?
- 如何从Spark ML Lib中的TF Vector RDD获取单词详细信息?