关于python:在倒排索引中搜索普通查询

Searching a normal query in an inverted index

我有一个完整的反向索引,以嵌套的python字典的形式出现。其结构为:

字:文件名:【位置列表】

例如,让字典名为index,然后对于单词"spam",条目将如下所示:

垃圾邮件:doc1.txt:[102300399],doc5.txt:[200587]

这样,包含任何单词的文档都可以通过index[word].keys()给出,并且该文档中的频率可以通过len(index[word][document])给出。

现在我的问题是,如何在这个索引中实现一个正常的查询搜索。例如,假设一个查询包含4个单词,查找包含所有四个匹配项的文档(按出现的总频率排序),然后查找包含3个匹配项的文档,等等……

**

Added this code, using S. Lott's answer.
This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.

**

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from collections import defaultdict
from operator import itemgetter

# Take input

query = input(" Enter the query :")

# Some preprocessing

query = query.lower()
query = query.strip()

# now real work

wordlist = query.split()
search_words = [ x for x in wordlist if x in index ]    # list of words that are present in index.

print"
searching for words ... :"
, search_words,"
"


doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
    for p in d:
        doc_words[p].append(w)

# create a dictionary identifying matches for each document    

result_set = {}

for i in doc_words.keys():
    count = 0
    matches = len(doc_words[i])     # number of matches
    for w in doc_words[i]:
        count += len(index[w][i])   # count total occurances
    result_set[i] = (matches,count)

# Now print in sorted order

print"   Document \t\t Words matched \t\t Total Frequency"
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
    print doc,"\t",doc_words[doc],"\t",count

请评论….谢谢。


这是一个开始:

1
doc_has_word = [ (index[word].keys(),word) for word in wordlist ]

这将建立一个(word,document)对的列表。你不能轻易地用它编一本字典,因为每个文档都会出现很多次。

但是

1
2
3
4
from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
    doc_words[tuple(d.items())].append(w)

可能会有所帮助。


以下是查找类似文档的解决方案(最难的部分):

1
2
3
wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])

wordMatches得到一个列表,其中每个元素都是文档的字典,与要匹配的单词之一匹配。

similarDocs是一组包含所查询单词的文档。这是通过只从wordMatches列表中的每个词典中提取文档名,将这些文档名列表表示为集合,然后交叉集合以查找常见的文档名来发现的。

一旦找到相似的文档,就应该能够使用defaultdict(如s.lott的答案所示)将每个单词和每个文档的所有匹配列表附加在一起。

相关链接:

  • 这个答案演示了defaultdict(int)。defaultdict(list)的工作方式基本相同。
  • 集合.交集示例


1
2
3
4
5
6
7
8
9
10
11
12
13
14
import itertools

index = {...}

def query(*args):
    result = []

    doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
    doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])

    for doc, group in doc_group:
        result.append((doc, sum([elem[1] for elem in group])))

    return sorted(result, key=lambda x:x[1])[::-1]