关于python：在倒排索引中搜索普通查询

Searching a normal query in an inverted index

我有一个完整的反向索引，以嵌套的python字典的形式出现。其结构为：

字：文件名：【位置列表】

例如，让字典名为index，然后对于单词"spam"，条目将如下所示：

垃圾邮件：doc1.txt:[102300399]，doc5.txt:[200587]

这样，包含任何单词的文档都可以通过index[word].keys()给出，并且该文档中的频率可以通过len(index[word][document])给出。

现在我的问题是，如何在这个索引中实现一个正常的查询搜索。例如，假设一个查询包含4个单词，查找包含所有四个匹配项的文档(按出现的总频率排序)，然后查找包含3个匹配项的文档，等等……

Added this code, using S. Lott's answer.
This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

from collections import defaultdict
from operator import itemgetter

# Take input

query = input(" Enter the query :")

# Some preprocessing

query = query.lower()
query = query.strip()

# now real work

wordlist = query.split()
search_words = [ x for x in wordlist if x in index ] # list of words that are present in index.

print"
searching for words ... :", search_words,"
"

doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
for p in d:
doc_words[p].append(w)

# create a dictionary identifying matches for each document

result_set = {}

for i in doc_words.keys():
count = 0
matches = len(doc_words[i]) # number of matches
for w in doc_words[i]:
count += len(index[w][i]) # count total occurances
result_set[i] = (matches,count)

# Now print in sorted order

print" Document \t\t Words matched \t\t Total Frequency"
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
print doc,"\t",doc_words[doc],"\t",count

请评论….谢谢。