Searching a normal query in an inverted index
我有一个完整的反向索引,以嵌套的python字典的形式出现。其结构为:
字:文件名:【位置列表】
例如,让字典名为index,然后对于单词"spam",条目将如下所示:
垃圾邮件:doc1.txt:[102300399],doc5.txt:[200587]
这样,包含任何单词的文档都可以通过index[word].keys()给出,并且该文档中的频率可以通过len(index[word][document])给出。
现在我的问题是,如何在这个索引中实现一个正常的查询搜索。例如,假设一个查询包含4个单词,查找包含所有四个匹配项的文档(按出现的总频率排序),然后查找包含3个匹配项的文档,等等……
**
Added this code, using S. Lott's answer.
This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.
**
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | from collections import defaultdict from operator import itemgetter # Take input query = input(" Enter the query :") # Some preprocessing query = query.lower() query = query.strip() # now real work wordlist = query.split() search_words = [ x for x in wordlist if x in index ] # list of words that are present in index. print" searching for words ... :", search_words," " doc_has_word = [ (index[word].keys(),word) for word in search_words ] doc_words = defaultdict(list) for d, w in doc_has_word: for p in d: doc_words[p].append(w) # create a dictionary identifying matches for each document result_set = {} for i in doc_words.keys(): count = 0 matches = len(doc_words[i]) # number of matches for w in doc_words[i]: count += len(index[w][i]) # count total occurances result_set[i] = (matches,count) # Now print in sorted order print" Document \t\t Words matched \t\t Total Frequency" print '-'*40 for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True): print doc,"\t",doc_words[doc],"\t",count |
请评论….谢谢。
这是一个开始:
1 | doc_has_word = [ (index[word].keys(),word) for word in wordlist ] |
这将建立一个(word,document)对的列表。你不能轻易地用它编一本字典,因为每个文档都会出现很多次。
但是
1 2 3 4 | from collections import defaultdict doc_words = defaultdict(list) for d, w in doc_has_word: doc_words[tuple(d.items())].append(w) |
可能会有所帮助。
以下是查找类似文档的解决方案(最难的部分):
1 2 3 | wordList = ['spam','eggs','toast'] # our list of words to query for wordMatches = [index.get(word, {}) for word in wordList] similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches]) |
一旦找到相似的文档,就应该能够使用defaultdict(如s.lott的答案所示)将每个单词和每个文档的所有匹配列表附加在一起。
相关链接:
- 这个答案演示了defaultdict(int)。defaultdict(list)的工作方式基本相同。
- 集合.交集示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import itertools index = {...} def query(*args): result = [] doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]] doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0]) for doc, group in doc_group: result.append((doc, sum([elem[1] for elem in group]))) return sorted(result, key=lambda x:x[1])[::-1] |