Python: merging dictionaries with adding values but conserving other fields
我在与下面格式:文本文件P></
1 2 3 | word_form root_form morphological_form frequency word_form root_form morphological_form frequency word_form root_form morphological_form frequency |
……1百万items withP></
but some of the Word _ forms an apostrophe(contain),我不知道别人会喜欢,count of them as the same情况下文字,就是说我会喜欢这些两类:MERGE线P></
1 2 | cup'board cup blabla 12 cupboard cup blabla2 10 |
(frequencies added into this one):P></
1 | cupboard cup blabla2 22 |
在我搜索解决方案给Python 2.7 to that was to,我的第一想法读文本文件存储在两个不同的茶,茶字词典的话apostrophe和没有去,我的话apostrophe over the dictionary of these words are already,试验中如果没有if the dictionary they are the apostrophe,实施时,简单的if notLY add this with在线apostrophe removed。这里是我的代码:P></
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | class Lemma: """Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus""" def __init__(self,lop): self.word_form = lop[0] self.root = lop[1] self.morph = lop[2] self.freq = int(lop[3]) def Reader(filename): """Keeps the lines of a file in memory for a single reading, memory efficient""" with open(filename) as f: for line in f: yield line def get_word_dict(filename): '''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe''' '''Works in a reasonable time''' '''This step can be done writing line by line, avoiding all storage in memory''' word_dict = {} word_dict_striped = {} # We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe with open('word_dict.txt', 'wb') as f: with open('word_dict_striped.txt', 'wb') as g: reader = Reader(filename) for line in reader: items = line.split("\t") word_form = items[0] if"'" in word_form: # we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped items[0] = word_form.replace("'","") items[2] = items[2].replace("\+Apos","") g.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3])) word_dict_striped({items[0] : Lemma(items)}) else: # we just add the lemma to the dictionary word_dict f.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3])) word_dict.update({items[0] : Lemma(items)}) return word_dict, word_dict_striped def merge_word_dict(word_dict, word_dict_striped): '''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key''' ''' Does not run in reasonable time on the whole list ''' with open('word_compiled_dict.txt', 'wb') as f: for word in word_dict_striped.keys(): if word in word_dict.keys(): word_dict[word].freq += word_dict_striped[word].freq f.write("%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq)) else: word_dict.update(word_dict_striped[word]) print"Number of words:", print(len(word_dict)) for x in word_dict: print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq return word_dict |
这个解决方案工作在合理的时间到the storage of the whether双光子写入词典,在线模式的textfiles在线存储的商店或任何avoid to them as the program对象字典中。but the ends of the双词典得以永远!P></
for the function是更新词典重写工作是会instead of one but the count增频光子。我看见一些解决方案得以学院词典与加成与计数器:Python的:SUM(MERGE elegantly of values)与词典合并和sum of双词典how to sum的元素如何在Python中单光子合并表达词典吗?is there any to语言词典(双路组合键,在出现增values for both)?但他们似乎工作词典are of the only when the form(Word,whereas count)的,想在其他领域Carry the the dictionary作为好。P></
我给你开的想法或问题reframing of the,因为是我的目标to have this program to obtain一次性运行在文件列表中merged this text,谢谢提前!P></
这是一些或多或少能满足你需要的东西。只需更改顶部的文件名。它不会修改原始文件。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | input_file_name ="input.txt" output_file_name ="output.txt" def custom_comp(s1, s2): word1 = s1.split()[0] word2 = s2.split()[0] stripped1 = word1.translate(None,"'") stripped2 = word2.translate(None,"'") if stripped1 > stripped2: return 1 elif stripped1 < stripped2: return -1 else: if"'" in word1: return -1 else: return 1 def get_word(line): return line.split()[0].translate(None,"'") def get_num(line): return int(line.split()[-1]) print"Reading file and sorting..." lines = [] with open(input_file_name, 'r') as f: for line in sorted(f, cmp=custom_comp): lines.append(line) print"File read and sorted" combined_lines = [] print"Combining entries..." i = 0 while i < len(lines) - 1: if get_word(lines[i]) == get_word(lines[i+1]): total = get_num(lines[i]) + get_num(lines[i+1]) new_parts = lines[i+1].split() new_parts[-1] = str(total) combined_lines.append("".join(new_parts)) i += 2 else: combined_lines.append(lines[i].strip()) i += 1 print"Entries combined" print"Writing to file..." with open(output_file_name, 'w+') as f: for line in combined_lines: f.write(line +" ") print"Finished" |
它对单词进行排序,使间距有点混乱。如果这很重要,请告诉我,它可以调整。
另一件事是它对整个事情进行分类。对于只有100万行,可能不会花费太长时间,但再次告诉我这是否是一个问题。