Python hashable dicts
作为一个练习,主要是为了我自己的兴趣,我正在实现一个回溯packrat解析器。这一点的灵感来自于我想更好地了解hygenic宏在类似algol的语言中是如何工作的(与您通常在其中发现的无语法的lisp方言相适应)。因此,不同的输入传递可能会看到不同的语法,因此缓存的分析结果是无效的,除非我还将语法的当前版本与缓存的分析结果一起存储。(编辑:使用键值集合的结果是它们应该是不可变的,但我不打算公开接口以允许更改它们,因此可变或不可变的集合都可以)
问题是python dict不能作为其他dict的键出现。即使使用元组(不管怎样我都会这样做)也没有帮助。
1 2 3 4 5 6 7 | >>> cache = {} >>> rule = {"foo":"bar"} >>> cache[(rule,"baz")] ="quux" Traceback (most recent call last): File"<stdin>", line 1, in <module> TypeError: unhashable type: 'dict' >>> |
我想一定是特百利。现在,python标准库提供了我所需要的,
1 2 3 4 5 | >>> from collections import namedtuple >>> Rule = namedtuple("Rule",rule.keys()) >>> cache[(Rule(**rule),"baz")] ="quux" >>> cache {(Rule(foo='bar'), 'baz'): 'quux'} |
好啊。但是我必须为我想要使用的规则中的每一个可能的键组合创建一个类,这并不坏,因为每个解析规则都确切地知道它使用的参数,所以类可以与解析规则的函数同时定义。
编辑:
1 2 3 4 5 6 7 | >>> you = namedtuple("foo",["bar","baz"]) >>> me = namedtuple("foo",["bar","quux"]) >>> you(bar=1,baz=2) == me(bar=1,quux=2) True >>> bob = namedtuple("foo",["baz","bar"]) >>> you(bar=1,baz=2) == bob(bar=1,baz=2) False |
tl'dr:我如何获得可以用作其他
在对答案进行了一些黑客攻击之后,下面是我正在使用的更完整的解决方案。请注意,这做了一些额外的工作,使结果的dict在实际应用中模糊地不可变。当然,打电话给
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | class hashdict(dict): """ hashable dict implementation, suitable for use as a key into other dicts. >>> h1 = hashdict({"apples": 1,"bananas":2}) >>> h2 = hashdict({"bananas": 3,"mangoes": 5}) >>> h1+h2 hashdict(apples=1, bananas=3, mangoes=5) >>> d1 = {} >>> d1[h1] ="salad" >>> d1[h1] 'salad' >>> d1[h2] Traceback (most recent call last): ... KeyError: hashdict(bananas=3, mangoes=5) based on answers from http://stackoverflow.com/questions/1151658/python-hashable-dicts """ def __key(self): return tuple(sorted(self.items())) def __repr__(self): return"{0}({1})".format(self.__class__.__name__, ",".join("{0}={1}".format( str(i[0]),repr(i[1])) for i in self.__key())) def __hash__(self): return hash(self.__key()) def __setitem__(self, key, value): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) def __delitem__(self, key): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) def clear(self): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) def pop(self, *args, **kwargs): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) def popitem(self, *args, **kwargs): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) def setdefault(self, *args, **kwargs): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) def update(self, *args, **kwargs): raise TypeError("{0} does not support item assignment" .format(self.__class__.__name__)) # update is not ok because it mutates the object # __add__ is ok because it creates a new object # while the new object is under construction, it's ok to mutate it def __add__(self, right): result = hashdict(self) dict.update(result, right) return result if __name__ =="__main__": import doctest doctest.testmod() |
哈希表应该是不可变的——不是强制执行它,而是相信您不会在dict第一次用作键后改变它,下面的方法可以工作:
1 2 3 4 5 6 7 | class hashabledict(dict): def __key(self): return tuple((k,self[k]) for k in sorted(self)) def __hash__(self): return hash(self.__key()) def __eq__(self, other): return self.__key() == other.__key() |
如果你真的需要改变你的听写并且仍然想把它们当作钥匙使用,复杂性会爆炸成百倍——不是说它不能做到,但我会等到一个非常具体的指示,然后再进入那不可思议的泥沼!-)
这是制作哈希字典的简单方法。记住,在嵌入另一本字典之后,不要因为明显的原因而改变它们。
1 2 3 | class hashabledict(dict): def __hash__(self): return hash(tuple(sorted(self.items()))) |
要使词典能够用于您的目的,只需添加一个_uu hash_uuu方法:
1 2 3 | class Hashabledict(dict): def __hash__(self): return hash(frozenset(self)) |
注意,frozenset转换将适用于所有字典(即,它不需要键可排序)。同样,字典值也没有限制。
如果有许多字典具有相同的键但具有不同的值,则需要将哈希值考虑在内。最快的方法是:
1 2 3 | class Hashabledict(dict): def __hash__(self): return hash((frozenset(self), frozenset(self.itervalues()))) |
这比
给出的答案是可以的,但是可以通过使用
1 2 3 4 5 | >>> import timeit >>> timeit.timeit('hash(tuple(sorted(d.iteritems())))',"d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')") 4.7758948802947998 >>> timeit.timeit('hash(frozenset(d.iteritems()))',"d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')") 1.8153600692749023 |
性能优势取决于字典的内容,但在我测试过的大多数情况下,使用
一个相当干净、简单的实现是
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import collections class FrozenDict(collections.Mapping): """Don't forget the docstrings!!""" def __init__(self, *args, **kwargs): self._d = dict(*args, **kwargs) def __iter__(self): return iter(self._d) def __len__(self): return len(self._d) def __getitem__(self, key): return self._d[key] def __hash__(self): return hash(tuple(sorted(self._d.iteritems()))) |
我一直回到这个话题…这是另一个变化。我不太愿意将
如果你把键、值对塞进一个
第1部分,您需要一种对"item"进行编码的方法,使frozenset主要通过它们的键来处理它们;我将为此做一个小的子类。
1 2 3 4 5 6 7 8 9 10 | import collections class pair(collections.namedtuple('pair_base', 'key value')): def __hash__(self): return hash((self.key, None)) def __eq__(self, other): if type(self) != type(other): return NotImplemented return self.key == other.key def __repr__(self): return repr((self.key, self.value)) |
仅此一点,您就可以看到不可变映射的喷口距离:
1 2 3 4 5 6 7 8 9 10 | >>> frozenset(pair(k, v) for k, v in enumerate('abcd')) frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')]) >>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd')) >>> pair(2, None) in pairs True >>> pair(5, None) in pairs False >>> goal = frozenset((pair(2, None),)) >>> pairs & goal frozenset([(2, None)]) |
哦!不幸的是,当使用集合运算符时,元素是相等的,但不是同一个对象;返回值中的元素是未定义的,我们将不得不经历更多的旋转。
1 2 3 4 | >>> pairs - (pairs - goal) frozenset([(2, 'c')]) >>> iter(pairs - (pairs - goal)).next().value 'c' |
然而,以这种方式查找值是很麻烦的,而且更糟的是,会创建许多中间集;这是不可能的!我们将创建一个"假"键值对来绕过它:
1 2 3 4 5 6 7 8 | class Thief(object): def __init__(self, key): self.key = key def __hash__(self): return hash(pair(self.key, None)) def __eq__(self, other): self.value = other.value return pair(self.key, None) == other |
这就减少了问题:
1 2 3 4 5 | >>> thief = Thief(2) >>> thief in pairs True >>> thief.value 'c' |
这就是所有深层次的魔力;剩下的就是把它包装成一个接口像dict一样的东西。由于我们是从
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | class FrozenDict(frozenset, collections.Mapping): def __new__(cls, seq=()): return frozenset.__new__(cls, (pair(k, v) for k, v in seq)) def __getitem__(self, key): thief = Thief(key) if frozenset.__contains__(self, thief): return thief.value raise KeyError(key) def __eq__(self, other): if not isinstance(other, FrozenDict): return dict(self.iteritems()) == other if len(self) != len(other): return False for key, value in self.iteritems(): try: if value != other[key]: return False except KeyError: return False return True def __hash__(self): return hash(frozenset(self.iteritems())) def get(self, key, default=None): thief = Thief(key) if frozenset.__contains__(self, thief): return thief.value return default def __iter__(self): for item in frozenset.__iter__(self): yield item.key def iteritems(self): for item in frozenset.__iter__(self): yield (item.key, item.value) def iterkeys(self): for item in frozenset.__iter__(self): yield item.key def itervalues(self): for item in frozenset.__iter__(self): yield item.value def __contains__(self, key): return frozenset.__contains__(self, pair(key, None)) has_key = __contains__ def __repr__(self): return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()') @classmethod def fromkeys(cls, keys, value=None): return cls((key, value) for key in keys) |
最终,它回答了我自己的问题:
1 2 3 4 5 6 7 8 9 10 | >>> myDict = {} >>> myDict[FrozenDict(enumerate('ab'))] = 5 >>> FrozenDict(enumerate('ab')) in myDict True >>> FrozenDict(enumerate('bc')) in myDict False >>> FrozenDict(enumerate('ab', 3)) in myDict False >>> myDict[FrozenDict(enumerate('ab'))] 5 |
@unknown接受的答案和@alexamartelli的答案都很好,但仅在以下限制条件下有效:
@obensonne更快的回答解除了约束2和3,但仍然受约束1的约束(值必须是可哈希的)。
@raymondhettinger更快的回答解除了所有3个约束,因为它不包括哈希计算中的
如果不满足此条件,哈希函数仍然有效,但可能会导致太多的冲突。例如,在极端情况下,所有字典都是从一个网站模板生成的(字段名作为键,用户输入作为值),键将始终相同,哈希函数将为所有输入返回相同的值。因此,当检索一个项目(
我认为下面的解决方案可以很好地工作,即使我上面列出的所有4个约束都被违反。它还有一个额外的优点,即它不仅可以散列字典,还可以散列任何容器,即使它们有嵌套的可变容器。
我非常感谢对此的任何反馈,因为到目前为止我只做了轻微的测试。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | # python 3.4 import collections import operator import sys import itertools import reprlib # a wrapper to make an object hashable, while preserving equality class AutoHash: # for each known container type, we can optionally provide a tuple # specifying: type, transform, aggregator # even immutable types need to be included, since their items # may make them unhashable # transformation may be used to enforce the desired iteration # the result of a transformation must be an iterable # default: no change; for dictionaries, we use .items() to see values # usually transformation choice only affects efficiency, not correctness # aggregator is the function that combines all items into one object # default: frozenset; for ordered containers, we can use tuple # aggregator choice affects both efficiency and correctness # e.g., using a tuple aggregator for a set is incorrect, # since identical sets may end up with different hash values # frozenset is safe since at worst it just causes more collisions # unfortunately, no collections.ABC class is available that helps # distinguish ordered from unordered containers # so we need to just list them out manually as needed type_info = collections.namedtuple( 'type_info', 'type transformation aggregator') ident = lambda x: x # order matters; first match is used to handle a datatype known_types = ( # dict also handles defaultdict type_info(dict, lambda d: d.items(), frozenset), # no need to include set and frozenset, since they are fine with defaults type_info(collections.OrderedDict, ident, tuple), type_info(list, ident, tuple), type_info(tuple, ident, tuple), type_info(collections.deque, ident, tuple), type_info(collections.Iterable, ident, frozenset) # other iterables ) # hash_func can be set to replace the built-in hash function # cache can be turned on; if it is, cycles will be detected, # otherwise cycles in a data structure will cause failure def __init__(self, data, hash_func=hash, cache=False, verbose=False): self._data=data self.hash_func=hash_func self.verbose=verbose self.cache=cache # cache objects' hashes for performance and to deal with cycles if self.cache: self.seen={} def hash_ex(self, o): # note: isinstance(o, Hashable) won't check inner types try: if self.verbose: print(type(o), reprlib.repr(o), self.hash_func(o), file=sys.stderr) return self.hash_func(o) except TypeError: pass # we let built-in hash decide if the hash value is worth caching # so we don't cache the built-in hash results if self.cache and id(o) in self.seen: return self.seen[id(o)][0] # found in cache # check if o can be handled by decomposing it into components for typ, transformation, aggregator in AutoHash.known_types: if isinstance(o, typ): # another option is: # result = reduce(operator.xor, map(_hash_ex, handler(o))) # but collisions are more likely with xor than with frozenset # e.g. hash_ex([1,2,3,4])==0 with xor try: # try to frozenset the actual components, it's faster h = self.hash_func(aggregator(transformation(o))) except TypeError: # components not hashable with built-in; # apply our extended hash function to them h = self.hash_func(aggregator(map(self.hash_ex, transformation(o)))) if self.cache: # storing the object too, otherwise memory location will be reused self.seen[id(o)] = (h, o) if self.verbose: print(type(o), reprlib.repr(o), h, file=sys.stderr) return h raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o))) def __hash__(self): return self.hash_ex(self._data) def __eq__(self, other): # short circuit to save time if self is other: return True # 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first # 2) any other situation => lhs.__eq__ will be called first # case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either # => the subclass instance's __eq__ is called first, and we should compare self._data and other._data # case 2. neither side is a subclass of the other; self is lhs # => we can't compare to another type; we should let the other side decide what to do, return NotImplemented # case 3. neither side is a subclass of the other; self is rhs # => we can't compare to another type, and the other side already tried and failed; # we should return False, but NotImplemented will have the same effect # any other case: we won't reach the __eq__ code in this class, no need to worry about it if isinstance(self, type(other)): # identifies case 1 return self._data == other._data else: # identifies cases 2 and 3 return NotImplemented d1 = {'a':[1,2], 2:{3:4}} print(hash(AutoHash(d1, cache=True, verbose=True))) d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True) print(hash(d)) |
您可能还需要添加这两个方法,以使v2 picking协议能够与hashdict实例一起工作。否则,cpickle将尝试使用hashdict.uuuuuuuuu setitem_uuuuuuuuuuuu,从而导致类型错误。有趣的是,对于其他两个版本的协议,您的代码工作得很好。
1 2 3 4 5 | def __setstate__(self, objstate): for k,v in objstate.items(): dict.__setitem__(self,k,v) def __reduce__(self): return (hashdict, (), dict(self),) |
如果您不在字典中输入数字,并且从未丢失包含字典的变量,则可以执行以下操作:
因为id()对于每个字典都是唯一的
编辑:
哦,对不起,是的,如果那样的话,其他人说的会更好。我想你也可以把字典序列化为一个字符串,比如
但是,如果你需要从这些键中恢复字典,那么你必须做一些更糟糕的事情,比如
我想这样做的好处是你不必写那么多的代码。