Quickly convert numpy arrays with index to dict of numpy arrays keyed on that index
我有一套NUMPY数组。其中之一是"键"列表,我想将数组重新排列成键控数组的dict。我当前的代码是:
1 2 3 | for key, val1, val2 in itertools.izip(keys, vals1, vals2): dict1[key].append(val1) dict2[key].append(val2) |
这是相当慢的,因为涉及的数组有数百万个条目长,而且这种情况经常发生。是否可以用矢量化的形式重写?可能的密钥集是提前知道的,并且有大约10个不同的密钥。
编辑:如果有k个不同的键,并且列表是n长的,那么当前的答案是o(nk)(每个键重复一次)和o(n log n)(排序第一)。不过,我仍在寻找一个O(N)矢量化的解决方案。希望这是可能的;毕竟,最简单的可能非异位的东西(即我已经拥有的东西)是O(N)。
一些时间:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import numpy as np import itertools def john1024(keys, v1, v2): d1 = {}; d2 = {}; for k in set(keys): d1[k] = v1[k==keys] d2[k] = v2[k==keys] return d1,d2 def birico(keys, v1, v2): order = keys.argsort() keys_sorted = keys[order] diff = np.ones(keys_sorted.shape, dtype=bool) diff[1:] = keys_sorted[1:] != keys_sorted[:-1] key_change = diff.nonzero()[0] uniq_keys = keys_sorted[key_change] v1_split = np.split(v1[order], key_change[1:]) d1 = dict(zip(uniq_keys, v1_split)) v2_split = np.split(v2[order], key_change[1:]) d2 = dict(zip(uniq_keys, v2_split)) return d1,d2 def knzhou(keys, v1, v2): d1 = {k:[] for k in np.unique(keys)} d2 = {k:[] for k in np.unique(keys)} for key, val1, val2 in itertools.izip(keys, v1, v2): d1[key].append(val1) d2[key].append(val2) return d1,d2 |
我用了10把钥匙,2000万个条目:
1 2 3 4 5 6 7 8 9 10 11 12 | import timeit keys = np.random.randint(0, 10, size=20000000) #10 keys, 20M entries vals1 = np.random.random(keys.shape) vals2 = np.random.random(keys.shape) timeit.timeit("john1024(keys, vals1, vals2)","from __main__ import john1024, keys, vals1, vals2", number=3) 11.121668815612793 timeit.timeit("birico(keys, vals1, vals2)","from __main__ import birico, keys, vals1, vals2", number=3) 8.107877969741821 timeit.timeit("knzhou(keys, vals1, vals2)","from __main__ import knzhou, keys, vals1, vals2", number=3) 51.76217794418335 |
因此,我们看到排序技术比让numpy查找每个键对应的索引快一点,但当然这两种方法都比python中的循环快得多。矢量化很好!
这是在python 2.7.12,numpy 1.9.2上
实现这一点的矢量化方法可能需要您对密钥进行排序。基本思想是对键和VAL进行排序以匹配。然后,您可以在每次排序的键数组中有一个新的键时拆分VAL数组。矢量化代码如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import numpy as np keys = np.random.randint(0, 10, size=20) vals1 = np.random.random(keys.shape) vals2 = np.random.random(keys.shape) order = keys.argsort() keys_sorted = keys[order] # Find uniq keys and key changes diff = np.ones(keys_sorted.shape, dtype=bool) diff[1:] = keys_sorted[1:] != keys_sorted[:-1] key_change = diff.nonzero()[0] uniq_keys = keys_sorted[key_change] vals1_split = np.split(vals1[order], key_change[1:]) dict1 = dict(zip(uniq_keys, vals1_split)) vals2_split = np.split(vals2[order], key_change[1:]) dict2 = dict(zip(uniq_keys, vals2_split)) |
由于argsort步骤,此方法的复杂性为o(n*log(n))。实际上,argsort非常快,除非n非常大。在argsort明显变慢之前,使用此方法可能会遇到内存问题。
让我们导入numpy并创建一些示例数据:
1 2 3 4 | >>> import numpy as np >>> keys = np.array(('key1', 'key2', 'key3', 'key1', 'key2', 'key1')) >>> vals1 = np.arange(6) >>> vals2 = np.arange(10, 16) |
现在,让我们创建字典:
1 2 3 4 5 6 7 8 9 | >>> d1 = {}; d2 = {} >>> for k in set(keys): ... d1[k] = vals1[k==keys] ... d2[k] = vals2[k==keys] ... >>> d1 {'key3': array([2]), 'key2': array([1, 4]), 'key1': array([0, 3, 5])} >>> d2 {'key3': array([12]), 'key2': array([11, 14]), 'key1': array([10, 13, 15])} |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | In [19]: keys = np.random.choice(np.arange(10),100) In [20]: vals=np.arange(100) In [21]: from collections import defaultdict In [22]: dd = defaultdict(list) In [23]: for k,v in zip(keys, vals): ...: dd[k].append(v) ...: In [24]: dd Out[24]: defaultdict(list, {0: [4, 39, 47, 84, 87], 1: [0, 25, 41, 46, 55, 58, 74, 77, 89, 92, 95], 2: [3, 9, 15, 24, 44, 54, 63, 66, 71, 80, 81], 3: [1, 13, 16, 37, 57, 76, 91, 93], ... 8: [51, 52, 56, 60, 68, 82, 88, 97, 99], 9: [21, 29, 30, 34, 35, 59, 73, 86]}) |
但是,使用一组已知的小键,您不需要这个专门的字典,因为您可以轻松地提前创建字典键条目。
1 | dd = {k:[] for k in np.unique(keys)} |
但是,既然您从数组开始,数组操作就可以排序和收集类似的值,这是非常值得的。