Fast replacement of values in a numpy array
我有一个非常大的numpy数组(最多包含一百万个元素),如下所示:
1 2 3 | [ 0 1 6 5 1 2 7 6 2 3 8 7 3 4 9 8 5 6 11 10 6 7 12 11 7 8 13 12 8 9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16 21 20 16 17 22 21 17 18 23 22 18 19 24 23] |
以及用于替换上述数组中某些元素的小字典映射
1 | {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0} |
号
我想根据上面的地图替换一些元素。numpy数组非常大,只有一小部分元素(在字典中作为键出现)将被相应的值替换。最快的方法是什么?
我相信还有更有效的方法,但现在,试试看
1 2 3 4 | from numpy import copy newArray = copy(theArray) for k, v in d.iteritems(): newArray[theArray==k] = v |
号
微基准和正确性测试:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | #!/usr/bin/env python2.7 from numpy import copy, random, arange random.seed(0) data = random.randint(30, size=10**5) d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0} dk = d.keys() dv = d.values() def f1(a, d): b = copy(a) for k, v in d.iteritems(): b[a==k] = v return b def f2(a, d): for i in xrange(len(a)): a[i] = d.get(a[i], a[i]) return a def f3(a, dk, dv): mp = arange(0, max(a)+1) mp[dk] = dv return mp[a] a = copy(data) res = f2(a, d) assert (f1(data, d) == res).all() assert (f3(data, dk, dv) == res).all() |
结果:
1 2 3 4 5 | $ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)' 100 loops, best of 3: 6.15 msec per loop $ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)' 100 loops, best of 3: 19.6 msec per loop |
。
假设值介于0和某个最大整数之间,可以使用numpy数组作为
1 2 3 | mp = numpy.arange(0,max(data)+1) mp[replace.keys()] = replace.values() data = mp[data] |
。
先到哪里
1 2 3 | data = [ 0 1 6 5 1 2 7 6 2 3 8 7 3 4 9 8 5 6 11 10 6 7 12 11 7 8 13 12 8 9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16 21 20 16 17 22 21 17 18 23 22 18 19 24 23] |
。
替换为
1 | replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0} |
我们得到
1 2 3 | data = [ 0 1 6 5 1 2 7 6 2 3 8 7 3 0 5 8 5 6 11 10 6 7 12 11 7 8 13 12 8 5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16 1 0 16 17 2 1 17 18 3 2 18 15 0 3] |
。
实现这一点的另一个更一般的方法是函数矢量化:
1 2 3 4 5 6 7 8 9 10 | import numpy as np data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23]) mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0} def mp(entry): return mapper_dict[entry] if entry in mapper_dict else entry mp = np.vectorize(mp) print mp(data) |
使用
1 2 3 | replace = numpy.array([list(replace.keys()), list(replace.values())]) # Create 2D replacement matrix mask = numpy.in1d(data, replace[0, :]) # Find elements that need replacement data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])] # Replace elements |
号
numpy_索引包(免责声明:我是它的作者)为这类问题提供了一个优雅而高效的矢量化解决方案:
1 2 | import numpy_indexed as npi remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values())) |
号
实现的方法类似于Jean-Leschut提到的基于搜索排序的方法,但更为一般。例如,数组的项不需要是int,但可以是任何类型,甚至是nd子数组本身;但是它应该实现相同的性能。
我对一些解决方案进行了基准测试,结果毫无吸引力:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import timeit import numpy as np array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int) from_values = np.unique(array) # pair values from 0 to 2000 to_values = np.arange(from_values.size) # all values from 0 to 1000 d = dict(zip(from_values, to_values)) def method_for_loop(): out = array.copy() for from_value, to_value in zip(from_values, to_values) : out[out == from_value] = to_value print('Check method_for_loop :', np.all(out == array/2)) # Just checking print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1)) def method_list_comprehension(): out = [d[i] for i in array] print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1)) def method_bruteforce(): idx = np.nonzero(from_values == array[:,None])[1] out = to_values[idx] print('Check method_bruteforce :', np.all(out == array/2)) # Just checking print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1)) def method_searchsort(): sort_idx = np.argsort(from_values) idx = np.searchsorted(from_values,array,sorter = sort_idx) out = to_values[sort_idx][idx] print('Check method_searchsort :', np.all(out == array/2)) # Just checking print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1)) |
号
我得到了以下结果:
1 2 3 4 5 6 7 8 9 10 11 | Check method_for_loop : True Time method_for_loop : 2.6411612760275602 Check method_list_comprehension : True Time method_list_comprehension : 0.07994363596662879 Check method_bruteforce : True Time method_bruteforce : 11.960559037979692 Check method_searchsort : True Time method_searchsort : 0.03770717792212963 |
号
"searchsort"方法比"for"循环快近100倍,比numpy bruteforce方法快约3600倍。列表理解方法也是代码简单性和速度之间的一个很好的权衡。
在数组上没有python循环的情况下,没有发布任何解决方案(除了celil的解决方案,不过假设数字"小"),因此这里有一个替代方案:
1 2 3 4 5 6 7 8 9 10 | def replace(arr, rep_dict): """Assumes all elements of"arr" are keys of rep_dict""" # Removing the explicit"list" breaks python3 rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items())))) idces = digitize(arr, rep_keys, right=True) # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr return rep_vals[idces] |
。
"idces"的创建方式来自这里。
不需要数据为整数的pythonic方式可以是偶数字符串:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from scipy.stats import rankdata import numpy as np data = np.random.rand(100000) replace = {data[0]: 1, data[5]: 8, data[8]: 10} arr = np.vstack((replace.keys(), replace.values())).transpose() arr = arr[arr[:,1].argsort()] unique = np.unique(data) mp = np.vstack((unique, unique)).transpose() mp[np.in1d(mp[:,0], arr),1] = arr[:,1] data = mp[rankdata(data, 'dense')-1][:,1] |
号
1 2 | for i in xrange(len(the_array)): the_array[i] = the_dict.get(the_array[i], the_array[i]) |
好吧,您需要通过一个
1 2 3 | for i in xrange( len( theArray ) ): if foo[ i ] in dict: foo[ i ] = dict[ foo[ i ] ] |