Get intersecting rows across two 2D numpy arrays
我想得到两个2D numpy数组的相交(公用)行。 例如,如果以下数组作为输入传递:
1 2 3 4 5 6 7 | array([[1, 4], [2, 5], [3, 6]]) array([[1, 4], [3, 6], [7, 8]]) |
输出应为:
1 2 | array([[1, 4], [3, 6]) |
我知道如何使用循环。 我正在寻找一种Pythonic / Numpy方式来做到这一点。
对于短数组,使用集合可能是最清晰,最易读的方法。
另一种方法是使用
1 2 3 4 5 6 7 8 9 10 11 12 13 | import numpy as np A = np.array([[1,4],[2,5],[3,6]]) B = np.array([[1,4],[3,6],[7,8]]) nrows, ncols = A.shape dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [A.dtype]} C = np.intersect1d(A.view(dtype), B.view(dtype)) # This last bit is optional if you're okay with"C" being a structured array... C = C.view(A.dtype).reshape(-1, ncols) |
对于大型数组,这应该比使用集合快得多。
您可以使用Python的集合:
1 2 3 4 5 6 7 8 | >>> import numpy as np >>> A = np.array([[1,4],[2,5],[3,6]]) >>> B = np.array([[1,4],[3,6],[7,8]]) >>> aset = set([tuple(x) for x in A]) >>> bset = set([tuple(x) for x in B]) >>> np.array([x for x in aset & bset]) array([[1, 4], [3, 6]]) |
正如Rob Cowie所指出的,这样做可以更简洁
1 | np.array([x for x in set(tuple(x) for x in A) & set(tuple(x) for x in B)]) |
可能有一种方法可以完成从数组到元组的所有操作,但是现在还没有出现。
我不明白为什么没有建议的纯numpy方法来使此工作。所以我找到了一个使用numpy广播的。基本思想是通过轴交换将阵列之一转换为3d。让我们构造2个数组:
1 2 3 | a=np.random.randint(10, size=(5, 3)) b=np.zeros_like(a) b[:4,:]=a[np.random.randint(a.shape[0], size=4), :] |
我的跑步给了:
1 2 3 4 5 6 7 8 9 10 | a=array([[5, 6, 3], [8, 1, 0], [2, 1, 4], [8, 0, 6], [6, 7, 6]]) b=array([[2, 1, 4], [2, 1, 4], [6, 7, 6], [5, 6, 3], [0, 0, 0]]) |
步骤是(数组可以互换):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #a is nxm and b is kxm c = np.swapaxes(a[:,:,None],1,2)==b #transform a to nx1xm # c has nxkxm dimensions due to comparison broadcast # each nxixj slice holds comparison matrix between a[j,:] and b[i,:] # Decrease dimension to nxk with product: c = np.prod(c,axis=2) #To get around duplicates:// # Calculate cumulative sum in k-th dimension c= c*np.cumsum(c,axis=0) # compare with 1, so that to get only one 'True' statement by row c=c==1 #// # sum in k-th dimension, so that a nx1 vector is produced c=np.sum(c,axis=1).astype(bool) # The intersection between a and b is a[c] result=a[c] |
在具有2行用于减少内存的函数中(如果错误请纠正我):
1 2 3 | def array_row_intersection(a,b): tmp=np.prod(np.swapaxes(a[:,:,None],1,2)==b,axis=2) return a[np.sum(np.cumsum(tmp,axis=0)*tmp==1,axis=1).astype(bool)] |
这给出了我的示例结果:
1 2 3 | result=array([[5, 6, 3], [2, 1, 4], [6, 7, 6]]) |
这比集合解决方案要快,因为它仅使用简单的numpy运算,同时不断缩小尺寸,并且非常适合两个大型矩阵。我想我的评论可能犯了错误,因为我通过实验和直觉获得了答案。列交集的等效项可以通过转置数组或略微更改步骤来找到。另外,如果需要重复,则必须跳过" //"内部的步骤。可以编辑该函数以仅返回索引的布尔数组,这对我很方便,同时尝试使用相同的向量获取不同的数组索引。投票的答案和我的基准(每个维度中的元素数量在选择内容方面起着作用):
码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | def voted_answer(A,B): nrows, ncols = A.shape dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [A.dtype]} C = np.intersect1d(A.view(dtype), B.view(dtype)) return C.view(A.dtype).reshape(-1, ncols) a_small=np.random.randint(10, size=(10, 10)) b_small=np.zeros_like(a_small) b_small=a_small[np.random.randint(a_small.shape[0],size=[a_small.shape[0]]),:] a_big_row=np.random.randint(10, size=(10, 1000)) b_big_row=a_big_row[np.random.randint(a_big_row.shape[0],size=[a_big_row.shape[0]]),:] a_big_col=np.random.randint(10, size=(1000, 10)) b_big_col=a_big_col[np.random.randint(a_big_col.shape[0],size=[a_big_col.shape[0]]),:] a_big_all=np.random.randint(10, size=(100,100)) b_big_all=a_big_all[np.random.randint(a_big_all.shape[0],size=[a_big_all.shape[0]]),:] print 'Small arrays:' print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_small,b_small),number=100)/100 print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_small,b_small),number=100)/100 print 'Big column arrays:' print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_col,b_big_col),number=100)/100 print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_col,b_big_col),number=100)/100 print 'Big row arrays:' print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_row,b_big_row),number=100)/100 print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_row,b_big_row),number=100)/100 print 'Big arrays:' print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_all,b_big_all),number=100)/100 print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_all,b_big_all),number=100)/100 |
结果:
1 2 3 4 5 6 7 8 9 10 11 12 | Small arrays: Voted answer: 7.47108459473e-05 Proposed answer: 2.47001647949e-05 Big column arrays: Voted answer: 0.00198730945587 Proposed answer: 0.0560171294212 Big row arrays: Voted answer: 0.00500325918198 Proposed answer: 0.000308241844177 Big arrays: Voted answer: 0.000864889621735 Proposed answer: 0.00257176160812 |
得出的结论是,如果必须比较2个2d点的2个大2d数组,则使用投票答案。如果您在各个维度上都有大型矩阵,则通过投票得出的答案绝对是最佳选择。因此,这取决于您每次选择的内容。
使用结构化数组实现此目的的另一种方法:
1 2 3 4 5 6 7 | >>> a = np.array([[3, 1, 2], [5, 8, 9], [7, 4, 3]]) >>> b = np.array([[2, 3, 0], [3, 1, 2], [7, 4, 3]]) >>> av = a.view([('', a.dtype)] * a.shape[1]).ravel() >>> bv = b.view([('', b.dtype)] * b.shape[1]).ravel() >>> np.intersect1d(av, bv).view(a.dtype).reshape(-1, a.shape[1]) array([[3, 1, 2], [7, 4, 3]]) |
为了清楚起见,结构化视图如下所示:
1 2 3 4 5 | >>> a.view([('', a.dtype)] * a.shape[1]) array([[(3, 1, 2)], [(5, 8, 9)], [(7, 4, 3)]], dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')]) |
1 | np.array(set(map(tuple, b)).difference(set(map(tuple, a)))) |
这也可以工作
没有索引
访问https://gist.github.com/RashidLadj/971c7235ce796836853fcf55b4876f3c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def intersect2D(Array_A, Array_B): """ Find row intersection between 2D numpy arrays, a and b. """ # ''' Using Tuple ''' # intersectionList = list(set([tuple(x) for x in Array_A for y in Array_B if(tuple(x) == tuple(y))])) print ("intersectionList = \ ",intersectionList) # ''' Using Numpy function"array_equal" ''' # """ This method is valid for an ndarray""" intersectionList = list(set([tuple(x) for x in Array_A for y in Array_B if(np.array_equal(x, y))])) print ("intersectionList = \ ",intersectionList) # ''' Using set and bitwise and ''' intersectionList = [list(y) for y in (set([tuple(x) for x in Array_A]) & set([tuple(x) for x in Array_B]))] print ("intersectionList = \ ",intersectionList) return intersectionList |
带索引
访问https://gist.github.com/RashidLadj/bac71f3d3380064de2f9abe0ae43c19e
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def intersect2D(Array_A, Array_B): """ Find row intersection between 2D numpy arrays, a and b. Returns another numpy array with shared rows and index of items in A & B arrays """ # [[IDX], [IDY], [value]] where Equal # ''' Using Tuple ''' # IndexEqual = np.asarray([(i, j, x) for i,x in enumerate(Array_A) for j, y in enumerate (Array_B) if(tuple(x) == tuple(y))]).T # ''' Using Numpy array_equal ''' # IndexEqual = np.asarray([(i, j, x) for i,x in enumerate(Array_A) for j, y in enumerate (Array_B) if(np.array_equal(x, y))]).T idx, idy, intersectionList = (IndexEqual[0], IndexEqual[1], IndexEqual[2]) if len(IndexEqual) != 0 else ([], [], []) return intersectionList, idx, idy |
1 2 3 4 5 6 7 8 9 10 11 12 | A = np.array([[1,4],[2,5],[3,6]]) B = np.array([[1,4],[3,6],[7,8]]) def matching_rows(A,B): matches=[i for i in range(B.shape[0]) if np.any(np.all(A==B[i],axis=1))] if len(matches)==0: return B[matches] return np.unique(B[matches],axis=0) >>> matching_rows(A,B) array([[1, 4], [3, 6]]) |
当然,这假定行的长度都相同。
1 2 3 4 5 6 7 8 9 10 11 12 | import numpy as np A=np.array([[1, 4], [2, 5], [3, 6]]) B=np.array([[1, 4], [3, 6], [7, 8]]) intersetingRows=[(B==irow).all(axis=1).any() for irow in A] print(A[intersetingRows]) |