关于python：在两个2D numpy数组中获取相交的行

Get intersecting rows across two 2D numpy arrays

我想得到两个2D numpy数组的相交(公用)行。例如，如果以下数组作为输入传递：

1
2
3
4
5
6
7

array([[1, 4],
[2, 5],
[3, 6]])

array([[1, 4],
[3, 6],
[7, 8]])

输出应为：

1 2	array([[1, 4], [3, 6])

我知道如何使用循环。我正在寻找一种Pythonic / Numpy方式来做到这一点。

对于短数组，使用集合可能是最清晰，最易读的方法。

另一种方法是使用numpy.intersect1d。不过，您必须欺骗它，以将行作为单个值来对待……这会使事情的可读性降低……

1
2
3
4
5
6
7
8
9
10
11
12
13

import numpy as np

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
'formats':ncols * [A.dtype]}

C = np.intersect1d(A.view(dtype), B.view(dtype))

# This last bit is optional if you're okay with"C" being a structured array...
C = C.view(A.dtype).reshape(-1, ncols)

对于大型数组，这应该比使用集合快得多。

相关讨论

我不明白为什么没有建议的纯numpy方法来使此工作。所以我找到了一个使用numpy广播的。基本思想是通过轴交换将阵列之一转换为3d。让我们构造2个数组：

1
2
3

a=np.random.randint(10, size=(5, 3))
b=np.zeros_like(a)
b[:4,:]=a[np.random.randint(a.shape[0], size=4), :]

我的跑步给了：

1
2
3
4
5
6
7
8
9
10

a=array([[5, 6, 3],
[8, 1, 0],
[2, 1, 4],
[8, 0, 6],
[6, 7, 6]])
b=array([[2, 1, 4],
[2, 1, 4],
[6, 7, 6],
[5, 6, 3],
[0, 0, 0]])

步骤是(数组可以互换)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

#a is nxm and b is kxm
c = np.swapaxes(a[:,:,None],1,2)==b #transform a to nx1xm
# c has nxkxm dimensions due to comparison broadcast
# each nxixj slice holds comparison matrix between a[j,:] and b[i,:]
# Decrease dimension to nxk with product:
c = np.prod(c,axis=2)
#To get around duplicates://
# Calculate cumulative sum in k-th dimension
c= c*np.cumsum(c,axis=0)
# compare with 1, so that to get only one 'True' statement by row
c=c==1
#//
# sum in k-th dimension, so that a nx1 vector is produced
c=np.sum(c,axis=1).astype(bool)
# The intersection between a and b is a[c]
result=a[c]

在具有2行用于减少内存的函数中(如果错误请纠正我)：

1
2
3

def array_row_intersection(a,b):
tmp=np.prod(np.swapaxes(a[:,:,None],1,2)==b,axis=2)
return a[np.sum(np.cumsum(tmp,axis=0)*tmp==1,axis=1).astype(bool)]

这给出了我的示例结果：

1
2
3

result=array([[5, 6, 3],
[2, 1, 4],
[6, 7, 6]])

这比集合解决方案要快，因为它仅使用简单的numpy运算，同时不断缩小尺寸，并且非常适合两个大型矩阵。我想我的评论可能犯了错误，因为我通过实验和直觉获得了答案。列交集的等效项可以通过转置数组或略微更改步骤来找到。另外，如果需要重复，则必须跳过" //"内部的步骤。可以编辑该函数以仅返回索引的布尔数组，这对我很方便，同时尝试使用相同的向量获取不同的数组索引。投票的答案和我的基准(每个维度中的元素数量在选择内容方面起着作用)：

码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

def voted_answer(A,B):
nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
'formats':ncols * [A.dtype]}
C = np.intersect1d(A.view(dtype), B.view(dtype))
return C.view(A.dtype).reshape(-1, ncols)

a_small=np.random.randint(10, size=(10, 10))
b_small=np.zeros_like(a_small)
b_small=a_small[np.random.randint(a_small.shape[0],size=[a_small.shape[0]]),:]
a_big_row=np.random.randint(10, size=(10, 1000))
b_big_row=a_big_row[np.random.randint(a_big_row.shape[0],size=[a_big_row.shape[0]]),:]
a_big_col=np.random.randint(10, size=(1000, 10))
b_big_col=a_big_col[np.random.randint(a_big_col.shape[0],size=[a_big_col.shape[0]]),:]
a_big_all=np.random.randint(10, size=(100,100))
b_big_all=a_big_all[np.random.randint(a_big_all.shape[0],size=[a_big_all.shape[0]]),:]

print 'Small arrays:'
print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_small,b_small),number=100)/100
print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_small,b_small),number=100)/100
print 'Big column arrays:'
print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_col,b_big_col),number=100)/100
print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_col,b_big_col),number=100)/100
print 'Big row arrays:'
print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_row,b_big_row),number=100)/100
print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_row,b_big_row),number=100)/100
print 'Big arrays:'
print '\\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_all,b_big_all),number=100)/100
print '\\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_all,b_big_all),number=100)/100

结果：

1
2
3
4
5
6
7
8
9
10
11
12

Small arrays:
Voted answer: 7.47108459473e-05
Proposed answer: 2.47001647949e-05
Big column arrays:
Voted answer: 0.00198730945587
Proposed answer: 0.0560171294212
Big row arrays:
Voted answer: 0.00500325918198
Proposed answer: 0.000308241844177
Big arrays:
Voted answer: 0.000864889621735
Proposed answer: 0.00257176160812

得出的结论是，如果必须比较2个2d点的2个大2d数组，则使用投票答案。如果您在各个维度上都有大型矩阵，则通过投票得出的答案绝对是最佳选择。因此，这取决于您每次选择的内容。

相关讨论

使用结构化数组实现此目的的另一种方法：

1
2
3
4
5
6
7

>>> a = np.array([[3, 1, 2], [5, 8, 9], [7, 4, 3]])
>>> b = np.array([[2, 3, 0], [3, 1, 2], [7, 4, 3]])
>>> av = a.view([('', a.dtype)] * a.shape[1]).ravel()
>>> bv = b.view([('', b.dtype)] * b.shape[1]).ravel()
>>> np.intersect1d(av, bv).view(a.dtype).reshape(-1, a.shape[1])
array([[3, 1, 2],
[7, 4, 3]])

为了清楚起见，结构化视图如下所示：

1
2
3
4
5

>>> a.view([('', a.dtype)] * a.shape[1])
array([[(3, 1, 2)],
[(5, 8, 9)],
[(7, 4, 3)]],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])

1	np.array(set(map(tuple, b)).difference(set(map(tuple, a))))

这也可以工作

没有索引
访问https://gist.github.com/RashidLadj/971c7235ce796836853fcf55b4876f3c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

def intersect2D(Array_A, Array_B):
"""
Find row intersection between 2D numpy arrays, a and b.
"""

# ''' Using Tuple ''' #
intersectionList = list(set([tuple(x) for x in Array_A for y in Array_B if(tuple(x) == tuple(y))]))
print ("intersectionList = \
",intersectionList)

# ''' Using Numpy function"array_equal" ''' #
""" This method is valid for an ndarray"""
intersectionList = list(set([tuple(x) for x in Array_A for y in Array_B if(np.array_equal(x, y))]))
print ("intersectionList = \
",intersectionList)

# ''' Using set and bitwise and '''
intersectionList = [list(y) for y in (set([tuple(x) for x in Array_A]) & set([tuple(x) for x in Array_B]))]
print ("intersectionList = \
",intersectionList)

return intersectionList

带索引
访问https://gist.github.com/RashidLadj/bac71f3d3380064de2f9abe0ae43c19e

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

def intersect2D(Array_A, Array_B):
"""
Find row intersection between 2D numpy arrays, a and b.
Returns another numpy array with shared rows and index of items in A & B arrays
"""
# [[IDX], [IDY], [value]] where Equal
# ''' Using Tuple ''' #
IndexEqual = np.asarray([(i, j, x) for i,x in enumerate(Array_A) for j, y in enumerate (Array_B) if(tuple(x) == tuple(y))]).T

# ''' Using Numpy array_equal ''' #
IndexEqual = np.asarray([(i, j, x) for i,x in enumerate(Array_A) for j, y in enumerate (Array_B) if(np.array_equal(x, y))]).T

idx, idy, intersectionList = (IndexEqual[0], IndexEqual[1], IndexEqual[2]) if len(IndexEqual) != 0 else ([], [], [])

return intersectionList, idx, idy

1
2
3
4
5
6
7
8
9
10
11
12

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

def matching_rows(A,B):
matches=[i for i in range(B.shape[0]) if np.any(np.all(A==B[i],axis=1))]
if len(matches)==0:
return B[matches]
return np.unique(B[matches],axis=0)

>>> matching_rows(A,B)
array([[1, 4],
[3, 6]])

当然，这假定行的长度都相同。

1
2
3
4
5
6
7
8
9
10
11
12

import numpy as np

A=np.array([[1, 4],
[2, 5],
[3, 6]])

B=np.array([[1, 4],
[3, 6],
[7, 8]])

intersetingRows=[(B==irow).all(axis=1).any() for irow in A]
print(A[intersetingRows])