关于python：如何查询pandas中的MultiIndex索引列值

How to query MultiIndex index columns values in pandas

代码示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])

In [172]: B = np.array([111, 222, 222, 333, 333, 777])

In [173]: C = randint(10, 99, 6)

In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])

In [175]: df.set_index(['A', 'B'], inplace=True)

In [176]: df
Out[176]:
C
A B
1.1 111 20
222 31
3.3 222 24
333 65
5.5 333 22
6.6 777 74

现在，我要检索一个值：q1：在[3.3，6.6]范围内——如果最后一个包含，预期返回值为[3.3，5.5，6.6]或[3.3，3.3，5.5，6.6]，如果没有，则为[3.3，5.5]或[3.3，3.3，5.5]。q2：在[2.0，4.0]范围内-预期返回值：[3.3]或[3.3，3.3]

对于任何其他多索引维度相同，例如b值：q3:在[111，500]范围内重复，作为范围内的数据行数-预期返回值：[111，222，222，333，333]

更正式：

假设t是一个列为a、b和c的表。该表包含n行。表单元格是数字，例如双整数、B整数和C整数。让我们创建一个T表的数据框架，让我们把它命名为df。让我们设置df的A列和B列索引(不重复，即没有单独的A列和B列作为索引，并作为数据分开)，在本例中是多索引的A列和B列。

问题：

例如，如何在索引上编写查询来查询索引A(或B)，比如在标签间隔[120.0，540.0]中？存在标签120.0和540.0。我必须澄清，我只对索引列表感兴趣，作为对查询的响应！

如何相同，但在标签120.0和540.0不存在的情况下，有低于120、高于120和低于540或高于540的标签？

如果问题1和问题2的答案是唯一的索引值，现在与索引范围内的数据行数相同，但重复。

我知道以上问题的答案，如果列不是索引，但在索引的情况下，经过网络的长期研究和熊猫功能的实验，我没有成功。我现在看到的唯一方法(不需要额外的编程)是除了索引之外，将A和B的副本作为数据列。

相关讨论

To query the DF by the multiindex values，for example where(A>1.7)and(B<666)：

1
2
3
4
5
6
7
8
9

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]

In [537]: result_df
Out[537]:
C
A B
3.3 222 43
333 59
5.5 333 56

Hence，to get for example the a index values，if still required：

1 2	In [538]: result_df.index.get_level_values('A') Out[538]: Index([3.3, 3.3, 5.5], dtype=object)

问题在于，在大数据帧中，索引选择字的性能比常规路径选择高10%。在重复工作中，looping，积累的延迟。See example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)

In [559]: len(df)
Out[559]: 12857

In [560]: df.sort(inplace=True)

In [561]: df_without_index = df.reset_index()

In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]
1000 loops, best of 3: 562 μs per loop

In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]
1000 loops, best of 3: 507 μs per loop

为了更好的可行性，我们可以简单地使用query()方法，避免df.index.get_level_values()和reset_index3〕的长度。

这是目标

1
2
3
4
5
6
7
8
9
10

In [12]: df
Out[12]:
C
A B
1.1 111 68
222 40
3.3 222 20
333 11
5.5 333 80
6.6 777 51

回答Q1(EDOCX1&5).In range [3.3, 6.6]

1
2
3
4
5
6
7
8
9
10
11
12
13
14

In [13]: df.query('3.3 <= A <= 6.6') # for closed interval
Out[13]:
C
A B
3.3 222 20
333 11
5.5 333 80
6.6 777 51

In [14]: df.query('3.3 < A < 6.6') # for open interval
Out[14]:
C
A B
5.5 333 80

对于任何类型的包容，一个课程可以与<, <=, >, >=一起进行。

相似，回答Q2(EDOCX1&5).In range EDOCX1&9)：

1
2
3
4
5
6

In [15]: df.query('2.0 <= A <= 4.0')
Out[15]:
C
A B
3.3 222 20
333 11

回答Q3(EDOCX1&10).In range [111, 500]

1
2
3
4
5
6
7
8
9

In [16]: df.query('111 <= B <= 500')
Out[16]:
C
A B
1.1 111 68
222 40
3.3 222 20
333 11
5.5 333 80

再者，你可以把A和B的查询结合起来。

1
2
3
4
5
6
7

In [17]: df.query('0 < A < 4 and 150 < B < 400')
Out[17]:
C
A B
1.1 222 40
3.3 222 20
333 11

你总是想用像指数一样的浮动，把它当作一个柱，而不是一个直接的索引行动。无论最终点是否存在，这些工作都会完成。

1
2
3
4
5
6
7
8
9
10
11
12

In [11]: df
Out[11]:
C
A B
1.1 111 81
222 45
3.3 222 98
333 13
5.5 333 89
6.6 777 98

In [12]: x = df.reset_index()

ZZU1

1
2
3
4
5

In [14]: x.loc[(x.A>=2.0)&(x.A<=4.0)]
Out[14]:
A B C
2 3.3 222 98
3 3.3 333 13

1
2
3
4
5
6
7
8

In [15]: x.loc[(x.B>=111.0)&(x.B<=500.0)]
Out[15]:
A B C
0 1.1 111 81
1 1.1 222 45
2 3.3 222 98
3 3.3 333 13
4 5.5 333 89

如果你想要回去的线索，就把它放下。这是一个简单的操作。

1
2
3
4
5
6
7
8
9

In [16]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B'])
Out[16]:
C
A B
1.1 111 81
222 45
3.3 222 98
333 13
5.5 333 89

如果你真的想要实际的索引值

1
2
3
4

In [5]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']).index
Out[5]:
MultiIndex
[(1.1, 111), (1.1, 222), (3.3, 222), (3.3, 333), (5.5, 333)]