将不同的函数应用于组对象中的不同项：Python pandas

Apply different functions to different items in group object: Python pandas

假设我有一个如下的数据帧：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

In [1]: test_dup_df

Out[1]:
exe_price exe_vol flag
2008-03-13 14:41:07 84.5 200 yes
2008-03-13 14:41:37 85.0 10000 yes
2008-03-13 14:41:38 84.5 69700 yes
2008-03-13 14:41:39 84.5 1200 yes
2008-03-13 14:42:00 84.5 1000 yes
2008-03-13 14:42:08 84.5 300 yes
2008-03-13 14:42:10 84.5 88100 yes
2008-03-13 14:42:10 84.5 11900 yes
2008-03-13 14:42:15 84.5 5000 yes
2008-03-13 14:42:16 84.5 3200 yes

我想在14:42:10时分组一个重复的数据，并对exe_price和exe_vol应用不同的函数(例如，对exe_vol求和，计算exe_price的体积加权平均值)。我知道我能做到

1	In [2]: grouped = test_dup_df.groupby(level=0)

对重复索引进行分组，然后使用first()或last()函数获取第一行或最后一行，但这不是我真正想要的。

是否有方法对不同列中的值进行分组，然后应用不同的(由我编写的)函数？

相关讨论

应用您自己的功能：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

In [12]: def func(x):
exe_price = (x['exe_price']*x['exe_vol']).sum() / x['exe_vol'].sum()
exe_vol = x['exe_vol'].sum()
flag = True
return Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag'])

In [13]: test_dup_df.groupby(test_dup_df.index).apply(func)
Out[13]:
exe_price exe_vol flag
date_time
2008-03-13 14:41:07 84.5 200 True
2008-03-13 14:41:37 85 10000 True
2008-03-13 14:41:38 84.5 69700 True
2008-03-13 14:41:39 84.5 1200 True
2008-03-13 14:42:00 84.5 1000 True
2008-03-13 14:42:08 84.5 300 True
2008-03-13 14:42:10 20.71 100000 True
2008-03-13 14:42:15 84.5 5000 True
2008-03-13 14:42:16 84.5 3200 True

相关讨论

我喜欢@waitingkuo的答案，因为它非常清晰易读。

不管怎么说，我一直把它放在身边，因为它看起来确实更快——至少对于熊猫版本0.10.0来说是这样。未来情况可能会(希望)改变，所以一定要重新运行基准测试，特别是如果您使用的是不同版本的熊猫。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

import pandas as pd
import io
import timeit

data = '''\
date time exe_price exe_vol flag
2008-03-13 14:41:07 84.5 200 yes
2008-03-13 14:41:37 85.0 10000 yes
2008-03-13 14:41:38 84.5 69700 yes
2008-03-13 14:41:39 84.5 1200 yes
2008-03-13 14:42:00 84.5 1000 yes
2008-03-13 14:42:08 84.5 300 yes
2008-03-13 14:42:10 10 88100 yes
2008-03-13 14:42:10 100 11900 yes
2008-03-13 14:42:15 84.5 5000 yes
2008-03-13 14:42:16 84.5 3200 yes'''

df = pd.read_table(io.BytesIO(data), sep='\s+', parse_dates=[[0, 1]],
index_col=0)

def func(subf):
exe_vol = subf['exe_vol'].sum()
exe_price = ((subf['exe_price']*subf['exe_vol']).sum()
/ exe_vol)
flag = True
return pd.Series([exe_price, exe_vol, flag],
index=['exe_price', 'exe_vol', 'flag'])
# return exe_price

def using_apply():
return df.groupby(df.index).apply(func)

def using_helper_column():
df['weight'] = df['exe_price'] * df['exe_vol']
grouped = df.groupby(level=0, group_keys=True)
result = grouped.agg({'weight': 'sum', 'exe_vol': 'sum'})
result['exe_price'] = result['weight'] / result['exe_vol']
result['flag'] = True
result = result.drop(['weight'], axis=1)
return result

result = using_apply()
print(result)
result = using_helper_column()
print(result)

time_apply = timeit.timeit('m.using_apply()',
'import __main__ as m ',
number=1000)
time_helper = timeit.timeit('m.using_helper_column()',
'import __main__ as m ',
number=1000)
print('using_apply: {t}'.format(t = time_apply))
print('using_helper_column: {t}'.format(t = time_helper))

产量

1
2
3
4
5
6
7
8
9
10
11

exe_vol exe_price flag
date_time
2008-03-13 14:41:07 200 84.50 True
2008-03-13 14:41:37 10000 85.00 True
2008-03-13 14:41:38 69700 84.50 True
2008-03-13 14:41:39 1200 84.50 True
2008-03-13 14:42:00 1000 84.50 True
2008-03-13 14:42:08 300 84.50 True
2008-03-13 14:42:10 100000 20.71 True
2008-03-13 14:42:15 5000 84.50 True
2008-03-13 14:42:16 3200 84.50 True

以时间为基准：

1 2	using_apply: 3.0081038475 using_helper_column: 1.35300707817

相关讨论

不太熟悉pandas，但在纯粹的麻木中，你可以做到：

1 2	tot_vol = np.sum(grouped['exe_vol']) avg_price = np.average(grouped['exe_price'], weights=grouped['exe_vol'])

相关讨论