Apply different functions to different items in group object: Python pandas
假设我有一个如下的数据帧:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | In [1]: test_dup_df Out[1]: exe_price exe_vol flag 2008-03-13 14:41:07 84.5 200 yes 2008-03-13 14:41:37 85.0 10000 yes 2008-03-13 14:41:38 84.5 69700 yes 2008-03-13 14:41:39 84.5 1200 yes 2008-03-13 14:42:00 84.5 1000 yes 2008-03-13 14:42:08 84.5 300 yes 2008-03-13 14:42:10 84.5 88100 yes 2008-03-13 14:42:10 84.5 11900 yes 2008-03-13 14:42:15 84.5 5000 yes 2008-03-13 14:42:16 84.5 3200 yes |
我想在
1 | In [2]: grouped = test_dup_df.groupby(level=0) |
对重复索引进行分组,然后使用
是否有方法对不同列中的值进行分组,然后应用不同的(由我编写的)函数?
应用您自己的功能:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | In [12]: def func(x): exe_price = (x['exe_price']*x['exe_vol']).sum() / x['exe_vol'].sum() exe_vol = x['exe_vol'].sum() flag = True return Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag']) In [13]: test_dup_df.groupby(test_dup_df.index).apply(func) Out[13]: exe_price exe_vol flag date_time 2008-03-13 14:41:07 84.5 200 True 2008-03-13 14:41:37 85 10000 True 2008-03-13 14:41:38 84.5 69700 True 2008-03-13 14:41:39 84.5 1200 True 2008-03-13 14:42:00 84.5 1000 True 2008-03-13 14:42:08 84.5 300 True 2008-03-13 14:42:10 20.71 100000 True 2008-03-13 14:42:15 84.5 5000 True 2008-03-13 14:42:16 84.5 3200 True |
我喜欢@waitingkuo的答案,因为它非常清晰易读。
不管怎么说,我一直把它放在身边,因为它看起来确实更快——至少对于熊猫版本0.10.0来说是这样。未来情况可能会(希望)改变,所以一定要重新运行基准测试,特别是如果您使用的是不同版本的熊猫。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | import pandas as pd import io import timeit data = '''\ date time exe_price exe_vol flag 2008-03-13 14:41:07 84.5 200 yes 2008-03-13 14:41:37 85.0 10000 yes 2008-03-13 14:41:38 84.5 69700 yes 2008-03-13 14:41:39 84.5 1200 yes 2008-03-13 14:42:00 84.5 1000 yes 2008-03-13 14:42:08 84.5 300 yes 2008-03-13 14:42:10 10 88100 yes 2008-03-13 14:42:10 100 11900 yes 2008-03-13 14:42:15 84.5 5000 yes 2008-03-13 14:42:16 84.5 3200 yes''' df = pd.read_table(io.BytesIO(data), sep='\s+', parse_dates=[[0, 1]], index_col=0) def func(subf): exe_vol = subf['exe_vol'].sum() exe_price = ((subf['exe_price']*subf['exe_vol']).sum() / exe_vol) flag = True return pd.Series([exe_price, exe_vol, flag], index=['exe_price', 'exe_vol', 'flag']) # return exe_price def using_apply(): return df.groupby(df.index).apply(func) def using_helper_column(): df['weight'] = df['exe_price'] * df['exe_vol'] grouped = df.groupby(level=0, group_keys=True) result = grouped.agg({'weight': 'sum', 'exe_vol': 'sum'}) result['exe_price'] = result['weight'] / result['exe_vol'] result['flag'] = True result = result.drop(['weight'], axis=1) return result result = using_apply() print(result) result = using_helper_column() print(result) time_apply = timeit.timeit('m.using_apply()', 'import __main__ as m ', number=1000) time_helper = timeit.timeit('m.using_helper_column()', 'import __main__ as m ', number=1000) print('using_apply: {t}'.format(t = time_apply)) print('using_helper_column: {t}'.format(t = time_helper)) |
产量
1 2 3 4 5 6 7 8 9 10 11 | exe_vol exe_price flag date_time 2008-03-13 14:41:07 200 84.50 True 2008-03-13 14:41:37 10000 85.00 True 2008-03-13 14:41:38 69700 84.50 True 2008-03-13 14:41:39 1200 84.50 True 2008-03-13 14:42:00 1000 84.50 True 2008-03-13 14:42:08 300 84.50 True 2008-03-13 14:42:10 100000 20.71 True 2008-03-13 14:42:15 5000 84.50 True 2008-03-13 14:42:16 3200 84.50 True |
以时间为基准:
1 2 | using_apply: 3.0081038475 using_helper_column: 1.35300707817 |
不太熟悉
1 2 | tot_vol = np.sum(grouped['exe_vol']) avg_price = np.average(grouped['exe_price'], weights=grouped['exe_vol']) |