Find the top n clients for a year then bucket those client's volume across each month the year
大家早安,
我想报告该年度的前 n 个客户,然后显示这些前 n 个客户中的每一个在一年中的表现。样本 df:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | import pandas as pd dfTest = [ ('Client', ['A','A','A','A', 'B','B','B','B', 'C','C','C','C', 'D','D','D','D']), ('Year_Month', ['2018-08', '2018-09', '2018-10','2018-11', '2018-08', '2018-09', '2018-10','2018-11', '2018-08', '2018-09', '2018-10', '2018-11', '2018-08', '2018-09', '2018-10', '2018-11']), ('Volume', [100, 200, 300,400, 1, 2, 3,4, 10, 20, 30,40, 1000, 2000, 3000,4000] ), ('state', ['Done', 'Tied Done', 'Tied Done','Done', 'Passed', 'Done', 'Passed', 'Done', 'Rejected', 'Done', 'Passed', 'Done', 'Done', 'Done', 'Done', 'Done'] ) ] df = pd.DataFrame.from_items(dfTest) print(df) Client Year_Month Volume state 0 A 2018-08 100 Done 1 A 2018-09 200 Tied Done 2 A 2018-10 300 Tied Done 3 A 2018-11 400 Done 4 B 2018-08 1 Passed 5 B 2018-09 2 Done 6 B 2018-10 3 Passed 7 B 2018-11 4 Done 8 C 2018-08 10 Rejected 9 C 2018-09 20 Done 10 C 2018-10 30 Passed 11 C 2018-11 40 Done 12 D 2018-08 1000 Done 13 D 2018-09 2000 Done 14 D 2018-10 3000 Done 15 D 2018-11 4000 Done |
现在确定顶部,比如说两个(n);关于已完成交易的客户:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | d = [ ('Done_Volume', 'sum') ] # first filter by substring and then aggregate of filtered df mask = ((df['state'] == 'Done') | (df['state'] == 'Tied Done')) df_Client_Done_Volume = df[mask].groupby(['Client'])['Volume'].agg(d) print(df_Client_Done_Volume) Client A 1000 B 6 C 60 D 10000 print(df_Client_Done_Volume.nlargest(2, 'Done_Volume')) Done_Volume Client D 10000 A 1000 |
所以客户 A 和 D 是我表现最好的两 (n) 个。
我现在想将此列表或 df 反馈到原始数据中,以检索它们在 Year_Month 上升到顶部且客户列为 rows
的一年中的表现
1 2 3 | Client 2018-08 2018-09 2018-10 2018-11 A 100 200 300 400 D 1000 2000 3000 4000 |
你需要 pandas.pivot_table 方法
这是我的建议:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def get_top_n_performer(df, n): df_done = df[df['state'].isin(['Done', 'Tied Done'])] aggs= {'Volume':['sum']} data = df_done.groupby('Client').agg(aggs) data = data.reset_index() data.columns = ['Client','Volume_sum'] data = data.sort_values(by='Volume_sum', ascending=False) return data.head(n) ls= list(get_top_n_performer(df, 2).Client.values) data = pd.pivot_table(df[df['Client'].isin(ls)], values='Volume', index=['Client'], columns=['Year_Month']) data = data.reset_index() print(data) |
输出:
1 2 3 | Year_Month Client 2018-08 2018-09 2018-10 2018-11 0 A 100 200 300 400 1 D 1000 2000 3000 4000 |
我希望这会有所帮助!
IIUC
1 2 3 4 5 6 7 8 | s=df.loc[df.state.isin(['Done','Tied Done'])].drop('state',1) s=s.pivot(*s.columns) s.loc[s.sum(1).nlargest(2).index] Year_Month 2018-08 2018-09 2018-10 2018-11 Client D 1000.0 2000.0 3000.0 4000.0 A 100.0 200.0 300.0 400.0 |