Performance issues with pandas and filtering on datetime column
我有一个pandas数据框,其中一列上有一个datetime64对象。
1 2 3 4 5 6 7 8 9 10 11 | time volume complete closeBid closeAsk openBid openAsk highBid highAsk lowBid lowAsk closeMid 0 2016-08-07 21:00:00+00:00 9 True 0.84734 0.84842 0.84706 0.84814 0.84734 0.84842 0.84706 0.84814 0.84788 1 2016-08-07 21:05:00+00:00 10 True 0.84735 0.84841 0.84752 0.84832 0.84752 0.84846 0.84712 0.8482 0.84788 2 2016-08-07 21:10:00+00:00 10 True 0.84742 0.84817 0.84739 0.84828 0.84757 0.84831 0.84735 0.84817 0.847795 3 2016-08-07 21:15:00+00:00 18 True 0.84732 0.84811 0.84737 0.84813 0.84737 0.84813 0.84721 0.8479 0.847715 4 2016-08-07 21:20:00+00:00 4 True 0.84755 0.84822 0.84739 0.84812 0.84755 0.84822 0.84739 0.84812 0.847885 5 2016-08-07 21:25:00+00:00 4 True 0.84769 0.84843 0.84758 0.84827 0.84769 0.84843 0.84758 0.84827 0.84806 6 2016-08-07 21:30:00+00:00 5 True 0.84764 0.84851 0.84768 0.84852 0.8478 0.84857 0.84764 0.84851 0.848075 7 2016-08-07 21:35:00+00:00 4 True 0.84755 0.84825 0.84762 0.84844 0.84765 0.84844 0.84755 0.84824 0.8479 8 2016-08-07 21:40:00+00:00 1 True 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.847855 9 2016-08-07 21:45:00+00:00 3 True 0.84727 0.84817 0.84743 0.8482 0.84743 0.84822 0.84727 0.84817 0.84772 |
我的应用程序遵循以下(简化)结构:
1 2 3 4 5 6 7 8 9 10 11 | class Runner(): def execute_tick(self, clock_tick, previous_tick): candles = self.broker.get_new_candles(clock_tick, previous_tick) if candles: run_calculations(candles) class Broker(): def get_new_candles(clock_tick, previous_tick) start = previous_tick - timedelta(minutes=1) end = clock_tick - timedelta(minutes=3) return df[(df.time > start) & (df.time <= end)] |
号
我在分析应用程序时注意到,调用
编辑:我在这里添加了一些关于用例的更多信息(另外,源代码可以在:https://github.com/jmelett/pyfxtrader上找到)
- 应用程序将接受一份仪器列表(例如欧元兑美元、美元兑日元、英镑兑瑞士法郎),然后为每个仪器及其时间表(例如5分钟、30分钟、1小时等)预取节拍/蜡烛。初始化的数据基本上是仪器的
dict ,每个仪器都包含另一个dict ,其蜡烛数据用于M5、M30、H1时间段。 - 每个"时间框架"都是一个熊猫数据框架,如顶部所示。
- 然后使用时钟模拟器查询特定时间(例如,15:30:00,给我最后一个x"5分钟蜡烛")的单个蜡烛(欧元兑美元)
- 然后使用这一数据"模拟"特定的市场条件(例如,过去1小时的平均价格增加了10%,买入市场头寸)
如果你的目标是提高效率,我会把所有的事情都用在麻木上。
我把
1 2 3 4 5 6 | def get_new_candles2(clock_tick, previous_tick): start = previous_tick - timedelta(minutes=1) end = clock_tick - timedelta(minutes=3) ge_start = df.time.values >= start.to_datetime64() le_end = df.time.values <= end.to_datetime64() return pd.DataFrame(df.values[ge_start & le_end], df.index[mask], df.columns) |
号数据设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from StringIO import StringIO import pandas as pd text ="""time,volume,complete,closeBid,closeAsk,openBid,openAsk,highBid,highAsk,lowBid,lowAsk,closeMid 2016-08-07 21:00:00+00:00,9,True,0.84734,0.84842,0.84706,0.84814,0.84734,0.84842,0.84706,0.84814,0.84788 2016-08-07 21:05:00+00:00,10,True,0.84735,0.84841,0.84752,0.84832,0.84752,0.84846,0.84712,0.8482,0.84788 2016-08-07 21:10:00+00:00,10,True,0.84742,0.84817,0.84739,0.84828,0.84757,0.84831,0.84735,0.84817,0.847795 2016-08-07 21:15:00+00:00,18,True,0.84732,0.84811,0.84737,0.84813,0.84737,0.84813,0.84721,0.8479,0.847715 2016-08-07 21:20:00+00:00,4,True,0.84755,0.84822,0.84739,0.84812,0.84755,0.84822,0.84739,0.84812,0.847885 2016-08-07 21:25:00+00:00,4,True,0.84769,0.84843,0.84758,0.84827,0.84769,0.84843,0.84758,0.84827,0.84806 2016-08-07 21:30:00+00:00,5,True,0.84764,0.84851,0.84768,0.84852,0.8478,0.84857,0.84764,0.84851,0.848075 2016-08-07 21:35:00+00:00,4,True,0.84755,0.84825,0.84762,0.84844,0.84765,0.84844,0.84755,0.84824,0.8479 2016-08-07 21:40:00+00:00,1,True,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.847855 2016-08-07 21:45:00+00:00,3,True,0.84727,0.84817,0.84743,0.8482,0.84743,0.84822,0.84727,0.84817,0.84772 """ df = pd.read_csv(StringIO(text), parse_dates=[0]) |
号测试输入变量
1 2 | previous_tick = pd.to_datetime('2016-08-07 21:10:00') clock_tick = pd.to_datetime('2016-08-07 21:45:00') |
号
1 | get_new_candles2(clock_tick, previous_tick) |
号
。
计时氧化镁
我想你已经在以一种相对有效的方式管理事情了。
在处理时间序列时,最好使用将时间戳用作
我测试的数据如下:
1 2 3 4 5 6 7 | Price Volume time 2016-02-10 11:16:15.951403000 6197.0 200.0 2016-02-10 11:16:16.241380000 6197.0 100.0 2016-02-10 11:16:16.521871000 6197.0 900.0 2016-02-10 11:16:16.541253000 6197.0 100.0 2016-02-10 11:16:16.592049000 6196.0 200.0 |
设置
1 2 | start = df.index[len(df)/4] end = df.index[len(df)/4*3] |
。
测试1:
1 2 3 4 5 | %%time _ = df[start:end] # Same for df.ix[start:end] CPU times: user 413 ms, sys: 20 ms, total: 433 ms Wall time: 430 ms |
。
另一方面,使用您的方法:
1 2 | df = df.reset_index() df.columns = ['time', 'Price', 'Volume'] |
测试2:
1 2 3 4 5 | %%time u = (df['time'] > start) & (df['time'] <= end) CPU times: user 21.2 ms, sys: 368 μs, total: 21.6 ms Wall time: 20.4 ms |
。
测试3:
1 2 3 4 5 | %%time _ = df[u] CPU times: user 10.4 ms, sys: 27.6 ms, total: 38.1 ms Wall time: 36.8 ms |
测试4:
1 2 3 4 5 | %%time _ = df[(df['time'] > start) & (df['time'] <= end)] CPU times: user 21.6 ms, sys: 24.3 ms, total: 45.9 ms Wall time: 44.5 ms |
。
注:每个代码块对应一个Jupyter笔记本电脑单元及其输出。我使用
我不完全确定为什么会是这样(我认为使用
我了解到,这些datetime对象可能会变得非常需要内存,并且需要更多的计算工作,特别是如果它们被设置为索引(datetime index对象?)
我认为你最好的办法就是把df.time转换成unix时间戳(比如ints,不再是datetime数据类型),然后做一个简单的整数比较。
Unix时间戳将类似于:1471554233(发布时间)。更多信息请访问:https://en.wikipedia.org/wiki/unixu time
执行此操作时的一些注意事项(例如记住时区):将datetime转换为unix时间戳,然后将其转换回python
在我看来,使用
1 2 | df.sort_values(by='time',inplace=True) df.ix[(df.time > start) & (df.time <= end),:] |
。
pandas查询可以使用numexpr作为引擎来加速评估:
1 | df.query('time > @start & time <= @end') |
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.query.html
这是一个猜测,因为我不能测试它,但我想到了两个想法。
使用查找序列确定要返回的数据帧的开始和结束索引:
1 2 3 4 | s = pd.Series(np.arange(len(df)), index=df.time) start = s.asof(start) end = s.asof(end) ret = df.iloc[start + 1 : end] |
将
1 2 | df = df.set_index('time') ret = df.loc[start : end] |
号
您可能需要在
在这两种情况下,每个数据帧只能执行一次主步骤(构造序列或设置索引)。