如何使用python pandas处理传入的实时数据

How to handle incoming real time data with python pandas

哪种方法是处理大熊猫的实时输入数据的最推荐方法？

每隔几秒钟，我就会收到以下格式的数据点：

1 2	{'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}

我想将它附加到现有的数据框架中，然后对其运行一些分析。

问题是，仅用dataframe.append追加行就可能导致复制时出现性能问题。

我尝试过的事情：

一些人建议预先分配一个大数据框架，并在数据进入时更新它：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

In [1]: index = pd.DatetimeIndex(start='2013-01-01 00:00:00', freq='S', periods=5)

In [2]: columns = ['high', 'low', 'open', 'close']

In [3]: df = pd.DataFrame(index=t, columns=columns)

In [4]: df
Out[4]:
high low open close
2013-01-01 00:00:00 NaN NaN NaN NaN
2013-01-01 00:00:01 NaN NaN NaN NaN
2013-01-01 00:00:02 NaN NaN NaN NaN
2013-01-01 00:00:03 NaN NaN NaN NaN
2013-01-01 00:00:04 NaN NaN NaN NaN

In [5]: data = {'time' :'2013-01-01 00:00:02', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}

In [6]: data_ = pd.Series(data)

In [7]: df.loc[data['time']] = data_

In [8]: df
Out[8]:
high low open close
2013-01-01 00:00:00 NaN NaN NaN NaN
2013-01-01 00:00:01 NaN NaN NaN NaN
2013-01-01 00:00:02 4 3 2 1
2013-01-01 00:00:03 NaN NaN NaN NaN
2013-01-01 00:00:04 NaN NaN NaN NaN

另一种选择是建立一个听写列表。只需将传入的数据附加到一个列表，并将其切片到较小的数据帧中即可完成这项工作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

In [9]: ls = []

In [10]: for n in range(5):
.....: # Naive stuff ahead =)
.....: time = '2013-01-01 00:00:0' + str(n)
.....: d = {'time' : time, 'stock' : 'BLAH', 'high' : np.random.rand()*10, 'low' : np.random.rand()*10, 'open' : np.random.rand()*10, 'close' : np.random.rand()*10}
.....: ls.append(d)

In [11]: df = pd.DataFrame(ls[1:3]).set_index('time')

In [12]: df
Out[12]:
close high low open stock
time
2013-01-01 00:00:01 3.270078 1.008289 7.486118 2.180683 BLAH
2013-01-01 00:00:02 3.883586 2.215645 0.051799 2.310823 BLAH

或者类似的事情，可能会处理更多的输入。

相关讨论

我将使用hdf5/pytables，如下所示：

将数据保留为"尽可能长"的python列表。

将结果附加到该列表中。

当它变得"大"时：

使用Pandas IO(和一个可附加的表)推到HDF5商店。
清除清单。

重复。

实际上，我定义的函数为每个"键"使用一个列表，这样您就可以在同一个进程中将多个数据帧存储到HDF5存储。

我们定义了一个函数，您用每行d调用它：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

CACHE = {}
STORE = 'store.h5' # Note: another option is to keep the actual file open

def process_row(d, key, max_len=5000, _cache=CACHE):
"""
Append row d to the store 'key'.

When the number of items in the key's cache reaches max_len,
append the list of rows to the HDF5 store and clear the list.

"""
# keep the rows for each key separate.
lst = _cache.setdefault(key, [])
if len(lst) >= max_len:
store_and_clear(lst, key)
lst.append(d)

def store_and_clear(lst, key):
"""
Convert key's cache list to a DataFrame and append that to HDF5.
"""
df = pd.DataFrame(lst)
with pd.HDFStore(STORE) as store:
store.append(key, df)
lst.clear()

注意：我们使用WITH语句在每次写入之后自动关闭存储。保持打开可能会更快，但如果是这样，建议您定期冲洗(关闭冲洗)。另外请注意，使用collections deque而不是list可能更易读，但是list的性能在这里稍好一些。

要使用此功能，您可以调用为：

1 2	process_row({'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}, key="df")

注意："df"是Pytables存储中使用的存储键。

作业完成后，请确保store_and_clear的剩余缓存：

1 2	for k, lst in CACHE.items(): # you can instead use .iteritems() in python 2 store_and_clear(lst, k)

现在，您可以通过以下方式获得完整的数据帧：

1 2	with pd.HDFStore(STORE) as store: df = store["df"] # other keys will be store[key]

一些评论：

5000可以调整，尝试一些较小/较大的数字以满足您的需要。
list append为o(1)，dataframe append为o(len(df))。
在你做统计或数据咀嚼之前，你不需要熊猫，用最快的。
此代码适用于多个进入的密钥(数据点)。
这是非常少的代码，我们将停留在普通的python列表中，然后是pandas数据帧…

此外，为了获得最新的读取，您可以定义一个在读取之前存储和清除的get方法。这样，您就可以获得最新的数据：

1
2
3
4

def get_latest(key, _cache=CACHE):
store_and_clear(_cache[key], key)
with pd.HDFStore(STORE) as store:
return store[key]

现在，当您使用：

1	df = get_latest("df")

您将获得最新的"df"。

另一个选项稍微复杂一些：在普通Pytables中定义一个自定义表，请参见教程。

注意：您需要知道字段名才能创建列描述符。