关于python：HDF5 – 并发，压缩和I / O性能

HDF5 - concurrency, compression & I/O performance

我有关于HDF5性能和并发性的以下问题：

HDF5是否支持并发写访问？

除了并发性考虑外，HDF5在I / O性能方面的表现如何(压缩率是否会影响性能)？

由于我在Python中使用HDF5，它的性能与Sqlite相比如何？

参考文献：

http://www.sqlite.org/faq.html#q5
可以在NFS文件系统上锁定sqlite文件吗？
http://pandas.pydata.org/

相关讨论

更新为使用pandas 0.13.1

1)编号http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats。有多种方法可以做到这一点，例如：让你的不同线程/进程写出计算结果，然后将一个进程组合??起来。

2)根据您存储的数据类型，操作方式以及检索方式，HDF5可以提供更好的性能。作为单个数组存储在HDFStore中，浮动数据，压缩(换句话说，不以允许查询的格式存储)，将被快速存储/读取。即使以表格格式存储(这会降低写入性能)，也会提供相当好的写入性能。您可以查看一下这些详细的比较(这是HDFStore在引擎盖下使用的内容)。 http://www.pytables.org/，这是一张很好的照片： >


(并且由于PyTables 2.3现在将查询编入索引)，因此perf实际上比这更好 
所以要回答你的问题，如果你想要任何一种表现，HDF5就是你要走的路。


写作：

<div class=

1
2
3
4
5
6
7
8
9
10
11

In [14]: %timeit test_sql_write(df)
1 loops, best of 3: 6.24 s per loop

In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop

In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop

In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop

读

1
2
3
4
5
6
7
8
9
10
11

In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop

这是代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

import sqlite3
import os
from pandas.io import sql

In [3]: df = DataFrame(randn(1000000,2),columns=list('AB'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A 1000000 non-null values
B 1000000 non-null values
dtypes: float64(2)

def test_sql_write(df):
if os.path.exists('test.sql'):
os.remove('test.sql')
sql_db = sqlite3.connect('test.sql')
sql.write_frame(df, name='test_table', con=sql_db)
sql_db.close()

def test_sql_read():
sql_db = sqlite3.connect('test.sql')
sql.read_frame("select * from test_table", sql_db)
sql_db.close()

def test_hdf_fixed_write(df):
df.to_hdf('test_fixed.hdf','test',mode='w')

def test_csv_read():
pd.read_csv('test.csv',index_col=0)

def test_csv_write(df):
df.to_csv('test.csv',mode='w')

def test_hdf_fixed_read():
pd.read_hdf('test_fixed.hdf','test')

def test_hdf_table_write(df):
df.to_hdf('test_table.hdf','test',format='table',mode='w')

def test_hdf_table_read():
pd.read_hdf('test_table.hdf','test')

当然是YMMV。