关于python：如何让pandas HDFStore’put’操作更快

how to make pandas HDFStore 'put' operation faster

我正试图用熊猫HDF5建立一个ETL工具包。

我的计划是

从mysql中提取一个表到一个数据帧；

将此数据帧放入hdfstore；

但当我执行步骤2时，我发现将数据帧放入*.h5文件花费了太多时间。

源MySQL服务器中表的大小：498MB
- 52纵队
- 924624记录
放入数据帧后的*.h5文件大小：513MB
- "看跌"操作花费849.345677137秒

我的问题是：这段时间费用正常吗？有什么方法可以使它更快吗？

更新1

谢谢杰夫

我的代码非常简单：
extract_store=hdfstore('extract_store.h5')提取_store['df_staff']=df_staff
当我尝试"ptdump-av file.h5"时，我得到了一个错误，但我仍然可以从这个h5文件加载数据帧对象：

tables.exceptions.HDF5ExtError: HDF5 error back trace

File"../../../src/H5F.c", line 1512, in H5Fopen
unable to open file File"../../../src/H5F.c", line 1307, in H5F_open
unable to read superblock File"../../../src/H5Fsuper.c", line 305, in H5F_super_read
unable to find file signature File"../../../src/H5Fsuper.c", line 153, in H5F_locate_signature
unable to find a valid file signature

End of HDF5 error back trace

Unable to open/create file 'extract_store.h5'

其他一些信息：
- 熊猫版本："0.10.0"
- 操作系统：Ubuntu Server 10.04 x86 U 64
- CPU:8*Intel(R)Xeon(R)CPU [email protected]
- 内存总量：51634016 KB

我将把熊猫更新到0.10.1-dev，然后再试一次。

更新2

我把熊猫更新到"0.10.1.dev-6e2b6ea"
但是时间成本没有下降，这次是884.15秒
"ptdump-av file.h5"的输出为：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.0',
TITLE := '',
VERSION := '1.0']
/df_bugs (Group) ''
/df_bugs._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
axis0_variety := 'regular',
axis1_variety := 'regular',
block0_items_variety := 'regular',
block1_items_variety := 'regular',
block2_items_variety := 'regular',
nblocks := 3,
ndim := 2,
pandas_type := 'frame',
pandas_version := '0.10.1']
/df_bugs/axis0 (Array(52,)) ''
atom := StringAtom(itemsize=19, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/df_bugs/axis0._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
kind := 'string',
name := None,
transposed := True]
/df_bugs/axis1 (Array(924624,)) ''
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/df_bugs/axis1._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
kind := 'integer',
name := None,
transposed := True]
/df_bugs/block0_items (Array(5,)) ''
atom := StringAtom(itemsize=12, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
kind := 'string',
name := None,
transposed := True]
/df_bugs/block0_values (Array(924624, 5)) ''
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
transposed := True]
/df_bugs/block1_items (Array(19,)) ''
atom := StringAtom(itemsize=19, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
kind := 'string',
name := None,
transposed := True]
/df_bugs/block1_values (Array(924624, 19)) ''
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
transposed := True]
/df_bugs/block2_items (Array(28,)) ''
atom := StringAtom(itemsize=18, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.3',
kind := 'string',
name := None,
transposed := True]
/df_bugs/block2_values (VLArray(1,)) ''
atom = ObjectAtom()
byteorder = 'irrelevant'
nrows = 1
flavor = 'numpy'
/df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes:
[CLASS := 'VLARRAY',
PSEUDOATOM := 'object',
TITLE := '',
VERSION := '1.3',
transposed := True]

我尝试了下面的代码(将数据帧放入hdfstore，参数"table"为true)，但得到了一个错误，似乎不支持python的datatime类型：

Exception: cannot find the correct atom type -> [dtype->object] object
of type 'datetime.datetime' has no len()

更新3

谢谢杰夫。抱歉耽搁了。

tables.version:"2.4.0"。
是的，884秒只是看跌操作的成本，没有从mysql拉操作
一行数据帧(df.ix[0])：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

bug_id 1
assigned_to 185
bug_file_loc None
bug_severity critical
bug_status closed
creation_ts 1998-05-06 21:27:00
delta_ts 2012-05-09 14:41:41
short_desc Two cursors.
host_op_sys Unknown
guest_op_sys Unknown
priority P3
rep_platform IA32
reporter 56
product_id 7
category_id 983
component_id 12925
resolution fixed
target_milestone ws1
qa_contact 412
status_whiteboard
votes 0
keywords SR
lastdiffed 2012-05-09 14:41:41
everconfirmed 1
reporter_accessible 1
cclist_accessible 1
estimated_time 0.00
remaining_time 0.00
deadline None
alias None
found_in_product_id 0
found_in_version_id 0
found_in_phase_id 0
cf_type Defect
cf_reported_by Development
cf_attempted NaN
cf_failed NaN
cf_public_summary
cf_doc_impact 0
cf_security 0
cf_build NaN
cf_branch
cf_change NaN
cf_test_id NaN
cf_regression Unknown
cf_reviewer 0
cf_on_hold 0
cf_public_severity ---
cf_i18n_impact 0
cf_eta None
cf_bug_source ---
cf_viss None
Name: 0, Length: 52

数据帧的图片(只需在ipython笔记本中键入"df")：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

Int64Index: 924624 entries, 0 to 924623
Data columns:
bug_id 924624 non-null values
assigned_to 924624 non-null values
bug_file_loc 427318 non-null values
bug_severity 924624 non-null values
bug_status 924624 non-null values
creation_ts 924624 non-null values
delta_ts 924624 non-null values
short_desc 924624 non-null values
host_op_sys 924624 non-null values
guest_op_sys 924624 non-null values
priority 924624 non-null values
rep_platform 924624 non-null values
reporter 924624 non-null values
product_id 924624 non-null values
category_id 924624 non-null values
component_id 924624 non-null values
resolution 924624 non-null values
target_milestone 924624 non-null values
qa_contact 924624 non-null values
status_whiteboard 924624 non-null values
votes 924624 non-null values
keywords 924624 non-null values
lastdiffed 924509 non-null values
everconfirmed 924624 non-null values
reporter_accessible 924624 non-null values
cclist_accessible 924624 non-null values
estimated_time 924624 non-null values
remaining_time 924624 non-null values
deadline 0 non-null values
alias 0 non-null values
found_in_product_id 924624 non-null values
found_in_version_id 924624 non-null values
found_in_phase_id 924624 non-null values
cf_type 924624 non-null values
cf_reported_by 924624 non-null values
cf_attempted 89622 non-null values
cf_failed 89587 non-null values
cf_public_summary 510799 non-null values
cf_doc_impact 924624 non-null values
cf_security 924624 non-null values
cf_build 327460 non-null values
cf_branch 614929 non-null values
cf_change 300612 non-null values
cf_test_id 12610 non-null values
cf_regression 924624 non-null values
cf_reviewer 924624 non-null values
cf_on_hold 924624 non-null values
cf_public_severity 924624 non-null values
cf_i18n_impact 924624 non-null values
cf_eta 3910 non-null values
cf_bug_source 924624 non-null values
cf_viss 725 non-null values
dtypes: float64(5), int64(19), object(28)

在"convert_objects()"之后：

1	dtypes: datetime64[ns](2), float64(5), int64(19), object(26)

将转换后的数据帧转换为HDF存储成本：749.50 s:)
- 似乎减少"对象"数据类型的数量是减少时间成本的关键
将转换后的数据帧放入带有参数"table"的hdfstore中仍然返回该错误。

1
2
3
4
5
6
7

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
2203 raise
2204 except (Exception), detail:
-> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
2206 j += 1
2207
Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len()

我试图把没有日期时间列的数据帧

更新4

mysql中有4列，类型为datetime:
- 创世记
- 德尔塔茨
- 持续扩散
- 最后期限

调用convert_objects()后：

创世记：

1	Timestamp: 1998-05-06 21:27:00

德尔塔兹：

1	Timestamp: 2012-05-09 14:41:41

持续扩散

1	datetime.datetime(2012, 5, 9, 14, 41, 41)

无论是在调用"convert_objects"之前还是之后，最后期限始终为"none"

None

在没有列"lastdiff"的情况下放置数据帧的成本为691.75秒
当将没有列"lastdiff"的数据帧放入并将参数"table"设置为true时，我得到一个新的错误：

1
2
3
4
5
6
7
8

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
2203 raise
2204 except (Exception), detail:
-> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
2206 j += 1
2207

Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len()

mysql中"估计时间"、"剩余时间"、"cf-viss"列的类型为"decimal"

更新5

我通过下面的代码将这些"decimal"类型的列转换为"float"类型：

1	no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float)

现在，时间成本是372.84秒
但是"table"版本的放置仍然引发了一个错误：

1
2
3
4
5
6
7
8

/usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
2203 raise
2204 except (Exception), detail:
-> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail)))
2206 j += 1
2207

Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len()