how to make pandas HDFStore 'put' operation faster
我正试图用熊猫HDF5建立一个ETL工具包。
我的计划是
但当我执行步骤2时,我发现将数据帧放入*.h5文件花费了太多时间。
- 源MySQL服务器中表的大小:498MB
- 52纵队
- 924624记录
- 放入数据帧后的*.h5文件大小:513MB
- "看跌"操作花费849.345677137秒
我的问题是:这段时间费用正常吗?有什么方法可以使它更快吗?
更新1谢谢杰夫
我的代码非常简单:
extract_store=hdfstore('extract_store.h5')提取_store['df_staff']=df_staff
- 当我尝试"ptdump-av file.h5"时,我得到了一个错误,但我仍然可以从这个h5文件加载数据帧对象:
tables.exceptions.HDF5ExtError: HDF5 error back trace
File"../../../src/H5F.c", line 1512, in H5Fopen
unable to open file File"../../../src/H5F.c", line 1307, in H5F_open
unable to read superblock File"../../../src/H5Fsuper.c", line 305, in H5F_super_read
unable to find file signature File"../../../src/H5Fsuper.c", line 153, in H5F_locate_signature
unable to find a valid file signatureEnd of HDF5 error back trace
Unable to open/create file 'extract_store.h5'
- 其他一些信息:
- 熊猫版本:"0.10.0"
- 操作系统:Ubuntu Server 10.04 x86 U 64
- CPU:8*Intel(R)Xeon(R)CPU [email protected]
- 内存总量:51634016 KB
我将把熊猫更新到0.10.1-dev,然后再试一次。
更新2- 我把熊猫更新到"0.10.1.dev-6e2b6ea"
- 但是时间成本没有下降,这次是884.15秒
- "ptdump-av file.h5"的输出为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | / (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.0', TITLE := '', VERSION := '1.0'] /df_bugs (Group) '' /df_bugs._v_attrs (AttributeSet), 12 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', axis0_variety := 'regular', axis1_variety := 'regular', block0_items_variety := 'regular', block1_items_variety := 'regular', block2_items_variety := 'regular', nblocks := 3, ndim := 2, pandas_type := 'frame', pandas_version := '0.10.1'] /df_bugs/axis0 (Array(52,)) '' atom := StringAtom(itemsize=19, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/axis1 (Array(924624,)) '' atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'integer', name := None, transposed := True] /df_bugs/block0_items (Array(5,)) '' atom := StringAtom(itemsize=12, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block0_values (Array(924624, 5)) '' atom := Float64Atom(shape=(), dflt=0.0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', transposed := True] /df_bugs/block1_items (Array(19,)) '' atom := StringAtom(itemsize=19, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block1_values (Array(924624, 19)) '' atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', transposed := True] /df_bugs/block2_items (Array(28,)) '' atom := StringAtom(itemsize=18, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block2_values (VLArray(1,)) '' atom = ObjectAtom() byteorder = 'irrelevant' nrows = 1 flavor = 'numpy' /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'VLARRAY', PSEUDOATOM := 'object', TITLE := '', VERSION := '1.3', transposed := True] |
- 我尝试了下面的代码(将数据帧放入hdfstore,参数"table"为true),但得到了一个错误,似乎不支持python的datatime类型:
Exception: cannot find the correct atom type -> [dtype->object] object
of type 'datetime.datetime' has no len()
更新3
谢谢杰夫。抱歉耽搁了。
- tables.version:"2.4.0"。
- 是的,884秒只是看跌操作的成本,没有从mysql拉操作
- 一行数据帧(df.ix[0]):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | bug_id 1 assigned_to 185 bug_file_loc None bug_severity critical bug_status closed creation_ts 1998-05-06 21:27:00 delta_ts 2012-05-09 14:41:41 short_desc Two cursors. host_op_sys Unknown guest_op_sys Unknown priority P3 rep_platform IA32 reporter 56 product_id 7 category_id 983 component_id 12925 resolution fixed target_milestone ws1 qa_contact 412 status_whiteboard votes 0 keywords SR lastdiffed 2012-05-09 14:41:41 everconfirmed 1 reporter_accessible 1 cclist_accessible 1 estimated_time 0.00 remaining_time 0.00 deadline None alias None found_in_product_id 0 found_in_version_id 0 found_in_phase_id 0 cf_type Defect cf_reported_by Development cf_attempted NaN cf_failed NaN cf_public_summary cf_doc_impact 0 cf_security 0 cf_build NaN cf_branch cf_change NaN cf_test_id NaN cf_regression Unknown cf_reviewer 0 cf_on_hold 0 cf_public_severity --- cf_i18n_impact 0 cf_eta None cf_bug_source --- cf_viss None Name: 0, Length: 52 |
- 数据帧的图片(只需在ipython笔记本中键入"df"):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | Int64Index: 924624 entries, 0 to 924623 Data columns: bug_id 924624 non-null values assigned_to 924624 non-null values bug_file_loc 427318 non-null values bug_severity 924624 non-null values bug_status 924624 non-null values creation_ts 924624 non-null values delta_ts 924624 non-null values short_desc 924624 non-null values host_op_sys 924624 non-null values guest_op_sys 924624 non-null values priority 924624 non-null values rep_platform 924624 non-null values reporter 924624 non-null values product_id 924624 non-null values category_id 924624 non-null values component_id 924624 non-null values resolution 924624 non-null values target_milestone 924624 non-null values qa_contact 924624 non-null values status_whiteboard 924624 non-null values votes 924624 non-null values keywords 924624 non-null values lastdiffed 924509 non-null values everconfirmed 924624 non-null values reporter_accessible 924624 non-null values cclist_accessible 924624 non-null values estimated_time 924624 non-null values remaining_time 924624 non-null values deadline 0 non-null values alias 0 non-null values found_in_product_id 924624 non-null values found_in_version_id 924624 non-null values found_in_phase_id 924624 non-null values cf_type 924624 non-null values cf_reported_by 924624 non-null values cf_attempted 89622 non-null values cf_failed 89587 non-null values cf_public_summary 510799 non-null values cf_doc_impact 924624 non-null values cf_security 924624 non-null values cf_build 327460 non-null values cf_branch 614929 non-null values cf_change 300612 non-null values cf_test_id 12610 non-null values cf_regression 924624 non-null values cf_reviewer 924624 non-null values cf_on_hold 924624 non-null values cf_public_severity 924624 non-null values cf_i18n_impact 924624 non-null values cf_eta 3910 non-null values cf_bug_source 924624 non-null values cf_viss 725 non-null values dtypes: float64(5), int64(19), object(28) |
- 在"convert_objects()"之后:
1 | dtypes: datetime64[ns](2), float64(5), int64(19), object(26) |
- 将转换后的数据帧转换为HDF存储成本:749.50 s:)
- 似乎减少"对象"数据类型的数量是减少时间成本的关键
- 将转换后的数据帧放入带有参数"table"的hdfstore中仍然返回该错误。
1 2 3 4 5 6 7 | /usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs) 2203 raise 2204 except (Exception), detail: -> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail))) 2206 j += 1 2207 Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.datetime' has no len() |
- 我试图把没有日期时间列的数据帧
更新4
- mysql中有4列,类型为datetime:
- 创世记
- 德尔塔茨
- 持续扩散
- 最后期限
调用convert_objects()后:
- 创世记:
1 | Timestamp: 1998-05-06 21:27:00 |
- 德尔塔兹:
1 | Timestamp: 2012-05-09 14:41:41 |
- 持续扩散
1 | datetime.datetime(2012, 5, 9, 14, 41, 41) |
- 无论是在调用"convert_objects"之前还是之后,最后期限始终为"none"
1 | None |
- 在没有列"lastdiff"的情况下放置数据帧的成本为691.75秒
- 当将没有列"lastdiff"的数据帧放入并将参数"table"设置为true时,我得到一个新的错误:
1 2 3 4 5 6 7 8 | /usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs) 2203 raise 2204 except (Exception), detail: -> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail))) 2206 j += 1 2207 Exception: cannot find the correct atom type -> [dtype->object] object of type 'Decimal' has no len() |
- mysql中"估计时间"、"剩余时间"、"cf-viss"列的类型为"decimal"
更新5
- 我通过下面的代码将这些"decimal"类型的列转换为"float"类型:
1 | no_diffed_converted_df_bugs.estimated_time = no_diffed_converted_df_bugs.estimated_time.map(float) |
- 现在,时间成本是372.84秒
- 但是"table"版本的放置仍然引发了一个错误:
1 2 3 4 5 6 7 8 | /usr/local/lib/python2.6/dist-packages/pandas-0.10.1.dev_6e2b6ea-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs) 2203 raise 2204 except (Exception), detail: -> 2205 raise Exception("cannot find the correct atom type -> [dtype->%s] %s" % (b.dtype.name, str(detail))) 2206 j += 1 2207 Exception: cannot find the correct atom type -> [dtype->object] object of type 'datetime.date' has no len() |
我非常确信您的问题与数据帧中实际类型的类型映射以及PyTables存储它们的方式有关。
- 具有固定重新存储的简单类型(float/ints/bools),这些类型映射到固定的C类型
- 如果可以正确转换日期时间,则会处理日期时间(例如,它们的数据类型为"datetime64[ns]",尤其是日期时间。不处理日期(nan是不同的情况,根据使用情况,可能会导致整个列类型处理错误)
- 字符串被映射(在Storer对象到对象类型中,表将它们映射到字符串类型)
- 未处理Unicode
- 所有其他类型都作为对象在存储库中处理,或者为表引发异常
这意味着,如果您正在对一个存储库执行一个放入操作(一个固定的表示),那么所有不可映射的类型都将成为对象,请参见。Pytables腌制这些柱子。请参阅下面的objectaom参考
http://pytables.github.com/usersguide/libref/declarative_classes.html原子类及其后代
表将在无效类型上引发(我应该在此处提供更好的错误消息)。如果您试图存储一个映射到ObjectAtom的类型(出于性能原因),我想我还会提供一个警告。
要强制某些类型,请尝试以下操作:
1 2 3 4 5 6 7 8 9 | import pandas as pd # convert None to nan (its currently Object) # converts to float64 (or type of other objs) x = pd.Series([None]) x = x.where(pd.notnull(x)).convert_objects() # convert datetime like with embeded nans to datetime64[ns] df['foo'] = pd.Series(df['foo'].values, dtype = 'M8[ns]') |
这是64位Linux上的一个示例(文件为1米行,磁盘上的大小约为1 GB)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | In [1]: import numpy as np In [2]: import pandas as pd In [3]: pd.__version__ Out[3]: '0.10.1.dev' In [3]: import tables In [4]: tables.__version__ Out[4]: '2.3.1' In [4]: df = pd.DataFrame(np.random.randn(1000 * 1000, 100), index=range(int( ...: 1000 * 1000)), columns=['E%03d' % i for i in xrange(100)]) In [5]: for x in range(20): ...: df['String%03d' % x] = 'string%03d' % x In [6]: df Out[6]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1000000 entries, 0 to 999999 Columns: 120 entries, E000 to String019 dtypes: float64(100), object(20) # storer put (cannot query) In [9]: def test_put(): ...: store = pd.HDFStore('test_put.h5','w') ...: store['df'] = df ...: store.close() In [10]: %timeit test_put() 1 loops, best of 3: 7.65 s per loop # table put (can query) In [7]: def test_put(): ....: store = pd.HDFStore('test_put.h5','w') ....: store.put('df',df,table=True) ....: store.close() In [8]: %timeit test_put() 1 loops, best of 3: 21.4 s per loop |
如何加快速度?
store.put('键',df,table=true)
做了这些工作后,同一数据集的放置操作性能有了很大的提高:
1 2 | CPU times: user 42.07 s, sys: 28.17 s, total: 70.24 s Wall time: 98.97 s |
第二次测试的配置文件日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | 95984 function calls (95958 primitive calls) in 68.688 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 445 16.757 0.038 16.757 0.038 {numpy.core.multiarray.array} 19 16.250 0.855 16.250 0.855 {method '_append_records' of 'tables.tableExtension.Table' objects} 16 7.958 0.497 7.958 0.497 {method 'astype' of 'numpy.ndarray' objects} 19 6.533 0.344 6.533 0.344 {pandas.lib.create_hdf_rows_2d} 4 6.284 1.571 6.388 1.597 {method '_fillCol' of 'tables.tableExtension.Row' objects} 20 2.640 0.132 2.641 0.132 {pandas.lib.maybe_convert_objects} 1 1.785 1.785 1.785 1.785 {pandas.lib.isnullobj} 7 1.619 0.231 1.619 0.231 {method 'flatten' of 'numpy.ndarray' objects} 11 1.059 0.096 1.059 0.096 {pandas.lib.infer_dtype} 1 0.997 0.997 41.952 41.952 pytables.py:2468(write_data) 19 0.985 0.052 40.590 2.136 pytables.py:2504(write_data_chunk) 1 0.827 0.827 60.617 60.617 pytables.py:2433(write) 1504 0.592 0.000 0.592 0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects} 4 0.534 0.133 13.676 3.419 pytables.py:1038(set_atom) 1 0.528 0.528 0.528 0.528 {pandas.lib.max_len_string_array} 4 0.441 0.110 0.571 0.143 internals.py:1409(_stack_arrays) 35 0.358 0.010 0.358 0.010 {method 'copy' of 'numpy.ndarray' objects} 1 0.276 0.276 3.135 3.135 internals.py:208(fillna) 5 0.263 0.053 2.054 0.411 common.py:128(_isnull_ndarraylike) 48 0.253 0.005 0.253 0.005 {method '_append' of 'tables.hdf5Extension.Array' objects} 4 0.240 0.060 1.500 0.375 internals.py:1400(_simple_blockify) 1 0.234 0.234 12.145 12.145 pytables.py:1066(set_atom_string) 28 0.225 0.008 0.225 0.008 {method '_createCArray' of 'tables.hdf5Extension.Array' objects} 36 0.218 0.006 0.218 0.006 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects} 6110 0.155 0.000 0.155 0.000 {numpy.core.multiarray.empty} 4 0.097 0.024 0.097 0.024 {method 'all' of 'numpy.ndarray' objects} 6 0.084 0.014 0.084 0.014 {tables.indexesExtension.keysort} 18 0.084 0.005 0.084 0.005 {method '_g_close' of 'tables.hdf5Extension.Leaf' objects} 11816 0.064 0.000 0.108 0.000 file.py:1036(_getNode) 19 0.053 0.003 0.053 0.003 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects} 1528 0.045 0.000 0.098 0.000 array.py:342(_interpret_indexing) 11709 0.040 0.000 0.042 0.000 file.py:248(__getitem__) 2 0.027 0.013 0.383 0.192 index.py:1099(get_neworder) 1 0.018 0.018 0.018 0.018 {numpy.core.multiarray.putmask} 4 0.013 0.003 0.017 0.004 index.py:607(final_idx32) |