ptrepack sortby needs 'full' index
我试图ptrepack一个用pandas HDFStore pytables接口创建的HDF文件。
数据帧的主要索引是时间,但我做了更多列
现在我想通过其中一个列对HDF文件进行排序(因为选择对我来说太慢了,84 GB文件),使用带有
1 | ()[maye@luna4 .../nominal]$ ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc --sortby=clat C9.h5 C9_sorted.h5 |
我收到错误信息:
()[maye@luna4 .../nominal]$ Problems doing the copy from 'C9.h5:/' to
'C9_sorted.h5:/' The error was --> :
Fieldclat must have associated a 'full' index in table/df/table . The destination file looks like: C9_sorted.h5
(Table(390557601,)) ''
(File) '' Last modif.: 'Fri Jul 26 18:17:56 2013' Object Tree: /
(RootGroup) '' /df (Group) '' /df/table (Table(0,), shuffle, blosc(9))
''Traceback (most recent call last): File
"/usr/local/epd/bin/ptrepack", line 10, in
sys.exit(main()) File"/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py",
line 480, in main
upgradeflavors=upgradeflavors) File"/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py",
line 225, in copyChildren
raise RuntimeError("Please check that the node names are not" RuntimeError: Please check that the node names are not duplicated in
destination, and if so, add the --overwrite-nodes flag if desired. In
particular, pay attention that rootUEP is not fooling you.
这是否意味着,我无法通过索引列对HDF文件进行排序,因为它们不是"完整"索引?
我已经测试了Jeff在上面讨论的讨论中提到的几个选项。
请看一下这款笔记本,希望它可以帮助您为数据存储做出相关决定:http://nbviewer.ipython.org/810bd0720bb1732067ff
笔记本电脑的要点是:https://gist.github.com/michaelaye/810bd0720bb1732067ff
我的主要结论是:
- 使用index = False有几个令人印象深刻的效果:1。它减少了生成的HDF文件的文件大小。 2.它可以更快地创建HDFFile。 3.即使如此ptdump和storer()。group.table打印输出也没有显示任何索引,商店显示仍然显示索引器和数据列(这可能是对我这边的pytables机器的无知)。
- 通过store.create_table_index()创建索引,无需通过其中一个数据列进行数据选择。
- 这个索引必须是一个'完整'索引,以便后面的带有--sortby的ptrepack不会保释。但它不一定是索引级别9.默认级别6很好,并且似乎不会显着影响数据选择速度。也许它会有很多列?
- 使用--propindexes几乎使ptrepacking时间加倍,数据选择速度略有提高。
- 使用compression和--propindexs仅比单独使用--propindex稍慢,而数据大小(至少在本例中)并没有显着下降。
- 通过使用压缩,数据选择速度似乎没有太大差异。
- 这个例子的加速比为1 mio。在对选择列进行排序后,仅使用--sortby而没有--propindexes的2列随机数据的行大约为5。
完成后,命令的超短摘要:
1 2 3 4 5 | df = pd.DataFrame(randn(1e6,2),columns=list('AB')).to_hdf('test.h5','df', data_columns=list('AB'),mode='w',table=True,index=False) store = pd.HDFStore('test.h5') store.create_table_index('df',columns=['B'], kind='full') store.close() |
在shell中:
1 | ptrepack --chunkshape=auto --sortby=B test.h5 test_sorted.h5 |
这是一个完整的例子。
使用data_column创建框架。将索引重置为完整索引。使用ptrepack来
排序它。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | In [16]: df = DataFrame(randn(10,2),columns=list('AB')).to_hdf('test.h5','df',data_columns=['B'],mode='w',table=True) In [17]: store = pd.HDFStore('test.h5') In [18]: store Out[18]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index],dc->[B]) In [19]: store.get_storer('df').group.table Out[19]: /df/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) autoIndex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "B": Index(6, medium, shuffle, zlib(1)).is_CSI=False} In [20]: store.create_table_index('df',columns=['B'],optlevel=9,kind='full') In [21]: store.get_storer('df').group.table Out[21]: /df/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) autoIndex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "B": Index(9, full, shuffle, zlib(1)).is_CSI=True} In [22]: store.close() In [25]: !ptdump -avd test.h5 / (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.0', TITLE := '', VERSION := '1.0'] /df (Group) '' /df._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['B'], encoding := None, index_cols := [(0, 'index')], info := {'index': {}}, levels := 1, nan_rep := b'nan', non_index_axes := [(1, ['A', 'B'])], pandas_type := b'frame_table', pandas_version := b'0.10.1', table_type := b'appendable_frame', values_cols := ['values_block_0', 'B']] /df/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "B": Index(9, full, shuffle, zlib(1)).is_csi=True} /df/table._v_attrs (AttributeSet), 15 attributes: [B_dtype := b'float64', B_kind := ['B'], CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0.0, FIELD_2_NAME := 'B', NROWS := 10, TITLE := '', VERSION := '2.6', index_kind := b'integer', values_block_0_dtype := b'float64', values_block_0_kind := ['A']] Data dump: [0] (0, [1.10989047288066], 0.396613633081911) [1] (1, [0.0981650001268093], -0.9209780702446433) [2] (2, [-0.2429293157073629], -1.779366453624283) [3] (3, [0.7305529521507728], 1.243565083939927) [4] (4, [-0.1480724789512519], 0.5260130757651649) [5] (5, [1.2560020435792643], 0.5455842491255144) [6] (6, [1.20129355706986], 0.47930635538027244) [7] (7, [0.9973598999689721], 0.8602929579025727) [8] (8, [-0.40070941088441786], 0.7622228032635253) [9] (9, [0.35865804118145655], 0.29939126149826045) |
这是另一种创建完全排序索引的方法(与以这种方式编写索引相反)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | In [23]: !ptrepack --sortby=B test.h5 test_sorted.h5 In [26]: !ptdump -avd test_sorted.h5 / (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.1', TITLE := '', VERSION := '1.0'] /df (Group) '' /df._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['B'], encoding := None, index_cols := [(0, 'index')], info := {'index': {}}, levels := 1, nan_rep := b'nan', non_index_axes := [(1, ['A', 'B'])], pandas_type := b'frame_table', pandas_version := b'0.10.1', table_type := b'appendable_frame', values_cols := ['values_block_0', 'B']] /df/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) /df/table._v_attrs (AttributeSet), 15 attributes: [B_dtype := b'float64', B_kind := ['B'], CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0.0, FIELD_2_NAME := 'B', NROWS := 10, TITLE := '', VERSION := '2.6', index_kind := b'integer', values_block_0_dtype := b'float64', values_block_0_kind := ['A']] Data dump: [0] (2, [-0.2429293157073629], -1.779366453624283) [1] (1, [0.0981650001268093], -0.9209780702446433) [2] (9, [0.35865804118145655], 0.29939126149826045) [3] (0, [1.10989047288066], 0.396613633081911) [4] (6, [1.20129355706986], 0.47930635538027244) [5] (4, [-0.1480724789512519], 0.5260130757651649) [6] (5, [1.2560020435792643], 0.5455842491255144) [7] (8, [-0.40070941088441786], 0.7622228032635253) [8] (7, [0.9973598999689721], 0.8602929579025727) [9] (3, [0.7305529521507728], 1.243565083939927) |