关于python：ptrepack sortby需要’完整’索引

ptrepack sortby needs 'full' index

我试图ptrepack一个用pandas HDFStore pytables接口创建的HDF文件。
数据帧的主要索引是时间，但我做了更多列data_columns，以便我可以通过这些data_columns过滤磁盘上的数据。

现在我想通过其中一个列对HDF文件进行排序(因为选择对我来说太慢了，84 GB文件)，使用带有sortby选项的ptrepack如下：

1	()[maye@luna4 .../nominal]$ ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc --sortby=clat C9.h5 C9_sorted.h5

我收到错误信息：

()[maye@luna4 .../nominal]$ Problems doing the copy from 'C9.h5:/' to
'C9_sorted.h5:/' The error was --> :
Field clat must have associated a 'full' index in table /df/table
(Table(390557601,)) ''. The destination file looks like: C9_sorted.h5
(File) '' Last modif.: 'Fri Jul 26 18:17:56 2013' Object Tree: /
(RootGroup) '' /df (Group) '' /df/table (Table(0,), shuffle, blosc(9))
''

Traceback (most recent call last): File
"/usr/local/epd/bin/ptrepack", line 10, in
sys.exit(main()) File"/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py",
line 480, in main
upgradeflavors=upgradeflavors) File"/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py",
line 225, in copyChildren
raise RuntimeError("Please check that the node names are not" RuntimeError: Please check that the node names are not duplicated in
destination, and if so, add the --overwrite-nodes flag if desired. In
particular, pay attention that rootUEP is not fooling you.

这是否意味着，我无法通过索引列对HDF文件进行排序，因为它们不是"完整"索引？

相关讨论

这是一个完整的例子。

使用data_column创建框架。将索引重置为完整索引。使用ptrepack来
排序它。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

In [16]: df = DataFrame(randn(10,2),columns=list('AB')).to_hdf('test.h5','df',data_columns=['B'],mode='w',table=True)

In [17]: store = pd.HDFStore('test.h5')

In [18]: store
Out[18]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index],dc->[B])

In [19]: store.get_storer('df').group.table
Out[19]:
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoIndex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"B": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

In [20]: store.create_table_index('df',columns=['B'],optlevel=9,kind='full')

In [21]: store.get_storer('df').group.table
Out[21]:
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoIndex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"B": Index(9, full, shuffle, zlib(1)).is_CSI=True}

In [22]: store.close()

In [25]: !ptdump -avd test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.0',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['B'],
encoding := None,
index_cols := [(0, 'index')],
info := {'index': {}},
levels := 1,
nan_rep := b'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := b'frame_table',
pandas_version := b'0.10.1',
table_type := b'appendable_frame',
values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(9, full, shuffle, zlib(1)).is_csi=True}
/df/table._v_attrs (AttributeSet), 15 attributes:
[B_dtype := b'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 10,
TITLE := '',
VERSION := '2.6',
index_kind := b'integer',
values_block_0_dtype := b'float64',
values_block_0_kind := ['A']]
Data dump:
[0] (0, [1.10989047288066], 0.396613633081911)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (2, [-0.2429293157073629], -1.779366453624283)
[3] (3, [0.7305529521507728], 1.243565083939927)
[4] (4, [-0.1480724789512519], 0.5260130757651649)
[5] (5, [1.2560020435792643], 0.5455842491255144)
[6] (6, [1.20129355706986], 0.47930635538027244)
[7] (7, [0.9973598999689721], 0.8602929579025727)
[8] (8, [-0.40070941088441786], 0.7622228032635253)
[9] (9, [0.35865804118145655], 0.29939126149826045)

这是另一种创建完全排序索引的方法(与以这种方式编写索引相反)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

In [23]: !ptrepack --sortby=B test.h5 test_sorted.h5

In [26]: !ptdump -avd test_sorted.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['B'],
encoding := None,
index_cols := [(0, 'index')],
info := {'index': {}},
levels := 1,
nan_rep := b'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := b'frame_table',
pandas_version := b'0.10.1',
table_type := b'appendable_frame',
values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
/df/table._v_attrs (AttributeSet), 15 attributes:
[B_dtype := b'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 10,
TITLE := '',
VERSION := '2.6',
index_kind := b'integer',
values_block_0_dtype := b'float64',
values_block_0_kind := ['A']]
Data dump:
[0] (2, [-0.2429293157073629], -1.779366453624283)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (9, [0.35865804118145655], 0.29939126149826045)
[3] (0, [1.10989047288066], 0.396613633081911)
[4] (6, [1.20129355706986], 0.47930635538027244)
[5] (4, [-0.1480724789512519], 0.5260130757651649)
[6] (5, [1.2560020435792643], 0.5455842491255144)
[7] (8, [-0.40070941088441786], 0.7622228032635253)
[8] (7, [0.9973598999689721], 0.8602929579025727)
[9] (3, [0.7305529521507728], 1.243565083939927)