关于python：如何解决HDFStore异常：无法找到正确的原子类型

How to trouble-shoot HDFStore Exception: cannot find the correct atom type

我正在寻找关于什么类型的数据场景可能导致此异常的一般指导。我试过以各种方式按摩我的数据无济于事。

我已经搜索了这个例外几天了，经历了几次谷歌小组讨论，并没有找到调试HDFStore Exception: cannot find the correct atom type的解决方案。我正在阅读混合数据类型的简单csv文件：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Int64Index: 401125 entries, 0 to 401124
Data columns:
SalesID 401125 non-null values
SalePrice 401125 non-null values
MachineID 401125 non-null values
ModelID 401125 non-null values
datasource 401125 non-null values
auctioneerID 380989 non-null values
YearMade 401125 non-null values
MachineHoursCurrentMeter 142765 non-null values
UsageBand 401125 non-null values
saledate 401125 non-null values
fiModelDesc 401125 non-null values
Enclosure_Type 401125 non-null values
...................................................
Stick_Length 401125 non-null values
Thumb 401125 non-null values
Pattern_Changer 401125 non-null values
Grouser_Type 401125 non-null values
Backhoe_Mounting 401125 non-null values
Blade_Type 401125 non-null values
Travel_Controls 401125 non-null values
Differential_Type 401125 non-null values
Steering_Controls 401125 non-null values
dtypes: float64(2), int64(6), object(45)

存储数据帧的代码：

1
2
3

In [30]: store = pd.HDFStore('test0.h5','w')
In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
....: store.append('df', chunk, index=False)

请注意，如果我在一次导入的数据帧上使用store.put，我可以成功存储它，虽然速度很慢(我相信这是由于对象dtypes的腌制，即使对象只是字符串数据)。

是否存在可能引发此异常的NaN值考虑因素？

例外：

1
2
3
4
5
6
7
8
9
10

Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa
geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo
delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou
pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi
on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe
r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co
upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid
th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_
Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] lis
t index out of range

更新1

Jeff关于存储在数据框中的列表的提示让我调查了嵌入式逗号。 pandas.read_csv正确解析文件，双引号中确实存在一些嵌入式逗号。所以这些字段本身不是python列表，但在文本中有逗号。这里有些例子：

1
2
3
4
5

3 Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons
6 Hydraulic Excavator, Track - 21.0 to 24.0 Metric Tons
8 Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons
11 Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower
12 Hydraulic Excavator, Track - 19.0 to 21.0 Metric Tons

但是，当我从pd.read_csv块中删除此列并附加到我的HDFStore时，我仍然得到相同的异常。当我尝试单独追加每一列时，我得到以下新异常：

1
2
3
4
5
6
7

In [6]: for chunk in pd.read_csv('Train.csv', header=0, chunksize=50000):
...: for col in chunk.columns:
...: store.append(col, chunk[col], data_columns=True)

Exception: cannot properly create the storer for: [_TABLE_MAP] [group->/SalesID
(Group) '',value-><class 'pandas.core.series.Series'>,table->True,append->True,k
wargs->{'data_columns': True}]

我会继续进行故障排除。这是几百条记录的链接：

https://docs.google.com/spreadsheet/ccc?key=0AutqBaUiJLbPdHFvaWNEMk5hZ1NTNlVyUVduYTZTeEE&usp=sharing

更新2

好的，我在我的工作计算机上尝试了以下内容并获得了以下结果：

1
2
3
4
5
6
7
8
9

In [4]: store = pd.HDFStore('test0.h5','w')

In [5]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
...: store.append('df', chunk, index=False, data_columns=True)
...:

Exception: cannot find the correct atom type -> [dtype->object,items->Index([fiB
aseModel], dtype=object)] [fiBaseModel] column has a min_itemsize of [13] but it
emsize [9] is required!

我想我知道这里发生了什么。如果我为第一个块获取字段fiBaseModel的最大长度，我得到：

1
2
3
4

In [16]: lens = df.fiBaseModel.apply(lambda x: len(x))

In [17]: max(lens[:10000])
Out[17]: 9

第二个块：

1 2	In [18]: max(lens[10001:20000]) Out[18]: 13

因此，为此列创建的存储表为9个字节，因为这是第一个块的最大值。当它在后续块中遇到较长的文本字段时，它会抛出异常。

我对此的建议是截断后续块中的数据(带有警告)或允许用户指定列的最大存储空间并截断超出它的任何内容。也许熊猫已经可以做到这一点，我还没来得及真正潜入HDFStore。

更新3

尝试使用pd.read_csv导入csv数据集。我将所有对象的字典传递给dtypes参数。然后我迭代文件并将每个块存储到HDFStore中，为min_itemsize传递一个大值。我得到以下异常：

1	AttributeError: 'NoneType' object has no attribute 'itemsize'

我的简单代码：

1
2
3
4
5
6

store = pd.HDFStore('test0.h5','w')
objects = dict((col,'object') for col in header)

for chunk in pd.read_csv('Train.csv', header=0, dtype=objects,
chunksize=10000, na_filter=False):
store.append('df', chunk, min_itemsize=200)

我试图调试和检查堆栈跟踪中的项目。这是表格在异常中的样子：

1
2
3
4
5
6
7
8
9
10

ipdb> self.table
/df/table (Table(10000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=200, shape=(53,), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (24,)
autoIndex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

更新4

现在，我正在尝试迭代地确定数据帧的对象列中最长字符串的长度。我是这样做的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

def f(x):
if x.dtype != 'object':
return
else:
return len(max(x.fillna(''), key=lambda x: len(str(x))))

lengths = pd.DataFrame([chunk.apply(f) for chunk in pd.read_csv('Train.csv', chunksize=50000)])
lens = lengths.max().dropna().to_dict()

In [255]: lens
Out[255]:
{'Backhoe_Mounting': 19.0,
'Blade_Extension': 19.0,
'Blade_Type': 19.0,
'Blade_Width': 19.0,
'Coupler': 19.0,
'Coupler_System': 19.0,
'Differential_Type': 12.0
... etc... }

一旦我得到了最大字符串列长度的字典，我尝试通过min_itemsize参数将其传递给append：

1
2
3
4
5
6
7
8
9
10
11
12
13

In [262]: for chunk in pd.read_csv('Train.csv', chunksize=50000, dtype=types):
.....: store.append('df', chunk, min_itemsize=lens)

Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa
geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo
delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou
pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi
on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe
r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co
upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid
th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_
Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] [va
lues_block_2] column has a min_itemsize of [64] but itemsize [58] is required!

违规列传递的min_itemsize为64，但异常表明需要58项。这可能是个错误？

在[266]中：pd.version
出[266]：'0.11.0.dev-eb07c5a'