How to trouble-shoot HDFStore Exception: cannot find the correct atom type
我正在寻找关于什么类型的数据场景可能导致此异常的一般指导。我试过以各种方式按摩我的数据无济于事。
我已经搜索了这个例外几天了,经历了几次谷歌小组讨论,并没有找到调试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | Int64Index: 401125 entries, 0 to 401124 Data columns: SalesID 401125 non-null values SalePrice 401125 non-null values MachineID 401125 non-null values ModelID 401125 non-null values datasource 401125 non-null values auctioneerID 380989 non-null values YearMade 401125 non-null values MachineHoursCurrentMeter 142765 non-null values UsageBand 401125 non-null values saledate 401125 non-null values fiModelDesc 401125 non-null values Enclosure_Type 401125 non-null values ................................................... Stick_Length 401125 non-null values Thumb 401125 non-null values Pattern_Changer 401125 non-null values Grouser_Type 401125 non-null values Backhoe_Mounting 401125 non-null values Blade_Type 401125 non-null values Travel_Controls 401125 non-null values Differential_Type 401125 non-null values Steering_Controls 401125 non-null values dtypes: float64(2), int64(6), object(45) |
存储数据帧的代码:
1 2 3 | In [30]: store = pd.HDFStore('test0.h5','w') In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000): ....: store.append('df', chunk, index=False) |
请注意,如果我在一次导入的数据帧上使用
是否存在可能引发此异常的NaN值考虑因素?
例外:
1 2 3 4 5 6 7 8 9 10 | Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_ Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] lis t index out of range |
更新1
Jeff关于存储在数据框中的列表的提示让我调查了嵌入式逗号。
1 2 3 4 5 | 3 Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons 6 Hydraulic Excavator, Track - 21.0 to 24.0 Metric Tons 8 Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons 11 Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower 12 Hydraulic Excavator, Track - 19.0 to 21.0 Metric Tons |
但是,当我从pd.read_csv块中删除此列并附加到我的HDFStore时,我仍然得到相同的异常。当我尝试单独追加每一列时,我得到以下新异常:
1 2 3 4 5 6 7 | In [6]: for chunk in pd.read_csv('Train.csv', header=0, chunksize=50000): ...: for col in chunk.columns: ...: store.append(col, chunk[col], data_columns=True) Exception: cannot properly create the storer for: [_TABLE_MAP] [group->/SalesID (Group) '',value-><class 'pandas.core.series.Series'>,table->True,append->True,k wargs->{'data_columns': True}] |
我会继续进行故障排除。这是几百条记录的链接:
https://docs.google.com/spreadsheet/ccc?key=0AutqBaUiJLbPdHFvaWNEMk5hZ1NTNlVyUVduYTZTeEE&usp=sharing
更新2
好的,我在我的工作计算机上尝试了以下内容并获得了以下结果:
1 2 3 4 5 6 7 8 9 | In [4]: store = pd.HDFStore('test0.h5','w') In [5]: for chunk in pd.read_csv('Train.csv', chunksize=10000): ...: store.append('df', chunk, index=False, data_columns=True) ...: Exception: cannot find the correct atom type -> [dtype->object,items->Index([fiB aseModel], dtype=object)] [fiBaseModel] column has a min_itemsize of [13] but it emsize [9] is required! |
我想我知道这里发生了什么。如果我为第一个块获取字段
1 2 3 4 | In [16]: lens = df.fiBaseModel.apply(lambda x: len(x)) In [17]: max(lens[:10000]) Out[17]: 9 |
第二个块:
1 2 | In [18]: max(lens[10001:20000]) Out[18]: 13 |
因此,为此列创建的存储表为9个字节,因为这是第一个块的最大值。当它在后续块中遇到较长的文本字段时,它会抛出异常。
我对此的建议是截断后续块中的数据(带有警告)或允许用户指定列的最大存储空间并截断超出它的任何内容。也许熊猫已经可以做到这一点,我还没来得及真正潜入
更新3
尝试使用pd.read_csv导入csv数据集。我将所有对象的字典传递给dtypes参数。然后我迭代文件并将每个块存储到HDFStore中,为
1 | AttributeError: 'NoneType' object has no attribute 'itemsize' |
我的简单代码:
1 2 3 4 5 6 | store = pd.HDFStore('test0.h5','w') objects = dict((col,'object') for col in header) for chunk in pd.read_csv('Train.csv', header=0, dtype=objects, chunksize=10000, na_filter=False): store.append('df', chunk, min_itemsize=200) |
我试图调试和检查堆栈跟踪中的项目。这是表格在异常中的样子:
1 2 3 4 5 6 7 8 9 10 | ipdb> self.table /df/table (Table(10000,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": StringCol(itemsize=200, shape=(53,), dflt='', pos=1)} byteorder := 'little' chunkshape := (24,) autoIndex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False} |
更新4
现在,我正在尝试迭代地确定数据帧的对象列中最长字符串的长度。我是这样做的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | def f(x): if x.dtype != 'object': return else: return len(max(x.fillna(''), key=lambda x: len(str(x)))) lengths = pd.DataFrame([chunk.apply(f) for chunk in pd.read_csv('Train.csv', chunksize=50000)]) lens = lengths.max().dropna().to_dict() In [255]: lens Out[255]: {'Backhoe_Mounting': 19.0, 'Blade_Extension': 19.0, 'Blade_Type': 19.0, 'Blade_Width': 19.0, 'Coupler': 19.0, 'Coupler_System': 19.0, 'Differential_Type': 12.0 ... etc... } |
一旦我得到了最大字符串列长度的字典,我尝试通过
1 2 3 4 5 6 7 8 9 10 11 12 13 | In [262]: for chunk in pd.read_csv('Train.csv', chunksize=50000, dtype=types): .....: store.append('df', chunk, min_itemsize=lens) Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_ Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] [va lues_block_2] column has a min_itemsize of [64] but itemsize [58] is required! |
违规列传递的min_itemsize为64,但异常表明需要58项。这可能是个错误?
在[266]中:pd.version
出[266]:'0.11.0.dev-eb07c5a'
您提供的链接可以很好地存储框架。逐列只表示指定data_columns = True。它将单独处理列并在违规列上升。
要诊断
1 2 3 | store = pd.HDFStore('test0.h5','w') In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000): ....: store.append('df', chunk, index=False, data_columns=True) |
在生产中,您可能希望将data_columns限制为要查询的列(也可以是None,在这种情况下,您只能查询索引/列)
更新:
你可能会遇到另一个问题。 read_csv根据它在每个块中看到的内容转换dtypes,
因此,如果块大小为10,000,则追加操作会失败,因为块1和块2具有
整数在一些列中查找数据,然后在块3中你有一些NaN因此浮动。
要么预先指定dtypes,要么使用更大的chunksize,要么运行两次操作
保证你的大块之间的dtypes。
在这种情况下,我更新了pytables.py以获得更有用的异常(同样
告诉你列是否有不兼容的数据)
谢谢你的报道!