HDFStore with string columns gives issues
我有一个pandas DataFrame
1 2 | d=pandas.HDFStore("C:\\PF\\Temp.h5") d['test']=myDF |
我得到了这个结果:
1 2 3 4 5 6 7 | C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_values] [items->[0, 1, 3, 4, 5, 6, 9, 10, 292, 411, 412, 477, 478, 479, 495, 572, 581, 590, 599, 608, 617, 626, 635]] warnings.warn(ws, PerformanceWarning) |
看起来每个列都是一个字符串的问题。 例如,如果我尝试
1 | myDF[0].dtype |
我明白了
1 | Out[38]: dtype('O') |
如何解决问题,即更改字符串列的
*编辑*
更多信息请求
1 2 3 4 5 | >>> pandas.__version__ Out[49]: '0.13.1' >>> tables.__version__ Out[53]: '3.1.0' |
构建pandas数据框如下:
1 | pandas.read_csv(fName,sep="|",header=None,low_memory=False) |
当我尝试
1 | myDF.info() |
我明白了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | Int64Index: 153895 entries, 0 to 153894 Data columns (total 644 columns): 0 object 1 object 2 int64 3 object 4 object 5 object 6 object 7 int64 8 float64 9 object 10 object 11 float64 12 float64 13 float64 14 float64 ... ... 642 float64 643 float64 dtypes: float64(619), int64(2), object(23) |
所有字符串列都已读为
仅当列中包含混合类型时才会发生此警告。 不只是字符串,而是字符串AND号。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | In [2]: DataFrame({ 'A' : [1.0,'foo'] }).to_hdf('test.h5','df',mode='w') pandas/io/pytables.py:2439: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_values] [items->['A']] warnings.warn(ws, PerformanceWarning) In [3]: df = DataFrame({ 'A' : [1.0,'foo'] }) In [4]: df Out[4]: A 0 1 1 foo [2 rows x 1 columns] In [5]: df.dtypes Out[5]: A object dtype: object In [6]: df['A'] Out[6]: 0 1 1 foo Name: A, dtype: object In [7]: df['A'].values Out[7]: array([1.0, 'foo'], dtype=object) |
因此,您需要确保不要在列中混合使用
如果您有需要转换的列,则可以执行以下操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | In [9]: columns = ['A'] In [10]: df.loc[:,columns] = df[columns].applymap(str) In [11]: df Out[11]: A 0 1.0 1 foo [2 rows x 1 columns] In [12]: df['A'].values Out[12]: array(['1.0', 'foo'], dtype=object) |