关于python：使用预先指定的dtypes将文件加载到pandas数据帧中，并用nan替换’DIV0’字符串

Load file into pandas dataframe using pre-specified dtypes and replacing 'DIV0' strings with nan

我正在设法将一个大的(>10GB)文件加载到熊猫数据帧中。目前这需要几分钟的时间，可能是因为检测到了熊猫的数据类型。为了加快速度并理想地减少内存占用，我想预先指定文件中每一列的数据类型。我尝试通过加载文件并记录panda分配的数据类型来完成此操作，但该文件包含一些需要替换的div0值：

1
2
3
4
5
6
7
8
9
10
11
12
13

df = pd.read_csv(data_path + data_file_name, index_col = None)
dtype_df = pd.DataFrame(df.dtypes)
dtype_dict = dtype_df.to_dict()[0]

dtype_dict

> {'CEO_Comp': dtype('float64'), 'aq_accounts_payable':
> dtype('float64'), 'aq_accounts_payable_ranked':
> dtype('float64'), 'aq_accounts_receivable': dtype('float64'),
> 'aq_accounts_receivable_ranked': dtype('float64'), ...

df2 = pd.read_csv(data_path + data_file_name, index_col = None, dtype = dtype_dict)

...
ValueError: could not convert string to float: 'DIV0'

不幸的是，有些字段似乎仍然包含字符串，例如"div0"。装载时如何处理这些？在读取文件时，是否将这些文件视为包含nan，或者是否必须进行预处理？

其次，我可以用float32和int32替换所有float64和int64数据类型吗？我不需要64位精度，我认为这可以显著降低内存和性能开销？

除了Milouga给出的下面的答案，如果其他人也有类似的问题，我继续使用下面的代码将数据类型从64位更改为32位，将dtype dict保存为pickle，然后重新加载，以后每次将csv加载为32位：

1
2
3
4
5
6
7
8
9
10
11
12

import pickle
dtype_df = pd.DataFrame(df.dtypes)
dtype_df.replace(['float64', 'int64'], ['float32', 'int32'], inplace = True)
dtype_dict = dtype_df.to_dict()[0]

# Pickle dict
with open(data_path + 'monthlies/' + 'dtype_dict.pkl', 'wb') as handle:
pickle.dump(dtype_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Load dict
with open(data_path + 'dtype_dict.pkl', 'rb') as handle:
dtype_dict = pickle.load(handle)

然后使用以下方法重新加载：

1	df = pd.read_csv(data_file, index_col = None, na_values = 'DIV0', dtype = dtype_dict, encoding='iso-8859-1')

也可以在read-csv中使用usecols = ['date', 'column_a', 'column_b' ...] etc.只加载所需的列。

使用函数read_csv的参数na_values。来自文档：

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN:", ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.

关于第二个问题，您可以在创建的dtype dict中用float32和int32替换该dtype。