关于pandas：如何在python中传入和操作大型数据文件

dataframeitertoolspandaspython

How to stream in and manipulate a large data file in python

我有一个相对较大(1 GB)的文本文件，我希望通过对各个类别求和来减小它们的大小：

1
2
3
4

Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0

至：

1
2
3

Geography Count
County1 15
County2 23

如果整个文件适合内存但使用pandas.read_csv()给出MemoryError，这将是一件简单的事情。所以我一直在研究其他方法，似乎有很多选择 - HDF5？使用itertools(看起来很复杂 - 生成器？)或者只是使用标准文件方法读取第一个地理位置(70行)，将count列相加，然后在加载另外70行之前写出。

有没有人对最佳方法有任何建议？我特别喜欢流数据的想法，特别是因为我可以想到很多其他有用的地方。我对这种方法最感兴趣，或者类似地使用最基本功能的方法。

编辑：在这个小案例中，我只想要按地理位置计算的数量。但是，如果我可以读入一个块，指定任何函数(比如一起添加2列，或者按地理位置取一列的最大值)，应用函数，并在读取新块之前写入输出，这将是理想的。

相关讨论

您可以使用dask.dataframe，它在语法上类似于pandas，但执行非核心操作，因此内存不应成为问题：

1
2
3
4
5

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

或者，如果需要pandas，则可以使用chunked读取，如@chrisaycock所述。您可能想要试验chunksize参数。

1
2
3
4
5
6
7
8
9
10

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

我喜欢@ root的解决方案，但我会进一步优化内存使用 - 只在内存中保留聚合DF并只读取那些你真正需要的列：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

cols = ['Geography','Count']
df = pd.DataFrame()

chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)

df.reset_index().to_csv('c:/temp/result.csv', index=False)

测试数据：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111

output.csv：

1
2
3
4
5
6
7
8
9

Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111

PS使用这种方法你可以处理大文件。

使用分块方法的PPS应该工作，除非您需要对数据进行排序 - 在这种情况下，我将使用经典的UNIX工具，如awk，sort等来首先对数据进行排序

我还建议使用PyTables(HDF5存储)而不是CSV文件 - 它非常快并且允许您有条件地读取数据(使用where参数)，因此它非常方便并且节省了大量资源并且通常更快与CSV相比。