关于python：打开一个25GB的文本文件进行处理

performancepython

Opening a 25GB text file for processing

我有一个25GB的文件需要处理。以下是我目前正在做的，但打开它需要非常长的时间：

1
2
3
4
5
6
7
8
9

collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
collection_contents = f.readlines()

length_of_file = len(collection_contents)

for num, line in enumerate(collection_contents):
print '%s / %s' % (num+1, length_of_file)
cursor.execute(...)

我该怎么改进呢？

相关讨论

除非文件中的行真的非常大，否则不要在每一行上打印进度。打印到终端非常慢。打印进度，例如每100行或每1000行。

使用可用的操作系统工具来获取文件的大小-os.path.getsize()，请参见在python中获取文件大小？

去掉readlines()以避免将25GB读取到内存中。相反，逐行读取和处理，请参见如何在python中逐行读取大型文件。

相关讨论

将文件传递两次：一次用于计数行，一次用于打印。不要在这么大的文件上调用readlines——最终会将所有内容交换到磁盘上。(实际上，一般不要打电话给readlines。这太愚蠢了。

(顺便说一下，我假设您实际上在处理行，而不仅仅是行数——您在那里发布的代码实际上没有使用文件中的任何内容，除了文件中的换行数之外。)

结合上面的答案，下面是我如何修改它。

1
2
3
4
5
6
7
8
9
10

size_of_file = os.path.getsize(collection_pricing)
progress = 0
line_count = 0

with open(collection_pricing, 'r') as f:
for line in f:
line_count += 1
progress += len(line)
if line_count % 10000 == 0:
print '%s / %s' % (progress, size_of_file)

这有以下改进：

不使用readlines()，所以不将所有内容存储到内存中
每10000行只打印一次
使用文件大小而不是行数来度量进度，因此不必重复文件两次。