关于文本文件：如何在python中获取行数？

How to get line count cheaply in Python?

我需要在python中获取一个大文件(数十万行)的行数。记忆和时间方面最有效的方法是什么？

现在我这样做了：

1
2
3
4
5

def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1

有没有可能做得更好？

一行，可能很快：

1	num_lines = sum(1 for line in open('myfile.txt'))

相关讨论

不错，也适用于空文件。
它是如何工作的？
它类似于和(1的序列)，每行都计为1。>>>[1 for line in range(10)][1，1，1，1，1，1，1，1，1]>>>sum(1 for line in range(10))10>>>
num_lines=sum(1代表打开的行(‘myfile.txt’)if line.rstrip())代表筛选空行
当我们打开一个文件时，一旦我们遍历了所有元素，它会自动关闭吗？是否需要"close()"？我认为我们不能在这个简短的语句中使用'with open()'，对吗？
如果有一个解释解释为什么它起作用的话，那么对于那些抓住这个答案快速解决问题的人来说，这将是非常有益的。
@Mannaggia您是正确的，最好使用"with open(filename)"来确保文件在完成时关闭，并且最好在try-except块中执行此操作，如果无法打开文件，将引发和ioerror异常。
另一件需要注意的事情是：这比原始问题在30万行文本文件上给出的速度慢大约0.04-0.05秒。
@安德鲁，你确定你测试过……科学地？
如果使用枚举，则不需要求和。除非使用列表理解，否则计数将在for循环之后维护。对于num_行，u in enumerate(open("file.txt"))：pass
你能解释一下这行1的作用吗？num_lines=sum(open("myfile.txt")中的行为1)……(这里还是初学者)您能解释一下这行代码是如何计算文件中的行数的吗？我不知道"1"是什么，它是用来干什么的？谢谢
@stryker 1 for line in open(..)基本上为每条线路提供了1的列表(但不是因为它是一个生成器)。因此，如果文本文件包含三行，那么[1 for line in open(...)]将是[1, 1, 1]：对于每行，1将添加到数组中。然后将该列表传递给sum()，后者汇总迭代器内的所有值。所以sum([1,2,3])就是6。在前面的例子中，文本有三行，我们得到了一个[1,1,1]的列表。当求和时，得到3，这当然是行数。这似乎是多余的，但它的内存很便宜。
"可能很快"。代码越少并不意味着代码效率越高。
我们用len()代替sum()怎么样，比如len([l for l in open('myfile.txt')])？

你不会比这更好的。

毕竟，任何解决方案都必须读取整个文件，找出您拥有多少
，并返回该结果。

在不读取整个文件的情况下，您有更好的方法吗？不确定。。。最好的解决方案将始终是I/O绑定的，您所能做的最好的就是确保不使用不必要的内存，但看起来您已经覆盖了这一点。

相关讨论

我相信内存映射文件将是最快的解决方案。我尝试了四个函数：op发布的函数(opcount)；对文件中的行进行简单迭代(simplecount；使用内存映射文件的readline(mmap)(mapcount；以及mykola kharechko提供的缓冲区读取解决方案(bufcount)。

我运行了五次每个函数，并计算了120万行文本文件的平均运行时间。

Windows XP、python 2.5、2GB RAM、2 GHz AMD处理器

以下是我的结果：

1
2
3
4

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

编辑：python 2.6的数字：

1
2
3
4

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

因此，对于Windows/python 2.6来说，缓冲区读取策略似乎是最快的。

代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
f = open(filename,"r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines

def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines

def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization

buf = read_f(buf_size)
while buf:
lines += buf.count('
')
buf = read_f(buf_size)

return lines

def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1

counts = defaultdict(list)

for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)

for key, vals in counts.items():
print key.__name__,":", sum(vals) / float(len(vals))

相关讨论

我不得不把这个贴在一个类似的问题上，直到我的名誉分数跳了一点(多亏了撞我的人！).

所有这些解决方案都忽略了一种使运行速度大大加快的方法，即使用未缓冲(原始)接口、使用bytearray和执行自己的缓冲。(这仅适用于python 3。在python 2中，原始接口在默认情况下可以使用，也可以不使用，但是在python 3中，您将默认为unicode。)

使用修改过的计时工具，我相信下面的代码比提供的任何解决方案都快(而且稍微多一些Python式的代码)：

1
2
3
4
5
6
7
8
9
10
11
12
13

def rawcount(filename):
f = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = f.raw.read

buf = read_f(buf_size)
while buf:
lines += buf.count(b'
')
buf = read_f(buf_size)

return lines

使用单独的生成器功能，运行速度更快：

1
2
3
4
5
6
7
8
9
10
11

def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)

def rawgencount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'
') for buf in f_gen )

这完全可以通过使用itertools的内联生成器表达式来完成，但是看起来很奇怪：

1
2
3
4
5
6
7

from itertools import (takewhile,repeat)

def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'
') for buf in bufgen )

以下是我的时间安排：

1
2
3
4
5
6
7
8
9
10
11

function average, s min, s ratio
rawincount 0.0043 0.0041 1.00
rawgencount 0.0044 0.0042 1.01
rawcount 0.0048 0.0045 1.09
bufcount 0.008 0.0068 1.64
wccount 0.01 0.0097 2.35
itercount 0.014 0.014 3.41
opcount 0.02 0.02 4.83
kylecount 0.021 0.021 5.05
simplecount 0.022 0.022 5.25
mapcount 0.037 0.031 7.46

相关讨论

您可以执行一个子进程并运行wc -l filename。

1
2
3
4
5
6
7
8
9

import subprocess

def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])

相关讨论

这里有一个python程序，它使用多处理库在机器/内核之间分配行计数。我的测试使用8核Windows64服务器将2000万在线文件的计数从26秒提高到7秒。注意：不使用内存映射会使事情慢得多。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
logger = logging.getLogger() # New logger at root level
logger.setLevel( logging.INFO )
logger.handlers.append( logging.StreamHandler() )
logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
init_logger(pid)
logging.info( 'start' )

physical_file = open(file1,"r")
# mmap.mmap(fileno, length[, tagname[, access[, offset]]]

m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

#work out file size to divide up line counting

fSize = os.stat(file1).st_size
chunk = (fSize / processes) + 1

lines = 0

#get where I start and stop
_seedStart = chunk * (pid)
_seekEnd = chunk * (pid+1)
seekStart = int(_seedStart)
seekEnd = int(_seekEnd)

if seekEnd < int(_seekEnd + 1):
seekEnd += 1

if _seedStart < int(seekStart + 1):
seekStart += 1

if seekEnd > fSize:
seekEnd = fSize

#find where to start
if pid > 0:
m1.seek( seekStart )
#read next line
l1 = m1.readline() # need to use readline with memory mapped files
seekStart = m1.tell()

#tell previous rank my seek start to make their seek end

if pid > 0:
queues[pid-1].put( seekStart )
if pid < processes-1:
seekEnd = queues[pid].get()

m1.seek( seekStart )
l1 = m1.readline()

while len(l1) > 0:
lines += 1
l1 = m1.readline()
if m1.tell() > seekEnd or len(l1) == 0:
break

logging.info( 'done' )
# add up the results
if pid == 0:
for p in range(1,processes):
lines += queues[0].get()
queues[0].put(lines) # the total lines counted
else:
queues[0].put(lines)

m1.close()
physical_file.close()

if __name__ == '__main__':
init_logger( 'main' )
if len(sys.argv) > 1:
file_name = sys.argv[1]
else:
logging.fatal( 'parameters required: file-name [processes]' )
exit()

t = time.time()
processes = multiprocessing.cpu_count()
if len(sys.argv) > 2:
processes = int(sys.argv[2])
queues=[] # a queue for each process
for pid in range(processes):
queues.append( multiprocessing.Queue() )
jobs=[]
prev_pipe = 0
for pid in range(processes):
p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
p.start()
jobs.append(p)

jobs[0].join() #wait for counting to finish
lines = queues[0].get()

logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )

相关讨论

我将使用python的文件对象方法readlines，如下所示：

1 2	with open(input_file) as foo: lines = len(foo.readlines())

这将打开文件，在文件中创建行列表，计算列表的长度，将其保存到变量中，然后再次关闭文件。

相关讨论

这是我用的，看起来很干净：

1
2
3
4
5
6
7
8
9
10
11

import subprocess

def count_file_lines(file_path):
"""
Counts the number of lines in a file using wc utility.
:param file_path: path to file
:return: int, no of lines
"""
num = subprocess.check_output(['wc', '-l', file_path])
num = num.split(' ')
return int(num[0])

更新：这比使用纯python快了一点，但代价是内存使用。在执行命令时，子进程将使用与父进程相同的内存占用量派生一个新进程。

相关讨论

1
2
3
4
5
6

def file_len(full_path):
""" Count number of lines in a file."""
f = open(full_path)
nr_of_lines = sum(1 for line in f)
f.close()
return nr_of_lines

相关讨论

我在这个版本中得到了一个小的(4-8%)改进，它重新使用了一个常量缓冲区，因此应该避免任何内存或GC开销：

1
2
3
4
5
6

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('
')

您可以随意使用缓冲区大小，也许会看到一些改进。

相关讨论

凯尔的回答

1	num_lines = sum(1 for line in open('my_file.txt'))

可能是最好的选择

1	num_lines = len(open('my_file.txt').read().splitlines())

以下是两者的性能比较

1
2
3
4
5

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 μs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 μs per loop

类似于此答案的单行bash解决方案，使用现代subprocess.check_output函数：

1 2	def line_count(file): return int(subprocess.check_output('wc -l {}'.format(file), shell=True).split()[0])

相关讨论

一线解决方案

1 2	import os os.system("wc -l filename")

我的片段

os.system('wc -l *.txt')

< /块引用>< /块引用>

1
2
3
4
0 bar.txt
1000 command.txt
3 test_file.txt
1003 total

相关讨论

好主意，但不幸的是，这在Windows上不起作用。

如果你想成为Python冲浪者，向Windows说再见。相信我，有一天你会感谢我的。

我只是认为值得注意的是，这只适用于Windows。我更喜欢自己在Linux/Unix堆栈上工作，但是在编写IMHO软件时，应该考虑程序在不同操作系统下运行时可能产生的副作用。由于OP没有提到他的平台，如果有人通过谷歌在这个解决方案上弹出并复制它(不知道Windows系统可能有什么限制)，我想添加注释。

这是我用纯Python找到的最快的东西。你可以通过设置缓冲区来使用你想要的任何数量的内存，尽管2*16在我的电脑上似乎是一个最佳选择。

1
2
3
4
5
6
from functools import partial

buffer=2**16
with open(myfile) as f:
print sum(x.count('
') for x in iter(partial(f.read,buffer), ''))

我在这里找到答案，为什么从STDIN中读取线比Python慢得多？稍微调整了一下。了解如何快速计算行数是一个很好的读物，尽管wc -l仍然比其他任何东西快75%。

这段代码简短明了。这可能是最好的方法：

1
2
num_lines = open('yourfile.ext').read().count('
')

相关讨论

您还应该关闭该文件。

这个简单的脚本适用于小文件。

它将把整个文件加载到内存中。

为了完成上面的方法，我尝试了一个文件输入模块的变体：

1
2
3
4
5
import fileinput as fi
def filecount(fname):
for line in fi.input(fname):
pass
return fi.lineno()

并将一个60英里的行文件传递给上述所有方法：

1
2
3
4
5
mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

让我有点惊讶的是，文件输入如此糟糕，扩展比所有其他方法都糟糕…

简单方法：
num_lines = len(list(open('myfile.txt')))

相关讨论

在此示例中，文件未关闭。

也许是小文件…

OP想要一些记忆效率高的东西。这绝对不是。

1
2
print open('file.txt', 'r').read().count("
") + 1

对于我来说，这个变种将是最快的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env python

def main():
f = open('filename')
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization

buf = read_f(buf_size)
while buf:
lines += buf.count('
')
buf = read_f(buf_size)

print lines

if __name__ == '__main__':
main()

原因：缓存比逐行读取快，string.count也很快

相关讨论

是吗？至少在osx/python2.5上，根据timeit.py的说法，OP的版本仍然快了10%。

也许，我不测试它。

如果最后一行不以''结尾怎么办？

我不知道你是如何测试它的，df，但是在我的机器上，它比任何其他选项慢2.5倍。

您声明它将是最快的，然后声明您没有测试过它。不是很科学吧？：)

请参阅下面的Ryan Ginstrom答案提供的解决方案和统计数据。也可以查看JF塞巴斯蒂安的评论和同一答案的链接。

它显示：mapcount()和wccount()都比buffcount快，尽管buffcount似乎比opcount和simplecount快。

我对缓冲区的修改如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def CountLines(filename):
f = open(filename)
try:
lines = 1
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)

# Empty file
if not buf:
return 0

while buf:
lines += buf.count('
')
buf = read_f(buf_size)

return lines
finally:
f.close()

现在还将计算空文件和最后一行(不带)。

相关讨论

也许还可以解释(或在代码中添加注释)您更改了什么以及为什么；)。可能会让人们更容易理解你的代码(而不是"解析"大脑中的代码)。

我认为循环优化允许python在read_f，python.org/doc/随笔/list2str执行局部变量查找。

打开文件的结果是一个迭代器，它可以转换为序列，序列的长度为：

1
2
with open(filename) as f:
return len(list(f))

这比显式循环更简洁，并避免使用enumerate。

相关讨论

这意味着需要将100MB文件读取到内存中。

是的，很好，不过我想知道速度(与记忆不同)的不同。可能可以创建一个这样做的迭代器，但我认为它相当于您的解决方案。

就记忆而言，这很糟糕…

-1，它不仅是内存，还必须在内存中构建列表。

count = max(enumerate(open(filename)))[0]

相关讨论

这将给出真值的计数-1。

enumerate()的可选第二个参数是根据docs.python.org/2/library/functions.html enumerate开始计数

这个怎么样？

1
2
3
4
5
def file_len(fname):
counts = itertools.count()
with open(fname) as f:
for _ in f: counts.next()
return counts.next()

如果要在Linux中以较低的成本获得python中的行数，我建议使用以下方法：

1
2
import os
print os.popen("wc -l file_path").readline().split()[0]

文件路径既可以是抽象文件路径，也可以是相对路径。希望这能有所帮助。

这个怎么样？

1
2
3
4
5
6
7
8
9
import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
counter+=1

fileinput.close()
print counter

这条班轮怎么样：

1
2
file_length = len(open('myfile.txt','r').read().split('
'))

使用此方法在3900行文件上花费0.003秒的时间

1
2
3
4
5
6
def c():
import time
s = time.time()
file_length = len(open('myfile.txt','r').read().split('
'))
print time.time() - s

1
2
3
4
5
6
def line_count(path):
count = 0
with open(path) as lines:
for count, l in enumerate(lines, start=1):
pass
return count

您可以按以下方式使用os.path模块：

1
2
3
import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

，其中Filename是文件的绝对路径。

相关讨论

这个答案与os.path有什么关系？

另一种可能性：

1
2
3
4
import subprocess

def num_lines_in_file(fpath):
return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])

相关讨论

不是多平台=/

1
2
3
4
def count_text_file_lines(path):
with open(path, 'rt') as file:
line_count = sum(1 for _line in file)
return line_count

相关讨论

如果你认为它是错的，你能解释一下它有什么问题吗？这对我很有用。谢谢！

我想知道为什么这个答案也被否决了。它按行对文件进行迭代，并对它们进行汇总。我喜欢它，它很短，而且直截了当，有什么问题吗？

如果文件可以装入内存，则

1
2
3
with open(fname) as f:
count = len(f.read().split(b'
')) - 1

创建名为count.py的可执行脚本文件：

1
2
3
4
5
6
#!/usr/bin/python

import sys
count = 0
for line in sys.stdin:
count+=1

然后将文件的内容通过管道传输到python脚本：cat huge.txt | ./count.py。管道也可以在PowerShell上工作，因此您最终将计算行数。
对我来说，在Linux上，它比：

1
2
3
count=1
with open('huge.txt') as f:
count+=1

如果文件中的所有行的长度相同(并且只包含ASCII字符)*，则可以非常便宜地执行以下操作：

1
2
3
fileSize = os.path.getsize( pathToFile ) # file size in bytes
bytesPerLine = someInteger # don't forget to account for the newline character
numLines = fileSize // bytesPerLine

*我怀疑如果使用像_这样的Unicode字符，需要更多的努力来确定一行中的字节数。

为什么下面的工作不行？

1
2
3
4
5
6
7
8
9
10
import sys

# input comes from STDIN
file = sys.stdin
data = file.readlines()

# get total number of lines in file
lines = len(data)

print lines

在这种情况下，len函数使用输入行作为确定长度的方法。

相关讨论

问题不在于如何计算行数，我已经在问题本身中演示了我在做什么：问题是如何有效地做到这一点。在您的解决方案中，整个文件都被读取到内存中，这对于大型文件来说至少是低效的，对于大型文件来说是不可能的。

实际上，它可能非常有效，除非它是不可能的。-)

这个怎么样？

1
2
3
4
import sys
sys.stdin=open('fname','r')
data=sys.stdin.readlines()
print"counted",len(data),"lines"

相关讨论

我认为它并没有解决这个大文件正在被读取到内存中的问题。

打印"counted"，len(数据)，"lines"^语法错误：无效语法

为什么不读取前100行和后100行并估计平均行长度，然后将总文件大小除以这些数字？如果你不需要一个精确的值，这是可行的。

相关讨论

我需要一个精确的值，但问题是一般情况下，行长度可能会有很大的不同。恐怕你的方法不是最有效的。

类似地：

1
2
3
4
lines = 0
with open(path) as f:
for line in f:
lines += 1