如何只使用python读取某个字符串后的文本文件中的行？

How to only read lines in a text file after a certain string using python?

使用python，我想把文本文件中特定字符串后面的所有行都读到字典中。我想在数千个文本文件中进行此操作。

我可以使用以下代码(从堆栈溢出答案中获得)识别并打印出特定的字符串("abstract")：

1
2
3
4
5

for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
print line;

但是如何告诉python开始读取只在字符串之后出现的行呢？

当到达要开始的行时，只需开始另一个循环：

1
2
3
4
5
6

for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
for line in f: # now you are at the lines you want
# do work

文件对象是它自己的迭代器，所以当我们到达包含抽象内容的行时，我们将继续从该行进行迭代，直到使用迭代器为止。

一个简单的例子：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

gen = (n for n in xrange(8))

for x in gen:
if x == 3:
print("starting second loop")
for x in gen:
print("In second loop",x)
else:
print("In first loop", x)

In first loop 0
In first loop 1
In first loop 2
starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7

您还可以使用itertools.dropwhile将行消耗到您想要的点。

1
2
3
4
5
6
7
8

from itertools import dropwhile

for files in filepath:
with open(files, 'r') as f:
dropped = dropwhile(lambda _line:"Abstract" not in _line, f)
next(dropped,"")
for line in dropped:
print(line)

相关讨论

使用布尔值忽略到该点的行：

1
2
3
4
5
6
7
8

found_abstract = False
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
found_abstract = True
if found_abstract:
#do whatever you want

相关讨论

您可以在这里使用itertools.dropwhile和itertools.islice，这是一个伪示例：

1
2
3
4
5
6
7

from itertools import dropwhile, islice

for fname in filepaths:
with open(fname) as fin:
start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
for line in islice(start_at, 1, None): # ignore the line still with Abstract in
print line

相关讨论

对我来说，下面的代码更容易理解。

1
2
3
4
5

with open(file_name, 'r') as f:
while not 'Abstract' in next(f):
pass
for line in f:
#line will be now the next line after the one that contains 'Abstract'

相关讨论

为了澄清，您的代码已经"读取"了所有行。若要在某一点后开始"注意"行，只需设置一个布尔标志来指示是否应忽略行，并在每一行检查它。

1
2
3
4
5
6
7

pay_attention = False
for line in f:
if pay_attention:
print line
else: # We haven't found our trigger yet; see if it's in this line
if 'Abstract' in line:
pay_attention = True

如果您不介意对代码进行更多的重新排列，那么您也可以使用两个部分循环：一个循环在找到触发器短语后终止('Abstract')，另一个循环读取以下所有行。这种方法比较干净(而且速度也很快)。

1
2
3
4
5

for skippable_line in f: # First skim over all lines until we find 'Abstract'.
if 'Abstract' in skippable_line:
break
for line in f: # The file's iterator starts up again right where we left it.
print line

这样做的原因是，open返回的文件对象的行为类似于生成器，而不是列表：它只根据请求生成值。因此，当第一个循环停止时，文件的内部位置将保留在第一个"未读"行的开头。这意味着当您进入第二个循环时，您看到的第一行是触发break的第一行之后的第一行。

相关讨论

猜猜字典是怎么写的，我会这样写的：

1
2
3
4
5
6
7

lines = dict()
for filename in filepath:
with open(filename, 'r') as f:
for line in f:
if 'Abstract' in line:
break
lines[filename] = tuple(f)

因此，对于每个文件，字典都包含一组行。