How to only read lines in a text file after a certain string using python?
使用python,我想把文本文件中特定字符串后面的所有行都读到字典中。我想在数千个文本文件中进行此操作。
我可以使用以下代码(从堆栈溢出答案中获得)识别并打印出特定的字符串("abstract"):
1 2 3 4 5 | for files in filepath: with open(files, 'r') as f: for line in f: if 'Abstract' in line: print line; |
但是如何告诉python开始读取只在字符串之后出现的行呢?
当到达要开始的行时,只需开始另一个循环:
1 2 3 4 5 6 | for files in filepath: with open(files, 'r') as f: for line in f: if 'Abstract' in line: for line in f: # now you are at the lines you want # do work |
文件对象是它自己的迭代器,所以当我们到达包含抽象内容的行时,我们将继续从该行进行迭代,直到使用迭代器为止。
一个简单的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | gen = (n for n in xrange(8)) for x in gen: if x == 3: print("starting second loop") for x in gen: print("In second loop",x) else: print("In first loop", x) In first loop 0 In first loop 1 In first loop 2 starting second loop In second loop 4 In second loop 5 In second loop 6 In second loop 7 |
您还可以使用itertools.dropwhile将行消耗到您想要的点。
1 2 3 4 5 6 7 8 | from itertools import dropwhile for files in filepath: with open(files, 'r') as f: dropped = dropwhile(lambda _line:"Abstract" not in _line, f) next(dropped,"") for line in dropped: print(line) |
使用布尔值忽略到该点的行:
1 2 3 4 5 6 7 8 | found_abstract = False for files in filepath: with open(files, 'r') as f: for line in f: if 'Abstract' in line: found_abstract = True if found_abstract: #do whatever you want |
您可以在这里使用
1 2 3 4 5 6 7 | from itertools import dropwhile, islice for fname in filepaths: with open(fname) as fin: start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin) for line in islice(start_at, 1, None): # ignore the line still with Abstract in print line |
对我来说,下面的代码更容易理解。
1 2 3 4 5 | with open(file_name, 'r') as f: while not 'Abstract' in next(f): pass for line in f: #line will be now the next line after the one that contains 'Abstract' |
为了澄清,您的代码已经"读取"了所有行。若要在某一点后开始"注意"行,只需设置一个布尔标志来指示是否应忽略行,并在每一行检查它。
1 2 3 4 5 6 7 | pay_attention = False for line in f: if pay_attention: print line else: # We haven't found our trigger yet; see if it's in this line if 'Abstract' in line: pay_attention = True |
如果您不介意对代码进行更多的重新排列,那么您也可以使用两个部分循环:一个循环在找到触发器短语后终止(
1 2 3 4 5 | for skippable_line in f: # First skim over all lines until we find 'Abstract'. if 'Abstract' in skippable_line: break for line in f: # The file's iterator starts up again right where we left it. print line |
这样做的原因是,
猜猜字典是怎么写的,我会这样写的:
1 2 3 4 5 6 7 | lines = dict() for filename in filepath: with open(filename, 'r') as f: for line in f: if 'Abstract' in line: break lines[filename] = tuple(f) |
因此,对于每个文件,字典都包含一组行。
这是因为循环读取到并包括您标识的行,从而使文件中的其余行准备从