关于shell：Python：并行执行cat子进程

Python: execute cat subprocess in parallel

我正在远程服务器上运行几个cat | zgrep命令，并分别收集它们的输出以进行进一步处理：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

class MainProcessor(mp.Process):
def __init__(self, peaks_array):
super(MainProcessor, self).__init__()
self.peaks_array = peaks_array

def run(self):
for peak_arr in self.peaks_array:
peak_processor = PeakProcessor(peak_arr)
peak_processor.start()

class PeakProcessor(mp.Process):
def __init__(self, peak_arr):
super(PeakProcessor, self).__init__()
self.peak_arr = peak_arr

def run(self):
command = 'ssh remote_host cat files_to_process | zgrep --mmap"regex" '
log_lines = (subprocess.check_output(command, shell=True)).split('
')
process_data(log_lines)

但是，这会导致子流程的顺序执行("ssh…cat…')命令。第二个峰值等待第一个峰值完成，依此类推。

如何修改此代码，以便子进程调用并行运行，同时仍然能够单独收集每个调用的输出？

相关讨论

您不需要multiprocessing和threading来并行运行子流程，例如：

1
2
3
4
5
6
7
8

#!/usr/bin/env python
from subprocess import Popen

# run commands in parallel
processes = [Popen("echo {i:d}; sleep 2; echo {i:d}".format(i=i), shell=True)
for i in range(5)]
# collect statuses
exitcodes = [p.wait() for p in processes]

它同时运行5个shell命令。注：这里既不使用螺纹，也不使用multiprocessing模块。没有必要在shell命令中添加ampersand &：Popen不等待命令完成。你需要明确地打电话给.wait()。

它很方便，但不必使用线程从子进程收集输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

#!/usr/bin/env python
from multiprocessing.dummy import Pool # thread pool
from subprocess import Popen, PIPE, STDOUT

# run commands in parallel
processes = [Popen("echo {i:d}; sleep 2; echo {i:d}".format(i=i), shell=True,
stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True)
for i in range(5)]

# collect output in parallel
def get_lines(process):
return process.communicate()[0].splitlines()

outputs = Pool(len(processes)).map(get_lines, processes)

相关：Python线程化多个bash子进程？。

以下是在同一线程中同时从多个子进程中获取输出的代码示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

#!/usr/bin/env python3
import asyncio
import sys
from asyncio.subprocess import PIPE, STDOUT

@asyncio.coroutine
def get_lines(shell_command):
p = yield from asyncio.create_subprocess_shell(shell_command,
stdin=PIPE, stdout=PIPE, stderr=STDOUT)
return (yield from p.communicate())[0].splitlines()

if sys.platform.startswith('win'):
loop = asyncio.ProactorEventLoop() # for subprocess' pipes on Windows
asyncio.set_event_loop(loop)
else:
loop = asyncio.get_event_loop()

# get commands output in parallel
coros = [get_lines('"{e}" -c"print({i:d}); import time; time.sleep({i:d})"'
.format(i=i, e=sys.executable)) for i in range(5)]
print(loop.run_until_complete(asyncio.gather(*coros)))
loop.close()