在Python中对文件进行多处理,然后将结果写入磁盘

Multiprocessing a file in Python, then writing the result to disk

我想做以下工作:

  • 从csv文件读取数据
  • 处理所述csv的每一行(假设这是一个长的网络操作)
  • 将结果写入另一个文件

我试着把这个和这个答案粘在一起,但几乎没有成功。不会调用第二个队列的代码,因此不会写入磁盘。如何让进程知道有第二个队列?

请注意,我不必是multiprocessing的粉丝。如果asyncawait工作得更好,我完全赞成。

我的代码到目前为止

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import multiprocessing
import os
import time

in_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()

def worker_main(in_queue, out_queue):
    print (os.getpid(),"working")
    while True:
        item = in_queue.get(True)
        print (os.getpid(),"got", item)
        time.sleep(1) #long network processing
        print (os.getpid(),"done", item)
        # put the processed items to be written to disl
        out_queue.put("processed:" + str(item))


pool = multiprocessing.Pool(3, worker_main,(in_queue,out_queue))

for i in range(5): # let's assume this is the file reading part
    in_queue.put(i)

with open('out.txt', 'w') as file:

    while not out_queue.empty():
        try:
            value = q.get(timeout = 1)
            file.write(value + '
'
)
        except Exception as qe:
            print ("Empty Queue or dead process")

尝试执行代码时遇到的第一个问题是:

1
2
3
An attempt has been made to start a new process before the current process has finished
its bootstrapping phase. This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom in the main module

我必须在if __name__ == '__main__':习惯用法中包装任何模块范围指令。在这里阅读更多。

由于您的目标是迭代一个文件的行,因此Pool.imap()似乎是一个很好的匹配。imap()文档指的是map()文档,区别在于imap()懒散地从iterable(在您的情况下是csv文件)中提取下一个项目,如果csv文件很大,这将是有益的。因此,从map()文件来看:

This method chops the iterable into a number of chunks which it
submits to the process pool as separate tasks.

imap()返回一个迭代器,这样您就可以对流程工作人员生成的结果进行迭代,以执行您必须对它们执行的操作(在您的示例中,它是将结果写入一个文件)。

下面是一个工作示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import multiprocessing
import os
import time


def worker_main(item):
    print(os.getpid(),"got", item)
    time.sleep(1) #long network processing
    print(os.getpid(),"done", item)
    # put the processed items to be written to disl
    return"processed:" + str(item)


if __name__ == '__main__':
    with multiprocessing.Pool(3) as pool:
        with open('out.txt', 'w') as file:
            # range(5) simulating a 5 row csv file.
            for proc_row in pool.imap(worker_main, range(5)):
                file.write(proc_row + '
'
)

# printed output:
# 1368 got 0
# 9228 got 1
# 12632 got 2
# 1368 done 0
# 1368 got 3
# 9228 done 1
# 9228 got 4
# 12632 done 2
# 1368 done 3
# 9228 done 4

out.txt如下:

1
2
3
4
5
processed:0
processed:1
processed:2
processed:3
processed:4

注意,我也不用排队。