How to Reduce the time taken to load a pickle file in python
我用python创建了一个字典,并将其放入pickle中。它的大小达到了300MB。现在,我想装同样的泡菜。
1 2 | output = open('myfile.pkl', 'rb') mydict = pickle.load(output) |
装这个泡菜大约需要15秒钟。我该如何减少这一时间?
硬件规格:Ubuntu 14.04,4GB RAM
下面的代码显示了使用json、pickle和cpickle转储或加载文件所需的时间。
转储后,文件大小将在300MB左右。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | import json, pickle, cPickle import os, timeit import json mydict= {all values to be added} def dump_json(): output = open('myfile1.json', 'wb') json.dump(mydict, output) output.close() def dump_pickle(): output = open('myfile2.pkl', 'wb') pickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL) output.close() def dump_cpickle(): output = open('myfile3.pkl', 'wb') cPickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL) output.close() def load_json(): output = open('myfile1.json', 'rb') mydict = json.load(output) output.close() def load_pickle(): output = open('myfile2.pkl', 'rb') mydict = pickle.load(output) output.close() def load_cpickle(): output = open('myfile3.pkl', 'rb') mydict = pickle.load(output) output.close() if __name__ == '__main__': print"Json dump:" t = timeit.Timer(stmt="pickle_wr.dump_json()", setup="import pickle_wr") print t.timeit(1),' ' print"Pickle dump:" t = timeit.Timer(stmt="pickle_wr.dump_pickle()", setup="import pickle_wr") print t.timeit(1),' ' print"cPickle dump:" t = timeit.Timer(stmt="pickle_wr.dump_cpickle()", setup="import pickle_wr") print t.timeit(1),' ' print"Json load:" t = timeit.Timer(stmt="pickle_wr.load_json()", setup="import pickle_wr") print t.timeit(1),' ' print"pickle load:" t = timeit.Timer(stmt="pickle_wr.load_pickle()", setup="import pickle_wr") print t.timeit(1),' ' print"cPickle load:" t = timeit.Timer(stmt="pickle_wr.load_cpickle()", setup="import pickle_wr") print t.timeit(1),' ' |
号
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | Json dump: 42.5809804916 Pickle dump: 52.87407804489 cPickle dump: 1.1903790187836 Json load: 12.240660209656 pickle load: 24.48748306274 cPickle load: 24.4888298893 |
我已经看到cpickle转储和加载所需的时间更少,但加载文件仍然需要很长的时间。
尝试使用
根据这个网站,
JSON is 25 times faster in reading (loads) and 15 times faster in
writing (dumps).
号
另请参见这个问题:加载pickled dictionary对象或加载JSON文件到dictionary更快是什么?
升级python或使用带有固定python版本的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | try: import cPickle except: import pickle as cPickle import pickle import json, marshal, random from time import time from hashlib import md5 test_runs = 1000 if __name__ =="__main__": payload = { "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)], "int": [random.randrange(0, 9999) for i in range(1000)], "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)] } modules = [json, pickle, cPickle, marshal] for payload_type in payload: data = payload[payload_type] for module in modules: start = time() if module.__name__ in ['pickle', 'cPickle']: for i in range(test_runs): serialized = module.dumps(data, protocol=-1) else: for i in range(test_runs): serialized = module.dumps(data) w = time() - start start = time() for i in range(test_runs): unserialized = module.loads(serialized) r = time() - start print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r)) |
结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | C:\Python27\python.exe -u"serialization_benchmark.py" json int W 0.125 R 0.156 pickle int W 2.808 R 1.139 cPickle int W 0.047 R 0.046 marshal int W 0.016 R 0.031 json float W 1.981 R 0.624 pickle float W 2.607 R 1.092 cPickle float W 0.063 R 0.062 marshal float W 0.047 R 0.031 json str W 0.172 R 0.437 pickle str W 5.149 R 2.309 cPickle str W 0.281 R 0.156 marshal str W 0.109 R 0.047 C:\pypy-1.6\pypy-c -u"serialization_benchmark.py" json int W 0.515 R 0.452 pickle int W 0.546 R 0.219 cPickle int W 0.577 R 0.171 marshal int W 0.032 R 0.031 json float W 2.390 R 1.341 pickle float W 0.656 R 0.436 cPickle float W 0.593 R 0.406 marshal float W 0.327 R 0.203 json str W 1.141 R 1.186 pickle str W 0.702 R 0.546 cPickle str W 0.828 R 0.562 marshal str W 0.265 R 0.078 c:\Python34\python -u"serialization_benchmark.py" json int W 0.203 R 0.140 pickle int W 0.047 R 0.062 pickle int W 0.031 R 0.062 marshal int W 0.031 R 0.047 json float W 1.935 R 0.749 pickle float W 0.047 R 0.062 pickle float W 0.047 R 0.062 marshal float W 0.047 R 0.047 json str W 0.281 R 0.187 pickle str W 0.125 R 0.140 pickle str W 0.125 R 0.140 marshal str W 0.094 R 0.078 |
号
python 3.4默认使用pickle协议3,与协议4没有区别。python 2的协议2是最高的pickle协议(如果为dump提供负值,则选择此选项),其速度是协议3的两倍。
我在使用cpickle本身读取大型文件(例如:~750 MB的igraph对象——一个二进制pickle文件)方面取得了很好的效果。这是通过简单地结束这里提到的pickle加载调用来实现的。
您案例中的示例代码段如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import timeit import cPickle as pickle import gc def load_cpickle_gc(): output = open('myfile3.pkl', 'rb') # disable garbage collector gc.disable() mydict = pickle.load(output) # enable garbage collector again gc.enable() output.close() if __name__ == '__main__': print"cPickle load (with gc workaround):" t = timeit.Timer(stmt="pickle_wr.load_cpickle_gc()", setup="import pickle_wr") print t.timeit(1),' ' |
当然,可能有更合适的方法来完成任务,但是,这种变通方法确实大大减少了所需的时间。(对我来说,从843.04s降到41.28s,大约20倍)
如果您试图将字典存储到单个文件中,则是大型文件的加载时间减慢了您的速度。您可以做的最简单的事情之一是将字典写到磁盘上的一个目录中,每个字典条目都是一个单独的文件。然后可以在多个线程中(或使用多处理)对文件进行pickle和unpickle。对于一个非常大的字典,无论您选择哪种序列化程序,这都应该比在单个文件中读取和读取快得多。有些软件包,如