How to achieve Faster File I/O In Python?
关于python,我有一个与速度/效率相关的问题:
我需要从一个嵌套的JSON文件中提取多个字段(在写入
一般情况下,我只需要把我所有的数据放在
我只是简单地把线组合成线,但这有点慢。到目前为止,我正在做:
- 将每一行组装为一个字符串(从JSON中提取所需字段)
- 将字符串写入相关文件
我对此有几个问题:-它会导致更多的单独的
所以我的问题是:
- 对于这类问题,什么是好的惯例?为了最有效地写入磁盘而平衡
speed vs memory-consumption 的存储器。 - 我应该增加我的
DEFAULT_BUFFER_SIZE 吗?(目前为8192)
我用每种编程语言和这个python org:i o检查过这个文件I/O,但是没有多大帮助,除了(在我了解之后,文件IO应该已经在python 3.6.x中进行了缓冲),我发现我的默认
提前感谢您的帮助!!
这是我的片段-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | def read_json_line(line=None): result = None try: result = json.loads(line) except Exception as e: # Find the offending character index: idx_to_replace = int(str(e).split(' ')[-1].replace(')','')) # Remove the offending character: new_line = list(line) new_line[idx_to_replace] = ' ' new_line = ''.join(new_line) return read_json_line(line=new_line) return result def extract_features_and_write(path_to_data, inp_filename, is_train=True): # It's currently having 8 lines of file.write(), which is probably making it slow as writing to disk is involving a lot of overheads as well features = ['meta_tags__twitter-data1', 'url', 'meta_tags__article-author', 'domain', 'title', 'published__$date',\ 'content', 'meta_tags__twitter-description'] prefix = 'train' if is_train else 'test' feature_files = [open(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),'w', encoding='utf-8') for feat in features] with open(os.path.join(PATH_TO_RAW_DATA, inp_filename), encoding='utf-8') as inp_json_file: ? for line in tqdm_notebook(inp_json_file): for idx, features in enumerate(features): json_data = read_json_line(line) ? content = json_data['meta_tags']["twitter:data1"].replace(' ', ' ').replace(' ', ' ').split()[0] feature_files[0].write(content + ' ') ? content = json_data['url'].split('/')[-1].lower() feature_files[1].write(content + ' ') ? content = json_data['meta_tags']['article:author'].split('/')[-1].replace('@','').lower() feature_files[2].write(content + ' ') ? content = json_data['domain'] feature_files[3].write(content + ' ') ? content = json_data['title'].replace(' ', ' ').replace(' ', ' ').lower() feature_files[4].write(content + ' ') ? content = json_data['published']['$date'] feature_files[5].write(content + ' ') ? content = json_data['content'].replace(' ', ' ').replace(' ', ' ') content = strip_tags(content).lower() content = re.sub(r"[^a-zA-Z0-9]","", content) feature_files[6].write(content + ' ') ? content = json_data['meta_tags']["twitter:description"].replace(' ', ' ').replace(' ', ' ').lower() feature_files[7].write(content + ' ') |
来自注释:
为什么您认为8次写入会导致8次物理写入硬盘?文件对象本身会缓冲要写入的内容,如果它决定写入您的操作系统,您的操作系统也可能会等待一段时间,直到它实际写入为止,即使这样,HarrDrives也会有缓冲区,可以将文件内容保留一段时间,直到它真正开始写入为止。看看python多长时间刷新一个文件?
您不应将异常用作控制流,也不应在不需要的地方递归。每一个递归都为函数调用准备新的调用堆栈——这需要资源和时间——而且所有的调用都必须恢复。
最好的做法是在将数据输入json.load()之前清理数据。接下来最好的方法是避免递归…尝试以下方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def read_json_line(line=None): result = None while result is None and line: # empty line is falsy, avoid endless loop try: result = json.loads(line) except Exception as e: result = None # Find the offending character index: idx_to_replace = int(str(e).split(' ')[-1].replace(')','')) # slice away the offending character: line = line[:idx_to_replace]+line[idx_to_replace+1:] return result |