Combine multiple data files into np.arrays, which are stored in dictionaries
我正试图加载一个大数据集。我有大约8K天的文件,每个文件都有数百个测量值数组。我可以将一天的文件加载到一组numpy数组中,这些数组存储在字典中。为了加载所有的日常文件,我用所需的键初始化一个字典。然后我循环浏览文件列表,加载一个,并尝试将它们存储在较大的字典中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | all_measurements = np.asarray([get_n_measurements(directory, name) for name in files]) error_files = [] temp = np.full(all_measurements.sum() all_data = {key: temp.copy(), fill_value, dtype=np.float64) for key in sample_file} start_index = 0 for data_file, n_measurements in zip(file_list, all_measurements): file_data = one_file(data_file) # Load one data file into a dict. for key, value in file_data.iteritems(): # I've tried .items(), .viewitems() as well. try: all_data[key][start_index : start_index + n_measurements] = file_data[key] except ValueError, msg: error_files.append((data_file, msg)) finally: start_index += n_measurements |
我已经检查了
以下是数据结构的示例:
1 2 3 4 5 | all_data = {'a': array([ 0.76290858, 0.83449302, ..., 0.06186873]), 'b': array([ 0.32939997, 0.00111448, ..., 0.72303435])} file_data = {'a': array([ 0.00915347, 0.39020354]), 'b': array([ 0.8992421 , 0.18964702])} |
在
结果发现所有的东西都放在同一个容器里。以上代码已编辑,问题已更正。
如果我很好地解释了您的代码,并且在这种情况下,如果
1 2 3 4 | all_measurements = np.array( [len(n_measurements) for n_measurements in file_list] ) |
或者,如何将EDOCX1[1]作为新的可初始化np.array的形状?