python 3.X concatenate zipped csv files to one non-zipped csv file
下面是我的python 3代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | import zipfile import os import time from timeit import default_timer as timer import re import glob import pandas as pd # local variabless # pc version # the_dir = r'c:\ImpExpData' # linux version the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95' def main(): """ this is the function that controls the processing """ start_time = timer() for root, dirs, files in os.walk(the_dir): for file in files: if file.endswith(".zip"): print("working dir is ...", the_dir) zipPath = os.path.join(root, file) z = zipfile.ZipFile(zipPath,"r") for filename in z.namelist(): if filename.endswith(".csv"): # print filename if re.match(r'^Trade-Geo.*\.csv$', filename): pass # do somethin with geo file # print" Geo data: " , filename elif re.match(r'^Trade-Metadata.*\.csv$', filename): pass # do something with metadata file # print"Metadata: ", filename else: try: with zipfile.ZipFile(zipPath) as z: with z.open(filename) as f: # print("send to test def...", filename) # print(zipPath) with zipfile.ZipFile(zipPath) as z: with z.open(filename) as f: frame = pd.DataFrame() # EmptyDataError: No columns to parse from file -- how to deal with this error train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252") # train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252") list_ = [] list_.append(train_df) # print(list_) frame = pd.concat(list_, ignore_index=True) frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works except: # catches EmptyDataError: No columns to parse from file print("EmptyDataError...." ,filename,"...", zipPath) # GetSubDirList(the_dir) end_time = timer() print("Elapsed time was %g seconds" % (end_time - start_time)) if __name__ == '__main__': main() |
它主要起作用——只是它没有将所有压缩的csv文件连接到一个文件中。有一个空文件,所有csv文件都具有相同的字段结构,所有csv文件的行数都不同。
以下是Spyder在运行时报告的内容:
1 2 3 4 5 6 7 8 9 10 | runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb') working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95 EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip /home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False. execfile(filename, namespace) Elapsed time was 104.857 seconds |
最后一个csv file是最后一个压缩的csv文件;csv文件在处理文件时会改变大小。
压缩文件中有99个csv文件,我希望将其压缩为一个非压缩csv文件
字段或列名称为:colmnames=["hs_code","uom","country","state","prov","value","quantity","year","month"]
csv文件标记为:chp01.csv、cht02.csv等到chp99.csv,"UOM"(度量单位)为空,或整数或字符串,具体取决于hs_代码。
问题:如何将压缩的csv文件连接成一个大的(估计100 MB未压缩)csv文件?
补充细节:我试图不解压缩csv文件,然后我必须删除它们。我需要concat文件,因为我还有额外的处理要做。提取压缩的csv文件是一个可行的选择,我希望不必这样做。
你有什么理由不想用你的贝壳做这个吗?
假设连接的顺序不相关:
1 2 3 | cd"/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95" unzip"Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped for f in Trade-Exports-Chp*.csv; do tail --lines=+2"$f">> concat.csv; done |
这会在附加到
如果你刚刚做到了:
1 | tail --lines=+2"Trade-Exports-Chp*.csv"> concat.csv |
你最终会得到:
1 2 3 4 5 6 7 8 9 10 | ==> Trade-Exports-Chp-1.csv <== ... ==> Trade-Exports-Chp-10.csv <== ... ==> Trade-Exports-Chp-2.csv <== ... etc. |
如果您关心订单,请将
虽然它在python中是可行的,但在本例中,我认为它不是合适的工作工具。
如果要在不实际提取zip文件的情况下就地执行该作业:
1 2 3 | for i in {1..99}; do unzip -p"Trade-Exports-Yr1992-1995.zip""Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv done |