python 3.X将压缩的csv文件连接到一个非压缩的csv文件

python 3.X concatenate zipped csv files to one non-zipped csv file

下面是我的python 3代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

import zipfile
import os
import time
from timeit import default_timer as timer
import re
import glob
import pandas as pd

# local variabless
# pc version
# the_dir = r'c:\ImpExpData'
# linux version
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95'

def main():
"""
this is the function that controls the processing
"""
start_time = timer()
for root, dirs, files in os.walk(the_dir):
for file in files:
if file.endswith(".zip"):
print("working dir is ...", the_dir)
zipPath = os.path.join(root, file)
z = zipfile.ZipFile(zipPath,"r")
for filename in z.namelist():
if filename.endswith(".csv"):
# print filename
if re.match(r'^Trade-Geo.*\.csv$', filename):
pass # do somethin with geo file
# print" Geo data: " , filename
elif re.match(r'^Trade-Metadata.*\.csv$', filename):
pass # do something with metadata file
# print"Metadata: ", filename
else:
try:
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
# print("send to test def...", filename)
# print(zipPath)
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
frame = pd.DataFrame()
# EmptyDataError: No columns to parse from file -- how to deal with this error
train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252")
# train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252")
list_ = []
list_.append(train_df)
# print(list_)
frame = pd.concat(list_, ignore_index=True)
frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works
except: # catches EmptyDataError: No columns to parse from file
print("EmptyDataError...." ,filename,"...", zipPath)

# GetSubDirList(the_dir)
end_time = timer()
print("Elapsed time was %g seconds" % (end_time - start_time))

if __name__ == '__main__':
main()

它主要起作用——只是它没有将所有压缩的csv文件连接到一个文件中。有一个空文件，所有csv文件都具有相同的字段结构，所有csv文件的行数都不同。

以下是Spyder在运行时报告的内容：

1
2
3
4
5
6
7
8
9
10

runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb')

working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95

EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip

/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
execfile(filename, namespace)

Elapsed time was 104.857 seconds

最后一个csv file是最后一个压缩的csv文件；csv文件在处理文件时会改变大小。

压缩文件中有99个csv文件，我希望将其压缩为一个非压缩csv文件

字段或列名称为：colmnames=["hs_code"，"uom"，"country"，"state"，"prov"，"value"，"quantity"，"year"，"month"]

csv文件标记为：chp01.csv、cht02.csv等到chp99.csv，"UOM"(度量单位)为空，或整数或字符串，具体取决于hs_代码。

问题：如何将压缩的csv文件连接成一个大的(估计100 MB未压缩)csv文件？

补充细节：我试图不解压缩csv文件，然后我必须删除它们。我需要concat文件，因为我还有额外的处理要做。提取压缩的csv文件是一个可行的选择，我希望不必这样做。

你有什么理由不想用你的贝壳做这个吗？

假设连接的顺序不相关：

1
2
3

cd"/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95"
unzip"Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped
for f in Trade-Exports-Chp*.csv; do tail --lines=+2"$f">> concat.csv; done

这会在附加到concat.csv之前从每个csv文件中删除第一行(列名)。

如果你刚刚做到了：

1	tail --lines=+2"Trade-Exports-Chp*.csv"> concat.csv

你最终会得到：

1
2
3
4
5
6
7
8
9
10

==> Trade-Exports-Chp-1.csv <==
...

==> Trade-Exports-Chp-10.csv <==
...

==> Trade-Exports-Chp-2.csv <==
...

etc.

如果您关心订单，请将Trade-Exports-Chp-1.csv .. Trade-Exports-Chp-9.csv更改为Trade-Exports-Chp-01.csv .. Trade-Exports-Chp-09.csv。

虽然它在python中是可行的，但在本例中，我认为它不是合适的工作工具。

如果要在不实际提取zip文件的情况下就地执行该作业：

1
2
3

for i in {1..99}; do
unzip -p"Trade-Exports-Yr1992-1995.zip""Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv
done