Parsing a CSV file using gawk
如何使用gawk解析csv文件?仅仅设置
使用不起作用的
文件内容:
1 2 | one,two,"three, four",five "six, seven",eight,"nine" |
gawk脚本:
1 2 3 4 5 6 7 | BEGIN { FS="," } { for (i=1; i<=NF; i++) printf"field #%d: %s ", i, $(i) printf"--------------------------- " } |
号
错误输出:
1 2 3 4 5 6 7 8 9 10 11 | field #1: one field #2: two field #3:"three field #4: four" field #5: five --------------------------- field #1:"six field #2: seven" field #3: eight field #4:"nine" --------------------------- |
期望输出:
1 2 3 4 5 6 7 8 9 | field #1: one field #2: two field #3:"three, four" field #4: five --------------------------- field #1:"six, seven" field #2: eight field #3:"nine" --------------------------- |
。
简短的回答是"如果csv包含笨拙的数据,我不会使用gawk来解析csv",其中"笨拙"表示csv字段数据中的逗号。
下一个问题是"你将要做什么其他处理",因为这将影响你使用的替代方案。
我可能会使用Perl和文本::csv或文本::csv_xs模块来读取和处理数据。记住,Perl最初部分地被写为
GAWK第4版手册规定使用
定义
您可以使用一个名为csvquote的简单包装函数来清理输入,并在awk完成处理后将其恢复。在开始和结束时通过管道传输数据,一切都应该正常工作:
之前:
1 | gawk -f mypgoram.awk input.csv |
。
后:
1 | csvquote input.csv | gawk -f mypgoram.awk | csvquote -u |
。
有关代码和文档,请参阅https://github.com/dbro/csvquote。
如果允许,我将使用python csv模块,特别注意使用的方言和所需的格式化参数,来解析您拥有的csv文件。
csv2delim.awk公司
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | # csv2delim.awk converts comma delimited files with optional quotes to delim separated file # delim can be any character, defaults to tab # assumes no repl characters in text, any delim in line converts to repl # repl can be any character, defaults to ~ # changes two consecutive quotes within quotes to ' # usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file # -v delim delimiter, defaults to tab # -v repl replacement char, defaults to ~ # e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt # abe 2-28-7 # abe 8-8-8 1.0 fixed empty fields, added replacement option # abe 8-27-8 1.1 used split # abe 8-27-8 1.2 inline rpl and"" = ' # abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time # abe 8-29-8 1.4 better message if delim present BEGIN { if (delim =="") delim ="\t" if (repl =="") repl ="~" print"csv2delim.awk v.m 1.4 run at" strftime() >"/dev/stderr" ########################################### } { #if ($0 ~ repl) { # print"Replacement character" repl" is on line" FNR":" lineIn";">"/dev/stderr" #} if ($0 ~ delim) { print"Temp delimiter character" delim" is on line" FNR":" lineIn";">"/dev/stderr" print" replaced by" repl >"/dev/stderr" } gsub(delim, repl) $0 = gensub(/([^,])""/,"\\1'","g") # $0 = gensub(/""([^,])/,"'\\1","g") # not needed above covers all cases out ="" #for (i = 1; i <= length($0); i++) n = length($0) for (i = 1; i <= n; i++) if ((ch = substr($0, i, 1)) ==""") inString = (inString) ? 0 : 1 # toggle inString else out = out ((ch =="," && ! inString) ? delim : ch) print out } END { print NR" records processed from" FILENAME" at" strftime() >"/dev/stderr" } |
号测试.csv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | "first","second","third" "fir,st","second","third" "first","sec""ond","third" " first",sec ond,"third" "first" ,"second","th ird" "first","sec;ond","third" "first","second","th;ird" 1,2,3 ,2,3 1,2, ,2, 1,,2 1,"2",3 "1",2,"3" "1",,"3" 1,"",3 "","","" "","""aiyn","oh""" """","""","""" 11,2~2,3 |
测试.bat
1 2 3 4 5 6 | rem test csv2delim rem default is: -v delim={tab} -v repl=~ gawk -f csv2delim.awk test.csv > test.txt gawk -v delim=; -f csv2delim.awk test.csv > testd.txt gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt gawk -v repl=` -f csv2delim.awk test.csv > testr.txt |
。
1 2 3 4 5 6 7 8 9 10 11 12 | { ColumnCount = 0 $0 = $0"," # Assures all fields end with comma while($0) # Get fields by pattern, not by delimiter { match($0, / *"[^"]*" *,|[^,]*,/) # Find a field with its delimiter suffix Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter gsub(/^ *"?|"? *,$/,"", Field) # Strip delimiter text: comma/space/quote Column[++ColumnCount] = Field # Save field without delimiter in an array $0 = substr($0, RLENGTH + 1) # Remove processed text from the raw data } } |
遵循此模式的模式可以访问列[]中的字段。ColumnCount指示在列[]中找到的元素数。如果不是所有行都包含相同的列数,则在处理较短的行时,列[]在列[ColumnCount]之后包含额外的数据。
这种实现速度很慢,但它似乎在模仿前面答案中提到的gawk>=4.0.0中的
参考
我不确定这是否是正确的做事方式。我宁愿使用一个csv文件,其中要么引用所有值,要么不引用。顺便说一句,awk允许regex作为字段分隔符。检查是否有用。
Perl有文本::csv_xs模块,专门用来处理带引号的逗号的奇怪。或者尝试文本::csv模块。
江户十一〔11〕。
生成此输出:
1 2 3 4 5 6 7 8 9 | field #0: one field #1: two field #2: three, four field #3: five --- field #0: six, seven field #1: eight field #2: nine --- |
这是一个人类可读的版本。将其保存为parsecsv,chmod+x,并运行为"parsecsv file.csv"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #!/usr/bin/perl use warnings; use strict; use Text::CSV_XS; my $csv = Text::CSV_XS->new(); open(my $data, '<', $ARGV[0]) or die"Could not open '$ARGV[0]' $! "; while (my $line = <$data>) { if ($csv->parse($line)) { my @f = $csv->fields(); for my $n (0..$#f) { print"field #$n: $f[$n] "; } print"--- "; } } |
。
您可能需要在计算机上指向不同版本的Perl,因为您的默认版本Perl上可能没有安装text::csv_xs模块。
1 2 | Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .). BEGIN failed--compilation aborted. |
如果没有安装任何Perl版本的text::csv_x,则需要:埃多克斯1〔12〕埃多克斯1〔13〕
这就是我想到的。如有任何意见和/或更好的解决方案,我们将不胜感激。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | BEGIN { FS="," } { for (i=1; i<=NF; i++) { f[++n] = $i if (substr(f[n],1,1)==""") { while (substr(f[n], length(f[n]))!=""" || substr(f[n], length(f[n])-1, 1)=="\") { f[n] = sprintf("%s,%s", f[n], $(++i)) } } } for (i=1; i<=n; i++) printf"field #%d: %s ", i, f[i] print"---------------------------------- " } |
基本思想是循环遍历字段,并且任何以引号开头但不以引号结尾的字段都将得到附加到其上的下一个字段。