关于bash:使用gawk解析CSV文件

Parsing a CSV file using gawk

如何使用gawk解析csv文件?仅仅设置FS=","是不够的,因为一个内有逗号的带引号的字段将被视为多个字段。

使用不起作用的FS=","的示例:

文件内容:

1
2
one,two,"three, four",five
"six, seven",eight,"nine"

gawk脚本:

1
2
3
4
5
6
7
BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) printf"field #%d: %s
"
, i, $(i)
  printf"---------------------------
"

}

错误输出:

1
2
3
4
5
6
7
8
9
10
11
field #1: one
field #2: two
field #3:"three
field #4:  four"
field #5: five
---------------------------
field #1:"six
field #2:  seven"
field #3: eight
field #4:"nine"
---------------------------

期望输出:

1
2
3
4
5
6
7
8
9
field #1: one
field #2: two
field #3:"three, four"
field #4: five
---------------------------
field #1:"six, seven"
field #2: eight
field #3:"nine"
---------------------------


简短的回答是"如果csv包含笨拙的数据,我不会使用gawk来解析csv",其中"笨拙"表示csv字段数据中的逗号。

下一个问题是"你将要做什么其他处理",因为这将影响你使用的替代方案。

我可能会使用Perl和文本::csv或文本::csv_xs模块来读取和处理数据。记住,Perl最初部分地被写为awksed杀手,因此a2ps2p程序仍然使用Perl分发,后者将awksed脚本(分别)转换为Perl。


GAWK第4版手册规定使用FPAT ="([^,]*)|(\"[^\"]+\")"

定义FPAT时,禁用FS并按内容而不是按分隔符指定字段。


您可以使用一个名为csvquote的简单包装函数来清理输入,并在awk完成处理后将其恢复。在开始和结束时通过管道传输数据,一切都应该正常工作:

之前:

1
gawk -f mypgoram.awk input.csv

后:

1
csvquote input.csv | gawk -f mypgoram.awk | csvquote -u

有关代码和文档,请参阅https://github.com/dbro/csvquote。


如果允许,我将使用python csv模块,特别注意使用的方言和所需的格式化参数,来解析您拥有的csv文件。


csv2delim.awk公司

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#       -v delim    delimiter, defaults to tab
#       -v repl     replacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and"" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN {
    if (delim =="") delim ="\t"
    if (repl =="") repl ="~"
    print"csv2delim.awk v.m 1.4 run at" strftime() >"/dev/stderr" ###########################################
}

{
    #if ($0 ~ repl) {
    #   print"Replacement character" repl" is on line" FNR":" lineIn";">"/dev/stderr"
    #}
    if ($0 ~ delim) {
        print"Temp delimiter character" delim" is on line" FNR":" lineIn";">"/dev/stderr"
        print"    replaced by" repl >"/dev/stderr"
    }
    gsub(delim, repl)

    $0 = gensub(/([^,])""/,"\\1'","g")
#   $0 = gensub(/""([^,])/,"'\\1","g")  # not needed above covers all cases

    out =""
    #for (i = 1;  i <= length($0);  i++)
    n = length($0)
    for (i = 1;  i <= n;  i++)
        if ((ch = substr($0, i, 1)) ==""")
            inString = (inString) ? 0 : 1 # toggle inString
        else
            out = out ((ch =="
," && ! inString) ? delim : ch)
    print out
}

END {
    print NR"
records processed from" FILENAME" at" strftime() >"/dev/stderr"
}

号测试.csv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first",sec   ond,"third"
"first" ,"second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3

测试.bat

1
2
3
4
5
6
rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt


1
2
3
4
5
6
7
8
9
10
11
12
{
  ColumnCount = 0
  $0 = $0","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  {
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"
?|"? *,$/,"", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
  }
}

遵循此模式的模式可以访问列[]中的字段。ColumnCount指示在列[]中找到的元素数。如果不是所有行都包含相同的列数,则在处理较短的行时,列[]在列[ColumnCount]之后包含额外的数据。

这种实现速度很慢,但它似乎在模仿前面答案中提到的gawk>=4.0.0中的FPAT/patsplit()功能。

参考


我不确定这是否是正确的做事方式。我宁愿使用一个csv文件,其中要么引用所有值,要么不引用。顺便说一句,awk允许regex作为字段分隔符。检查是否有用。


Perl有文本::csv_xs模块,专门用来处理带引号的逗号的奇怪。或者尝试文本::csv模块。

江户十一〔11〕。

生成此输出:

1
2
3
4
5
6
7
8
9
field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---

这是一个人类可读的版本。将其保存为parsecsv,chmod+x,并运行为"parsecsv file.csv"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die"Could not open '$ARGV[0]' $!
"
;
while (my $line = <$data>) {
    if ($csv->parse($line)) {
        my @f = $csv->fields();
        for my $n (0..$#f) {
            print"field #$n: $f[$n]
"
;
        }
        print"---
"
;
    }
}

您可能需要在计算机上指向不同版本的Perl,因为您的默认版本Perl上可能没有安装text::csv_xs模块。

1
2
Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.

如果没有安装任何Perl版本的text::csv_x,则需要:埃多克斯1〔12〕埃多克斯1〔13〕


这就是我想到的。如有任何意见和/或更好的解决方案,我们将不胜感激。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)==""") {
      while (substr(f[n], length(f[n]))!="
"" || substr(f[n], length(f[n])-1, 1)=="\") {
        f[n] = sprintf("
%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf"
field #%d: %s
", i, f[i]
  print"
----------------------------------
"
}

基本思想是循环遍历字段,并且任何以引号开头但不以引号结尾的字段都将得到附加到其上的下一个字段。