关于linux：从shell转换HTML表到CSV文件

Converting HTML table to CSV file from shell

我正在尝试将一个带有HTML表的文件转换为CSV格式。此文件的摘录如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head id="Head1"><link rel="shortcut icon" href="favicon.ico" />
Untitled Page
</head>
<body>
<form name="form1" method="post" action="mypricelist.aspx" id="form1">
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />

<table id="price_list" border="0">
<tr>
<td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
<td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
<td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

我试着用

1	xls2csv -x -c\; evprice.xls > evprice.csv

但那给了我一个错误的说法

1	evprice.xls is not OLE file or Error

我搜索谷歌。它说这是因为文件不是正确的XLS，而是HTML。

当我尝试

1	file evprice.xls

它说它的HTML找到了一个"解决方案"，使用libreoffice。

1	libreoffice --headless -convert-to csv ./evprice.xls

这并没有给出错误，但是csv输出文件很奇怪，就像在记事本中打开一个exe文件一样。

里面有很多像这样奇怪的字符

1	—??-t9ü~?óXtK￠

有人知道为什么会发生这种情况，并且得到了一个有效的解决方案吗？

相关讨论

我已经构建了一个python实用程序，它将HTML文件中的所有表转换为单独的csv文件。

你可以在这里找到它。

脚本的关键是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

from BeautifulSoup import BeautifulSoup
import csv

filename ="MY_HTML_FILE"
fin = open(filename,'r')

print"Opening file"
fin = fin.read()

print"Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print"Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print"CSVing file"
tablecount = -1
for table in soup.findAll("table"):
tablecount += 1
print"Processing Table #%d" % (tablecount)
with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in table.findAll('tr'):
cols = row.findAll(['td','th'])
if cols:
cols = [x.text for x in cols]
fout.writerow(cols)