关于屏幕抓取：如何将HTML表格刮到CSV？

How can I scrape an HTML table to CSV?

问题

我在工作中使用了一个工具，可以让我进行查询并返回HTML信息表。我没有任何后端访问权限。

如果我能把这些信息放到电子表格中进行排序、求平均值等操作，那么这些信息中的很多都会更有用。我怎样才能将这些数据筛选为一个csv文件呢？

我的第一个想法

因为我知道jquery，所以我想我可以用它去除屏幕上的表格格式，插入逗号和换行符，然后将整个混乱复制到记事本中并保存为csv。有更好的主意吗？

解决方案

是的，伙计们，这真的和复制和粘贴一样简单。我不觉得自己很傻吗？

具体来说，当我粘贴到电子表格中时，我必须选择"选择性粘贴"并选择"文本"格式，否则它试图将所有内容粘贴到单个单元格中，即使我突出显示了整个电子表格。

相关讨论

在工具的UI中选择HTML表并将其复制到剪贴板(如果可能的话)
粘贴到Excel中。
另存为csv文件

但是，这是手动解决方案，而不是自动解决方案。

相关讨论

使用Python：

例如，假设您希望从一些站点(如：fxquotes)获取csv格式的外汇报价。

然后…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[[cc]','')
data = replace(data,'

，'')文件"location='/users/location"编辑"this"file_name=file_location+'usd_aus.csv'文件=打开(文件名，"W")file.write(数据)文件()< /代码>

编辑：从表中获取值：示例来源：Palewire

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url ="http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
col = row.findAll('td')

rank = col[0].string
artist = col[1].string
album = col[2].string
cover_link = col[3].img['src']

record = (rank, artist, album, cover_link)
print"|".join(record)

相关讨论

这是我的python版本，使用(当前)最新版本的beautifulsoup，可以通过以下方式获得：

1	$ sudo easy_install beautifulsoup4

该脚本从标准输入中读取HTML，并以适当的csv格式输出所有表中的文本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
return"".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
for row in table.find_all('tr'):
col = map(cell_text, row.find_all(re.compile('t[dh]')))
output.writerow(col)
output.writerow([])

相关讨论

更简单(因为它下次为您保存了它)

在Excel中

数据/导入外部数据/新建Web查询

将带您进入URL提示。输入您的URL，它将界定页面上要导入的可用表。沃伊拉

相关讨论

有两种方法(特别是对于那些没有卓越表现的人而言)：

谷歌电子表格具有出色的importHTML功能：
- =importHTML("http://example.com/page/with/table","table", index
- 索引从1开始
- 我建议在进口后不久买一辆copy和paste values。
- 文件->下载为->csv
python一流的pandas库有方便的read_html和to_csv功能
- 下面是一个基本的python3脚本，它提示输入url、该url的哪个表以及csv的文件名。

基本的python实现，使用beautifulsoup，同时考虑rowspan和colspan：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

from BeautifulSoup import BeautifulSoup

def table2csv(html_txt):
csvs = []
soup = BeautifulSoup(html_txt)
tables = soup.findAll('table')

for table in tables:
csv = ''
rows = table.findAll('tr')
row_spans = []
do_ident = False

for tr in rows:
cols = tr.findAll(['th','td'])

for cell in cols:
colspan = int(cell.get('colspan',1))
rowspan = int(cell.get('rowspan',1))

if do_ident:
do_ident = False
csv += ','*(len(row_spans))

if rowspan > 1: row_spans.append(rowspan)

csv += '"{text}"'.format(text=cell.text) + ','*(colspan)

if row_spans:
for i in xrange(len(row_spans)-1,-1,-1):
row_spans[i] -= 1
if row_spans[i] < 1: row_spans.pop()

do_ident = True if row_spans else False

csv += '
'

csvs.append(csv)
#print csv

return '

'.join(csvs)

Excel可以打开HTTP页。

如：

单击"文件"，打开

在"文件名"下，粘贴url ie:how can I scrape a html table to csv？

单击确定

Excel尽力将HTML转换为表格。

这不是最优雅的解决方案，但确实有效！

快速和肮脏：

从浏览器中复制到Excel，另存为csv。

更好的解决方案(长期使用)：

用您选择的语言编写一点代码，它将把HTML内容拉下来，并从中剔除您想要的部分。您可能会在数据检索的基础上加入所有的数据操作(排序、平均等)。这样，您只需运行代码，就可以得到所需的实际报告。

这完全取决于您将执行此特定任务的频率。

下面是一个结合了grequest和soup从结构化网站下载大量页面的测试示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
return"".join(cell.stripped_strings)

def parse_table(body_html):
soup = BeautifulSoup(body_html)
for table in soup.find_all('table'):
for row in table.find_all('tr'):
col = map(cell_text, row.find_all(re.compile('t[dh]')))
print(col)

def process_a_page(response, *args, **kwargs):
parse_table(response.content)

def download_a_chunk(k):
chunk_size = 10 #number of html pages
x ="http://www.blahblah....com/inclusiones.php?p="
x2 ="&name=..."
URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
print("downloading chunk",str(k))
download_a_chunk(k)
time.sleep(11)

如果您正在进行屏幕抓取，并且要转换的表有一个给定的ID，那么您总是可以对HTML进行regex解析，并编写一些脚本来生成csv。

你试过用Excel打开它吗？如果将Excel中的电子表格保存为HTML格式，您将看到Excel使用的格式。从我写的一个网络应用程序中，我吐出了这个HTML格式，这样用户就可以导出到Excel。