HTML抓取的选项？ | 码农家园

Options for HTML scraping?

我正在考虑尝试漂亮的汤，一个用于HTML抓取的python包。有没有其他HTML抓取包我应该看？python不是一个要求，我实际上也很想听听其他语言。

到目前为止的故事：

Python
- 靓汤
- LXML
- HTQL
- 刮擦的
- 机械化
红宝石
- 诺科吉里
- 希普里科
- 机械化
- 刮削
- 擦洗！
- 袋熊
- 瓦蒂尔
.NET
- HTML敏捷包
- 沃廷
珀尔
- WWW：机械化
- 刮网机
爪哇
- 标签汤
- HTML-单元
- 网络收获
- 贾里德
- 汤
- Jericho HTML解析器
JavaScript
- 请求
- 再见
- 阿罗
- 节点骑兵
- 幻象
PHP
- 古特
- HTMLSQL
- PHP简单HTML DOM分析器
- PHP卷发刮削
- 斯佳丽
他们中的大多数
- 屏风铲运机

相关讨论

红宝石世界相当于美丽的汤，这就是为什么"幸运"的辣酱。

相关讨论

在.NET世界中，我推荐HTML敏捷包。虽然不像上面的一些选项(如htmlsql)那么简单，但它非常灵活。它允许您将格式不好的HTML像格式良好的XML一样进行合并，因此您可以使用xpath或只在节点上创建。

http://www.codeplex.com/htmlagilitypack

相关讨论

漂亮的汤是一个伟大的方式去HTML刮。我以前的工作让我做了很多刮擦，我希望我开始的时候就知道一些漂亮的汤。它就像一个有很多有用选项的DOM，而且更像是一个Python。如果你想尝试Ruby，他们会移植一种叫RubyFulsoup的漂亮汤，但很快就没有更新过了。

其他有用的工具是htmlparser或sgmllib.sgmlparser，它们是标准python库的一部分。这些方法通过在每次输入/退出标记并遇到HTML文本时调用方法来工作。如果你熟悉的话，他们就像外国人。如果您要解析非常大的文件并创建一个DOM树，那么这些库将特别有用，而且成本高昂。

正则表达式不是非常必要的。漂亮的汤处理正则表达式，所以如果你需要它们的力量，你可以利用它。我说如果你不需要速度和更小的记忆足迹，那就用漂亮的汤吧。如果您在python上找到更好的HTML解析器，请告诉我。

我发现HTMLSQL是一种非常简单的屏幕抓取方法。从字面上看，用它获得结果需要几分钟。

这些查询非常直观，比如：

1	SELECT title from img WHERE $class == 'userpic'

现在还有其他一些采用相同方法的替代方案。

相关讨论

(P)The Python LXML Library Acts as a Pytonic Binding for the Libxm2 and Libxslt Libraries.I like particularly its explath support and pretty-printing of the in-memory xml structure.It also supports parsing broken html.And I don't think you can find other pyton libraries/bindings that parse xml faster than LXML.(p)

对于Perl，有www:：mechanize。

(P)Python has several options for html scraping in addition to beatiful soup.这里有些人：(p)

Mechanize：相似的珍珠EDOCX1Gives you a browser like object to inermact with web pages
字母名称：Python Binding to EDOCX1Supports various options to traverse and select elements(E.G.Xpath and CSS selection)
Scrapemark：High level library using templates to extract informations from HTML.
Pyquery：Allows you to make jquery like queres on xml documents.
Scrapy：An high level scraping and web crawling framework.It can be used to write spiders，for data mining and for monitoring and automated testing

相关讨论

为什么没有人提到JTHOST for Java？http://jsoup.org网站/

(P)"简单的HTML Dom Parser"是一个很好的选择，如果你的家人与Jquery或Javascript Selectors他们在家里会发现你自己。(p)(P)Find it here(p)(P)There is also a blog post about it here.(p)

相关讨论

(P)The templatemaker utility from Adrian Holovaty(of Django Fame)used a very interesting approach：You feed it variations of the same page and it"learns"where the"holes"for variable data are.It's not html specific，so it would be good for scraping any other plaintext content as well.I've used it also for PDFS and html transformed to plaintext(with PDFTOTEXT and Lynx，respectively).(p)

相关讨论

(P)我知道和爱screen-scraper。(p)(P)Screen-scraper is a tool for extracting data from websites.Screen-Scraper Automotes：(p)字母名称(P)普通用户：(p)字母名称(P)Technical:(p)字母名称(P)Three editions of Screen-Scraper：(p)字母名称

相关讨论

我将首先了解相关站点是否提供API服务器或RSS源以访问您所需的数据。

(P)Another option for perl would be web:：scraper which is based on ruby's scrapi.In a nutshell，with nice and consense syntax，you can get a structures of data.(p)

刮削堆溢出特别容易与鞋和hpricot。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

require 'hpricot'

Shoes.app :title =>"Ask Stack Overflow", :width => 370 do
SO_URL ="http://stackoverflow.com"
stack do
stack do
caption"What is your question?"
flow do
@lookup = edit_line"stackoverflow", :width =>"-115px"
button"Ask", :width =>"90px" do
download SO_URL +"/search?s=" + @lookup.text do |s|
doc = Hpricot(s.response.body)
@rez.clear()
(doc/:a).each do |l|
href = l["href"]
if href.to_s =~ /\/questions\/[0-9]+/ then
@rez.append do
para(link(l.inner_text) { visit(SO_URL + href) })
end
end
end
@rez.show()
end
end
end
end
stack :margin => 25 do
background white, :radius => 20
@rez = stack do
end
end
@rez.hide()
end
end

(P)I've had some success with Htmlunit，in Java.It's a simple framework for writing unit tests on web ui's，but equally useful for html scraping.(p)

相关讨论

雅虎！查询语言或YQL可以与jquery、ajax、jsonp一起使用，以屏幕抓取网页

还有这个解决方案：netty httpclient

(P)网络的另一个工具(p)

我用红宝石做装饰。例如，这是一段代码片段，我使用它从我的招聘账户的六页中检索所有图书标题(因为它们似乎没有提供一页包含此信息)：

1
2
3
4
5
6
7
8
9
10

pagerange = 1..6
proxy = Net::HTTP::Proxy(proxy, port, user, pwd)
proxy.start('www.hirethings.co.nz') do |http|
pagerange.each do |page|
resp, data = http.get"/perth_dotnet?page=#{page}"
if resp.class == Net::HTTPOK
(Hpricot(data)/"h3 a").each { |a| puts a.innerText }
end
end
end

它相当完整。在此之前的所有操作都是库导入和我的代理的设置。

我用了很多漂亮的汤和Python。它比正则表达式检查要好得多，因为它就像使用DOM一样工作，即使HTML格式不好。与正则表达式相比，您可以快速找到语法更简单的HTML标记和文本。一旦找到一个元素，就可以对它及其子元素进行迭代，这对于理解代码中的内容比使用正则表达式更有用。我希望漂亮的汤在几年前就已经存在了，当时我不得不做大量的屏幕抓取——这会节省我很多时间和头痛，因为在人们开始验证它之前，HTML结构太差了。

(P)虽然这是为网络测试设计的，但我一直在使用Watin框架来实现这一目的。因为它是基于多个背景，它是简单的捕捉html，text，or images。最近，我用它来Dump a list of links from a mediaviki all pages namespace query into an excel spreadsheet.The following VB.net code fragement is pretty crude，but it works.(p)字母名称

我在Perl中使用了lwp和html:：treebuilder，并发现它们非常有用。

lwp(libwww-perl的缩写)允许你连接到网站并抓取HTML，你可以在这里得到模块，而O'Reilly的书似乎在这里在线。

TreeBuilder允许您从HTML构造一个树，并且文档和源代码可以在HTML：：TreeBuilder-构建HTML语法树的解析器中找到。

不过，这种方法可能还有太多的繁重工作要做。我还没有看过另一个答案建议的机械化模块，所以我很可能会这样做。

(P)Implementations of the HTML5 parsing Algorithm：HTML5LIB(Python，Ruby)，validator.nu html parser(Java，Javascript；C++in development)，Hubbububub(C)，Twintsam(C±35)；upcoming.(p)

好吧，如果你只想在客户端使用一个浏览器，你就有jcrawl.com。在从Web应用程序(http://www.jcrawl.com/app.html)设计了报废服务之后，您只需将生成的脚本添加到HTML页面即可开始使用/显示数据。

所有的废弃逻辑都是通过javascript在浏览器上发生的。希望你觉得它有用。点击此链接获取从雅虎网球中提取最新新闻的实时示例。

(P)你会是一个足迹不使用珍珠。来吧，来吧。(p)(P)Bone up on the following modules and Ginsu any scrape around.(p)字母名称

在爪哇，你可以使用标签汤。

(P)I've had mixed results in net using sgmreader which was originally started by Chris Lovett and appears to have been updated by minddouch.(p)

(P)I've also had great success using capana's jaxer+jquery to parse pages.It's not a s fast or'script-like'in nature，but Jquery selectors+real Javascript/dom is a lifesaver on more complicated(or malformed)pages.(p)

你可能已经有了这么多了，但我认为这就是你要做的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

from __future__ import with_statement
import re, os

profile =""

os.system('wget --no-cookies --header"Cookie: soba=(SeCreTCODe)" http://stackoverflow.com/users/30/myProfile.html')
with open("myProfile.html") as f:
for line in f:
profile = profile + line
f.close()
p = re.compile('summarycount">(\d+)') #Rep is found here
print p
m = p.search(profile)
print m
print m.group(1)
os.system("espeak "Rep is at" + m.group(1) +" points""
os.remove("myProfile.html")

(P)I like Google spreadsheets'importxml(URL，xpath)function.(p)(P)It will repeat cells down the column if your xpath expression returns more than one value.(p)(P)You can have up to 50 EDOCX1 pental 2 functions on one spreadsheet.(p)(P)Rapidminer's web plugin is also pretty easy to use.It can do posts，accepts cookies，and can set the user-agent.(p)

正则表达式也可以很好地用于HTML刮擦；-)尽管在看了漂亮的汤之后，我明白了为什么这将是一个有价值的工具。

相关讨论

我做了一个非常好的图书馆互联网工具，用于网页抓取。

其思想是将模板与网页相匹配，该模板将从网页中提取所有数据，并验证网页结构是否不变。

因此，您只需获取要处理的网页的HTML，删除所有动态或不相关的内容，并注释有趣的部分。

例如，stackoverflow.com索引页上新问题的HTML是：

1
2
3
4
5
6
7
8

<a title="Some times my tree list have vertical scroll ,then I scrolled very fast and the tree list shivered .Have any solution for this.
" class="question-hyperlink" href="/questions/11326954/about-scroll-bar-issue-in-tree">About Scroll bar issue in Tree

因此，您只需删除这个特定的ID、标题和摘要，就可以创建一个模板来读取标题、摘要和链接数组中的所有新问题：

1
2
3
4
5
6
7
8
9
10
11

<t:loop>

{title:=text(), summary:=@title, link:=@href}

</t:loop>

当然，它还支持基本技术，CSS 3选择器、XPath2和XQuery1表达式。

唯一的问题是我太笨了，把它变成了一个免费的帕斯卡图书馆。但是也有独立于语言的Web演示。

(P)For more complex scraping applications，I would recommend the irobotsoft web scraper.它是专门为Screen Scraping设计的免费软件。It has a strong query language for HTML pages，and it provides a very simple web recording interface that will free you from many programming effort.(p)

我做了很多高级的web抓取，所以我想完全控制我的堆栈并理解其局限性。这个WebCraping库就是结果。

相关讨论

Scrubyt使用Ruby和Hpricot来做漂亮的、容易的web抓取。我在大约30分钟内用这个为我大学的图书馆服务写了一个刮刀。

我一直在使用feedity-http://feedity.com来完成我的库中的一些抓取工作(以及转换成RSS提要)。它对大多数网页都很有效。

大卫格拉斯最近的演讲欢迎来到丛林！(yuiconf 2011 Opening keynote)展示了如何在node.js上使用yui 3在服务器上进行客户端式编程(使用dom选择器而不是字符串处理)。这是非常令人印象深刻的。

当涉及到从服务器端的HTML文档中提取数据时，node.js是一个非常好的选项。我已经成功地将它与两个模块一起使用，分别是request和cheerio。

你可以在这里看到一个例子，它是如何工作的。

对于那些喜欢图形化工作流工具的人，RapidMiner(FOSS)有一个很好的Web爬行和抓取功能。

以下是一系列视频：

http://vancourverdata.blogspot.com/2011/04/rapid miner-web-crawling-rapid-miner-web.html

夏普查询

基本上是针对c_的jquery。它依赖于HTML敏捷性包来解析HTML。