使用python-extract特定数据在LXML中进行屏幕抓取

Screen scraping in LXML with python— extract specific data

在过去的几个小时里，我一直在尝试编写一个程序来完成我认为非常简单的任务：

程序要求用户输入(比如"幸福"类型)

程序使用此格式("http://thinkexist.com/search/searchquotation.asp")查询网站thinkexist。搜索=用户输入")

程序返回网站的第一个报价。

我试过将xpath与lxml结合使用，但没有经验，而且每一个构造都返回一个空数组。

引用的实际肉似乎包含在类"sqq"中。

如果我通过Firebug导航该站点，单击dom选项卡，则引号似乎位于textNode属性"wholetext"或"textcontent"中，但我不知道如何通过编程使用该知识。

有什么想法吗？

1
2
3
4
5
6
7
8
9
10
11
12

import lxml.html
import urllib

site = 'http://thinkexist.com/search/searchquotation.asp'

userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})

root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[@class="sqq"]')

print quotes[0].text_content()

…如果你进入"莎士比亚"，它就会回来

1
2
3
4

In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.