关于python:使用SgmlLinkExtractor进行Scrapy

Scrapy using SgmlLinkExtractor

我正试图对窗体的页进行爬网http://www.wynk.in/music/song/variable_underlined_hydrometic_string.html.我想从笔记本电脑上点击这些URL,但是由于这些URL只在应用程序和waps上工作,所以我给了用户代理settings.py中的'mozilla/5.0(linux;u;android 2.3.4;fr fr;htc desire build/grj22)applewebkit/533.1(khtml,类似gecko)version/4.0 mobile safari/533.1'。我的代码文件读取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from scrapy import Selector
from wynks.items import WynksItem

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class MySpider(CrawlSpider):

name ="wynk"
#allowed_domains = ["wynk.in"]
start_urls = ["http://www.wynk.in/", ]
#start_urls = []
rules = (Rule(SgmlLinkExtractor(allow=[r'/music/song/\w+.html']), callback='parse_item', follow=True),)

def parse_item(self, response):
    hxs = Selector(response)
    if hxs:
        tds = hxs.xpath("//div[@class='songDetails']//tr//td")
        if tds:
            for td in tds.xpath('.//div'):
                titles = td.xpath("a/text()").extract()
                if titles:
                    for title in titles:
                        print title

我通过运行来启动代码Scrapy Crawl Wynk-o abcd.csv-t csv

但是,我只得到这个结果爬行(200)http://www.wynk.in/>(参考:无)2015-03-23 11:06:04+0530[Wynk]信息:关闭蜘蛛(完成)我做错什么了?


由于在主页上没有到上述URL的直接链接,通过获取所有链接来解决问题,并通过创建递归请求递归访问音乐/歌曲页面。将继承改为从spider继承,而不是从crawlspiper继承