Pass the url into the parse method in scrapy that was consumed from RabbitMQ
我正在使用scrapy来使用来自RabbitMQ的消息(url),但是当我使用yield调用解析方法时,将我的url作为参数传递。程序没有进入回调方法。下面是以下内容我的蜘蛛
的代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # -*- coding: utf-8 -*- import scrapy import pika from scrapy import cmdline import json class MydeletespiderSpider(scrapy.Spider): name = 'Mydeletespider' allowed_domains = [] start_urls = [] def callback(self,ch, method, properties, body): print(" [x] Received %r" % body) body=json.loads(body) url=body.get('url') yield scrapy.Request(url=url,callback=self.parse) def start_requests(self): cre = pika.PlainCredentials('test', 'test') connection = pika.BlockingConnection( pika.ConnectionParameters(host='10.0.12.103', port=5672, credentials=cre, socket_timeout=60)) channel = connection.channel() channel.basic_consume(self.callback, queue='Deletespider_Batch_Test', no_ack=True) print(' [*] Waiting for messages. To exit press CTRL+C') channel.start_consuming() def parse(self, response): print response.url pass cmdline.execute('scrapy crawl Mydeletespider'.split()) |
我的目标是将url响应传递给解析方法
要使用 rabbitmq 中的 url,您可以查看
Scrapy-rabbitmq is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the Scrapy framework.
要启用它,请在
中设置这些值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # Enables scheduling storing requests queue in rabbitmq. SCHEDULER ="scrapy_rabbitmq.scheduler.Scheduler" # Don't cleanup rabbitmq queues, allows to pause/resume crawls. SCHEDULER_PERSIST = True # Schedule requests using a priority queue. (default) SCHEDULER_QUEUE_CLASS = 'scrapy_rabbitmq.queue.SpiderQueue' # RabbitMQ Queue to use to store requests RABBITMQ_QUEUE_NAME = 'scrapy_queue' # Provide host and port to RabbitMQ daemon RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666} # Bonus: # Store scraped item in rabbitmq for post-processing. # ITEM_PIPELINES = { # 'scrapy_rabbitmq.pipelines.RabbitMQPipeline': 1 # } |
在你的蜘蛛中:
1 2 3 4 5 6 7 8 9 | from scrapy import Spider from scrapy_rabbitmq.spiders import RabbitMQMixin class RabbitSpider(RabbitMQMixin, Spider): name = 'rabbitspider' def parse(self, response): # mixin will take urls from rabbit queue by itself pass |
参考这个:http://30daydo.com/article/512
def start_requests(self) 这个函数应该返回一个生成器,否则scrapy 不会工作。