关于网页抓取：将 url 传递给从 RabbitMQ 消费的 scrapy 中的 parse 方法

Pass the url into the parse method in scrapy that was consumed from RabbitMQ

我正在使用scrapy来使用来自RabbitMQ的消息(url)，但是当我使用yield调用解析方法时，将我的url作为参数传递。程序没有进入回调方法。下面是以下内容我的蜘蛛

的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

# -*- coding: utf-8 -*-
import scrapy
import pika
from scrapy import cmdline
import json

class MydeletespiderSpider(scrapy.Spider):
name = 'Mydeletespider'
allowed_domains = []
start_urls = []

def callback(self,ch, method, properties, body):
print(" [x] Received %r" % body)
body=json.loads(body)
url=body.get('url')
yield scrapy.Request(url=url,callback=self.parse)

def start_requests(self):
cre = pika.PlainCredentials('test', 'test')
connection = pika.BlockingConnection(
pika.ConnectionParameters(host='10.0.12.103', port=5672, credentials=cre, socket_timeout=60))
channel = connection.channel()

channel.basic_consume(self.callback,
queue='Deletespider_Batch_Test',
no_ack=True)

print(' [*] Waiting for messages. To exit press CTRL+C')
channel.start_consuming()

def parse(self, response):
print response.url
pass

cmdline.execute('scrapy crawl Mydeletespider'.split())

我的目标是将url响应传递给解析方法

相关讨论

要使用 rabbitmq 中的 url，您可以查看 scrapy-rabbitmq 包：

Scrapy-rabbitmq is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the Scrapy framework.

要启用它，请在 settings.py:

中设置这些值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# Enables scheduling storing requests queue in rabbitmq.
SCHEDULER ="scrapy_rabbitmq.scheduler.Scheduler"
# Don't cleanup rabbitmq queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_rabbitmq.queue.SpiderQueue'
# RabbitMQ Queue to use to store requests
RABBITMQ_QUEUE_NAME = 'scrapy_queue'
# Provide host and port to RabbitMQ daemon
RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666}

# Bonus:
# Store scraped item in rabbitmq for post-processing.
# ITEM_PIPELINES = {
# 'scrapy_rabbitmq.pipelines.RabbitMQPipeline': 1
# }

在你的蜘蛛中：

1
2
3
4
5
6
7
8
9

from scrapy import Spider
from scrapy_rabbitmq.spiders import RabbitMQMixin

class RabbitSpider(RabbitMQMixin, Spider):
name = 'rabbitspider'

def parse(self, response):
# mixin will take urls from rabbit queue by itself
pass