How to reschedule 403 HTTP status codes to be crawled later in scrapy?
根据这些说明,我可以看到 HTTP 500 错误、连接丢失错误等总是被重新安排,但如果 403 错误也被重新安排,或者它们被简单地视为有效响应或在之后被忽略,我无法找到任何地方达到重试限制。
同样来自同一条指令:
Failed pages are collected on the scraping process and rescheduled at
the end, once the spider has finished crawling all regular (non
failed) pages. Once there are no more failed pages to retry, this
middleware sends a signal (retry_complete), so other extensions could
connect to that signal.
这些
另外,当scrapy遇到HTTP 400状态时,我可以看到这个异常:
1 | 2015-12-07 12:33:42 [scrapy] DEBUG: Ignoring response <400 http://example.com/q?x=12>: HTTP status code is not handled or not allowed |
从这个异常中,我认为很明显 HTTP 400 响应被忽略并且没有重新安排。
我不确定 403 HTTP 状态是被忽略还是重新安排在最后被抓取。
因此,我尝试根据这些文档重新安排所有具有 HTTP 状态 403 的响应。到目前为止,这是我尝试过的:
在 middlewares.py 文件中:
1 2 3 4 5 | def process_response(self, request, response, spider): if response.status == 403: return request else: return response |
在settings.py中:
1 2 | RETRY_TIMES = 5 RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408] |
我的问题是:
您可以在此处找到要重试的默认状态。
将 403 添加到
一种方法是向你的 Spider 添加一个中间件(来源,链接):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # File: middlewares.py from twisted.internet import reactor from twisted.internet.defer import Deferred DEFAULT_DELAY = 5 class DelayedRequestsMiddleware(object): def process_request(self, request, spider): delay_s = request.meta.get('delay_request_by', None) if not delay_s and response.status != 403: return delay_s = delay_s or DEFAULT_DELAY deferred = Deferred() reactor.callLater(delay_s, deferred.callback, None) return deferred |
在示例中,您可以使用元键调用延迟:
1 2 3 | # This request will have itself delayed by 5 seconds yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', meta={'delay_request_by': 5}) |