关于网页抓取:由于Meta Refresh导致的Scrapy问题

Issue with respect to Scrapy due to Meta Refresh

我刚接触到ScrapyFramework,正在尝试使用Spider对一个网站进行爬行。在我的网站中,当我从第1页导航到第2页时,中间页被添加到meta-refresh中,它将重定向到第2页。但是,在重定向时,我经常收到错误302。我试着跟着做

将用户代理设置为"Mozilla/5.0(Windows NT 6.1)AppleWebKit/537.36(khtml,类似gecko)Chrome/56.0.2924.87 Safari/537.36"

设置下载延迟=15

设置重定向最大元刷新延迟=100

但是我没有成功。我是个新手。如果有人能帮助我提供解决这个问题的指导,我将不胜感激。

根据请求添加日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2017-02-17 21:02:43 [scrapy.core.engine] INFO: Spider opened
2017-02-17 21:02:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-02-17 21:02:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-02-17 21:02:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://xxxx.website.com/search-cases.htm> (referer: None)
2017-02-17 21:02:44 [quotes] INFO: http://www.xxxx.website2.com/e
services/home.page
2017-02-17 21:02:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (
meta refresh) to <GET http://www.xxxx.website2.com/eservices/;jsessionid=D
724B51CE14CFB9A06AB5A1C2BADC7BA?x=pQSPWmZkMdOltOc6jey5Pzm2g*gqQrsim1X*85dDjm1K*V
wIS*xP-fdT9lRZBHHOA41kK1OaAco2dC8Un6N*uJtWnK50mGmm> from <GET http://www.courtre
cords.alaska.gov/eservices/home.page>
2017-02-17 21:02:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (
302) to <GET http://www.xxxx.website2.com/eservices/home.page> from <GET h
ttp://www.xxxx.website2.com/eservices/;jsessionid=D724B51CE14CFB9A06AB5A1C
2BADC7BA?x=pQSPWmZkMdOltOc6jey5Pzm2g*gqQrsim1X*85dDjm1K*VwIS*xP-fdT9lRZBHHOA41kK
1OaAco2dC8Un6N*uJtWnK50mGmm>
2017-02-17 21:02:55 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET
 http://www.xxxx.website2.com/eservices/home.page> - no more duplicates wi
ll be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-02-17 21:02:55 [scrapy.core.engine] INFO: Closing spider (finished)

**请注意,我已更改网站名称**


正如@elrull在他的评论中提到的,问题是重复的请求被过滤了。在为重定向请求设置了dont_filter=true之后,程序开始正确地执行抓取操作。