Issue with respect to Scrapy due to Meta Refresh
我刚接触到ScrapyFramework,正在尝试使用Spider对一个网站进行爬行。在我的网站中,当我从第1页导航到第2页时,中间页被添加到meta-refresh中,它将重定向到第2页。但是,在重定向时,我经常收到错误302。我试着跟着做
将用户代理设置为"Mozilla/5.0(Windows NT 6.1)AppleWebKit/537.36(khtml,类似gecko)Chrome/56.0.2924.87 Safari/537.36"
设置下载延迟=15
设置重定向最大元刷新延迟=100
但是我没有成功。我是个新手。如果有人能帮助我提供解决这个问题的指导,我将不胜感激。
根据请求添加日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | 2017-02-17 21:02:43 [scrapy.core.engine] INFO: Spider opened 2017-02-17 21:02:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2017-02-17 21:02:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening o n 127.0.0.1:6023 2017-02-17 21:02:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://xxxx.website.com/search-cases.htm> (referer: None) 2017-02-17 21:02:44 [quotes] INFO: http://www.xxxx.website2.com/e services/home.page 2017-02-17 21:02:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ( meta refresh) to <GET http://www.xxxx.website2.com/eservices/;jsessionid=D 724B51CE14CFB9A06AB5A1C2BADC7BA?x=pQSPWmZkMdOltOc6jey5Pzm2g*gqQrsim1X*85dDjm1K*V wIS*xP-fdT9lRZBHHOA41kK1OaAco2dC8Un6N*uJtWnK50mGmm> from <GET http://www.courtre cords.alaska.gov/eservices/home.page> 2017-02-17 21:02:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ( 302) to <GET http://www.xxxx.website2.com/eservices/home.page> from <GET h ttp://www.xxxx.website2.com/eservices/;jsessionid=D724B51CE14CFB9A06AB5A1C 2BADC7BA?x=pQSPWmZkMdOltOc6jey5Pzm2g*gqQrsim1X*85dDjm1K*VwIS*xP-fdT9lRZBHHOA41kK 1OaAco2dC8Un6N*uJtWnK50mGmm> 2017-02-17 21:02:55 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.xxxx.website2.com/eservices/home.page> - no more duplicates wi ll be shown (see DUPEFILTER_DEBUG to show all duplicates) 2017-02-17 21:02:55 [scrapy.core.engine] INFO: Closing spider (finished) |
**请注意,我已更改网站名称**
正如@elrull在他的评论中提到的,问题是重复的请求被过滤了。在为重定向请求设置了dont_filter=true之后,程序开始正确地执行抓取操作。