How to move on if the error occur in response on python in beautiful Soup
我制作了一个网络爬虫,它从一个文本文件中获取数千个URL,然后在该网页上爬行数据。现在它有了许多URL;一些URL也被破坏了。所以它给了我一个错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | Traceback (most recent call last): File"C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 57, in <module> crawl_data("http://www.foasdasdasdasdodily.com/r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show") File"C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 18, in crawl_data data = requests.get(url) File"C:\Python27\lib\site-packages equests\api.py", line 67, in get return request('get', url, params=params, **kwargs) File"C:\Python27\lib\site-packages equests\api.py", line 53, in request return session.request(method=method, url=url, **kwargs) File"C:\Python27\lib\site-packages equests\sessions.py", line 468, in request resp = self.send(prep, **send_kwargs) File"C:\Python27\lib\site-packages equests\sessions.py", line 576, in send r = adapter.send(request, **kwargs) File"C:\Python27\lib\site-packages equests\adapters.py", line 437, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.foasdasdasdasdodily.com', port=80): Max retries exceeded with url: /r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0310FCB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)) |
以下是我的代码:
1 2 3 4 5 6 7 8 | def crawl_data(url): global connectString data = requests.get(url) response = str( data ) if response !="<Response [200]>": return soup = BeautifulSoup(data.text,"lxml") titledb = soup.h1.string |
但它仍然给了我同样的例外或错误。
I simply want it to ignore that Urls from which there is no response
and move on to the next Url.
您需要了解异常处理。忽略这些错误的最简单方法是用
1 2 3 4 | try: <process a single URL> except requests.exceptions.ConnectionError: pass |
这意味着,如果发生指定的异常,您的程序将只执行
使用
1 2 3 4 5 6 7 8 9 10 | def crawl_data(url): global connectString try: data = requests.get(url) except requests.exceptions.ConnectionError: return response = str( data ) soup = BeautifulSoup(data.text,"lxml") titledb = soup.h1.string |