Read timeout using either urllib2 or any other http library
我有代码可以读取这样的URL:
1 2 3 4 5 6 7 | from urllib2 import Request, urlopen req = Request(url) for key, val in headers.items(): req.add_header(key, val) res = urlopen(req, timeout = timeout) # This line blocks content = res.read() |
对urlopen()调用超时。但随后代码进入res.read()调用,在那里我想读取响应数据,而超时不在那里应用。因此,读取调用可能会永远挂起,等待来自服务器的数据。我找到的唯一解决方案是使用一个信号来中断read(),这不适合我,因为我正在使用线程。
还有其他的选择吗?是否有针对python的HTTP库来处理读取超时?我看过httplib2和请求,它们似乎遇到了与上面相同的问题。我不想使用socket模块编写自己的非阻塞网络代码,因为我认为应该已经有了一个库。
更新:下面的所有解决方案都不是为我做的。您可以自己看到,设置socket或urlopen超时在下载大文件时没有效果:
1 2 3 4 | from urllib2 import urlopen url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso' c = urlopen(url) c.read() |
至少在使用python 2.7.3的Windows上,超时被完全忽略。
任何库都不可能不通过线程或其他方式使用某种异步计时器来完成此操作。原因是在
SO_RCVTIMEO
Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.
粗体部分是关键。只有在
使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import httplib import socket import threading def download(host, path, timeout = 10): content = None http = httplib.HTTPConnection(host) http.request('GET', path) response = http.getresponse() timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD]) timer.start() try: content = response.read() except httplib.IncompleteRead: pass timer.cancel() # cancel on triggered Timer is safe http.close() return content >>> host = 'releases.ubuntu.com' >>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1) >>> print content is None True >>> content = download(host, '/15.04/MD5SUMS', 1) >>> print content is None False |
除了检查
我在测试中(使用这里描述的技术)发现,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import urllib2 as u c = u.urlopen('http://localhost/', timeout=5.0) s = c.read(1<<20) Traceback (most recent call last): File"<stdin>", line 1, in <module> File"/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) File"/usr/lib/python2.7/httplib.py", line 561, in read s = self.fp.read(amt) File"/usr/lib/python2.7/httplib.py", line 1298, in read return s + self._file.read(amt - len(s)) File"/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) socket.timeout: timed out |
也许这是新版本的一个特性?我在12.04版的Ubuntu上直接使用了python 2.7。
一个可能的(不完美的)解决方案是设置全局套接字超时,在这里详细解释:
1 2 3 4 5 6 7 8 9 10 | import socket import urllib2 # timeout in seconds socket.setdefaulttimeout(10) # this call to urllib2.urlopen now uses the default timeout # we have set in the socket module req = urllib2.Request('http://www.voidspace.org.uk') response = urllib2.urlopen(req) |
但是,只有当您愿意全局修改套接字模块的所有用户的超时时,这才有效。我在芹菜任务中运行这个请求,所以这样做会把芹菜工作者代码本身的超时弄乱。
我很高兴听到其他解决办法…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #!/usr/bin/env python3 """Test that pycurl.TIMEOUT does limit the total request timeout.""" import sys import pycurl timeout = 2 #NOTE: it does limit both the total *connection* and *read* timeouts c = pycurl.Curl() c.setopt(pycurl.CONNECTTIMEOUT, timeout) c.setopt(pycurl.TIMEOUT, timeout) c.setopt(pycurl.WRITEFUNCTION, sys.stdout.buffer.write) c.setopt(pycurl.HEADERFUNCTION, sys.stderr.buffer.write) c.setopt(pycurl.NOSIGNAL, 1) c.setopt(pycurl.URL, 'http://localhost:8000') c.setopt(pycurl.HTTPGET, 1) c.perform() |
该代码在~2秒内引发超时错误。我已经用以多个块发送响应的服务器测试了总的读取超时时间,其时间小于块之间的超时时间:
1 | $ python -mslow_http_server 1 |
其中,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | #!/usr/bin/env python """Usage: python -mslow_http_server [<read_timeout>] Return an http response with *read_timeout* seconds between parts. """ import time try: from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer, test except ImportError: # Python 3 from http.server import BaseHTTPRequestHandler, HTTPServer, test def SlowRequestHandlerFactory(read_timeout): class HTTPRequestHandler(BaseHTTPRequestHandler): def do_GET(self): n = 5 data = b'1 ' self.send_response(200) self.send_header("Content-type","text/plain; charset=utf-8") self.send_header("Content-Length", n*len(data)) self.end_headers() for i in range(n): self.wfile.write(data) self.wfile.flush() time.sleep(read_timeout) return HTTPRequestHandler if __name__ =="__main__": import sys read_timeout = int(sys.argv[1]) if len(sys.argv) > 1 else 5 test(HandlerClass=SlowRequestHandlerFactory(read_timeout), ServerClass=HTTPServer) |
我已经用
我希望这是一个常见的问题,但是-在任何地方都找不到答案…刚刚用超时信号构建了一个解决方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import urllib2 import socket timeout = 10 socket.setdefaulttimeout(timeout) import time import signal def timeout_catcher(signum, _): raise urllib2.URLError("Read timeout") signal.signal(signal.SIGALRM, timeout_catcher) def safe_read(url, timeout_time): signal.setitimer(signal.ITIMER_REAL, timeout_time) url = 'http://uberdns.eu' content = urllib2.urlopen(url, timeout=timeout_time).read() signal.setitimer(signal.ITIMER_REAL, 0) # you should also catch any exceptions going out of urlopen here, # set the timer to 0, and pass the exceptions on. |
解决方案的信号部分归功于这里btw:python timer secrety
任何异步网络库都应该允许在任何I/O操作上强制执行总超时,例如,下面的gevent代码示例:
1 2 3 4 5 6 7 8 9 10 11 | #!/usr/bin/env python2 import gevent import gevent.monkey # $ pip install gevent gevent.monkey.patch_all() import urllib2 with gevent.Timeout(2): # enforce total timeout response = urllib2.urlopen('http://localhost:8000') encoding = response.headers.getparam('charset') print response.read().decode(encoding) |
以下是异步等价物:
1 2 3 4 5 6 7 8 9 10 11 | #!/usr/bin/env python3.5 import asyncio import aiohttp # $ pip install aiohttp async def fetch_text(url): response = await aiohttp.get(url) return await response.text() text = asyncio.get_event_loop().run_until_complete( asyncio.wait_for(fetch_text('http://localhost:8000'), timeout=2)) print(text) |
这里定义了测试HTTP服务器。
这不是我看到的行为。当呼叫超时时,我得到一个
1 2 3 4 5 6 7 8 | from urllib2 import Request, urlopen req = Request('http://www.google.com') res = urlopen(req,timeout=0.000001) # Traceback (most recent call last): # File"<stdin>", line 1, in <module> # ... # raise URLError(err) # urllib2.URLError: <urlopen error timed out> |
难道你不能抓住这个错误,然后避免尝试读取
1 2 3 4 5 6 7 | try: res = urlopen(req,timeout=3.0) except: print 'Doh!' finally: print 'yay!' print res.read() |
我想手动实现超时的方法是通过
对read语句的socket timeout有相同的问题。对我来说有效的方法是将urlopen和read放在try语句中。希望这有帮助!