关于python:使用urllib2或任何其他http库读取超时

Read timeout using either urllib2 or any other http library

我有代码可以读取这样的URL:

1
2
3
4
5
6
7
from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
    req.add_header(key, val)
res = urlopen(req, timeout = timeout)
# This line blocks
content = res.read()

对urlopen()调用超时。但随后代码进入res.read()调用,在那里我想读取响应数据,而超时不在那里应用。因此,读取调用可能会永远挂起,等待来自服务器的数据。我找到的唯一解决方案是使用一个信号来中断read(),这不适合我,因为我正在使用线程。

还有其他的选择吗?是否有针对python的HTTP库来处理读取超时?我看过httplib2和请求,它们似乎遇到了与上面相同的问题。我不想使用socket模块编写自己的非阻塞网络代码,因为我认为应该已经有了一个库。

更新:下面的所有解决方案都不是为我做的。您可以自己看到,设置socket或urlopen超时在下载大文件时没有效果:

1
2
3
4
from urllib2 import urlopen
url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso'
c = urlopen(url)
c.read()

至少在使用python 2.7.3的Windows上,超时被完全忽略。


任何库都不可能不通过线程或其他方式使用某种异步计时器来完成此操作。原因是在httpliburllib2和其他库中使用的timeout参数将timeout设置在基础socket上。文档中解释了这一点。

SO_RCVTIMEO

Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.

粗体部分是关键。只有在timeout窗口期间没有收到单个字节时,才会引发socket.timeout。换句话说,这是接收字节之间的timeout

使用threading.Timer的简单函数可以如下所示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import httplib
import socket
import threading

def download(host, path, timeout = 10):
    content = None

    http = httplib.HTTPConnection(host)
    http.request('GET', path)
    response = http.getresponse()

    timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
    timer.start()

    try:
        content = response.read()
    except httplib.IncompleteRead:
        pass

    timer.cancel() # cancel on triggered Timer is safe
    http.close()

    return content

>>> host = 'releases.ubuntu.com'
>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
>>> print content is None
True
>>> content = download(host, '/15.04/MD5SUMS', 1)
>>> print content is None
False

除了检查None外,还可以捕获不在函数内部而是在函数外部的httplib.IncompleteRead异常。但是,如果HTTP请求没有Content-Length头,则后一种情况将不起作用。


我在测试中(使用这里描述的技术)发现,urlopen()调用中设置的超时也会影响read()调用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import urllib2 as u
c = u.urlopen('http://localhost/', timeout=5.0)
s = c.read(1<<20)
Traceback (most recent call last):
  File"<stdin>", line 1, in <module>
  File"/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File"/usr/lib/python2.7/httplib.py", line 561, in read
    s = self.fp.read(amt)
  File"/usr/lib/python2.7/httplib.py", line 1298, in read
    return s + self._file.read(amt - len(s))
  File"/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
socket.timeout: timed out

也许这是新版本的一个特性?我在12.04版的Ubuntu上直接使用了python 2.7。


一个可能的(不完美的)解决方案是设置全局套接字超时,在这里详细解释:

1
2
3
4
5
6
7
8
9
10
import socket
import urllib2

# timeout in seconds
socket.setdefaulttimeout(10)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

但是,只有当您愿意全局修改套接字模块的所有用户的超时时,这才有效。我在芹菜任务中运行这个请求,所以这样做会把芹菜工作者代码本身的超时弄乱。

我很高兴听到其他解决办法…


pycurl.TIMEOUT期权适用于整个请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/env python3
"""Test that pycurl.TIMEOUT does limit the total request timeout."""
import sys
import pycurl

timeout = 2 #NOTE: it does limit both the total *connection* and *read* timeouts
c = pycurl.Curl()
c.setopt(pycurl.CONNECTTIMEOUT, timeout)
c.setopt(pycurl.TIMEOUT, timeout)
c.setopt(pycurl.WRITEFUNCTION, sys.stdout.buffer.write)
c.setopt(pycurl.HEADERFUNCTION, sys.stderr.buffer.write)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, 'http://localhost:8000')
c.setopt(pycurl.HTTPGET, 1)
c.perform()

该代码在~2秒内引发超时错误。我已经用以多个块发送响应的服务器测试了总的读取超时时间,其时间小于块之间的超时时间:

1
$ python -mslow_http_server 1

其中,slow_http_server.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env python
"""Usage: python -mslow_http_server [<read_timeout>]

   Return an http response with *read_timeout* seconds between parts.
"""

import time
try:
    from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer, test
except ImportError: # Python 3
    from http.server import BaseHTTPRequestHandler, HTTPServer, test

def SlowRequestHandlerFactory(read_timeout):
    class HTTPRequestHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            n = 5
            data = b'1
'

            self.send_response(200)
            self.send_header("Content-type","text/plain; charset=utf-8")
            self.send_header("Content-Length", n*len(data))
            self.end_headers()
            for i in range(n):
                self.wfile.write(data)
                self.wfile.flush()
                time.sleep(read_timeout)
    return HTTPRequestHandler

if __name__ =="__main__":
    import sys
    read_timeout = int(sys.argv[1]) if len(sys.argv) > 1 else 5
    test(HandlerClass=SlowRequestHandlerFactory(read_timeout),
         ServerClass=HTTPServer)

我已经用http://google.com:22222测试了总的连接超时。


我希望这是一个常见的问题,但是-在任何地方都找不到答案…刚刚用超时信号构建了一个解决方案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import urllib2
import socket

timeout = 10
socket.setdefaulttimeout(timeout)

import time
import signal

def timeout_catcher(signum, _):
    raise urllib2.URLError("Read timeout")

signal.signal(signal.SIGALRM, timeout_catcher)

def safe_read(url, timeout_time):
    signal.setitimer(signal.ITIMER_REAL, timeout_time)
    url = 'http://uberdns.eu'
    content = urllib2.urlopen(url, timeout=timeout_time).read()
    signal.setitimer(signal.ITIMER_REAL, 0)
    # you should also catch any exceptions going out of urlopen here,
    # set the timer to 0, and pass the exceptions on.

解决方案的信号部分归功于这里btw:python timer secrety


任何异步网络库都应该允许在任何I/O操作上强制执行总超时,例如,下面的gevent代码示例:

1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python2
import gevent
import gevent.monkey # $ pip install gevent
gevent.monkey.patch_all()

import urllib2

with gevent.Timeout(2): # enforce total timeout
    response = urllib2.urlopen('http://localhost:8000')
    encoding = response.headers.getparam('charset')
    print response.read().decode(encoding)

以下是异步等价物:

1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python3.5
import asyncio
import aiohttp # $ pip install aiohttp

async def fetch_text(url):
    response = await aiohttp.get(url)
    return await response.text()

text = asyncio.get_event_loop().run_until_complete(
    asyncio.wait_for(fetch_text('http://localhost:8000'), timeout=2))
print(text)

这里定义了测试HTTP服务器。


这不是我看到的行为。当呼叫超时时,我得到一个URLError

1
2
3
4
5
6
7
8
from urllib2 import Request, urlopen
req = Request('http://www.google.com')
res = urlopen(req,timeout=0.000001)
#  Traceback (most recent call last):
#  File"<stdin>", line 1, in <module>
#  ...
#  raise URLError(err)
#  urllib2.URLError: <urlopen error timed out>

难道你不能抓住这个错误,然后避免尝试读取res吗?当我尝试使用res.read()之后,我得到NameError: name 'res' is not defined.是你需要的:

1
2
3
4
5
6
7
try:
    res = urlopen(req,timeout=3.0)
except:          
    print 'Doh!'
finally:
    print 'yay!'
    print res.read()

我想手动实现超时的方法是通过multiprocessing来实现,不是吗?如果作业尚未完成,您可以终止它。


对read语句的socket timeout有相同的问题。对我来说有效的方法是将urlopen和read放在try语句中。希望这有帮助!