Limiting/throttling the rate of HTTP requests in GRequests
我正在用python 2.7.3编写一个小脚本,其中包含grequests和lxml,它允许我从各种网站收集一些可收集的卡片价格,并对它们进行比较。问题是其中一个网站限制了请求的数量,如果超过了这个数量,就会返回HTTP错误429。
是否有一种方法可以添加限制grequest中的请求数,以使我不超过每秒指定的请求数?另外-如果发生HTTP 429,如何在一段时间后使grequestes重试?
旁注——他们的限额低得离谱。大约每15秒有8个请求。我多次使用浏览器破坏它,只是刷新等待价格变化的页面。
我要自己回答我自己的问题,因为我必须自己解决这个问题,而且这方面的信息似乎很少。
想法如下。与grequest一起使用的每个请求对象在创建时都可以将会话对象作为参数。另一方面,会话对象可以安装HTTP适配器,在发出请求时使用这些适配器。通过创建我们自己的适配器,我们可以截获请求,并以最适合我们应用程序的方式对请求进行速率限制。在我的例子中,我最后得到了下面的代码。
用于限制的对象:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5) DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15) class BurstThrottle(object): max_hits = None hits = None burst_window = None total_window = None timestamp = None def __init__(self, max_hits, burst_window, wait_window): self.max_hits = max_hits self.hits = 0 self.burst_window = burst_window self.total_window = burst_window + wait_window self.timestamp = datetime.datetime.min def throttle(self): now = datetime.datetime.utcnow() if now < self.timestamp + self.total_window: if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits): self.hits += 1 return datetime.timedelta(0) else: return self.timestamp + self.total_window - now else: self.timestamp = now self.hits = 1 return datetime.timedelta(0) |
HTTP适配器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | class MyHttpAdapter(requests.adapters.HTTPAdapter): throttle = None def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE, pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES, pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW, wait_window=DEFAULT_WAIT_WINDOW): self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window) super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize, max_retries=max_retries, pool_block=pool_block) def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None): request_successful = False response = None while not request_successful: wait_time = self.throttle.throttle() while wait_time > datetime.timedelta(0): gevent.sleep(wait_time.total_seconds(), ref=True) wait_time = self.throttle.throttle() response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies) if response.status_code != 429: request_successful = True return response |
设置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | requests_adapter = adapter.MyHttpAdapter( pool_connections=__CONCURRENT_LIMIT__, pool_maxsize=__CONCURRENT_LIMIT__, max_retries=0, pool_block=False, burst_window=datetime.timedelta(seconds=5), wait_window=datetime.timedelta(seconds=20)) requests_session = requests.session() requests_session.mount('http://', requests_adapter) requests_session.mount('https://', requests_adapter) unsent_requests = (grequests.get(url, hooks={'response': handle_response}, session=requests_session) for url in urls) grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__) |
请看一下自动请求限制:https://pypi.python.org/pypi/requeststhrottler/0.2.2
您可以在每个请求之间设置固定的延迟量,也可以在固定的秒数内设置发送的请求数(基本上相同):
1 2 3 4 5 6 7 | import requests from requests_throttler import BaseThrottler request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)] # An example list of requests with BaseThrottler(name='base-throttler', delay=1.5) as bt: throttled_requests = bt.multi_submit(reqs) |
其中函数
然后您可以访问响应:
1 2 | for tr in throttled_requests: print tr.response |
或者,您可以通过指定在固定时间内发送的一个或多个请求(例如,每60秒发送15个请求)来实现这一点:
1 2 3 4 5 6 7 | import requests from requests_throttler import BaseThrottler request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)] # An example list of requests with BaseThrottler(name='base-throttler', reqs_over_time=(15, 60)) as bt: throttled_requests = bt.multi_submit(reqs) |
两种解决方案都可以在不使用
1 2 3 4 5 6 7 8 9 | import requests from requests_throttler import BaseThrottler request = requests.Request(method='GET', url='http://www.google.com') reqs = [request for i in range(0, 5)] # An example list of requests bt = BaseThrottler(name='base-throttler', delay=1.5) bt.start() throttled_requests = bt.multi_submit(reqs) bt.shutdown() |
有关详细信息,请访问:http://pythonhosted.org/requeststhrottler/index.html
我也有类似的问题。这是我的解决方案。在你的情况下,我会:
1 2 3 4 5 | def worker(): with rate_limit('slow.domain.com', 2): response = requests.get('https://slow.domain.com/path') text = response.text # Use `text` |
假设你有多个域名,我会设置一个字典映射
此代码假定您将使用gevent和monkey补丁。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | from contextlib import contextmanager from gevent.event import Event from gevent.queue import Queue from time import time def rate_limit(resource, delay, _queues={}): """Delay use of `resource` until after `delay` seconds have passed. Example usage: def worker(): with rate_limit('foo.bar.com', 1): response = requests.get('https://foo.bar.com/path') text = response.text # use `text` This will serialize and delay requests from multiple workers for resource 'foo.bar.com' by 1 second. """ if resource not in _queues: queue = Queue() gevent.spawn(_watch, queue) _queues[resource] = queue return _resource_manager(_queues[resource], delay) def _watch(queue): "Watch `queue` and wake event listeners after delay." last = 0 while True: event, delay = queue.get() now = time() if (now - last) < delay: gevent.sleep(delay - (now - last)) event.set() # Wake worker but keep control. event.clear() event.wait() # Yield control until woken. last = time() @contextmanager def _resource_manager(queue, delay): "`with` statement support for `rate_limit`." event = Event() queue.put((event, delay)) event.wait() # Wait for queue watcher to wake us. yield event.set() # Wake queue watcher. |
看起来没有任何简单的机制来处理请求或grequests代码中的这个内置项。唯一的钩子似乎是周围的反应。
这里有一个超级黑客的工作,至少证明了这是可能的-我修改了grequest,以保留发出请求的时间列表,并休眠异步请求的创建,直到每秒请求数低于最大值。
1 2 3 4 5 6 7 8 9 10 11 | class AsyncRequest(object): def __init__(self, method, url, **kwargs): print self,'init' waiting=True while waiting: if len([x for x in q if x > time.time()-15]) < 8: q.append(time.time()) waiting=False else: print self,'snoozing' gevent.sleep(1) |
您可以使用grequests.imap()以交互方式观看
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import time import rg urls = [ 'http://www.heroku.com', 'http://python-tablib.org', 'http://httpbin.org', 'http://python-requests.org', 'http://kennethreitz.com', 'http://www.cnn.com', ] def print_url(r, *args, **kwargs): print(r.url),time.time() hook_dict=dict(response=print_url) rs = (rg.get(u, hooks=hook_dict) for u in urls) for r in rg.imap(rs): print r |
我希望有一个更优雅的解决方案,但到目前为止我找不到。在会话和适配器中四处查看。也许游泳池管理员可以增加?
此外,我不会将此代码投入生产中,'q'列表永远不会被修剪,最终会变得相当大。另外,我不知道它是否真的像广告上说的那样工作。当我查看控制台输出时,它看起来就是这样。
呃。只要看看这段代码,我就知道是凌晨3点。该睡觉了。