Python requests timeout not suitable when sites stream
以下内容适用于99.999%的网站,但随机找到了一个不适用的网站:
1 2 | import requests requests.get('http://arboleascity.com',timeout=(5,5),verify=False) |
我已经就这个项目提出了一个问题。
https://github.com/requests/requests/issues/4276
有什么建议或想法吗?
我在
它流一个shoutcast流(内容类型:audio/aacp),所以没有超时,它只是从不停止流。
如果需要主页而不是流,请将用户代理头设置为类似浏览器的内容。如果你想要音频流,使用
如果您正在编写scraper,那么在尝试获取分块的响应之前,您可能需要检查head请求中的内容类型。
问题不在于
也就是说,似乎
如果使用有效的浏览器签名,它只返回页面html(
1 2 3 4 5 6 | $ curl -vvv -A 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0' http://arboleascity.com >/dev/null ... < Content-Type: text/html;charset=utf-8 ... 100 118 0 118 0 0 297 0 --:--:-- --:--:-- --:--:-- 297 * Connection #0 to host arboleascity.com left intact |
但是,如果保留
1 2 3 4 5 6 7 8 | $ curl -vvv http://arboleascity.com >/dev/null ... < Content-Type: audio/aacp ... < icy-notice1: This stream requires Winamp < icy-notice2: SHOUTcast DNAS/posix(linux x64) v2.5.1.724 ... 100 345k 0 345k 0 0 26975 0 --:--:-- 0:00:13 --:--:-- 7118^C |
或者,使用
1 2 | >>> headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0'} >>> r = requests.get('http://arboleascity.com', headers=headers) |
他们两个都是按文件所记载的那样工作的。
The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine (corresponding to the connect()) call on the socket. It's a good practice to set connect timeouts to slightly larger than a multiple of 3, which is the default TCP packet retransmission window.
Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it's the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).