Content-length header not the same as when manually calculating it?
这里的答案(原始响应的字节大小)表示:
Just take the
len() of the content of the response:
1
2
3 >>> response = requests.get('https://github.com/')
>>> len(response.content)
51671
但是这样做并不能得到准确的内容长度。例如,请查看此python代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import sys import requests def proccessUrl(url): try: r = requests.get(url) print("Correct Content Length:"+r.headers['Content-Length']) print("bytes of r.text :"+str(sys.getsizeof(r.text))) print("bytes of r.content :"+str(sys.getsizeof(r.content))) print("len r.text :"+str(len(r.text))) print("len r.content :"+str(len(r.content))) except Exception as e: print(str(e)) #this url contains a content-length header, we will use that to see if the content length we calculate is the same. proccessUrl("https://stackoverflow.com") |
如果我们尝试手动计算内容长度并将其与标题中的内容进行比较,我们会得到一个更大的答案?
1 2 3 4 5 | Correct Content Length: 51504 bytes of r.text : 515142 bytes of r.content : 257623 len r.text : 257552 len r.content : 257606 |
为什么
如果您希望
1 2 3 4 5 6 7 8 | from io import BytesIO r = requests.get(url, stream=True) # read directly from the raw urllib3 connection raw_content = r.raw.read() content_length = len(raw_content) # replace the internal file-object to serve the data again r.raw._fp = BytesIO(raw_content) |
演示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | >>> import requests >>> from io import BytesIO >>> url ="https://stackoverflow.com" >>> r = requests.get(url, stream=True) >>> r.headers['Content-Encoding'] # a compressed response 'gzip' >>> r.headers['Content-Length'] # the raw response contains 52055 bytes of compressed data '52055' >>> r.headers['Content-Type'] # we are served UTF-8 HTML data 'text/html; charset=utf-8' >>> raw_content = r.raw.read() >>> len(raw_content) # the raw content body length 52055 >>> r.raw._fp = BytesIO(raw_content) >>> len(r.content) # the decompressed binary content, byte count 258719 >>> len(r.text) # the Unicode content decoded from UTF-8, character count 258658 |
号
这会将完整的响应读取到内存中,因此如果您希望得到大量的响应,请不要使用此选项!在这种情况下,您可以使用
将
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import requests import shutil import tempfile def ensure_content_length( url, *args, method='GET', session=None, max_size=2**20, # 1Mb **kwargs ): kwargs['stream'] = True session = session or requests.Session() r = session.request(method, url, *args, **kwargs) if 'Content-Length' not in r.headers: # stream content into a temporary file so we can get the real size spool = tempfile.SpooledTemporaryFile(max_size) shutil.copyfileobj(r.raw, spool) r.headers['Content-Length'] = str(spool.tell()) spool.seek(0) # replace the original socket with our temporary file r.raw._fp.close() r.raw._fp = spool return r |
这将接受现有会话,并允许您指定请求方法。根据内存限制的需要调整
1 2 3 4 5 6 7 | >>> r = ensure_content_length('https://github.com/') >>> r <Response [200]> >>> r.headers['Content-Length'] '14490' >>> len(r.content) 54814 |
。
注意,如果不存在
附带说明:如果后面的是