关于python：内容长度标题与手动计算时不一样？

Content-length header not the same as when manually calculating it?

这里的答案(原始响应的字节大小)表示：

Just take the len() of the content of the response:

1
2
3
>>> response = requests.get('https://github.com/')
>>> len(response.content)
51671

但是这样做并不能得到准确的内容长度。例如，请查看此python代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

import sys
import requests

def proccessUrl(url):
try:
r = requests.get(url)
print("Correct Content Length:"+r.headers['Content-Length'])
print("bytes of r.text :"+str(sys.getsizeof(r.text)))
print("bytes of r.content :"+str(sys.getsizeof(r.content)))
print("len r.text :"+str(len(r.text)))
print("len r.content :"+str(len(r.content)))
except Exception as e:
print(str(e))

#this url contains a content-length header, we will use that to see if the content length we calculate is the same.
proccessUrl("https://stackoverflow.com")

如果我们尝试手动计算内容长度并将其与标题中的内容进行比较，我们会得到一个更大的答案？

1
2
3
4
5

Correct Content Length: 51504
bytes of r.text : 515142
bytes of r.content : 257623
len r.text : 257552
len r.content : 257606

为什么len(r.content)没有返回正确的内容长度？如果表头丢失，我们如何才能准确地手动计算？

相关讨论

Content-Length头反映响应的主体。这与text或content属性的长度不同，因为响应可以被压缩。requests为您解压缩响应。

如果您希望response对象仍然正常工作，则必须绕过许多内部管道以获得原始的、压缩的原始内容，然后才能访问更多的内部构件。最简单的方法是启用流，然后从原始套接字读取：

1
2
3
4
5
6
7
8

from io import BytesIO

r = requests.get(url, stream=True)
# read directly from the raw urllib3 connection
raw_content = r.raw.read()
content_length = len(raw_content)
# replace the internal file-object to serve the data again
r.raw._fp = BytesIO(raw_content)

演示：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

>>> import requests
>>> from io import BytesIO
>>> url ="https://stackoverflow.com"
>>> r = requests.get(url, stream=True)
>>> r.headers['Content-Encoding'] # a compressed response
'gzip'
>>> r.headers['Content-Length'] # the raw response contains 52055 bytes of compressed data
'52055'
>>> r.headers['Content-Type'] # we are served UTF-8 HTML data
'text/html; charset=utf-8'
>>> raw_content = r.raw.read()
>>> len(raw_content) # the raw content body length
52055
>>> r.raw._fp = BytesIO(raw_content)
>>> len(r.content) # the decompressed binary content, byte count
258719
>>> len(r.text) # the Unicode content decoded from UTF-8, character count
258658

号

这会将完整的响应读取到内存中，因此如果您希望得到大量的响应，请不要使用此选项！在这种情况下，您可以使用shutil.copyfileobj()将数据从r.raw文件复制到假脱机临时文件(一旦达到某个大小，该文件将切换到磁盘上的文件)，获取该文件的文件大小，然后将该文件填充到r.raw._fp上。

将Content-Type头添加到缺少该头的任何请求的函数如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

import requests
import shutil
import tempfile

def ensure_content_length(
url, *args, method='GET', session=None, max_size=2**20, # 1Mb
**kwargs
):
kwargs['stream'] = True
session = session or requests.Session()
r = session.request(method, url, *args, **kwargs)
if 'Content-Length' not in r.headers:
# stream content into a temporary file so we can get the real size
spool = tempfile.SpooledTemporaryFile(max_size)
shutil.copyfileobj(r.raw, spool)
r.headers['Content-Length'] = str(spool.tell())
spool.seek(0)
# replace the original socket with our temporary file
r.raw._fp.close()
r.raw._fp = spool
return r

这将接受现有会话，并允许您指定请求方法。根据内存限制的需要调整max_size。https://github.com上的演示，缺少Content-Length头：

1
2
3
4
5
6
7

>>> r = ensure_content_length('https://github.com/')
>>> r
<Response [200]>
>>> r.headers['Content-Length']
'14490'
>>> len(r.content)
54814

。

注意，如果不存在Content-Encoding头，或者该头的值设置为identity，并且Content-Length可用，那么只需依赖Content-Length作为响应的完整大小即可。这是因为显然没有应用压缩。

附带说明：如果后面的是bytes或str对象的长度(该对象中的字节或字符数)，则不应使用sys.getsizeof()。sys.getsizeof()为您提供了一个python对象的内部内存占用，它覆盖的不仅仅是该对象中的字节或字符数。看看在python中len()和sys.getsizeof()方法有什么区别？