python urllib2会自动解压缩从网页获取的gzip数据吗？

Does python urllib2 automatically uncompress gzip data fetched from webpage?

我在用

1	data=urllib2.urlopen(url).read()

我想知道：

如何判断URL上的数据是否是gzip？

如果数据是gzip，urllib2是否自动解压缩数据？数据是否始终是字符串？

相关讨论

How can I tell if the data at a URL is gzipped?

这将检查内容是否为gzip并解压缩：

1
2
3
4
5
6
7
8
9
10

from StringIO import StringIO
import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()

Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

不。URLLIB2不会自动解压缩数据，因为"接受编码"头不是由URLLIB2设置的，而是由您使用：request.add_header('Accept-Encoding','gzip, deflate')设置的。

相关讨论

如果你说的是一个简单的.gz文件，不，urllib2不会对其进行解码，你将得到未更改的.gz文件作为输出。

如果您谈论的是使用Content-Encoding: gzip或deflate的自动HTTP级压缩，那么必须使用Accept-Encoding头的客户机故意要求进行压缩。

URLLIB2不设置此头，因此它返回的响应将不会被压缩。您可以安全地获取资源，而不必担心压缩(尽管由于不支持压缩，请求可能需要更长的时间)。

相关讨论

您的问题已经得到了回答，但是为了更全面的实现，请看一下MarkPilgrim对此的实现，它涵盖了gzip、deflate、安全的URL解析，以及更多广泛使用的RSS解析器，但仍然是一个有用的参考。

相关讨论

现在看来，URLLIB3会自动处理这个问题。

引用头：

HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary':
'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options':
'sameorigin', 'Server': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018
02:42:16 GMT', 'X-Content-Type-Options': 'nosniff',
'X-XSS-Protection': '1; mode=block', 'Content-Type': 'text/plain;
charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000;
includeSubDomains', 'X-UA-Compatible': 'IE=edge', 'Date': 'Sat, 01 Sep
2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding':
'chunked'})

参考代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

import gzip
import io
import urllib3

class EDDBMultiDataFetcher():
def __init__(self):
self.files_dict = {
'Populated Systems':'http://eddb.io/archive/v5/systems_populated.jsonl',
'Stations':'http://eddb.io/archive/v5/stations.jsonl',
'Minor factions':'http://eddb.io/archive/v5/factions.jsonl',
'Commodities':'http://eddb.io/archive/v5/commodities.json'
}
self.http = urllib3.PoolManager()
def fetch_all(self):
for item, url in self.files_dict.items():
self.fetch(item, url)

def fetch(self, item, url, save_file = None):
print("Fetching:" + item)
request = self.http.request(
'GET',
url,
headers={
'Accept-encoding': 'gzip, deflate, sdch'
})
data = request.data.decode('utf-8')
print("Fetch complete")
print(data)
print(request.headers)
quit()

if __name__ == '__main__':
print("Fetching files from eddb.io")
fetcher = EDDBMultiDataFetcher()
fetcher.fetch_all()