关于python：如何使用包含额外数据的Gzip文件？

How can I work with Gzip files which contain extra data?

我正在编写一个脚本，它将使用来自仪器的数据作为gzip流。在大约90％的情况下，gzip模块完美地工作，但是一些流导致它产生IOError: Not a gzipped file。如果删除gzip标头并将deflate流直接送入zlib，我会得到Error -3 while decompressing data: incorrect header check。在将我的头撞到墙上大约半天之后，我发现有问题的流包含一个看似随机数量的额外字节(不是gzip数据的一部分)附加到末尾。

令我感到奇怪的是Python无法使用这些文件有两个原因：

Gzip和7zip都可以毫无问题地打开这些"填充"文件。 (Gzip生成消息decompression OK, trailing garbage ignored，7zip默默成功。)

Gzip和Python文档似乎都表明这应该有效:(强调我的)

Gzip的format.txt：

It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.

Python的gzip.GzipFile`：

Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.

Python的zlib.Decompress.unused_data：

A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.

The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.

以下是我尝试过的四种方法。 (这些例子是Python 3.1，但我测试了2.5和2.7并且遇到了同样的问题。)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()

# approach 2 - gzip.GzipFile
with open(filename,"rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()

# approach 3 - zlib.decompress
with open(filename,"rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])

# approach 4 - zlib.decompressobj
with open(filename,"rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])

难道我做错了什么？

UPDATE

好的，虽然gzip的问题似乎是模块中的一个错误，但我的zlib问题是自己造成的。 ;-)

在深入研究gzip.py时，我意识到我做错了什么 - 默认情况下，zlib.decompress等。期待zlib包裹的流，而不是裸露的流。通过传递wbits的负值，您可以告诉zlib跳过zlib标头并解压缩原始流。这两项工作：

1
2
3
4
5
6
7
8

# approach 5 - zlib.decompress with negative wbits
with open(filename,"rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)

# approach 6 - zlib.decompressobj with negative wbits
with open(filename,"rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])