How can I work with Gzip files which contain extra data?
我正在编写一个脚本,它将使用来自仪器的数据作为gzip流。在大约90%的情况下,
令我感到奇怪的是Python无法使用这些文件有两个原因:
Gzip和Python文档似乎都表明这应该有效:(强调我的)
Gzip的format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python的gzip.GzipFile`:
Calling a
GzipFile object’sclose() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass aStringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using theStringIO object’sgetvalue() method.
Python的
A string which contains any bytes past the end of the compressed data. That is, this remains
"" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is"" , the empty string.The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s
decompress() method until theunused_data attribute is no longer the empty string.
以下是我尝试过的四种方法。 (这些例子是Python 3.1,但我测试了2.5和2.7并且遇到了同样的问题。)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # approach 1 - gzip.open with gzip.open(filename) as datafile: data = datafile.read() # approach 2 - gzip.GzipFile with open(filename,"rb") as gzipfile: with gzip.GzipFile(fileobj=gzipfile) as datafile: data = datafile.read() # approach 3 - zlib.decompress with open(filename,"rb") as gzipfile: data = zlib.decompress(gzipfile.read()[10:]) # approach 4 - zlib.decompressobj with open(filename,"rb") as gzipfile: decompressor = zlib.decompressobj() data = decompressor.decompress(gzipfile.read()[10:]) |
难道我做错了什么?
UPDATE
好的,虽然
在深入研究
1 2 3 4 5 6 7 8 | # approach 5 - zlib.decompress with negative wbits with open(filename,"rb") as gzipfile: data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS) # approach 6 - zlib.decompressobj with negative wbits with open(filename,"rb") as gzipfile: decompressor = zlib.decompressobj(-zlib.MAX_WBITS) data = decompressor.decompress(gzipfile.read()[10:]) |
这是一个错误。 Python中gzip模块的质量远远低于Python标准库中应该要求的质量。
这里的问题是gzip模块假定该文件是gzip格式文件流。在压缩数据的末尾,它从头开始,期待一个新的gzip头;如果找不到,则引发异常。这是错的。
当然,连接两个gzip文件是有效的,例如:
1 2 3 4 5 6 | echo testing > test.txt gzip test.txt cat test.txt.gz test.txt.gz > test2.txt.gz zcat test2.txt.gz # testing # testing |
gzip模块的错误是,如果第二次没有gzip头,它不应该引发异常;它应该只是结束文件。如果第一次没有标题,它应该只引发异常。
没有直接修改gzip模块就没有干净的解决方法;如果你想这样做,请查看
此模块中还有其他错误。例如,它不必要地寻求,导致它在不可搜索的流(例如网络套接字)上失败。这让我对这个模块几乎没有信心:一个不知道gzip需要在没有搜索的情况下运行的开发人员非常不合格地为Python标准库实现它。
我过去也遇到过类似的问题。我写了一个新的模块,可以更好地使用流。你可以尝试一下,看看它是否适合你。
我确实遇到了这个问题,但这些答案都没有解决我的问题。所以,这就是我为解决问题所做的工作:
1 2 3 4 5 6 7 8 9 | #for gzip files unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16) #for zlib files unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS) #automatic header detection (zlib or gzip): unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32) |
根据您的情况,可能需要对数据进行解码,例如:
1 | unzipped = unzipped.decode() |
https://docs.python.org/3/library/zlib.html
我无法使用上述技术。所以做了一个使用zipfile包的工作
1 2 3 4 5 | import zipfile from io import BytesIO mock_file = BytesIO(data) #data is the compressed string z = zipfile.ZipFile(file = mock_file) neat_data = z.read(z.namelist()[0]) |
工作完美