gzip files can contain multiple concatenated gzips
TomiBelan opened this issue · 8 comments
What version are you using (fq -v
)?
$ fq -v 0.8.0 (linux amd64)
How was fq installed?
go run
Can you reproduce the problem using the latest release or master branch?
Yes
What did you do?
$ printf aaaaaaaaaa | gzip > test.gz
$ printf bbbbbbbbbb | gzip >> test.gz
$ zcat test.gz; echo .
aaaaaaaaaabbbbbbbbbb.
$ go run github.com/wader/fq@master dd test.gz
go: downloading github.com/wader/fq v0.8.1-0.20231020164445-1a3823f1877b
|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|.{}: test.gz (gzip)
0x000|1f 8b |.. | identification: raw bits (valid)
0x000| 08 | . | compression_method: "deflate" (8)
| | | flags{}:
0x000| 00 | . | text: false
0x000| 00 | . | header_crc: false
0x000| 00 | . | extra: false
0x000| 00 | . | name: false
0x000| 00 | . | comment: false
0x000| 00 | . | reserved: 0
0x000| 00 00 00 00 | .... | mtime: 0 (1970-01-01T00:00:00Z)
0x000| 00 | . | extra_flags: 0
0x000| 03 | . | os: "unix" (3)
|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|
0x0|61 61 61 61 61 61 61 61 61 61| |aaaaaaaaaa| | uncompressed: raw bits
0x000| 4b 4c 84 01 00 | KL... | compressed: raw bits
0x000| f0| .| crc32: 0x4c11cdf0 (valid)
0x010|cd 11 4c |..L |
0x010| 0a 00 00 00 | .... | isize: 10
0x010| 1f 8b 08 00 00 00 00 00 00| .........| gap0: raw bits
0x020|03 4b 4a 82 01 00 f8 4c 2f 42 0a 00 00 00| |.KJ....L/B....| |
What result did you expect?
The top level type should not be an object with "identification", "compression_method" etc., but an array of such objects.
A gzip file consists of a series of “members” (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them. RFC 1952
What did you see instead?
The "bbbbbbbbbb" member is shown as gap0 and not parsed.
Huh did not know, thats interesting. I wonder if this is the same or similar to inflate/deflate flush to encode boundaries, i ran into this for TLS compression, but in that case there is no header for the trailing inflates.
Can come up with three ways to model this:
- Root is alway an array. Maybe inconvenient?
- Root can optionally be an array. Currently not possible API-wise.
- Add a trailing field array etc with trailing gzip:s
- Something else?
I think "root is always an array" most precisely models the underlying format.
Yeap some of text test were wrong, fixed, thanks.
I wonder if it's bad that we won't provide the full concatenated uncompressed stream somehow? also the nested decoding should happen on the concatenation and not the members uncompressed data. So maybe the root should instead be a struct with a members
array and a uncompressed
raw bytes?
I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members
array and a uncompressed
raw bytes". But today I was analyzing a corrupted gz file where zcat
said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed
of each member and check they're fine. But I know this is an unusual situation.
I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.
Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.
rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done
I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a
members
array and auncompressed
raw bytes". But today I was analyzing a corrupted gz file wherezcat
said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to seeuncompressed
of each member and check they're fine. But I know this is an unusual situation.
Yes it does nested decode by default, with sometimes options to disable it. This was added early for fq as it's roots is in debugging media containers and codecs where it's common with lots of nested subformat and muxers that slice up packets in various ways.
About each member's uncompress: in the PR i now modelled so that you have access to both each members uncompressed data and a concat of them all.
I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.
I think it makes sense, kind of the point of fq is to not hide details :)
Now i actually remember that alpine packages uses concatted gzip:s.
Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.
rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done
Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?
fq is not great for outputting multiple files atm, not sure how it could be done without adding messy IO-function hmm. But i have used some hack using tar. So something like this:
Copy the to_tar
snippet from https://github.com/wader/fq/wiki/snippets an put in tar.jq
then do:
# -L . adds cwd to include path
# use include "tar" to include tar.jq
# iterate .members as {key: ..., value: ...} objects, as it's an array key will be 0,1,2,... and value the member itself
# to_tar(f) takes a function f as arg that outputs {filename: ..., data: ...} objects
$ fq -L . 'include "tar"; to_tar(.members | to_entries[] | {filename: "part\(.key)", data: .value.uncompressed})' format/gzip/testdata/multi_members.gz | tar tv
-rw-r--r-- 0 user group 11 Jan 1 1970 part0
-rw-r--r-- 0 user group 10 Jan 1 1970 part1
Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?
Right, I wanted to output each compressed member to a file, so I can look at them with zcat/fq/hexdump. $((o+1)) is just because tail counts from 1, e.g. "tail -c+9" discards first 8 bytes and starts printing from the 9th byte.
Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.
By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that?
It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:
. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with:
statement called the close() method and wrote a gzip footer, but not the correct values.
Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.
I can relate and it took quite a while to get my head around it, now i love it. But i think it really fits very well for what i at least use fq for, to do lots of adhoc queries to digg and poke around in half broken and strange media and binary files. And i hope basic jq is easy enough for ppl to use... i've also notice ppl use fq by more or less just with d
and -V
etc and then pipe grep/less or whatnot :) whatever works
By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that? It is generated by a Python program which opens it as
with gzip.open(filename, "at") as f:
. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. Thewith:
statement called the close() method and wrote a gzip footer, but not the correct values.
👍 aha tricky, glad you solved it! so it was just one odd gzip file or something that happened regularly?