wader/fq

gzip files can contain multiple concatenated gzips

TomiBelan opened this issue · 8 comments

What version are you using (fq -v)?

$ fq -v
0.8.0 (linux amd64)

How was fq installed?

go run

Can you reproduce the problem using the latest release or master branch?

Yes

What did you do?

$ printf aaaaaaaaaa | gzip > test.gz
$ printf bbbbbbbbbb | gzip >> test.gz
$ zcat test.gz; echo .
aaaaaaaaaabbbbbbbbbb.
$ go run github.com/wader/fq@master dd test.gz
go: downloading github.com/wader/fq v0.8.1-0.20231020164445-1a3823f1877b
     |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|.{}: test.gz (gzip)
0x000|1f 8b                                          |..              |  identification: raw bits (valid)
0x000|      08                                       |  .             |  compression_method: "deflate" (8)
     |                                               |                |  flags{}:
0x000|         00                                    |   .            |    text: false
0x000|         00                                    |   .            |    header_crc: false
0x000|         00                                    |   .            |    extra: false
0x000|         00                                    |   .            |    name: false
0x000|         00                                    |   .            |    comment: false
0x000|         00                                    |   .            |    reserved: 0
0x000|            00 00 00 00                        |    ....        |  mtime: 0 (1970-01-01T00:00:00Z)
0x000|                        00                     |        .       |  extra_flags: 0
0x000|                           03                  |         .      |  os: "unix" (3)
     |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|
  0x0|61 61 61 61 61 61 61 61 61 61|                 |aaaaaaaaaa|     |  uncompressed: raw bits
0x000|                              4b 4c 84 01 00   |          KL... |  compressed: raw bits
0x000|                                             f0|               .|  crc32: 0x4c11cdf0 (valid)
0x010|cd 11 4c                                       |..L             |
0x010|         0a 00 00 00                           |   ....         |  isize: 10
0x010|                     1f 8b 08 00 00 00 00 00 00|       .........|  gap0: raw bits
0x020|03 4b 4a 82 01 00 f8 4c 2f 42 0a 00 00 00|     |.KJ....L/B....| |

What result did you expect?

The top level type should not be an object with "identification", "compression_method" etc., but an array of such objects.

A gzip file consists of a series of “members” (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them. RFC 1952

What did you see instead?

The "bbbbbbbbbb" member is shown as gap0 and not parsed.

wader commented

Huh did not know, thats interesting. I wonder if this is the same or similar to inflate/deflate flush to encode boundaries, i ran into this for TLS compression, but in that case there is no header for the trailing inflates.

Can come up with three ways to model this:

  • Root is alway an array. Maybe inconvenient?
  • Root can optionally be an array. Currently not possible API-wise.
  • Add a trailing field array etc with trailing gzip:s
  • Something else?

I think "root is always an array" most precisely models the underlying format.

wader commented

Have a look at #794 and i think i agree, always an array is probably best

wader commented

Yeap some of text test were wrong, fixed, thanks.

I wonder if it's bad that we won't provide the full concatenated uncompressed stream somehow? also the nested decoding should happen on the concatenation and not the members uncompressed data. So maybe the root should instead be a struct with a members array and a uncompressed raw bytes?

I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members array and a uncompressed raw bytes". But today I was analyzing a corrupted gz file where zcat said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed of each member and check they're fine. But I know this is an unusual situation.

I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.

Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.

rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done
wader commented

I didn't realize fq performs nested decoding. I'm not sure what to do. In most cases it might be better to have "a struct with a members array and a uncompressed raw bytes". But today I was analyzing a corrupted gz file where zcat said CRC and size is wrong, and fq helped me to discover only the last member is corrupted and find out why. It was useful to see uncompressed of each member and check they're fine. But I know this is an unusual situation.

Yes it does nested decode by default, with sometimes options to disable it. This was added early for fq as it's roots is in debugging media containers and codecs where it's common with lots of nested subformat and muxers that slice up packets in various ways.

About each member's uncompress: in the PR i now modelled so that you have access to both each members uncompressed data and a concat of them all.

I don't have a strong preference. I feel multi-member gz files are rare in practice, so either way is a decent choice.

I think it makes sense, kind of the point of fq is to not hide details :)

Now i actually remember that alpine packages uses concatted gzip:s.

Just for fun: This is how I used fq to analyze it. That was before I filed this issue, so I had to use gap0.

rm -f part* after*; cp original_input.gz after0.gz; i=0; while true; do o=$(./fq '.gap0|tobytesrange.start' after$i.gz) || break; [[ -z $o ]] && break; head -c$o after$i.gz > part$((i+1)).gz; tail -c+$((o+1)) after$i.gz > after$((i+1)).gz; ((i++)); done

Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?

fq is not great for outputting multiple files atm, not sure how it could be done without adding messy IO-function hmm. But i have used some hack using tar. So something like this:

Copy the to_tar snippet from https://github.com/wader/fq/wiki/snippets an put in tar.jq then do:

# -L . adds cwd to include path
# use include "tar" to include tar.jq
# iterate .members as {key: ..., value: ...} objects, as it's an array key will be 0,1,2,... and value the member itself
# to_tar(f) takes a function f as arg that outputs {filename: ..., data: ...} objects
$ fq -L . 'include "tar"; to_tar(.members | to_entries[] | {filename: "part\(.key)", data: .value.uncompressed})' format/gzip/testdata/multi_members.gz | tar tv
-rw-r--r--  0 user   group      11 Jan  1  1970 part0
-rw-r--r--  0 user   group      10 Jan  1  1970 part1

Nice! you wanted to output each uncompressed to a file? what was the o+1 thing, skip one byte from gap0 start?

Right, I wanted to output each compressed member to a file, so I can look at them with zcat/fq/hexdump. $((o+1)) is just because tail counts from 1, e.g. "tail -c+9" discards first 8 bytes and starts printing from the 9th byte.

Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.

By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that?
It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with: statement called the close() method and wrote a gzip footer, but not the correct values.

wader commented

Interesting tar snippet. To be honest I don't really like or understand the jq language, but maybe I'll learn one day.

I can relate and it took quite a while to get my head around it, now i love it. But i think it really fits very well for what i at least use fq for, to do lots of adhoc queries to digg and poke around in half broken and strange media and binary files. And i hope basic jq is easy enough for ppl to use... i've also notice ppl use fq by more or less just with d and -V etc and then pipe grep/less or whatnot :) whatever works

By the way just for fun, this is not related to fq, but I solved the mystery of the corrupted gz file I mentioned: The uncompressed data looks OK and the footer is present, but the footer CRC and isize are wrong. What could've caused that? It is generated by a Python program which opens it as with gzip.open(filename, "at") as f:. The solution is that it got a KeyboardInterrupt exception just after executing this line. The compressed data was written, but self.crc and self.size weren't updated. The with: statement called the close() method and wrote a gzip footer, but not the correct values.

👍 aha tricky, glad you solved it! so it was just one odd gzip file or something that happened regularly?