wader/fq

problems with zip support

0-wiz-0 opened this issue ยท 21 comments

I tried fq on zip archives, and I had two problems so far:

# fq . zip64.zip
    |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b|0123456789abcdef0123456789ab|.{}: zip64.zip (zip)
0x00|50 4b 03 04 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd ff ff ff ff ff ff ff ff 01 00|PK..-.....Or[@..............|  local_files[0:1]:
0x1c|14 00 2d 01 00 10 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 61 0a         |..-....................a.   |
0x1c|                                                                           50 4b 01|                         PK.|  central_directories[0:1]:
0x38|02 1e 03 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd 02 00 00 00 02 00 00 00 01 00 00|...-.....Or[@...............|
0x54|00 00 00 00 00 01 00 00 00 80 11 00 00 00 00 2d                                    |...............-            |
0x54|                                                50 4b 06 06 2c 00 00 00 00 00 00 00|                PK..,.......|  gap0: raw bits
0x70|1e 03 2d 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00|..-.........................|
*   |until 0xaf.7 (76)                                                                  |                            |
0xa8|                        50 4b 05 06 00 00 00 00 01 00 01 00 2f 00 00 00 35 00 00 00|        PK........../...5...|  end_of_central_directory_record{}:
0xc4|00 00|                                                                             |..|                         |
  • I wanted to try the interactive mode to look inside a zip inside a zip, and fq just hung completely, eating ~33GB RAM and 100% CPU:
fq -i . bigzero-zip.zip

using https://github.com/nih-at/libzip/blob/main/regress/bigzero-zip.zip

wader commented

Hey, sorry i somehow missed the notification email about this. Will have a look.

wader commented

Had quick look at zip64.zip. gap0 seems to be a "Zip64 end of central directory record" structure (signature 0x06064b50, confusengly is reverse, zip uses little endian). If i've understood the spec correctly the way to detect zip64 is that the "zip32" version of "end of central directory record" has a "offset of start of central directory ...." field that is 0xffffffff, see:

 4.4.24 offset of start of central directory with respect to
        the starting disk number:  (4 bytes)

  Offset of the start of the central directory on the
  disk on which the central directory starts. If an 
  archive is in ZIP64 format and the value in this 
  field is 0xFFFFFFFF, the size will be in the 
  corresponding 8 byte zip64 end of central 
  directory field.

In this file it seems to be 0x35 ๐Ÿค” do you know if there is some other way to detect it? or could this be some left over data etc because of how the zip writer is implemented?

$ fq -o line_bytes=8 '(.end_of_central_directory_record | d), (.gap0 | dd)' zip64.zip
    โ”‚00 01 02 03 04 05 06 07โ”‚01234567โ”‚.end_of_central_directory_record{}:
0xb0โ”‚50 4b 05 06            โ”‚PK..    โ”‚  signature: raw bits (valid)
0xb0โ”‚            00 00      โ”‚    ..  โ”‚  disk_nr: 0
0xb0โ”‚                  00 00โ”‚      ..โ”‚  central_directory_start_disk_nr: 0
0xb8โ”‚01 00                  โ”‚..      โ”‚  nr_of_central_directory_records_on_disk: 1
0xb8โ”‚      01 00            โ”‚  ..    โ”‚  nr_of_central_directory_records: 1
0xb8โ”‚            2f 00 00 00โ”‚    /...โ”‚  size_of_central_directory: 47
0xc0โ”‚35 00 00 00            โ”‚5...    โ”‚  offset_of_start_of_central_directory: 53
0xc0โ”‚            00 00โ”‚     โ”‚    ..โ”‚ โ”‚  comment_length: 0
    โ”‚                       โ”‚        โ”‚  comment: ""
    โ”‚00 01 02 03 04 05 06 07โ”‚01234567โ”‚
0x60โ”‚            50 4b 06 06โ”‚    PK..โ”‚.gap0: raw bits
0x68โ”‚2c 00 00 00 00 00 00 00โ”‚,.......โ”‚
0x70โ”‚1e 03 2d 00 00 00 00 00โ”‚..-.....โ”‚
0x78โ”‚00 00 00 00 01 00 00 00โ”‚........โ”‚
0x80โ”‚00 00 00 00 01 00 00 00โ”‚........โ”‚
0x88โ”‚00 00 00 00 2f 00 00 00โ”‚..../...โ”‚
0x90โ”‚00 00 00 00 35 00 00 00โ”‚....5...โ”‚
0x98โ”‚00 00 00 00 50 4b 06 07โ”‚....PK..โ”‚
0xa0โ”‚00 00 00 00 64 00 00 00โ”‚....d...โ”‚
0xa8โ”‚00 00 00 00 01 00 00 00โ”‚........โ”‚

You can identify a zip64 by the existence of the zip64 EOCD, which you can recognize by the 'PK\06\06'.
If any of the files is too large (even just when uncompressed), then the whole file needs to be zip64 even if none of the fields in the EOCD marker are too big. AFAIK there is no requirement to set the values in the EOCD to 0xffffffff if they fit.
Does this help?

Sorry, that was slightly wrong: You look for the Zip64 end of central directory locator (0x07064b50) directly before the standard ZIP EOCD that points you to the Zip64 EOCD and then you can parse from there.

wader commented

Yes helps, and that is actually you how the decoder works now, it heuristically seeks from the end backwards after the "zip32" EOCD signature. So should it look for both and decode but prefer the zip64 one? otherwise there might be a zip32 gap i guess?

No, in a zip64 there are both. From the appnote:

      [central directory header n]
      [zip64 end of central directory record]
      [zip64 end of central directory locator] 
      [end of central directory record]

So at the end you always have a zip(32) EOCD. If you go backwards from that and find a zip64 EOCD locator, it's a zip64 and you need to parse the zip64 EOCD (which it tells you how to find). If you find a central directory header instead, you know it's not a zip64.

wader commented

Aha i see, thanks. Will look into a fix soon, let me know if you want to have a go at it. The code looks a messy now, so could probably be cleanup up and refactored a bit also.

wader commented

Had a look at bigzero-zip.zip now also. Looks like the toml decoder is not very happy about the 4GB zeroes files. If i build without it in the probe group it succeeds decoding after some time (but is quite slow).

I did fq -d bytes '_registry.groups.probe[] as $f | $f, try decode($f) catch .' bigzero to figure which probe format that got stuck.

Will have a look at the toml code (uses github.com/BurntSushi/toml) and try figure out some way to fail toml decoding faster.

No hurry. Since I'm already maintaining a zip library, I'll leave this one up to you ;)

Interesting analysis about the toml decoder. I had thought files would be decoded more on-demand, not all in advance. Good to know!

wader commented

I was a bit fast to judge, toml is slow but finishes, it seems to actually be the xml decoder that eat a lot of memory for some reason hmm.

Currently i'm not doing any on-demand decoding, have thought about it and would be interesting to look into, but fq's decode and jq code is quite complex as it is :) so will see. Focus have mostly been on making things possible over speed and efficiency. But some format do have options to disable sub-decoding, the zip format should probably have decode_uncompressed option etc.

wader commented

#594 makes decoding bigzero-zip.zip quite a lot faster but will still use some cpu and memory as is uncompresses to memory. This will also speed up probing in many other cases. If your curious, fq does not have any special probe code instead it's up to decoder that are in the "probe" group to fail fast.

wader commented

Both issues should be resolved now. Give it a try if you can.

I noticed that there were some more (non-broken) files from the libzip regression tests that fq dont like, maybe i will have a look at those also.

Thanks, the zip64 eocd parsing looks good now.
I've tried the bigzero-zip.zip too but it doesn't look too good?

# ./fq -i . /tmp/bigzero-zip.zip                                                                                                                                                                                                                                 
error: /tmp/bigzero-zip.zip: probe: failed to decode: try fq -d FORMAT to force format, see fq -h formats for list
empty> ^D
# ./fq -d zip -i . /tmp/bigzero-zip.zip 
zip!> .
      |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f|0123456789abcdef0123456789abcdef|.{}: /tmp/bigzero-zip.zip (zip)
      |                                                                                               |                                |  error: zip: RawLen(signature): failed at position 10489.7 (read size 0 seek pos 0): outside buffer
0x0000|50 4b 03 04 14 00 02 00 08 00 c8 78 84 45 54 81 15 ae 4e 28 00 00 b9 9a 3f 00 0b 00 1c 00 62 69|PK.........x.ET...N(....?.....bi|  gap0: raw bits
*     |until 0x28e3.7 (10468)                                                                         |                                |
0x28e0|            50 4b 05 06 00 00 00 00 01 00 01 00 51 00 00 00 93 28 00 00 00 00|                 |    PK..........Q....(....|     |  end_of_central_directory_record{}:
      |                                                                                               |                                |  end_of_central_directory_locator{}:
wader commented

Oh yes i messed up, this should fix it #596

All good now, thank you!

wader commented

๐Ÿฅณ thanks for reporting and nice bug report

wader commented

You added fq to NetBSD ports? thanks for that, were no issues? have no idea how the golang support is on *BSDs.

Also I can add you to the list of ppl i notify when doing a new release if you want?

Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :)
If you have pkgsrc set up, just use pkgin install fq to get the program installed.

Golang support in pkgsrc and NetBSD in special is quite good.

Sure, let me know about new releases. Thank you

wader commented

Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :) If you have pkgsrc set up, just use pkgin install fq to get the program installed.

Aha didn't know ๐Ÿ‘ could possibly add it to the fq README

Golang support in pkgsrc and NetBSD in special is quite good.

Good to know. The only thing i'm a bit worried about fq-wise is the readline module which has some os-specific code, but the REPL seems to work fine?

Sure, let me know about new releases. Thank you

Will do

Yes, the REPL works fine.

(NetBSD is the upstream for editline (a BSD licensed readline).)

wader commented

Good, fq uses a fork of https://github.com/chzyer/readline which is similar to libreadline etc but implemented in go so is very convenient to use and build. Trying to stay away from cgo to not have to deal with c build issues... have enough of those in my life anyway :)