problems with zip support
0-wiz-0 opened this issue ยท 21 comments
I tried fq on zip archives, and I had two problems so far:
- fq seems to have zip64 support, but it's not complete - at least on one file I tried it on - the eocd64 marker is shown as 'gap0' (example file: https://github.com/nih-at/libzip/blob/main/regress/zip64.zip):
# fq . zip64.zip
|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b|0123456789abcdef0123456789ab|.{}: zip64.zip (zip)
0x00|50 4b 03 04 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd ff ff ff ff ff ff ff ff 01 00|PK..-.....Or[@..............| local_files[0:1]:
0x1c|14 00 2d 01 00 10 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 61 0a |..-....................a. |
0x1c| 50 4b 01| PK.| central_directories[0:1]:
0x38|02 1e 03 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd 02 00 00 00 02 00 00 00 01 00 00|...-.....Or[@...............|
0x54|00 00 00 00 00 01 00 00 00 80 11 00 00 00 00 2d |...............- |
0x54| 50 4b 06 06 2c 00 00 00 00 00 00 00| PK..,.......| gap0: raw bits
0x70|1e 03 2d 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00|..-.........................|
* |until 0xaf.7 (76) | |
0xa8| 50 4b 05 06 00 00 00 00 01 00 01 00 2f 00 00 00 35 00 00 00| PK........../...5...| end_of_central_directory_record{}:
0xc4|00 00| |..| |
- I wanted to try the interactive mode to look inside a zip inside a zip, and fq just hung completely, eating ~33GB RAM and 100% CPU:
fq -i . bigzero-zip.zip
using https://github.com/nih-at/libzip/blob/main/regress/bigzero-zip.zip
Hey, sorry i somehow missed the notification email about this. Will have a look.
Had quick look at zip64.zip
. gap0
seems to be a "Zip64 end of central directory record" structure (signature 0x06064b50, confusengly is reverse, zip uses little endian). If i've understood the spec correctly the way to detect zip64 is that the "zip32" version of "end of central directory record" has a "offset of start of central directory ...." field that is 0xffffffff, see:
4.4.24 offset of start of central directory with respect to
the starting disk number: (4 bytes)
Offset of the start of the central directory on the
disk on which the central directory starts. If an
archive is in ZIP64 format and the value in this
field is 0xFFFFFFFF, the size will be in the
corresponding 8 byte zip64 end of central
directory field.
In this file it seems to be 0x35 ๐ค do you know if there is some other way to detect it? or could this be some left over data etc because of how the zip writer is implemented?
$ fq -o line_bytes=8 '(.end_of_central_directory_record | d), (.gap0 | dd)' zip64.zip
โ00 01 02 03 04 05 06 07โ01234567โ.end_of_central_directory_record{}:
0xb0โ50 4b 05 06 โPK.. โ signature: raw bits (valid)
0xb0โ 00 00 โ .. โ disk_nr: 0
0xb0โ 00 00โ ..โ central_directory_start_disk_nr: 0
0xb8โ01 00 โ.. โ nr_of_central_directory_records_on_disk: 1
0xb8โ 01 00 โ .. โ nr_of_central_directory_records: 1
0xb8โ 2f 00 00 00โ /...โ size_of_central_directory: 47
0xc0โ35 00 00 00 โ5... โ offset_of_start_of_central_directory: 53
0xc0โ 00 00โ โ ..โ โ comment_length: 0
โ โ โ comment: ""
โ00 01 02 03 04 05 06 07โ01234567โ
0x60โ 50 4b 06 06โ PK..โ.gap0: raw bits
0x68โ2c 00 00 00 00 00 00 00โ,.......โ
0x70โ1e 03 2d 00 00 00 00 00โ..-.....โ
0x78โ00 00 00 00 01 00 00 00โ........โ
0x80โ00 00 00 00 01 00 00 00โ........โ
0x88โ00 00 00 00 2f 00 00 00โ..../...โ
0x90โ00 00 00 00 35 00 00 00โ....5...โ
0x98โ00 00 00 00 50 4b 06 07โ....PK..โ
0xa0โ00 00 00 00 64 00 00 00โ....d...โ
0xa8โ00 00 00 00 01 00 00 00โ........โ
You can identify a zip64 by the existence of the zip64 EOCD, which you can recognize by the 'PK\06\06'.
If any of the files is too large (even just when uncompressed), then the whole file needs to be zip64 even if none of the fields in the EOCD marker are too big. AFAIK there is no requirement to set the values in the EOCD to 0xffffffff if they fit.
Does this help?
Sorry, that was slightly wrong: You look for the Zip64 end of central directory locator (0x07064b50) directly before the standard ZIP EOCD that points you to the Zip64 EOCD and then you can parse from there.
Yes helps, and that is actually you how the decoder works now, it heuristically seeks from the end backwards after the "zip32" EOCD signature. So should it look for both and decode but prefer the zip64 one? otherwise there might be a zip32 gap i guess?
No, in a zip64 there are both. From the appnote:
[central directory header n]
[zip64 end of central directory record]
[zip64 end of central directory locator]
[end of central directory record]
So at the end you always have a zip(32) EOCD. If you go backwards from that and find a zip64 EOCD locator, it's a zip64 and you need to parse the zip64 EOCD (which it tells you how to find). If you find a central directory header instead, you know it's not a zip64.
Aha i see, thanks. Will look into a fix soon, let me know if you want to have a go at it. The code looks a messy now, so could probably be cleanup up and refactored a bit also.
Had a look at bigzero-zip.zip
now also. Looks like the toml decoder is not very happy about the 4GB zeroes files. If i build without it in the probe group it succeeds decoding after some time (but is quite slow).
I did fq -d bytes '_registry.groups.probe[] as $f | $f, try decode($f) catch .' bigzero
to figure which probe format that got stuck.
Will have a look at the toml code (uses github.com/BurntSushi/toml) and try figure out some way to fail toml decoding faster.
No hurry. Since I'm already maintaining a zip library, I'll leave this one up to you ;)
Interesting analysis about the toml decoder. I had thought files would be decoded more on-demand, not all in advance. Good to know!
I was a bit fast to judge, toml is slow but finishes, it seems to actually be the xml decoder that eat a lot of memory for some reason hmm.
Currently i'm not doing any on-demand decoding, have thought about it and would be interesting to look into, but fq's decode and jq code is quite complex as it is :) so will see. Focus have mostly been on making things possible over speed and efficiency. But some format do have options to disable sub-decoding, the zip format should probably have decode_uncompressed
option etc.
#594 makes decoding bigzero-zip.zip quite a lot faster but will still use some cpu and memory as is uncompresses to memory. This will also speed up probing in many other cases. If your curious, fq does not have any special probe code instead it's up to decoder that are in the "probe" group to fail fast.
Both issues should be resolved now. Give it a try if you can.
I noticed that there were some more (non-broken) files from the libzip regression tests that fq dont like, maybe i will have a look at those also.
Thanks, the zip64 eocd parsing looks good now.
I've tried the bigzero-zip.zip too but it doesn't look too good?
# ./fq -i . /tmp/bigzero-zip.zip
error: /tmp/bigzero-zip.zip: probe: failed to decode: try fq -d FORMAT to force format, see fq -h formats for list
empty> ^D
# ./fq -d zip -i . /tmp/bigzero-zip.zip
zip!> .
|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f|0123456789abcdef0123456789abcdef|.{}: /tmp/bigzero-zip.zip (zip)
| | | error: zip: RawLen(signature): failed at position 10489.7 (read size 0 seek pos 0): outside buffer
0x0000|50 4b 03 04 14 00 02 00 08 00 c8 78 84 45 54 81 15 ae 4e 28 00 00 b9 9a 3f 00 0b 00 1c 00 62 69|PK.........x.ET...N(....?.....bi| gap0: raw bits
* |until 0x28e3.7 (10468) | |
0x28e0| 50 4b 05 06 00 00 00 00 01 00 01 00 51 00 00 00 93 28 00 00 00 00| | PK..........Q....(....| | end_of_central_directory_record{}:
| | | end_of_central_directory_locator{}:
All good now, thank you!
๐ฅณ thanks for reporting and nice bug report
You added fq to NetBSD ports? thanks for that, were no issues? have no idea how the golang support is on *BSDs.
Also I can add you to the list of ppl i notify when doing a new release if you want?
Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :)
If you have pkgsrc set up, just use pkgin install fq
to get the program installed.
Golang support in pkgsrc and NetBSD in special is quite good.
Sure, let me know about new releases. Thank you
Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :) If you have pkgsrc set up, just use
pkgin install fq
to get the program installed.
Aha didn't know ๐ could possibly add it to the fq README
Golang support in pkgsrc and NetBSD in special is quite good.
Good to know. The only thing i'm a bit worried about fq-wise is the readline module which has some os-specific code, but the REPL seems to work fine?
Sure, let me know about new releases. Thank you
Will do
Yes, the REPL works fine.
(NetBSD is the upstream for editline (a BSD licensed readline).)
Good, fq uses a fork of https://github.com/chzyer/readline which is similar to libreadline etc but implemented in go so is very convenient to use and build. Trying to stay away from cgo to not have to deal with c build issues... have enough of those in my life anyway :)