extra = 0 confusion
magcius opened this issue · 4 comments
lzxd.c has this code: https://github.com/kyz/libmspack/blob/master/libmspack/mspack/lzxd.c#L589-L609
Which suggests that extra = 0 is not defined in the LZX spec. However, on this spec page: https://docs.microsoft.com/en-us/previous-versions/bb417343(v=msdn.10)#microsoft-lzx-data-compression-format
( The formatting is unfortunately very mangled, but it's the best I can find )
There is an explicit callout for the extra = 0 case:
else if (extra > 0) /* just some verbatim bits */
verbatim_bits ← readbits(extra)
aligned_bits ← 0
else /* no verbatim bits */
verbatim_bits ← 0
aligned_bits ← 0
endif
Implying that match_offset from the position slot table should be left alone, not reset to such a close offset. Was this another discrepancy between the Java decoder implementation and the spec, or did you misread the specification?
I noticed this when working on my own implementation; I have not checked to see if LZX files in the wild hit this case.
Looking back at LZXFMT.DOC from CABSDK.EXE, you're right, it does include extra == 0
logic in the pseudocode, exactly as seen on the page you linked. I must have misread it. It's also hiding in the LZX DELTA specification logic, where the logic would run verbatim_bits = readbits(0); aligned_bits = 0;
And yes, there's a discrepancy with the java implementation, which has the logic you see in lzxd.c:
int extra = extra_bits[position_slot];
match_offset = position_base[position_slot];
if (extra > 3) match_offset += readbits(extra - 3) << 3) + aligned_tree.readsym();
else if (extra == 3) match_offset += aligned_tree.readsym();
else if (extra != 0) match_offset += readbits(extra);
else match_offset = 1;
I have thousands of cab files with extra == 0
cases. cabextract, EXTRACT.EXE (from 1997) and EXPAND.EXE (from the 2000s) all agree on the decoded result. The smallest one is 305 bytes and I've printed it in base64 below. If I change lzxd.c to follow the specification, it starts to get the result wrong.
$ base64 < 0625.cab
TVNDRgAAAAAxAQAAAAAAACwAAAAAAAAAAwEBAAEAAAAAAAAASQAAAAEAAxX+AgAAAAAAAAAA6yJg
pwAAYXJyb3dfcmwuY3VyAPRpnfLgAP4CW4CAjQAg6C+NlADBAAADAAA0NAC2DoTSCjtLYidxLIHm
/yd1Iv9iT8QEXLoAIAAAZgDAjGEKCt9fOIXvSta7Fp3AUuICkRAKQpAQCuiY/U39/f/9gGgAAAAA
AAAMjICDBu4GEfv94vczBnjmSNJfZX1w+llcmmvGkPP/+LZ928xgk3+1jbc89tyWC07HWE9tZ+jf
uV2n6FtW7WQkrGciPVHl/IxMDt/omBEgIwAAAAAAAAIA/5EAoGYTN/eyUZSI1IJoddEsEqqlNLTO
WRfA5RFRJcvSQoqniAJRjqSUsII=
$ wine extract.exe /e 0625.cab
Microsoft (R) Diamond Extraction Tool - Version (32) 1.00.0601 (03/18/97)
Copyright (c) Microsoft Corp 1994-1997. All rights reserved.
Cabinet 0625.cab
Extracting arrow_rl.cur
$ md5sum arrow_rl.cur
c8f980bf103d1c781ed71b25e46e7219 arrow_rl.cur
$ WINEPREFIX=$HOME/.wine64 WINEARCH=win64 wine expand.exe 0625.cab file.out
Microsoft (R) File Expansion Utility Version 6.1.7600.16385
Copyright (c) Microsoft Corporation. All rights reserved.
Adding Z:\tmp\file.out to Extraction Queue
Expanding Files ....
Expanding Files Complete ...
$ md5sum file.out
c8f980bf103d1c781ed71b25e46e7219 file.out
$ cabextract -t 0625.cab
Testing cabinet: 0625.cab
arrow_rl.cur OK c8f980bf103d1c781ed71b25e46e7219
All done, no errors.
$ ./cabextract-to-spec -t 0625.cab
Testing cabinet: 0625.cab
arrow_rl.cur OK 2188d750947c1fbb24dba4080dd7b51d
That is bizarre to me, because it implies they're trying to code a 1 match_offset but are using a position_slot other than 3, which would already have 0 extra bits. But it seems like that's the correct behavior regardless?
Thanks for your recommendation to look at it. I rewrote the decoding from scratch in 202ab41 which does correctly decode files, including the example above. I'm not entirely sure where the discrepancy lies, but there is no longer a special case for extra == 0, and it does work. Thanks!
Nice, that's very similar to what I do now, except I handle the match_offset == 3 case explicitly in a separate test, which means I never have to test for extra > 0. It also means I can fold the -2 into the position_base_table, and omit the first four elements as well.