kyz/libmspack

extra = 0 confusion

magcius opened this issue · 4 comments

lzxd.c has this code: https://github.com/kyz/libmspack/blob/master/libmspack/mspack/lzxd.c#L589-L609

Which suggests that extra = 0 is not defined in the LZX spec. However, on this spec page: https://docs.microsoft.com/en-us/previous-versions/bb417343(v=msdn.10)#microsoft-lzx-data-compression-format

( The formatting is unfortunately very mangled, but it's the best I can find )

There is an explicit callout for the extra = 0 case:

                  else if (extra > 0) /* just some verbatim bits */
                        verbatim_bits ← readbits(extra)
                        aligned_bits  ← 0
                  else /* no verbatim bits */
                        verbatim_bits ← 0
                        aligned_bits  ← 0
                  endif

Implying that match_offset from the position slot table should be left alone, not reset to such a close offset. Was this another discrepancy between the Java decoder implementation and the spec, or did you misread the specification?

I noticed this when working on my own implementation; I have not checked to see if LZX files in the wild hit this case.

kyz commented

Looking back at LZXFMT.DOC from CABSDK.EXE, you're right, it does include extra == 0 logic in the pseudocode, exactly as seen on the page you linked. I must have misread it. It's also hiding in the LZX DELTA specification logic, where the logic would run verbatim_bits = readbits(0); aligned_bits = 0;

And yes, there's a discrepancy with the java implementation, which has the logic you see in lzxd.c:

    int extra = extra_bits[position_slot];
    match_offset = position_base[position_slot];
    if      (extra > 3)  match_offset += readbits(extra - 3) << 3) + aligned_tree.readsym();
    else if (extra == 3) match_offset += aligned_tree.readsym();
    else if (extra != 0) match_offset += readbits(extra);
    else                 match_offset = 1;

I have thousands of cab files with extra == 0 cases. cabextract, EXTRACT.EXE (from 1997) and EXPAND.EXE (from the 2000s) all agree on the decoded result. The smallest one is 305 bytes and I've printed it in base64 below. If I change lzxd.c to follow the specification, it starts to get the result wrong.

$ base64 < 0625.cab 
TVNDRgAAAAAxAQAAAAAAACwAAAAAAAAAAwEBAAEAAAAAAAAASQAAAAEAAxX+AgAAAAAAAAAA6yJg
pwAAYXJyb3dfcmwuY3VyAPRpnfLgAP4CW4CAjQAg6C+NlADBAAADAAA0NAC2DoTSCjtLYidxLIHm
/yd1Iv9iT8QEXLoAIAAAZgDAjGEKCt9fOIXvSta7Fp3AUuICkRAKQpAQCuiY/U39/f/9gGgAAAAA
AAAMjICDBu4GEfv94vczBnjmSNJfZX1w+llcmmvGkPP/+LZ928xgk3+1jbc89tyWC07HWE9tZ+jf
uV2n6FtW7WQkrGciPVHl/IxMDt/omBEgIwAAAAAAAAIA/5EAoGYTN/eyUZSI1IJoddEsEqqlNLTO
WRfA5RFRJcvSQoqniAJRjqSUsII=
$ wine extract.exe /e 0625.cab 
Microsoft (R) Diamond Extraction Tool - Version (32) 1.00.0601 (03/18/97)
Copyright (c) Microsoft Corp 1994-1997. All rights reserved.

 Cabinet 0625.cab

Extracting arrow_rl.cur
$ md5sum arrow_rl.cur 
c8f980bf103d1c781ed71b25e46e7219  arrow_rl.cur
$ WINEPREFIX=$HOME/.wine64 WINEARCH=win64 wine expand.exe 0625.cab file.out
Microsoft (R) File Expansion Utility  Version 6.1.7600.16385
Copyright (c) Microsoft Corporation. All rights reserved.

Adding Z:\tmp\file.out to Extraction Queue

Expanding Files ....

Expanding Files Complete ...
$ md5sum file.out 
c8f980bf103d1c781ed71b25e46e7219  file.out

$ cabextract -t 0625.cab 
Testing cabinet: 0625.cab
  arrow_rl.cur  OK                             c8f980bf103d1c781ed71b25e46e7219

All done, no errors.
$ ./cabextract-to-spec -t 0625.cab
Testing cabinet: 0625.cab
  arrow_rl.cur  OK                             2188d750947c1fbb24dba4080dd7b51d

That is bizarre to me, because it implies they're trying to code a 1 match_offset but are using a position_slot other than 3, which would already have 0 extra bits. But it seems like that's the correct behavior regardless?

kyz commented

Thanks for your recommendation to look at it. I rewrote the decoding from scratch in 202ab41 which does correctly decode files, including the example above. I'm not entirely sure where the discrepancy lies, but there is no longer a special case for extra == 0, and it does work. Thanks!

Nice, that's very similar to what I do now, except I handle the match_offset == 3 case explicitly in a separate test, which means I never have to test for extra > 0. It also means I can fold the -2 into the position_base_table, and omit the first four elements as well.

https://github.com/magcius/noclip.website/blob/99bdbc8742d44772eb25a4269aa3fa25e4ee2a76/src/Common/Compression/LZX.ts#L314-L346