Compression does not respect LZ4 official End of block conditions
Opened this issue · 0 comments
I noticed that the library does not respect end of block conditions specified in the official LZ4 repository. More specifically
End of block conditions
- The last match must start at least 12 bytes before the end of block. The last match is part of the penultimate sequence. It is followed by the last sequence, which contains only literals.
For example the following ASCII string
Abcdefghijklmnop0000000000000000Abcdefghijk
is encoded as
04 22 4d 18 40 70 df 1e 00 00 00 fb 02 41 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 30 01 00 02 20 00 50 67 68 69 6a 6b 00 00 00 00
<━ ━ ━ FRAME ━ ━ ━> <━ BLOCK ━> <━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ SEQUENCE 0 ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━> <SEQ 1> <━ SEQUENCE 2 ━> <━ FRAME ━>
A b c d e f g h i j k l m n o p 0 | | g h i j k
▲ | |
▲ ▲ ┕━━━━┙ |
┕━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
This produces a match starting less than 12 bytes before the end of the block, which is not guaranteed to be decoded correctly.
In contrast, LZ4 official encoder correctly prevents the match from happening: here is what is generated for the same input
04 22 4D 18 60 40 82 21 00 00 00 FB 02 41 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 30 01 00 B0 41 62 63 64 65 66 67 68 69 6A 6B 00 00 00 00
<━ ━ ━ FRAME ━ ━ ━> <━ BLOCK ━> <━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ SEQUENCE 0 ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━ ━> <━ ━ ━ ━ ━ ━ SEQUENCE 1 ━ ━ ━ ━ ━> <━ FRAME ━>
A b c d e f g h i j k l m n o p 0 | A b c d e f g h i j k
▲ |
┕━━━━┙
This was obtained with the following command
$ echo -ne "Abcdefghijklmnop0000000000000000Abcdefghijk" | lz4 -c -12 --no-frame-crc | od -t x1 -A n
04 22 4d 18 60 40 82 21 00 00 00 fb 02 41 62 63
64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 30 01 00
b0 41 62 63 64 65 66 67 68 69 6a 6b 00 00 00 00
Using
$ lz4 --version
*** LZ4 command line interface 64-bits v1.9.2, by Yann Collet ***
Note that adding an extra character (from Abcdefghijklmnop0000000000000000Abcdefghijk
to Abcdefghijklmnop0000000000000000Abcdefghijkl
) the match is now starting 12 bytes before the end of block and producing a match is now legal (therefore LZ4 official produces the same output as lz4js)
echo -ne "Abcdefghijklmnop0000000000000000Abcdefghijkl" | lz4 -c -12 --n o-frame-crc | od -t x1 -A n
04 22 4d 18 60 40 82 1e 00 00 00 fb 02 41 62 63
64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 30 01 00
03 20 00 50 68 69 6a 6b 6c 00 00 00 00