schnaader/precomp-cpp

Misdetection or mistreatment of streams

M-Gonzalo opened this issue · 1 comments

There is something wrong that is not right about precomp... That is about how technical I'm gonna get. Sorry, but I can't pin exactly where the fault is. I can describe how it looks, though:

A second run of precomp identifies and process several streams that should be properly addressed in the first one.

$precomp -cn -intense -brute -e test.bin

Recompressed streams: 6124/6247
GZip streams: 51/54
PNG streams: 871/871
PNG streams (multi): 2/2
GIF streams: 2/2
zLib streams (intense mode): 5190/5299
Brute mode streams: 8/19

You can speed up Precomp for THIS FILE with these parameters:
-d2

$precomp -cn -intense -brute -e test.bin.pcf

Recompressed streams: 193/230
GZip streams: 3/3
PNG streams: 1/3
zLib streams (intense mode): 189/216
Brute mode streams: 0/8

You can speed up Precomp for THIS FILE with these parameters:
-d1

At first I thought maybe there were some zlib literals stored by preflate, but in that case, it wouldn't be beneficial to re-recompress the file.

62,12 %		45094212		test.pcf_cn.pcf_cn.raz
64,68 %		46955772		test.pcf_cn.raz
67,92 %		49303633		test.pcf_cn.pcf_cn.arc
70,32 %		51048411		test.pcf_cn.arc
100,00 %	72595528		test

I decided to close this and remove the bug label. Let me explain this decision using the example of mozilla from Silesia Corpus where something similar happens.

TL;DR: Things are strange, but there are reasons. Main reason: False positives in intense and brute mode. Although this issue is closed now, things might get better in later versions.

Here are the results (file size in bytes) for this file using different parameters along once or twice, compressing using lzma afterwards (using -t+)

Parameters for each pass One pass and compressed Two passes and compressed
-cn 11,903,939 11,910,923
-cn -intense 11,955,623 11,954,647
-cn -brute 11,915,247 11,908,187
-cn -intense -brute 11,954,887 11,950,595

The first line is what is expected - a second pass makes things worse, everything's fine. Comparing the result to the other results in the middle column is still fine, there are false positives that make the compression ratio drop a bit, and the intense ones seem to be worse somehow.

The surprising numbers are the others, though: All intense/brute results get better in a second pass. I'll try to illustrate what's going on here with a brute stream that is detected in the second pass - in fact, it's the only stream that is processed by this pass for -cn -brute and it improves the compression result a lot (11,915,247 bytes -> 11,908,187 bytes!) :

(19.46%) Possible zLib-Stream (brute mode) found at position 11484623
Compressed size: 94
Can be decompressed to 1070 bytes
Non-ZLIB reconstruction data size: 25 bytes

Highlighting this part of the .pcf in a hex editor leads to:

Brute mode match, pass 2

We can see two things there:

  1. The file this comes from is human-readable, so it's definitely a false positive. By the way, the file this comes from is mozilla/chrome/messenger.jar/content/messenger/renameFolderDialog.xul
  2. The last 6 bytes highlighted don't seem to belong to that file, data starts to get non-textual there.

Let's confirm this by looking at the original file:

Original file (renameFolderDialog.xul)

Indeed, part of the highlighted data in the .pcf is not part of the .xul file. This is the reason why recursion didn't work in the first pass: Looking for brute streams in the .xul alone doesn't lead to a detection. In the next pass, however, the data is part of the .pcf and followed by some other data, which leads to a detection. Welcome to the strange world of zLib streams... :suspect:

It gets even weirder if we look at what is the result of inflating these 94 bytes to 1070 bytes. Remember that compression ratio is improved quite a lot - by transforming the highlighted text in the images above to this:

Decompression of the match

Welcome to the strange world of improving lzma compression ratio... :rage1:

So, at least for this special case where intense and brute mode detect only false positives, things are "easy":

  • A second pass sometimes helps, but this is only the result of strange coincidences
  • There is no magic additional compression, using intense and/or brute mode in this case gives worse results than not using it. (Also note: I tried a third pass on mozilla, too, no additional streams are detected and all ratios get worse)
  • This pops up in 0.4.7dev because preflate can recompress every zLib stream which also includes false positives.
  • Solution is: Don't use intense and/or brute if there are only false positives.

However, on files that have both false positives and improvements by intense/brute mode (like the test.bin file you posted), things are more complicated. In theory, you could analyse which streams are false positives and ignore them using -i which would lead to a ratio better than 62,12 %. In practice however, identifying false positives in 5000+ streams is time-wasting.

I'm sure there are ways to detect false positives like the example here. Expanding 94 bytes of text data to 1070 bytes of binary gibberish using 25 bytes of preflate reconstruction data is a clear false positive. However, as always, if we go too far with this, we'll start categorizing real deflate streams as false positives.