image-rs/deflate-rs

Performance improvements

oyvindln opened this issue · 1 comments

The performance is still not great compared to miniz/zlib on files with long runs of the same byte.

EDIT:
See next post.

Profiling reveals that lz77::longest_match and lz77::get_match_length is where most time is spent.

get_match_length is particularly problematic for data where there is a lot of repetitions of one literal that causes a lot of calls to this function. (As there will be a large amount of entries in the hash chain for the 3-byte sequences of this byte.) Currently it uses two zipped iterators to compare the matches, which may not be ideal performance wise. C implementations of deflate seem to be checking multiple bytes at once by casting the bytes to larger data types. I've tested this, but it didn't seem to make a difference.

In the longest_match function, array lookups seems to be the main cause of the slowdown (maybe because further instructions depend on the array value?). If we can find a way to reduce the number of lookups, or length of the hash chains without impacting compression ratio, that would be helpful to improve performance.

For lower compression levels, other compressors simply hard-limit the length hash chains, and further adaptively reduces the hash chain length when there is a decent match.

So the main point where the current implementation (outside of RLE mode) spends much time (according to operf) is the comparison after loading the next value from the hash chain in the matching function:
https://github.com/oyvindln/deflate-rs/blob/dd923fb9950b6fce37cec9ed6ab617a036c559b8/src/matching.rs#L103

Maybe this is the cpu stalling waiting for data or something.

Current plans for improvements that may or may not help:

  • Avoid creating an the extra temporary huffman code length table.
  • Avoid using a temporary heap buffer in in_place_lengths.
  • Change the match function to behave more like miniz:
    • divide max iterations by 3, do 3 checks per iteration, jump to next iteration on a match
      • This didn't seem to make any difference to performance
  • Implement a longest_match function using this algorithm to possibly cut down on hash chain iterations:http://www.gildor.org/en/projects/zlib
  • Avoid bounds checks in various functions (huffman table length/distance lookups, hash chain, huffman length generation and others)
  • Check for faster CRC generators (current used library seems to be abandoned/inactive with some PRs with improvements)
  • Consider using threads for some operations
  • Avoid buffering input when it's not needed (e.g for the simple functions (deflate_*) and if write_all is called with a very long buffer.)
    • Specialisation is unstable, so we might want to make a VecEncoder or something similar instead for now
  • Avoid buffering output when using a writer where writing is "guaranteed"(excluding OOM) to succeed, e.g Vec.
  • Try to heap-allocate buffers close together in memory. Don't think it's currently doable in an easy way in current stable rust without blowing up the stack (box syntax is unstable, but even that may not work in debug mode).