lz4/lz4-java

Question: Why does LZ4BlockOutputStream use a checksum?

tomerz90 opened this issue · 5 comments

AFAIK, its not some "convention" or something for output streams and also lz4 doesnt use a checksum.
From my tests it uses significant CPU and if its there for the reason I think it is, it is unnecessary for me and I can use a "NO-OP" checksum.
Can you please explain why LZ4BlockOutputStream needs a checksum?

The LZ4 block format does not use a checksum, but the LZ4 frame format does use a checksum. https://github.com/lz4/lz4/blob/master/doc/lz4_Frame_format.md I don't think it is weird for a streaming compression format to use a checksum.

Although LZ4BlockOutputStream forces you to use a checksum, LZ4FrameOutputStream allows you to disable it by not specifying CONTENT_CHECKSUM or BLOCK_CHECKSUM.

I never said its weird :), Im was just trying to understand its purpose.

  1. So the checksum is used mostly for when the compressed data is transfered over the network or anything else that might cause corruption right? If I have a process that writes the compressed data to disk and then reads it (all locally on the same machine) there is no point for the checksum right?
  2. I think I remember reading in the doc that the frame output stream is slightly slower than block, am I right?
  3. What is the logic behind disabling checksum being supported for the frame output stream and not for block?
  1. In general, right. However, whether to use a checksum for a local disk or not depends on what disk, file system, memory, and CPU you use and what level of reliability you need. Data corruption can happen at any place, even in memory (though most soft errors should be fixed by ECC, if you use good memory).
  2. By "block", do you mean LZ4BlockOutputStream/LZ4BlockInputStream or the block-oriented APIs provided by LZ4Compressor/LZ4SafeDecompressor/LZ4FastDecompressor?
  3. It's just not implemented now in LZ4BlockOutputStream. ;) I was not in the lz4-java project when LZ4BlockOutputStream was implemented, so I have no idea why the designer did not support it. I am open to support it, but I recommend any new application to use LZ4FrameOutputStream/LZ4FrameInputStream because it is compatible with the LZ4 frame format. LZ4BlockOutputStream/LZ4BlockInputStream are maintained for historical reasons.
  1. Yes, LZ4BlockOutputStream/LZ4BlockInputStream
  2. I understand but Im writing a high-performance application and it seems that frame output without checksum vs block output without checksum results in block output being ~13% faster (but its a "quick-and-dirty" check as part of some other code so it might be skewed), so maintaining it might be good, and not just for historical reasons.

I don't remember I have seen any document regarding LZ4Frame* vs LZ4Block*. I have never measured them myself for comparison, but the 13% difference is not something that can be ignored. Actually, I happen to be writing a benchmark suite to compare Java streaming compression algorithms, so I am going to reproduce your observation soon. Hopefully, I would like to close the 13% gap, rather than maintaining LZ4Block* forever....