ulikunitz/xz

Current maturity of project (and other semantics)

dsoprea opened this issue · 3 comments

As mentioned in #1, this project was declared as "not even alpha" in 2015. What is the current maturity of this project?

As mentioned in #23, this project was considered to be not near the speed of XZ back in 2018. Has the speed improved since then? Were you referring to general code optimization or part of your LZMA implementation that needed to be reimplemented according to the specification in order to meet the standard speed expectations?

As mentioned in #26, this project implements a different LZMA algorithm than XZ. Would you be able to provide the names/URLs of your algorithm versus that algorithm?

Hi,

Thanks for your questions. I try to answer them as best as I can.

What is the current maturity of this project?

The master branch is stable and there are no known bugs.

But the software didn't advance a lot over the last years. One reason is that zstd compression is a very viable alternative (https://github.com/klauspost/compress/tree/master/zstd) and has almost the compression ratio of xz, but it's encoding algorithm is much faster. No amount of optimization for xz will result in performance parity with zstd due to the different encoding algorithms.

Has the speed improved since then (2018)? Were you referring to general code optimization or part of your LZMA implementation that needed to be reimplemented according to the specification in order to meet the standard speed expectations?

I need to clarify that the specifications describe the encoding format, it doesn't describe the actual algorithm to create the encoding. The critical part is the creation of the LZ sequences. It has influence on the compression rate and compression and decompression speed.

The speed has not improved since 2018. Originally I planned to work on following items.

  1. Multithreaded compressions for large blocks (e.g. 8 Megabyte)
  2. Optimizations in the encoder to exploit CPU caches better (very technical).
  3. Improvements of the generation of the Lempel-Ziv sequences.

Please note that this is a pure side project and has nothing to do with my full-time job.

Would you be able to provide the names/URLs of your algorithm versus their algorithm?

The encoding is exactly the same as documented for LZMA. Note that xz actually uses LZMA2 encoding, which creates a block structure for LZMA. xz itself is a container format.

What is not part of the specifications is the way the Lempel-Ziv sequences are generated. While there is some literature here, this is a little bit of an art form and you must spend time to read the source code of other implementations to learn the tricks. The implementation I used was straight forward and is definitely not the fastest possible and it is not directly documented in the literature. But most LZ sequencing implementations are not described in the literature because those algorithms are heavily optimized and have often very specific design goals.

I used following documents for the implementation:

XZ File Format Specification
LZMA Specification in the LZMA SDK
LZMA2 description, created by me

A good article about practical Lempel-Ziv sequencing is

Longest match string searching for Ziv-Lempel compression (1993); Bell and Kulp

A good introduction in the whole field is

Compression and Coding Algorithms; Moffat and Turpin

Thank you for that complete response. I'm compelled to help, though dubious that I would've had to participate in the current implementation in order to effectively understand where we're at and to then establish an optimization roadmap of how to improve it. Even at this point, it's been long-enough for you that you might also have to reorient yourself with the implementation.

I will appreciate any help. Yes, its true, there are parts of the implementation, where I would have to look around, but one of the advantages of Go is that it is good to read.