Multi-thread/parallel encoding/decoding for enhanced performance?

Question

Multi-thread/parallel encoding/decoding for enhanced performance?

timmyd7777 opened this issue 4 years ago · 3 comments

One of the main reasons we are interested in JPEG-LS is performance. We would like to losslessly compress large (5320x4600 mono 8-bit) video frames in near-real-time, and JPEG-LS is among the simplest (i.e. fastest) lossless compression schemes available. However, as far as I can tell - and I may be mistaken, so please correct me if I'm wrong - the CharLS implementation is running in a single thread. Almost all of today's big software performance increases come from parallelization. Common CPUs now have 8 or more cores, and dividing a task among them is often the best way to get noticeable performance gains. This is even more true for GPUs, which can run thousands of threads in parallel.

It seems to me that image compression is a task that could parallelize well. Each thread could compress one row, or perhaps one sub-rectangle, of the image. So with N threads working in parallel, you could (in theory) compress the image in 1/N-th the time. I've noticed that Nvidia has a parallelized version of LibJPEG and JPEG-2000, which absolutely destroy the single-threaded CPU versions (although, admittedly, they are parallelizing on GPUs, and not just multi-core CPUs):

https://developer.nvidia.com/nvjpeg

At any rate, I'm wondering if it is possible to parallelize CharLS. Again, we are most interested in JPEG-LS for performance reasons, and this approach seems like the most likely path to get dramatic performance gains. Your thoughts welcome!

Answer 1 · 2021-03-10T21:40:09.000Z

The JPEG-LS algorithm has been developed around 1998, multi-core CPUs and GPU hardware was not common then. The design goal for the algorithm at that time was a generic single core CPU and hardware implementations. In hardware (FPGA) it is possible to execute the algorithm in a pipeline, see for example the publication: “Efficient high-performance implementation of JPEG-LS encoder”, M. Papadonikolakis. On a generic CPU it makes more sense to do the parallelization on a higher level as there are many data dependencies in the JPEG-LS algorithm.

CharLS is a library and uses the calling thread to execute the algorithm. It does not create threads on its own and a single encoder/decoder instance is by design not thread safe. The expectation is that higher level software handles thread synchronization. Different instances of an CharLS encoder are however thread safe and encoding 10 images with 10 encoders at the same time is perfectly fine.

There are a couple of things possible to leverage the processing power of a multi-core CPU to encode faster:

If some latency is acceptable you can create a couple of encoders instances (say 5) and 5 worker thread. As soon as the first video frame is available, you assign it to thread 1, the second frame to 2.etc.
Depending on the worst case encoding time and required frame rate one can compute how many workers threads are needed and how big the introduced latency is.
If the receiver does not need 1 single large JPEG-LS image, you could encode the big image as 10 images of 532* 4600. Encoding these images can be done in parallel with 10 worker threads. The decoder can also decode them with 10 threads.
JPEG-LS has a concept of restart markers (RST). They are originally intended to recover from data corruption but are rarely used. It would be technically possible to use 10 threads and divide the scans lines in groups. After the first 532 lines a RST marker would then inserted to notify the decoder to reset its state, and start decode of the second 532 lines.
Note: support to encode\decode JPEG-LS streams with RST markers is part of the ISO standard, but not implemented by CharLS (never requested).
Use another more recent JPEG algorithm. These new algorithms are designed to leverage the capabilities of modern hardware.

JPEG XS: released in 2018 and designed for visual lossless encoding of video streams.
JPEG XL: replacement for the “classic” JPEG algorithm provides a lossless mode. File format is stable, but docs are not yet released as an official ISO standard. Free open-source implementation is available.
jpeg.org has more info about these algorithms.

Notes:

JPEG-LS color images (RGB) could be encoded\decoded with 3 threads at the same time.
Jpeg-LS part 2 defines a reversible color space conversion. This would be a good example of an algorithm that could be executed on a GPU.
The default build of CharLS targets a “common” x86\x64 CPU. If the hardware is known, some additional compiler optimizations can be enabled.

Answer 2 · 2021-03-11T17:58:13.000Z

Thanks very much. In a pinch, your suggestion #2 would work for us. Although, it might be easier to split the original image horizontally (i.e. 10 images of 5320 x 460 instead of 532x4600).

JPEG XS sounds very interesting, and may be more what we need. I did a quick search, and could not find a reference C++ implementation that I could easily compile. If you know of one, can you provide a link?

Again, thanks very much.

Answer 3 · 2021-03-13T16:14:39.000Z

The ISO organization sells reference software (ISO/IEC 21122-5), I don't know how usable that reference software is.
There are patents involved with JPEX-XS (www.jpegxs.com), which makes creating an open source implementation difficult. The companies behind JPEG-XS also have SDKs, perhaps they offer an evaluation version.