Optimization in zune-jpeg
Opened this issue · 6 comments
Hello there, first off thanks for this amazing piece of software; as I'm looking to write high performance code for image processing I absolutely love the work you've done here, for JPEG and other formats.
I've been poking at the AVX2 optimizations for the IDCT routine and have some questions/observations. I noticed there is a separate version of the IDCT that is written for AVX2. I've done very little tests but I found that the auto-vectorized scalar code for AVX2 is shorter than the hand-written AVX2 version, so I don't know which is better (I haven't checked for the presence of pesky branches or anything, just size so nothing really meaningful).
However I poked at the way the scalar code does the optimization pass by verifying if DCT coefficients are zero; the assembly generated from that is possibly the worst it can be with a basic loop over each element.
I quickly tested and managed to get an implementation that seem do quite decently auto-vectorize, including on AVX-512 (https://rust.godbolt.org/z/oKMWh44r8). Both combined, it might be worth checking if the hand-written AVX2 version is actually faster than the auto-vectorized scalar code for AVX2, or save maintenance trouble by dropping it if it isn't (at the cost of not having it "for free", including without appropriate flags).
Another thing I was considering, would it be possible to take advantage of the fact AVX2 can fit 16 u16 (and that AVX-512 can fit 32 of them) to "redesign" the AVX2 implementation to process 2 blocks at a time? I have a feeling this could be faster than processing blocks "sequentially" so-to-speak.
Further optimizations I was thinking of include doing quantization "as part of" the IDCT step so it can be done while data is already in registers and avoid some memory-register roundtrips - probably a small thing, but everything counts :p
Another one that might be interesting is avoid one of the transpose steps by performing it while reversing the zig-zag encoding, so the data is already ready to be processed.
That's all I've noticed while poking at the decoder, I haven't looked at it fully but I felt like it'd be worth an issue to being those ideas to the table and eventually have some feedback on them. I'm not super proficient when it comes to very high performance things, but I have some bit of knowledge and would love to be pointed in the right direction of those ideas are not good or could be improved!
Hi there, thank you for the praise :)
I quickly tested and managed to get an implementation that seem do quite decently auto-vectorize, including on AVX-512 (https://rust.godbolt.org/z/oKMWh44r8). Both combined, it might be worth checking if the hand-written AVX2 version is actually faster than the auto-vectorized scalar code for AVX2, or save maintenance trouble by dropping it if it isn't (at the cost of not having it "for free", including without appropriate flags).
Nice your is_zero
looks better than what I did, but I'd prefer if it were written in explicit simd, since e.g 1.62
generates some not so nice code ( https://rust.godbolt.org/z/ac7ovr1M3), but it's nice when it works
Another thing I was considering, would it be possible to take advantage of the fact AVX2 can fit 16 u16 (and that AVX-512 can fit 32 of them) to "redesign" the AVX2 implementation to process 2 blocks at a time? I have a feeling this could be faster than processing blocks "sequentially" so-to-speak.
This is a really big change, and one that would annoy anyone undertaking it and I'm not sure it would be worth it, specifically, you'd have to support different sampling factors, (4x4,2x2,1x1),handle edges, back up decoded MCU coefficients since we decode a 8x8 block to an array and idct it and determine how you will batch up, remember order is
Y,Cb,Cr,Y,Cb,Cr, so if you do Y,Cb
-> Idct
, you have to have some not so pretty code for handling subsampled code.
I once thought about doing this, in the early days when optimization was my only thought, sadly real life trumps optimizations.
Also libjpeg-turbo tried it, and didn't like it, see libjpeg-turbo/libjpeg-turbo#30 (comment)
Further optimizations I was thinking of include doing quantization "as part of" the IDCT step so it can be done while data is already in registers and avoid some memory-register roundtrips - probably a small thing, but everything counts :p
Yea,it was initially this way, but the decoding huffman part is ILP starved so I simply moved it there to just 'spice up' the CPU, but would be worth investigating.
Another one that might be interesting is avoid one of the transpose steps by performing it while reversing the zig-zag encoding, so the data is already ready to be processed.
Interesting, have any code that does this?
That's all I've noticed while poking at the decoder, I haven't looked at it fully but I felt like it'd be worth an issue to being those ideas to the table and eventually have some feedback on them. I'm not super proficient when it comes to very high performance things, but I have some bit of knowledge and would love to be pointed in the right direction of those ideas are not good or could be improved!
No one is an expert per se, enjoy exploring I'm happy to guide you why some part looks the way it does, there are some interesting optimizations to be found here.
Nice your
is_zero
looks better than what I did, but I'd prefer if it were written in explicit simd, since e.g1.62
generates some not so nice code ( https://rust.godbolt.org/z/ac7ovr1M3), but it's nice when it works
That's fair, I didn't take into account that factor. Even then I think it might be a good idea to revise the implementation to make it use SSE2 instructions if possible, which would be a perf gain for people who somehow need it. The first row can be handled as a separate case using the "classic" for loop and then use a plain 7x8 for loop to have this be transformed in fast code by LLVM when it can.
I'd prefer if it were written in explicit simd
In that case, unless I'm missing something it's possible to get the zero checking down to ~10 instructions using OR
operations and doing a single test at the end. https://rust.godbolt.org/z/e1Mxs8qj1
This is a really big change, and one that would annoy anyone undertaking it and I'm not sure it would be worth it
Fair enough. I had in mind that something like this, when it can work, coupled to doing smaller DCTs to reduce the image size by half / a quarter / an eighth of the original size can make for very fast image down-scaling, and keep feeding as much data as possible to the CPU.
I absolutely understand it might not be worth the trouble and there's a lot in the way, not everything is roses and sunshine
Interesting, have any code that does this?
Using this table for the un-zigzag routine should produce the transposed result directly without the need for a transpose. This is why I mentioned it, it just requires an update to the mapping table to drop the first transpose operation. (It should be noted this is also true for a JPEG encoder; when performing the Forward DCT it can do the 1D-DCT, transpose, 1D-DCT, transposed zigzag encode - skipping the re-transposition step)
0, 8, 1, 2, 9, 16, 24, 17,
10, 3, 4, 11, 18, 25, 32, 40,
33, 26, 19, 12, 5, 6, 13, 20,
27, 34, 41, 48, 56, 49, 42, 33,
28, 21, 14, 07, 15, 22, 29, 36,
43, 50, 57, 58, 51, 44, 37, 30,
23, 31, 38, 45, 52, 59, 60, 53,
46, 39, 47, 54, 61, 62, 55, 63,
In that case, unless I'm missing something it's possible to get the zero checking down to ~10 instructions using OR operations and doing a single test at the end. https://rust.godbolt.org/z/e1Mxs8qj1
Nice, I don't think there is any issue, let me try and replace it and run tests to see if anything goes wrong( or you can run it too)
Fair enough. I had in mind that something like this, when it can work, coupled to doing smaller DCTs to reduce the image size by half / a quarter / an eighth of the original size can make for very fast image down-scaling, and keep feeding as much data as possible to the CPU.
Yea jpeg-decoder and libjpeg-turbo do that, I don't due to the additional code maintenance
Using this table for the un-zigzag routine should produce the transposed result directly without the need for a transpose. This is why I mentioned it, it just requires an update to the mapping table to drop the first transpose operation. (It should be noted this is also true for a JPEG encoder; when performing the Forward DCT it can do the 1D-DCT, transpose, 1D-DCT, transposed zigzag encode - skipping the re-transposition step)
If you have some time, I'd like to see if you could test this to see if it's viable. It may end up being a performance benefit and those are always nice
Yea jpeg-decoder and libjpeg-turbo do that, I don't due to the additional code maintenance
Hm, that might be a problem for me, as I was looking forward to take advantage of certain shrink-on-load techniques to accelerate image downscaling for my use-case 😔
I was hoping some routines could be further optimized but I definitely get why it's a pain to do so. I might try to experiment with it on my end anyways, and see how far I can go until I give up 🤪
If you have some time, I'd like to see if you could test this to see if it's viable
I'll draft a PR when I have some time later this week then!
Been a bit more than a week already ;-; So! I did play with it a bit and I couldn't get it to play nice while in the middle of the huffman decode logic. I've done a bit more experiments, looking websites that look like it's 1997 explaining big brain stuff that hurt my small little brain 😵💫
I'll keep playing with it for now, maybe experiment some tricks I read from some scientific publications on optimizing huffman decode routines and see if I can put to good use my newly acquired knowledge about execution ports, pleasing the branch predictor, and all sorts of dark magic tricks that make a very specific type of sand think faster 🤔
Nice, here are some sites with some things about Huffman encoding/decoding.
Fabian's site https://fgiesen.wordpress.com/
Charles Bloom : http://cbloomrants.blogspot.com/
How libjpeg-turbo does it https://github.com/libjpeg-turbo/libjpeg-turbo/blob/main/jdhuff.c
Also stb has an interesting implementation here https://github.com/nothings/stb/blob/f4a71b13373436a2866c5d68f8f80ac6f0bc1ffe/stb_image.h#L2002-L2206