On performance of JPEG MMX encoder
mliszcz opened this issue · 7 comments
On the last kernel meeting I promised to do some benchmarks of the MMX code vs non-MMX code which can give a hint if it makes sense to have this MMX assembly in our codebase. But since it was decided that we should not implement encoding on our own, these results are of limited usefulness.
All tests were done on g++ 10.2.0 on Intel i7-8565U. I implemented a simple program that loads a bitmap (using CImg) and calls jpeg_encode_24
from cppTango. For those wanting to reproduce - I just placed the test program, CImg.h and all files from cppapi/server/jpeg and cppapi/server/jpeg_mmx in a single directory. Then I compiled with:
g++ test.cpp jpeg_*.cpp -o test -lX11 -lpthread -fPIE <any extra flags>
The test program runs encoding 10 times using source image stored in the same memory area (hot) and then runs encoding 10 times again but source is stored in a different place in memory (cold). Turns out that there are not much differences, maybe due to bitmap size, which is 4K (50MB, way too large for CPU cache). The program measures time for each pass in milliseconds.
flags | hot 1 | hot 2 | cold 1 | cold 2 | comments |
---|---|---|---|---|---|
9935 | 10780 | 11036 | 10872 | unoptimized 64bit | |
-m32 | 11399 | 11583 | 11740 | 11807 | unoptimized 32bit |
-O3 | 5430 | 5440 | 5221 | 5255 | optimized 64bit |
-O3 -m32 | 6317 | 6402 | 6342 | 6381 | optimized 32bit |
-m32 -DJPG_USE_ASM | 11432 | 11791 | 11444 | 11542 | unoptimized 32bit assembly |
-O3 -m32 -DJPG_USE_ASM | - | - | - | - | “./test” terminated by signal SIGSEGV (Address boundary error) |
-march=skylake | 10126 | 10184 | 10213 | 10501 | unoptimized 64bit skylake |
-m32 -march=skylake | 11121 | 12148 | 11877 | 11616 | unoptimized 32bit skylake |
-O3 -march=skylake | 4876 | 4914 | 5097 | 5166 | optimized 64bit skylake |
-O3 -m32 -march=skylake | 6321 | 6350 | 6318 | 6382 | optimized 32bit skylake |
-DJPG_USE_ASM -D_64BITS | jpeg_color_mmx.cpp:453: Error: operand type mismatch for `push' |
Conclusions:
- 64bit code is faster than 32bit code (executed on 64bit cpu and os)
- O3 flag improves run time a lot
- specifying target CPU architecture helps a bit (?)
- MMX version is not working with anything better than -O0 on my machine. Encoder crashes at runtime if compiled with -O1 or better.
- MMX version is not working in 64bit mode. This is not used by cppTango but still in MMX code we have branch depending on
_64BITS
macro being defined.
Regarding SIMD instructions generated by compiler:
flags | jpeg_dct_mmx.s | jpeg_dct.s |
---|---|---|
-m32 -DJPG_USE_ASM -S | MMX as specified in .cpp | no SIMD |
-m32 -S | empty | no SIMD |
-m32 -O3 -S | empty | SSE |
-m32 -O3 -march=skylake -S | empty | AVX |
The test code:
// CImg: https://github.com/dtschump/CImg/raw/master/CImg.h
// Test image: 4K / 51.26 MB from https://filesamples.com/formats/bmp
// https://filesamples.com/samples/image/bmp/sample_5184%C3%973456.bmp
#include <chrono>
#include <fstream>
#include <iostream>
#include <vector>
#include "CImg.h"
#include "jpeg_lib.h"
constexpr double quality = 90;
auto test_hot(int n, int width, int height, unsigned char* data)
{
int jpegSize0;
unsigned char* jpegData0;
jpeg_encode_rgb24(width, height, data, quality, &jpegSize0, &jpegData0);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n; ++i)
{
int jpegSize;
unsigned char* jpegData;
jpeg_encode_rgb24(width, height, data, quality, &jpegSize, &jpegData);
}
auto end = std::chrono::high_resolution_clock::now();
return end - start;
}
auto test_cold(int n, int width, int height, unsigned char* data)
{
auto data_size = width * height * 3; // 24 bpp
std::vector<unsigned char*> sources{};
std::vector<unsigned char*> blanks{};
for (int i = 0; i < n; ++i) {
unsigned char* data_copy = new unsigned char[data_size];
std::memcpy(data_copy, data, data_size);
sources.push_back(data_copy);
blanks.push_back(new unsigned char[data_size]);
}
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n; ++i)
{
int jpegSize;
unsigned char* jpegData;
jpeg_encode_rgb24(width, height, sources[i], quality, &jpegSize, &jpegData);
}
auto end = std::chrono::high_resolution_clock::now();
return end - start;
}
int main(int, char**)
{
cimg_library::CImg<unsigned char> img("sample_5184_3456.bmp");
int n = 10;
auto time_hot = test_hot(n, img.width(), img.height(), img.data());
auto time_cold = test_cold(n, img.width(), img.height(), img.data());
std::cout << "hot = " << std::chrono::duration_cast<
std::chrono::milliseconds>(time_hot).count() << std::endl;
std::cout << "cold = " << std::chrono::duration_cast<
std::chrono::milliseconds>(time_cold).count() << std::endl;
return 0;
}
@bourtemb @t-b should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?
g++ test.cpp jpeg_*.cpp -o test -lX11 -lpthread -fPIE -g -Og -m32 -DJPG_USE_ASM
+(gdb) run
Starting program: /home/michal/Documents/tango/mmx/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x565bc776 in conv_block_RGB24H2V2_mmx (width=1448851968,
rgb=0xf45ae01c "\256\261\300\252\270\231\253\236\244\250\301\243\233\261\264\255\245\256\215\257\270\257\263\247Ŗ\273\253\275\224\234\243\257\236\224\247\264\253\257\222\252\243\257\233\254\243\230\234\256\230\254\237\246\217\237\221\245\207\245\236\244\234\232\255\217\237\216\220\232\215\236\210\230\216\226\235\240\220\251\233\222\220\202\217\242\200\222\213\242\227\244\223\220\234\227\212\215\246\231\217\223\211\226\204\240\214\241\207\226\225\226\212\236\227\231\215\211\214\222\210\220\223\216\225\223\210\217\222\203\222\202\234\225\211\210~\230\202\214\205\226|\216x~\215u\202\202\201\227\206\220\177\207\204\214\177u\232\213u\205\205\221\202\242\226\224|\220\262\363\360͏gnrx\203\213\211|\204\200\213~}\212s\177\215r\200\206v\205\211}"..., y=0xf45ae010, cb=0x1440, cr=0xf11ea294) at jpeg_color_mmx.cpp:1064
1064 __asm__ (
+(gdb) bt
#0 0x565bc776 in conv_block_RGB24H2V2_mmx (width=1448851968,
rgb=0xf45ae01c "\256\261\300\252\270\231\253\236\244\250\301\243\233\261\264\255\245\256\215\257\270\257\263\247Ŗ\273\253\275\224\234\243\257\236\224\247\264\253\257\222\252\243\257\233\254\243\230\234\256\230\254\237\246\217\237\221\245\207\245\236\244\234\232\255\217\237\216\220\232\215\236\210\230\216\226\235\240\220\251\233\222\220\202\217\242\200\222\213\242\227\244\223\220\234\227\212\215\246\231\217\223\211\226\204\240\214\241\207\226\225\226\212\236\227\231\215\211\214\222\210\220\223\216\225\223\210\217\222\203\222\202\234\225\211\210~\230\202\214\205\226|\216x~\215u\202\202\201\227\206\220\177\207\204\214\177u\232\213u\205\205\221\202\242\226\224|\220\262\363\360͏gnrx\203\213\211|\204\200\213~}\212s\177\215r\200\206v\205\211}"..., y=0xf45ae010, cb=0x1440, cr=0xf11ea294) at jpeg_color_mmx.cpp:1064
#1 0xf11ea010 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
+(gdb)
Fun fact: I don't know if this a feature or a bug (and in which code) but when I ran this code:
int main(int, char**)
{
cimg_library::CImg<unsigned char> img("sample_640_426.bmp");
int jpegSize;
unsigned char* jpegData;
constexpr double quality = 90;
jpeg_encode_rgb24(img.width(), img.height(), img.data(),
quality, &jpegSize, &jpegData);
std::ofstream wf("out.jpeg", std::ios::out | std::ios::binary);
wf.write((char*)jpegData, jpegSize);
wf.close();
return 0;
}
On this image from the Internet:
I got it repeated 9 times in grayscale. The dimensions of both pictures are the same. The source is 24bpp.
should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?
We only ever compile with -O0
.
Fun fact: I don't know if this a feature or a bug (and in which code) but when I ran this code:
Not sure, I've never used the jpeg encoded attribute.
Thanks @t-b. I haven't noticed that we compile with only
-O0
.
Indeed. And there is probably a good reason for that. So, no need to investigate further these crashes indeed.
@bourtemb all jpeg code I took from latest tango-9-lts.
As for the grayscale 3x3 output image, I'm seeing the same results when encoding with libjpeg-turbo. So actually cppTango is correct here and it is CImg which does this.
I'm closing this issue as I think there is nothing more to add in this topic.