On performance of JPEG MMX encoder

On the last kernel meeting I promised to do some benchmarks of the MMX code vs non-MMX code which can give a hint if it makes sense to have this MMX assembly in our codebase. But since it was decided that we should not implement encoding on our own, these results are of limited usefulness.

All tests were done on g++ 10.2.0 on Intel i7-8565U. I implemented a simple program that loads a bitmap (using CImg) and calls jpeg_encode_24 from cppTango. For those wanting to reproduce - I just placed the test program, CImg.h and all files from cppapi/server/jpeg and cppapi/server/jpeg_mmx in a single directory. Then I compiled with:

g++ test.cpp jpeg_*.cpp -o test -lX11 -lpthread -fPIE <any extra flags>

The test program runs encoding 10 times using source image stored in the same memory area (hot) and then runs encoding 10 times again but source is stored in a different place in memory (cold). Turns out that there are not much differences, maybe due to bitmap size, which is 4K (50MB, way too large for CPU cache). The program measures time for each pass in milliseconds.

flags	hot 1	hot 2	cold 1	cold 2	comments
	9935	10780	11036	10872	unoptimized 64bit
-m32	11399	11583	11740	11807	unoptimized 32bit
-O3	5430	5440	5221	5255	optimized 64bit
-O3 -m32	6317	6402	6342	6381	optimized 32bit
-m32 -DJPG_USE_ASM	11432	11791	11444	11542	unoptimized 32bit assembly
-O3 -m32 -DJPG_USE_ASM	-	-	-	-	“./test” terminated by signal SIGSEGV (Address boundary error)
-march=skylake	10126	10184	10213	10501	unoptimized 64bit skylake
-m32 -march=skylake	11121	12148	11877	11616	unoptimized 32bit skylake
-O3 -march=skylake	4876	4914	5097	5166	optimized 64bit skylake
-O3 -m32 -march=skylake	6321	6350	6318	6382	optimized 32bit skylake
-DJPG_USE_ASM -D_64BITS					jpeg_color_mmx.cpp:453: Error: operand type mismatch for `push'

Conclusions:

64bit code is faster than 32bit code (executed on 64bit cpu and os)
O3 flag improves run time a lot
specifying target CPU architecture helps a bit (?)
MMX version is not working with anything better than -O0 on my machine. Encoder crashes at runtime if compiled with -O1 or better.
MMX version is not working in 64bit mode. This is not used by cppTango but still in MMX code we have branch depending on _64BITS macro being defined.

Regarding SIMD instructions generated by compiler:

flags	jpeg_dct_mmx.s	jpeg_dct.s
-m32 -DJPG_USE_ASM -S	MMX as specified in .cpp	no SIMD
-m32 -S	empty	no SIMD
-m32 -O3 -S	empty	SSE
-m32 -O3 -march=skylake -S	empty	AVX

The test code:

// CImg: https://github.com/dtschump/CImg/raw/master/CImg.h
// Test image: 4K / 51.26 MB from https://filesamples.com/formats/bmp
// https://filesamples.com/samples/image/bmp/sample_5184%C3%973456.bmp

#include <chrono>
#include <fstream>
#include <iostream>
#include <vector>

#include "CImg.h"
#include "jpeg_lib.h"

constexpr double quality = 90;

auto test_hot(int n, int width, int height, unsigned char* data)
{
    int jpegSize0;
    unsigned char* jpegData0;
    jpeg_encode_rgb24(width, height, data, quality, &jpegSize0, &jpegData0);

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < n; ++i)
    {
        int jpegSize;
        unsigned char* jpegData;
        jpeg_encode_rgb24(width, height, data, quality, &jpegSize, &jpegData);
    }

    auto end = std::chrono::high_resolution_clock::now();

    return end - start;
}

auto test_cold(int n, int width, int height, unsigned char* data)
{
    auto data_size = width * height * 3; // 24 bpp
    std::vector<unsigned char*> sources{};
    std::vector<unsigned char*> blanks{};
    for (int i = 0; i < n; ++i) {
        unsigned char* data_copy = new unsigned char[data_size];
        std::memcpy(data_copy, data, data_size);
        sources.push_back(data_copy);
        blanks.push_back(new unsigned char[data_size]);
    }

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < n; ++i)
    {
        int jpegSize;
        unsigned char* jpegData;
        jpeg_encode_rgb24(width, height, sources[i], quality, &jpegSize, &jpegData);
    }

    auto end = std::chrono::high_resolution_clock::now();

    return end - start;
}

int main(int, char**)
{
    cimg_library::CImg<unsigned char> img("sample_5184_3456.bmp");

    int n = 10;
    auto time_hot = test_hot(n, img.width(), img.height(), img.data());
    auto time_cold = test_cold(n, img.width(), img.height(), img.data());
    std::cout << "hot = " << std::chrono::duration_cast<
        std::chrono::milliseconds>(time_hot).count() << std::endl;
    std::cout << "cold = " << std::chrono::duration_cast<
        std::chrono::milliseconds>(time_cold).count() << std::endl;

    return 0;
}

@bourtemb @t-b should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?

g++ test.cpp jpeg_*.cpp -o test -lX11 -lpthread -fPIE -g -Og -m32 -DJPG_USE_ASM

+(gdb) run
Starting program: /home/michal/Documents/tango/mmx/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x565bc776 in conv_block_RGB24H2V2_mmx (width=1448851968, 
    rgb=0xf45ae01c "\256\261\300\252\270\231\253\236\244\250\301\243\233\261\264\255\245\256\215\257\270\257\263\247Ŗ\273\253\275\224\234\243\257\236\224\247\264\253\257\222\252\243\257\233\254\243\230\234\256\230\254\237\246\217\237\221\245\207\245\236\244\234\232\255\217\237\216\220\232\215\236\210\230\216\226\235\240\220\251\233\222\220\202\217\242\200\222\213\242\227\244\223\220\234\227\212\215\246\231\217\223\211\226\204\240\214\241\207\226\225\226\212\236\227\231\215\211\214\222\210\220\223\216\225\223\210\217\222\203\222\202\234\225\211\210~\230\202\214\205\226|\216x~\215u\202\202\201\227\206\220\177\207\204\214\177u\232\213u\205\205\221\202\242\226\224|\220\262\363\360͏gnrx\203\213\211|\204\200\213~}\212s\177\215r\200\206v\205\211}"..., y=0xf45ae010, cb=0x1440, cr=0xf11ea294) at jpeg_color_mmx.cpp:1064
1064          __asm__ (
+(gdb) bt
#0  0x565bc776 in conv_block_RGB24H2V2_mmx (width=1448851968, 
    rgb=0xf45ae01c "\256\261\300\252\270\231\253\236\244\250\301\243\233\261\264\255\245\256\215\257\270\257\263\247Ŗ\273\253\275\224\234\243\257\236\224\247\264\253\257\222\252\243\257\233\254\243\230\234\256\230\254\237\246\217\237\221\245\207\245\236\244\234\232\255\217\237\216\220\232\215\236\210\230\216\226\235\240\220\251\233\222\220\202\217\242\200\222\213\242\227\244\223\220\234\227\212\215\246\231\217\223\211\226\204\240\214\241\207\226\225\226\212\236\227\231\215\211\214\222\210\220\223\216\225\223\210\217\222\203\222\202\234\225\211\210~\230\202\214\205\226|\216x~\215u\202\202\201\227\206\220\177\207\204\214\177u\232\213u\205\205\221\202\242\226\224|\220\262\363\360͏gnrx\203\213\211|\204\200\213~}\212s\177\215r\200\206v\205\211}"..., y=0xf45ae010, cb=0x1440, cr=0xf11ea294) at jpeg_color_mmx.cpp:1064
#1  0xf11ea010 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
+(gdb)

Fun fact: I don't know if this a feature or a bug (and in which code) but when I ran this code:

int main(int, char**)
{
    cimg_library::CImg<unsigned char> img("sample_640_426.bmp");

    int jpegSize;
    unsigned char* jpegData;

    constexpr double quality = 90;

    jpeg_encode_rgb24(img.width(), img.height(), img.data(),
        quality, &jpegSize, &jpegData);

    std::ofstream wf("out.jpeg", std::ios::out | std::ios::binary);
    wf.write((char*)jpegData, jpegSize);
    wf.close();

    return 0;
}

On this image from the Internet:

I got it repeated 9 times in grayscale. The dimensions of both pictures are the same. The source is 24bpp.

should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?

We only ever compile with -O0.

cppTango/cppapi/server/jpeg_mmx/CMakeLists.txt

Line 22 in 1945e65

target_compile_options(jpeg_mmx_objects PRIVATE -O0)

Fun fact: I don't know if this a feature or a bug (and in which code) but when I ran this code:

Not sure, I've never used the jpeg encoded attribute.

Thanks @t-b. I haven't noticed that we compile with only -O0.

@bourtemb @t-b should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?

@mliszcz , I don't remember any report about that.
Which version of the code did you use during your tests?

Thanks @t-b. I haven't noticed that we compile with only -O0.

Indeed. And there is probably a good reason for that. So, no need to investigate further these crashes indeed.

@bourtemb all jpeg code I took from latest tango-9-lts.

As for the grayscale 3x3 output image, I'm seeing the same results when encoding with libjpeg-turbo. So actually cppTango is correct here and it is CImg which does this.

I'm closing this issue as I think there is nothing more to add in this topic.