tango-controls/cppTango

On performance of JPEG MMX encoder

mliszcz opened this issue · 7 comments

On the last kernel meeting I promised to do some benchmarks of the MMX code vs non-MMX code which can give a hint if it makes sense to have this MMX assembly in our codebase. But since it was decided that we should not implement encoding on our own, these results are of limited usefulness.

All tests were done on g++ 10.2.0 on Intel i7-8565U. I implemented a simple program that loads a bitmap (using CImg) and calls jpeg_encode_24 from cppTango. For those wanting to reproduce - I just placed the test program, CImg.h and all files from cppapi/server/jpeg and cppapi/server/jpeg_mmx in a single directory. Then I compiled with:

g++ test.cpp jpeg_*.cpp -o test -lX11 -lpthread -fPIE <any extra flags>

The test program runs encoding 10 times using source image stored in the same memory area (hot) and then runs encoding 10 times again but source is stored in a different place in memory (cold). Turns out that there are not much differences, maybe due to bitmap size, which is 4K (50MB, way too large for CPU cache). The program measures time for each pass in milliseconds.

flags hot 1 hot 2 cold 1 cold 2 comments
9935 10780 11036 10872 unoptimized 64bit
-m32 11399 11583 11740 11807 unoptimized 32bit
-O3 5430 5440 5221 5255 optimized 64bit
-O3 -m32 6317 6402 6342 6381 optimized 32bit
-m32 -DJPG_USE_ASM 11432 11791 11444 11542 unoptimized 32bit assembly
-O3 -m32 -DJPG_USE_ASM - - - - “./test” terminated by signal SIGSEGV (Address boundary error)
-march=skylake 10126 10184 10213 10501 unoptimized 64bit skylake
-m32 -march=skylake 11121 12148 11877 11616 unoptimized 32bit skylake
-O3 -march=skylake 4876 4914 5097 5166 optimized 64bit skylake
-O3 -m32 -march=skylake 6321 6350 6318 6382 optimized 32bit skylake
-DJPG_USE_ASM -D_64BITS jpeg_color_mmx.cpp:453: Error: operand type mismatch for `push'

Conclusions:

  • 64bit code is faster than 32bit code (executed on 64bit cpu and os)
  • O3 flag improves run time a lot
  • specifying target CPU architecture helps a bit (?)
  • MMX version is not working with anything better than -O0 on my machine. Encoder crashes at runtime if compiled with -O1 or better.
  • MMX version is not working in 64bit mode. This is not used by cppTango but still in MMX code we have branch depending on _64BITS macro being defined.

Regarding SIMD instructions generated by compiler:

flags jpeg_dct_mmx.s jpeg_dct.s
-m32 -DJPG_USE_ASM -S MMX as specified in .cpp no SIMD
-m32 -S empty no SIMD
-m32 -O3 -S empty SSE
-m32 -O3 -march=skylake -S empty AVX

The test code:

// CImg: https://github.com/dtschump/CImg/raw/master/CImg.h
// Test image: 4K / 51.26 MB from https://filesamples.com/formats/bmp
// https://filesamples.com/samples/image/bmp/sample_5184%C3%973456.bmp

#include <chrono>
#include <fstream>
#include <iostream>
#include <vector>

#include "CImg.h"
#include "jpeg_lib.h"

constexpr double quality = 90;

auto test_hot(int n, int width, int height, unsigned char* data)
{
    int jpegSize0;
    unsigned char* jpegData0;
    jpeg_encode_rgb24(width, height, data, quality, &jpegSize0, &jpegData0);

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < n; ++i)
    {
        int jpegSize;
        unsigned char* jpegData;
        jpeg_encode_rgb24(width, height, data, quality, &jpegSize, &jpegData);
    }

    auto end = std::chrono::high_resolution_clock::now();

    return end - start;
}

auto test_cold(int n, int width, int height, unsigned char* data)
{
    auto data_size = width * height * 3; // 24 bpp
    std::vector<unsigned char*> sources{};
    std::vector<unsigned char*> blanks{};
    for (int i = 0; i < n; ++i) {
        unsigned char* data_copy = new unsigned char[data_size];
        std::memcpy(data_copy, data, data_size);
        sources.push_back(data_copy);
        blanks.push_back(new unsigned char[data_size]);
    }

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < n; ++i)
    {
        int jpegSize;
        unsigned char* jpegData;
        jpeg_encode_rgb24(width, height, sources[i], quality, &jpegSize, &jpegData);
    }

    auto end = std::chrono::high_resolution_clock::now();

    return end - start;
}

int main(int, char**)
{
    cimg_library::CImg<unsigned char> img("sample_5184_3456.bmp");

    int n = 10;
    auto time_hot = test_hot(n, img.width(), img.height(), img.data());
    auto time_cold = test_cold(n, img.width(), img.height(), img.data());
    std::cout << "hot = " << std::chrono::duration_cast<
        std::chrono::milliseconds>(time_hot).count() << std::endl;
    std::cout << "cold = " << std::chrono::duration_cast<
        std::chrono::milliseconds>(time_cold).count() << std::endl;

    return 0;
}

@bourtemb @t-b should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?

g++ test.cpp jpeg_*.cpp -o test -lX11 -lpthread -fPIE -g -Og -m32 -DJPG_USE_ASM
+(gdb) run
Starting program: /home/michal/Documents/tango/mmx/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x565bc776 in conv_block_RGB24H2V2_mmx (width=1448851968, 
    rgb=0xf45ae01c "\256\261\300\252\270\231\253\236\244\250\301\243\233\261\264\255\245\256\215\257\270\257\263\247Ŗ\273\253\275\224\234\243\257\236\224\247\264\253\257\222\252\243\257\233\254\243\230\234\256\230\254\237\246\217\237\221\245\207\245\236\244\234\232\255\217\237\216\220\232\215\236\210\230\216\226\235\240\220\251\233\222\220\202\217\242\200\222\213\242\227\244\223\220\234\227\212\215\246\231\217\223\211\226\204\240\214\241\207\226\225\226\212\236\227\231\215\211\214\222\210\220\223\216\225\223\210\217\222\203\222\202\234\225\211\210~\230\202\214\205\226|\216x~\215u\202\202\201\227\206\220\177\207\204\214\177u\232\213u\205\205\221\202\242\226\224|\220\262\363\360͏gnrx\203\213\211|\204\200\213~}\212s\177\215r\200\206v\205\211}"..., y=0xf45ae010, cb=0x1440, cr=0xf11ea294) at jpeg_color_mmx.cpp:1064
1064          __asm__ (
+(gdb) bt
#0  0x565bc776 in conv_block_RGB24H2V2_mmx (width=1448851968, 
    rgb=0xf45ae01c "\256\261\300\252\270\231\253\236\244\250\301\243\233\261\264\255\245\256\215\257\270\257\263\247Ŗ\273\253\275\224\234\243\257\236\224\247\264\253\257\222\252\243\257\233\254\243\230\234\256\230\254\237\246\217\237\221\245\207\245\236\244\234\232\255\217\237\216\220\232\215\236\210\230\216\226\235\240\220\251\233\222\220\202\217\242\200\222\213\242\227\244\223\220\234\227\212\215\246\231\217\223\211\226\204\240\214\241\207\226\225\226\212\236\227\231\215\211\214\222\210\220\223\216\225\223\210\217\222\203\222\202\234\225\211\210~\230\202\214\205\226|\216x~\215u\202\202\201\227\206\220\177\207\204\214\177u\232\213u\205\205\221\202\242\226\224|\220\262\363\360͏gnrx\203\213\211|\204\200\213~}\212s\177\215r\200\206v\205\211}"..., y=0xf45ae010, cb=0x1440, cr=0xf11ea294) at jpeg_color_mmx.cpp:1064
#1  0xf11ea010 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
+(gdb) 

Fun fact: I don't know if this a feature or a bug (and in which code) but when I ran this code:

int main(int, char**)
{
    cimg_library::CImg<unsigned char> img("sample_640_426.bmp");

    int jpegSize;
    unsigned char* jpegData;

    constexpr double quality = 90;

    jpeg_encode_rgb24(img.width(), img.height(), img.data(),
        quality, &jpegSize, &jpegData);

    std::ofstream wf("out.jpeg", std::ios::out | std::ios::binary);
    wf.write((char*)jpegData, jpegSize);
    wf.close();

    return 0;
}

On this image from the Internet:
image

I got it repeated 9 times in grayscale. The dimensions of both pictures are the same. The source is 24bpp.
image

t-b commented

should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?

We only ever compile with -O0.

target_compile_options(jpeg_mmx_objects PRIVATE -O0)

Fun fact: I don't know if this a feature or a bug (and in which code) but when I ran this code:

Not sure, I've never used the jpeg encoded attribute.

Thanks @t-b. I haven't noticed that we compile with only -O0.

@bourtemb @t-b should we investigate the crashes of MMX code with -O1 and better? Do you remember if this has been reported already?

@mliszcz , I don't remember any report about that.
Which version of the code did you use during your tests?

Thanks @t-b. I haven't noticed that we compile with only -O0.

Indeed. And there is probably a good reason for that. So, no need to investigate further these crashes indeed.

@bourtemb all jpeg code I took from latest tango-9-lts.

As for the grayscale 3x3 output image, I'm seeing the same results when encoding with libjpeg-turbo. So actually cppTango is correct here and it is CImg which does this.

I'm closing this issue as I think there is nothing more to add in this topic.