The goal of this practical work is to use SIMD (Single Instruction, Multiple Data) x86_64 operations, and more particularly Intel's SSE extensions in order to decode a video in motion jpeg (MJPEG) format, and to work on registers of 128 bits.
The execution requires an Intel x86_64 processor and the SDL1.2 library (or SDL12-compat). SSE instructions' documentation is available on Intel's website : Intel Intrinsics Guide
Everything we need is in the folder tp2_src_etd
.
The video to be decoded is : tp2_src_etd/ice_age_256x144_444.mjpeg
.
The three implementations of float conversion are as follows:
conv-float.c
: "classic" float conversion, without any particular optimizationconv-unrolled4-float-a-trou.c
: conversion of 4 by 4 floats compared to the classic implementationconv-sse-a-trou.c
: converting 4 by 4 floats using 128-bit registers and SSE instructions
To compile one of these conversion methods, just use:
make <mjpeg-float | mjpeg-conv-unrolled4-float-a-trou | mjpeg-conv-sse-a-trou>
The created executable takes 1, 2 or 3 arguments. The first one is the name of the video (ice_age_256x144_444.mjpeg
), the second is an integer giving the number of frames to decode (-1
if all), and the third, which is not mandatory, makes it possible not to display the result: we can then measure the raw speed of the decoding.
Then, to see how long the decoding takes on the full movie, just run:
time ./mjpeg-float ice_age_256x144_444.mjpeg -1 azerty
On the three implementations presented and on 10 executions for each, I obtained the following results:
We can see that the Float Unrolled and SSE implementations are slightly faster than the classic Float implementation, but that the use of SSE instructions, on the machine on which I carried out this work, does not necessarily offer better performance than 4 by 4 conversion.
This work was done as part of the Conception and Exploration of multi-core architectures course given in the last year of the Information Systems Engineering specialization at Grenoble INP - Ensimag, UGA.