phoboslab/qoi

Change wire format to little-endian

nigeltao opened this issue ยท 4 comments

Spun out of #28 (comment) where I said:

I would instead advise to make everything little endian. Not just the header fields, but also the bytecodes. Almost all CPUs used today are little-endian, and support unaligned loads.

I just uploaded a proof of concept which showed an approx 1.05x improvement in decode speed. Copy/pasting the commit message:

Change wire format to little-endian

This is a backwards incompatible change to the wire format. For example,
the QOI_DIFF_24 bit-packing encoding changes from:

| MSB   6   5   4   3   2   1 LSB |
+---------------------------------+
|   1   1   1   0  r4  r3  r2  r1 | addr+0
|  r0  g4  g3  g2  g1  g0  b4  b3 | addr+1
|  b2  b1  b0  a4  a3  a2  a1  a0 | addr+2

to:

| MSB   6   5   4   3   2   1 LSB |
+---------------------------------+
|  r3  r2  r1  r0   0   1   1   1 | addr+0
|  b1  b0  g4  g3  g2  g1  g0  r4 | addr+1
|  a4  a3  a2  a1  a0  b4  b3  b2 | addr+2

----

Decode speed-up on an Intel NUC (Comet Lake, BXNUC10i5FNKPA):
1.07x  images/kodak
1.04x  images/misc
1.01x  images/screenshots
1.06x  images/textures
1.05x  images/wallpaper

In detail:

images/kodak
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       8.0       144.2         49.02          2.73       717
stbi:         8.8        84.2         44.43          4.67       979
qoi:          3.6         4.2        109.80         93.41       771
qoile:        3.3         4.2        117.70         93.41       771

images/misc
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       9.3        89.1         82.18          8.57       335
stbi:         8.2        77.5         93.17          9.85       497
qoi:          2.8         3.1        275.03        245.73       451
qoile:        2.7         3.1        285.92        245.95       451

images/screenshots
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:      45.3       519.5        181.85         15.84      2219
stbi:        35.2       622.5        233.73         13.22      2821
qoi:         24.3        23.9        339.08        344.26      2582
qoile:       23.9        23.9        343.80        344.20      2582

images/textures
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       2.6        33.1         50.84          3.92       163
stbi:         2.5        19.8         52.26          6.56       232
qoi:          0.9         1.0        149.74        130.18       184
qoile:        0.8         1.0        158.48        130.22       184

images/wallpaper
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:     154.4      2289.3         60.71          4.09      9224
stbi:       190.0      1455.1         49.32          6.44     13299
qoi:         71.2        76.6        131.57        122.35     10647
qoile:       67.9        76.6        138.11        122.37     10647

The key difference between qoi.h and qoile.h is:

391c183
< void *qoi_decode(const void *data, int size, int *out_w, int *out_h, int channels) {
---
> void *qoile_decode(const void *data, int size, int *out_w, int *out_h, int channels) {
427c219
< 			int b1 = bytes[p++];
---
> 			uint32_t b = peek_u32le(bytes + p);
429,462c221,257
< 			if ((b1 & QOIBE_MASK_2) == QOIBE_INDEX) {
< 				px = index[b1 ^ QOIBE_INDEX];
< 			}
< 			else if ((b1 & QOIBE_MASK_3) == QOIBE_RUN_8) {
< 				run = (b1 & 0x1f);
< 			}
< 			else if ((b1 & QOIBE_MASK_3) == QOIBE_RUN_16) {
< 				int b2 = bytes[p++];
< 				run = (((b1 & 0x1f) << 8) | (b2)) + 32;
< 			}
< 			else if ((b1 & QOIBE_MASK_2) == QOIBE_DIFF_8) {
< 				px.rgba.r += ((b1 >> 4) & 0x03) - 1;
< 				px.rgba.g += ((b1 >> 2) & 0x03) - 1;
< 				px.rgba.b += ( b1       & 0x03) - 1;
< 			}
< 			else if ((b1 & QOIBE_MASK_3) == QOIBE_DIFF_16) {
< 				int b2 = bytes[p++];
< 				px.rgba.r += (b1 & 0x1f) - 15;
< 				px.rgba.g += (b2 >> 4) - 7;
< 				px.rgba.b += (b2 & 0x0f) - 7;
< 			}
< 			else if ((b1 & QOIBE_MASK_4) == QOIBE_DIFF_24) {
< 				int b2 = bytes[p++];
< 				int b3 = bytes[p++];
< 				px.rgba.r += (((b1 & 0x0f) << 1) | (b2 >> 7)) - 15;
< 				px.rgba.g +=  ((b2 & 0x7c) >> 2) - 15;
< 				px.rgba.b += (((b2 & 0x03) << 3) | ((b3 & 0xe0) >> 5)) - 15;
< 				px.rgba.a +=   (b3 & 0x1f) - 15;
< 			}
< 			else if ((b1 & QOIBE_MASK_4) == QOIBE_COLOR) {
< 				if (b1 & 8) { px.rgba.r = bytes[p++]; }
< 				if (b1 & 4) { px.rgba.g = bytes[p++]; }
< 				if (b1 & 2) { px.rgba.b = bytes[p++]; }
< 				if (b1 & 1) { px.rgba.a = bytes[p++]; }
---
> 			if ((b & QOILE_MASK_2) == QOILE_INDEX) {
> 				px = index[(b >> 2) & 63];
> 				p += 1;
> 			}
> 			else if ((b & QOILE_MASK_3) == QOILE_RUN_8) {
> 				run = ((b >> 3) & 0x1F);
> 				p += 1;
> 			}
> 			else if ((b & QOILE_MASK_3) == QOILE_RUN_16) {
> 				run = ((b >> 3) & 0x1FFF);
> 				p += 2;
> 			}
> 			else if ((b & QOILE_MASK_2) == QOILE_DIFF_8) {
> 				px.rgba.r += ((b >> 2) & 0x03) - 1;
> 				px.rgba.g += ((b >> 4) & 0x03) - 1;
> 				px.rgba.b += ((b >> 6) & 0x03) - 1;
> 				p += 1;
> 			}
> 			else if ((b & QOILE_MASK_3) == QOILE_DIFF_16) {
> 				px.rgba.r += ((b >>  3) & 0x1F) - 15;
> 				px.rgba.g += ((b >>  8) & 0x0F) - 7;
> 				px.rgba.b += ((b >> 12) & 0x0F) - 7;
> 				p += 2;
> 			}
> 			else if ((b & QOILE_MASK_4) == QOILE_DIFF_24) {
> 				px.rgba.r += ((b >>  4) & 0x1F) - 15;
> 				px.rgba.g += ((b >>  9) & 0x1F) - 15;
> 				px.rgba.b += ((b >> 14) & 0x1F) - 15;
> 				px.rgba.a += ((b >> 19) & 0x1F) - 15;
> 				p += 3;
> 			}
> 			else if ((b & QOILE_MASK_4) == QOILE_COLOR) {
> 				p += 1;
> 				if (b & 0x10) { px.rgba.r = bytes[p++]; }
> 				if (b & 0x20) { px.rgba.g = bytes[p++]; }
> 				if (b & 0x40) { px.rgba.b = bytes[p++]; }
> 				if (b & 0x80) { px.rgba.a = bytes[p++]; }

If the highest bits of a byte decide what should be done the checks can be simplified.
My (Seed7) code works with the following definitions:

const integer: QOI_INDEX   is 16#3f;  # 00xxxxxx
const integer: QOI_RUN_8   is 16#5f;  # 010xxxxx
const integer: QOI_RUN_16  is 16#7f;  # 011xxxxx
const integer: QOI_DIFF_8  is 16#bf;  # 10xxxxxx
const integer: QOI_DIFF_16 is 16#df;  # 110xxxxx
const integer: QOI_DIFF_24 is 16#ef;  # 1110xxxx
const integer: QOI_COLOR   is 16#ff;  # 1111xxxx

This allows checks without masks:

            if byte1 <= QOI_INDEX then
              ...
            elsif byte1 <= QOI_RUN_8 then
              ...
            elsif byte1 <= QOI_RUN_16 then
              ...
            elsif byte1 <= QOI_DIFF_8 then
              ...
            elsif byte1 <= QOI_DIFF_16 then
              ...
            elsif byte1 <= QOI_DIFF_24 then
              ...
            else  # byte1 <= QOI_COLOR
              ...
            end if;

This trick works also in C. I have not measured how much time is saved with this in C. If the lowest bits are used to do the decisions this type of optimization is not possible.

The file format for QOI will not change anymore. See #37 for more info.

Ideas for a successor to QOI should be discussed here: https://github.com/nigeltao/qoi2-bikeshed/issues

Qix- commented

Please reconsider. This is a huge wart for the spec, if you ask me. There's little reason to have big-endian ints in 2021.

As far as I can tell it's only the header fields that are big endian. The rest of this diff is about the bit encoding inside the bytes which isn't really endianness related (and I think @ThomasMertes is right, it makes sense to keep it as-is).

I totally agree though it's a super weird decision to make the headers big endian and should probably be changed.