Change wire format to little-endian

Question

Change wire format to little-endian

nigeltao opened this issue 3 years ago · 4 comments

Spun out of #28 (comment) where I said:

I would instead advise to make everything little endian. Not just the header fields, but also the bytecodes. Almost all CPUs used today are little-endian, and support unaligned loads.

I just uploaded a proof of concept which showed an approx 1.05x improvement in decode speed. Copy/pasting the commit message:

Change wire format to little-endian

This is a backwards incompatible change to the wire format. For example,
the QOI_DIFF_24 bit-packing encoding changes from:

| MSB   6   5   4   3   2   1 LSB |
+---------------------------------+
|   1   1   1   0  r4  r3  r2  r1 | addr+0
|  r0  g4  g3  g2  g1  g0  b4  b3 | addr+1
|  b2  b1  b0  a4  a3  a2  a1  a0 | addr+2

to:

| MSB   6   5   4   3   2   1 LSB |
+---------------------------------+
|  r3  r2  r1  r0   0   1   1   1 | addr+0
|  b1  b0  g4  g3  g2  g1  g0  r4 | addr+1
|  a4  a3  a2  a1  a0  b4  b3  b2 | addr+2

----

Decode speed-up on an Intel NUC (Comet Lake, BXNUC10i5FNKPA):
1.07x  images/kodak
1.04x  images/misc
1.01x  images/screenshots
1.06x  images/textures
1.05x  images/wallpaper

In detail:

images/kodak
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       8.0       144.2         49.02          2.73       717
stbi:         8.8        84.2         44.43          4.67       979
qoi:          3.6         4.2        109.80         93.41       771
qoile:        3.3         4.2        117.70         93.41       771

images/misc
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       9.3        89.1         82.18          8.57       335
stbi:         8.2        77.5         93.17          9.85       497
qoi:          2.8         3.1        275.03        245.73       451
qoile:        2.7         3.1        285.92        245.95       451

images/screenshots
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:      45.3       519.5        181.85         15.84      2219
stbi:        35.2       622.5        233.73         13.22      2821
qoi:         24.3        23.9        339.08        344.26      2582
qoile:       23.9        23.9        343.80        344.20      2582

images/textures
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:       2.6        33.1         50.84          3.92       163
stbi:         2.5        19.8         52.26          6.56       232
qoi:          0.9         1.0        149.74        130.18       184
qoile:        0.8         1.0        158.48        130.22       184

images/wallpaper
        decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:     154.4      2289.3         60.71          4.09      9224
stbi:       190.0      1455.1         49.32          6.44     13299
qoi:         71.2        76.6        131.57        122.35     10647
qoile:       67.9        76.6        138.11        122.37     10647

The key difference between qoi.h and qoile.h is:

391c183
< void *qoi_decode(const void *data, int size, int *out_w, int *out_h, int channels) {
---
> void *qoile_decode(const void *data, int size, int *out_w, int *out_h, int channels) {
427c219
< 			int b1 = bytes[p++];
---
> 			uint32_t b = peek_u32le(bytes + p);
429,462c221,257
< 			if ((b1 & QOIBE_MASK_2) == QOIBE_INDEX) {
< 				px = index[b1 ^ QOIBE_INDEX];
< 			}
< 			else if ((b1 & QOIBE_MASK_3) == QOIBE_RUN_8) {
< 				run = (b1 & 0x1f);
< 			}
< 			else if ((b1 & QOIBE_MASK_3) == QOIBE_RUN_16) {
< 				int b2 = bytes[p++];
< 				run = (((b1 & 0x1f) << 8) | (b2)) + 32;
< 			}
< 			else if ((b1 & QOIBE_MASK_2) == QOIBE_DIFF_8) {
< 				px.rgba.r += ((b1 >> 4) & 0x03) - 1;
< 				px.rgba.g += ((b1 >> 2) & 0x03) - 1;
< 				px.rgba.b += ( b1       & 0x03) - 1;
< 			}
< 			else if ((b1 & QOIBE_MASK_3) == QOIBE_DIFF_16) {
< 				int b2 = bytes[p++];
< 				px.rgba.r += (b1 & 0x1f) - 15;
< 				px.rgba.g += (b2 >> 4) - 7;
< 				px.rgba.b += (b2 & 0x0f) - 7;
< 			}
< 			else if ((b1 & QOIBE_MASK_4) == QOIBE_DIFF_24) {
< 				int b2 = bytes[p++];
< 				int b3 = bytes[p++];
< 				px.rgba.r += (((b1 & 0x0f) << 1) | (b2 >> 7)) - 15;
< 				px.rgba.g +=  ((b2 & 0x7c) >> 2) - 15;
< 				px.rgba.b += (((b2 & 0x03) << 3) | ((b3 & 0xe0) >> 5)) - 15;
< 				px.rgba.a +=   (b3 & 0x1f) - 15;
< 			}
< 			else if ((b1 & QOIBE_MASK_4) == QOIBE_COLOR) {
< 				if (b1 & 8) { px.rgba.r = bytes[p++]; }
< 				if (b1 & 4) { px.rgba.g = bytes[p++]; }
< 				if (b1 & 2) { px.rgba.b = bytes[p++]; }
< 				if (b1 & 1) { px.rgba.a = bytes[p++]; }
---
> 			if ((b & QOILE_MASK_2) == QOILE_INDEX) {
> 				px = index[(b >> 2) & 63];
> 				p += 1;
> 			}
> 			else if ((b & QOILE_MASK_3) == QOILE_RUN_8) {
> 				run = ((b >> 3) & 0x1F);
> 				p += 1;
> 			}
> 			else if ((b & QOILE_MASK_3) == QOILE_RUN_16) {
> 				run = ((b >> 3) & 0x1FFF);
> 				p += 2;
> 			}
> 			else if ((b & QOILE_MASK_2) == QOILE_DIFF_8) {
> 				px.rgba.r += ((b >> 2) & 0x03) - 1;
> 				px.rgba.g += ((b >> 4) & 0x03) - 1;
> 				px.rgba.b += ((b >> 6) & 0x03) - 1;
> 				p += 1;
> 			}
> 			else if ((b & QOILE_MASK_3) == QOILE_DIFF_16) {
> 				px.rgba.r += ((b >>  3) & 0x1F) - 15;
> 				px.rgba.g += ((b >>  8) & 0x0F) - 7;
> 				px.rgba.b += ((b >> 12) & 0x0F) - 7;
> 				p += 2;
> 			}
> 			else if ((b & QOILE_MASK_4) == QOILE_DIFF_24) {
> 				px.rgba.r += ((b >>  4) & 0x1F) - 15;
> 				px.rgba.g += ((b >>  9) & 0x1F) - 15;
> 				px.rgba.b += ((b >> 14) & 0x1F) - 15;
> 				px.rgba.a += ((b >> 19) & 0x1F) - 15;
> 				p += 3;
> 			}
> 			else if ((b & QOILE_MASK_4) == QOILE_COLOR) {
> 				p += 1;
> 				if (b & 0x10) { px.rgba.r = bytes[p++]; }
> 				if (b & 0x20) { px.rgba.g = bytes[p++]; }
> 				if (b & 0x40) { px.rgba.b = bytes[p++]; }
> 				if (b & 0x80) { px.rgba.a = bytes[p++]; }

Answer 1 · 2021-11-27T11:03:27.000Z

If the highest bits of a byte decide what should be done the checks can be simplified.
My (Seed7) code works with the following definitions:

const integer: QOI_INDEX   is 16#3f;  # 00xxxxxx
const integer: QOI_RUN_8   is 16#5f;  # 010xxxxx
const integer: QOI_RUN_16  is 16#7f;  # 011xxxxx
const integer: QOI_DIFF_8  is 16#bf;  # 10xxxxxx
const integer: QOI_DIFF_16 is 16#df;  # 110xxxxx
const integer: QOI_DIFF_24 is 16#ef;  # 1110xxxx
const integer: QOI_COLOR   is 16#ff;  # 1111xxxx

This allows checks without masks:

            if byte1 <= QOI_INDEX then
              ...
            elsif byte1 <= QOI_RUN_8 then
              ...
            elsif byte1 <= QOI_RUN_16 then
              ...
            elsif byte1 <= QOI_DIFF_8 then
              ...
            elsif byte1 <= QOI_DIFF_16 then
              ...
            elsif byte1 <= QOI_DIFF_24 then
              ...
            else  # byte1 <= QOI_COLOR
              ...
            end if;

This trick works also in C. I have not measured how much time is saved with this in C. If the lowest bits are used to do the decisions this type of optimization is not possible.

Answer 2 · 2021-11-28T16:49:21.000Z

The file format for QOI will not change anymore. See #37 for more info.

Ideas for a successor to QOI should be discussed here: https://github.com/nigeltao/qoi2-bikeshed/issues

Answer 3 · 2021-12-20T15:21:49.000Z

Please reconsider. This is a huge wart for the spec, if you ask me. There's little reason to have big-endian ints in 2021.

Answer 4 · 2021-12-20T21:57:30.000Z

As far as I can tell it's only the header fields that are big endian. The rest of this diff is about the bit encoding inside the bytes which isn't really endianness related (and I think @ThomasMertes is right, it makes sense to keep it as-is).

I totally agree though it's a super weird decision to make the headers big endian and should probably be changed.