phoboslab/qoi

Upcoming breaking changes & locking in the data format

phoboslab opened this issue ยท 50 comments

Saying that I'm surprised by the amount of attention this is getting would be an understatement. There's lots of discussion going on about how the data format and compression could be improved and what features could be added.

I want to give my views here and discuss how to go forward.

First and foremost, I want QOI to be simple. Please keep this in mind. I consider the general compression scheme to be done. There's lots of interesting ideas on how to improve compression. I want to tinker with these ideas - but not for QOI.

QOI will not be versioned. There will only be one version of QOI's data format. I'm hoping we will be able to strictly define what exactly that is in the coming days.

QOI will only support 24bit RGB and 32bit RGBA data. I acknowledge there's some need for fewer or more channels and also for higher bit depths or paletted color - QOI will not serve these needs.

So, with all that said, there's some breaking changes that are probably worthwhile. I want to discuss if and how to implement those.

Proposed changes

  1. width, height and size in the header should be stored as big endian for consistency with the rest of the format (this change already happened in c03edb2)

  2. Color differences (QOI_DIFF_*) should be stored have the same range as two's-complement. That means:

  • 2bit: -2..1 instead of the current range -1..2
  • 4bit: -8..7 instead of the current range -7..8
  • 5bit: -16..15 instead of the current range -15..16
  1. The header should accommodate some more info. Currently there's demand for
    3a) number of channels (#16)
    3b) the colorspace (#25 and this huge discussion on HN)
    3c) un-/premultiplied alpha (#13)
    3d) user-defined values

So, 1) is already implemented; 2) seems like the right thing to do (any objections?); 3) is imho worth discussing.

3a) Storing the number of channels (3 or 4) in the header would allow a user of this library to omit if they want RGB or RGBA and files would be more descriptive of their contents. You would still be able to enforce 3 or 4 channels when loading. This is consistent to what stbi_load does

int x,y,n;
unsigned char *data = stbi_load(filename, &x, &y, &n, 0);
// ... process data if not NULL ...
// ... x = width, y = height, n = # 8-bit components per pixel ...
// ... replace '0' with '1'..'4' to force that many components per pixel
// ... but 'n' will always be the number that it would have been if you said 0

It is my opinion that the channels header value should be purely informative. Meaning, en-/decoder will do exactly the same, regardless of the number of channels. The extra 5bit for alpha in QOI_DIFF_24 will still be wasted for RGB files.

3b) I don't understand enough about the colorspace issue to gauge the significance. If we implement this however, I would suggest to give this a full byte in the header, where 0 = sRGB and any non-zero value is another, user-defined(?) colorspace.

3c) I'm against an option for premultiplied alpha, because it puts more burden on any QOI implementation to decode in the right pixel format. We should just specify that QOI images have un-premultiplied alpha.

3d) For simplicity's sake I'd like to put 3a) and 3b) as one byte each into the header. I'm uncertain if we then should "pad" the u32 size in the header with two more bytes. This would make the size 4byte aligned again, but there's probably no need for it!? A u16 unused could also cause more confusion when other QOI libraries suddenly specify any of these bits to mean something.

With all this, the header would then be the following 16 bytes:

struct qoi_header_t {
	char [4];       // magic bytes "qoif"
	u16 width;      // image width in pixels (BE)
	u16 height;     // image height in pixels (BE)
	 u8 channels;   // must be 3 (RGB) or 4 (RGBA)
	 u8 colorspace; // 0 = sRGB (other values currently undefined)
	u16 unused;     // free for own use
	u32 size;       // number of data bytes following this header (BE)
};

The one issue I have with this, is how to give these extra header value to the user of this library. qoi_read("file.qoi", &w, &h, &channels_in_file, &colorspace, want_channels) looks like an ugly API. So maybe that would rather be implemented as qoi_read_ex() and qoi_read() stays as it is. I'm still not sure if I want that extended header...

What's the opinion of the other library authors?

On color spaces... I'm one of those 'experts' the HN thread is complaining about :)

If the user is doing complex image operations in a color accurate space, they're going to be using, most likely OpenEXR, because the values are important, as is the colorimetric information, and there's a lot more information that needs to be stored beyond color space (as is pointed out in the HN thread).

The intent of QOI is fast compress/decompress of data, obviously in a real time context.

The data QOI is going to be used with is therefore going to be either linear or SRGB. Drawing a line on the sand on those two seems like a clear, useful, choice.

There's a rub though. It's normal to store linear alpha with sRGB data. Also, it's common to store linear data in color channels, for example, in a normal map or other material input such as "metalness" in a typical PBR set up.

I propose that it makes sense to simply make the rule that individual channels are either linear-and-user-interpreted, or sRGB-with-gamma, and have the format include a byte with a bitmask indicating which channel is which.

I also propose that a bit of 0 would mean gamma, and 1 linear, and ordered such that the A channel lands in the LSB. A mask of 0 therefore might mean all gamma, and a mask of 1 would mean sRGB and linear alpha, regardless of the number of channels. That would make the two most common combinations result in a byte of zero and one, for any number of channels with alpha. The conciseness of only needing to care about zero or one in general appeals to me.

Color differences (QOI_DIFF_*) should be stored as two's-complement.

๐Ÿ‘

Also my three cents:

  • Make width and height 32-bit.
  • Make it clear if the diffs wrap or not. Currently the decoder wraps, but the encoder doesn't take advantage of it.
  • There are three different code points allocated to "same pixel as last": index of last pixel, RLE 1 and diff24 with all flags clear.

u32 size; // number of data bytes following this header (BE)

What's the use of this? It requires the encoder to either seek the file, buffer the whole output or do a first pass just to compute this number.

u32 size; // number of data bytes following this header (BE)

What's the use of this? It requires the encoder to either seek the file, buffer the whole output or do a first pass just to compute this number.

Yea, I had the same question. This requires the writer to support seeking (which would yield problems with compression encoders), or you have to encode everything into a buffer first.

The only advantage, I guess, would be able to skip qoi-encoded images if there were many of them streamed in a row, or something like that.

I agree with the previous two writes that i find the size field in the header restricting. My implementation verifies that the stream is indeed not more than size bytes long. The size field also allows skipping QOI images in a stream without decoding.

It is my opinion that the channels header value should be purely informative.

I agree with this. I think additional fields like color space and such should also be purely informative, as otherwise the decode might need to perform color space conversion. I would say that color space will only ever affect RGB, but QOI will always use linear alpha.

I also agree with 1) and 2). We should not support premultiplied alpha, as it makes things more complex. I can also live with the current offset method for the signed deltas.

One question i had for the file format was: What is the purpose of the four padding bytes at the end of the file in the original implementation? ๐Ÿค”

Make width and height 32-bit.

What's your reasoning for this? Maybe I'm a bit naive, but a 64k x 64k image seems plenty large to me. The 12bit width/height in MPEG1 still holds up 30 years later (admittedly barely).

Make it clear if the diffs wrap or not. Currently the decoder wraps, but the encoder doesn't take advantage of it.

Uh, interesting! Didn't even think about it. I'll check if it makes a significant difference in compression. My hunch is that it does not - and the added complexity is not worth it.

There are three different code points allocated to "same pixel as last": index of last pixel, RLE 1 and diff24 with all flags clear.

True. I guess QOI_RUN_8 with 1 could be omitted and be always encoded as QOI_INDEX. QOI_RUN_8 could then be 2..33, but we don't gain much here. Omitting QOI_DIFF_24 with all 0 differences is difficult. I don't think this is an issue.

What's the use of u32 size?

Good question! I was thinking of this as additional way to verify if we have a valid header. Streaming QOI images would also benefit from this. An alternative would be a special QOI_IMAGE_END code.

Case in point: I've had very frustrating experiences with MPEG1 not having any field of the frame size or an indicator for a frame end, resulting in one extra frame of latency - you have to wait until the next frame arrives until you can be sure you have enough data to decode the current one.

MPEG 1 was conceived in a way where you could encode data and spit it out to the wire immediately. Same for receiving: as soon as you get any bits from the wire you can theoretically start decoding. Yet, none(?) of the (software) decoders take advantage of it. ffmpeg and libmpeg2 buffers data for a whole frame before they start decoding. It's non-trivial to have a streaming decoder.

I agree with the previous two writes that i find the size field in the header restricting.

Can you clarify what you mean by "restricting"?

What is the purpose of the four padding bytes at the end of the file

Since the longest chunk we can have is 5 bytes (QOI_COLOR with RGBA set), having an extra 4 bytes that we can ignore at the end means we only have to check once per loop iteration if didn't run out of data. Without these extra 4 bytes, we would need to check for every decoded chunk separately. E.g.:

else if ((b1 & QOI_MASK_3) == QOI_RUN_16) {
	// We need an extra byte here. Check if we have enough data
	if (p >= chunks_len) {
		free(pixels);
		return NULL;
	}
	int b2 = bytes[p++];
	run = (((b1 & 0x1f) << 8) | (b2)) + 32;
}
...
else if ((b1 & QOI_MASK_4) == QOI_COLOR) {
	// Calculate how many extra bytes we need
	int bytes_needed = 
		((b1 >> 4) & 1) +
		((b1 >> 2) & 1) +
		((b1 >> 1) & 1) +
		((b1 >> 0) & 1);
	// Check if we have enough data
	if (p + bytes_needed > chunks_len) {
		free(pixels);
		return NULL;
	}
	if (b1 & 8) { px.rgba.r = bytes[p++]; }
	if (b1 & 4) { px.rgba.g = bytes[p++]; }
	if (b1 & 2) { px.rgba.b = bytes[p++]; }
	if (b1 & 1) { px.rgba.a = bytes[p++]; }
}

In contrast, the 4 bytes padding seemed like a nicer solution to guard against malicious input.

Make it clear if the diffs wrap or not. Currently the decoder wraps, but the encoder doesn't take advantage of it.

Uh, interesting! Didn't even think about it. I'll check if it makes a significant difference in compression. My hunch is that it does not - and the added complexity is not worth it.

It's a significant difference in the decoder to check if the overflow/wraparound would happen and error out because if format violation or just let the wraparound happen, and have improved encoding space (as hard on/off transitions can be encoded in a single byte instead of a QOI_COLOR encoding)

Case in point: I've had very frustrating experiences with MPEG1 not having any field of the frame size or an indicator for a frame end, resulting in one extra frame of latency - you have to wait until the next frame arrives until you can be sure you have enough data to decode the current one.

QOI always encodes all pixels, so you know the end of frame is reached when you have decoded exactly width * height pixels. It doesn't allow jumping forward in a stream of QOI images, but omitting it allows encoding a QOI image directly into a socket compared to encoding it into memory, and then sending the full thing. Imho, it's better to omit the size field, as decoding QOI doesn't cost much time, especially if you only read the bytes.

True. I guess QOI_RUN_8 with 1 could be omitted and be always encoded as QOI_INDEX. QOI_RUN_8 could then be 2..33, but we don't gain much here. Omitting QOI_DIFF_24 with all 0 differences is difficult. I don't think this is an issue.

Agreed. Encoding a run as 2โ€ฆ33 would make pictures also a tad smaller. I think this is worth the improvement as there isn't any difference in the decoder, and the encoder only gets slightly harder by introducing a single branch that checks if run-lengh is 1 and emit a QOI_INDEX instead of a QOI_RUN8

In contrast, the 4 bytes padding seemed like a nicer solution to guard against malicious input.

Ah. My implementation checks for the overrun anways, so i was wondering why it's there. Also note that malicious input doesn't have these 4 "safety" bytes on purpose, so we cannot rely on them being there.

What's your reasoning for this? Maybe I'm a bit naive, but a 64k x 64k image seems plenty large to me.

  1. I remember very well when 320x200 was called "hires". With 1bpp. Now I'm writing this on a 4k monitor.
  2. Not all images fit on screen, e.g. satellite photos. Phones with 108 Mpix cameras are readily available. That's 12k x 9k.
  3. If speed is a selling point for this format, it should support large images.

I'll check if it makes a significant difference in compression. My hunch is that it does not - and the added complexity is not worth it.

There's not much complexity here. Truncate the deltas in the encoder to 8 bits and document it.
Rejecting overflows in the decoder would make things complex.

It's a significant difference in the decoder to check if the overflow/wraparound would happen (...)

Right. That makes sense! I see now that allowing for wrapping is the better choice.

Also note that malicious input doesn't have these 4 "safety" bytes on purpose, so we cannot rely on them being there.

Doesn't matter. qoi_decode() get's the full buffer size, including those 4 bytes. It just stops the decode 4 bytes early. A malicious file could specify a data size of 1 and then start a QOI_COLOR with RGBA, causing us to read 4 bytes over the buffer end without this padding.

two's-complement

To clarify: you want to keep the biases encoding and just move the range by one?
Because two's-complement normally means that e.g. zero is stored as zero and the high bit is the sign.

To clarify: you want to keep the biases encoding and just move the range by one?

Sorry, yes. I just wanted to shift the range by one to be consistent with the range of a two's complement int.

Make width and height 32-bit.

I propose having a u8 that specifies the bit size of the width and height values. this would allow small images to take up a few fewer bytes, and super large images possible.

The one issue I have with this, is how to give these extra header value to the user of this library. qoi_read("file.qoi", &w, &h, &channels_in_file, &colorspace, want_channels) looks like an ugly API.

Perhaps I am missing something obvious, but why is the following not an option?
qui_read("file.qoi", &header, want_channels)

Make width and height 32-bit.

I propose having a u8 that specifies the bit size of the width and height values. this would allow small images to take up a few fewer bytes, and super large images possible.

That sounds needlessly complex. I would just make it 32 bits (which is necessary for large images such as satellite fotos, ...) and be done with it. Sure, we might have 24 additional bits in the header that are not necessary for small files, but imo thats an acceptable sacrifice.

From the OP:

struct qoi_header_t {
	char [4];       // magic bytes "qoif"
	u16 width;      // image width in pixels (BE)
	u16 height;     // image height in pixels (BE)
	 u8 channels;   // must be 3 (RGB) or 4 (RGBA)
	 u8 colorspace; // 0 = sRGB (other values currently undefined)
	u16 unused;     // free for own use
	u32 size;       // number of data bytes following this header (BE)
};

I've mentioned it before (#12), but for comparison, here's the 16-byte NIE header:

struct nie_header_t {
    u32 magic;
    u8  version;
    u8  order;    // BGRA vs BGRX vs RGBA vs RGBX.
    u8  alpha;    // non-premul vs premul.
    u8  depth;    // 8-bit vs 16-bit.
    u32 width;
    u32 height;
}

Like QOI, I also never expect to define a "version 2", so the version byte looks useless at first, but I've also learned over the years to never say never.

Also, the version byte is 0xFF, which guarantees that the header is both invalid ASCII and invalid UTF-8.

width, height and size in the header should be stored as big endian for consistency with the rest of the format

I would instead advise to make everything little endian. Not just the header fields, but also the bytecodes. Almost all CPUs used today are little-endian, and support unaligned loads.

The current qoi.h implementation reads one byte at a time:

if ((b1 & QOI_MASK_4) == QOI_DIFF_24) {
  int b2 = bytes[p++];
  int b3 = bytes[p++];
  px.rgba.r += (((b1 & 0x0f) << 1) | (b2 >> 7)) - 15;
  px.rgba.g +=  ((b2 & 0x7c) >> 2) - 15;
  px.rgba.b += (((b2 & 0x03) << 3) | ((b3 & 0xe0) >> 5)) - 15;
  px.rgba.a +=   (b3 & 0x1f) - 15;
}

By replacing the b1, b2 and b3 uint8_t variables with a single b uint32_t variable, and changing the QOI_MASK_ETC bits from high bits to low bits, this could be:

if ((b & QOI_MASK_4) == QOI_DIFF_24) {
  px.rgba.r += ((b >>  4) & 31) - 15;
  px.rgba.g += ((b >>  9) & 31) - 15;
  px.rgba.b += ((b >> 14) & 31) - 15;
  px.rgba.a += ((b >> 19) & 31) - 15;
  p += 3;
}

Which might be faster. It certainly looks simpler.

one downside of that implementation is that it would be harder for non-C languages to implement. if it's faster though, it still probably makes sense.

Make width and height 32-bit.

The maximum RGB size with 16-bit width/height is 12GB and would already take around 30 seconds to decode (although the current implementation fails at 2GB). For these and larger images, you really want the format to support multi-threaded decoding (chunks/tiles), and probably zoom levels. This isn't "simple", so I'd keep the 16-bit size limit to stop people misusing the format.

I propose that it makes sense to simply make the rule that individual channels are either linear-and-user-interpreted, or sRGB-with-gamma, and have the format include a byte with a bitmask indicating which channel is which.

Yes, in practice this is all I would ever use having worked on various games and texture compression tools. Which channels are sRGB? Which are linear? That's it and gives me 99% of the value.

What's your reasoning for this? Maybe I'm a bit naive, but a 64k x 64k image seems plenty large to me.

  1. I remember very well when 320x200 was called "hires". With 1bpp. Now I'm writing this on a 4k monitor.
  2. Not all images fit on screen, e.g. satellite photos. Phones with 108 Mpix cameras are readily available. That's 12k x 9k.
  3. If speed is a selling point for this format, it should support large images.

Agreed on #1, #2, #3. Make it a DWORD. Every single time I've limited my texture compression tools to specific maximum sizes (like 8192x8192 or 16384x16384), somebody has yelled and I had to bump the limits. For some specific use cases 16K-32K are definitely a thing, and I don't see why >=64K shouldn't be supported. You need more than 16-bits for the width/height.

QOI is so much faster than PNG that it's a compelling use case for 64K-128K images. 640k is not enough:
https://www.wired.com/1997/01/did-gates-really-say-640k-is-enough-for-anyone/

Make width and height 32-bit.
The maximum RGB size with 16-bit width/height is 12GB and would already take around 30 seconds to decode (although the current implementation fails at 2GB). For these and larger images, you really want the format to support multi-threaded decoding (chunks/tiles), and probably zoom levels. This isn't "simple", so I'd keep the 16-bit size limit to stop people misusing the format.

We're all sitting here with what would have been viewed as personal supercomputers in the 80's/90's. I have 128 GB RAM and soon going to 256 GB. Give me 32-bit width & height please - we're ready now.

We're all sitting here with what would have been viewed as personal supercomputers in the 80's/90's. I have 128 GB RAM and soon going to 256 GB. Give me 32-bit width & height please - we're ready now.

True, but my supercomputer also has 16 cores, and I would much rather load the same image split into tiles in ~2 seconds than ~30 seconds. I guess you know this, and want 32-bit sizes anyway (e.g. to replace even worse performing PNGs of these sizes?), so go for it.

(I've tried working with a 20268x11309 JPEG (die shot, still well under the 16-bit limit, less than 1GB of raw RGB data), and there's no question that my computer could do it, but it just makes most software unusable. I'd love to have better formats for large images, but it's not a domain suited to simplicity-first projects.)

+1 for uint32 w/h

+1 for colorspace as a bitfield

Other color spaces will probably be better served by other formats.

Maybe we should also consider having a next field to indicate another header with metadata follows this the magic bytes + size of that indicate what it is. Similar to Vulkan pNext and sType. Given it would only be used for metadata, it could be skipped over by interpreters.

Just a suggestion, probably a lil out of scope if you want to keep it simple and straightforward.

The maximum RGB size with 16-bit width/height is 12GB

Not all images are square. Especially in science it isn't unbelievable that you could want to encode a 4GB image that's 2^20 by 2^10 pixels (for example). Since the cost is only 32 bits extra, I think it's worth the extra flexibility.

Could also be useful for LUTs that are 1xLARGEVALUE.

Not all images are square.

Good point! I've seen bitmap fonts that don't fit on my 4K screen and they were from a 20th century computer with less than 1 MB RAM.

I've come to the following conclusion:

Things that will change

  • the range of QOI_DIFF will shift -1, to be consistent with the range of a two's complement int
  • QOI_DIFF will explicitly allow to wrap around. Wether the encoder makes use of this is outside of the spec
  • the size field in the header will be removed
  • width and height in the header will be widened to 32bit
  • a channels field will be added to the header. This is purely informative and will not change the behavior of the en-/decoder
  • a colorspace bitmap will be added to the header. This is purely informative and will not change the behavior of the en-/decoder.
  • the spec will mandate that the alpha channel is un-premultiplied

The header then looks like this:

struct qoi_header_t {
    char [4];       // magic bytes "qoif"
    u32 width;      // image width in pixels (BE)
    u32 height;     // image height in pixels (BE)
     u8 channels;   // must be 3 (RGB) or 4 (RGBA)
     u8 colorspace; // a bitmap 0000rgba where 
                    //   - a zero bit indicates sRGBA, 
                    //   - a one bit indicates linear (user interpreted)
                    //   colorspace for each channel
};

Things that will stay the same

  • BE encoding will stay. To keep the library portable, implementers have to read 1 byte at a time anyway
  • Channel ordering will stay at RGBA
  • the range of QOI_RUN_8 will stay at 1..32. Shifting it to 2..33 and letting the 1 case always be handled by QOI_INDEX or QOI_DIFF would needlessly complicate the spec and encoder
  • the 4 bytes padding will stay to simplify range checks in the decoder

Any work on better compression methods, dividing an image into blocks to allow for better streaming and other features should be rolled in a successor of QOI. Maybe we can call that QGI then :)

I'll try to implement these changes today.

I learned a lot; thank you all so much for your input! <3

I just wanted to write a proposal with which i'd be happy. It had one new idea, but scrap that. I'll go adjust my implementation.

Can i expect this to be frozen? Then i'll happily start PRing a improved spec in qoi.h with the bitfield documentation and maybe also make a nice HTML/PDF version

BE encoding will stay. To keep the library portable, implementers have to read 1 byte at a time anyway

This isn't about portability. It's about possible compiler's optimization: https://godbolt.org/z/3brMonWrE

This isn't about portability. It's about possible compiler's optimization: https://godbolt.org/z/3brMonWrE

We're talking about 1 (mov) or two instructions (mov+bswap) here, which doesn't really matter in the grand scheme of decoding a file format, as fetching data from even L1 is slower than executing some instructions, at least for that special x86_64 architecture.

Yes. If there's not any severe mistakes in my previous post, this will be the final spec.

This isn't about portability. It's about possible compiler's optimization: https://godbolt.org/z/3brMonWrE

That's a good point, but I will not entertain it for QOI. QOI_DIFF_24 is the only place where that would matter and I expect the CPU to spend most of the time with branch miss-predicts. Sorry, I don't want to erect a bike-shed :)

u32 size; // number of data bytes following this header (BE)

What's the use of this? It requires the encoder to either seek the file, buffer the whole output or do a first pass just to compute this number.

As someone pointed on HackerNews, it allow to simply put EXIF (or any other metadata format) at the end of the file, and a library/user will only need to skip these size bytes to read the metadata.

As someone pointed on HackerNews, it allow to simply put EXIF (or any other metadata format) at the end of the file, and a library/user will only need to skip these size bytes to read the metadata.

Yeah, but this is also simply possible by seeking to the end of the file and reading the header there, similar to how Id3 tags are encoded

I would instead advise to make everything little endian. Not just the header fields, but also the bytecodes. Almost all CPUs used today are little-endian, and support unaligned loads.

The current qoi.h implementation reads one byte at a time:

if ((b1 & QOI_MASK_4) == QOI_DIFF_24) {
  int b2 = bytes[p++];
  int b3 = bytes[p++];
  px.rgba.r += (((b1 & 0x0f) << 1) | (b2 >> 7)) - 15;
  px.rgba.g +=  ((b2 & 0x7c) >> 2) - 15;
  px.rgba.b += (((b2 & 0x03) << 3) | ((b3 & 0xe0) >> 5)) - 15;
  px.rgba.a +=   (b3 & 0x1f) - 15;
}

By replacing the b1, b2 and b3 uint8_t variables with a single b uint32_t variable, and changing the QOI_MASK_ETC bits from high bits to low bits, this could be:

if ((b & QOI_MASK_4) == QOI_DIFF_24) {
  px.rgba.r += ((b >>  4) & 31) - 15;
  px.rgba.g += ((b >>  9) & 31) - 15;
  px.rgba.b += ((b >> 14) & 31) - 15;
  px.rgba.a += ((b >> 19) & 31) - 15;
  p += 3;
}

Which might be faster. It certainly looks simpler.

seconded, truth is that little-endian "has won", everything performance-sensitive today should use little-endian, meaning they'll be slightly faster on little-endian cpus and slightly slower on big-endian cpus, because practically everything today use little-endian cpus. (X86 and x86-64, Apple M1, iPhone, Samsung phones, the vast majority of smartphones, its all little-endian. the last Apple big-endian system was released in 2005, they transitioned to little-endian in 2006 and have been little-endian ever since.)

  • last Apple hardware to use big-endian was 2005
  • last Intel CPU to use big-endian was 2017 (and even that was just a higher-clocked release of a 2012 model, with no architectural improvements, just higher clockspeeds..)

BE encoding will stay. To keep the library portable, implementers have to read 1 byte at a time anyway

Little-endian shows an approx 1.05x improvement in decode speed. The new qoile.h implementation is still portable. Let's move the endianness discussion to #36.

     u8 colorspace; // a bitmap 0000rgba where 
                    //   - a zero bit indicates sRGBA, 
                    //   - a one bit indicates linear (user interpreted)
                    //   colorspace for each channel

What does a 0b00000011 value mean? Is a "half sRGB, half other" color space a thing?

struct qoi_header_t {
    char [4];       // magic bytes "qoif"
    u32 width;      // image width in pixels (BE)
    u32 height;     // image height in pixels (BE)
     u8 channels;   // must be 3 (RGB) or 4 (RGBA)
     u8 colorspace; // a bitmap 0000rgba where 
                    //   - a zero bit indicates sRGBA, 
                    //   - a one bit indicates linear (user interpreted)
                    //   colorspace for each channel
};

Bikeshedding alert: you could re-pack from 14 back to 12 bytes if you really wanted to:

struct qoi_header_t { 
    char [4];         // magic bytes "qoif"
    u32 alpha  :  1;  // 0=RGBX (i.e. 3 channels), 1=RGBA (i.e. 4 channels)
    u32 width  : 31;  // image width in pixels
    u32 srgb   :  1;  // 0=Other, 1=sRGB
    u32 height : 31;  // image height in pixels
};

More bikeshedding: I'd then change the magic to "qoi\xFF" to be non-ASCII and non-UTF-8.

Bikeshedding alert: you could re-pack from 14 back to 12 bytes if you really wanted to:

I do not :)

What does a 0b00000011 value mean? Is a "half sRGB, half other" color space a thing?

You can store different planes in one file, where some of those are sRGB (e.g. a specular map(?)) while others are interpreted linearly (e.g. bump maps, normal maps). QOI just provides this info in the header. What the user does with this info is outside of this spec.

Little-endian shows an approx 1.05x improvement in decode speed.

I believe there's more gains to be had by rearranging/nesting the if-statements and other little tweaks. E.g. changing this range check else if (p < chunks_len) to a simple else leads to a similar speedup on my CPU.

Anyway, if we want to have the absolutely fastest encode/decode speed, this format would need to change a lot to accommodate SIMD instructions. It's certainly a worthwhile endeavor, but again, QOI is done! Thank you :)

You can store different planes in one file, where some of those are sRGB (e.g. a specular map(?)) while others are interpreted linearly (e.g. bump maps, normal maps).

This (e.g. a 1-channel 'sRGB' specular map) still doesn't make sense to me. sRGB-ness (combined with your monitor color profile) tells you how to convert an RGB triple from the source image color space to the destination monitor color space. If only the red and green parts of the source image are 'sRGB' then you can't make the conversion without knowing the blue part too in sRGB space, and the file format doesn't tell you that.

If you're storing specular maps, bump maps, normal maps, etc. then just tag the whole thing (all 4 channels) as not-sRGB. It's an all-or-nothing thing.

Little-endian shows an approx 1.05x improvement in decode speed.

I believe there's more gains to be had by rearranging/nesting the if-statements and other little tweaks.

Sure, but it's not "you can choose only one optimization". You can rearrange the if-statements (for e.g. +10%) and change to little endian (for e.g. +5%) and the two combine.

The thing is you can rearrange the if-statements 3 months from now. You can't change to little endian once you freeze the file format.

Anyway, if we want to have the absolutely fastest encode/decode speed, this format would need to change a lot to accommodate SIMD instructions.

It's not about having the absolutely fastest decode speed. It's about having a free lunch with no change in file format complexity. It's just a different bit packing into exactly the same number of bytes per opcode, with exactly the same opcode semantics.

I wrote something regarding little- vs. big-endian here. If the highest bits of the bytecode decide about the operation the masking operation is unnecessary and a comparison with <= is sufficient.

Re: color difference (2), why not use wrapping ops instead of essentially using saturating ops? (please correct me if I'm wrong)

For example, r1 = 240, r2 = 5. Instead of computing it in higher-bit integer space and getting -235 which falls outside of 5-bit diff encoding, you could compute 5_u8.wrapping_sub(240_u8) and get 20_u8 which fits in 5 bits just fine - using bias encoding you would store 20_u8 - 16_u8 = 4_u8 (assuming -16..15 like discussed above).

Another point is that in addition to the diff op itself you would only have to perform 2 ops on the diff: one comparison and one addition instead of 3 like it currently is (2 comparison ops, one addition):

  • Currently (sorry for Rust):
     let vr = (r1 as i16) - (r2 as i16);  // one -
     if vr >= -16 && vr <= 15 {           // two comparisons
        // store: (vr + 16) as u8         // one +
     }
  • Could be:
    let vr = r1.wrapping_sub(r2).wrapping_add(16);  // one +, one -
    if vr < 32 {                                    // one comparison
        // store: vr
    }

The potential compression disadvantage may be that some pixels further down the line with e.g. r = 225 would no longer count as valid for diff encoding since they would be too far away in wrapped space. So, expected: faster encoding speed, lower compression ratios. But it's not so obvious and probably needs testing, is it noticeable or not. (I can try testing it on qoibench.c)

the size field in the header will be removed

I think this is a bad idea. I have seen several file formats without size and it always looked like a bad idea.

The Entropy Coded Segment of JPEG is such an example. All other JPEG segments have a length. The ECS (Entropy Coded Segment) does not. The ECS segment ends if a 0xff byte is not followed by 0 byte. In decoding the byte sequence 0xff 0 needs to be converted to 0xff. Every other JPEG segment can be skipped easily, because it has a length. An ECS must be processed byte by byte to find its end. This is absolutely crazy.

If the size is missing you need to process everything to get to the end of the data. The size opens the opportunity to have other information afterwards. E.g.: Another header with another image. Or meta information. With a size field you can just jump to the information after the image without processing it. The size can also be used as redundant information (afterwards the file must end). I now think that the "channels" field is not needed, but a "size" field definitive makes sense.

@ThomasMertes There's pros and cons of not having size as a mandatory field:

  • Cons: what you've listed. You have to parse the data to find its end. This would hurt any streaming contexts where you receive consecutive frames, or metadata afterwards.
  • Pros: encoding now works in streaming mode. Whatever you're encoding into doesn't need to require seeking, just for the sake of writing the size field which is not needed in a typical decoding scenario.

Alternative middle ground: add an unused u32 field(s) in the header? So if you want to write size there, you can, e.g. if streaming multiple frames?

With a size field you can just jump to the information after the image without processing it.

Why not have this information before the image then?

jmi2k commented

To clarify: you want to keep the biases encoding and just move the range by one?

Sorry, yes. I just wanted to shift the range by one to be consistent with the range of a two's complement int.

Is there a reason behind using biases instead of just two's complement there? For software implementations it might be kinda irrelevant, but for hardware implementations it requires adding more circuitry to take into account the correct bias. OTOH, sign extension is trivial both in software and hardware.

I spent some more time thinking about how I could use qui files in some of my applications and realized, that the suggested 8 bits of custom information might not necessarily be enough for me.

What if, instead of storing the custom information in the header itself, we store an offset between the end of the header and the start of the image itself. This way people could add their own information as some kind of secondary header, which would be completely ignored (besides a getter maybe) by qoi.

@phoboslab has just said that "the data format is now fixed!".

Let's move the bikeshedding to:
https://github.com/nigeltao/qoi2-bikeshed/issues

Note that GitHub is having some server issues right now.

vonj commented

I'm working with 48-bit uncompressed scans of film. I might have to create a qoi-48 fork then. :-)
Also given more and more monitors support higher bit depths than 24, it's a sign of the times I think.

There's pros and cons of not having size as a mandatory field

Indeed.

One option for "a sequence of QOI images", without an explicit size field, is to present a TAR file that contains QOI files. It's streaming / single-pass encoding friendly.

Similarly, for people wanting to stuff EXIF or other metadata after the, uh, non-meta data, you could instead use something based on IFF / RIFF / TIFF.