phoboslab/qoa

Frame header: frame size is calculable and can be used for other data, suggestion: improved seekability

Closed this issue · 17 comments

gmta commented

The 16 bits in use to communicate the frame size are not necessary, since the first 8 bits of each frame's header contains the number of channels and from that you can calculate the total frame size, because all arrays are only dependent on the number of channels:

sizeof(frame_header) + num_channels * (sizeof(lms_state) + sizeof(qoa_slice_t) * 256)

By dropping the frame size bits, and by extension its value limit of an unsigned 16-bit int, we could theoretically also drop this channel limit:

#define QOA_MAX_CHANNELS 8

As a suggestion, I would propose to include more metadata to improve seekability through frames:

As it stands, each frame can change the number of channels and/or sample rate which means that you need to read each frame header to be able to seek through an audio stream. Even if the number of channels remains constant, you can only seek to certain sample offsets and cannot seek to certain timestamps or even calculate the timestamp after seeking without decoding all frames in between.

If we use the free 16 bits to encode additional metadata, we can include something like this in the frame header

bits 0123456789abcdef
     tvvvvvvvvvvvvvvv
      
t = 0
    you need to decode all frames
t = 1
    the value `v` (15 bits) indicates the number of following frames that do not deviate in number of channels or sample rate

For live streaming you would use t=0 but for streams encoded ahead of time, you could set t=1 and set the number of frames for which the encoder is certain that the number of channels or sample rate isn't going to change. Even if the encoder isn't sure ahead of time, if the encoder's output is seekable it could write the correct value after the fact.

This way, you only need to decode the first frame to be able to seek to any timestamp as well.

Good points!

Typically the qoa_desc struct is allocated on the stack, so I wanted to limit the size to something reasonable - hence QOA_MAX_CHANNELS 8. Not sure how to go about this. Is more than 8 channels used anywhere (where you want a lossy codec)?

The frame size in the header is mostly for convenience (and because I had some bytes left...). I agree that it's redundant and replacing it with something more useful is good idea.

As for changing samplerate/channels:

In a valid QOA file all frames have the same number of channels and the same
samplerate. These restriction may be releaxed for streaming. This remains to
be decided.

~ https://github.com/phoboslab/qoa/blob/master/qoa.h#L24-L26

I lack the experience here. My thought was that for files there would be no point in changing these. Is there any format and/or application that does this? For streaming it surely makes sense to drop channels or lower the samplerate on the fly.

gmta commented

Is more than 8 channels used anywhere (where you want a lossy codec)?

Not that I know of - 8 (7.1) channels would be the most common for lossy home audio purposes. I'm fine with keeping the maximum as-is.

My thought was that for files there would be no point in changing these. Is there any format and/or application that does this?

At least FLAC supports this. I can imagine recorded TV streams switching from 2.0 to 5.1, and for live streaming it is useful to reduce or increase the bitrate on the fly of course.

Typically the qoa_desc struct is allocated on the stack, so I wanted to limit the size to something reasonable - hence QOA_MAX_CHANNELS 8. Not sure how to go about this. Is more than 8 channels used anywhere (where you want a lossy codec)?

Considering games (which is one of QOA's main focus points if I understand that correctly): dynamic soundtracks which might have several tracks which are live mixed depending on player activities immediately come to mind. You'd need more than 4 tracks to exceed 8 channels with stereo tracks, but I'm not a game developer so I'm not sure how many are usually in use.

I lack the experience here. My thought was that for files there would be no point in changing these. Is there any format and/or application that does this? For streaming it surely makes sense to drop channels or lower the samplerate on the fly.

Realtime streaming changes bitrates dynamically depending on network capabilities. Because QOA can't do that directly, the only other pathway would be to change sample rates. Changing channel arrangements (excluding channel correlation types) is much more uncommon, in FLAC I would expect that it happens with specially crafted large files that contain many varying tracks of music.

FWIW, in my SerenityOS implementation I'm already expecting to resample data on the fly and move channel data around, which we're already doing to good effect with FLAC.

gmta commented

Just a sidenote, the proposed bitformat for the 16 bits in my opening is a bit stupid. We can just use all 16 bits as a u16 where 0 immediately works as expected.

As it stands, each frame can change the number of channels and/or sample rate which means that you need to read each frame header to be able to seek through an audio stream. Even if the number of channels remains constant, you can only seek to certain sample offsets and cannot seek to certain timestamps or even calculate the timestamp after seeking without decoding all frames in between. ...

The way flac differentiates between fixed and variable blocksize encoding is with a flag present in every frame header, which must be either set or unset throughout. That approach would neatly work here, set to 1 when we know ahead of time that num_channels and samplerate is fixed, 0 otherwise. Some files encoded as variable could potentially be converted to fixed with another pass over the data, which would be useful to allow quicker seeking. But as qoa's framesize can be determined from the frame header (unlike flac say which has to read the entire frame even to skip it), it's not the end of the world if seeking stays as it is (requiring every frame header to be read to step-through).

The frame size in the header is mostly for convenience (and because I had some bytes left...). I agree that it's redundant and replacing it with something more useful is good idea.

A checksum, probably crc8 as used in flac, would go a long way towards parity with most other formats that exist. It's not compute-heavy, but it may be more than you're willing to stomach for this format. It's also reasonable to do as you have and delegate checksums to an external source, the right tool for the right job.

Considering games (which is one of QOA's main focus points if I understand that correctly): dynamic soundtracks which might have several tracks which are live mixed depending on player activities immediately come to mind. You'd need more than 4 tracks to exceed 8 channels with stereo tracks, but I'm not a game developer so I'm not sure how many are usually in use.

Multi-language VO could also be conveniently encoded in a single file. Back in the PS1/PS2 era multi5 and multi8 were pretty common, PS3 has many releases using multi12/13/14, and now that digital downloads exist the average has probably increased. Or think of a game like worms where you can customise the language/accent/gibberish of your teams speech to potentially dozens and dozens of variations.

Not that single-file QOA should necessarily be used for dynamic soundtracks or multilang. It would be more efficient to DIY with separate files to only load what was necessary, but that does come with extra implementation hazards and complexity.

Typically the qoa_desc struct is allocated on the stack, so I wanted to limit the size to something reasonable - hence QOA_MAX_CHANNELS 8

With the above in mind I'm fine with the reference implementation being limited to 8 channels, as long as the spec is only limited by whatever the width of num_channels ends up being.

Realtime streaming changes bitrates dynamically depending on network capabilities.

Are you sure they don't just switch to a lower-bitrate stream through a mechanism on the site/player?

I'd be inclined to suggest that varying things like samplerate should be done by the streaming player seamlessly switching to a different audio stream, not something that an underlying format should concern itself with. Yes flac provides it's context every frame (samplerate etc) allowing it to vary, but this was done to support multicasting which as it turns out is not how most streaming is done AFAIK. The way streaming seems to tend to work is that the player gets context on load, then the stream is simply treated as a non-seekable file. There are examples of multicast but I assume they are niche. See the current flac maintainers response to a tangentially related question: https://hydrogenaud.io/index.php/topic,123569.msg1021306.html#msg1021306

Also how relevant is streaming as a design goal? Efficient bitrate when streaming is important and opus exists for that. Maybe streaming can be fobbed off entirely as a consideration, let the transport layer solve that if someone wants to use qoa.

If anything I'd argue for moving samplerate and num_channels from the frame_header to the file_header. Enforce that these are fixed for the duration of the file to simplify things, for free good seek performance is guaranteed. There'd then be enough room to potentially expand some limits:

	struct {
		char     magic[4];         // magic bytes 'qoaf'
		uint32_t samples;          // number of samples per channel in this file
		uint32_t samplerate;   // 32 bit as 24 bit is 16MHz, radio frequencies go higher and there's no reason not to support them given the extra space
		uint16_t num_channels; // 65536 channels is enough for anyone
		uint16_t tbd;          // maybe an optional crc16 of all frames combined as an integrity check, just because there's room.
		                       // Maybe expand samples to uint48_t instead.
		                       // Maybe just padded with 0x0000
	} file_header;                 // = 128 bits

And if that's done only fsamples remains in frame_header, which should be 5120 for all frames except probably the last frame. If adaptable-streaming and an undetermined sample count in the header can be eliminated then so can the frame_header entirely. Alternatively the frame_header could be filled with mildly useful things:

	struct {
		char     sync[4];      // magic bytes 'qoa\0' as a way to try and detect/recover from corruption, particularly when non-seekable or offset is unknown
		uint16_t fsamples;     // sample count per channel in this frame
		uint16_t pad;          // 0x0000
	} frame_header;                 // = 64 bits

The way flac differentiates between fixed and variable blocksize encoding is with a flag present in every frame header, which must be either set or unset throughout. That approach would neatly work here, set to 1 when we know ahead of time that num_channels and samplerate is fixed, 0 otherwise. Some files encoded as variable could potentially be converted to fixed with another pass over the data, which would be useful to allow quicker seeking. But as qoa's framesize can be determined from the frame header (unlike flac say which has to read the entire frame even to skip it), it's not the end of the world if seeking stays as it is (requiring every frame header to be read to step-through).

As Dominic has stated multiple times (this is also in qoa.h, but some have probably overlooked it), the current pre-release QOA spec does not allow the audio parameters to vary within a file. The reference implementation will reject any files that violate that requirement. The SerenityOS implementation (which I wrote) will issue a warning and stops being able to seek, but can decode the files just fine (though the current "reduction" or "expansion" of audio data, such as resampling and channel combination, is crude and not lossless).

A checksum, probably crc8 as used in flac, would go a long way towards parity with most other formats that exist. It's not compute-heavy, but it may be more than you're willing to stomach for this format. It's also reasonable to do as you have and delegate checksums to an external source, the right tool for the right job.

To be honest I have not experienced a situation where checksums are particularly useful. Just a little thought reveals: (1) in streaming dropped data should be handled by the transport layer, (2) streamed audio should and will usually be encrypted and checked with signatures or MACs, making transmission errors easily detectable, and (3) file bitrot should be detected by modern file system implementations. If WAV is overwhelmingly fine without checksums, I think a deliberately simple format like QOA (arguably simpler than WAV if it weren't for compression) should not include this feature. If your specific application requires data integrity, checksumming the entire file or blocks of it is something that can be easily added on.

With the above in mind I'm fine with the reference implementation being limited to 8 channels, as long as the spec is only limited by whatever the width of num_channels ends up being.

I agree with this but it needs good wording in the spec. Something like "Implementations may choose to limit the number of channels they can handle, but at least 8 channels must be supported." It's important to not allow the limit to be arbitrarily low, as at that point many common files will stop being cross-compatible. 8 is a good limit for deliberately simple or constrained implementations such as the stack-allocating reference implementation, and anything higher is unusual in any case.

Are you sure they don't just switch to a lower-bitrate stream through a mechanism on the site/player?

I am not familiar with WebRTC and the likes, I would suspect that changing the bitrate this way is better for interruption-free streaming but I have no idea. Either way, since the spec currently doesn't allow it it's not an important concern.

If anything I'd argue for moving samplerate and num_channels from the frame_header to the file_header. Enforce that these are fixed for the duration of the file to simplify things, for free good seek performance is guaranteed.

Given that these limits already exist, I find your suggestion for moving the metadata to the header worth consideration. I also think your data allocation is reasonable. However, it will make QOA unstreamable and Dominic has to decide here whether he deems the simplification and bitrate reduction more important than streamability. I have no strong opinions either way, the things I like about QOA are its simplicity and fast decodability, and neither really speak for or against this change.

And if that's done only fsamples remains in frame_header, which should be 5120 for all frames except probably the last frame. If adaptable-streaming and an undetermined sample count in the header can be eliminated then so can the frame_header entirely. Alternatively the frame_header could be filled with mildly useful things:

Don't forget that the frame header is responsible for occasionally providing a full-precision LMS state, which is required to prevent degradation of quality. 5120 samples mean almost 10 frames per second, therefore we have enough "resets in quality" per second to not be too noticeable.

	struct {
		char     sync[4];      // magic bytes 'qoa\0' as a way to try and detect/recover from corruption, particularly when non-seekable or offset is unknown
		uint16_t fsamples;     // sample count per channel in this frame
		uint16_t pad;          // 0x0000
	} frame_header;                 // = 64 bits

I do not like this suggestion. A sync code is decently useless if it can accidentally appear in the stream. For example, I consider the FLAC sync code to be almost useless because while they take great care to make it impossible in frame and subframe headers, it can always appear in checksums and especially residuals (a long series of zero residuals give you single 1 bits, which combined with the occasional 01 "1" will very quickly look like a sync code).

More interestingly, I think it would be a good idea to force the frame size to 5120 samples and zero-pad the data at the end if necessary (or, in a smart encoder, trim start and end silence to put only "real" data into the file). That would avoid all problems with forcing implementations to use 5120 samples where possible (because there is no other option) and special-casing the last block. Simplification :). There is precedent for this: MP3's frame size is fixed to 1152 samples, which makes this arguably complex format significantly simpler.

the current pre-release QOA spec does not allow the audio parameters to vary within a file

Sure, but I was hypothesising recording a stream. It might not be a valid "file" format but if streaming exists recording it will too?

Don't forget that the frame header is responsible for occasionally providing a full-precision LMS state, which is required to prevent degradation of quality

I thought it was done purely to chunk the data to allow seeking. Anyway I was just talking of the frame_header struct.

I do not like this suggestion. A sync code is decently useless if it can accidentally appear in the stream. For example, I consider the FLAC sync code to be almost useless because while they take great care to make it impossible in frame and subframe headers, it can always appear in checksums and especially residuals

A sync code is not very useful for qoa but FLAC's sync code is actually about as performant as a sync code can be without guaranteeing the string isn't present anywhere else. It is unlikely to appear in residuals, a rice code is a unary followed by n bits for an n-bit rice parameter (and if n is 0 the entire partition isn't written and all residuals have an implicit value of 0). Flac's sync code is 13 1's followed by 2 0's, for simplicity we could also include the following blocking strategy bit which has to stay fixed for the duration making 16 bits of "sync". Due to the way the distribution of residual values tends (bell curve) it's unlikely an encoder trying to output efficient results with input that in any way resembles audio will output that 16 bit string and byte aligned at that. crc8 cannot form part of an erroneous sync code under normal circumstances (common samplerate, common blocksize), the crc16 can and it's byte-aligned however what follows is an actual sync code which guarantees the erroneous sync will fail cheap flac header validation checks and not trigger a full decode. Erroneous sync in an escape-coded partition is pretty easy but escape codes are rarely used, the same applies to verbatim frames.

A sync code is not very useful for qoa but FLAC's sync code is actually about as performant as a sync code can be without guaranteeing the string isn't present anywhere else. It is unlikely to appear in residuals, a rice code is a unary followed by n bits for an n-bit rice parameter (and if n is 0 the entire partition isn't written and all residuals have an implicit value of 0).

Without arguing about an unrelated format any further (I have seen sync codes in residual data with my own two eyes), https://datatracker.ietf.org/doc/html/draft-ietf-cellar-flac-07#name-format-lay-out says (emphasis mine):

Since a decoder MAY start decoding in the middle of a stream, there MUST be a method to determine the start of a frame. A 15-bit sync code begins each frame. The sync code will not appear anywhere else in the frame header. However, since it MAY appear in the subframes, the decoder has two other ways of ensuring a correct sync. The first is to check that the rest of the frame header contains no invalid data. Even this is not foolproof since valid header patterns can still occur within the subframes. The decoder's final check is to generate an 8-bit CRC of the frame header and compare this to the CRC stored at the end of the frame header.

In other words: Significant bitstream decoding work has to be done to sync up with a FLAC stream. I do not think this kind of effort is warranted for QOA.

In other words: Significant bitstream decoding work has to be done to sync up with a FLAC stream.

The odds of a full erroneous decode are small and that is the only time significant effort is required. An aligned sync code by itself in arbitrary data is rare enough that the sync code would do its job, let alone the effort flac goes to to reduce erroneous sync which conservatively must push that to at least 2^-25 maybe much higher. Then factor in cheap header decode checks that push it to at least 2^-35. Even if against all odds an erroneous sync is generated that leads to a full erroneous decode, it's unlikely you can trigger it if you tried. You'd need to arbitrary seek to the exact tenth of a second where you'd find the bad sync before a good sync. So basically it's a non-issue.

But that is beside the point. In qoa a sync code would only be mildly beneficial as stated, more of a check that the file isn't corrupt (or probably more likely is actually a qoa file) than a sync. If the choice is zero padding or something cheap that might be useful then why not.

p0nce commented

As it stands, each frame can change the number of channels and/or sample rate which means that you need to read each frame header to be able to seek through an audio stream.

No?

When I implemented QOA seeking I first thought of iterating through each frame header, but after more reading:

  • it's not legal to have frames with less than 256 slices before the end of the stream.
  • in the same QOA, number of channels and SR cannot change anyway.

So I end up seeking by assuming a fixed frame size then decoding a few samples (less than 256*20), which is O(1)

I'll note that if QOA had a lossless mode (optional residual after predictors?) it could be a kind of in-memory sound storage like DXT.

gmta commented

@p0nce please read the issue, what you wrote was already established in the first comment here.

p0nce commented

Hello thanks but I did read the whole issue before, I was disagreeing with that sentence and still do:

Even if the number of channels remains constant, you can only seek to certain sample offsets and cannot seek to certain timestamps or even calculate the timestamp after seeking without decoding all frames in between.

This is only true if @kleinesfilmroellchen suggestion of fixed frame is rejected.
There is a "frame length in frame header" vs "fixed frame" decision to be made, and the current qoa.h is "fixed frame" despite the header field existing. btw I don't have an opinion about this for the record.

I think it would be a good idea to force the frame size to 5120 samples and zero-pad the data at the end if necessary

Cool idea.

gmta commented

@p0nce yeah, that was in qoa.h all along I believe? As referenced earlier.

Just to clarify: yes, the current spec draft mandates the frame length to be exactly 256 slices per channel for all frames except the last.

I will align the comments and format description in qoa.h with the spec shortly.

I do not like this suggestion. A sync code is decently useless if it can accidentally appear in the stream.

A sync code is designed around the fact that it can appear anywhere in the data. The purpose of a sync code is to allow the decoder the resume decoding at the point where it reads the sync code and then verify that the decoding process is consistent with the data it's reading. If the previous suggestion was a frame header like this:

struct {
    char     sync[4];      // magic bytes 'qoa\0' as a way to try and detect/recover from corruption, particularly when non-seekable or offset is unknown
    uint16_t fsamples;     // sample count per channel in this frame
    uint16_t pad;          // 0x0000
} frame_header;                // = 64 bits

Then the decoder knows that 2 bytes later it needs to read 16 zeros (still possible in data but less probable) and fsamples later there will another sync code (again possible in the data but even less probable) Using a sync code, the decoder can reasonably assume that after 2 or 3 frames (less than a second of audio) decoded without any inconsistency it can start to output decoded frames again.

Even if the ultimate solution for the problem of reconnecting to an audio stream is not using a sync code, something must be done, because right now the decoder cannot check the consistency of the data it's decoding

Frame sync is still possible without an explicit sync token. Consider the current frame header:

struct {
	uint8_t  num_channels; // no. of channels
	uint24_t samplerate;   // samplerate in hz
	uint16_t fsamples;     // samples per channel in this frame
	uint16_t fsize;        // frame size (includes this header)
} frame_header; 

fsamples must be 5120 and fsize can be calculated/verified from fsamples and num_channels. If these values are consistent for a few frames, sync has been found.

Granted, an explicit sync token would be more intuitive but – as far as I can tell – has no other advantages over this approach.

If fsamples will always be 5120 (except the last frame, but that's too late to sync anyway) then that's effectively a sync code :)

I wasn't sure if fsamples had to be always the same value, based on the suggestions presented before, but if that's the case, I don't think frame_header needs to change at all.

You could even change samplerate dynamically and the sync process would be the same