phoboslab/qoa

Specification Draft

phoboslab opened this issue · 16 comments

While there's still some details to discuss (specifically fields in the file & frame headers), I started working on the file format specification. The current draft can be found here:

https://qoaformat.org/qoa-specification-draft-01.pdf

I'm sure I forgot to mention some details and/or need to clarify things. Please let me know!

Is there a valid reason to allow zero channels? I suppose it could be used to encode digital silence. If you don't want that, probably best to have num_channels represent 1 to 256 or at least 1 to 255. Very low bitrates trip up streams IIRC, which I think is why libFLAC implemented a min bitrate option.

Other than that it looks good to me (haven't had a chance to fully understand the algorithm so can't validate that), you've even used flac's default channel order which is nice.

The spec forbids 0 channels:

A valid QOA file must have at least one frame, containing at least one channel and one sample with a samplerate between 1..16777215

We could surely define the num_channels field to represent a range of 1 .. 256, but I think it's neat that the value in the file is the value, without any transformations (same for samplerate); and it's not like having just 255 channels instead of 256 is a big limitation :)

A valid QOA file must have at least one frame, containing at least one channel and one sample with a samplerate between 1..16777215

That forbids 0 channels for static files when paired with this

static files (those with samples set to a non-zero value), each
frame must have the same number of channels and same samplerate.

but not a streaming context. Explicitly stating num_channels is 1..255 would solve that.

Just trundling through the spec, (love it) and ever the stickler for understanding how tables are derived, is there a specific reason for the scalefactor generation to use a power value of 2.75 that I've missed?

Through careful deliberation (trial & error) I came to the conclusion that the prediction is usually accurate enough for the scalefactor to top out at 2048. It's also advantageous to have more precision on the lower end. pow(s, 2.75) satisfies both and is easy to document :)

Some more details here: https://github.com/phoboslab/qoa/blob/master/qoa.h#L155-L176

Ahhh cool, I wondered if it was something obvious I'd missed or whether it was an exponent value that satisfied your requirements. Looking (well, sounding!) good. You realise you're going to have to end this series with a QOV too ;o) I had great fun writing simplified video codecs for use in games a few years back - they were quite OK! Keep up the great work.

Regarding the channel layout, I would like the standard non-film layout L, R, C, LFE, BL, BR, SL, SR (which the spec uses) to be "mandatory in general-purpose files" for channel counts 1 to 8. For how these are usually laid out; 1, 2, 3, and 8 are intuitive and https://datatracker.ietf.org/doc/html/draft-ietf-cellar-flac-07#name-channels-bits is in my opinion the best use of space in the other cases:

  • 4: ditch centre channel (from 3) in favour of two back/side (generic surround) channels: L, R, B/SL, B/SR
  • 5: add centre channel back: L, R, C, B/SL, B/SR
  • 6: add LFE: L, R, C, LFE, B/SL, B/SR
  • 7: add a single back channel before the last two side channels: L, R, C, LFE, B, SL, SR

This adds the most important channels first when increasing the channel count, makes centre and LFE have a constant position, and also ensures that for channel counts 4 and up, the last two channels are always some kind of surround channel.

This will prevent incompatible divergent implementations: Downmixing extra channels to stereo is the most important consideration.

Of course, if a file is application-specific (e.g. in games), it can deviate from this standard layout, but any general-use file should have to follow this layout.

For the detailed decoder explanation, some editorial suggestions:

  • The scalefactor for each slice is dequantized into sf by: -> The scalefactor **sf_quant** for each slice is dequantized into sf by:
  • Clarify the inclusivity of the final sample range.
  • History update could use the notation history[i] = history[i+1] which is more precise.

Finally, regarding frame sizes:

  • Mention the usual frame size of 5120 samples.
  • Clarify slice count specifications. That is: Each frame, except the last, MUST (RFC 2219) contain exactly 256 slices per channel. The last frame MAY contain any positive number of slices per channel, up to and including 256.

Great suggestions, thanks!

Here's an updated draft: https://qoaformat.org/qoa-specification-draft-02.pdf

Changes:

  • clarify that 0 channels, 0 samplerate is forbidden in ALL frames (regardless of file or streaming context)
  • specify all ranges n .. m as (inclusive)
  • more precise history update notation
  • channel layout (+ the spec now just the says the layout is...)
  • state that channels are interleaved
  • wording and layout changed a bit

Question: what's the correct wording here?

  • Channels are interleaved per slice.
  • Slices are interleaved per channel.

Thanks for the corrections and the specification of standard channel layouts! There's a small typo "expcet" after the slice illustration.

Question: what's the correct wording here?

  • Channels are interleaved per slice.
  • Slices are interleaved per channel.

I think the latter is more accurate; the FLAC spec uses the term "channel-interleaved" regularly.

I'm still missing the short mention of "usually 5120 samples per frame", did you forget or is there a particular reason?

Other than that, I'm very happy with the state of the spec!

I'm still missing the short mention of "usually 5120 samples per frame", did you forget or is there a particular reason?

I don't see the point. Specifying that a frame has 256 slices per channel and one slice has 20 samples should be sufficient!?

Unrelated: Over in the Hydrogen Audio Forums there's a point being made to allow setting the channel allocation separately from the number channels. I have to say that it seems like overkill for this otherwise very simplistic format. Are anything but 1, 2, 6 or 8 channels in common use these days? Also, FLAC enjoys widespread use despite not being able to allocate channels freely...

I believe flac can allocate channels freely, not in the format itself but in a well-supported extension that's added as a vorbis comment WAVEFORMATEXTENSIBLE, a scheme inherited from modern wave formats. A supporting player can read the comment to know the correct order of the channels.

I don't think it's necessary for qoa. It makes sense if the goal was maximum compatibility in a generic user-facing context, but the point of qoa is a simple way to store lossy audio that a programmer can shape to their whim if they have some specific needs like looping etc. Multiple channels and samplerate is the minimum and only absolute requirement to be a container for a chunk of something considered ready-to-go audio. Anything else can be bolted on if necessary IMO.

I agree. The channel allocation issue - while interesting - is drifting into bike shedding territory. QOA is not (and does not want to be) a general purpose audio format in the same reigns as MP3 or FLAC. I believe the current solution is certainly good enough for this format.

If nothing else pops up, I will the declare the current spec draft 0.3 as final early next week.

Thanks everyone!

Repeating a typo note: "excpet" in the paragraph below the slice diagram. (That's also my last comments; everything else seems good)

I have declared the spec as final. It can be found on https://qoaformat.org

Closing this issue as completed!