AOMediaCodec/iamf

About mixes and sub-mixes

Opened this issue · 2 comments

I'm trying to understand better the concept of mixes and sub-mixes. In reading the spec here is my interpretation and some suggestions, I would be happy to get confirmation and/or corrections.

I found Figure 1 misleading:

  • it does not show sub-mixes. It is not clear if the "Process Mix" box is about sub-mixes or not. I think it is, because sub-mixes are the only way to mention a given audio element multiple times in a mix presentation.
  • it gives the impression that a) sub-mixes are used to target loudspeakers vs headphones and that b) they are comprised of the same audio elements. On a), I think this is not the case (see more on that below). On b), is it always the case?

The RenderingConfig in a sub-mix is not optional. This means that it is present also when the audio element is not meant to be rendered on headphones. It is NOT an indication that the sub-mix is made for headphones. In fact, the signaling inside the RenderingConfig can be different from 1 audio element to the next in the same sub-mix. As I understand it, the RenderingConfig is an indication that, if the player decides to render the sub-mix on headphones, then rendering shall use stereo or binaural, but the same sub-mix could be rendered on loudspeakers.

Continuing on the notion of sub-mix. Based on the syntax, 2 sub-mixes can be different because:
a) they select different audio elements
b) if the same audio elements are used,
b.1) the relative gains of each other can be different.
b.2) or the relative gains are the same but the rendering configs are different. This case does not seem to make sense.
b.3) or the relative gains are the same, and the rendering configs are the same, but the overall gain of the sub-mix is different from the other sub-mix. It does not seem to make sense to provide another submix that just differs by its overall gain.

Only a) or b.1) seem to make sense. In that case, it is not clear why one would use 1 presentation with 2 sub-mixes vs 2 presentations with 1 sub-mix in each.

About loudness. As I understand it, loudness information is descriptive. It describes the loudness that will occur if a particular layout is selected and if the various gains are applied. It can be used to select a sub-mix over another one, but most likely it is used to inform any loudness normalization.

About animatable mix-presentation-level parameters. As I understand, the following can be animated with a parameter substream:

  • gain applied to an element within a submix (element_mix_config)
  • gain applied to the overall submix gain (output_mix_config)

About "7.3.1. Selecting a Mix Presentation"
The whole section talks about "mix" but it's not clear it is meant to say "mix presentation" or "sub-mix".

"If there are any user-selectable mixes, the IA parser SHOULD select the mix, or mixes, that match the user’s preferences. An example might be a mix with a specific language. Mix Presentations MAY use mix_presentation_friendly_label to describe such mixes."

  1. How does one identify if there are such "user-selectable" mixes?
  2. The language in the MixPresentationOBU does not identify the language of the Mix (or submixes) but the language in which the annotations are provided.
    Rephrase as:
    "A player may decide to expose to the user the labels present in the MixPresentationOBU, in the user preferred language, if labels are present in multiple languages; and let the user select the Mix Presentation(s) based on the label."

I note that a player can select between 2 mixes based on annotations. However, sub-mixes don't have annotations. How can a player decide to select a given sub-mix then?

"2. If there is more than one valid mix remaining,"
That part seems to hint that some mixes will be made for headphones and some will be made for loudspeakers. It seems to me that a mix is just a selection of audio elements with their relative gains. As discussed above, the rendering config is a hint on how it shall render the element when headphones are used, but a given audio element will most likely always have a binaural-preferred rendering config no matter what submix (or mix) it is part of.

"If there is no such mix, select the mix with the highest available loudness_layout."
What does "highest available loudness_layout" refer to? The highest value of layout_type or the highest value of sound_system?

Let me share my understanding.

All sentences in v1 specification are described under the assumption that one MixPresentationOBU has only one sub-mix.

When one MixPresentationOBU has multiple sub-mixes, there is no relationship between any two of the sub-mixes. There is nothing more except they are just multiplexed in one IA sequence.
For example,

  • Service A: streams IA sequence with MixPresentationOBU-A including one sub-mix A
  • Service B: streams IA sequence with MixPresentationOBU-B including one sub-mix B
  • Device C: gets Service A and B. Then it multiplexes them into one IA sequence with MixPresentationOBU-C including sub-mix A and sub-mix B. And then transfers the IA sequence to Device D.
  • Device D: presents sub-mix A and sub-mix B by using MixPresentationOBU-C.

"If there is no such mix, select the mix with the highest available loudness_layout."
What does "highest available loudness_layout" refer to? The highest value of layout_type or the highest value of sound_system?
==> I think that the intention was the highest value of sound_system.

I am going to close this issue soon. Please let me know if you have a concern.