MPEGGroup/FileFormat

Simplify Draft 'pymd' Box

Opened this issue · 8 comments

Concerning the Image Pyramid Entity Group Box 'pymd' in WG03N1157_23524 HEIF Ed3 Amd2 Prelim WD:

The following four variables seem redundant and could be removed to greatly simplify the 'pymd' box:

  1. tile_size_x
  2. tile_size_y
  3. tiles_in_layer_row_minus1
  4. tiles_in_layer_column_minus1
pymd ImageGrid
tiles_in_layer_row_minus1 rows_minus_one
tiles_in_layer_column_minus1 columns_minus_one
tile_size_x (output_width / columns_minus_one)
tile_size_y (output_height / rows_minus_one)

If the gridding is handled within the codec, then the codec specific boxes would be used instead. For example, the uncompressed codec would use the uncC box:

pymd uncC
tiles_in_layer_row_minus1 num_tile_rows_minus_one
tiles_in_layer_column_minus1 num_tile_cols_minus_one
tile_size_x (ispe image_width / num_tile_cols_minus_one)
tile_size_y (ispe image_height / num_tile_rows_minus_one)

I imagine these variables were originally placed here for convenience. However, there are already well defined mechanisms for accessing them. In my opinion, duplicating them in the pymd box adds complexity for very little gain.

We could even leave away the layer_binning entry. It appears to be equivalent to ceil(base width/overview width) and it is thus redundant. That would eliminate all custom data from the pymd box, leaving just the list of pyramid image item IDs, sorted according to increasing resolution.

Another reason for leaving away layer_binning is that it restricts us to power-of-two integer scaling factors only, while I think it would be perfectly fine to also have finer resolution granularity, like a factor $\sqrt 2$ per layer. For example, layers with image widths 1000, 707, 500.

Is there any important reason for the restriction that every pyramid level should have the same tile size? I think this is overly restrictive. There are good reasons why tile sizes would vary for different layers. For example, the base layer might have the original uncompressed data, while overview layers use h265 with a tile size dictated by the encoder hardware). Or building the lower resolutions is much easier if this can be done for each base-level tile separately.

If it is important for an application to have the same tile size in each level, the encoder is free to do so even without enforcing this, and having similar tile sizes could be signaled as a flag (but it's also obvious when looking at the individual pyramid levels).

FYI, the layer_binning isn't restricted to power-of-two scaling. It is restricted to integer factors. For instance, the examples in the document show 2x2 binning as a layer_binning of 2 and 4x4 binning as a layer_binning of 4. One can also choose to implement a 3x3 binning as a layer_binning of 3, etc.

the layer_binning isn't restricted to power-of-two scaling

Right. This was a semantic typo. I mean integer scale factor. I've corrected it above.

fmaze commented

I imagine these variables were originally placed here for convenience

Indeed, the purpose was that the reader has all the information it needs in a single place. Depending on which kinds of overviews are composing the pyramid, the information may be more or less complicated to gather (grid, tiled coded image with constrained extents, uncompressed image). For instance, for a grid image the example above is not totally correct due to possible implicit cropping, To be accurate you would have to get the width/height of the input images.

I agree it is worth simplifying in some cases, what about using a flag so that this information would be conditionally present or not?

A conditional flag is one option, but I think that the "difficulty" to get the information is no strong argument. Any decoder that is able to decode the images obviously also knows how the get the image sizes. And from there, all parameters can be trivially computed. Furthermore, it is not so clear why anyone other than the decoder would be interested in the number of tiles in each layer.

I can see two use-cases:

  • a multiplexing/demultiplexing application might want to know the size of each layer without being able to decode it. For example, a network server might want to know which resolution layer to stream. This could be solved by making the ispe property mandatory when an image is used in a pymd. The ispe is usually present anyway so that this will not be a big change and double (and thus possibly inconsistent) information is avoided.
  • it might be useful information whether each resolution layer uses the same tile size. This would be a good case for your conditional flag. If the flag is set, the pymd is restricted to use the same tile size in every layer and the tile size is indicated in the pymd header.

Thus, the specification would be:

  • ispe is mandatory for each layer image
  • the header is defined like this:
aligned(8) class ImagePyramidEntityGroup
extends EntityToGroupBox ('pymd', version = 0, flags) {
        if (flags & 1) {
  	  unsigned int(16) tile_size_x;
	  unsigned int(16) tile_size_y;
        }
}

AMD1 of HEIF has been revised. We invite experts to have a new look at the text.