some questions about Hierarchical Encoding

hello! it is a great work! but after reading this paper many times, i haven't understood the idea of Hierarchical Encoding, and couldn't find some efficient way to fix it. so could u explain it more to me? thanks!

Thanks for your interest in our work! Here are the ideas for using hierarchical encoding:

1.) In this work, we aim to utilize interpolation methods like bilinear instead of convolutional layers for upscaling the feature maps, as they are parameter-free. While these methods generate outputs with higher resolutions, the outputs do not contain richer information than the inputs. As shown in many works, applying positional encoding is one way to help neural representations model high-frequency details, which can also be beneficial for enhancing upscaled feature maps.

2.) To apply positional encoding for enhancing the upscaled feature maps, we prefer grid-based encoding as it learns faster; however, it requires a large amount of storage. Thus, we propose using the “local feature grids”, where the grids contain relative positional information and are much smaller.

In this paper, we consider the video signals and the latent representations generated from the network as 3D volumes but process them in 2D patches/frames. That is, the feature maps are 2D slices from the volume. In such case, each pixel has a 3D “global coordinate” relative to the original volume.

Now, consider the case after upscaling feature maps by an upscaling factor S. For each “global coordinate” <t, u, v> in the upscaled feature maps, we compute the “local coordinate” as <t, u mod S, v mod S> and utilize this “local coordinate” for performing interpolation with the “local feature grids”, to obtain the hierarchical encoding. Since the local coordinates are at most <T - 1, S - 1, S - 1> (T is the number of frames, 0 <= t < T), and the scaling factor S is typically much smaller than the height and width, the local feature grids can be much more compact than the normal feature grids.

3.) Note that if the hierarchical encoding depends only on the spatial local coordinates <u, v>, it provides not much information than convolutional layers. In our case, the hierarchical encoding is a function of both the frame index t and the spatial local coordinates <u, v>, where the output contains rich temporal information as well as the relative positional information. It provides a significant improvement for video sequences which contain fast motions, as shown in our ablation study.

Please let me know if you have any further questions!

hello! I also have one question about hierarchical encoding.
Could you please help on it? Thanks.

Q1.)
From the paper description, this work uses local coordinates, i.e. (u_local, v_local, t) to obtain the hierarchical encoding.
But, from the code, I can see only use (0, 0, t) for trilinear interpolation to get target size feature grids (both width and height are 1)
Then, do concatenation for fully connected layer.
Not sure if this is the correct setting for both u_local and v_local are 0 in GridEncodingBase()

Hi, in our implementation, the encoding is actually computed in two steps, see compute_temp_local_encoding.

HiNeRV/models/encoding.py

Line 285 in 9e2231b

    
           def compute_temp_local_encoding(self, x: torch.Tensor, idx: torch.IntTensor, idx_max: tuple[int, int, int],

Here are the two steps:

(1.) Considering that we want to compute a batch of encoding with shape [N, T, H, W, C]. In the TemporalLocalGridEncoding class, the i-th level feature grids have a shape of [T_grid_i, 1, 1, K[0], K[1], K[2], C_grid_i]. Here, K is set to the upscaling ratio, so K[0] is set to 1 since we do not use temporal upsampling, while K[1] and K[2] are integers between 2 and 5 in our settings. The output of TemporalLocalGridEncoding with a shape [N, T, 1, 1, K[0], K[1], K[2], C] is obtained by interpolation from multiple level grids, concatenation, and a fully connected layer, as shown in TemporalLocalGridEncoding.forward.

(2.) In the second step, the local coordinates (u_local, v_local) are used to map the encodings to the feature maps. In compute_temp_local_encoding, a matrix M_3d is first computed by comparing the local pixel indexes and the encoding indexes. This matrix M_3d is then used in a matrix multiplication (with reshape) to map the output from the TemporalLocalGridEncoding class, with shape [N, T, 1, 1, K[0], K[1], K[2], C], to the final local encoding with shape [N, T, H, W, C].

Although in the paper, we mentioned that trilinear interpolation is utilized for obtaining the encoding, since the spatial size of the feature grids in TemporalLocalGridEncoding is always the same as the upscaling ratio, i.e., the range of the local coordinates, we can simply do interpolation in the temporal dimension with the coordinate t, then extract the encoding with the local coordinates (u_local, v_local) by operations like indexing or matrix multiplication (here, we use matrix multiplication). The order of some operations is also changed in this implementation due to better performance and code organization, i.e., concatenation and the fully connected layer are applied at the end of the first step, but these changes do not affect the output.

Hi, thanks for your kindly help.
BTW, is below local coordinates typo?
The Sn should be 2 and the local coordinates should be below?
If my understanding is wrong, please help kindly correct me.

As they should map to one of the vector which comes from TemporalLocalGridEncoding
(0,0) (1,0)
(0,1) (1,1)

Yes, it is a typo.
Thank you for pointing that out!