1:1 mapping between motion and latent space assumed for in-painting?

Hi,
Great codebase and extensive results. I was wondering that for your in-painting experiment, since your masked generation works at the latent space of vqvae, how do you ensure that for example, for in-painting [118-150] frames of a given sequence will correspond to certain specific tokens in the latent space?

momask-codes/edit_t2m.py

Line 131 in 500ffe6

_start = int(_start*seq_len)

Thank you for clarification.

Hi, thank you for your interest.

Firstly, I suppose you notice that the correspondence is actually 4 frames to 1 token. Therefore, the in-painting section (e.g., [118-115]) would be rounded by 4.

Secondly, for the latent-motion correspondence, we construct our VQ-VAE with a shallow 1D convolutional network. The convolutional network inherently preserves the structure of the data (i.e., sequence), and the shallow network further provides a relatively small perception field. So, while we can't assert that 1 token precisely maps to a specific 4-frame motion clip, it should contain dominant information from the temporally corresponding 4-frame motion clip. Masking this token will then erase the 4-frame motion clip for in-painting.

Hope my answer solve your question.

Okay I understood along the same lines. Much clearer now. Thanks!