Why token_embed is random initialized for training MaskedTranformer in transformer.py

Question

Why token_embed is random initialized for training MaskedTranformer in transformer.py

Opened this issue 6 months ago · 5 comments

Thanks for your great work, In transformer.py, I think the token_embed should be initialized via pretrained codebook by func load_and_freeze_token_emb during training. Look forward to your reply

Answer 1 · 2024-04-17T03:57:21.000Z

Hi, we tried both, and found that random initialization works better.

…

On Tue, 16 Apr 2024 at 21:45, buptxyb666 ***@***.***> wrote: look forward to your reply. — Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKRYNBY7JXINSLYP5VIYKE3Y5XV6DAVCNFSM6AAAAABGKPY7U2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRQGI4TANBRHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2024-04-17T06:32:21.000Z

In the way, the bidirectional attention does not utilize the unmasked motion tokens to infer masked ones. Can you interpret how to understand this result?

Answer 3 · 2024-04-17T15:48:04.000Z

In the way, the bidirectional attention does not utilize the unmasked motion tokens to infer masked ones.

I want to make sure we are on the same page.

The embedding layer can be always random initialized as we do on any learnable parameters. But it doesn't mean that the corresponds {[token 0]: weight_0, [token 1]: weight1, ..., [token mask], weight_-1} are random.

"utilize the unmasked motion tokens to infer masked ones" should refer to the training paradigm, where we only learn to predict the tokens which are masked in the input, but not about semantic pre-define token embeddings.

If you still have any concerns, please let me know and explain why you come to that conclusion. Thanks.

Answer 4 · 2024-04-18T11:27:40.000Z

I know that the correspondence of embedding layer is not random.

I just wonder that why using the random initialized learnable token instead of the pre-trained codebook for unmasked motion token retrieval can achieve better result. Compared with the learnable token, the codebook has encoded the motion representations in the first training stage.

Using random initialized embedding layer for unmasked ones can not provide the motion prior learned from the first training stage for mask prediction in my opinion.

Answer 5 · 2024-04-18T17:28:21.000Z

Thanks for your clarification.

First, we want to re-emphasize that loading pre-trained codebook or initiating embedding layer randomly has nothing to do with "utilize the unmasked motion tokens to infer masked ones". They are just two initialization options.

Second, I agree that the code book from the first auto-encoder training stage can provide motion priors. Although, empirically, random initialization leads to better results, it never means loading pre-trained code book is wrong.

For the reason, we suspect that the auto-encoder codebook might contain the reconstruction prior while what we learn in the masked-transformer is a generative prior. GPT2 and many other discrete tokens prediction models e.g. MaskGIT also use these setting or provide an option.