ariG23498/mae-scalable-vision-learners

Unshuffle the patches?

Closed this issue · 2 comments

Your code helps me a lot! However, I still have some questions. In the paper, the authors say they unshuffle the full list before applying the deocder. In the MaskedAutoencoder class of your implementation,
decoder_inputs = tf.concat([encoder_outputs, masked_embeddings], axis=1)
no unshuffling is used. I wonder if you can tell me the purpose of doing so? Thanks a lot!

The shuffling and unshuffling parts are implementation details. We implementation masking and unmasking differently without the loss generality.

Hey @changtaoli
To add on top of what @sayakpaul has said, the authors unshuffle the data to order the pathces in the way they are supposed to enter the decoder. In our implementation we do not unshuffle the pathces as you might think, instead we add the positional information to the pathces so that the Transformer decoder does infact have all the necessary signal it needs.

You can think of adding positional information as a neat trick to combat the actual unshuffling of the patches.

Hope this helps!