TonyLianLong/CrossMAE

Question about the AE neck width vs input size

Closed this issue · 1 comments

Okay so the input is 224x224 and it is split into 16x16 patches with 3 channels (768 inputs). The embedding dimension is 1024 per patch token, which is larger than the input, so it is not compressed at all. A huge encoder is run on this, producing output the same size as the input. Then the decoder linearly maps 1024 to 512 features (slightly smaller than original patch size input). It adds mask tokens and class tokens. Then the decoder goes from 512 -> 768 at the end...

It looks like the neck of the auto-encoder is very wide compared to the input, like only 33% smaller than the input. Am I reading this wrong? I'm sure even a simple auto-encoder would be able to perform well with such a wide neck, no?

Looking at the code here:

latent, mask, ids_restore = self.forward_encoder(imgs, mask_ratio)

Ah nevermind this is about recovering the masked blocks and nothing to do with the embedding part.