facebookresearch/ConvNeXt-V2

On masking input images

xiaohao-lin1 opened this issue · 1 comments

Dear Author,

In paper, you mentioned that masking is done on the raw images. However, in put code, masking is only done after the stem layer. Can you explain the inconsistency? Thank you!

I think it is because the output of the stem layer is aligned to the patch boundaries so the two are equivalent, and this way allows the mask layers to be at lower resolution. You could potentially save some compute by not calculating the stem of masked-out blocks but maybe the overhead prevents this from being worthwhile.