Questions about application to a plain ViT

Question

Closed this issue 10 months ago · 1 comments

Very happy to see your code！
I am very interested in application to a plain ViT, can you provide some related tips?
Thank you so much！

Answer 1 · 2023-11-16T03:56:49.000Z

Hi, you could easily create a new ViT backbone class in backbone.py.

Following are some tips:

For the implementation, you could refer to detectron2.
ViT, by default, applies global attention for each layer. To enable window based attention (similar to swin transformer), you could adjust window_size and window_block_indexes (here) options.
Load a MAE pre-trained checkpoint.
Add learning rate decay for ViT. In our existing code, we have defined the get_swin_layer_id function for Swin Tranformer. You could use this as a reference when adding an implementation for ViT.
(learning rate decay is a widely adopted trick when finetuning Mask-Image-Modeling pretrained models.)