Questions about application to a plain ViT
Closed this issue · 1 comments
feivelliu commented
Very happy to see your code!
I am very interested in application to a plain ViT, can you provide some related tips?
Thank you so much!
impiga commented
Hi, you could easily create a new ViT
backbone class in backbone.py.
Following are some tips:
- For the implementation, you could refer to detectron2.
- ViT, by default, applies global attention for each layer. To enable window based attention (similar to swin transformer), you could adjust
window_size
andwindow_block_indexes
(here) options. - Load a MAE pre-trained checkpoint.
- Add learning rate decay for ViT. In our existing code, we have defined the
get_swin_layer_id
function for Swin Tranformer. You could use this as a reference when adding an implementation for ViT.
(learning rate decay is a widely adopted trick when finetuning Mask-Image-Modeling pretrained models.)