Using other ViT models yeilds lower miou score

Question

Using other ViT models yeilds lower miou score

Opened this issue 6 months ago · 0 comments

Thanks for open sourcing the great work!

I am able to reproduce the metric reported using ViT-B/16 encoder backbone, on city scape and voc20 dataset. However, replacing the vision encoder to ViT-B/32 and ViT-L/14@336px in configs/base_config.py, while keeping all other configuration unchanged, results in a lower score. Below is a summary table of dataset , pretrained vision encoder and mIoU:

voc 20 | ViT-B/16 | 81.5300
voc 20 | ViT-B/32 | 76.6700
voc 20 | ViT-L/14@336px | 50.3500

There seem to be something off with other pretrained clip vision encoders, especially ViT-L/14@336px. Is there is any parameters/configuration need to be adjusted for other vision encoders? Could you suggest some reasons for the lower performance? Thank you so much!