wangf3014/SCLIP

Using other ViT models yeilds lower miou score

Opened this issue · 0 comments

Thanks for open sourcing the great work!

I am able to reproduce the metric reported using ViT-B/16 encoder backbone, on city scape and voc20 dataset. However, replacing the vision encoder to ViT-B/32 and ViT-L/14@336px in configs/base_config.py, while keeping all other configuration unchanged, results in a lower score. Below is a summary table of dataset , pretrained vision encoder and mIoU:

dataset | encoder | mIoU
city scape | ViT-B/16 | 32.3500
city scape | ViT-B/32 | 22.6200
city scape | ViT-L/14@336px | 12.4400

voc 20 | ViT-B/16   | 81.5300
voc 20 | ViT-B/32   | 76.6700
voc 20 | ViT-L/14@336px | 50.3500

There seem to be something off with other pretrained clip vision encoders, especially ViT-L/14@336px. Is there is any parameters/configuration need to be adjusted for other vision encoders? Could you suggest some reasons for the lower performance? Thank you so much!