dingjiansw101/ZegFormer

Does the direct use of the CLIP model violate the principle of zero-shot learning?

Closed this issue · 2 comments

Hello, author, you directly use CLIP model to classify the class-agnostic binary mask during the testing phase. This seems to violate the principle of zero-shot learning, because CLIP already has the information of unseen classes.

CLIP may contain the information of unseen classes but only at image-level. No pixel-level information of unseen classes is available during training. Besides, the CLIP does not use any human annotated labels. In fact, when using CLIP models, it is a relaxed setting but has more practical values compared to the strict zero-shot semantic segmentation, which is similar to the setting of open vocabulary object detection[1, 2].
In addition, the proposed decoupling framework also improves the models in a strict zero-shot semantic segmentation setting.

[1] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In CVPR, , 2021.
[2] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-shot detection via vision and language knowledge distillation. arXiv, 2021.

I hope that I have illustrated it well. Feel free to reopen this issue.