Does the direct use of the CLIP model violate the principle of zero-shot learning?

Question

Does the direct use of the CLIP model violate the principle of zero-shot learning?

Closed this issue 3 years ago · 2 comments

Hello, author, you directly use CLIP model to classify the class-agnostic binary mask during the testing phase. This seems to violate the principle of zero-shot learning, because CLIP already has the information of unseen classes.

Answer 1 · 2021-12-16T05:44:09.000Z

CLIP may contain the information of unseen classes but only at image-level. No pixel-level information of unseen classes is available during training. Besides, the CLIP does not use any human annotated labels. In fact, when using CLIP models, it is a relaxed setting but has more practical values compared to the strict zero-shot semantic segmentation, which is similar to the setting of open vocabulary object detection[1, 2].
In addition, the proposed decoupling framework also improves the models in a strict zero-shot semantic segmentation setting.

[1] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In CVPR, , 2021.
[2] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-shot detection via vision and language knowledge distillation. arXiv, 2021.

Answer 2 · 2021-12-26T07:32:00.000Z

I hope that I have illustrated it well. Feel free to reopen this issue.