Question of the zero-shot
Zhimin-C opened this issue · 1 comments
Thanks for the great job! In this study, both the 2D backbone lseg and openseg necessitate text labels for input. This approach appears more akin to weakly supervised learning (as described in https://openreview.net/pdf?id=4Q9CmC3ypdE) that leverages scene text labels, rather than true zero-shot learning. Therefore, wouldn't it be more appropriate to benchmark against weakly supervised learning baselines as opposed to zero-shot learning baselines?
@Zhimin-C thanks for your interest in our work! Actually there was also a CVPR reviewer asking a roughly same question, so I will reply to you by copying a paragraph of our answer :)
The terminology for zero-shot learning (ZSL) is ambiguous. In a theoretical ZSL system, there should
be no training data of any kind from seen classes. However, almost all real-world ZSL systems utilize general purpose feature extractors pretrained for proxy tasks on large datasets. For example, 3DGenZ proposes a ZSL variant utilizing image features pretrained on ImageNet (see
section 4.5 in their paper), and OpenSeg, LSeg, CLIP, and
ALIGN proposed ZSL methods trained on alt-text (as we
do). The authors of those papers all describe their methods as ZSL. We followed the same terminology.
So long story short, if you check the original CLIP paper and many of follow-ups, they refer their method as zero-shot. We follow them :)