Qinying-Liu/Awesome-Open-Vocabulary-Semantic-Segmentation

CLIP-DINOiser misclassified

Opened this issue · 3 comments

Hi, I do not fully understand why CLIP-DINOiser is considered weakly-supervised. There is no text supervision at all.

Thanks in advance!

In my opinion, the weakly-supervised methods can be roughly divided into three groups: 1) the methods trained with image-caption datasets. 2) the methods trained with unlabeled data. 3) the methods without training. The last two groups still relies on foundation models (e.g. CLIP) which have been trained using image-caption datasets. Hence, I consider they are essentially weakly-supervised methods. Although the term "training-free" has been accepted by the community, there is no appropriate term to describe the second group. Hence I currently put they into the first group, but I will continue to refine the classification of the methods.

Hi, thanks a lot for your reply.
I completely agree with you on the training-free methods, but I don't think that's the point here. Looking at the list of works added in the weakly-supervised group, I notice that several of them do in fact completely different things. As you said, all of them rely on a foundation model (most often CLIP or CLIP-like) and I think that the grouping should rather take into consideration what are the additional requirements of the method.
Most of the ones in the weakly-supervised group fine-tune, train additional modules, or train from scratch, but still with a requirement of yet another large-scale image + captions dataset (typically millions of image + caption pairs).
CLIP-DINOiser requires 1k unlabeled images - this is different, and most likely shows that there is a need for yet another group in this repo. To me, unsupervised methods makes sense, as there is training but with no labels.

Let me know what you think! I highly appreciate this repo, it does a great job for the community, and I'm always impressed by its (your) thoroughness. Since it's also very popular, there is a high risk of some errors propagating quickly to a larger community (how are methods compared against each other, how they are reviewed, how they are discovered and referenced, etc.), and I believe our discussion points out one.

Thanks!
The authors of CLIP-DINOiser

I agree with your point about the necessity of a more detailed classification for weakly supervised methods. However, I am unsure if it is appropriate to use the term "unsupervised methods" for these techniques, as they often build upon foundation models trained with labeled data. Nevertheless, I assure you that I will categorize weakly supervised methods more precisely, including adding tags to differentiate between them.