lezhang7/Enhance-FineGrained

About the performance on ELEVATER

hiker-lw opened this issue · 9 comments

hello, sorry to bother you again, I notice that the CE-CLIP's performance on ELEVATER reported in the paper is 53.2, however in my case this is 44.4 using your provided checkpoint. As this is a huge gap, and I don't know if my code is wrong or for some other reasons. Would you mind share you test code on ELEVATER? Thanks very much!

Hi,

There was a bug in testing ELEVATER in the original arXiv version, where we didn't incorporate all datasets. As a result, we observed a drop in zero-shot image classification performance across all models, including Negclip, SVLC, and ours, due to the absence of Lora.

Instead, we report ImageNet1k linear probing performance to demonstrate that the visual representation is robust and retains its original capabilities folloing NegCLIP paper. You can find the test code at GitHub - LAION-AI/CLIP_benchmark.

Best regards,

Thanks for your reply sincerely! Would you mind describing more details about the bug?

We did not incorporate some datasets during ELEVATOR evaluation

You mean the non-trival performance drop of negative text augmented model like NegCLIP, DAC, CLIP-SVLC, etc. is indeed exisit, and there is no bug in the evaluation code of ELEVATOR?

Yes, all models with hard negative text generation show a drop in zero-shot image classification performance, but they maintain their performance on the linear probing classification task. The finetuning of text encoder affects this. Train with lora can alleviate this effectively though.

But DAC and CLIP-SVLC are also trained with lora, I still observed big performance drop (-12% and -7% respectively). I feel this problem is hard to resolve without hard negative text. Anyway, thanks very very much~~ If it's convenient, could we add each other on WeChat to communicate?

I have been working on this task for a year now and still have not made any significant progress. I sincerely hope to have the opportunity to discuss it further with you if you don't mind~

Yes sure ! my wechat number is Leo723_Z. My hypo is that training with hard negative texts will bias models towards image-to-text retrieval task, and it's kind of ood finetuning that makes it forget about original capability. One easy fix is mix the training with pretraining data as in https://github.com/wjpoom/SPEC .

Thanks so much! You are really a kind reseacher, best regards~