BeierZhu/Prompt-align

Why Table 1 shows inconsistent CLIP performances with those in the CoCoOp paper?

Closed this issue · 4 comments

Greetings.
Table 1 shows that the vanilla CLIP obtains 65.13 (base) and 69.02 (new) accuracies. However, the accuracies in the CoCoOp paper are 69.34 (base) and 74.22 (new).
Note that the vanilla CLIP has never been modified, so the performances should be always fixed. Is this an error?
Besides, why do you choose 4 shots for the base-to-new generalization experiment in Table 1? Since both CoOp and CoCoOp choose 16 shots. What is the purpose?

  1. Please read the CoCoOp carefully. CoCoOp uses ViT as the backbone, while CoOp uses ResNet. To make fair comparison with CoOp and CoCoOp, we select ResNet for all experiment setting.

  2. 16 shots is strictly not few-shot setting (conventional few-shot uses 1 or 5), thus we choose 4.

Dear Ma Chengcheng

If you are interested in this work. We can add a WeChat friend for further discussion. My account is: Thu_E_E

Best wishes.

Another difference I forget to mention is that in CoCoOp paper, all experiments are run only for 10 epochs due the training inefficiency of CoCoOp. But according to my training log, 10 epochs is insufficient for model convergence which makes unfair comparison between difference models. Thus I adopt CoOp setting to the base-to-new experiments, ie, 50 epochs for 1 shot, 100 epochs for 2 and 4 shots, 200 epochs for 8 and 16 shots.

Much appreciated~