Important changes made to Dassl's transforms.py
KaiyangZhou opened this issue · 0 comments
So, you might find OpenAI's code produces around 59% accuracy for zero-shot CLIP (vision_model=RN50
) on ImageNet with prompt ensembling, but CoOp's code gives only 57.81% for the same model (see Table 7 in the paper).
This difference is caused by using different transforms: OpenAI's code applies Resize(224)
to an image while CoOp's code (the previous version) uses Resize((224, 224))
. More specifically, the former keeps the image aspect ratio while the latter doesn't. To allow the results produced by CoOp's code to be comparable to OpenAI's code, we have made our transforms consistent with theirs. So the transforms in the config files have now been changed from ["random_flip", "random_translation", "center_crop", "normalize"]
to ["random_resized_crop", "random_flip", "normalize"]
.
If you are using our Dassl-based CoOp code, please update the code to the latest version. If you want to use your own code, you can simple copy CoOp's model code (i.e. CustomCLIP) and do the comparison on the same ground with whatever pipelines you are using.
For your reference, we have rerun CoOp using the new config files and put below the comparison of Table 7's results.
Previous version
Method | RN50 | Rn101 | ViT-B/32 | ViT-B/16 |
---|---|---|---|---|
Prompt engineering | 55.41 | 58.72 | 59.88 | 64.71 |
Prompt ensembling | 57.81 | 60.49 | 62.01 | 67.31 |
CoOp | 60.46 | 64.39 | 64.92 | 70.13 |
Current version
Method | RN50 | Rn101 | ViT-B/32 | ViT-B/16 |
---|---|---|---|---|
Prompt engineering | 58.18 | 61.26 | 62.05 | 66.73 |
Prompt ensembling | 60.41 | 62.54 | 63.71 | 68.74 |
CoOp | 62.95 | 66.60 | 66.85 | 71.92 |