Code for the article "Frustratingly Easy Test-Time Adaptation of Vision-Language Models", arXiv, May 2024.
Authors: Matteo Farina, Gianni Franchi, Giovanni Iacca, Massimiliano Mancini, Elisa Ricci.
We provide both pip requirements and a conda environment to install the dependencies of this repository, feel free to choose the one that better suits your needs. The code was tested with python 3.11.9.
Install pip requirements:
pip install -r requirements.txt
Install with conda:
conda env create -f environment.yaml
The only model weights you need to download are MaPLe's pretrained initializations. For your convenience, we provide a script to download them automatically. Simply run:
./scripts/download_maple.sh
You should now have a weights
folder with the 3 MaPLe's ImageNet pretrainings provided by the authors (weights/maple_seed1.pth
, weights/maple_seed2.pth
and weights/maple_seed3.pth
). Please check everything is in place. Should you have any problems, please download the weights from this link and rename them accordingly.
We strongly suggest you create a datasets
folder under the root of this repository and store all datasets there.
For robustness to natural distribution shifts, we consider ImageNet-1k and 4 variants:
- ImageNet-A.
- ImageNet-v2 (we use the validation set of the
MatchedFrequency
version) - ImageNet-Sketch.
- ImageNet-R.
For all datasets simply download, extract and put them in the ./datasets
folder. You should have the following structure:
./datasets/
| imagenet/
| | train/
| | | # class folders
| | val/
| | | # class folders
| imagenet-a/
| | # class folders
| imagenet-r/
| | # class folders
| imagenet-sketch
| | # class folders
| imagenetv2-matched-frequency-format-val
| | # class folders (0 to 999)
For Finegrained classification, we adopt the same splits as Zhou et al. Please refer to this page for the installation of all datasets and the JSON files for the splits. Once everything is downloaded, please organize everything as follows:
./datasets/
| caltech-101/
| | images/
| | | # class folders
| | split_zhou_Caltech101.json
| dtd/
| | images/
| | | # class folders
| | split_zhou_DescribableTextures.json
| fgvc_aircraft/
| | images/
| | | # list of images
| | # a bunch of txt files
| flower102/
| | jpg/
| | | # list of images
| | split_zhou_OxfordFlowers.json
| food101/
| | images/
| | | # class folders
| | split_zhou_Food101.json
| oxford_pets/
| | images/
| | | # list of images
| | split_zhou_OxfordPets.json
| sun397/
| | images/
| | | # lettered folders ('a', 'b', 'c', etc.)
| | split_zhou_SUN397.json
| ucf101/
| | images/
| | | # class folders
| | split_zhou_UCF101.json
IMPORTANT. By the time of developing this work, the official Stanford Cars' website was unreachable. Please download images from this Kaggle page and annotations from this Drive link. You should organize files as follows:
./datasets/
| stanford_cars/
| | images/
| | | train/
| | | | # list of images
| | | test/
| | | | # list of images
| | annots/
| | | labels.csv
| | | metadata.csv
| | | split_coop.csv
The entrypoint for this repository is run.py
. Please execute python run.py --help
for a sense of the arguments.
We provide different bash files in scripts
to run different versions of Zero
:
zero.sh
runs VanillaZero
;zero_rlcf.sh
runs theZero
variant with a smaller CLIP-ViT-B-16 and a larger CLIP-ViT-L-14;
Note that the --templates
flag activates the ensemble of textual templates (+Ensemble
in Tab.1 and 2 of the article).
The --maple
flag uses a MaPLe pretraining (only available with CLIP-ViT-B-16).
If you find this work useful, please consider citing:
@article{farina2024frustratingly,
title={Frustratingly Easy Test-Time Adaptation of Vision-Language Models},
author={Farina, Matteo and Franchi, Gianni and Iacca, Giovanni and Mancini, Massimiliano and Ricci, Elisa},
journal={arXiv preprint arXiv:2405.18330},
year={2024}
}
Parts of this repository are based on TPT, RLCF, MaPLe and CoOp repositories. Huge thx to all authors!
Please do not hesitate to file an issue or to contact me at m.farina@unitn.it
. I'll do my best to help!