This GitHub repository features code from our CVPR 2024 paper where we tackle zero-shot and few-shot classification using the vision-language model CLIP. Our approach handles groups of unlabeled images together, enhancing accuracy over traditional methods that consider each image separately. We build a new classification framework based on classification of probability features and an optimization technique that mimics the Expectation-Maximization algorithm. On zero-shot tasks with test batches of 75 samples, our approaches EM-Dirichlet and Hard EM-Dirichlet yield near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
- torch 1.13.1 (or later)
- torchvision
- tqdm
- numpy
- pillow
- pyyaml
- scipy
- clip
For downloading the datasets and splits, we follow the instructions given in the Github repository of TIP-Adapter. We use train/val/test splits from CoOp's Github for all datasets except ImageNet where the validation set is used as test set.
The downloaded datasets should be placed in the folder data/ the following way:
.
├── ...
├── data
│ ├── food101
│ ├── eurosat
│ ├── dtd
│ ├── oxfordpets
│ ├── flowers101
│ ├── caltech101
│ ├── ucf101
│ ├── fgvcaircraft
│ ├── stanfordcars
│ ├── sun397
│ └── imagenet
└── ...
For a fixed temperature (
To do so, run bash scripts/extract_softmax_features.sh
For instance, for the dataset eurosat, the temperature T=30 and the backbone RN50, the features will be saved under
eurosat
├── saved_features
│ ├── test_softmax_RN50_T30.plk
│ ├── val_softmax_RN50_T30.plk
│ ├── train_softmax_RN50_T30.plk
└── ...
Alternatively, to reproduce the comparisons in the paper, you can also compute directly the visual embeddings running bash scripts/extract_visual_features.sh
.
The process of extracting features might be time-consuming, but once completed, the methods operates quite efficiently.
You can reproduce the results displayed in Table 1 in the paper by using the config/main_config.yaml
file. Small variations in the results may be observed due to the randomization of the tasks.
The zero-shot methods are EM-Dirichlet (em_dirichlet
), Hard EM-Dirichlet (hard_em_dirichlet
), Hard K-means (hard_kmeans
), Soft K-means (soft_kmeans
), EM-Gaussian (Id cov) (em_gaussian
), EM-Gaussian (diagonal con) (em_dirichlet_cov
), KL K-means (kl_kmeans
).
The methods can be tested on the softmax features by setting use_softmax_features=True
or the visual features use_softmax_features=False
.
For example, to run the method EM-Dirichlet on Caltech101 on 1000 realistic tranductive zero-shot tasks:
python main.py --opts shots 0 dataset caltech101 batch_size 100 number_tasks 1000 use_softmax_feature True
You can reproduce the results displayed in Table 2 in the paper by using the config/main_config.yaml
file. Small variations in the results may be observed due to the randomization of the tasks.
The zero-shot methods are EM-Dirichlet (em_dirichlet
), Hard EM-Dirichlet (hard_em_dirichlet
), alpha_tim
), PADDLE (paddle
), Laplacian Shot (laplacian_shot
), BDSCPN (bdcpsn
).
Methods (results_few_shot/val/
).
For example, to run the method EM-Dirichlet on Caltech101 on 1000 realistic tranductive 4-shot tasks:
python main.py --opts shots 4 dataset caltech101 batch_size 100 number_tasks 1000 use_softmax_feature True
This repository was inspired by the publicly available code from the paper Realistic evaluation of transductive few-shot learning and TIP-Adapter.