/CDUL

Unofficial Implementation to CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification [ICCV'23]

Primary LanguagePythonApache License 2.0Apache-2.0

CDUL

arXiv arXiv

Implementation to CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

Setup

Create a .env file at the root of the CDUL repository to store environment variables. Set the DATASETS_ROOT (for downloading and preparing the datasets) and PROJECT_ROOT (the path to this directory). It is recommended to create an account on weights and biases for logging experiments.

Example:

PROJECT_ROOT='<path to this repository>/CDUL'
DATASETS_ROOT='<path to your datasets folder>/datasets'
WANDB_API_KEY=<your wandb api key>
WANDB_ENTITY=<wandb entity (username)>
WANDB_PROJECT=CDUL

Also export the above environment variables by adding them to any of .bashrc/.zshrc/by conda env activation.

Creating a conda environment:

conda create -n cdul python=3.10.12
conda activate cdul
pip install -r requirements.txt

Project Structure

.
│
├── configs                   <- Hydra configs
│   ├── data                     <- Data configs
│   ├── experiment               <- Experiment configs
│   ├── hydra                    <- Hydra configs
│   ├── logger                   <- Logger configs
│   ├── model                    <- Model configs
│   ├── paths                    <- Project paths configs
│   │
│   └── config.yaml            <- Main config for running
│
│
├── clip_cache             <- Cache generated on the PASCAL VOC 2012 dataset for 'global' vectors and 'aggregate' vectors
├── logs                   <- Experiment Logs generated by hydra (generated when conducting experiments)
├── wandb                  <- Offline Logs generated by wandb (generated when conducting experiments)
│
│
├── src                    <- Source code
│   ├── data                     <- Data files
│	│   │
│   │	├── data.py              <- Contains generic classes for manipulating data: CLIPCache, TileCropDataset
│	│   └── voc.py               <- Contains functions and classes specific to the PASCAL VOC 2012 dataset
│   │
│   ├── models                   <- Model files
│   ├── utils                    <- Utility files
│   │
│   ├── clip_cache.py            <- Run the cache generation for 'global', 'aggregate' vectors
│   ├── evaluate.py              <- Evaluate the mAP for a specific pseudo label initialization (i.e evaluate the generated cache)
│   └── train.py                 <- Train the classifier using the generated cache
│
│
├── .env                      <- File for storing environment variables
├── .project-root             <- File for inferring the position of project root directory (do not delete)
├── Makefile                  <- Makefile with commands like `make train` or `make test`
├── requirements.txt          <- File for installing python dependencies
└── README.md                 <- README file specifying project instructions

Running

For convenience, a Makefile has been provided to execute underlying commands for various tasks. Run make help for all available commands (this assumes that you have make installed). Kindly check Hydra and Lightning-Hydra-Template to understand more about using the configs.

Verifying Claims of the Original Paper

We need to verify the following central claims:

  1. The effectiveness of the aggregation of global and local alignments generated by CLIP in forming pseudo labels for training an unsupervised classifier.
  2. The effectiveness of the gradient-alignment training method, which recursively updates the network parameters and the pseudo labels, to update the quality of the initial pseudo labels.

We currently try to verify the claims on the PASCAL VOC 2012 dataset.
For downloading the PASCAL VOC dataset to the DATASETS_ROOT folder run make voc2012.

Claim 1

To generate pseudo label vector caches for the global and aggregate alignment vectors using various snippet sizes (num_patches) and thresholds run the following commands:

Note

There was an initial confusion regarding the meaning of snippet size being 3 x 3. Early results by using 3 x 3 pixels as snippets with a large threshold of 0.5 did not seem to improve over the mAP of global similarity vectors. But considering the meaning of snippets as 3 x 3 = 9 crops of the image and using a threshold of 0 does improve the result over global.

# cache global similarity vectors
make clip_cache

# cache aggregate vectors with num_patches 3 x 3 and threshold 0
make clip_cache0

# cache aggregate vectors with num_patches 3 x 3 and threshold 0.1
make clip_cache1

A clip_cache folder will get created under the dataset folder. For the PASCAL VOC 2012 dataset, the filetree looks like:

VOC2012
│
├── Annotations
├── clip_cache - this contains cached vectors in different hierarchies as per the config values
├── ImageSets
├── JPEGImages
├── SegmentationClass
└── SegmentationObject

For evaluating the clip_cache provided in this repository, copy the clip_cache folder to the above location in the downloaded PASCAL VOC 2012 dataset.

For evaluating the quality (mAP) of the initial pseudo labels run:

# Note: for evaluating, you must have the clip_cache created using either the above commands or the provided cache.

# to evaluate the pseudo labels initialzed only using the global similarity vectors
make evaluate

# evaluate final pseudo labels (average of global and aggregate) using num_patches 3 x 3 and threshold 0
make evaluate0

# evaluate final pseudo labels using num_patches 3 x 3 and threshold 0.1
make evaluate1

Claim 2

For training the classifier network on the final similarity vectors as the initial pseudo labels, run: make train.
More details can be seen in the comments of various config files.
The pseudo label update frequency and warmput can be varied accordingly.

To disable logging using weights and biases, prefix any command with WANDB_MODE=disabled. Logs for the experiments can be found at: https://wandb.ai/manan-shah/CDUL.

To run an extensive hyperparameter search run:

python src/train.py -m hparams_search=voc2012_optuna

TODO

  • Generate a cache for num_patches 3 x 3 for different thresholds.
  • Use multiprocessing for cache generation on multiple GPUs.
  • Experiment with optimization hyper-parameters to get better mAP on val set.
  • Run experiments on other datasets - MS-COCO.

Citations

@article{shah2024reproducibility,
  title   = {Reproducibility Study of CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification},
  author  = {Manan Shah and Yash Bhalgat},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2405.11574}
}
@article{abdelfattah2023cdul,
  title   = {CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification},
  author  = {Rabab Abdelfattah and Qing Guo and Xiaoguang Li and Xiaofeng Wang and Song Wang},
  year    = {2023},
  journal = {arXiv preprint arXiv: 2307.16634}
}