/TGP-T

[AAAI2024] Official implementation of the AAAI 2024 paper TGP-T

Primary LanguagePython

TGP-T🚀: Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Official implementation of the paper in AAAI 2024:

Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang

[PDF]

TGP-T Paradigm

Introduction🧭

TGP-T is an efficient prompt tuning framework for adapting VLMs with significantly lower resource demand. We introduce compound text supervision to guide the optimization of prompts, i.e., category-wise and content-wise text supervision. Through a Bonder structure, we align the generated prompts with visual features. As a result, we only need two prompt inputs to text encoder to produce state-of-the-art performance on 11 datasets for few-shot classification.

TGP-T Framework

Requirements📨

Installation

We recommend to install the environment through conda and pip.

conda create -n tgpt python=3.8
conda activate tgpt

# Install the dependencies
pip install -r requirements.txt

Dataset

Follow these steps to prepare the datasets:

1. Images

  • Please follow data/DATASETS.md to download the 11 datasets along with ImageNetV2 and ImageNet-Sketch.

2. Content descriptions

  • Download all the content descriptions and preprocessed data files in GoogleDrive or Baidu Netdisk (passward:ky6d).
  • We provide the content descriptions generated by MiniGPT-4 for all 11 datasets. The statistics are shown in the following table.
  • Each line in the descriptions.txt contains three elements separated by \t, i.e., image name, content description and category of the image.
Dataset # Descriptions Avg. Sentence Length Example Description
Caltech101 8,242 20.04 Caltech Image [Faces] The man in the image has a bald head and a scruffy beard.
DTD 5,640 17.64 DTD Image [Studded] The couch has silver studs on the armrests and legs.
EuroSAT 27,000 19.81 EuroSAT Image [Pasture] The pasture land in this image is an open field with green grass and dotted with small trees and bushes.
FGVCAircraft 3,334 21.43 FGVCAircraft Image [A320] The A320 is a white airplane with red and white stripes and the German flag on the tail.
Food101 101,000 21.34 Food101 Image [Hot dog] This hot dog has chili and cheese on it.
ImageNet 90,053 22.03 ImageNet Image [Stingray] The stingray in the image is a large, majestic marine animal with a long, slender body and wide wings.
StanfordCars 8,144 21.67 StanfordCars Image [2007 Hyundai Elantra Sedan] The 2007 Hyundai Elantra Sedan is a sleek and stylish silver car on display at an auto show.
OxfordPets 3,680 22.34 OxfordPets Image [Ragdoll] The ragdoll cat in the image has blue eyes and a gray and white body with soft, fluffy fur.
Flowers102 8,189 18.11 Flowers102 Image [Passion flower] The passion flower is a beautiful purple flower with white stripes and a long stem.
UCF101 7,639 19.62 UCF101 Image [Basketball_Dunk] This image shows a basketball player dunking the ball over an opponent during a game.
SUN397 19,850 23.00 SUN397 Image [Hospital_room] The hospital room has several beds, a desk, and modern medical equipment.

3. Data organization

  • Put them in the same directory. For example, the directory structure should look like:
imagenet/

|-- descriptions.txt

|-- images/

|   |-- train/ # contains 1,000 folders like n01440764, n01443537, etc.

|   |-- val/

|-- split_fewshot/

|   |-- shot_16-seed_1.pkl  # shot_16-seed_2.pkl, shot_8-seed_1.pkl, etc.

Before training, make sure you change the image paths in the split_fewshot/shot_{x}-seed_{x}.pkl.

We provide tools/convert_path.py to get this done. To trigger convertion for all datasets, you can run this simple command:

sh convert_path.sh [/root/path/to/your/data]

# For example
# sh convert_path.sh ./recognition/data

Usage🧩

1. Configs

The running configurations can be modified in configs/configs/dataset.yaml, including number of shots, visual encoders, and hyperparamters.

For simplicity, we provide the hyperparamters achieving the overall best performance on 16 shots for a dataset, which is accord with the scores in our paper. If respectively tuned for different shot numbers, the 1~16-shot performance can be further improved. You can edit the MAX_ITER, LR for fine-grained tuning.

2. Running

Few-shot Recognition🎯

Training. To train on a specific dataset, all you need is train.py :

  • Specify the dataset configs by --config-file
  • Specify your path to the dataset by --root
  • Specify the path where you want to save the results (inlcuding training logs and weights) by --output-dir
  • You can turn on the logging on wandb by enabling --use-wandb. To specify the project name of wandb, you can use --wandb-proj [your-proj-name]

For example, to run on ImageNet:

python train.py --config-file configs/configs/imagenet.yaml --use-wandb

Reproduction. We also provide an easy way main.sh to reproduce the results in our paper. This will trigger 16-shot training on all 11 datasets, including different seeds. To speed up the training, you can choose to parallel them on multiple GPUs.

sh main.sh

Distribution Shift🎯

To perform distribution shift experiments, all you need is train_xdomain.py. We take "Source" dataset as ImageNet, "Target" datasets as ImageNet-V2 and ImageNet-Sketch.

Running the following command will automatically load the weights you have trained on 16-shot ImageNet. If you haven't trained on ImageNet before, it will first start training and then evaluation.

python train_xdomain.py --use-wandb

Discussion💬

To inject rich semantic knowledge into prompts, we take advantages of MiniGPT-4 to generate the content descriptions. Here are some related discussions:

  • In order to reduce noise from MiniGPT-4, i.e., focus on the target object rather than background information, we adjust the input prompts for MiniGPT-4 and avoid the model from generating overly long sentences. After testing, we choose Describe the {classname} in this image in one sentence. as the final input prompts.
  • MiniGPT-4 is introduced only during the training phase and does not cause information leakage in the test phase. Actually, the enhancement brought to visual tasks by general knowledge of large models is indeed highly interesting. We also hope this work can inspire more resource-efficient ways on this.

Citation

@inproceedings{tan2024compound,
  title={Compound text-guided prompt tuning via image-adaptive cues},
  author={Tan, Hao and Li, Jun and Zhou, Yizhuang and Wan, Jun and Lei, Zhen and Zhang, Xiangyu},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={5},
  pages={5061--5069},
  year={2024}
}

Acknowledgements

This repo benefits from CLIP, CoOp and Cross-Modal Adaptation. Thanks for their wonderful works.