TGP-T🚀: Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Official implementation of the paper in AAAI 2024:

Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang

Introduction🧭

TGP-T is an efficient prompt tuning framework for adapting VLMs with significantly lower resource demand. We introduce compound text supervision to guide the optimization of prompts, i.e., category-wise and content-wise text supervision. Through a Bonder structure, we align the generated prompts with visual features. As a result, we only need two prompt inputs to text encoder to produce state-of-the-art performance on 11 datasets for few-shot classification.

Requirements📨

Installation

We recommend to install the environment through conda and pip.

conda create -n tgpt python=3.8
conda activate tgpt

# Install the dependencies
pip install -r requirements.txt

Dataset

Follow these steps to prepare the datasets:

1. Images

Please follow data/DATASETS.md to download the 11 datasets along with ImageNetV2 and ImageNet-Sketch.

2. Content descriptions

Download all the content descriptions and preprocessed data files in GoogleDrive or Baidu Netdisk (passward:ky6d).
We provide the content descriptions generated by MiniGPT-4 for all 11 datasets. The statistics are shown in the following table.
Each line in the descriptions.txt contains three elements separated by \t, i.e., image name, content description and category of the image.

Dataset	# Descriptions	Avg. Sentence Length	Description
Caltech101	8,242	20.04	[Faces] The man in the image has a bald head and a scruffy beard.
DTD	5,640	17.64	[Studded] The couch has silver studs on the armrests and legs.
EuroSAT	27,000	19.81	[Pasture] The pasture land in this image is an open field with green grass and dotted with small trees and bushes.
FGVCAircraft	3,334	21.43	[A320] The A320 is a white airplane with red and white stripes and the German flag on the tail.
Food101	101,000	21.34	[Hot dog] This hot dog has chili and cheese on it.
ImageNet	90,053	22.03	[Stingray] The stingray in the image is a large, majestic marine animal with a long, slender body and wide wings.
StanfordCars	8,144	21.67	[2007 Hyundai Elantra Sedan] The 2007 Hyundai Elantra Sedan is a sleek and stylish silver car on display at an auto show.
OxfordPets	3,680	22.34	[Ragdoll] The ragdoll cat in the image has blue eyes and a gray and white body with soft, fluffy fur.
Flowers102	8,189	18.11	[Passion flower] The passion flower is a beautiful purple flower with white stripes and a long stem.
UCF101	7,639	19.62	[Basketball_Dunk] This image shows a basketball player dunking the ball over an opponent during a game.
SUN397	19,850	23.00	[Hospital_room] The hospital room has several beds, a desk, and modern medical equipment.

3. Data organization

Put them in the same directory. For example, the directory structure should look like:

imagenet/

|-- descriptions.txt

|-- images/

|   |-- train/ # contains 1,000 folders like n01440764, n01443537, etc.

|   |-- val/

|-- split_fewshot/

|   |-- shot_16-seed_1.pkl  # shot_16-seed_2.pkl, shot_8-seed_1.pkl, etc.

Before training, make sure you change the image paths in the split_fewshot/shot_{x}-seed_{x}.pkl.

We provide tools/convert_path.py to get this done. To trigger convertion for all datasets, you can run this simple command:

sh convert_path.sh [/root/path/to/your/data]

# For example
# sh convert_path.sh ./recognition/data

Usage🧩

1. Configs

The running configurations can be modified in configs/configs/dataset.yaml, including number of shots, visual encoders, and hyperparamters.

For simplicity, we provide the hyperparamters achieving the overall best performance on 16 shots for a dataset, which is accord with the scores in our paper. If respectively tuned for different shot numbers, the 1~16-shot performance can be further improved. You can edit the MAX_ITER, LR for fine-grained tuning.

2. Running

Few-shot Recognition🎯

Training. To train on a specific dataset, all you need is train.py :

Specify the dataset configs by --config-file
Specify your path to the dataset by --root
Specify the path where you want to save the results (inlcuding training logs and weights) by --output-dir
You can turn on the logging on wandb by enabling --use-wandb. To specify the project name of wandb, you can use --wandb-proj [your-proj-name]

For example, to run on ImageNet:

python train.py --config-file configs/configs/imagenet.yaml --use-wandb

Reproduction. We also provide an easy way main.sh to reproduce the results in our paper. This will trigger 16-shot training on all 11 datasets, including different seeds. To speed up the training, you can choose to parallel them on multiple GPUs.

sh main.sh

Distribution Shift🎯

To perform distribution shift experiments, all you need is train_xdomain.py. We take "Source" dataset as ImageNet, "Target" datasets as ImageNet-V2 and ImageNet-Sketch.

Running the following command will automatically load the weights you have trained on 16-shot ImageNet. If you haven't trained on ImageNet before, it will first start training and then evaluation.

python train_xdomain.py --use-wandb

Discussion💬

To inject rich semantic knowledge into prompts, we take advantages of MiniGPT-4 to generate the content descriptions. Here are some related discussions:

In order to reduce noise from MiniGPT-4, i.e., focus on the target object rather than background information, we adjust the input prompts for MiniGPT-4 and avoid the model from generating overly long sentences. After testing, we choose Describe the {classname} in this image in one sentence. as the final input prompts.
MiniGPT-4 is introduced only during the training phase and does not cause information leakage in the test phase. Actually, the enhancement brought to visual tasks by general knowledge of large models is indeed highly interesting. We also hope this work can inspire more resource-efficient ways on this.

Citation

@inproceedings{tan2024compound,
  title={Compound text-guided prompt tuning via image-adaptive cues},
  author={Tan, Hao and Li, Jun and Zhou, Yizhuang and Wan, Jun and Lei, Zhen and Zhang, Xiangyu},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={5},
  pages={5061--5069},
  year={2024}
}

Acknowledgements

This repo benefits from CLIP, CoOp and Cross-Modal Adaptation. Thanks for their wonderful works.

EricTan7/TGP-T