Multi-Modal-Adapter: A Python repository from ZjjConan

MMA: Multi-Modal Adapter for Vision-Language Models (CVPR2024)

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, Xiaohua Xie

Highlights

Abstract: Pre-trained Vision-Language Models (VLMs) have served as excellent foundation models for transfer learning in diverse downstream tasks. However, tuning VLMs for few-shot generalization tasks faces a discrimination — generalization dilemma, i.e., general knowledge should be preserved and task-specific knowledge should be fine-tuned. How to precisely identify these two types of representations remains a challenge. In this paper, we propose a Multi-Modal Adapter (MMA) for VLMs to improve the alignment between representations from text and vision branches. MMA aggregates features from different branches into a shared feature space so that gradients can be communicated across branches. To determine how to incorporate MMA, we systematically analyze the discriminability and generalizability of features across diverse datasets in both the vision and language branches, and find that (1) higher layers contain discriminable dataset-specific knowledge, while lower layers contain more generalizable knowledge, and (2) language features are more discriminable than visual features, and there are large semantic gaps between the features of the two modalities, especially in the lower layers. Therefore, we only incorporate MMA to a few higher layers of transformers to achieve an optimal balance between discrimination and generalization. We evaluate the effectiveness of our approach on three tasks: generalization to novel classes, novel target datasets, and domain generalization. Compared to many state-of-the-art methods, our MMA achieves leading performance in all evaluations.

Contributions

We introduce a dataset-level analysis method to systematically examine feature representations for transformer-based CLIP models. This analysis helps build more effective and efficient adapters for VLMs.
We propose a novel adapter that contains separate projection layers to improve feature representations for image and text encoders independently. We also introduce a shared projection to provide better alignment between vision-language representations.
We integrate our adapter into the well-known CLIP model and evaluate them on various few-shot generalization tasks. Experiment results show that our method achieves leading performance among all compared approaches.

Base-to-Novel Generalization Results

Results reported below are average accuracy across 11 recognition datasets over 3 seeds. Please refer to our paper for more details.

Name	Base Accuracy	Novel Accuracy	Harmonic Mean
CLIP	69.34	74.22	71.70
CoOp	82.69	63.22	71.66
CoCoOp	80.47	71.69	75.83
ProDA	81.56	72.30	76.65
KgCoOp	80.73	73.60	77.00
MaPLe	82.28	75.14	78.55
LASP	82.70	74.90	78.61
LASP-V	83.18	76.11	79.48
RPO	81.13	75.00	77.78
MMA	83.20	76.80	79.87

Installation

This code is built on top of the awesome project - CoOp, so you need to follow its setup steps:

First, you need to install the dassl environment - Dassl.pytorch. Simply follow the instructions described here to install dassl as well as PyTorch. After that, run pip install -r requirements.txt under Multi-Modal-Adapter/ to install a few more packages required by CLIP (this should be done when dassl is activated).

Second, you need to follow DATASETS.md to install the datasets.

How to Run

The script run_examples.sh provides a simple illustration. For example, to run the training and evaluation on Base-to-Novel generalization with seed-1 on the GPU-0, you can use the following command:

# arg1 = used gpu_id
# arg2 = seed number

bash run_examples.sh 0 1

Citation

If you find this work helpful for your research, please kindly cite the following paper:

@InProceedings{Yang_2024_CVPR,
    author    = {Yang, Lingxiao and Zhang, Ru-Yuan and Wang, Yanchen and Xie, Xiaohua},
    title     = {MMA: Multi-Modal Adapter for Vision-Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {23826-23837}
}

ZjjConan/Multi-Modal-Adapter