RobustVLM

[Paper] [HuggingFace] [BibTeX]

This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models" (ICML 2024).

We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks. We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves. Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining higher clean accuracy than previous adversarial fine-tuning methods.

Installation
Models
- Loading pretrained models
- Summary of results
Training
Evaluation

Installation

The code is tested with Python 3.11. To install the required packages, run:

pip install -r requirements.txt

Models

We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each):

Model	Link	Proposed by	Notes
TeCoA²	Link	Mao et al. (2023)	Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$
TeCoA⁴	Link	Mao et al. (2023)	Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$
FARE²	Link	ours	Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$
FARE⁴	Link	ours	Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$

The models also available on HuggingFace.

All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training.

Loading pretrained models

The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code:

import torch
from open_clip import create_model_and_transforms
model, _, image_processor = create_model_and_transforms(
            'ViT-L-14', pretrained='openai', device='cpu'
        )
checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu'))
model.visual.load_state_dict(checkpoint)

Alternatively load directly from HuggingFace:

from open_clip import create_model_and_transforms
model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip')

Summary of results

We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. CLIP-only means that we evaluate the respective CLIP model in a standalone fashion for zero-shot classification, whereas OpenFlamingo and LLaVA evaluation means that we use the respective CLIP model as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks are provided in the paper.

Clean evaluation:

	CLIP-only	OpenFlamingo 9B		LLaVA 1.5 7B
Model	Avg. zero-shot	COCO	TextVQA	COCO	TextVQA
OpenAI	73.1	79.7	23.8	115.5	37.1
TeCoA²	60.0	73.5	16.6	98.4	24.1
FARE²	67.0	79.1	21.6	109.9	31.9
TeCoA⁴	54.2	66.9	15.4	88.3	20.7
FARE⁴	61.1	74.1	18.6	102.4	27.6

Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{2}{255}$):

	CLIP-only	OpenFlamingo 9B		LLaVA 1.5 7B
Model	Avg. zero-shot	COCO	TextVQA	COCO	TextVQA
Openai	0.0	1.5	0.0	4.0	0.5
TeCoA²	43.6	31.6	3.5	44.2	12.1
FARE²	43.1	34.2	4.1	53.6	14.7
TeCoA⁴	42.3	28.5	2.1	50.9	12.6
FARE⁴	45.9	30.9	3.4	57.1	15.8

Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{4}{255}$):

	CLIP-only	OpenFlamingo 9B		LLaVA 1.5 7B
Model	Avg. zero-shot	COCO	TextVQA	COCO	TextVQA
Openai	0.0	1.1	0.0	3.1	0.0
TeCoA²	27.0	21.2	2.1	30.3	8.8
FARE²	20.5	19.5	1.9	31.0	9.1
TeCoA⁴	31.9	21.6	1.8	35.3	9.3
FARE⁴	32.4	22.8	2.9	40.9	10.9

Training

TeCoA⁴

python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize True --steps 20000 --warmup 1400 --batch_size 128 --loss ce --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss ce --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name TECOA4 --log_freq 10 --eval_freq 10```

FARE⁴

python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name FARE4 --log_freq 10 --eval_freq 10

Set --eps 2 to obtain TeCoA² and FARE² models.

Evaluation

Make sure files in bash directory are executable: chmod +x bash/*

CLIP ImageNet

python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2

CLIP Zero-Shot

Set models to be evaluated in CLIP_benchmark/benchmark/models.txt and datasets in CLIP_benchmark/benchmark/datasets.txt (the datasets are downloaded from HuggingFace). Then run

cd CLIP_benchmark
./bash/run_benchmark_adv.sh

VLM Captioning and VQA

LLaVA

In /bash/llava_eval.sh supply paths for the datasets. The required annotation files for the datasets can be obtained from this HuggingFace repository. Set --vision_encoder_pretrained to openai or supply path to fine-tuned CLIP model checkpoint. Then run

./bash/llava_eval.sh

The LLaVA model will be automatically downloaded from HuggingFace.

OpenFlamingo

Download the OpenFlamingo 9B model, supply paths in /bash/of_eval_9B.sh and run

./bash/of_eval_9B.sh

VLM Stealthy Targeted Attacks

For targeted attacks on COCO, run

./bash/llava_eval_targeted.sh

For targeted attacks on self-selected images, set images and target captions in vlm_eval/run_evaluation_qualitative.py and run

python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose

With 10,000 iterations it takes about 2 hours per image on an A100 GPU.

POPE

./bash/eval_pope.sh openai  # for clean  model evaluation
./bash/eval_pope.sh   # for robust  model evaluation - add path_to_ckpt in bash file

SQA

./bash/eval_scienceqa.sh openai  # for clean  model evaluation
./bash/eval_scienceqa.sh   # for robust  model evaluation - add path_to_ckpt in bash file

Acknowledgements

This repository gratefully forks from

Citation

If you find this repository useful, please consider citing our paper:

@article{schlarmann2024robustclip,
    title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models}, 
    author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein},
    year={2024},
    journal={ICML}
}

SaraGhazanfari/RobustVLM