On Evaluating Adversarial Robustness of
Large Vision-Language Models

Yunqing Zhao^*1 , Tianyu Pang^*2† , Chao Du^2† , Xiao Yang ³ , Chongxuan Li ⁴
Ngai‑Man Cheung^1† , Min Lin² ^*Equal Contribution
¹Singapore University of Technology and Design (SUTD)
²Sea AI Lab (SAIL), Singapore
³Tsinghua University ⁴Renmin University of China

arXiv-Preprint, 2023

Project Page / Slides / arXiv / Data Repository

TL, DR:

In this research, we evaluate the adversarial robustness of recent large vision-language models (VLMs), under the most realistic and challenging setting with threat model of black-box access and targeted goal.

Our proposed method aims for the targeted response generation over large VLMs such as MiniGPT-4, LLaVA, Unidiffuser, BLIP/2, Img2Prompt, etc.

In other words, we mislead and let the VLMs say what you want, regardless of the content of the input images.

Requirements

Platform: Linux
Hardware: A100 PCIe 40G
lmdb, tqdm
wandb, torchvision, etc.

As we apply Stable Diffusion for targeted image generation, we init our conda environment following Latent Diffusion Models. A suitable base conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Note that for different victim models, we will follow their official implementations and conda environments.

Targeted Image Generation

As discussed in our paper, to achieve a flexible targeted attack, we leverage a pretrained text-to-image model to generate an targetd image given a single caption as the targeted text. Consequently, in this way you can specify the targeted caption for attack by yourself!

We use Stable Diffusion, DALL-E or Midjourney as the text-to-image generators in our experiments. Here, we use Stable Diffusion for demonstration (thanks for open-sourcing!).

Prepare the scripts

git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion

then, prepare the full targeted captions from MS-COCO, or download our processed and cleaned version:

https://drive.google.com/file/d/19tT036LBvqYonzI7PfU9qVi3jVGApKrg/view?usp=sharing

and move it to ./stable-diffusion/. In experiments, one can randomly sample a subset of COCO captions (e.g., 10, 100, 1K, 10K, 50K) for the adversarial attack. For example, lets assume we have randomly sampled 10K COCO captions as our targeted text $\boldsymbol{c}_\text{tar}$ and stored them in the following file:

https://drive.google.com/file/d/1e5W3Yim7ZJRw3_C64yqVZg_Na7dOawaF/view?usp=sharing

Generate the targeted images

The targeted images $\boldsymbol{h}\xi(\boldsymbol{c}\text{tar})$ can be obtained via Stable Diffusion by reading text prompt from the sampled COCO captions, with the script below (note that hyperparameters can be adjusted with your preference):

python ./scripts/txt2img.py \
        --ddim_eta 0.0 \
        --n_samples 10 \
        --n_iter 1 \
        --scale 7.5 \
        --ddim_steps 50 \
        --plms \
        --skip_grid \
        --ckpt ./_model_pool/sd-v1-4-full-ema.ckpt \
        --from-file './name_of_your_coco_captions_file.txt' \
        --outdir './path_of_your_targeted_images' \

Additional implementation details of text-to-image generation by Stable Diffusion can be found HERE.

Adversarial Attack & Black-box Query

Overview of our AttackVLM strategy

Prepare the VLM scripts

There are two steps of adversarial attack for VLMs: (1) transfer-based attacking strategy and (2) query-based attacking strategy. Here, we use Unidiffuser for an example, and other types of VLMs will be supported soon.

Unidiffuser

Installation

git clone https://github.com/thu-ml/unidiffuser.git
cd unidiffuser
cp ../unidff_tool/* ./

then, create a suitable conda environment named unidiffuser following the steps HERE, and prepare the corresponding model weights (we use uvit_v1.pth as the weight of U-ViT).

Transfer-based attacking strategy

conda activate unidiffuser

python _train_adv_img.py \
        --output unidiff_adv_transfer \
        --batch_size 250 \
        --num_samples 10000 \
        --steps 100 \
        --epsilon 8 \
        --cle_data_path 'path_of_your_clean_data_folders' \
        --tgt_data_path 'path_of_your_tgt_data_folders' \
        --output 'name_of_your_output_img_folder'

the crafted adv images $\boldsymbol{x}\text{trans}$ will be stored in ../_output_img/name_of_your_output_img_folder. Then, we perform image-to-text and store the generated response of $\boldsymbol{x}\text{trans}$. This can be achieved by:

python _eval_i2t_dataset.py \
        --batch_size 10 \
        --mode i2t \
        --img_path '../_output_img/name_of_your_output_img_folder' \
        --output 'name_of_your_output_txt_file' \

where the generated responses will be stored in ./output_unidiffuser/name_of_your_output_txt_file.txt. We will use them for pseudo-gradient estimation via RGF-estimator.

Query-based attacking strategy (via RGF-estimator)

python _train_adv_img_query.py \
        --output unidiff_adv_query \
        --data_path '' \
        --text_path './output_unidiffuser/name_of_your_output_txt_file.txt' \
        --batch_size 1 \
        --num_samples 10000 \
        --steps 8 \
        --sigma 8 \
        --delta 'zero' \
        --num_query 50 \
        --num_sub_query 25 \
        --wandb \
        --wandb_project_name tmp \
        --wandb_run_name tmp \

Evaluation

We use different types of CLIP text encoder (e.g., RN50, ViT-B/32, ViT-L/14, etc.) to evaluate the similarity between (a) the generated response and (b) the predefined targeted text $\boldsymbol{c}_\text{tar}$. Refer to the following eval script as an example:

python eval_clip_text_score.py \
        --batch_size 250 \
        --num_samples 10000 \
        --pred_text_path ../_output_text/your_pred_captions.txt \
        --tgt_text_path ../_output_text/your_tgt_captions.txt \

Alternatively, you can use wandb to dynamically monitor the moving average of the CLIP score, this is because the black-box query-based attack might be slow when processing abundant perturbed samples at the same time.

Bibtex

If you find this project useful in your research, please consider citing our paper:

@article{zhao2023evaluate,
  title={On Evaluating Adversarial Robustness of Large Vision-Language Models},
  author={Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man and Lin, Min},
  journal={arXiv preprint arXiv:2305.16934},
  year={2023}
}

Meanwhile, a relevant research that aims to Embedding a Watermark to (multi-modal) Diffusion Models:

@article{zhao2023recipe,
  title={A Recipe for Watermarking Diffusion Models},
  author={Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Cheung, Ngai-Man and Lin, Min},
  journal={arXiv preprint arXiv:2303.10137},
  year={2023}
}

Acknowledgement:

We appreciate the wonderful base implementation of MiniGPT-4, LLaVA, Unidiffuser, LAVIS and CLIP. We also thank @MetaAI for open-sourcing their LLaMA checkponts. We thank SiSi for providing some enjoyable and visual-pleasant images generated by @Midjourney in our research.

Sammy42779/AttackVLM_sammy

On Evaluating Adversarial Robustness of Large Vision-Language Models