/BridgeTower

Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"

Primary LanguagePythonMIT LicenseMIT

BridgeTower

This repo is the official Pytorch implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".

Updates

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

Architecture

Architecture

Main Results

Result1

Result2

Deployment

  • Run setup.sh to set up the environment.
  • [Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

Checkpoints

  • Pre-trained checkpoints on 4M data: BASE and LARGE

  • Fine-tuned checkpoints for

  • Here is an example for downloading a checkpoint.

    # download azcopy
    wget https://aka.ms/downloadazcopy-v10-linux
    tar -xvf downloadazcopy-v10-linux
    sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
    sudo chmod -R 777 /usr/bin/azcopy
    # azcopy copy [remote path] [local path]
    azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"

Pre-training on Image-Text Datasets

# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh

Fine-tuning on Downstream VL Tasks

  • VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh

# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh

Fine-tuning on Uni-Modal Tasks

# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh

# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh

Citation

@article{xu2022bridge,
  title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
  author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2206.08657},
  year={2022}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them: