Vision Language Models

Vision-Language Pretraining

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [NeurIPS 2019]

[paper] [code] Facebook AI Research

  • Architecture: Two stream 🔃 co-attentional transformer layers

  • Pretrain dataset: Conceptual Captions (~3.3M)

  • Pretrain Tasks ViLBERT_pretrain

    • predicting the semantics of masked words and image regions given the unmasked inputs (Masked Multi-modal Modelling)

    image: Predict the semantic classes distribution using image input/output with detection model, then minimize KL divergence between these two distributions.

    text: Same as BERT.

    • predicting whether an image and text segment correspond (Multi-modal Alignment) with [IMG] and [CLS] output
  • Image feature (Fast R-CNN)

    • <image coordinates (4), area fraction, visual feature> from pretrained object detection network
    • projected to match the visual feature
  • Text feature Google's WordPiece tokenizer

LXMERT: Learning Cross-Modality Encoder Representations from Transformers [EMNLP 2019]

[paper] [code] The University of North Carolina

  • Architecture: Two stream --- Object relationship encoder (Image), language encoder (Text), cross-modality encoder.

  • Pretrain dataset: COCO + Visual Genome (9.18 M)

  • Pretrain Tasks

    • MLM, Masked Object Prediction (MOP) [feature regression and label classification], Cross-modality Matching with only [CLS] output, Image Question Answering
  • Image feature (Fast R-CNN)

    • <bounding box coordinates, 2048-d region-of-interest>
    • projection
  • Text feature

VisualBERT: A Simple and Performant Baseline for Vision and Language [arXiv 2019/08, ACL 2020]

[paper] [code]

  • Architecture: Single stream BERT
  • Pretrain dataset: COCO (100k)
  • Pretrain tasks:
    • Task-Agnostic Pretraining
    • MLM with only text masked
    • Sentence-image matching (Cross-modality Matching) with only [CLS] output
    • Task-Specific Pretraining using MLM with task-specific dataset, which help adapting to the new target domain.
  • Features
    • Image feature (Fast R-CNN) visual feature representation: bounding region feature + segment embedding + position embedding
    • Text feature: same as BERT

VL-BERT: Pre-training of Generic Visual-Linguistic Representations [ICLR 2020]

[paper] [code] USTC & MSRA

  • Architecture: Single stream BERT

  • Pretrain dataset: Conceptual Captions (3.3M) for visual-linguistic & BooksCorpus, English Wikipedia for pure text corpus

  • Pretrain Tasks

    • MLM, Masked RoI Classification with Linguistic Clues
    • They claim that Cross-modality Matching is of no use.
  • Features

    • Visual Feature Embedding (Fast R-CNN)

    • visual appearance embedding: 2048-d feature For Non-visual elements, they're obtained by RoI covering the whole input image.

    • visual geometry embedding: to 2048-d representation by computing sine and cosine of different wavelengths according to "Relation networks for object detection"

    • Token Embedding

    • WordPiece Embedding For Visual elements, a special [IMG] is assigned.

    • Segment Embedding: Learnable

    • Sequence Position Embedding: Learnable

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training [AAAI 2020]


  • Architecture: Single stream BERT

  • Pretrain dataset: Conceptual Captions (3M) + SUB Captions (0.8M)

  • Pretrain tasks

    MLM + Masked Object Classification+ Visual-linguistic Matching (Cross-modality Matching) with only [CLS] output

  • Features

    • Image feature (Fast R-CNN)
    • [IMG] token + segment embedding + position embedding + next term
    • , visual feature --separately--> embedding space using FC, then added up
    • Text feature: same as BERT

Unified Vision-Language Pre-Training for Image Captioning and VQA [AAAI 2020]

[code], (VLP)

UNITER: Learning Universal Image-text Representations [ECCV 2020]

[paper] [code]

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [arXiv 2020/04, ECCV 2020]

[paper] [code]

Learning Transferable Visual Models From Natural Language Supervision [OpenAI papers 2021/01]

[paper] [blog] [code]

Video-Language Pretraining

VideoBERT: A Joint Model for Video and Language Representation Learning [ICCV 2019]


Multi-modal Transformer for Video Retrieval [ECCV 2020]

[paper] [code]

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [EMNLP 2020]

[paper] [code]

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation


Image-Text Retrieval & Matching

ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data

arXiv 2020/01 [paper]

Cross-Probe BERT for Efficient AND effective Cross-Modal Search

ICLR 2021 submission. [paper]

Multi-Modality Cross Attention Network for Image and Sentence Matching [ICCV 2020]


Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [CVPR 2021]



12-in-1: Multi-Task Vision and Language Representation Learning [CVPR 2020]

[paper] [code] Multi-task Learning

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

arXiv 2020/04 [paper] In-depth Analysis

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

arXiv 2020/05, ECCV 2020 Spotlight [paper] In-depth Analysis

Adaptive Transformers for Learning Multimodal Representations [ACL 2020]

[paper] Adaptive Transformer Analysis

Data, Architecture, or Losses: What Contributes Most to Multimodal Transformer Success? [TACL 2021]



Pre-trained Models for Natural Language Processing: A Survey [arXiv 2020/03]


A Survey on Contextual Embeddings [arXiv 2020/03]


Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [arXiv 2019]


Deep Multimodal Representation Learning: A Survey [arXiv 2019]


Pre-trained models for natural language processing: A survey [arXiv 2020]


A Survey on Visual Transformer [arXiv 2020/12]



facebook MMF


Efficient Transformers

  1. Fixed Patterns

    Image Transformer [ICML 2018]

    • Blockwise Patterns
    • Strided Patterns
    • Compressed Patterns
  2. Combination of Patterns

    Combining two or more distinct access patterns.

  3. Learnable Patterns

    Reformer: The Efficient Transformer [ICLR 2020]

    Opposite to the Fixed Patterns, learnable patterns aim to learn the access pattern in a data-driven fashion.

  4. Memory

    Longformer: The Long-Document Transformer [arXiv 2020/04]

    Leverage a side memory module to access multiple tokens at once.

  5. Low-Rank Methods

    Linformer: Self-Attention with Linear Complexity [arXiv 2020/06]

    Leverage low-rank approximations of the self-attention matrix.

  6. Kernels

    View the attention mechanism through kernelization, which enable clever mathematical re-writing of self-attention mechanism to avoid explicitly computing the N*N matrix. Can be view as low-rank method.

  7. Recurrence

    A natural extension to the blockwise method is to connect these blocks via recurrence.

Performer: Rethinking Attention with Performers [arXiv 2020/09, Under review of ICLR 2021]

[paper] [code] Google & University of Cambridge & DeepMind & Alan Turing Institute

Linformer: Self-Attention with Linear Complexity [arXiv 2020/06]

[paper] [code] FAIR

  • Tasks: Natural language understanding and downstream tasks.
  • Contribution: Projecting (N, d) Key and Value to (k, d).
  • Complexity: O(n)
  • Restrictions:
    • Cause mixing of sequence information, which would make it non-trivial to maintain causal masking or prevent past-future information mixing.

Linear Transformer: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [ICML 2020]

[paper] [code] Idiap Research Institute

Synthesizer: Neural Speech Synthesis with Transformer Network [AAAI 2019]

[paper] [code] UESTC, MSRA

Sinkhorn Transformer: Sparse Sinkhorn Attention [ICML 2020]

[paper] [code] Google AI

Reformer: The Efficient Transformer [ICLR 2020]

[paper] [code] UCB & Google Research

  • Tasks: Machine translation

  • Contribution:

    • Locality Sensitive Hashing Attention (LSHA)

      • Weight approximation

      For each query , the attention is computed as: .

      As softmax: , the several largest term can roughly approximate the value.

      [10, 7, 1, 0 ,2] ---softmax---> [95%, 4.7%, 0.012%, 0.0043%, 0.032%]

      • Shared-QK Transformer

      For each k, q, let

      Does not affect transformer performance.

      • LSH bucket

      Split queries and keys into different buckets. Each query only attend to keys in the same bucket.

  • Complexity: ref to the Table 3 of paper.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [arXiv 2019/06]

[paper] [code] CMU, Google Brain

Compressive Transformers for Long-Range Sequence Modelling [ICLR 2020]

[paper] [code] Deep Mind, UCL

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks [ICML 2019]

[paper] University of Oxford

Longformer: The Long-Document Transformer [arXiv 2020/04]

[paper] [code] Allen Institute for Artificial Intelligence

  • Tasks: Language Model, such as summarization, question answering...
  • Contribution:
    • Sliding windows attention

      Applying different w for different layer. They increase the receptive field as the model goes deeper.

    • Dilated sliding window

      In multi-head attention, they use mixed sliding windows. The dilated sliding is used to focus on the longer context, while un-dilated sliding is used to focus on local context.

    • Global+sliding window

      They add global attention in some specific points for different tasks. [CLS] for classification task, "Whole Question Sentence" for QA task.

  • Complexity: O(kn) for Local Attention, where k is the window size.

Routing Transformer: Efficient Content-Based Sparse Attention with Routing Transformers [arXiv 2020/10]

[paper] [code] Google Research

Big Bird: Transformers for Longer Sequences [NIPS 2020]

[paper] Google Research

Etc: Encoding long and structured data in transformers [EMNLP 2020]

[paper] [code] Google Research

Memory Compressed: Generating Wikipedia by Summarizing Long Sequences [ICLR 2018]

[paper] [code] ** Google Brain

  • Tasks: Text generation with WIKI as input.
  • Contribution:
    • Local Attention
    • Memory-compressed Attention
  • Complexity: O(bn) for Local Attention, where b is the block number. O(n*n/k) for Memory-compressed Attention, where k is the nn.Conv1d kernel size and strides.

Blockwise Transformer: Blockwise Self-Attention for Long Document Understanding [arXiv 2020/10]

[paper] [code] Tsinghua University, FAIR

Image Transformer [ICML 2018]

[paper] [code1] [code2] Google Brain, UCB, Google AI

  • Tasks: Image Generation and Super Resolution
  • Contribution:
    • Query Block split & 2 Local Attention
  • Complexity: O(nm), where n is the length of flatted image, m is the memory length.
  • Restrictions
    • Only focus on local neighborhood, which can be a issue where global information is required to solve a task.
    • The constant term: , is introduced to be a extra hyper-parameter.

Sparse Transformer: Generating Long Sequences with Sparse Transformers [arXiv 2019/04]

[paper] [code] OpenAI

Axial Transformer: Axial Attention in Multidimensional Transformers [arXiv 2019/12]

[paper] [code] UCB, Google Brain

Fastformer: Additive Attention Can Be All You Need [arXiv 2021/08]

[paper] [code] Tsinghua University, MSRA

Image Transformers

ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Under review of ICLR 2021]

[paper] [code1] [code2] Google

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [Arxiv 2021/03]

[paper] [code] MSRA

Transformer GAN

TransGAN: Two Transformers Can Make One Strong GAN [Arxiv 2021/02]

[paper] [code]

  1. Architecture: Transformer-only

    • Up-sampling in Generator: pixelshuffle module from "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (CVPR 2016)"
  2. Tricks

    • Data augmentation

    • Super-resolution co-training

    • Locality-Aware Initialization for Self-Attention

    • Ablations

  3. Results

    • Scaling up model

    • Comparison with other model

GANsformer: Generative Adversarial Transformers [Arxiv 2021/03]

[paper] [code] Stanford & Facebook

TFill: Image Completion via a Transformer-Based Architecture [Arxiv 2021/04]

[paper] [code] Code will be released in July MIT

Transformer Visualizations

Transformer Interpretability Beyond Attention Visualization [Arxiv 2020/12]

[paper] [code] FAIR

Transformer Internal Essence

Pretrained Transformers as Universal Computation Engines [Arxiv 2021/03]

[paper] [code] Google Brain


Pre-trained Models for Natural Language Processing: A Survey

arXiv 2020/03 [paper]

A Survey on Contextual Embeddings

arXiv 2020/03 [paper]

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

arXiv 2019 [paper]

Deep Multimodal Representation Learning: A Survey

arXiv 2019 [paper]

Multimodal Machine Learning: A Survey and Taxonomy

TPAMI 2018 [paper]