/PiToMe

[NeurIPS 2024] Accelerating Transformers with Spectrum-Preserving Token Merging

Primary LanguageJupyter NotebookOtherNOASSERTION


Accelerating Transformers with Spectrum-Preserving Token Merging

Hoai-Chau Tran* · Duy M. H. Nguyen* · Duy M. Nguyen · TrungTin Nguyen · Ngan Le · Pengtao Xie · Daniel Sonntag · James Y. Zou · Binh T. Nguyen · Mathias Niepert

Paper PDF Arxiv Google Colab Youtube Video


This repository provides a PyTorch implementation of the paper Accelerating Transformers with Spectrum-Preserving Token Merging, accepted at NeurIPS 2024. In this work, we introduce a new algorithm called pitome, designed to compress Vision Transformers (ViT) across various applications through token merging. After each layer, tokens are progressively merged, resulting in a remaining r percentage of tokens, as illustrated in the figure below.

Example Image

News

  • Code for VQA with LLaVA 1.5 is under refactoring. Coming soon!
  • [27/10/2024] Release code for image classification task
  • [01/10/2024] Release code for text classification task
  • [29/09/2024] Release code for image-text retrieval task
  • [25/09/2024] Our paper has been accepted at NeurIPS 2024 as a Poster 🎉 🎉 🎉
  • [29/05/2024] Upload PrePrint on Arxiv

Abstract

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5% average performance drop of ViT-MAE-H compared to 2.6% as baselines), image-text retrieval (0.3% average performance drop of CLIP on Flickr30k compared to 4.5% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions.

Table of Contents

Method

Example Image

All implementations of PiToMe and baselines can be found in the algo folder

Installation

First, you need to clone this repository

git clone https://github.com/hchautran/PiToMe.git
cd PiToMe

Next, you need to install the required packages using the commands below:

conda create -n pitome python=3.10
conda activate pitome
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia 
pip install -r requirements.txt

Image-Text Retrieval

Using pitome with ITR models

Currently, only checkpoints from LAVIS are supported. You can directly download and directly apply pitome to pretrained weights

from lavis.models import load_model_and_preprocess
from algo import pitome

# Load a pretrained model, can be blip/albef/blip2 .
model, vis_processors, txt_processors = load_model_and_preprocess("blip_retrieval", "coco", is_eval=False)
# Patch the blip's visual encoder with PiToMe.
pitome.patch.blip(model.visual_encoder)
# Set the number of ratio of remaining token per layer. See paper for details.
model.visual_encoder.ratio = 0.9 

In the future, we are planning support checkpoints from HuggingFace.

Run

In our paper we evaluate our method on 2 dataset - Flickr30k and MS-COCO.

Step 1: Configure the data storage path in the default.yml file and change this to your preferred path. This file is located in the the folder where lavis is installed. you can find it quickly by using this command:

import lavis;print(f"{'/'.join(lavis.__file__.split('/')[:-1])}/configs");

Update the cache_root entry to the path that you wanted.

Step 2: Download the data You can download Flickr30k and MSCOCO by using avaiable scripts:

python itr/download_coco.py
python itr/download_flickr.py

Currently, we are supporting blip, blip2, clip, and albef you can try directly compressing these models for off-the-shell performance by running this command:

python -m torch.distributed.run \
    --nproc_per_node=5 main_itr.py \
    --cfg-path scripts/eval_scripts/blip_itr_coco.yml \
    --algo pitome \
    --ratio 0.95 \
    --model blip \
    --dataset flickr \
    --eval 

Or retrain these model using this command:

CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch --main_process_port 29500 main_ic.py \
   --batch-size $BATCH_SIZE \
   --model ${ARCH}_${SIZE}_patch16_${INPUT_SIZE}  \
   --algo ${ALGO} \
   --ratio ${RATIO} \
   --input-size ${INPUT_SIZE} \
   --epoch $EPOCH  \
   --lr 0.00001

You can also evaluate/train all other baselines with multiple ratio r by running:

sh scripts/eval_scripts/eval_itr_all.sh #off-the-shell evaluate 
sh scripts/train_scripts/train_itr_all.sh #retrain

The results will be printed and saved to the itr_outputs directory.

Image Classification

Using pitome with ViT models for image classification

We are currently supporting the DeiT and MAE models for image classification tasks.

from timm.models import create_model
from algo import pitome

# Load a pretrained model, can be any vit / deit model.
model = create_model("deit_base_patch16_224", pretrained=True)
# Patch the ViT model with ToMe.
pitome.patch.deit(model)
# pitome.patch.mae(model)
# Set the ratio of remain token  per layer. See paper for details.
model.ratio = 0.95 

Run

In this task all experiment are conducted on ImageNet1K dataset, which is a subset of ImageNet that contain 1000 classes. By default, all data and model checkpoints will be downloaded and saved into the folder specified by DATA_PATH variable located in tasks/ic/utils.py. You can change this to the path you wanted.

You can try directly compressing these models for off-the-shell performance

python main_ic.py \
   --batch-size 256 \ 
   --model ${ARCH}-${MODEL_SIZE}-${INPUT_SIZE}  \ 
   --algo ${ALGO} \
   --ratio ${RATIO} \
   --eval

Or retraining them by running this command:

CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch --main_process_port 29500 main_ic.py \
   --batch-size $BATCH_SIZE \
   --model ${ARCH}-${MODEL_SIZE}-${INPUT_SIZE}  \ 
   --algo ${ALGO} \
   --ratio ${RATIO} \
   --epoch $EPOCH  \
   --lr 0.00001

You can also evaluate/train all models with all baselines using multiple ratio r by running:

sh scripts/eval_scripts/eval_ic_all.sh #off-the-shell evaluate
sh scripts/train_scripts/train_ic_all.sh #retrain model

The results will be printed and saved to outputs/ic_outputs directory.

Text Classification

Using pitome with text classification models

We support bert and distilbert for text classification tasks.

from algo import pitome
from transformers import AutoModelForSequenceClassification

# Load a pretrained model, can be bert or distilbert .
model_ckt = 'JiaqiLee/imdb-finetuned-bert-base-uncased'
# model_ckt = 'bert-base-uncased'
# model_ckt = 'distilbert-base-uncased'
model =  AutoModelForSequenceClassification.from_pretrained(model_ckt)

# Patch the bert encoder with PiToMe.
pitome.patch.bert(model.bert.encoder)
# pitome.patch.distilbert(model.distilbert.transformer)

# Set the number of ratio of remaining token per layer. See paper for details.
model.bert.encoder.ratio = 0.65 
# model.distilbert.transformer.ratio = self.ratio 

Run

In this task, all experiments are conducted on the following datasets: IMBb, sst2 and Rotten Tomatoes. By default, all data and model checkpoints are downloaded and saved to the folder specified by the DATA_PATH variable in tasks/tc/config.py. You can modify this variable to specify a different path as needed.

You can directly can evaluate off-the-shell perfomance by running:

python main_tc.py \
   --algo $ALGO \
   --ratio $RATIO \
   --task $TASK \
   --model $MODEL \
   --eval 

Or retrain the model by running:

CUDA_VISIBLE_DEVICES=$5 python -m accelerate.commands.launch main_tc.py \
   --model $MODEL \
   --algo $ALGO \
   --ratio $RATIO \
   --task $TASK 

You can also evaluate all models with all baselines using multiple ratio r by running:

sh scripts/eval_scripts/eval_tc_all.sh #off-the-shell performance
sh scripts/train_scripts/train_tc_all.sh #retrain

The results will be printed and saved to outputs/tc_outputs directory.

Notebook

You can refer to the notebooks folder for example usages.

Citation

@article{tran2024accelerating,
  title={Accelerating Transformers with Spectrum-Preserving Token Merging},
  author={Tran, Hoai-Chau and Nguyen, Duy MH and Nguyen, Duy M and Nguyen, Trung-Tin and Le, Ngan and Xie, Pengtao and Sonntag, Daniel and Zou, James Y and Nguyen, Binh T and Niepert, Mathias},
  journal={Advances in Neural Information Processing Systems},
  year={2025}
}

If you have any issues, feel free to contact us at tranhoaichau.00@gmail.com or Ho_Minh_Duy.Nguyen@dfki.de

Acknowledgement

Thanks Token Merging: Your ViT But Faster (ToMe) for providing open-source code. This repository is built based on the original ToMe structure. We also adopted baselines from ToFu, DiffRate, mctf, crossget, dct.