/R-MeeTo

Give us minutes, we give back a faster Mamba. The official implementation of "Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training".

Primary LanguagePython

R-MeeTo: Rebuild Your Faster Vision Mamba in Minutes

The official implementation of "Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training".

Mingjia Shi*, Yuhao Zhou*, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You

(*: equal contribution, †: corresponding authors)

🌟🌟 Mingjia, Ruiji, Zekai, and Zhiyuan are looking for Ph.D. positions, many thanks for considering their applications.

Paper Project Page

TL;DR

  • Why is Mamba sensitive to token reduction?
  • Why does R-MeeTo (i.e., Merging + Re-training) work?

The anwser to all is the key knowledge loss.

video_pre_v5.mp4

The key knowledge loss mainly causes the heavier performance drop after applying token reduction. R-MeeTo is thus proposed, fast fixing key knowledge and therefore recovering performance.

R-MeeTo is simple and effective, with only two main modules: merging and re-training. Merging lowers the knowledge loss while re-training fast recovers the knowledge structure of Mamba.

video_pre_method.mp4

Overview

Figure: Analysis’ sketch: Mamba is sensitive to token reduction. Experiments about i. token reduction are conducted with DeiT-S (Transformer) and Vim-S (Mamba) on ImageNet-1K. The reduction ratios in the experiment about ii. shuffled tokens are 0.14 for Vim-Ti and 0.31 for Vim-S/Vim-B. Shuffle strategy is odd-even shuffle: [0,1,2,3]→[0,2], [1,3]→[0,2,1,3]. The empirical results of I(X;Y), the mutual information between inputs X and outputs Y of the Attention Block and SSM, are measured by MINE on the middle layers of DeiT-S and Vim-S (7-th/12 layers and the 14-th/24 layers respectively.) See this implementation repo of MINE.

Abstract: Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge tokens and a drop in performance, making it not a good solution for enhancing efficiency. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs, but vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge and performance in Mamba. Re-training the model with token merging, which effectively rebuilds the key knowledge, enhances the performance of Mamba. Empirically, pruned Vims, recovered on ImageNet-1K, only drop up to 0.9% accuracy, by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2 $\times$ (up to 1.5 $\times$) speed up in inference.

🚀 News

  • 2024.12.12: The code is released.

⚡️ Faster Vision Mamba is Rebuilt in Minutes

Hardware Vim-Ti Vim-S Vim-B
1 x 8 x H100 (single machine) 16.2 mins 25.2 mins 57.6 mins
2 x 8 x H100 (Infiniband) 8.1 mins 12.9 mins 30.6 mins
4 x 8 x H100 (Infiniband) 4.2 mins 6.8 mins 16.9 mins

Wall time in minutes of re-training Vim-Ti, Vim-S and Vim-B for 3 epochs on 3 hardwares by R-MeeTo. Give us minutes, we give back a faster Mamba.

🛠 Dataset Prepare

  • For the image dataset, we use ImageNet-1K.
  • For the video dataset, we use K400. You can download it from OpenDataLab or its official website. We follow the data list from here to split the dataset.

🛠 Installation

1. Clone the repository

git clone https://github.com/NUS-HPC-AI-Lab/R-MeeTo

2. Create a new Conda environment

conda env create -f environment.yml

or install the necessary packages by requirement.txt

conda create -n R_MeeTo python=3.10.12
pip install -r requirements.txt

3. Install Mamba package manually

  • For Vim baseline: pip install the mamba package and casual-conv1d (version:1.1.1) in the Vim repo.
git clone https://github.com/hustvl/Vim
cd Vim 
pip install -e causal_conv1d==1.1.0
pip install -e mamba-1p1p1
  • For VideoMamba baseline: pip install the mamba package and casual-conv1d (version:1.1.0) in the VideoMamba repo.
git clone https://github.com/OpenGVLab/VideoMamba
cd VideoMamba
pip install -e causal_conv1d
pip install -e mamba

4. Download the baseline pretrained models from our baseline official source

See PRETRAINED for downloading the pretrained model of our baseline.

⚙️ Usage

🛠️ Reproduce our results

For image task:

bash ./image_task/exp_sh/tab2/vim_tiny.sh

For video task:

bash ./video_task/exp_sh/tab13/videomamba_tiny.sh

Checkpoints:

See CKPT to find our reproduced checkpoints and logs of the main results.

⏱️ Measure inference speed

R-MeeTo effectively optimizes inference speed and is adaptable for both consumer-level, enterprise-level and other high-performance devices. See this example for testing FLOPS (G) and throughput (im/s).

🖼️ Visualization

See this example of visualization of merged token on ImageNet-1k val using a re-trained Vim-S.

Citation

If you found our work useful, please consider citing us.

@misc{shi2024faster,
      title={Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training},
      author={Shi, Mingjia and Zhou, Yuhao and Yu, Ruiji and Li, Zekai and Liang, Zhiyuan and Zhao, Xuanlei and
       Peng, Xiaojiang and Rajpurohit, Tanmay and Vedantam, Ramakrishna and
       Zhao, Wangbo and Wang, Kai and You, Yang},
      year={2024},
      eprint={2412.12496},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2412.12496},
}

Acknowledge

The repo is partly built based on ToMe, Vision Mamba, and VideoMamba. We are grateful for their generous contributions to open source.