/Parameter-Efficient-Transfer-Learning-Benchmark

A Unified Parameter-Efficient Transfer Learning Benchmark for Computer Vision Tasks

Primary LanguagePython

๐“ฅ๐“ฒ๐“ผ๐“พ๐“ช๐“ต ๐“Ÿ๐“ช๐“ป๐“ช๐“ถ๐“ฎ๐“ฝ๐“ฎ๐“ป-๐“”๐“ฏ๐“ฏ๐“ฒ๐“ฌ๐“ฒ๐“ฎ๐“ท๐“ฝ ๐“ฃ๐“ป๐“ช๐“ท๐“ผ๐“ฏ๐“ฎ๐“ป ๐“›๐“ฎ๐“ช๐“ป๐“ท๐“ฒ๐“ท๐“ฐ ๐“‘๐“ฎ๐“ท๐“ฌ๐“ฑ๐“ถ๐“ช๐“ป๐“ด

GitHub stars GitHub forks GitHub contributors GitHub activity

ยท Benchmark ยท Homepage ยท Document ยท

๐Ÿ”ฅ News and Updates

  • โœ… [2024/07/01] Visual PEFT Benchmark starts releasing code, etc.

  • โœ… [2024/06/20] Visual PEFT Benchmark homepage is created.

  • โœ… [2024/06/01] Visual PEFT Benchmark repo is created.

๐Ÿ”ฅ Introduction

Parameter-efficient transfer learning (PETL) methods show promise in adapting a pre-trained model to various downstream tasks while training only a few parameters. In the computer vision (CV) domain, numerous PETL algorithms have been proposed, but their direct employment or comparison remains inconvenient. To address this challenge, we construct a Unified Visual PETL Benchmark (V-PETL Bench) for the CV domain by selecting 30 diverse, challenging, and comprehensive datasets from image recognition, video action recognition, and dense prediction tasks. On these datasets, we systematically evaluate 25 dominant PETL algorithms and open-source a modular and extensible codebase for a fair evaluation of these algorithms. V-PETL Bench runs on NVIDIA A800 GPUs and requires approximately 310 GPU days. We release all the checkpoints and training logs, making them more efficient and friendly to researchers. Additionally, V-PETL Bench 13 will be continuously updated for new PETL algorithms and CV tasks.

๐Ÿ”ฅ Getting Started

This is an example of how to set up V-PETL Bench locally.

To get a local copy up, running follow these simple example steps.

Environment Setup

V-PETL Bench is built on pytorch, with torchvision, torchaudio, and timm, etc.

  • To install the required packages, you can create a conda environment.
conda create --name v-petl-bench python=3.8
  • Activate conda environment.
conda activate v-petl-bench
  • Use pip to install required packages.
pip install -r requirements.txt

Data Preparation

Image Classification Dataset

  • 1. Visual Task Adaptation Benchmark (VTAB)

    VTAB comprises 19 diverse visual classification datasets. We have processed all the dataset and the data can be downloaded here: Download Link. For specific processing procedures and tips, please see VTAB_SETUP.

  • 2. Fine-Grained Visual Classification tasks (FGVC)

    FGVC comprises 5 fine-grained visual classification dataset. The datasets can be downloaded following the official links. We split the training data if the public validation set is not available. The splitted dataset can be found here: Download Link.

Video Action Recognition Dataset

  • 1. Kinetics-400

  • 2. Something-Something V2(SSv2)

  • 3. HMDB51

Dense Prediction Dataset

  • 1. MS-COCO

  • 2. ADE20K

  • 3. PASCAL VOC

Pre-trained Model Preperation

Quick Start

Results

Benchmark results of image classification on VTAB

We evaluate 13 PETL algorithms on five datasets with ViTB/16 models pre-trained on ImageNet-21K. We highlight the best and the second results.

Method CUB-200-2011 NABirds Oxford Flowers Stanford Dogs Stanford Cars Mean #Params. (M) PPT
Traditional Finetuning
Full fine-tuning 87.3 82.7 98.8 89.4 84.5 88.54 85.8M -
Linear probing 85.3 75.9 97.9 86.2 51.3 79.32 0 M 0.79
PETL Algorithms
Adapter 87.1 84.3 98.5 89.8 68.6 85.66 0.41M 0.84
AdaptFormer 88.4 84.7 99.2 88.2 81.9 88.48 0.46M 0.87
Prefix Tuning 87.5 82.0 98.0 74.2 90.2 86.38 0.36M 0.85
U-Tuning 89.2 85.4 99.2 84.1 92.1 90.00 0.36M 0.89
BitFit 87.7 85.2 99.2 86.5 81.5 88.02 0.10M 0.88
VPT-Shallow 86.7 78.8 98.4 90.7 68.7 84.66 0.25M 0.84
VPT-Deep 88.5 84.2 99.0 90.2 83.6 89.10 0.85M 0.86
SSF 89.5 85.7 99.6 89.6 89.2 90.72 0.39M 0.89
LoRA 85.6 79.8 98.9 87.6 72.0 84.78 0.77M 0.82
GPS 89.9 86.7 99.7 92.2 90.4 91.78 0.66M 0.90
HST 89.2 85.8 99.6 89.5 88.2 90.46 0.78M 0.88
LAST 88.5 84.4 99.7 86.0 88.9 89.50 0.66M 0.87
SNF 90.2 87.4 99.7 89.5 86.9 90.74 0.25M 0.90

Benchmark results of image classification on VTAB

Benchmark results on VTAB. We evaluate 18 PETL algorithms on 19 datasets with ViT-B/16 models pre-trained on ImageNet-21K. We highlight the best and the second results.

Method CIFAR-100 Caltech101 DTD Flowers102 Pets SVHN Sun397 Patch Camelyon EuroSAT Resisc45 Retinopathy Clevr/count Clevr/distance DMLab KITTI/distance dSprites/loc dSprites/ori SmallNORB/azi SmallNORB/ele Mean # Params. (M) PPT
Traditional Finetuning
Full fine-tuning 68.9 87.7 64.3 97.2 86.9 87.4 38.8 79.7 95.7 84.2 73.9 56.3 58.6 41.7 65.5 57.5 46.7 25.7 29.1 65.57 85.8M -
Linear probing 63.4 85.0 63.2 97.0 86.3 36.6 51.0 78.5 87.5 68.6 74.0 34.3 30.6 33.2 55.4 12.5 20.0 9.6 19.2 52.94 0M 0.53
PETL Algorithms
Adapter 69.2 90.1 68.0 98.8 89.9 82.8 54.3 84.0 94.9 81.9 75.5 80.9 65.3 48.6 78.3 74.8 48.5 29.9 41.6 71.44 0.16M 0.71
VPT-Shallow 77.7 86.9 62.6 97.5 87.3 74.5 51.2 78.2 92.0 75.6 72.9 50.5 58.6 40.5 67.1 68.7 36.1 20.2 34.1 64.85 0.08M 0.65
VPT-Deep 78.8 90.8 65.8 98.0 88.3 78.1 49.6 81.8 96.1 83.4 68.4 68.5 60.0 46.5 72.8 73.6 47.9 32.9 37.8 69.43 0.56M 0.68
BitFit 72.8 87.0 59.2 97.5 85.3 59.9 51.4 78.7 91.6 72.9 69.8 61.5 55.6 32.4 55.9 66.6 40.0 15.7 25.1 62.05 0.10M 0.61
LoRA 67.1 91.4 69.4 98.8 90.4 85.3 54.0 84.9 95.3 84.4 73.6 82.9 69.2 49.8 78.5 75.7 47.1 31.0 44.0 72.25 0.29M 0.71
AdaptFormer 70.8 91.2 70.5 99.1 90.9 86.6 54.8 83.0 95.8 84.4 76.3 81.9 64.3 49.3 80.3 76.3 45.7 31.7 41.1 72.32 0.16M 0.72
SSF 69.0 92.6 75.1 99.4 91.8 90.2 52.9 87.4 95.9 87.4 75.5 75.9 62.3 53.3 80.6 77.3 54.9 29.5 37.9 73.10 0.21M 0.72
NOAH 69.6 92.7 70.2 99.1 90.4 86.1 53.7 84.4 95.4 83.9 75.8 82.8 68.9 49.9 81.7 81.8 48.3 32.8 44.2 73.25 0.43M 0.72
SCT 75.3 91.6 72.2 99.2 91.1 91.2 55.0 85.0 96.1 86.3 76.2 81.5 65.1 51.7 80.2 75.4 46.2 33.2 45.7 73.59 0.11M 0.73
FacT 70.6 90.6 70.8 99.1 90.7 88.6 54.1 84.8 96.2 84.5 75.7 82.6 68.2 49.8 80.7 80.8 47.4 33.2 43.0 73.23 0.07M 0.73
RepAdapter 72.4 91.6 71.0 99.2 91.4 90.7 55.1 85.3 95.9 84.6 75.9 82.3 68.0 50.4 79.9 80.4 49.2 38.6 41.0 73.84 0.22M 0.72
Hydra 72.7 91.3 72.0 99.2 91.4 90.7 55.5 85.8 96.0 86.1 75.9 83.2 68.2 50.9 82.3 80.3 50.8 34.5 43.1 74.21 0.28M 0.73
LST 59.5 91.5 69.0 99.2 89.9 79.5 54.6 86.9 95.9 85.3 74.1 81.8 61.8 52.2 81.0 71.7 49.5 33.7 45.2 71.70 2.38M 0.65
DTL 69.6 94.8 71.3 99.3 91.3 83.3 56.2 87.1 96.2 86.1 75.0 82.8 64.2 48.8 81.9 93.9 53.9 34.2 47.1 74.58 0.04M 0.75
HST 76.7 94.1 74.8 99.6 91.1 91.2 52.3 87.1 96.3 88.6 76.5 85.4 63.7 52.9 81.7 87.2 56.8 35.8 52.1 75.99 0.78M 0.74
GPS 81.1 94.2 75.8 99.4 91.7 91.6 52.4 87.9 96.2 86.5 76.5 79.9 62.6 55.0 82.4 84.0 55.4 29.7 46.1 75.18 0.22M 0.74
LAST 66.7 93.4 76.1 99.6 89.8 86.1 54.3 86.2 96.3 86.8 75.4 81.9 65.9 49.4 82.6 87.9 46.7 32.3 51.5 74.15 0.66M 0.72
SNF 84.0 94.0 72.7 99.3 91.3 90.3 54.9 87.2 97.3 85.5 74.5 82.3 63.8 49.8 82.5 75.8 49.2 31.4 42.1 74.10 0.25M 0.73

Benchmark results of video action recognition on SSv2 and HMDB51.

Benchmark results on SSv2 and HMDB51. We evaluate 5 PETL algorithms with ViT-B from VideoMAE and Video Swin Transformer. The results are Top-1 accuracy.

Method Model Pre-training # Params. SSv2 HMDB51
Top1 PPT Top1 PPT
Vision Transformer (from VideoMAE)
Full fine-tuning ViT-B Kinetics 400 85.97 M 53.97% - 46.41% -
Frozen ViT-B Kinetics 400 0 M 29.23% 0.29 49.84% 0.50
AdaptFormer ViT-B Kinetics 400 1.19 M 59.02% 0.56 55.69% 0.53
BAPAT ViT-B Kinetics 400 2.06 M 57.78% 0.53 57.18% 0.53
Video Swin Transformer
Full fine-tuning Video Swin-B Kinetics 400 87.64 M 50.99% - 68.07% -
Frozen Video Swin-B Kinetics 400 0 M 24.13% 0.24 71.28% 0.71
LoRA Video Swin-B Kinetics 400 0.75 M 38.34% 0.37 62.12% 0.60
BitFit Video Swin-B Kinetics 400 1.09 M 45.94% 0.44 68.26% 0.65
AdaptFormer Video Swin-B Kinetics 400 1.56 M 40.80% 0.38 68.66% 0.64
Prefix-tuning Video Swin-B Kinetics 400 6.37 M 39.46% 0.32 56.13% 0.45
BAPAT Video Swin-B Kinetics 400 6.18 M 53.36% 0.43 71.93% 0.58

Benchmark results of dense prediction on COCO

Benchmark results on COCO. We evaluate 9 PETL algorithms with Swin-B models pre-trained on ImageNet-22K.

Swin-B # Params. Memory COCO (Cascade Mask R-CNN) COCO (Cascade Mask R-CNN)
$\mathrm{AP_{Box}}$ PPT $\mathrm{AP_{Mask}}$ PPT
Traditional Finetuning
Full fine-tuning 86.75 M 17061 MB 51.9% - 45.0% -
Frozen 0.00 M 7137 MB 43.5% 0.44 38.6% 0.39
PETL Algorithms
Bitfit 0.20 M 13657 MB 47.9% 0.47 41.9% 0.42
LN TUNE 0.06 M 12831 MB 48.0% 0.48 41.4% 0.41
Partial-1 12.60 M 7301 MB 49.2% 0.35 42.8% 0.30
Adapter 3.11 M 12557 MB 50.9% 0.45 43.8% 0.39
LoRA 3.03 M 11975 MB 51.2% 0.46 44.3% 0.40
AdaptFormer 3.11 M 13186 MB 51.4% 0.46 44.5% 0.40
LoRand 1.20 M 13598 MB 51.0% 0.49 43.9% 0.42
E$^3$VA 1.20 M 7639 MB 50.5% 0.48 43.8% 0.42
Mona 4.16 M 13996 MB 53.4% 0.46 46.0% 0.40

Benchmark results of dense prediction on PASCAL VOC and ADE20K.

Benchmark results on PASCAL VOC and ADE20K. We evaluate 9 PETL algorithms with Swin-L models pre-trained on ImageNet-22K

Swin-L # Params. Memory (VOC) Pascal VOC (RetinaNet) ADE20K (UPerNet)
$\mathrm{AP_{Box}}$ PPT $\mathrm{mIoU}$ PPT
Traditional Finetuning
Full fine-tuning 198.58 M 15679 MB 83.5% - 52.10% -
Frozen 0.00 M 3967 MB 83.6% 0.84 46.84% 0.47
PETL Algorithms
Bitfit 0.30 M 10861 MB 85.7% 0.85 48.37% 0.48
LN TUNE 0.09 M 10123 MB 85.8% 0.86 47.98% 0.48
Partial-1 28.34 M 3943 MB 85.4% 0.48 47.44% 0.27
Adapter 4.66 M 10793 MB 87.1% 0.74 50.78% 0.43
LoRA 4.57 M 10127 MB 87.5% 0.74 50.34% 0.43
AdaptFormer 4.66 M 11036 MB 87.3% 0.74 50.83% 0.43
LoRand 1.31 M 11572 MB 86.8% 0.82 50.76% 0.48
E$^3$VA 1.79 M 4819 MB 86.5% 0.81 49.64% 0.46
Mona 5.08 M 11958 MB 87.3% 0.73 51.36% 0.43

โญ Citation

If you find our survey and repository useful for your research, please cite it below:

@article{xin2024bench,
  title={V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark},
  author={Yi Xin, Siqi Luo, Xuyang Liu, Haodi Zhou, Xinyu Cheng, Christina Luoluo Lee, Junlong Du, Yuntao Du., Haozhe Wang, MingCai Chen, Ting Liu, Guimin Hu, Zhongwei Wan, Rongchao Zhang, Aoxue Li, Mingyang Yi, Xiaohong Liu},
  year={2024}
}