Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Official Pytorch code for the paper: Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition.

We propose a novel down-sampling adapter module inserted in between transformer encoder layers to adapt and down-sample features in the parameter-efficient transfer-learning setting:

This is to alleviate the attention collapse issue observed for ViTs in UFGIR datasets with the PETL settings:

Our method obtains favorable results across ten ultra-fine-grained image recognition (UFGIR) datasets:

Also, our method achieves a superior accuracy vs cost (in terms of total trainable parameters and FLOPs) trade-off compared to alternatives.

Pre-trained checkpoints are available on HuggingFace!

The code for our model (and the ViT backbone) is in fgir_vit/model_utils/modules_others/adapters.py.

Setup

pip install -e .

Preparation

Datasets are downloaded from:

Xiaohan Yu, Yang Zhao, Yongsheng Gao, Xiaohui Yuan, Shengwu Xiong (2021). Benchmark Platform for Ultra-Fine-Grained Visual Categorization BeyondHuman Performance. In ICCV2021.
https://github.com/XiaohanYu-GU/Ultra-FGVC?tab=readme-ov-file

Train

To train a ViT B-16 with ILA++ on CUB using image size 224:

python tools/train.py --serial 1 --cfg configs/soylocal_ft_weakaugs.yaml --model_name vit_b16 --freeze_backbone --classifier cls --adapter adapter --ila --ila_locs --cpu_workers 8 --seed 1 --lr 0.1

Similarly, for image size 448:

python tools/train.py --serial 3 --cfg configs/soylocal_ft_weakaugs.yaml --model_name vit_b16 --freeze_backbone --image_size 448 --classifier cls --adapter adapter --ila --ila_locs --cpu_workers 8 --seed 100 --lr 0.1
python tools/train.py --cfg configs/cub_ft_is224_medaugs.yaml --lr 0.01 --model_name vit_b16 --cfg_method configs/methods/glsim.yaml --image_size 448

Compute CKA Similarity

For frozen vanilla ViT B-16

python -u tools/compute_feature_metrics.py --debugging --serial 32 --cfg configs/soyageing_ft_weakaugs.yaml --model_name vit_b16 --fp16 --compute_attention_cka --ckpt_path ckpts/soyageing_vit_b16_cls_fz_1.pth

For frozen ViT B-16 with our proposed ILA:

python -u tools/compute_feature_metrics.py --debugging --serial 32 --cfg configs/soyageing_ft_weakaugs.yaml --model_name vit_b16 --fp16 --compute_attention_cka --ckpt_path ckpts/soyageing_vit_b16_ila_dso_cls_adapter_fz_1.pth --adapter adapter --ila --ila_locs

Results will be saved in results_inference directory.

Visualize attention

For ViT with ILA to visualize attention rollout for the 1st encoder group (first 4 encoder blocks: 0_4):

python -u tools/vis_dfsm.py --batch_size 8 --vis_cols 8 --vis_mask rollout_0_4 --serial 30 --cfg configs/soyageing_ft_weakaugs.yaml --model_name vit_b16 --fp16 --ckpt_path ../results_ila/serial1_ckpts/soyageing_vit_b16_ila_dso_cls_adapter_fz_1.pth --adapter adapter --ila --ila_locs

Citation

If you find our work helpful in your research, please cite it as:

[1] E. A. Rios, F. Oyerinde, M.-C. Hu, and B.-C. Lai, “Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition,” Sep. 17, 2024, arXiv: arXiv:2409.11051. doi: 10.48550/arXiv.2409.11051.

Acknowledgements

We thank NYCU's HPC Center and National Center for High-performance Computing (NCHC) for providing computational and storage resources.

We thank the authors of TransFG, FFVT, SimTrans, CAL, MPN-COV, VPT, VQT, ConvPass and timm for providing implementations for comparison. We also thank the authors of the Ultra-FGVC datasets.

Also, Weight and Biases for their platform for experiment management.

arkel23/DownSamplingInterLayerAdapter