/iTPN

(CVPR2023/TPAMI2024) Integrally Pre-Trained Transformer Pyramid Networks -- A Hierarchical Vision Transformer for Masked Image Modeling

Primary LanguagePython

[CVPR2023/TPAMI2024]

(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)

iTPN

Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively.

Updates

11/Jul./2024

Fast-iTPN is accepted by TPAMI2024.

08/Jan./2024

Fast-iTPN is public at arxiv. Fast-iTPN is a more powerful version of iTPN.

26/Dec./2023

model Para. (M) Pre-train teacher input/patch 21K ft? Acc on IN.1K checkpoint checkpoint (21K)
Fast-iTPN-T 24 IN.1K CLIP-L 224/16 N 85.1% baidu/google
Fast-iTPN-T 24 IN.1K CLIP-L 384/16 N 86.2%
Fast-iTPN-T 24 IN.1K CLIP-L 512/16 N 86.5%
Fast-iTPN-S 40 IN.1K CLIP-L 224/16 N 86.4% baidu/google
Fast-iTPN-S 40 IN.1K CLIP-L 384/16 N 86.95%
Fast-iTPN-S 40 IN.1K CLIP-L 512/16 N 87.8%
Fast-iTPN-B 85 IN.1K CLIP-L 224/16 N 87.4% baidu/google
Fast-iTPN-B 85 IN.1K CLIP-L 512/16 N 88.5%
Fast-iTPN-B 85 IN.1K CLIP-L 512/16 Y 88.75% baidu/google
Fast-iTPN-L 312 IN.1K CLIP-L 640/16 N 89.5% baidu/google

All the pre-trained Fast-iTPN models are available now (passward: itpn) ! The tiny/small/base scale models report the best performance on ImageNet-1K as far as we know. Use them for your own tasks! See Details.

30/May/2023

model Pre-train teacher input/patch 21K ft? Acc on IN.1K
EVA-02-B IN.21K EVA-CLIP-g 196/14 N 87.0%
EVA-02-B IN.21K EVA-CLIP-g 448/14 N 88.3%
EVA-02-B IN.21K EVA-CLIP-g 448/14 Y 88.6%
Fast-iTPN-B IN.1K CLIP-L 224/16 N 87.4%
Fast-iTPN-B IN.1K CLIP-L 512/16 N 88.5%
Fast-iTPN-B IN.1K CLIP-L 512/16 Y 88.7%

All the models above are only pre-trained on ImageNet-1K and these models will be available soon.

29/May/2023

The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.

28/Feb./2023

iTPN is accepted by CVPR2023!

08/Feb./2023

The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.

configurations: intermediate fine-tuning on ImageNet-21K + 384 input size

21/Jan./2023

Our HiViT is accepted by ICLR2023!

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer

08/Dec./2022

Get checkpoints (password: abcd):

iTPN-B-pixel iTPN-B-CLIP iTPN-L-pixel iTPN-L-CLIP/16
baidu drive download download download download
google drive download download download download

25/Nov./2022

The preprint version is public at arxiv.

Requiments

  • Ubuntu
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+
  • Pytorch 1.7+

Dataset

  • ImageNet-1K
  • COCO2017
  • ADE20K

Get Started

Prepare the environment:

conda create --name itpn python=3.8 -y
conda activate itpn

git clone git@github.com:sunsmarterjie/iTPN.git
cd iTPN

pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops

iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).

Main Results

iTPN

Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately.

iTPN

Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods.

License

iTPN is released under the License.

Citation

@article{tian2024fast,
  title={Fast-iTPN: Integrally pre-trained transformer pyramid network with token migration},
  author={Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}
@inproceedings{tian2023integrally,
  title={Integrally pre-trained transformer pyramid networks},
  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18610--18620},
  year={2023}
}
@inproceedings{zhang2023hivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023}
}