ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Results | Updates | Usage | Todo | Acknowledge

This branch contains the pytorch implementation of ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. It obtains 81.1 AP on MS COCO Keypoint test-dev set.

Web Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo for video: and images

Results from this repo on MS COCO val set (single-task training)

Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.

With classic decoder

Model	Pretrain	Resolution	AP	AR	config	log	weight
ViTPose-B	MAE	256x192	75.8	81.1	config	log	Onedrive
ViTPose-L	MAE	256x192	78.3	83.5	config	log	Onedrive
ViTPose-H	MAE	256x192	79.1	84.1	config	log	Onedrive

With simple decoder

Model	Pretrain	Resolution	AP	AR	config	log	weight
ViTPose-B	MAE	256x192	75.5	80.9	config	log	Onedrive
ViTPose-L	MAE	256x192	78.2	83.4	config	log	Onedrive
ViTPose-H	MAE	256x192	78.9	84.0	config	log	Onedrive

Results from this repo on MS COCO val set (multi-task training)

Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII+CrowdPose	256x192	77.5	82.6	config	Onedrive
ViTPose-L	COCO+AIC+MPII+CrowdPose	256x192	79.1	84.1	config	Onedrive
ViTPose-H	COCO+AIC+MPII+CrowdPose	256x192	79.8	84.8	config	Onedrive
ViTPose-G	COCO+AIC+MPII+CrowdPose	576x432	81.0	85.6

Results from this repo on OCHuman test set (multi-task training)

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII+CrowdPose	256x192	88.2	90.0	config	Onedrive
ViTPose-L	COCO+AIC+MPII+CrowdPose	256x192	91.5	92.8	config	Onedrive
ViTPose-H	COCO+AIC+MPII+CrowdPose	256x192	91.6	92.8	config	Onedrive
ViTPose-G	COCO+AIC+MPII+CrowdPose	576x432	93.3	94.3

Results from this repo on CrowdPose test set (multi-task training)

Using YOLOv3 human detector. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AP(H)	config	weight
ViTPose-B	COCO+AIC+MPII+CrowdPose	256x192	74.7	63.3	config	Onedrive
ViTPose-L	COCO+AIC+MPII+CrowdPose	256x192	76.6	65.9	config	Onedrive
ViTPose-H	COCO+AIC+MPII+CrowdPose	256x192	76.3	65.6	config	Onedrive
ViTPose-G	COCO+AIC+MPII+CrowdPose	576x432	78.3	67.9

Results from this repo on MPII val set (multi-task training)

Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.

Model	Dataset	Resolution	Mean	config	weight
ViTPose-B	COCO+AIC+MPII+CrowdPose	256x192	93.4	config	Onedrive
ViTPose-L	COCO+AIC+MPII+CrowdPose	256x192	93.9	config	Onedrive
ViTPose-H	COCO+AIC+MPII+CrowdPose	256x192	94.1	config	Onedrive
ViTPose-G	COCO+AIC+MPII+CrowdPose	576x432	94.3

Results from this repo on AI Challenger test set (multi-task training)

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model	Dataset	Resolution	AP	AR	config	weight
ViTPose-B	COCO+AIC+MPII+CrowdPose	256x192	31.9	36.3	config	Onedrive
ViTPose-L	COCO+AIC+MPII+CrowdPose	256x192	34.6	39.0	config	Onedrive
ViTPose-H	COCO+AIC+MPII+CrowdPose	256x192	35.3	39.8	config	Onedrive
ViTPose-G	COCO+AIC+MPII+CrowdPose	576x432	43.2	47.1

Updates

[2022-05-24] Upload the single-task training code, single-task pre-trained models, and multi-task pretrained models.

[2022-05-06] Upload the logs for the base, large, and huge models!

[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet

Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

Download the pretrained models from MAE or ViTAE, and then conduct the experiments by

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH> --seed 0

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch --seed 0

To test the pretrained models performance, please run

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

Todo

This repo current contains modifications including:

Upload configs and pretrained models
More models with SOTA results
Upload multi-task training config

Acknowledge

We acknowledge the excellent implementation from mmpose and MAE.

Citing ViTPose

@misc{xu2022vitpose,
      title={ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation}, 
      author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
      year={2022},
      eprint={2204.12484},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

shangdibufashi/ViTPose