/HAP

[NeurIPS 2023] HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Primary LanguagePython

HAP

📚 Contents

📋 Introduction

This repository contains the implementation code for paper:

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Advances in Neural Information Processing Systems (NeurIPS) 2023

[arXiv]   [project page]

HAP is the first masked image modeling framework for human-centric pre-training. It leverages body structure-aware training to learn general human visual representations. It achieves SOTA performance across several human-related benchmarks.

📂 Datasets

Pre-Training Data

We use LUPerson for pre-training. To make the pre-training more efficient, we only use half of the dataset with a list named "CFS_list.pkl" from TransReID-SSL. To extract the keypoint information of data, which is the masking guidance during pre-training, ViTPose is used to perform inference on LUPerson. You can download our pose dataset here.

Put the dataset directories outside the HAP project:

root
├── HAP
├── LUPerson-data  # LUPerson data
│   ├── xxx.jpg
│   └── ...
└── LUPerson-pose  # LUPerson with pose keypoints
    ├── xxx.npy
    └── ...

🛠️ Environment

Conda is recommended for configuring the environment:

conda env create -f env-hap.yaml && conda activate env_hap

🚀 Get Started

The default setting for pre-training is 400 epochs with total batch-size of 4096.

It may need 32 GPUs with memory larger than 32GB, such as NVIDIA V100, for pre-training.

# -------------------- Pre-Training HAP on LUPerson --------------------
cd HAP/

MODEL=pose_mae_vit_base_patch16

# Download official MAE model pre-trained on ImageNet and move it here
CKPT=mae_pretrain_vit_base.pth

# Download cfs list and move it here
CFS_PATH=cfs_list.pkl

OMP_NUM_THREADS=1 python -m torch.distributed.launch \
    --nnodes=${NNODES} \
    --node_rank=${RANK} \
    --master_addr=${ADDRESS} \
    --master_port=${PRETRAIN_PORT} \
    --nproc_per_node=${NPROC_PER_NODE} \
    main_pretrain.py \
    --dataset LUPersonPose \
    --data_path ../LUPerson-data \
    --pose_path ../LUPerson-pose \
    --sample_split_source ${CFS_PATH} \
    --batch_size 256 \
    --model ${MODEL} \
    --resume ${CKPT} \
    --ckpt_pos_embed 14 14 \
    --mask_ratio 0.5 \
    --align 0.05 \
    --epochs 400 \
    --blr 1.5e-4 \
    --ckpt_overwrite \
    --seed 0 \
    --tag default

🏆 Results

We evaluate HAP for the following downstream tasks. Click them to find implementation instructions.

You can download the checkpoint of the pre-trained HAP model here. The results are given below.

task dataset resolution structure result
Person ReID MSMT17 (256, 128) ViT 76.4 (mAP)
Person ReID MSMT17 (384, 128) ViT 76.8 (mAP)
Person ReID MSMT17 (256, 128) ViT-lem 78.0 (mAP)
Person ReID MSMT17 (384, 128) ViT-lem 78.1 (mAP)
Person ReID Market-1501 (256, 128) ViT 91.7 (mAP)
Person ReID Market-1501 (384, 128) ViT 91.9 (mAP)
Person ReID Market-1501 (256, 128) ViT-lem 93.8 (mAP)
Person ReID Market-1501 (384, 128) ViT-lem 93.9 (mAP)
task dataset resolution training result
2D Pose Estimation MPII (256, 192) single-dataset 91.8 (PCKh)
2D Pose Estimation MPII (384, 288) single-dataset 92.6 (PCKh)
2D Pose Estimation MPII (256, 192) multi-dataset 93.4 (PCKh)
2D Pose Estimation MPII (384, 288) multi-dataset 93.6 (PCKh)
2D Pose Estimation COCO (256, 192) single-dataset 75.9 (AP)
2D Pose Estimation COCO (384, 288) single-dataset 77.2 (AP)
2D Pose Estimation COCO (256, 192) multi-dataset 77.0 (AP)
2D Pose Estimation COCO (384, 288) multi-dataset 78.2 (AP)
2D Pose Estimation AIC (256, 192) single-dataset 31.5 (AP)
2D Pose Estimation AIC (384, 288) single-dataset 37.7 (AP)
2D Pose Estimation AIC (256, 192) multi-dataset 32.2 (AP)
2D Pose Estimation AIC (384, 288) multi-dataset 38.1 (AP)
task dataset result
Pedestrian Attribute Recognition PA-100K 86.54 (mA)
Pedestrian Attribute Recognition RAP 82.91 (mA)
Pedestrian Attribute Recognition PETA 88.36 (mA)
task dataset result
Text-to-Image Person ReID CUHK-PEDES 68.05 (Rank-1)
Text-to-Image Person ReID ICFG-PEDES 61.80 (Rank-1)
Text-to-Image Person ReID RSTPReid 49.35 (Rank-1)
task dataset result
3D Pose Estimation 3DPW 90.1 (MPJPE), 56.0 (PA-MPJPE), 106.3 (MPVPE)

💗 Acknowledgement

We acknowledge the following open source projects.

✅ Citation

@article{yuan2023hap,
  title={HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception},
  author={Yuan, Junkun and Zhang, Xinyu and Zhou, Hao and Wang, Jian and Qiu, Zhongwei and Shao, Zhiyin and Zhang, Shaofeng and Long, Sifan and Kuang, Kun and Yao, Kun and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

🤝 Contribute & Contact

Feel free to star and contribute to our repository.

If you have any questions or advice, contact us through GitHub issues or email (yuanjk0921@outlook.com).