/Fast-3D-Human-Pose-Estimation

This is a pytorch implementation of method based on Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation applying on human pose estimation tasks using 2-view stereo images.

Primary LanguagePython

Fast-3D-Human-Pose Estimation

Introduction

This is a pytorch implementation of method based on Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation applying on stereo images to reconstruct the human poses in 3D world. We also compare this with a naive approach reference to Simple Baselines for Human Pose Estimation and Tracking which consist of encoder decoder structure and predict 2d pose from both view. We evaluate their performance using the Mean Per Joint Position Error (MPJPE) metric in both 2D and 3D scenarios. Additionally, we employ data augmentation techniques, such as masking out a small block on the human in images, incorporating methods like Cutout and Hide-and-Seek, to enhance the accuracy of the models.

Contribution

  • We implement and extent the method Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation from scratch with slightly modification and apply it to the stereo reconstruction tasks.
  • We find that the random masking data augmentation strategies can more or less ease the self-occlusion and improve the MPJPE to some extent.
  • We experiment with different tricks (Different Loss Function, Gradient Clip) to further increase both the accuracy and stabilization for the training process.

Dataset

We pretrained our model using the MPII Dataset which includes around 25K images containing over 40K people with annotated body joints. Then we do fine-tuning on the stereo data from MADS Dataset which consists of martial arts actions (Tai-chi and Karate), dancing actions (hip-hop and jazz), and sports actions (basketball, volleyball, football, rugby, tennis and badminton). Two martial art masters, two dancers and an athlete performed these actions while being recorded with either multiple cameras or a stereo depth camera.

Please download the data and arange it into this pattern:

Your_WorkingSpace/
├── ...
├── ...
├── data/
    └── MADS_depth
    └── MADS_multiview

And run the code to extract the training/validation data:

$ python extract_data.py

Train

CDRNET

Run the following cmd to train the CDRNET.

$ python train_cdr.py

Note: You need the backbone weight(See the "Weigths" section below) before training the CDRNET

Backbone

Run the following cmd to train your cutomized resnet backbone

$ python train.py

Inference

Run the following cmd after extracting data:

$ bash scripts/inference.sh

Weights

You can download the weights via the link.

You should maintain the weight under this structure to run the inference.

Your_WorkingSpace/
├── ...
├── ...
├── weights/
    └── mads_3d_256_101
          └── best.pth
    └── mpii_256_101
        └── latest.pth

Results

Best:

HipHop_best

Sports_best

Baseline:

HipHop_base

Sports_base

References

CDRNet
DiffDLT
learnable-triangulation-pytorch
R-YOLOv4