SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos (ECCV 2022)

This repo is the official implementation of "SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos". [Paper] [Project]

Update

Support SmoothNet in MMPose Release v0.25.0 and MMHuman3D as a smoothing strategy!
Clean version is released!
To further improve SmoothNet as a near online smoothing strategy, we reduce the original window size 64 to 32 frames by default!
We also provide the pretrained models with the window size 8, 16, 32 and 64 frames here.

It currently includes code, data, log and models for the following tasks:

2D human pose estimation
3D human pose estimation
Body recovery via a SMPL model

Major Features

Model training and evaluation for 2D pose, 3D pose, and SMPL body representation
Supporting 6 popular datasets (AIST++, Human3.6M, Sub-JHMDB, MPI-INF-3DHP, MuPoTS-3D, 3DPW) and providing cleaned estimation results of 13 popular pose estimation backbones(SPIN, TCMR, VIBE, CPN, FCN, Hourglass, HRNet, RLE, VideoPose3D, TposeNet, EFT, PARE, SimplePose)

Description

When analyzing human motion videos, the output jitters from existing pose estimators are highly-unbalanced with varied estimation errors across frames. Most frames in a video are relatively easy to estimate and only suffer from slight jitters. In contrast, for rarely seen or occluded actions, the estimated positions of multiple joints largely deviate from the ground truth values for a consecutive sequence of frames, rendering significant jitters on them.

To tackle this problem, we propose to attach a dedicated temporal-only refinement network to existing pose estimators for jitter mitigation, named SmoothNet. Unlike existing learning-based solutions that employ spatio-temporal models to co-optimize per-frame precision and temporal smoothness at all the joints, SmoothNet models the natural smoothness characteristics in body movements by learning the long-range temporal relations of every joint without considering the noisy correlations among joints. With a simple yet effective motion-aware fully-connected network, SmoothNet improves the temporal smoothness of existing pose estimators significantly and enhances the estimation accuracy of those challenging frames as a side-effect. Moreover, as a temporal-only model, a unique advantage of SmoothNet is its strong transferability across various types of estimators and datasets. Comprehensive experiments on five datasets with eleven popular backbone networks across 2D and 3D pose estimation and body recovery tasks demonstrate the efficacy of the proposed solution. Our code and datasets are provided in the supplementary materials.

Results

SmoothNet is a plug-and-play post-processing network to smooth any outputs of existing pose estimators. To fit well across datasets, backbones, and modalities with lower MPJPE and PA-MPJPE, we provide THREE pre-trained models (Train on AIST-VIBE-3D, 3DPW-SPIN-3D, and H36M-FCN-3D) to handle all existing issues.

Please refer to our supplementary materials to check the cross-model validation in detail. Noted that all models can obtain lower and similar Accels than the compared backbone estimators. The differences are in MPJPEs and PA-MPJPEs.

Due to the temporal-only network without spatial modelings, SmoothNet is trained on 3D position representations only, and can be tested on 2D, 3D, and 6D representations, respectively.

3D Keypoint Results

Dataset	Estimator	MPJPE (Input/Output):arrow_down:	Accel (Input/Output):arrow_down:	Pretrain model
AIST++	SPIN	107.17/95.21	33.19/4.17	checkpoint / config
AIST++	TCMR*	106.72/105.51	6.4/4.24	checkpoint / config
AIST++	VIBE*	106.90/97.47	31.64/4.15	checkpoint / config
Human3.6M	FCN	54.55/52.72	19.17/1.03	checkpoint / config
Human3.6M	RLE	48.87/48.27	7.75/0.90	checkpoint / config
Human3.6M	TCMR*	73.57/73.89	3.77/2.79	checkpoint / config
Human3.6M	VIBE*	78.10/77.23	15.81/2.86	checkpoint / config
Human3.6M	Videopose(T=27)*	50.13/50.04	3.53/0.88	checkpoint / config
Human3.6M	Videopose(T=81)*	48.97/48.89	3.06/0.87	checkpoint / config
Human3.6M	Videopose(T=243)*	48.11/48.05	2.82/0.87	checkpoint / config
MPI-INF-3DHP	SPIN	100.74/92.89	28.54/6.54	checkpoint / config
MPI-INF-3DHP	TCMR*	92.83/88.93	7.92/6.49	checkpoint / config
MPI-INF-3DHP	VIBE*	92.39/87.57	22.37/6.5	checkpoint / config
MuPoTS	TposeNet*	103.33/100.78	12.7/7.23	checkpoint / config
MuPoTS	TposeNet+RefineNet*	93.97/91.78	9.53/7.21	checkpoint / config
3DPW	EFT	90.32/88.40	32.71/6.07	checkpoint / config
3DPW	EFT	90.32/86.39	32.71/6.30	checkpoint / config(additional training)
3DPW	PARE	78.91/78.11	25.64/5.91	checkpoint / config
3DPW	SPIN	96.85/95.84	34.55/6.17	checkpoint / config
3DPW	TCMR*	86.46/86.48	6.76/5.95	checkpoint / config
3DPW	VIBE*	82.97/81.49	23.16/5.98	checkpoint / config

2D Keypoint Results

Dataset	Estimator	MPJPE (Input/Output):arrow_down:	Accel (Input/Output):arrow_down:	Pretrain model
Human3.6M	CPN	6.67/6.45	2.91/0.14	checkpoint / config
Human3.6M	Hourglass	9.42/9.25	1.54/0.15	checkpoint / config
Human3.6M	HRNet	4.59/4.54	1.01/0.13	checkpoint / config
Human3.6M	RLE	5.14/5.11	0.9/0.13	checkpoint / config

SMPL Results

Dataset	Estimator	MPJPE (Input/Output):arrow_down:	Accel (Input/Output):arrow_down:	Pretrain model
AIST++	SPIN	107.72/103.00	33.21/5.72	checkpoint / config
AIST++	TCMR*	106.95/106.39	6.47/4.68	checkpoint / config
AIST++	VIBE*	107.41/102.06	31.65/5.95	checkpoint / config
3DPW	EFT	91.60/89.57	33.38/7.89	checkpoint / config
3DPW	PARE	79.93/78.68	26.45/6.31	checkpoint / config
3DPW	SPIN	99.28/97.81	34.95/7.40	checkpoint / config
3DPW	TCMR*	88.46/88.37	7.12/6.52	checkpoint / config
3DPW	VIBE*	84.27/83.14	23.59/7.24	checkpoint / config

* means the used pose estimators are using temporal information.
The usage of SmoothNet for better performance: a SOTA single-frame estimator (e.g., PARE) + SmoothNet
Since TCMR uses a sliding window method to smooth the poses, which causes over-smoothness issue, SmoothNet will be hard to further decrease the MPJPE, PA-MPJPE.

Getting Started

Environment Requirement

SmoothNet has been implemented and tested on Pytorch 1.10.1 with python >= 3.6. It supports both GPU and CPU inference.

Clone the repo:

git clone https://github.com/cure-lab/SmoothNet.git

We recommend you prepare the environment using conda:

# conda
source scripts/install_conda.sh

Prepare Data

All the data used in our experiment can be downloaded here.

Google Drive

Baidu Netdisk

The sructure of the repository should look like this:

|-- configs
    |-- aist_vibe_3D.yaml
    |-- ...
|-- data
    |-- checkpoints         # pretrained checkpoints
    |-- poses               # cleaned detected poses and groundtruth poses
    |-- smpl                # SMPL parameters
|-- lib
    |-- core
        |-- ...
    |-- dataset
        |-- ...
    |-- models
        |-- ...
    |-- utils
        |-- ...
|-- results                 # folders including log files, checkpoints, running config and tensorboard logs
|-- scripts
    |-- install_conda.sh
|-- eval_smoothnet.py       # SmoothNet evaluation
|-- train_smoothnet.py      # SmoothNet training
|-- README.md
|-- LICENSE
|-- requirements.txt

If you want to add your own dataset, please follow these steps (noted that this is also how the provided data is organized):

Organize your data into corresponding type according to the body representation. The file structure is shown as follows:
```
|-- [your dataset]\_[estimator]\_[2D/3D/smpl]
    |-- detected
        |-- [your dataset]\_[estimator]\_[2D/3D/smpl]_test.npz
        |-- [your dataset]\_[estimator]\_[2D/3D/smpl]_train.npz
    |-- groundtruth
        |-- [your dataset]\_gt\_[2D/3D/smpl]_test.npz
        |-- [your dataset]\_gt\_[2D/3D/smpl]_train.npz
```
It is fine if you only have training or testing data. The content in each .npz file is consist of "imgname" and "human poses", which is related to the body representation you use.
- 3D keypoints:
  - imgname: Strings containing the image and sequence name with format [sequence_name]/[image_name](string "" if the sequence_name and image_name not available).
  - keypoints_3d: 3D joint position. The shape of each sequence is corresponding_sequence_length*(keypoints_number*3). The order of it is the same with imgname
- 2D keypoints
  - imgname: Strings containing the image and sequence name with format [sequence_name]/[image_name](string "" if the sequence_name and image_name not available).
  - keypoints_2d: 2D joint position. The shape of each sequence is corresponding_sequence_length*(keypoints_number*2). The order of it is the same with imgname
- SMPL
  - imgname: Strings containing the image and sequence name with format [sequence_name]/[image_name](string "" if the sequence_name and image_name not available).
  - pose: pose parameters. The shape of each sequence is corresponding_sequence_length*72. The order of it is the same with imgname
  - shape: shape parameters. The shape of each sequence is corresponding_sequence_length*10. The order of it is the same with imgname
If you use 3D keypoints as the body representation, add the root of all keypoints cfg.DATASET.ROOT_[your dataset]_[estimator]_3D in evaluate_config.py, train_config.py or visualize_config.py according to your purpose(test, train or visualize).
Construct your own dataset following the existing dataset files. You might need to modify the detailed implementation depending on your data characteristics.

Training

Run the commands below to start training:

python train_smoothnet.py --cfg [config file] --dataset_name [dataset name] --estimator [backbone estimator you use] --body_representation [smpl/3D/2D] --slide_window_size [slide window size]

For example, you can train on 3D representation of Human3.6M using backbone estimator FCN with silde window size 8 by:

python train_smoothnet.py --cfg configs/h36m_fcn_3D.yaml --dataset_name h36m --estimator fcn --body_representation 3D --slide_window_size 8

You can easily train on multiple datasets using "," to split multiple datasets / estimator / body representation. For example, you can train on AIST++ - VIBE - 3D and 3DPW - SPIN - 3D with silde window size 8 by:

python train_smoothnet.py --cfg configs/h36m_fcn_3D.yaml --dataset_name aist,pw3d --estimator vibe,spin --body_representation 3D,3D  --slide_window_size 8

Note that the training and testing datasets should be downloaded and prepared before training.

Evaluation

Run the commands below to start evaluation:

python eval_smoothnet.py --cfg [config file] --checkpoint [pretrained checkpoint] --dataset_name [dataset name] --estimator [backbone estimator you use] --body_representation [smpl/3D/2D] --slide_window_size [slide window size] --tradition [savgol/oneeuro/gaus1d]

For example, you can evaluate MPI-INF-3DHP - TCMR - 3D and MPI-INF-3DHP - VIBE - 3D using SmoothNet trained on 3DPW - SPIN - 3D with silde window size 8, and compare the results with traditional filters oneeuro by:

python eval_smoothnet.py --cfg configs/pw3d_spin_3D.yaml --checkpoint data/checkpoints/pw3d_spin_3D/checkpoint_8.pth.tar --dataset_name mpiinf3dhp,mpiinf3dhp --estimator tcmr,vibe --body_representation 3D,3D --slide_window_size 8 --tradition oneeuro

Note that the pretrained checkpoints and testing datasets should be downloaded and prepared before evaluation.

The data and checkpoints used in our experiment can be downloaded here.

Google Drive

Baidu Netdisk

Visualization

Here, we only provide demo visualization based on offline processed detected poses of specific datasets(e.g. AIST++, Human3.6M, and 3DPW). To visualize on arbitrary given video, please refer to the inference/demo of MMHuman3D.

un the commands below to start evaluation:

python visualize_smoothnet.py --cfg [config file] --checkpoint [pretrained checkpoint] --dataset_name [dataset name] --estimator [backbone estimator you use] --body_representation [smpl/3D/2D] --slide_window_size [slide window size] --visualize_video_id [visualize sequence id] --output_video_path [visualization output video path]

For example, you can visualize the second sequence of 3DPW - SPIN - 3D using SmoothNet trained on 3DPW - SPIN - 3D with silde window size 32, and output the video to ./visualize by:

python visualize_smoothnet.py --cfg configs/pw3d_spin_3D.yaml --checkpoint data/checkpoints/pw3d_spin_3D/checkpoint_8.pth.tar --dataset_name pw3d --estimator spin --body_representation 3D --slide_window_size 32 --visualize_video_id 2 --output_video_path ./visualize

Citing SmoothNet

If you find this repository useful for your work, please consider citing it as follows:

@inproceedings{zeng2022smoothnet,
      title={SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos},
      author={Zeng, Ailing and Yang, Lei and Ju, Xuan and Li, Jiefeng and Wang, Jianyi and Xu, Qiang},
      booktitle={European Conference on Computer Vision},
      year={2022},
      organization={Springer}
}

Please remember to cite all the datasets and backbone estimators if you use them in your experiments.

License

This code is available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using this code you agree to the terms in the LICENSE. Third-party datasets and software are subject to their respective licenses.

Li-Lai/SmoothNet