World-Grounded Human Motion Recovery via Gravity-View Coordinates
Zehong Shen*, Huaijin Pi*, Yan Xia, Zhi Cen, Sida Peng†, Zechen Hu, Hujun Bao, Ruizhen Hu, Xiaowei Zhou
SIGGRAPH Asia 2024
Please see installation for details. Don't install this repo as a package, it will cause errors when importing other modules.
Install hamer and link the vitpose-wholebody checkpoint in hamer (./_DATA/vitpose_ckpts/vitpose+_huge/wholebody.pth
) to this repo ./inputs/checkpoints/vitpose/vitpose-h-coco-wholebody.pth
.
Demo entries are provided in tools/demo
. Use -s
to skip visual odometry if you know the camera is static, otherwise the camera will be estimated by DPVO.
We also provide a script demo_folder.py
to inference a entire folder.
python -m tools.demo.demo --video=docs/example_video/tennis.mp4 -s
python -m tools.demo.demo_folder -f inputs/demo/folder_in -d outputs/demo/folder_out -s
python -m tools.demo.demo_multiperson --video=docs/example_video/two_persons.mp4 --output_root outputs/demo_mp --batch_size 64 --export_npy
python -m tools.demo.demo_multiperson --video=docs/example_video/vertical_dance.mp4 --output_root outputs/demo_mp -s
-
Test: To reproduce the 3DPW, RICH, and EMDB results in a single run, use the following command:
python tools/train.py global/task=gvhmr/test_3dpw_emdb_rich exp=gvhmr/mixed/mixed ckpt_path=inputs/checkpoints/gvhmr/gvhmr_siga24_release.ckpt
To test individual datasets, change
global/task
togvhmr/test_3dpw
,gvhmr/test_rich
, orgvhmr/test_emdb
. -
Train: To train the model, use the following command:
# The gvhmr_siga24_release.ckpt is trained with 2x4090 for 420 epochs, note that different GPU settings may lead to different results. python tools/train.py exp=gvhmr/mixed/mixed
During training, note that we do not employ post-processing as in the test script, so the global metrics results will differ (but should still be good for comparison with baseline methods).
This version of the repository includes modifications to support multi-person HMR:
-
Multi-person tracking:
- Updated the
Tracker
class to return bounding boxes for multiple people usingget_all_tracks
instead ofget_one_track
. - Modified preprocessing to handle multiple person detections and features.
- Updated the
-
Multi-person pose estimation:
- Adapted the
VitPoseExtractor
to process multiple people simultaneously. - Updated the feature extraction process to handle batches of multiple people.
- Adapted the
-
Multi-person SMPL reconstruction:
- Modified the
DemoPL
class to predict SMPL parameters for multiple people. - Updated the rendering process to handle multiple SMPL models in both in-camera and global coordinate systems.
- Modified the
-
Rendering improvements:
- Implemented merged faces creation for rendering multiple SMPL models simultaneously.
- Added support for retargeting global translations to better align with in-camera positions.
-
New demo script:
- Added
demo_multiperson.py
to showcase the multi-person reconstruction pipeline. - Includes options for batch processing and verbose output for debugging.
- Added
-
Performance optimizations:
- Introduced batch processing for VitPose and feature extraction to improve efficiency.
-
/preprocess/bbx.pt
:- Contains bounding box information for multiple people
bbx_xyxy
: Tensor of shape (P, L, 4), where P is the number of people and L is the number of framesbbx_xys
: Tensor of shape (P, L, 3), containing center coordinates and scale for each bounding box
-
/preprocess/slam_results.pt
:- Camera pose estimation results (if not using static camera)
- NumPy array of shape (L, 7), where each row contains [x, y, z, qx, qy, qz, qw]
-
/preprocess/vitpose.pt
:- 2D pose estimation results
- Tensor of shape (P, L, 17, 3), where 17 is the number of keypoints and 3 represents [x, y, confidence]
-
/preprocess/vit_features.pt
:- Image features extracted from the video frames
- Tensor of shape (P, L, 1024), where 1024 is the feature dimension
The main reconstruction results are stored in hmr4d_results.pt
, which contains the following keys:
-
smpl_params_global
andsmpl_params_incam
:- SMPL parameters for global and in-camera coordinate systems
- Each contains:
body_pose
: Tensor of shape (P, L, 21, 3, 3)betas
: Tensor of shape (P, L, 10)global_orient
: Tensor of shape (P, L, 1, 3, 3)transl
: Tensor of shape (P, L, 3)
-
K_fullimg
:- Camera intrinsic matrix
- Tensor of shape (L, 3, 3), same across all frames
-
focal_length
,width
,height
:- Focal length, width, and height of the video frames
- Tensor of shape (L,)
-
net_outputs
:- Additional network outputs (not used for now)
The smpl_params_global
params of different people starts from the same origin. To visualize the results, I retarget the global translations based on the first-frame of smpl_params_incam
params.
If you find this code useful for your research, please use the following BibTeX entry.
@inproceedings{shen2024gvhmr,
title={World-Grounded Human Motion Recovery via Gravity-View Coordinates},
author={Shen, Zehong and Pi, Huaijin and Xia, Yan and Cen, Zhi and Peng, Sida and Hu, Zechen and Bao, Hujun and Hu, Ruizhen and Zhou, Xiaowei},
booktitle={SIGGRAPH Asia Conference Proceedings},
year={2024}
}
We thank the authors of WHAM, 4D-Humans, and ViTPose-Pytorch for their great works, without which our project/code would not be possible.