REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos (CVPR2023)

Project Page | Paper

This is the official PyTorch implementation of REC-MV.

TODO:triangular_flag_on_post:

Requirements

Python 3 Pytorch3d (0.4.0, some compatibility issues may occur in higher versions of pytorch3d)

PyTorch<=1.10.2

pytorch-scatter==2.0.9

Note: A GTX 3090 is recommended to run REC-MV, make sure enough GPU memory if using other cards.

Install

conda env create REC-MV
conda activate REC-MV
pip install -r requirements.txt
bash install.sh

It is recommended to install pytorch3d 0.4.0 from source.

wget -O pytorch3d-0.4.0.zip https://github.com/facebookresearch/pytorch3d/archive/refs/tags/v0.4.0.zip
unzip pytorch3d-0.4.0.zip
cd pytorch3d-0.4.0 && python setup.py install && cd ..

To download the SMPL models from here and move pkls to smpl_pytorch/model.

To download DeepFashion3D templates with feature curve labels from here and move this folder to ../smpl_clothes_template

A Gentle Introduction

Reconstructing dynamic 3D garment surfaces with open boundaries from monocular videos is an important problem as it provides a practical and low-cost solution for clothes digitization. Recent neural rendering methods achieve high-quality dynamic clothed human reconstruction results from monocular video, but these methods cannot separate the garment surface from the body. To address the above limitations, in this paper, we formulate this task as an optimization problem of 3D garment feature curves and surface reconstruction from monocular video. We introduce a novel approach, called REC-MV, to jointly optimize the explicit feature curves and the implicit signed distance field (SDF) of the garments. Then the open garment meshes can be extracted via garment template registration in the canonical space.

Preprocess Datasets

Click to expand

SMPL Prediction

The preprocessing of dataset is described here. If you want to optimize your own data, you can run VideoAvatar or TCMR to get the initial SMPL estimation. Surely, you can use your own SMPL initialization and normal prediction method then use REC-MV to reconstruct.

Normal map Prediction

To enable our normal optimization, you have to install PIFuHD and Lightweight Openpose in your $ROOT1 and $ROOT2 first. Then copy generate_normals.py and generate_boxs.py to $ROOT1 and $ROOT2 seperately, and run the following code to extract normals before running REC-MV:

cd $ROOT2
python generate_boxs.py --data $ROOT/video-category/imgs
cd $ROOT1
python generate_normals.py --imgpath $ROOT/video-category/imgs

Parsing Foreground Mask

We utilize a awesome RobustVideoMatting to parsing human mask from monocular Videos.

Parsing Garment Semantic label.

Self-Correction-Human-Parsing is employed to segment garment labels. Note that we find ATR pretrained weight is better than other checkpoints, so we suggest you to load the ATR checkpoint.

Initialized Voxel Skinning Weights.

To better model skirts or dresses skinning weights, we apply fite to diffuse skinning weights into the whole voxel space. Specifically, we initialized skinning weights as the step1 said(Link)

The following commands give you an example to obtain PeopleSnapshot diffused skinning weights.

#!/bin/bash
#! example for processing people snapshot
# name_list=( female-3-casual female-3-sport female-4-casual female-6-plaza female-7-plaza )
name_list=( male-2-casual )
for name in ${name_list[@]}; do
    python -m step1_diffused_skinning.compute_diffused_skinning --config configs/step1/${name}.yaml
done

# clean tmp files
rm -rf ./data_tmp_constraints/*
rm -rf ./data_tmp_skinning_grid/*

Preprocess Datasets from Ours.

We provide links to the datas we have already processed

OneDrive

Baidu Drive

The dataset folder is like the following example:

xxx_tic/
├── xxx_tic_diffused -> /data4/lingtengqiu/REC-MV/CUHKszCAP/xxx_tic_diffused/
├── xxx_tic_large_pose -> /data4/lingtengqiu/REC-MV/CUHKszCAP/xxx_tic_large_pose/
├── xxx_tic_tcmr_output.pkl
├── camera.npz
├── diffused_skinning_weights.npy
├── featurelines -> ../featurelines/xxx_tic
├── imgs
│   ├── 000000.jpg
│   ├── 000000_rect.txt
│   ├── 000001.jpg
│   ├── 000001_rect.txt
├── mask2fl
│   ├── 000000.json
│   ├── 000015.json
├── masks
│   ├── 000000.png
│   ├── 000001.png
├── normals
│   ├── 000000.png
│   ├── 000001.png
├── parsing_SCH_ATR
│   ├── 000000.npy
│   ├── 000000.png
│   ├── mask_parsing_000000.npy
├── result
│   ├── config.conf
│   └── debug
└── smpl_rec.npz

Preprocess your Dataset

The following commands give you a guidance to process your videos.

# An example to guide you to process your videos. Assuming you video data putted into folder, namely female_large_pose.
# and processed data folder denoted as female_large_pose_process_new.


# 1.data prepare
# e.g. openpose detection
bash ./scripts/openpose_predict.sh ./female_large_pose/${video-name}/ ./female_large_pose/${video-name}/joints2d
# parsing human mask
# parsing a human mask by RobustVideoMatting
# >>>>> https://github.com/PeterL1n/RobustVideoMatting
#e.g.
bash ./scripts/matting_video.sh 2 ./female_large_pose/${video-name}/imgs/ ./female_large_pose/${video-name}/body_masks ./female_large_pose/${video-name}/masks
# e.g. resize images to  [1080, 1080]

bash ./scripts/resize_video_imgs.sh ${video-name}
# video-avatar predict smpl results
# path: ~/cvpr2023/REC-MV/lib/videoavatars
# e.g.
bash ./scripts/build_large_pose.sh ${video-name}

# process videoavatar data to current datasets
python tools/people_aposefemale_process.py --root ~/cvpr2023/REC-MV/lib/videoavatars/datasets/${video-name}/ --save_root ./female_large_pose_process_new/${video-name}/

# prdice bbox
cd lib/lightweight-human-pose-estimation.pytorch/
python generate_boxs.py --data ./female_large_pose_process_new/${video-name}/imgs/
# predict normal map from pifuhd
cd -
cd lib/pifuhd/
python ./generate_normals.py --imgpath ./female_large_pose_process_new/${video-name}/imgs/
# predict TCMR, joints to optimize beta at the begining
bash ./scripts/get_smpl_from_video.sh ./$raw_video.mp4$ ${gpu_id}

#2. parsing garment mask
# detect or label 2d curve points from the key frames.
# e.g.
bash ./scripts/parsing_mask.sh 0 ./configs/female_large_pose_process_new/${video-name}.conf ./female_large_pose_process_new/${video-name}/
# parsing fl from two key points
python ./tools/parsing_mask_to_fl.py --parsing_type ATR  --input_path ./female_large_pose_process_new/${video-name}/ --output_path ./female_large_pose_process_new/${video-name}/mask2fl

#3. training
bash ./scripts/female_large_pose_process_new/${video-name}.sh ${gpu_id} ${save_folder} ${wandb_logger_name}

Huggingface preprocess

We build huggingface model to help ones to obtain videos' training data.

Demo

Download the pretrained weights for self-rotated (Onedrive/Baidu Drive) and large motion(Onedrive/Baidu Drive) into CUHKszCap-L/anran_tic (Note that you need download CUHKszCap-L first).

Download CUHKszCap-L, its checkpoints soft link to working space, like:

├── anran_tic_large_pose (large_pose checkpoint)
├── anran_tic_self_rotated (self-rotated checkpoint)
├── featurelines -> ../featurelines/anran_tic
├── imgs
├── mask2fl
├── masks
├── normals
├── parsing_SCH_ATR
└── result
    └── debug

Run the following code to generate garment meshes from monocular videos.

# Download CUHKszCap-L, its checkpoints soft link to working space
ln -s ../xxx gap-female-largepose 
# 1.self-rotated garment capturing
# bash ./scripts/large_pose/test_large_pose_A_fl.sh ${gpu_id} ${subject_name} ${checkpoint name} 
bash ./scripts/large_pose/test_large_pose_A_fl.sh 2 anran_tic anran_tic_self_rotated
# 2.large-motion garment capturing
# ${gpu_id} ${subject_name} ${checkpoint name} 
bash ./scripts/large_pose/test_large_pose_A_fl.sh 2 anran_tic anran_tic_self_rotated
# generating video
# define pngs to video function
encodepngffmpeg()
{
	# $1: target folder
	# $2: save video name
    rm -rf ${2}
    ffmpeg -r ${1} -pattern_type glob -i '*.png' -vcodec libx264 -crf 18 -vf "pad=ceil(iw/2)*2:ceil(ih/2)*2" -pix_fmt yuv420p ${2}
}

# 3. producing self-rotated video
cd ./gap-female-largepose/anran_tic/anran_tic_self_rotated/colors
encodepngffmpeg 30 ./demo.mp4

# producing large-pose video
cd ./gap-female-largepose/anran_tic/anran_tic_large_pose/colors
encodepngffmpeg 30 ./demo.mp4

Training

For training phase.

The work direction likes the following folder tree:

├── a_pose_female_process -> ../../a_pose_female_process/
├── configs
│   ├── female_large_pose
│   ├── gap-female
│   ├── people_snapshot
│   └── sythe
├── dataset
│   └── __pycache__
├── debug
│   ├── register
│   └── smpl_beta
│       └── imgs
├── engineer
│   ├── core
│   │   └── __pycache__
│   ├── networks
│   │   └── __pycache__
│   ├── optimizer
│   │   └── __pycache__
│   ├── registry
│   ├── utils
│   │   └── __pycache__
│   └── visualizer
│       └── __pycache__
├── FastMinv
│   ├── build
│   │   ├── bdist.linux-x86_64
│   │   ├── lib.linux-x86_64-3.8
│   │   └── temp.linux-x86_64-3.8
│   ├── dist
│   └── FastMinv.egg-info
├── logs -> ../logs/
├── MCAcc
│   ├── cuda
│   │   ├── build
│   │   │   ├── bdist.linux-x86_64
│   │   │   ├── lib.linux-x86_64-3.8
│   │   │   └── temp.linux-x86_64-3.8
│   │   ├── dist
│   │   └── interplate.egg-info
│   └── __pycache__
├── MCGpu
│   ├── build
│   │   ├── bdist.linux-x86_64
│   │   ├── lib.linux-x86_64-3.8
│   │   └── temp.linux-x86_64-3.8
│   ├── dist
│   └── MCGpu.egg-info
├── model
│   └── __pycache__
├── people_snapshot_public_proprecess -> ../people_snapshot_public_proprecess/
├── preprocess
├── scripts
│   ├── gap-female
│   ├── large_pose
│   ├── people_snapshot
│   ├── preprocess
│   └── sythe
├── smpl_pytorch -> ../smpl_pytorch/
├── tools
└── utils
    └── __pycache__

Run the following code to fitting garment meshes from monocular videos.

# Training from scratch: self-rotated videos.

## Peoplesnapshot
### softlink preprocess peoplesnapshot dataset in ./REC-MV
ln -s ../people_snapshot_public_proprecess/ ./
### e.g. training codes
bash ./scripts/people_snapshot/train_female-3-casual.sh 0 ${exp_name} ${wandb name}

## Gap-Female
bash ./scripts/gap-female/train_anran_garment_fl.sh 1 anran_exp anran_exp

# Training for large motion video after self-rotated fitting 
## e.g. training codes
bash ./scripts/gap-female/train_anran_garment_fl.sh 1 ${exp_self_rotated_name} ${wandb name}
## copy self-rotated folder to large pose, and also its weights
cp -rd ./gap-female-largepose/anran_tic/${exp_self_rotated_name}/  cp -rd ./gap-female-largepose/anran_tic/${exp_large_pose_name}/
mv ./gap-female-largepose/anran_tic/${exp_large_pose_name}/latest.pth ./gap-female-largepose/anran_tic/${exp_large_pose_name}/a-pose.pth
bash ./scripts/gap-female/train_anran_garment_fl.sh 1 ${exp_large_pose_name} ${wandb name}

# Good luck in Garment-Verse.

CVPR2023 Offline Questions

Due to the delay in obtaining my Canadian visa, I was unable to attend the in-person CVPR2023 conference in Vancouver, Canada.

Thanks to Zhongjin Luo for helping collect the offline questions.

Q1: How to obtain templates; How to split garment and human?

For the garment templates, we follow DeepFashion3D and use garment templates defined by them.
we apply the NICP algorithm to register category-specific meshes from implicit clothed human surfaces in canonical space, guided by 3D explicit curves.

Q2 : How to obtain 2D curve; How to optimize 3D curves?

First, the garment semantic labels are parsed from a sequence by simply applying an off-the-shelf Estimator. Then we train the garment landmark predictor to obtain garment 2 key points. Finally, the 2D visible curves can be parsed from the garment semantic contour (It is the shortest path starting from one key point to another one on the garment contour). More details can be found in the supplementary.
We define explicit 3D curves in canonical spaces. Following the dynamic human NeRF modeling methodology, we utilize non-rigid offset MLPs and skinning weight to warp the 3D curves to observation space. Then we project 3D curves into pixel spaces and optimize them through 2D projection loss.

Q3: Can we use it to build a dataset? what is different from directly scanning and registering?

Yes, we can. However, the quality of parsing mesh depends on the accuracy of the predicted garment normals.
Direct scanning requires purchasing advanced equipment and hiring manpower to annotate feature lines which are time-consuming and money-consuming( - -) while our method can obtain high-quality garment mesh from a monocular video shoot by smartphone.

Q4 What is the difference between REC-MV and ReEF?

ReEF is an image-based method, while our method is video-based. ReEF struggles to model a video sequence, as it lacks consistency in the temporal domain.

Q5 What field do you use to represent open-boundary clothing? SDF?

We use SDF with Explicit Curves to represent open-boundary clothing. As we know, SDF hardly handles open-boundary meshes, especially for garments. Inspired by ReEF, if we can obtain dynamic SDF as well as explicit curves in temporal space, we are able to model dynamic open-boundary clothing.

Citation

If you use REC-MV in your research, please consider the following BibTeX entry and give us a star🌟!

@inproceedings{qiu2023recmv
  title={REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos},
  author={Qiu, Lingteng and Chen, Guanying and Zhou, Jiapeng, Xu Mutian and Wang, Junle, and Han, Xiaoguang},
  booktitle={CVPR},
  year={2023}
}

Acknowledgements

Here are some great resources we benefit or utilize from:

SelfRecon and Open-PIFuhd for Our code base.
VideoAvatar and TCMR for SMPL initialization.
SMPL for Parametric Body Representation
Self-Correction-Human-Parsing for Garment Parsing
RobustVideoMatting for Foreground Parsing
PyTorch3D for Differential Explicit Rendering
Fite for Skinning weights initialization

Zevrap-81/REC-MV