[CVPR 2024] SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Primary LanguagePythonMIT LicenseMIT

Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Official code release for CVPR 2024 paper SiTH.

What you can find in this repo:

  • Demo for reconstructing a fully textured 3D human from a single image in 2 minutes (tested on an RTX 3090 GPU)
  • A minimal script for fitting the SMPL-X model to an image.
  • A new evaluation benchmark for single-view 3D human reconstruction.
  • A Gradio demo for creating 3D humans with poses and text prompts.
  • Training scripts for the diffusion model and the mesh reconstruction model.

If you find our code and paper useful, please cite it as

    title={SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion},
    author={Ho, Hsuan-I and Song, Jie and Hilliges, Otmar},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},


  • [June 14, 2024] Release the training code for the diffusion model and the mesh reconstruction model. More instruction will be provided soon.
  • [May 15, 2024] Update an application of 3D avatar animation.
  • [April 24, 2024] Gradio demo for 3D human creation is now available.
  • [April 15, 2024] Release demo code, models, and the evaluation benchmark.


Our code has been tested with Ubuntu 22.04, PyTorch 2.1.0, CUDA 12.1, and an RTX 3090 GPU.

Simply run the following command to install relevant packages:

pip install -r requirements.txt

Quick Start

  1. Download the checkpoint files into the checkpoints folder.
bash tools/download.sh
  1. Download SMPL-X models and move them to the data/body_models folder. You should have the following data structure:
        ├── SMPLX_NEUTRAL.pkl
        ├── SMPLX_NEUTRAL.npz
        ├── SMPLX_MALE.pkl
        ├── SMPLX_MALE.npz
        ├── SMPLX_FEMALE.pkl
        └── SMPLX_FEMALE.npz
  1. Run the script for body fitting, back hallucination, and mesh reconstruction.
bash run.sh

SiTH Pipeline

Data Preparation

You can prepare your own RGBA images and put them into the data/examples/rgba folder. For example, you can create photos from OutfitAnyone, and remove the background with Segment Anything or Clipdrop.

  1. Run the script to generate square and centralized input images into the data/examples/images folder. The default size is 1024x1024. You can also adjust the size by adjusting the --size and --ratio arguments.
python tools/centralize_rgba.py
  1. Install and run openpose to get .json files of COCO-25 body, hand, and face keypoints. For example, we used the following command, and your image folder should contain files as in data/examples/images.
cd /path/to/openpose_dir

./build/examples/openpose/openpose.bin --image_dir /path/to/images_dir --write_json /path/to/images_dir --display 0 --net_resolution -1x544 --scale_number 3 --scale_gap 0.25 --hand --face --render_pose 0

SMPL-X Fitting

Next, we fit the SMPL-X body model to each input image and align them within a cube of [-1, 1]. By default, we use the following command that optimizes the global orientation, body shape, scale, and X,Y offset parameters.

python fit.py --opt_orient --opt_betas

There are also additional arguments and hyperparameters for customized fitting. For example, if you find the initial body pose not perfectly aligned, you can use the --pot_pose flag to optimize specific body joints. You can visualize the fitting results by activating the --debug flag.

Back-view Hallucination

Given the front-view images and SMPL-X parameters, we generate back-view images with our image-conditioned diffusion model. The following command generates images in the data/examples/back_images folder.

python hallucinate.py --num_validation_image 8

Note that generative models do have randomness. Therefore multiple images are generated and you can choose the best one to replace it in data/examples/back_images. There are several parameters you can play with:

  • --guidance_scale: Classifier-free guidance (CFG) scale.
  • --conditioning_scale: ControlNet conditioning scale.
  • --num_inference_steps: Denoising steps.
  • --pretrained_model_name_or_path: The default model is trained on 500 human scans. We offer a new model trained with 2000+ scans and more view angles. To use the model, please adjust to hohs/SiTH-diffusion-2000.

Textured Human Reconstruction

Before reconstructing the 3D meshes, make sure the following folders and images are ready.

    |   ├── 000.png
    |   ├── 000_keypoints.json
    |   ...
    |   ├── 000_smplx.obj
    |   ...
        ├── 000_00X.png

The following command will reconstruct textured meshes under data/examples/meshes:

python reconstruct.py --test_folder data/examples --config recon/config.yaml --resume checkpoints/recon_model.pth

The default --grid_size for marching cube is set to 512. If your images contain noisy segmentation borders, you can increase --erode_iter to shrink your segmentation mask.

Training Models

Please see TRAINING.md


Texts to 3D Humans

Instruction Video

We create an application combining SiTH and powerful ControlNet for 3D human creation. In the demo, users can easily create 3D humans with several button clicks.

You can either play our Online Demo or launch the web UI locally. To run the demo on your local machine, simply run

python app.py

You will see the following web UI on

Animation-ready Avatars

SiTH can be used for creating animatable 3D avatars from images. You can generate a textured mesh with a UV map by modifying the command at run.sh with

python reconstruct.py --test_folder data/examples --config recon/config.yaml --resume checkpoints/recon_model.pth --grid_size 300 --save_uv

⚠️ You need to install an additional package for UV unwrapping pip install xatlas. Note that UV unwrapping takes a long computational time (>10 mins per mesh). Therefore, it should be only used for the avatar animation application.

We fit and repose the reconstructed textured meshes using Editable-humans. Please check their demo code to see how to repose a 3D human mesh.

Evaluation Benchmark

We created an evaluation benchmark using the CustomHumans dataset. Please apply the dataset directly and you will find the necessary files in the download link.

Note that we trained our models with 526 human scans provided in the THuman2.0 dataset and tested on 60 scans in the CustomHumans dataset. We used the default hyperparameters and commands suggested in run.sh. The evaluation script can be found here and here. You will need to install two additional packages for evaluation:

pip install torchmetrics[image] mediapipe
Single-view human 3D reconstruction benchmark
Methods P-to-S (cm) ↓ S-to-P (cm) ↓ NC ↑ f-Score ↑
PIFu [Saito2019] 2.209 2.582 0.805 34.881
PIFuHD[Saito2020] 2.107 2.228 0.804 39.076
PaMIR [Zheng2021] 2.181 2.507 0.813 35.847
FOF [Feng2022] 2.079 2.644 0.808 36.013
2K2K [Han2023] 2.488 3.292 0.796 30.186
ICON* [Xiu2022] 2.256 2.795 0.791 30.437
ECON* [Xiu2023] 2.483 2.680 0.797 30.894
SiTH* (Ours) 1.871 2.045 0.826 37.029
  • *indicates methods trained on the same THuman2.0 dataset.

Back-view hallucination benchmark
Methods SSIM ↑ LPIPS↓ KID(×10^−3^) ↓ Joints Err. (pixel) ↓
Pix2PixHD [Wang2018] 0.816 0.141 86.2 53.1
DreamPose [Karras2023] 0.844 0.132 86.7 76.7
Zero-1-to-3 [Liu2023] 0.862 0.119 30.0 73.4
ControlNet [Zhang2023] 0.851 0.202 39.0 35.7
SiTH (Ours) 0.950 0.063 3.2 21.5


We used code from other great research work, including occupancy_networks, pifuhd, kaolin-wisp, mmpose, smplx, SMPLer-X, editable-humans.

We created all the videos using powerful aitviewer.

We sincerely thank the authors for their awesome work!


For any questions or problems, please open an issue or contact Hsuan-I Ho.