This repository contains the implementation of the following paper:
HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation [Project Page] [Paper] [Code] [Video] [Data]
Xuan Ju∗12, Ailing Zeng∗1, Chenchen Zhao∗2, Jianan Wang1, Lei Zhang1, Qiang Xu2
∗ Equal contribution 1International Digital Economy Academy 2The Chinese University of Hong Kong
In this work, we propose a native skeleton-guided diffusion model for controllable HIG called HumanSD. Instead of performing image editing with dual-branch diffusion, we fine-tune the original SD model using a novel heatmap-guided denoising loss. This strategy effectively and efficiently strengthens the given skeleton condition during model training while mitigating the catastrophic forgetting effects. HumanSD is fine-tuned on the assembly of three large-scale human-centric datasets with text-imagepose information, two of which are established in this work.
- (a) a generation by the pre-trained pose-less text-guided stable diffusion (SD)
- (b) pose skeleton images as the condition to ControlNet and our proposed HumanSD
- (c) a generation by ControlNet
- (d) a generation by HumanSD (ours). ControlNet and HumanSD receive both text and pose conditions.
HumanSD shows its superiorities in terms of (I) challenging poses, (II) accurate painting styles, (III) pose control capability, (IV) multi-person scenarios, and (V) delicate details.
Table of Contents
- Release inference code and pretrained models
- Release Gradio UI demo
- Public training data (LAION-Human)
- Release training code (will be public after received)
HumanSD has been implemented and tested on Pytorch 1.12.1 with python 3.9.
Clone the repo:
git clone --recursive git@github.com:IDEA-Research/HumanSD.git
We recommend you first install pytorch
following official instructions. For example:
# conda
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
Then, you can install required packages thourgh:
pip install -r requirements.txt
You also need to install MMPose following here. Noted that you only need to install MMPose as a python package.
Checkpoints
Download necessary checkpoints of HumanSD, which can be found here. The data structure should be like:
|-- humansd_data
|-- checkpoints
|-- higherhrnet_w48_humanart_512x512_udp.pth
|-- v2-1_512-ema-pruned.ckpt
|-- humansd-v1.ckpt
Noted that v2-1_512-ema-pruned.ckpt should be download from Stable Diffusion.
models
You also need to prepare the configs of MMPose models. You can directly download mmpose/configs and put it into humansd_data. Then the data structure will be:
|-- humansd_data
|-- models
|-- mmpose
|-- configs
|-- _base_
|-- animal
|-- ...
You can run demo through:
python scripts/gradio/pose2img.py
We have also provided the comparison of ControlNet and T2I-Adapter, you can run all these methods in one demo. But you need to download corresponding model and checkpoints following:
To compare ControlNet, and T2I-Adpater's results.
(1) You need to initialize ControlNet and T2I-Adapter as submodule usinggit submodule init
git submodule update
(2) Then download checkpoints from: a. T2I-Adapter b. ControlNet. And put them into humansd_data/checkpoints
Then, run:
python scripts/gradio/pose2img.py --controlnet --t2i
Noted that you may have to modify some code in T2I-Adapter due to the path conflict.
e.g., use
from comparison_models.T2IAdapter.ldm.models.diffusion.ddim import DDIMSampler
instead of
from T2IAdapter.ldm.models.diffusion.ddim import DDIMSampler
You may refer to the code here for loading the data.
Laion-Human
You may apply for access of Laion-Human here. Noted that we have provide the pose annotations, images' .parquet file and mapping file, please download the images according to .parquet. The key
in .parquet is the corresponding image index. For example, image with key=338717
in 00033.parquet is corresponding to images/00000/000338717.jpg. If you download the LAION-Aesthetics in tar files, which is different from our data structure, we recommend you extract the tar file through code:
import tarfile
tar_file="00000.tar" # 00000.tar - 00286.tar
present_tar_path=f"xxxxxx/{tar_file}"
save_dir="humansd_data/datasets/Laion/Aesthetics_Human/images"
with tarfile.open(present_tar_path, "r") as tar_file:
for present_file in tar_file.getmembers():
if present_file.name.endswith(".jpg"):
print(f" image:- {present_file.name} -")
image_save_path=os.path.join(save_dir,tar_file.replace(".tar",""),present_file.name)
present_image_fp=TarIO.TarIO(present_tar_path, present_file.name)
present_image=Image.open(present_image_fp)
present_image_numpy=cv2.cvtColor(np.array(present_image),cv2.COLOR_RGB2BGR)
if not os.path.exists(os.path.dirname(image_save_path)):
os.makedirs(os.path.dirname(image_save_path))
cv2.imwrite(image_save_path,present_image_numpy)
The file data structure should be like:
|-- humansd_data
|-- datasets
|-- Laion
|-- Aesthetics_Human
|-- images
|-- 00000
|-- 000000000.jpg
|-- 000000001.jpg
|-- ...
|-- 00001
|-- ...
|-- pose
|-- 00000
|-- 000000000.npz
|-- 000000001.npz
|-- ...
|-- 00001
|-- ...
|-- mapping_file_training.json
Human-Art
You may download Human-Art dataset here.
The file data structure should be like:
|-- humansd_data
|-- datasets
|-- HumanArt
|-- images
|-- 2D_virtual_human
|-- cartoon
|-- 000000000007.jpg
|-- 000000000019.jpg
|-- ...
|-- digital_art
|-- ...
|-- 3D_virtual_human
|-- real_human
|-- pose
|-- 2D_virtual_human
|-- cartoon
|-- 000000000007.npz
|-- 000000000019.npz
|-- ...
|-- digital_art
|-- ...
|-- 3D_virtual_human
|-- real_human
|-- mapping_file_training.json
|-- mapping_file_validation.json
- (a) a generation by the pre-trained text-guided stable diffusion (SD)
- (b) pose skeleton images as the condition to ControlNet, T2I-Adapter and our proposed HumanSD
- (c) a generation by ControlNet
- (d) a generation by T2I-Adapter
- (e) a generation by HumanSD (ours).
ControlNet, T2I-Adapter, and HumanSD receive both text and pose conditions.
@article{ju2023humansd,
title={Human{SD}: A Native Skeleton-Guided Diffusion Model for Human Image Generation},
author={Ju, Xuan and Zeng, Ailing and Zhao, Chenchen and Wang, Jianan and Zhang, Lei and Xu, Qiang},
journal={arXiv preprint arXiv:2304.04269},
year={2023}
}
@inproceedings{ju2023human,
title={Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes},
author={Ju, Xuan and Zeng, Ailing and Wang, Jianan and Xu, Qiang and Zhang, Lei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023},
}
- Our code is modified on the basis of Stable Diffusion, thanks to all the contributors!
- HumanSD would not be possible without LAION and their efforts to create open, large-scale datasets.
- Thanks to the DeepFloyd team at Stability AI, for creating the subset of LAION-5B dataset used to train HumanSD.
- HumanSD uses OpenCLIP, trained by Romain Beaumont.