DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
Official PyTorch implementation of the CVPR 2023 paper
DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
Gwanghyun Kim, Se Young Chun
CVPR 2023Abstract:
Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information.
Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a novel pipeline of text-guided domain adaptation tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.
2023.03.31
: Code & Colab demo are released.2023.04.03
: Gradio demo is released.2023.06.20
: Huggingface demo is now available at .
- We have used Linux (Ubuntu 20.04).
- We have used 1 NVIDIA A100 GPU for text-guided domain adaptation, and have used 1 NVIDIA A100 or RTX3090 GPU for the test using the shifted generators.
1–8 high-end NVIDIA GPUs. We have done all testing and development using V100, RTX3090, and A100 GPUs. - Python 3.8, PyTorch 1.12.1 (or later), CUDA toolkit 11.6 (or later).
- Python libraries: see environment.yml for exact library dependencies. You can use the following commands with Miniconda3 to create and activate your Python environment:
git clone https://github.com/gwang-kim/DATID-3D_tmp.git cd DATID-3D conda env create -n datid3d -f environment.yml conda activate datid3d
- We use the pretrained EG3D models as our pretrained 3D generative models. The prtrained EG3D models will be downloaded automatically for convinence. Or you can download the pretrained EG3D models, put
afhqcats512-128.pkl
andaffhqrebalanced512-128.pkl
in~/eg3d/pretrained/
.
- We provide a interactive Gradio app demo.
python datid3d_gradio_app.py
- We provide a Colab demo for you to play with DATID-3D! Due to 12GB of the VRAM limit in Colab, we only provide the codes of inference & applications with 3D generative models fine-tuned using DATID-3D, not fine-tuning code.
Fine-tuned 3D generative models using DATID-3D pipeline are stored as *.pkl
files.
You can download the models in our Hugginface model pages.
mkdir finetuned
wget https://huggingface.co/gwang-kim/datid3d-finetuned-eg3d-models/resolve/main/finetuned_models/ffhq-pixar.pkl -O finetuned
You can sample images and shapes (as .mrc files), pose-controlled videos using the shifted 3D generative model. For example:
# Sample images and shapes (as .mrc files) using the shifted 3D generative model
python datid3d_test.py --mode image \
--generator_type='ffhq' \
--outdir='test_runs' \
--seeds='100-200' \
--trunc='0.7' \
--shape=True \
--network=finetuned/ffhq-pixar.pkl
# Sample pose-controlled videos using the shifted 3D generative model
python datid3d_test.py --mode video \
--generator_type='ffhq' \
--outdir='test_runs' \
--seeds='100-200' \
--trunc='0.7' \
--grid=4x4 \
--network=finetuned/ffhq-pixar.pkl
The results are saved to ~/test_runs/image
or ~/test_runs/video
.
Following EG3D, we visualize our .mrc shape files with UCSF Chimerax.
To visualize a shape in ChimeraX do the following:
- Import the
.mrc
file withFile > Open
- Find the selected shape in the Volume Viewer tool
- The Volume Viewer tool is located under
Tools > Volume Data > Volume Viewer
- The Volume Viewer tool is located under
- Change volume type to "Surface"
- Change step size to 1
- Change level set to 10
- Note that the optimal level can vary by each object, but is usually between 2 and 20. Individual adjustment may make certain shapes slightly sharper
- In the
Lighting
menu in the top bar, change lighting to "Full"
This includes alignment -> pose extraction -> 3D GAN inversion -> generation of images using fine-tuned generator
.
We use Deep3DFaceRecon as the pose estimation models.
The prtrained pose estimation will be downloaded automatically for convinence.
Or you can download the pretrained pose estimation model and BFM files, put epoch_20.pth
in ~/pose_estimation/checkpoints/pretrained/
and put unzip BFM.zip
in ~/pose_estimation/
.
For example:
# Text-guided manipulated 3D reconstruction from images using the shifted 3D generative model
python datid3d_test.py --mode manip \
--indir='input_imgs' \
--generator_type='ffhq' \
--outdir='test_runs' \
--trunc='0.7' \
--network=finetuned/ffhq-pixar.pkl
The results are saved to ~/test_runs/manip_3D_recon/4_manip_result
.
You can do text-guided domain adaptation of 3D generator with your own text prompt using datid3d_train.py
. For example:
python datid3d_train.py \
--mode='ft' \
--pdg_prompt='a FHD photo of face of beautiful Elf with silver hair in the live action movie' \
--pdg_generator_type='ffhq' \
--pdg_strength=0.7 \
--pdg_num_images=1000 \
--pdg_sd_model_id='stabilityai/stable-diffusion-2-1-base' \
--pdg_num_inference_steps=50 \
--ft_generator_type='same' \
--ft_batch=20 \
--ft_kimg=200
The results of each training run are saved to a newly created directory, for example ~/training_runs/00011-ffhq-data_ffhq_a_FHD_photo_of_face_of_beautiful_Elf_with_silver_hair_in_the_live_action_movie-gpus1-batch20-gamma5
.
@inproceedings{kim2022datid3d,
author = {Gwanghyun Kim and Se Young Chun},
title = {DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model},
booktitle = {CVPR},
year = {2023}
}
We thank the contributions of public projects for sharing their code. We apply our pipelines to EG3D, one of the 3D generative models, and adopt Stable Diffusion as our text-to-image diffusion models and Deep3DFaceRecon as our pose estimation models. We also utilze a part of codes in HFGI3D.