/DATID-3D

[CVPR 2023] Official implementation of "DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model"

Primary LanguagePython

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
Official PyTorch implementation of the CVPR 2023 paper

Open In Spaces Colab project_page arXiv

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
Gwanghyun Kim, Se Young Chun
CVPR 2023

gwang-kim.github.io/datid_3d

Abstract:
Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information.
Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a novel pipeline of text-guided domain adaptation tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

Recent Updates

  • 2023.03.31: Code & Colab demo are released.
  • 2023.04.03: Gradio demo is released.
  • 2023.06.20: Huggingface demo is now available at Open In Spaces.

Requirements

  • We have used Linux (Ubuntu 20.04).
  • We have used 1 NVIDIA A100 GPU for text-guided domain adaptation, and have used 1 NVIDIA A100 or RTX3090 GPU for the test using the shifted generators.
    1–8 high-end NVIDIA GPUs. We have done all testing and development using V100, RTX3090, and A100 GPUs.
  • Python 3.8, PyTorch 1.12.1 (or later), CUDA toolkit 11.6 (or later).
  • Python libraries: see environment.yml for exact library dependencies. You can use the following commands with Miniconda3 to create and activate your Python environment:
    git clone https://github.com/gwang-kim/DATID-3D_tmp.git
    cd DATID-3D
    conda env create -n datid3d -f environment.yml
    conda activate datid3d
  • We use the pretrained EG3D models as our pretrained 3D generative models. The prtrained EG3D models will be downloaded automatically for convinence. Or you can download the pretrained EG3D models, put afhqcats512-128.pkl and affhqrebalanced512-128.pkl in ~/eg3d/pretrained/.

Demo

Gradio Demo Open In Spaces

  • We provide a interactive Gradio app demo.
python datid3d_gradio_app.py

Colab Demo Open In Colab

  • We provide a Colab demo for you to play with DATID-3D! Due to 12GB of the VRAM limit in Colab, we only provide the codes of inference & applications with 3D generative models fine-tuned using DATID-3D, not fine-tuning code.

Download Fine-tuned 3D Generative Models

Fine-tuned 3D generative models using DATID-3D pipeline are stored as *.pkl files. You can download the models in our Hugginface model pages.

mkdir finetuned
wget https://huggingface.co/gwang-kim/datid3d-finetuned-eg3d-models/resolve/main/finetuned_models/ffhq-pixar.pkl -O finetuned

Sample Images, Shapes and Videos

You can sample images and shapes (as .mrc files), pose-controlled videos using the shifted 3D generative model. For example:

# Sample images and shapes (as .mrc files) using the shifted 3D generative model

python datid3d_test.py --mode image \
--generator_type='ffhq' \
--outdir='test_runs' \
--seeds='100-200' \
--trunc='0.7' \
--shape=True \
--network=finetuned/ffhq-pixar.pkl 
# Sample pose-controlled videos using the shifted 3D generative model

python datid3d_test.py --mode video \
--generator_type='ffhq' \
--outdir='test_runs' \
--seeds='100-200' \
--trunc='0.7' \
--grid=4x4 \
--network=finetuned/ffhq-pixar.pkl 

The results are saved to ~/test_runs/image or ~/test_runs/video.

Following EG3D, we visualize our .mrc shape files with UCSF Chimerax.

To visualize a shape in ChimeraX do the following:

  1. Import the .mrc file with File > Open
  2. Find the selected shape in the Volume Viewer tool
    1. The Volume Viewer tool is located under Tools > Volume Data > Volume Viewer
  3. Change volume type to "Surface"
  4. Change step size to 1
  5. Change level set to 10
    1. Note that the optimal level can vary by each object, but is usually between 2 and 20. Individual adjustment may make certain shapes slightly sharper
  6. In the Lighting menu in the top bar, change lighting to "Full"

Single-shot Text-guided 2D-to-3D

Text-guided Manipulated 3D Reconstruction

This includes alignment -> pose extraction -> 3D GAN inversion -> generation of images using fine-tuned generator. We use Deep3DFaceRecon as the pose estimation models. The prtrained pose estimation will be downloaded automatically for convinence. Or you can download the pretrained pose estimation model and BFM files, put epoch_20.pth in ~/pose_estimation/checkpoints/pretrained/ and put unzip BFM.zip in ~/pose_estimation/. For example:

# Text-guided manipulated 3D reconstruction from images using the shifted 3D generative model

python datid3d_test.py --mode manip \
--indir='input_imgs' \
--generator_type='ffhq' \
--outdir='test_runs' \
--trunc='0.7' \
--network=finetuned/ffhq-pixar.pkl 

The results are saved to ~/test_runs/manip_3D_recon/4_manip_result.

Text-guided Domain Adaptation of 3D Generator

You can do text-guided domain adaptation of 3D generator with your own text prompt using datid3d_train.py. For example:

python datid3d_train.py \
   --mode='ft' \
   --pdg_prompt='a FHD photo of face of beautiful Elf with silver hair in the live action movie' \
   --pdg_generator_type='ffhq' \
   --pdg_strength=0.7 \
   --pdg_num_images=1000 \
   --pdg_sd_model_id='stabilityai/stable-diffusion-2-1-base' \
   --pdg_num_inference_steps=50 \
   --ft_generator_type='same' \
   --ft_batch=20 \
   --ft_kimg=200

The results of each training run are saved to a newly created directory, for example ~/training_runs/00011-ffhq-data_ffhq_a_FHD_photo_of_face_of_beautiful_Elf_with_silver_hair_in_the_live_action_movie-gpus1-batch20-gamma5.

Citation

@inproceedings{kim2022datid3d,
  author = {Gwanghyun Kim and Se Young Chun},
  title = {DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model},
  booktitle = {CVPR},
  year = {2023}
}

Acknowledgements

We thank the contributions of public projects for sharing their code. We apply our pipelines to EG3D, one of the 3D generative models, and adopt Stable Diffusion as our text-to-image diffusion models and Deep3DFaceRecon as our pose estimation models. We also utilze a part of codes in HFGI3D.