/Envision3D

Envision3D: One Image to 3D with Anchor Views Interpolation

Primary LanguagePython

Envision3D: One Image to 3D with Anchor Views Interpolation, ArXiv, Project Page, Model Weights

Yatian Pang, Tanghui Jia, Yujun Shi, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Xing Zhou, Francis E.H. Tay, Li Yuan

TL;DR

We propose a novel cascade diffusion framework to efficiently generate dense(32) multi-view consistent images and extract high-quality 3D content. The cascade diffusion framework inference only takes less than 12GB VRAM.

img

Abstract

We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. Recent methods that extract 3D content from multi-view images generated by diffusion models show great potential. However, it is still challenging for diffusion models to generate dense multi-view consistent images, which is crucial for the quality of 3D content extraction. To address this issue, we propose a novel cascade diffusion framework, which decomposes the challenging dense views generation task into two tractable stages, namely anchor views generation and anchor views interpolation. In the first stage, we train the image diffusion model to generate global consistent anchor views conditioning on image-normal pairs. Subsequently, leveraging our video diffusion model fine-tuned on consecutive multi-view images, we conduct interpolation on the previous anchor views to generate extra dense views. This framework yields dense, multi-view consistent images, providing comprehensive 3D information. To further enhance the overall generation quality, we introduce a coarse-to-fine sampling strategy for the reconstruction algorithm to robustly extract textured meshes from the generated dense images. Extensive experiments demonstrate that our method is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.

Set up

pip install -r req.txt
pip install carvekit --no-deps

Inference

1. Download model checkpoints

Download our pre-trained model checkpoints from here.

Download the image normal estimation model omnidata_dpt_normal_v2.ckpt from here.

Place all of them under pretrained_models dir.

2. Pre-process input image

Run the following command to resize the input image and predict the normal map.

CUDA_VISIBLE_DEVICES=0 python process_img.py example_imgs/pumpkin.png processed_imgs/ --size 256 --recenter

3. Inference stage I and stage II

Modify config files in the cfgs directory and run the following command to inference.

CUDA_VISIBLE_DEVICES=0 python gen_s1.py --config cfgs/s1.yaml  validation_dataset.filepaths=['pumpkin.png'] validation_dataset.crop_size=224
CUDA_VISIBLE_DEVICES=0 python gen_s2.py --config cfgs/s2.yaml  validation_dataset.scene=pumpkin

4. Textured mesh extraction

After getting 32 views of images, first set the correct path to the output images and then run the following command for 3D content extraction.

cd instant-nsr-pl/
python launch.py --config configs/neuralangelo-pinhole-wmask-opt.yaml --gpu 0 --train dataset.scene=pumpkin

Results

img img

Work in progress ...

  • Inference code
  • Checkpoints
  • Instructions
  • Training code

Acknowledgements

We thank the authors of Wonder3D, Stable Video Diffusion, omnidata, Diffusers and AnimateAnything for their great works and open-source codes.

Citation

@misc{pang2024envision3d,
      title={Envision3D: One Image to 3D with Anchor Views Interpolation}, 
      author={Yatian Pang and Tanghui Jia and Yujun Shi and Zhenyu Tang and Junwu Zhang and Xinhua Cheng and Xing Zhou and Francis E. H. Tay and Li Yuan},
      year={2024},
      eprint={2403.08902},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}