StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D (arXiv)

Pengsheng Guo, Hanxiang Hao, Adam Caccavale, Zhongzheng Ren, Edward Zhang, Qi Shan, Aditya Sankar, Alex Schwing, Alex Colburn, Fangchang Ma
Apple

Summary

We present StableDreamer, a methodology incorporating three advancements to tame noisy score distillation sampling (SDS). It reduces multi-face geometries, generates fine details, and converges stably. Specifically, our contributions include:

  1. Interpreting SDS as a reparametrized supervised reconstruction problem, leading to new visualization that motivates the use of an annealing schedule for noise levels.
  2. A two-stage training framework that combines image and latent diffusion for enhanced geometry and color quality.
  3. Integration of 3D Gaussians for text-to-3D generation, with novel regularization techniques for improved quality and convergence, to further improve fidelity and details.
Expand Abstract

In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the diffusion network, and the 3D model representation. To overcome these limitations, we present StableDreamer, a methodology incorporating three advances. First, inspired by InstructNeRF2NeRF, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. This finding provides a novel tool to debug SDS, which we use to show the impact of time-annealing noise levels on reducing multi-faced geometries. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition. Based on this observation, StableDreamer introduces a two-stage training strategy that effectively combines these aspects, resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance the overall quality, reduce memory usage during training, and accelerate rendering speeds, and better capture semi-transparent objects. StableDreamer reduces multi-face geometries, generates fine details, and converges stably.

Results

DreamFusion ProlificDreamer Gsgen StableDreamer (Ours)
a zoomed out dslr photo of a baby bunny sitting on top of a stack of pancakes
a zoomed out dslr photo of a rabbit cutting grass with a lawnmow
a wide angle dslr photo of a colorful rooster
a dslr photo of a blue jay standing on a large basket of rainbow macarons
a dslr photo of a tarantula, highly detailed
a zoomed out dslr photo of a corgi wearing a top hat
humoristic san goku body mixed with wild boar head running, amazing high tech fitness room digital illustration

Related links

Check out recent related work of StableDreamer:

Citation

@misc{guo2023stabledreamer,
      title={StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D}, 
      author={Pengsheng Guo and Hans Hao and Adam Caccavale and Zhongzheng Ren and Edward Zhang and Qi Shan and Aditya Sankar and Alexander G. Schwing and Alex Colburn and Fangchang Ma},
      year={2023},
      eprint={2312.02189},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}