/MVDream

Multi-view Diffusion for 3D Generation

Primary LanguagePythonMIT LicenseMIT

MVDream

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, Xiao Yang

| Project Page | 3D Generation | Paper | HuggingFace Demo (Coming) |

multiview diffusion

3D Generation

  • This repository only includes the diffusion model and 2D image generation code of MVDream paper.
  • For 3D Generation, please check MVDream-threestudio.

Installation

You can use the same environment as in Stable-Diffusion for this repo. Or you can set up the environment by installing the given requirements

pip install -r requirements.txt

To use MVDream as a python module, you can install it by pip install -e . or:

pip install git+https://github.com/bytedance/MVDream

Model Card

Our models are provided on the Huggingface Model Page with the OpenRAIL license.

Model Base Model Resolution
sd-v2.1-base-4view Stable Diffusion 2.1 Base 4x256x256
sd-v1.5-4view Stable Diffusion 1.5 4x256x256

By default, we use the SD-2.1-base model in our experiments.

Note that you don't have to manually download the checkpoints for the following scripts.

Text-to-Image

You can simply generate multi-view images by running the following command:

python scripts/t2i.py --text "an astronaut riding a horse"

We also provide a gradio script to try out with GUI:

python scripts/gradio_app.py

Usage

Load the Model

We provide two ways to load the models of MVDream:

  • Automatic: load the model config with model name and weights from huggingface.
from mvdream.model_zoo import build_model
model = build_model("sd-v2.1-base-4view")
  • Manual: load the model with a config file and a checkpoint file.
from omegaconf import OmegaConf
from mvdream.ldm.util import instantiate_from_config
config = OmegaConf.load("mvdream/configs/sd-v2-base.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("path/to/sd-v2.1-base-4view.th", map_location='cpu'))

Inference

Here is a simple example for model inference:

import torch
from mvdream.camera_utils import get_camera
model.eval()
model.cuda()
with torch.no_grad():
    noise = torch.randn(4,4,32,32, device="cuda") # batch of 4x for 4 views, latent size 32=256/8
    t = torch.tensor([999]*4, dtype=torch.long, device="cuda") # same timestep for 4 views
    cond = {
        "context": model.get_learned_conditioning([""]*4).cuda(), # text embeddings
        "camera": get_camera(4).cuda(),
        "num_frames": 4,
    }
    eps = model.apply_model(noise, t, cond=cond)

Acknowledgement

This repository is heavily based on Stable Diffusion. We would like to thank the authors of these work for publicly releasing their code.

Citation

@article{shi2023MVDream,
  author = {Shi, Yichun and Wang, Peng and Ye, Jianglong and Mai, Long and Li, Kejie and Yang, Xiao},
  title = {MVDream: Multi-view Diffusion for 3D Generation},
  journal = {arXiv:2308.16512},
  year = {2023},
}