/view-fusion

Official implementation of ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis

Primary LanguageJupyter NotebookMIT LicenseMIT

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis

image
Fig. 1. Architecture Overview. ViewFusion takes an arbitrary number of unordered and pose-free views coupled with the noise at timestep t-1. The inputs are denoised in parallel using the U-Net conditioned on timestep t and target viewing angle. The model then produces noise predictions and corresponding weights for timestep t. A composed noise prediction, computed as a weighted sum of individual contributions, is then subtracted from the previous timestep prediction. Ultimately, after T timesteps, a fully denoised target view is obtained.

This is the official implementation of "ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis".

@misc{spiegl2024viewfusion,
      title={ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis},
      author={Bernard Spiegl and Andrea Perin and Stéphane Deny and Alexander Ilin},
      year={2024},
      eprint={2402.02906},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Setup

Environment

You can install and activate the conda environment by simply running:

conda env create -f environment.yml
conda activate view-fusion

For ARM-based macOS run:

conda env create -f environment_osx.yml
conda activate view-fusion

Dataset

Version of the NMR ShapeNet dataset we use is hosted by (Niemeyer et al.). Downloadable here.
Please note that our current setup is optimized for use in a cluster computing environment and requires sharding.

To shard the dataset, place the NMR_Dataset.zip in data/nmr/ and run python data/dataset_prep.py command. The default sharding will split the dataset into four shards. In order to enable parallelization, the number of shards has to be divisible by the number of GPUs you use.

Experiments - Work In Progress!

Configurations for various experiments are located in configs/.

Training

To launch training on a single GPU run:

python main.py -c configs/small-v100.yaml -g -t --wandb

For a distributed setup run:

torchrun --nnodes=$NUM_NODES --nproc_per_node=$NUM_GPUS main.py -c configs/small-v100-4.yaml -g -t --wandb

where $NUM_NODES and $NUM_GPUS can, for instance, be replaced by 1 and 4, respectively. This would correspond to a single-node setup with four V100 GPUs.

(In case you are using Slurm, more example scripts are available in slurm/.)

Inference

Coming soon.

Eval

Coming soon.

Using Only the Model

In case you want to implement separate data pipelines or training procedures, all the architecture details are available in model/.

At training time, the model receives:

  • y_0 which is the target (ground truth) of shape (B C H W),
  • y_cond which contains all the input views and is of shape (B N C H W) where N denotes the total number of views (24 in our case),
  • view_count of shape (B,) which contains the number of views used as conditioning for each sample in the batch,
  • angle also of shape (B,) indicating the target angle for each sample.

At inference time, y_0 is omitted, with everything else remaining the same as training.
See paper for full implementation details.

Resource Requirements

NB Training configurations require significant amount of VRAM.
The model referenced in the paper was trained using configs/multi-view-composable-variable-small-v100-4.yaml configuration for 710k steps (approx. 6.5 days) on 4x V100 GPUs, each with 32GB VRAM.
Pretrained model weights will be made available soon.

Results

image image image