/pixelsplat

[CVPR 2024 Oral] Code for "pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction" by David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann

Primary LanguagePythonMIT LicenseMIT

pixelSplat

This is the code for pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction by David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann.

Check out the project website here. We will be presenting pixelSplat at CVPR 2024 in Seattle.

teaser.mp4

Camera-ready Updates

This version of the codebase has been updated slightly to reflect the CVPR camera-ready version of the paper (and the latest version of the paper on arXiv). Here are the changes:

  • The models have all been retrained with small bug fixes and a slight architectural improvement (per-image self-attention + convolution has been replaced with only per-image self-attention in the epipolar transformer). This has improved the results slightly across the board:
Run Name PSNR SSIM LPIPS
re10k (old) 25.89 0.858 0.142
re10k (new) 26.09 0.863 0.136
acid (old) 28.14 0.839 0.150
acid (new) 28.27 0.843 0.146
  • A configuration for 3-view pixelSplat has been added. In general, it's now possible to run pixelSplat with an arbitrary number of views, although you'll need a lot of GPU memory to do so.
  • The original version of the code base can be found at commit 787a8896deb232ec652426bf157f9679d4046c3c.

Installation

To get started, create a virtual environment using Python 3.10+:

python3.10 -m venv venv
source venv/bin/activate
# Install these first! Also, make sure you have python3.11-dev installed if using Ubuntu.
pip install wheel torch torchvision torchaudio
pip install -r requirements.txt

If your system does not use CUDA 12.1 by default, see the troubleshooting tips below.

Troubleshooting

The Gaussian splatting CUDA code (diff-gaussian-rasterization) must be compiled using the same version of CUDA that PyTorch was compiled with. As of December 2023, the version of PyTorch you get when doing pip install torch was built using CUDA 12.1. If your system does not use CUDA 12.1 by default, you can try the following:

  • Install a version of PyTorch that was built using your CUDA version. For example, to get PyTorch with CUDA 11.8, use the following command (more details here):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  • Install CUDA Toolkit 12.1 on your system. One approach (try this at your own risk!) is to install a second CUDA Toolkit version using the runfile (local) option here. When you run the installer, disable the options that install GPU drivers and update the default CUDA symlinks. If you do this, you can point your system to CUDA 12.1 during installation as follows:
LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64 pip install -r requirements.txt
# If everything else was installed but you're missing diff-gaussian-rasterization, do:
LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64 pip install git+https://github.com/dcharatan/diff-gaussian-rasterization-modified

Acquiring Datasets

pixelSplat was trained using versions of the RealEstate10k and ACID datasets that were split into ~100 MB chunks for use on server cluster file systems. Small subsets of the Real Estate 10k and ACID datasets in this format can be found here. To use them, simply unzip them into a newly created datasets folder in the project root directory.

If you would like to convert downloaded versions of the Real Estate 10k and ACID datasets to our format, you can use the scripts here. Reach out to us if you want the full versions of our processed datasets, which are about 500 GB and 160 GB for Real Estate 10k and ACID respectively.

Acquiring Pre-trained Checkpoints

You can find pre-trained checkpoints here. You can find the checkpoints for the original codebase (without the improvements from the camera-ready version of the paper) here.

Running the Code

Training

The main entry point is src/main.py. Call it via:

python3 -m src.main +experiment=re10k

This configuration requires a single GPU with 80 GB of VRAM (A100 or H100). To reduce memory usage, you can change the batch size as follows:

python3 -m src.main +experiment=re10k data_loader.train.batch_size=1

Our code supports multi-GPU training. The above batch size is the per-GPU batch size.

Evaluation

To render frames from an existing checkpoint, run the following:

# Real Estate 10k
python3 -m src.main +experiment=re10k mode=test dataset/view_sampler=evaluation dataset.view_sampler.index_path=assets/evaluation_index_re10k.json checkpointing.load=checkpoints/re10k.ckpt

# ACID
python3 -m src.main +experiment=acid mode=test dataset/view_sampler=evaluation dataset.view_sampler.index_path=assets/evaluation_index_acid.json checkpointing.load=checkpoints/acid.ckpt

Note that you can also use the evaluation indices that end with _video (in /assets) to render the videos shown on the website.

Ablations

You can run the ablations from the paper by using the corresponding experiment configurations. For example, to ablate the epipolar encoder:

python3 -m src.main +experiment=re10k_ablation_no_epipolar_transformer

Our collection of pre-trained checkpoints includes checkpoints for the ablations.

VS Code Launch Configuration

We provide VS Code launch configurations for easy debugging.

Camera Conventions

Our extrinsics are OpenCV-style camera-to-world matrices. This means that +Z is the camera look vector, +X is the camera right vector, and -Y is the camera up vector. Our intrinsics are normalized, meaning that the first row is divided by image width, and the second row is divided by image height.

Figure Generation Code

We've included the scripts that generate tables and figures in the paper. Note that since these are one-offs, they might have to be modified to be run.

Notes on Bugs

Since the original release of the pixelSplat codebase, the following bugs have been identified:

  • The LPIPS loss was using the wrong input range (0 to 1 instead of -1 to 1). Results should be slightly better with this fixed. Thank you to Katja Schwarz for finding this bug!
  • The view sampler at src/dataset/view_sampler/view_sampler_bounded.py was incorrectly using min_gap in place of max_gap during training. This bug has been fixed, and the training configurations have been updated to reflect the unintended behavior, so this shouldn't affect the results. Note that views sampled during evaluation are chosen differently, so those were not affected. Thank you to Chris Wewer for finding this bug!

Related Papers

Check out the following papers that build on top of pixelSplat's codebase:

  • MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images by Yuedong Chen et al. (webpage, code): This method solves the same problem as pixelSplat using plane sweeping/cost volumes. This yields slightly better novel view synthesis results, much cleaner 3D Gaussian point clouds, and improved cross-dataset generalization.
  • latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction by Christopher Wewer et al. (webpage): This is a generative model built on top of pixelSplat. It predicts semantic Gaussians in 3D space, splats them into 2D, and then converts the resulting 2D features into images using a 2D generative model.

If you used ideas or code from pixelSplat and would like to be featured here, send an email to charatan@mit.edu!

BibTeX

@inproceedings{charatan23pixelsplat,
      title={pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction},
      author={David Charatan and Sizhe Li and Andrea Tagliasacchi and Vincent Sitzmann},
      year={2023},
      booktitle={arXiv},
}

Acknowledgements

This work was supported by the National Science Foundation under Grant No. 2211259, by the Singapore DSTA under DST00OECI20300823 (New Representations for Vision), by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) under 140D0423C0075, and by the Amazon Science Hub. The Toyota Research Institute also partially supported this work. The views and conclusions contained herein reflect the opinions and conclusions of its authors and no other entity.