A high-performance C++ and CUDA implementation of 3D Gaussian Splatting, built upon the gsplat rasterization backend.
Reduce training time by half and win the $1500 prize!
Details here: Issue #135
This competition is sponsored by @vincentwoo, @mazy1998, and myself – each contributing $300. @toshas and @ChrisAtKIRI are each contributing $200. Finally, @JulienBlanchon and Drew Moffitt are each contributing $100.
This is a solution for Issue #135. It implements improvements that reduce the training from the original 49m 50s to 20m 30s, a speedup by over 2.4x. To get the time below 20 minutes, I also added an optional trick that sacrifices some quality to get the time down to 19m 27s.
The key changes can be summarized as follows:
- Instead of the modular gsplat implementation, the new implementation tries to use as few CUDA kernels as possible.
- I redid all the math for the forward and backward pass to skip all unnecessary computations.
- All activation functions of the Gaussians' parameters are fused into the respective kernels of the forward and backward pass.
- I implemented an improved version of the kernel for the blending backward pass based on Taming 3DGS.
- I separated the sorting into depth and tile sorting passes similar to how it is done in Splatshop.
- I integrated tile-based culling and load balancing based on StopThePop.
- Also see
fastgs/rasterization/README.md
- After every training iteration, MCMC adds noise to the position of each Gaussian. I fused the required operations into a single kernel.
- The functions for adding and relocating Gaussians could be fused too, but because they do not happen that often during training, the benefit would be small. Therefore, I did not do so and only fixed a few inefficiencies in the libtorch implementation.
- Gaussians are now also "dead" when the quaternion used to represent their rotation has a squared norm below
1e-8as optimization and rendering would then be numerically unstable anyway. The new rasterizer also culls such Gaussians during training. - Like most 3DGS implementations, the old implementation used to call the libtorch equivalent of
torch.cuda.empty_cache()after every densification step to solve memory fragmentation issues. This slightly slows down training as temporary memory is first freed and then reallocated. Enabling the torch memory allocator's settingexpandable_segments:Trueallows to get rid of this. - I removed the
abs()calls in the opacity and scale regularization losses as they are unnecessary because the Gaussians' opacity and scale are always positive after applying the respective activations.
- The libtorch Adam implementation is slow because it launches all operations for each update step as separate CUDA kernels. I implemented a fused version of the Adam optimizer step that is significantly faster.
- During the first 1000 iterations, the higher-degree SH coefficients will not be used and therefore always have zero gradients. Therefore, the optimizer step for those can be skipped for a small speedup.
- (optional) Between iteration 1000 and 25000, the expensive update for higher-degree SH coefficients can be batched over two iterations to achieve a small speed up. This is disabled by default as it seems to slightly reduce quality in my tests.
- The torch dataloader was re-created after every epoch. Now only the random sampler is reset, which avoids unnecessary overhead.
- Image normalization to the [0, 1] range was done on the CPU. Doing it after uploading to the GPU is much faster.
- Images were always loaded from disk, which is actually not that bad with a torch dataloader that uses multiple worker threads, but it has some overhead. Now, heuristics are used to determine whether the dataset fits into GPU memory. If yes, images are cached in VRAM.
- Since the VRAM needed for storing the view matrices and camera positions for all views is negligible, they are now precomputed and stored in VRAM. Example: 10k views -> 0.76 MB (could be lower as I store the full 4x4 view matrix instead of just the relevant 3x4 part)
Timings are quite consistent across runs as long as one does not touch the PC during benchmarking.
=========================================
BENCHMARK SUMMARY
=========================================
garden : 00h 02m 28s
bicycle : 00h 02m 14s
stump : 00h 02m 20s
bonsai : 00h 03m 17s
counter : 00h 03m 35s
kitchen : 00h 03m 42s
room : 00h 02m 54s
-----------------------------------------
Total time: 0h 20m 30s
=========================================
With the optional SH trick enabled, training is slightly faster and the total is below 20 minutes:
=========================================
BENCHMARK SUMMARY
=========================================
garden : 00h 02m 18s
bicycle : 00h 02m 06s
stump : 00h 02m 11s
bonsai : 00h 03m 07s
counter : 00h 03m 25s
kitchen : 00h 03m 35s
room : 00h 02m 45s
-----------------------------------------
Total time: 0h 19m 27s
=========================================
Here are the average quality metrics that I got with the original implementation and the new one without and with the optional SH trick enabled. Note that metric computation still uses gsplat to render the images. This highlights that the new rasterizer used during training is not required during inference to obtain high quality.
| Version | PSNR | SSIM | LPIPS |
|---|---|---|---|
| original | 29.235601 | 0.879749 | 0.227879 |
| solution | 29.277397 | 0.879477 | 0.228021 |
| solution w/ SH trick | 29.226835 | 0.879472 | 0.228022 |
I spent a lot of time on making sure the new implementation does not reduce quality. Note that the results in this table do not confirm that the new implementation is strictly better than the original one. There are small differences in terms of how things are computed in a numerically stable manner in the rasterizer. However, this did not make a measurable difference in practice. The main problem when repeatedly testing both the original and the new implementation is that metrics fluctuate between runs making it impossible to tell which implementation achieves better quality.
- Improve the viewer, i.e., better camera controls, more interactive features.
- Migrate UI from Dear ImGui to RmlUi.
- Support SuperSplat-like editing features, just more interactive.
Contributions are very welcome!
Please open pull requests towards the dev branch. On dev, changes will be licensed as GPLv3. Once we have reached a new stable state on dev (the viewer will be improved as the next priority), we will merge back to master. The repo will then be licensed as GPLv3.
- [2025-06-28]: A docker dev container has arrived.
- [2025-06-27]: Removed submodules. Dependencies are now managed via vcpkg. This simplifies the build process and reduces complexity.
- [2025-06-26]: We have new sponsors adding each $200 for a total $1300 prize pool!
The implementation uses weights/lpips_vgg.pt, which is exported from torchmetrics.image.lpip.LearnedPerceptualImagePatchSimilarity with:
- Network type: VGG
- Normalize: False (model expects inputs in [-1, 1] range)
- Model includes: VGG backbone with pretrained ImageNet weights and the scaling normalization layer
Note: While the model was exported with normalize=False, the C++ implementation handles the [0,1] to [-1,1] conversion internally during LPIPS computation, ensuring compatibility with images loaded in [0,1] range.
| Scene | Iteration | PSNR | SSIM | LPIPS | Num Gaussians |
|---|---|---|---|---|---|
| garden | 30000 | 27.538504 | 0.866146 | 0.148426 | 1000000 |
| bicycle | 30000 | 25.771051 | 0.790709 | 0.244115 | 1000000 |
| stump | 30000 | 27.141726 | 0.805854 | 0.246617 | 1000000 |
| bonsai | 30000 | 32.586533 | 0.953505 | 0.224543 | 1000000 |
| counter | 30000 | 29.346529 | 0.923511 | 0.223990 | 1000000 |
| kitchen | 30000 | 31.840155 | 0.938906 | 0.141826 | 1000000 |
| room | 30000 | 32.511021 | 0.938708 | 0.253696 | 1000000 |
| mean | 30000 | 29.533646 | 0.888191 | 0.211888 | 1000000 |
For reference, here are the metrics for the official gsplat-mcmc implementation below. However, the lpips results are not directly comparable, as the gsplat-mcmc implementation uses a different lpips model.
| Scene | Iteration | PSNR | SSIM | LPIPS | Num Gaussians |
|---|---|---|---|---|---|
| garden | 30000 | 27.307266 | 0.854643 | 0.103883 | 1000000 |
| bicycle | 30000 | 25.615253 | 0.774689 | 0.182401 | 1000000 |
| stump | 30000 | 26.964493 | 0.789816 | 0.162758 | 1000000 |
| bonsai | 30000 | 32.735737 | 0.953360 | 0.105922 | 1000000 |
| counter | 30000 | 29.495266 | 0.924103 | 0.129898 | 1000000 |
| kitchen | 30000 | 31.660593 | 0.935315 | 0.087113 | 1000000 |
| room | 30000 | 32.265732 | 0.937518 | 0.132472 | 1000000 |
| mean | 30000 | 29.434906 | 0.881349 | 0.129207 | 1000000 |
Join our growing community for discussions, support, and updates:
- 💬 Discord Community - Get help, share results, and discuss development
- 🌐 mrnerf.com - Visit our website for more resources
- 📚 Awesome 3D Gaussian Splatting - Comprehensive paper list and resources
- 🐦 @janusch_patas - Follow for the latest updates
- Linux (tested with Ubuntu 22.04+) or Windows
- CMake 3.24 or higher
- C++23 compatible compiler (GCC 14+ or Clang 17+)
- CUDA 12.8 or higher (Required: CUDA 11.8 and lower are no longer supported)
- Python with development headers
- LibTorch 2.7.0 - Setup instructions below
- vcpkg for dependency management
- Other dependencies are handled automatically by vcpkg
⚠️ Important: This project now requires CUDA 12.8+ and C++23. If you need to use older CUDA versions, please use an earlier release of this project.
- NVIDIA GPU with CUDA support and compute capability 7.5+
- Successfully tested: RTX 4090, RTX A5000, RTX 3090Ti, A100, RTX 2060 SUPER
- Known issue with RTX 3080Ti on larger datasets (see #21)
- GPU Memory: Minimum 8GB VRAM recommended for training
If you successfully run on other hardware, please share your experience in the Discussions section!
# Set up vcpkg once
git clone https://github.com/microsoft/vcpkg.git
cd vcpkg && ./bootstrap-vcpkg.sh -disableMetrics && cd ..
export VCPKG_ROOT=/path/to/vcpkg # ideally should put this in ~/.bashrc to make it permanent
# Clone the repository
git clone https://github.com/MrNeRF/gaussian-splatting-cuda
cd gaussian-splatting-cuda
# Download and setup LibTorch 2.7.0 with CUDA 12.8 support
wget https://download.pytorch.org/libtorch/cu128/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu128.zip
unzip libtorch-cxx11-abi-shared-with-deps-2.7.0+cu128.zip -d external/
rm libtorch-cxx11-abi-shared-with-deps-2.7.0+cu128.zip
# Build the project
cmake -B build -DCMAKE_BUILD_TYPE=Release -G Ninja
cmake --build build -- -j$(nproc)Instructions must be run in VS Developer Command Prompt
# Set up vcpkg once
git clone https://github.com/microsoft/vcpkg.git
cd vcpkg && .\bootstrap-vcpkg.bat -disableMetrics && cd ..
set VCPKG_ROOT=%CD%\vcpkg
# Note: Add VCPKG_ROOT to your system environment variables permanently via System Properties > Advanced > Environment Variables
# Clone the repository
git clone https://github.com/MrNeRF/gaussian-splatting-cuda
cd gaussian-splatting-cuda
# Download and setup LibTorch 2.7.0 with CUDA 12.8 support
# Create directories for debug and release versions (create directories if they don't exist)
if not exist external mkdir external
if not exist external\debug mkdir external\debug
if not exist external\release mkdir external\release
# LibTorch must be downloaded separately for debug and release in Windows
# Download and extract debug version
curl -L -o libtorch-win-shared-with-deps-debug-2.7.0+cu128.zip https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu128.zip
tar -xf libtorch-win-shared-with-deps-debug-2.7.0+cu128.zip -C external\debug
del libtorch-win-shared-with-deps-debug-2.7.0+cu128.zip
# Download and extract release version
curl -L -o libtorch-win-shared-with-deps-2.7.0+cu128.zip https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-2.7.0%2Bcu128.zip
tar -xf libtorch-win-shared-with-deps-2.7.0+cu128.zip -C external\release
del libtorch-win-shared-with-deps-2.7.0+cu128.zip
# Build the project
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
Note: Building in Debug mode requires building debug python libraries (python3*_d.dll, python3*_d.lib) separately.
If you encounter CUDA library linking errors:
# Verify CUDA libraries exist
ls -la /usr/local/cuda-12.8/lib64/libcudart*
ls -la /usr/local/cuda-12.8/lib64/libcurand*
ls -la /usr/local/cuda-12.8/lib64/libcublas*
# Install missing development packages
sudo apt install cuda-cudart-dev-12-8 cuda-curand-dev-12-8 cuda-cublas-dev-12-8Ensure you have a modern compiler:
# Check compiler version
gcc --version # Should be 14+ for full C++23 support
# Install GCC 14 (Ubuntu 24.04+)
sudo apt update
sudo apt install gcc-14 g++-14 gfortran-14 # updating gfortran to 14 is not necessary for this project but compilation for other projects could break with gfortran still staying at 13 while gcc/g++ is 14
# Set as default
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 60
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 60
sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-14 60
# Select the gcc/g++-14
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
sudo update-alternatives --config gfortranGCC 14 must be built from source on Ubuntu 22.04:
# Install build dependencies
sudo apt update
sudo apt install build-essential
sudo apt install libmpfr-dev libgmp3-dev libmpc-dev -y
# Download and build GCC 14.1.0
wget http://ftp.gnu.org/gnu/gcc/gcc-14.1.0/gcc-14.1.0.tar.gz
tar -xf gcc-14.1.0.tar.gz
cd gcc-14.1.0
# Configure build (this may take several minutes)
./configure -v --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu \
--prefix=/usr/local/gcc-14.1.0 --enable-checking=release --enable-languages=c,c++ \
--disable-multilib --program-suffix=-14.1.0
# Build GCC (this will take 1-2 hours depending on your system)
make -j$(nproc)
# Install GCC
sudo make install
# Set up alternatives to use the new GCC version
sudo update-alternatives --install /usr/bin/gcc gcc /usr/local/gcc-14.1.0/bin/gcc-14.1.0 14
sudo update-alternatives --install /usr/bin/g++ g++ /usr/local/gcc-14.1.0/bin/g++-14.1.0 14
# Verify installation
gcc --version
g++ --versionNote: Building GCC from source is time-intensive (1-2 hours). Consider using Ubuntu 24.04+ or Docker for faster setup if possible.
This project uses LibTorch 2.7.0 for optimal performance and compatibility:
- Enhanced Performance: Improved optimization and memory management
- API Stability: Latest stable PyTorch C++ API
- CUDA 12.8 Support: Full compatibility with modern CUDA versions
- C++23 Features: Leverages modern C++ features for better performance
- Bug Fixes: Resolved optimizer state management issues
Note: Make sure to download the CUDA 12.8 compatible version of LibTorch as shown in the build instructions.
This project also supports a Docker-based environment for simplified setup and reproducible builds with CUDA 12.8 support.
- Docker
- Docker Compose
- NVIDIA Container Toolkit
(Required for GPU support)
To build and run the container, use the provided helper script:
# Build the Docker image (with cache)
./docker/run_docker.sh -b
# OR build without cache
./docker/run_docker.sh -n
# Start the container and enter it
./docker/run_docker.sh -u
# Stop and remove containers
./docker/run_docker.sh -c
# Build and start the Docker image with CUDA 12.8
./docker/run_docker.sh -bu 12.8.0This will mount your current project directory into the container, enabling live development.
GPU acceleration and GUI support (e.g., OpenGL viewers) are enabled if supported by your system.
- Update CUDA to 12.8+ if using an older version
- Update compiler to support C++23 (GCC 14+ or Clang 17+)
- Download the new LibTorch version with CUDA 12.8 support using the build instructions
- Clean your build directory:
rm -rf build/ - Rebuild the project
Download the dataset from the original repository: Tanks & Trains Dataset
Extract it to the data folder in the project root.
-d, --data-path [PATH]
Path to the training data containing COLMAP sparse reconstruction (required)
-o, --output-path [PATH]
Path to save the trained model (default:./output)
-
-i, --iter [NUM]
Number of training iterations (default: 30000)- Paper suggests 30k, but 6k-7k often yields good preliminary results
- Outputs are saved every 7k iterations and at completion
-
-r, --resolution [NUM]
Set the resolution for training images- -1: Use original resolution (default)
- Positive values: Target resolution for image loading
-
--steps-scaler [NUM]
Scale all training steps by this factor (default: 1)- Multiplies iterations, refinement steps, and evaluation/save intervals
- Creates multiple scaled checkpoints for each original step
--max-cap [NUM]
Maximum number of Gaussians for MCMC strategy (default: 1000000)- Controls the upper limit of Gaussian splats during training
- Useful for memory-constrained environments
-
--images [FOLDER]
Images folder name (default:images)- Options:
images,images_2,images_4,images_8 - Mip-NeRF 360 dataset uses different resolutions
- Options:
-
--test-every [NUM]
Every N-th image is used as a test image (default: 8)- Used for train/validation split
-
--eval
Enable evaluation during training- Computes metrics (PSNR, SSIM, LPIPS) at specified steps
- Evaluation steps defined in
parameter/optimization_params.json
-
--save-eval-images
Save evaluation images during training- Requires
--evalto be enabled - Saves comparison images and depth maps (if applicable)
- Requires
--render-mode [MODE]
Render mode for training and evaluation (default:RGB)RGB: Color onlyD: Accumulated depth onlyED: Expected depth onlyRGB_D: Color + accumulated depthRGB_ED: Color + expected depth
--headless
Enable the Visualization mode- Displays the current state of the Gaussian splatting in a window
- Useful for debugging and monitoring training progress
-
--bilateral-grid
Enable bilateral grid for appearance modeling- Helps with per-image appearance variations
- Adds TV (Total Variation) regularization
-
--sh-degree-interval [NUM]
Interval for increasing spherical harmonics degree- Controls how often SH degree is incremented during training
-
-h, --help
Display the help menu
Basic training:
./build/gaussian_splatting_cuda -d /path/to/data -o /path/to/outputMCMC training with limited Gaussians:
./build/gaussian_splatting_cuda -d /path/to/data -o /path/to/output --max-cap 500000Training with evaluation and custom settings:
./build/gaussian_splatting_cuda \
-d data/garden \
-o output/garden \
--images images_4 \
--test-every 8 \
--eval \
--save-eval-images \
--render-mode RGB_D \
-i 30000Force overwrite existing output:
./build/gaussian_splatting_cuda -d data/garden -o output/garden -fTraining with step scaling for multiple checkpoints:
./build/gaussian_splatting_cuda \
-d data/garden \
-o output/garden \
--steps-scaler 3 \
-i 10000The implementation uses JSON configuration files located in the parameter/ directory:
Controls training hyperparameters including:
- Learning rates for different components
- Regularization weights
- Refinement schedules
- Evaluation and save steps
- Render mode settings
- Bilateral grid parameters
Key parameters can be overridden via command-line options.
We welcome contributions! Here's how to get started:
- Getting Started:
- Check out issues labeled as good first issues for beginner-friendly tasks
- For new ideas, open a discussion or join our Discord
- Development Requirements:
- C++23 compatible compiler (GCC 14+ or Clang 17+)
- CUDA 12.8+ for GPU development
- Follow the build instructions above
- Before Submitting a PR:
- Apply
clang-formatfor consistent code style - Use the pre-commit hook:
cp tools/pre-commit .git/hooks/ - Discuss new dependencies in an issue first - we aim to minimize dependencies
- Ensure your changes work with both Debug and Release builds
This implementation builds upon several key projects:
-
gsplat: We use gsplat's highly optimized CUDA rasterization backend, which provides significant performance improvements and better memory efficiency.
-
Original 3D Gaussian Splatting: Based on the groundbreaking work by Kerbl et al.
If you use this software in your research, please cite the original work:
@article{kerbl3Dgaussians,
author = {Kerbl, Bernhard and Kopanas, Georgios and Leimkühler, Thomas and Drettakis, George},
title = {3D Gaussian Splatting for Real-Time Radiance Field Rendering},
journal = {ACM Transactions on Graphics},
number = {4},
volume = {42},
month = {July},
year = {2023},
url = {https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/}
}See LICENSE file for details.
Connect with us:
- 🌐 Website: mrnerf.com
- 📚 Papers: Awesome 3D Gaussian Splatting
- 💬 Discord: Join our community
- 🐦 Twitter: Follow @janusch_patas for development updates
