MrNeRF/gaussian-splatting-cuda

Performance Measure

Closed this issue · 5 comments

When I measure performance on my 2080Ti GPU on truck and train scene I get the following metrics:

Image size: 979 x 546 TRUCK Scene
This repo: trained on RTX 2080Ti 7000k iterations PSNR: 23.9 | time: 259 s | splats: 1.38e6
Official repo: trained on RTX 2080Ti 7000k iterations PSNR: 24.6 | time: 254 s | splats: 1.85e6

Image size: 979 x 546 TRAIN Scene
This repo: trained on RTX 2080Ti 7000k iterations PSNR: 18.5 | time: 209 s | splats: 4.83e5
Official repo: trained on RTX 2080Ti 7000k iterations PSNR: 20.1 | time: 180 s | splats: 6.64e5

This repo seems to have less splats compared to the original, but PSNR and time seems to be better for the original. Is this what you are noticing on 3090?

MrNeRF commented

I know that my implementation gives a lower splat count, but the performance is much better
Interesting, when I run those scenes on my RTX 4090 (I don't have an 3090), I get the following measures with the same resolution 979 x 546 over 7k iterations :

Truck Scene:
Official Implementation:
02:05<00:00, 55.96it/s PSNR 24.708043289184573, 1.845.626 splats (2:17.92 total including loading and saving)
My Implementation:
84.113sec, avg 83.2 iter/sec, 1.375.162 splats, PSNR: 24.562675 (1:25.42 total including loading and saving)

Train Scene:
Official Implementation:
01:31<00:00, 76.39it/s PSNR 19.95653419494629, 660.769 splats (total including loading and saving 1:40.95)
My Implementation:
63.102sec, avg 110.9 iter/sec, 489100 splats, PSNR: 18.067596 (total including loading and saving 1:04.42)

So, at least on my grapics card, my implementation runs faster. There are also some measurements by others in the Readme (measuring truck scene). So far, all users have have measured much faster training with this implementation then the original implementation.

Might be due to your older graphics card? The PSNR is quite volalatile, isn't it? It depends on which image you compare when evaluating since the order is random. At least for the truck scene I have observed that in most cases my PSNR is higher. Have not measured very often for the train scene so can't say.

=> TL;DR: My implementation much faster, lower splat count, PSNR overall not sure.

Interesting, on older GPU (2080 Ti) wonder why official repo. is faster. PSNR varies +/1-2 on average whereas official repo's PSNR almost gives the same value. Maybe I need to look closely on how they are evaluating

MrNeRF commented

I believe it might have something to do with the optimizations which to a large extend build upon better and more exhaustive utilizing shared memory. The Ada architecture has more per SM and might utilize it better which might negatively impact older architectures.

To test it, you could swap out the gradient computation in backward.cu and replace it with the original computations:
https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L399-L557
https://github.com/MrNeRF/gaussian-splatting-cuda/blob/master/cuda_rasterizer/backward.cu#L1C1-L601C2

Might give quite similar performance. But this is only a hypothesis if you want to try it.

I replaced "renderBackwardsCUDA" method and it worked!

This repo before: trained on RTX 2080Ti 7000k iterations PSNR: 23.9 | time: 259 s | splats: 1.38e6
This repo after: trained on RTX 2080Ti 7000k iterations PSNR: 24.1 | time: 229 s | splats: 1.60e6
Official repo: trained on RTX 2080Ti 7000k iterations PSNR: 24.6 | time: 254 s | splats: 1.85e6

MrNeRF commented

You could also set the D=2
https://github.com/MrNeRF/gaussian-splatting-cuda/blob/master/cuda_rasterizer/backward.cu#L386
Maybe this works with your hardware and you see a speedup. Its interesting how this optimizations have a negative effect on older hardware. But I optimized towards a RTX 4090.