Scaling regularization falls to NaN after ~500 iterations
mbernier-arcturus opened this issue · 6 comments
There seems to be an issue with how scaling is computed, it always falls back to NaN after a while. In most experiments, changing opacity and scalability regularization parameters seems to either drive towards "scaling_reg = NaN" error, GPU memory increasing until full and stall (issue #6 ) or successfully go through. I'm wondering if there is a bug in the cuda code related to submodules/diff-gaussian-rasterization, since we already have to modify a line to make it work (issue #4)
Training progress: 3%|█████▋ | 800/30000 [00:22<15:26, 31.51it/s, Loss=0.3607646]Loss: 0.3645972013473511 [25/06 14:53:42]
ssim_loss: 0.5151825547218323 [25/06 14:53:42]
opacity_reg: 0.0003654684405773878 [25/06 14:53:42]
scaling_reg: 0.14932723343372345 [25/06 14:53:42]
Training progress: 3%|███████▎ | 1000/30000 [00:27<08:40, 55.76it/s, Loss=nan]Loss: nan [25/06 14:53:47]
ssim_loss: 0.9891835451126099 [25/06 14:53:47]
opacity_reg: 6.815180677222088e-05 [25/06 14:53:47]
scaling_reg: nan [25/06 14:53:47]
Training progress: 4%|████████▋ | 1200/30000 [00:34<09:18, 51.56it/s, Loss=nan]Loss: nan [25/06 14:53:54]
ssim_loss: 0.9988929033279419 [25/06 14:53:54]
opacity_reg: 7.748496136628091e-05 [25/06 14:53:54]
scaling_reg: nan [25/06 14:53:54]
Can you please specify the setup and the shape you are trying to run and also your environment?
The code is only tested on Ubuntu, the specifics are now updated in README.
Hi, thanks for replying!
I have a rather large scene filmed using 70 cameras (colmap already done), with partial obstructions in some of the views and a busy background.
I am on window, torch 2.3.1+cu121, with a RTX3090
Playing with both opacity_lr or the opacity_reg seems to break scaling as well... I'm currently trying to figure out where it happens exactly. For example, if I increase opacity_reg to 0.9, scaling falls to NaN at 2% instead of 6%.
I'm also getting NaN values for the loss but after 4200 iterations. My scene was initialized with 300,000 points with cap_max
set to 600,000 (default values otherwise)
@mbernier-arcturus or @amballa Can you please share the data you are training with me so I can reproduce and debug the issue? I never encountered NaN with the 5 datasets we tested.
Closing this issue due to inactivity. Please feel free to reopen it if you continue to encounter the problem.