Scaling regularization falls to NaN after ~500 iterations

Question

Scaling regularization falls to NaN after ~500 iterations

mbernier-arcturus opened this issue 9 months ago · 6 comments

mbernier-arcturus commented 9 months ago

There seems to be an issue with how scaling is computed, it always falls back to NaN after a while. In most experiments, changing opacity and scalability regularization parameters seems to either drive towards "scaling_reg = NaN" error, GPU memory increasing until full and stall (issue #6 ) or successfully go through. I'm wondering if there is a bug in the cuda code related to submodules/diff-gaussian-rasterization, since we already have to modify a line to make it work (issue #4)

Training progress: 3%|█████▋ | 800/30000 [00:22<15:26, 31.51it/s, Loss=0.3607646]Loss: 0.3645972013473511 [25/06 14:53:42]
ssim_loss: 0.5151825547218323 [25/06 14:53:42]
opacity_reg: 0.0003654684405773878 [25/06 14:53:42]
scaling_reg: 0.14932723343372345 [25/06 14:53:42]
Training progress: 3%|███████▎ | 1000/30000 [00:27<08:40, 55.76it/s, Loss=nan]Loss: nan [25/06 14:53:47]
ssim_loss: 0.9891835451126099 [25/06 14:53:47]
opacity_reg: 6.815180677222088e-05 [25/06 14:53:47]
scaling_reg: nan [25/06 14:53:47]
Training progress: 4%|████████▋ | 1200/30000 [00:34<09:18, 51.56it/s, Loss=nan]Loss: nan [25/06 14:53:54]
ssim_loss: 0.9988929033279419 [25/06 14:53:54]
opacity_reg: 7.748496136628091e-05 [25/06 14:53:54]
scaling_reg: nan [25/06 14:53:54]

Answer 1 · 2024-06-27T05:07:38.000Z

Can you please specify the setup and the shape you are trying to run and also your environment?
The code is only tested on Ubuntu, the specifics are now updated in README.

Answer 2 · 2024-06-27T15:49:34.000Z

Hi, thanks for replying!
I have a rather large scene filmed using 70 cameras (colmap already done), with partial obstructions in some of the views and a busy background.
I am on window, torch 2.3.1+cu121, with a RTX3090

Playing with both opacity_lr or the opacity_reg seems to break scaling as well... I'm currently trying to figure out where it happens exactly. For example, if I increase opacity_reg to 0.9, scaling falls to NaN at 2% instead of 6%.

Answer 3 · 2024-06-28T23:18:42.000Z

I'm also getting NaN values for the loss but after 4200 iterations. My scene was initialized with 300,000 points with cap_max set to 600,000 (default values otherwise)

Answer 4 · 2024-06-29T04:24:35.000Z

@mbernier-arcturus or @amballa Can you please share the data you are training with me so I can reproduce and debug the issue? I never encountered NaN with the 5 datasets we tested.

Answer 5 · 2024-07-18T05:42:13.000Z

binorm gets overflowed.
check #8

plz apply this commit: 33924d1

Answer 6 · 2024-09-24T01:06:25.000Z

Closing this issue due to inactivity. Please feel free to reopen it if you continue to encounter the problem.