19reborn/NeuS2

Training Loss:nan using DTU-Scan24 data

Opened this issue · 5 comments

zzzolo commented

Hi, @19reborn @AuthorityWang @erin47
Thanks for your awesome work.
After I installed the related environment, I use DTU Scan24 scene with json cofig/nerf/dtu.json for testing. The command is :
python scripts/run.py --scene ./data/scan24/dtu_scan24/transform.json --name dtu_scan24 --network dtu.json --n_steps 15000

However, I got loss=nan around 400 iterations ( it occurs around 400 steps every time ):

Training: 15%|█▌ | 2263/15000 [01:08<06:41, 32.55step/s, loss=nan]

Do you have any idea how to solve this problem?

I encounter the same problem

Hi @zzzolo, @2454511550Lin , I think the problem may be related to this(#8). Could you please have a check?

zzzolo commented

Hi @19reborn , I have set ek_loss_weight=0.0 and run the code with commnad python scripts/run.py --scene ./data/scan24/dtu_scan24/transform.json --name dtu_scan24 --network dtu.json --n_steps 15000 in the main folder without gui. But it doesn't work as that issue (#8 (comment)).

I also tried doing python scripts/run.py and the loss still became nan around 400 epoches. Tweaking the ek_loss_weight does not help (tried 0, 0.1, 0.2).
python scripts/run.py --scene ./data/scan24/dtu_scan24/transform.json --name dtu --network dtu.json --n_steps 10000

FYI, this may result from lack of CUDA environment during compilation. In a new container I forgot to add

export PATH="/usr/local/cuda-11.8/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH"

to my ~/.bashrc before compilation. Then I get nan loss around iteration 400 (same as @zzzolo). After adding the CUDA environment to paths and re-compile the project, I get normal loss again.