CUDA illegal memory access when training with web viewer running
Opened this issue · 10 comments
Hi, first of all just wanted to thank you for this amazing project!
I've wanted to leverage depth camera as a prior for training gs for a while and can't believe it took me this long to stumble upon this project.
I'm currently facing this issue when training with nerfstudio viewer on:
Traceback (most recent call last):
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\haven\miniconda3\envs\nerfstudio\Scripts\ns-train.exe\__main__.py", line 7, in <module>
sys.exit(entrypoint())
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 262, in entrypoint
main(
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 247, in main
launch(
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 189, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 100, in train_loop
trainer.train()
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\engine\trainer.py", line 261, in train
loss, loss_dict, metrics_dict = self.train_iteration(step)
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\utils\profiler.py", line 112, in inner
out = func(*args, **kwargs)
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\engine\trainer.py", line 496, in train_iteration
_, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\utils\profiler.py", line 112, in inner
out = func(*args, **kwargs)
File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\pipelines\base_pipeline.py", line 302, in get_train_loss_dict
metrics_dict = self.model.get_metrics_dict(model_outputs, batch)
File "C:\Users\haven\code\nerfstudio\dn-splatter\dn_splatter\dn_model.py", line 750, in get_metrics_dict
"rgb_mse": float(rgb_mse),
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
My data was captured with an azure kinect sensor using SAI, at first I used the included process_sai.py
to preprocess the recorded data, but the transform.json output gave camera intrinsics that nerfstudio's undistort function didn't like where k4 wasn't 0, so I copied the camera intrinsic values from sai-cli process
(which gave k1 k2 p1 p2 = 0, not sure the intrinsic values are important on the kinect) and training starts properly now.
When I run ns-train
while having the nerfstudio web viewer on, at around 5000 - 7000 steps it would throw a CUDA illegal memory access error, however without the web viewer running it would run without complaining. I've tried running ns-train multiple times with and without the web viewer and it only complains when web viewer is running. Has anyone seen similar behaviors?
System info:
>>> torch.__version__
'2.1.2+cu118'
>conda list nerfstudio
# Name Version Build Channel
nerfstudio 1.1.3 pypi_0 pypi
>nvidia-smi
Sat Oct 19 21:03:41 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A |
| 0% 62C P2 212W / 350W | 3135MiB / 24576MiB | 76% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Thanks again
@Haven-Lau , I have seen a similar phenomena in my early experiments. What I found was that the viewer can crash if you move the camera in the viewer so that it does not see the Gaussian scene properly (like turning 180 degrees and looking at nothing). I wonder if this is the same issue for you.
@maturk Thanks for the quick reply!
Yes it does sound similar, I find it to crash regardless of where I'm looking (eventually if I have the viewer running) but it is definitely more likely to crash when I pan around a lot very quickly or stare at nothing. I wonder if it's some race conditions between ns-train dn-splatter and the viewer. However today I had my first crash crash without running the viewer. Is there a way to save checkpoints throughout the training process instead of only at 100%?
Just to make sure, you are using nerfstudio v1.1.3 and gsplat v1.0.0?
Correct
# Name Version Build Channel
nerfstudio 1.1.3 pypi_0 pypi
gsplat 1.0.0 pypi_0 pypi
I'm running windows hopefully that's not the cause, but I can try to spin up an ubuntu env at some point since I couldn't get the download scripts running on windows anyways due to the cli commands on different OS I think
Have u tried any other dataset if it occurs there? I am wondering if there are some issues with the optimization (densification/culling) due to the depth supervision. Pictures of the scene when or near the crash might help me debug as well. Maybe try turn off depth loss and see if crash still happens in that scenario.
For my own scene I turned on --pipeline.model.use-normal-loss True --pipeline.model.use-normal-tv-loss True
and that caused it to crash at 51% (15400 steps), without normal loss it trains to 100% without crashing; I'm not loading any normal maps.
This is what it looked like before ~1000 steps before the crash (this time with viewer on it crashed at 12xxx steps intead of 15xxx)
This is one of the training input
I'll try training again with one of the mushroom dataset and report back
I tried using mushroom honka kinect short raw
dataset and processed the raw camera and depth mkv using process_sai.py
, then ran
> ns-train dn-splatter --data data\honka_processed
--pipeline.model.use-depth-loss True
--pipeline.model.depth-lambda 0.2
--pipeline.model.use-normal-loss True
--pipeline.model.use-normal-tv-loss True
--pipeline.model.normal-supervision depth
normal-nerfstudio --load-normals False
This time I was able to use normal-loss and normal-tv-loss without crashing
However this time I saw similar degrading issue as #68 where towards the end of the training process a bunch of big splats got introduced and the some surfaces now have holes
Could you see if you can reproduce similar issues with the mushroom dataset using the same steps?
Eventually I want to figure out how to process my own raw kinect data using the same steps as demonstrated in the mushroom paper as the output from that seems to be very good, there seems to be quite a gap between processing raw kinect data using process_sai
tool vs the preprocessed kinect data given by mushroom dataset, or perhaps do you think it is my ns-train params?
Hi, @Haven-Lau may I ask which camera pose you use for mushroom dataset, mushroom dataparsers in dn-splatter also support kinect sequence, the command can like:
ns-train dn-splatter --data mushroom_sequence
--pipeline.model.use-depth-loss True
--pipeline.model.depth-lambda 0.2
--pipeline.model.use-normal-loss True
--pipeline.model.use-normal-tv-loss True
--pipeline.model.normal-supervision depth
mushroom --load-normals False --mode kinect
你好,@刘海文我可以问一下你对蘑菇数据集使用哪种相机姿势吗?dn splater中的蘑菇数据解析器也支持kinect序列,命令可以像这样:
ns-train dn-splatter --data mushroom_sequence --pipeline.model.use-depth-loss True --pipeline.model.depth-lambda 0.2 --pipeline.model.use-normal-loss True --pipeline.model.use-normal-tv-loss True --pipeline.model.normal-supervision depth mushroom --load-normals False --mode kinect
What should I do if I want to add a mask to the mushroom dataset?