Reduce VRAM usage
YerldSHO opened this issue · 2 comments
✨ Pixi task (nrgbd_wr in default): python -m neural_graph_mapping.run_mapping --config nrgbd_dataset.yaml neural_graph_map.yaml coslam_eval.yaml --dataset_config.root_dir $NGM_DATA_DIR/nrgbd/ --dataset_config.scene whiteroom $NGM_EXTRA_ARGS --rerun_vis True
/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 12, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Traceback (most recent call last):
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 2428, in
main()
File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 2421, in main
neural_graph_map.fit()
File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 1032, in fit
self._init_mv_training_data()
File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/utils.py", line 83, in wrapper
result = f(*args, **kwargs)
File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 1692, in _init_mv_training_data
self._nc_rgbd_tensor = torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.58 GiB. GPU 0 has a total capacity of 5.78 GiB of which 66.44 MiB is free. Process 11099 has 5.30 GiB memory in use. Including non-PyTorch memory, this process has 116.00 MiB memory in use. Of the allocated memory 9.94 MiB is allocated by PyTorch, and 12.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Exception in thread Thread-1 (_pin_memory_loop):
Traceback (most recent call last):
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 53, in _pin_memory_loop
do_one_step()
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 30, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd
fd = df.detach()
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Good afternoon, I was running your code and came across this problem, what could be the problem and how can I solve it?
You are running out of VRAM.
The code and parameters were tested with an RTX 4090 (24 GB). I believe there are a few easy ways to save memory (reducing fields optimized in parallel, reducing number of rays per field), but getting it down to your 6 GB might require a few changes to the code (currently we preallocate a buffer for the keyframes, they could also be loaded on demand from disk instead and / or downsampled without noticable loss in quality).
I'll spend a bit of time on this to see if I can come up with a low VRAM configuration.
Thanks for your answer, I'm looking forward to next it. I will also continue working on the memory solution because this affects more than one project
I would be grateful if you leave a few points from which I can start searching