drprojects/superpoint_transformer

OutOfMemoryError when run eval.py

vvuonghn opened this issue · 5 comments

Hi @drprojects
Thank you for your research, it is very useful.
I completed training the model on s3dis data and got the checkpoint. But when I run the command line for evaluate the model, the log shows as bellow.
I am using RTX 4090 Ti, my GPU memory is free at that time (~24 GB).
torch.cuda.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.62 GiB (GPU 0; 23.62 GiB total capacity; 15.77 GiB already allocated; 7.08 GiB free; 15.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON: CUDA out of memory. Tried to allocate 7.62 GiB (GPU 0; 23.62 GiB total capacity; 15.77 GiB already allocated; 7.08 GiB free; 15.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON

Oh, I move to higher GPU memory server, I can run eval.py. I test with public checkpoints,
the GPU memory consumption is ~ 48GB. Could the model run eval.py on 24GB GPU

Hello, I've also been trying to test the evaluation script on DALES, and I also run into an OutOfMemoryError, this time in the demo_dales.ipynb file, fourth cell:
OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 10.75 GiB total capacity; 10.44 GiB already allocated; 18.50 MiB free; 10.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried to set max_split_size_mb to 128, but it didn't work.
I am using the spit-2_dales.ckpt provided in the README.

Hi @vvuonghn thanks for your interest in the project.

Can you please clarify:

  • are you using the same GPU at inference time as the one you used at train time ?
  • have you made any modification to the project, even minor ? (ie what is you git diff 2e155e7c95cb88b24c30c7faaf108951176e17fd)

Also, please provide the full traceback message so we can see what caused the error. Make sure you set CUDA_LAUNCH_BLOCKING=1 before running, this way the traceback will be more accurate.

Hi @drprojects
I am using the same gpu for training and inference. (one gpu NVIDIA 4090TI - 24 GB). The training process completed and report score. But when I run eval.py (this bug was showed)

I am using S3DIS. my machine have a gpu, I try to set export CUDA_VISIBLE_DEVICES=0 and the bug still hapend.
I don't modify source code, just push data and training the model
I attach the log file for train and eval process below

train.log
eval.log

CUDA_LAUNCH_BLOCKING=1 is a debug env variable used to block kernel launches and to report the proper stacktrace once an assert is triggered. You should not use it in production, but only during debugging. In other words, when you debug code running on CUDA, you need to set CUDA_LAUNCH_BLOCKING=1 if you want to have access to the proper traceback error message. Otherwise the message you get may be unhelpful in locating the source of error.

I had a look at the logs you shared. I notice you are using the s3dis_11g config at training but not at evaluation time. This is probably the reason for your memory error. If you have a look at the difference between these two configs, you will notice that they prepare the data differently.

The provided pretrained weights are trained with s3dis and not s3dis_11g. We do not provide pretrained weights for s3dis_11g so you will have to train yours and evaluate them using the s3dis_11g config (and not the s3dis as you did).

Also, using the s3dis_11g on a 24G GPU is a waste of memory. I recommend you use the s3dis config as a default instead and modify some settings to fit into a 24G GPU. Have a look at the README for a list of settings you can play with to mitigate CUDA errors. I do not have a 24G GPU at hand to help you with that, but I recommend using datamodule.xy_tiling=2 or 3 for a start, and playing with datamodule.sample_graph_k and datamodule.sample_graph_r from there. Have a look and the configs and code to get a grasp of what each does.