Loss becomes NaN after a couple steps (+failed to allocate 4.00G CUDA_ERROR_OUT_OF_MEMORY)

Question

Loss becomes NaN after a couple steps (+failed to allocate 4.00G CUDA_ERROR_OUT_OF_MEMORY)

aaaa-trsh opened this issue 4 years ago · 1 comments

I'm getting an error that says:
tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
After this error, training starts as normal (. . ?). However, after around 20 steps it becomes NaN:

I am attempting to train YOLOv3-tiny with my own dataset with 1 class:
python train.py --batch_size 8 --dataset ./data/custom_train.tfrecord --val_dataset ./data/custom_val.tfrecord --epochs 10 --mode eager_fit --transfer none --tiny
Can anyone help me understand why this is happening?

Answer 1 · 2021-08-02T16:56:11.000Z

For me, CUDA 10.1 worked, anything other than that won't. What version of CUDA are you running?