Loss becomes NaN after a couple steps (+failed to allocate 4.00G CUDA_ERROR_OUT_OF_MEMORY)
aaaa-trsh opened this issue · 1 comments
aaaa-trsh commented
I'm getting an error that says:
tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
After this error, training starts as normal (. . ?). However, after around 20 steps it becomes NaN:
I am attempting to train YOLOv3-tiny with my own dataset with 1 class:
python train.py --batch_size 8 --dataset ./data/custom_train.tfrecord --val_dataset ./data/custom_val.tfrecord --epochs 10 --mode eager_fit --transfer none --tiny
Can anyone help me understand why this is happening?
berserkr commented
For me, CUDA 10.1 worked, anything other than that won't. What version of CUDA are you running?