GPU memory
freecui opened this issue · 13 comments
could you tell me how much memory there is? I use 8G GPU, but I meet 'CUDA out of memory'; I change the parameter max_mel_frames and tacotron_batch_size, but I still cann't sovle 'out of memory'; Is there any other way to solve CUDA out of memory with 8G GPU
You might reduce the reduction factor to 5 or larger adapting for small GPU memory with little side effect in evaluation. I am using GTX 1080Ti.
I have set ''--n-frames-per-step=5', but there is 'RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED' after the process run some steps;
the error:
Traceback (most recent call last):
File "train.py", line 414, in
main()
File "train.py", line 339, in main
scaled_loss.backward()
File "/home/avatar/.local/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/avatar/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Strange, you might try r=10 to test whether your GPU memory is all right.
what do you mean 'r=10'? I print my gpu when I am train;
Mon Mar 16 08:56:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.26 Driver Version: 440.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 34% 58C P2 144W / 250W | 7971MiB / 7981MiB | 64% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1107 G /usr/lib/xorg/Xorg 18MiB |
| 0 1285 G /usr/bin/gnome-shell 49MiB |
| 0 2462 G /usr/lib/xorg/Xorg 93MiB |
| 0 2797 G /usr/bin/gnome-shell 92MiB |
| 0 4264 G ...uest-channel-token=15602922648757120850 31MiB |
| 0 12961 C python3 7679MiB |
+-----------------------------------------------------------------------------+
Traceback (most recent call last):
File "train.py", line 414, in
main()
File "train.py", line 339, in main
scaled_loss.backward()
File "/home/avatar/.local/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/avatar/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I I wonder if GPU memory usage increases with the number of trainig steps, so 'RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED' may be out of memory
There might be memory thrashing during training on PyTorch due to its dynamic gragh design. You might continue running the bash script from the latest checkpoint, or reduce the batch size into 24 or increase the reduction factor to 5 or more to leave some remaining space for GPU memory.
By the way, the version of t2 has been updated with a new and better STFT in preprocessing.
And I can tell you a secret: 1080Ti is more steady than 2080Ti with little CUDNN_STATUS_EXECUTION_FAILED though the production has been stopped by NVIDIA.
I don't have 1080Ti and 2080Ti , and I use 2080 SUPER; I didn't buy 1080Ti as you said it has been stopped
Did you give up any very long text or clips in the preprocessing?
yes, I did that
I have set ''--n-frames-per-step=7' and reduce the batch size into 24, and it spends 5661MB when training; there doesn't appear ''RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED' ; so I think cuDNN error is due to GPU memory
Of course and I am glad that you have made it. If it appears again, just continue training. The latest checkpoint would be saved automatically.
ok, Thank you all your reply
Fixed 53f0d31