GPU memory

Question

GPU memory

freecui opened this issue 5 years ago · 13 comments

could you tell me how much memory there is? I use 8G GPU, but I meet 'CUDA out of memory'; I change the parameter max_mel_frames and tacotron_batch_size, but I still cann't sovle 'out of memory'; Is there any other way to solve CUDA out of memory with 8G GPU

Answer 1 · 2020-03-07T11:31:02.000Z

You might reduce the reduction factor to 5 or larger adapting for small GPU memory with little side effect in evaluation. I am using GTX 1080Ti.

Answer 2 · 2020-03-16T02:18:30.000Z

I have set ''--n-frames-per-step=5', but there is 'RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED' after the process run some steps;
the error:
Traceback (most recent call last):
File "train.py", line 414, in
main()
File "train.py", line 339, in main
scaled_loss.backward()
File "/home/avatar/.local/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/avatar/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Answer 3 · 2020-03-16T03:42:15.000Z

Strange, you might try r=10 to test whether your GPU memory is all right.

Answer 4 · 2020-03-19T07:33:58.000Z

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1107 G /usr/lib/xorg/Xorg 18MiB |
| 0 1285 G /usr/bin/gnome-shell 49MiB |
| 0 2462 G /usr/lib/xorg/Xorg 93MiB |
| 0 2797 G /usr/bin/gnome-shell 92MiB |
| 0 4264 G ...uest-channel-token=15602922648757120850 31MiB |
| 0 12961 C python3 7679MiB |
+-----------------------------------------------------------------------------+

Traceback (most recent call last):
File "train.py", line 414, in
main()
File "train.py", line 339, in main
scaled_loss.backward()
File "/home/avatar/.local/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/avatar/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I I wonder if GPU memory usage increases with the number of trainig steps, so 'RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED' may be out of memory

Answer 5 · 2020-03-19T07:45:13.000Z

There might be memory thrashing during training on PyTorch due to its dynamic gragh design. You might continue running the bash script from the latest checkpoint, or reduce the batch size into 24 or increase the reduction factor to 5 or more to leave some remaining space for GPU memory.
By the way, the version of t2 has been updated with a new and better STFT in preprocessing.

Answer 6 · 2020-03-19T08:05:41.000Z

And I can tell you a secret: 1080Ti is more steady than 2080Ti with little CUDNN_STATUS_EXECUTION_FAILED though the production has been stopped by NVIDIA.

Answer 7 · 2020-03-19T08:12:09.000Z

I don't have 1080Ti and 2080Ti , and I use 2080 SUPER; I didn't buy 1080Ti as you said it has been stopped

Answer 8 · 2020-03-21T07:55:05.000Z

Did you give up any very long text or clips in the preprocessing?

Answer 9 · 2020-03-21T08:25:18.000Z

yes, I did that

Answer 10 · 2020-03-24T07:32:05.000Z

I have set ''--n-frames-per-step=7' and reduce the batch size into 24, and it spends 5661MB when training; there doesn't appear ''RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED' ; so I think cuDNN error is due to GPU memory

Answer 11 · 2020-03-24T08:25:43.000Z

Of course and I am glad that you have made it. If it appears again, just continue training. The latest checkpoint would be saved automatically.

Answer 12 · 2020-03-24T09:36:59.000Z

ok, Thank you all your reply

Answer 13 · 2020-03-26T10:34:30.000Z

Fixed 53f0d31