RuntimeError: CUDA error: out of memory

Question

RuntimeError: CUDA error: out of memory

zysNLP opened this issue 6 years ago · 10 comments

When I execute python train_QuAC.py,there are errors as following:

After Input LSTM, the vector_sizes [doc, query] are [ 250 250 ] * 2
Self deep-attention 250 rays in 750-dim space
Before answer span finding, hidden size are 250 250
12/20/2018 04:43:43 [dev] Total number of params: 11852394
12/20/2018 16:43:43 - INFO - main - [dev] Total number of params: 11852394
12/20/2018 04:43:45 Epoch 1
12/20/2018 16:43:45 - WARNING - main - Epoch 1
12/20/2018 04:43:46 updates[ 1] train loss[15.87973] remaining[1:26:47]
12/20/2018 16:43:46 - INFO - main - updates[ 1] train loss[15.87973] remaining[1:26:47]
12/20/2018 04:44:00 updates[ 21] train loss[10.87507] remaining[0:45:12]
12/20/2018 16:44:00 - INFO - main - updates[ 21] train loss[10.87507] remaining[0:45:12]
12/20/2018 04:44:14 updates[ 41] train loss[10.16175] remaining[0:44:17]
12/20/2018 16:44:14 - INFO - main - updates[ 41] train loss[10.16175] remaining[0:44:17]
12/20/2018 04:44:29 updates[ 61] train loss[9.96472] remaining[0:45:30]
12/20/2018 16:44:29 - INFO - main - updates[ 61] train loss[9.96472] remaining[0:45:30]
12/20/2018 04:44:48 updates[ 81] train loss[9.56536] remaining[0:49:21]
12/20/2018 16:44:48 - INFO - main - updates[ 81] train loss[9.56536] remaining[0:49:21]
12/20/2018 04:45:03 updates[ 101] train loss[9.38102] remaining[0:48:40]
12/20/2018 16:45:03 - INFO - main - updates[ 101] train loss[9.38102] remaining[0:48:40]
12/20/2018 04:45:17 updates[ 121] train loss[9.11970] remaining[0:47:07]
12/20/2018 16:45:17 - INFO - main - updates[ 121] train loss[9.11970] remaining[0:47:07]
12/20/2018 04:45:35 updates[ 141] train loss[8.99858] remaining[0:48:30]
12/20/2018 16:45:35 - INFO - main - updates[ 141] train loss[8.99858] remaining[0:48:30]
12/20/2018 04:45:49 updates[ 161] train loss[8.76992] remaining[0:47:21]
12/20/2018 16:45:49 - INFO - main - updates[ 161] train loss[8.76992] remaining[0:47:21]
12/20/2018 04:46:03 updates[ 181] train loss[8.64918] remaining[0:46:42]
12/20/2018 16:46:03 - INFO - main - updates[ 181] train loss[8.64918] remaining[0:46:42]
12/20/2018 04:46:17 updates[ 201] train loss[8.58265] remaining[0:46:12]
12/20/2018 16:46:17 - INFO - main - updates[ 201] train loss[8.58265] remaining[0:46:12]
12/20/2018 04:46:30 updates[ 221] train loss[8.47283] remaining[0:45:20]
12/20/2018 16:46:30 - INFO - main - updates[ 221] train loss[8.47283] remaining[0:45:20]
12/20/2018 04:46:43 updates[ 241] train loss[8.38734] remaining[0:44:24]
12/20/2018 16:46:43 - INFO - main - updates[ 241] train loss[8.38734] remaining[0:44:24]
12/20/2018 04:46:57 updates[ 261] train loss[8.36940] remaining[0:44:00]
12/20/2018 16:46:57 - INFO - main - updates[ 261] train loss[8.36940] remaining[0:44:00]
Traceback (most recent call last):
File "train_QuAC.py", line 324, in
main()
File "train_QuAC.py", line 209, in main
model.update(batch)
File "/home/zys/文档/FlowQA-master/QA_model/model_QuAC.py", line 83, in update
score_s, score_e, score_no_answ = self.network(*inputs)
File "/home/zys/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/zys/文档/FlowQA-master/QA_model/detail_model.py", line 306, in forward
highlvl_self_attn_hiddens = self.highlvl_self_att(x1_att, x1_att, x1_mask, x3=doc_hiddens, drop_diagonal=True)
File "/home/zys/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/zys/文档/FlowQA-master/QA_model/layers.py", line 285, in forward
alpha = F.softmax(scores, dim=2)
File "/home/zys/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 889, in softmax
return input.softmax(dim)
RuntimeError: CUDA error: out of memory

Answer 1 · 2018-12-20T09:41:11.000Z

请问下代码哪里可以优化一下么，使内存尽量减少用些？本人机器1080TI

Answer 2 · 2019-01-10T03:31:10.000Z

Is your batch_size = 1?
You need to set the batch size as 1 to avoid out of memory.
Besides, you could add torch.cuda.empty() to the end of function update() at the train_CoQA.py

Answer 3 · 2019-02-16T02:51:55.000Z

请问下代码哪里可以优化一下么，使内存尽量减少用些？本人机器1080TI

Have you solved this problem? @zysNLP

Answer 4 · 2019-03-11T01:08:34.000Z

Hi，I met the same problem with you , have you solved it?

Answer 5 · 2019-04-25T09:45:30.000Z

請問一下，如果想跑train_QuAC.py的部分，大概需要memory size是多少的GPU呢？

Answer 6 · 2019-04-25T11:48:21.000Z

@a410661 More than 8G，I have tried on 1070ti-8g

Answer 7 · 2019-04-25T14:35:51.000Z

@a410661 More than 8G，I have tried on 1070ti-8g

感謝！不過我試了batch_size=1的case，還是有memory不夠的問題@@

Answer 8 · 2019-04-27T09:26:39.000Z

QuAC大概需要11g，所以8g是不行的，即使batch size设置为1 发自我的iPhone

…

------------------ Original ------------------ From: a410661 <notifications@github.com> Date: Thu,Apr 25,2019 10:35 PM To: momohuang/FlowQA <FlowQA@noreply.github.com> Cc: longkun <1711466966@qq.com>, Comment <comment@noreply.github.com> Subject: Re: [momohuang/FlowQA] RuntimeError: CUDA error: out of memory (#8) @a410661 More than 8G，I have tried on 1070ti-8g 感謝！不過我試了batch_size=1的case，還是有memory不夠的問題@@ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 9 · 2019-04-27T10:59:22.000Z

原來如此，謝謝回答!
看來我要想想別的辦法了....

QuAC大概需要11g，所以8g是不行的，即使batch size设置为1 发自我的iPhone
…
------------------ Original ------------------ From: a410661 notifications@github.com Date: Thu,Apr 25,2019 10:35 PM To: momohuang/FlowQA FlowQA@noreply.github.com Cc: longkun 1711466966@qq.com, Comment comment@noreply.github.com Subject: Re: [momohuang/FlowQA] RuntimeError: CUDA error: out of memory (#8) @a410661 More than 8G，I have tried on 1070ti-8g 感謝！不過我試了batch_size=1的case，還是有memory不夠的問題@@ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 10 · 2019-05-10T05:59:51.000Z

Thank you for all of you to reply this issue.It seems the problem get some answers. I think making batchsize=1 is not a good idea, even maybe solve it. Maybe need a better algorithm. Hope there are more answers.