carpedm20/ENAS-pytorch

CUDA out of memory

MJHutchinson opened this issue · 1 comments

First off, thanks for making this, looks great!

I downloaded the repo and I'm trying to run examples to test out the repo before moving on. Unfortunately I'm running into a problem with running the training in that almost immediately CUDA runs out of memory. I'm running on a GTX 1050 with 4GB of RAM (about 3GB avalible to use for training), same as the 980 you mentioned you were running on? I was just wondering if you had any ideas about what could be causing this issue! Full error message below.

python main.py --network_type rnn --dataset ptb --controller_optim adam --controller_lr 0.00035 --shared_optim sgd --shared_lr 20.0 --entropy_coeff 0.0001
2018-02-16 22:22:54,351:INFO::[*] Make directories : logs/ptb_2018-02-16_22-22-54
2018-02-16 22:22:59,204:INFO::# of parameters: 146,014,000
2018-02-16 22:22:59,315:INFO::[*] MODEL dir: logs/ptb_2018-02-16_22-22-54
2018-02-16 22:22:59,316:INFO::[*] PARAM path: logs/ptb_2018-02-16_22-22-54/params.json
train_shared:   0%|   | 0/14524 [00:00<?, ?it/s]
/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/models/controller.py:96: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  probs = F.softmax(logits)
/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/models/controller.py:97: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_prob = F.log_softmax(logits)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "main.py", line 45, in <module>
    main(args)
  File "main.py", line 34, in main
    trainer.train()
  File "/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/trainer.py", line 87, in train
    self.train_shared()
  File "/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/trainer.py", line 143, in train_shared
    loss.backward()
  File "/home/mjhutchinson/.conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/mjhutchinson/.conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCStorage.cu:58

If there's any other info that would be helpful pleas let me know!

What I am using is gpu980ti which has 6GB memory. You can reduce --shared_embed and --shared_hid which is a major factor controlling the required memory size.