Memory out on SM-MNIST

Question

Memory out on SM-MNIST

fanshuhuangjia opened this issue 4 years ago · 9 comments

I trained the model easily by following your instructions,but i got "OSError: [Errno 12] Cannot allocate memory" when 319999/1100000。I have tried to set n_worker=0 and pin_memory=False,but it didn't work.So,I wonder how many cpu memories i need to train the model on SM-MNIST?(My CPU memories:80G)

Answer 1 · 2020-12-18T20:57:54.000Z

Hi, thank you for your interest in our work!

We are currently investigating the issue and trying to reproduce your error. In the meantime and to help us understand your problem, could you please provide additional details such as the exact command that you executed to launch training, the line at which the error occurs (if available) or any other relevant information on your experimental setup (such as package versions different than those given in the requirements.txt file)?

Answer 2 · 2020-12-19T01:31:59.000Z

Hi,

thank you for the answer.

The environment I am using has the same modules with the requirements file except for torch. I use torch==1.5.0 in order to be compatible with CUDA10.2.
The comment I use:
OMP_NUM_THREADS=4 python -m torch.distributed.launch --nproc_per_node=2 train.py --device 0 1 --apex_amp --ny 20 --nz 20 --beta_z 2 --nt_cond 5 --nt_inf 5 --dataset smmnist --nc 1 --seq_len 15 --data_dir data --save_path logs

I used 2 GeFoece RTX 2080Ti to train the model.
The error is shown as follows:

So,i changed the n_workers=0 and pin_memory=False,but it still used about 53G Cpu memories when 46% 501884/1100000.
Thank you!

Answer 3 · 2020-12-21T20:58:12.000Z

Unfortunately, we were not able to reproduce this error as our program runs on our architecture and one GPU without exceeding 30GB of RAM for hundreds of thousands of iterations. However, there are some following possible workarounds that you might want to try.

If you cannot use PyTorch 1.4.0 for CUDA-related reasons, you probably should be using the 1.5.1 version instead of the 1.5.0 one, since 1.5.1 solves many bugs of 1.5.0.
Our model for MNIST should be trainable on your hardware with only one GPU. It might help you to complete its training as reducing the number of GPUs significantly reduces memory usage.
Validation steps in our code use additional memory, you can space them out using the --val_interval option to reduce memory consumption.
Besides configuration-dependent matters, this issue could originate from the way that PyTorch handles data loading together with Python multiprocessing, as described in this PyTorch issue. We just pushed a new version of our code that uses this workaround that may help to solve the issue.
If the above solution does not work, it could be difficult for us to solve the issue without substantially changing our code architecture, which we would like to preserve for reproducibily purposes. However, you could try other workarounds from this issue, such as implementing shared memory as suggested in this message. In our case, this should be applied to the self.data attribute, since it contains the sequence of MNIST digits that is used in every data loading process.

Please let us know whether any of these suggestions solves your problem!

Answer 4 · 2020-12-25T01:33:53.000Z

Thank you for the suggestions! I'm sorry i didn't reply in time.
I had tried to follow your suggestions(1.2.3.4) but it didn't work.By running on only one GPU,i finished the whole training with a large cost of CPU memories.(I will put up the results soon after evaluation)

As you can see,It needs about 100G to train the model.
I will try to train the model on the other computers to check if there are something wrong in my computer.
Thanks again for your time and effort! It helps me a lot!

Answer 5 · 2020-12-25T02:13:47.000Z

Though the problem haven't been solved,I am happy that the results are matching with the paper(PSNR 16.93 ± 0.07 SSIM 0.7799 ± 0.0020).

Thanks again for your detailed instruction!

Answer 6 · 2020-12-25T17:10:42.000Z

No worries, thank you for the update!

There might be another explanation: this Apex issue reports that Apex usage in specific configurations leads to CPU memory leak such as the one you encounter. If possible, could you please try to train our model without the --apex_amp option to check whether your issue still occurs? Training at full precision is significantly slower but you should be able to quickly confirm the absence of memory leak with the setup that you first used when reporting this problem.

If this is indeed the cause of the memory leak, there is unfortunately not much we can do on our side except reporting it in our instructions. You have then two solutions:

try the workarounds suggested in the corresponding issue, such as installing different versions of PyTorch or compiling it yourself;
train the model using the --torch_amp option with the latest version of PyTorch (1.7.1), which integrates Apex's mixed-precision training directly into PyTorch (note that this feature is experimental as we have not reproduced our results with this option yet).

Please let us know whether this helps solving your issue.

Answer 7 · 2020-12-28T12:23:09.000Z

Hi, following your suggestions i trained the model without --apex_amp and the issue of CPU memory leak did't occur anymore.So it verifies your explanation that Apex usage in specific configurations leads to CPU memory leak.

As FabianIsensee said in this Apex issue that the problem will go away if you compile pytorch yourself with a more recent version of cuDNN. So i checked my version of CuDNN but found that i didn't install a CuDNN in my computer.But the issue still occured after installtion.

Now,i'd like to train the model using PyTorch(1.7.1) and check my installation environment of Apex.
Thanks again for your time and effort!!!

Answer 8 · 2020-12-28T17:21:51.000Z

Nice, thank you for your help! We are closing this issue since the source of the problem was found. Please let us know if you have any other question!

Answer 9 · 2021-01-02T01:56:27.000Z

Hi, I trained the model using PyTorch(1.7.1) with --torch_amp ,the issue did't occur anymore. And I got the results as follows:
psnr 16.743256 +/- 0.06728019263638955
ssim 0.77448565 +/- 0.0019244341182183178
lpips 0.11437263 +/- 0.001075665273820588
It gives a similar results, you can use it as a reference.
Thank you!