he-dhamo/simsg

pytorch process is killed

buaaswf opened this issue · 3 comments

Dear @azadef @he-dhamo ,
Thanks for your sharing. I am now following your work, I run the training code.
However, I met a problem as follows,
During the training phase, the process will be killed randomly,
I employ pytorch1.1 and my server is good enough with 64g memory and 32g GPU.
Have you ever met this problem?
I have tried to solve this by reducing the batch size and mask size, however, it doesn't work even I set them with 1.

python scripts/run_train.py args_64_crn_vg.yaml
t = 15500 / 300000
G [L1_pixel_loss]: 0.5709
G [bbox_pred]: 0.2680
G [ac_loss]: 0.4748
G [g_gan_obj_loss]: 0.0387
G [g_gan_img_loss]: 0.0231
G [total_loss]: 1.3755
D_obj [d_obj_gan_loss]: 0.4348
D_obj [d_ac_loss_real]: 4.9174
D_obj [d_ac_loss_fake]: 4.7477
D_img [d_img_gan_loss]: 1.8805
6%|██████ | 3787/62565 [04:47<1:10:53, 13.82it/s]Killed

Thanks.
Wenfeng

Hi Wenfeng,

Thanks for reporting this. We haven't encountered such an issue. Our experiments for this work were run on a 11GB GPU and 64GB of RAM, so memory is probably not the problem. We suspect the issue might be related with your cuda / pytorch version, or maybe there is a constrain in your server that does not allow more memory usage than a certain value.

Dear @he-dhamo ,
I have changed the pytorch verison to 1.2, and it works well.
Thanks

Great!