RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
jxzhangjhu opened this issue · 1 comments
I have the following error, can you help to take a look?
(/home/ec2-user/SageMaker/env/test) sh-4.2$ python synthetic_train.py --num_res_blocks 3 --diffusion_steps 4000 --noise_schedule linear --lr 1e-4 --batch_size 20000 --task 1
Logging to /tmp/openai-2022-11-08-05-25-53-353605
args: Namespace(task=1, schedule_sampler='uniform', lr=0.0001, weight_decay=0.0, lr_anneal_steps=1000, batch_size=20000, microbatch=-1, ema_rate='0.9999', log_interval=10, save_interval=10000, resume_checkpoint='', use_fp16=False, fp16_scale_growth=0.001, num_channels=256, num_res_blocks=3, dropout=0.2, use_checkpoint=False, in_channels=2, learn_sigma=False, diffusion_steps=4000, noise_schedule='linear', timestep_respacing='', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False)
[W ProcessGroupGloo.cpp:694] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
Logging to /tmp/openai-2022-11-08-05-25-53-357098
creating 2d model and diffusion...
creating 2d data loader...
training 2d model...
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/DDIB/ddib/synthetic_train.py", line 82, in
main()
File "/home/ec2-user/SageMaker/DDIB/ddib/synthetic_train.py", line 40, in main
TrainLoop(
File "/home/ec2-user/SageMaker/DDIB/ddib/guided_diffusion/train_util.py", line 67, in init
self._load_and_sync_parameters()
File "/home/ec2-user/SageMaker/DDIB/ddib/guided_diffusion/train_util.py", line 122, in _load_and_sync_parameters
dist_util.sync_params(self.model.parameters())
File "/home/ec2-user/SageMaker/DDIB/ddib/guided_diffusion/dist_util.py", line 83, in sync_params
dist.broadcast(p, 0)
File "/home/ec2-user/SageMaker/env/test/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1408, in broadcast
work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
I found the solution to fix this one just change to p.detach()