Does gradient clip in `to_fp16` really take effect ?
richarddwang opened this issue · 1 comments
richarddwang commented
I tried to find the answer using ipdb, below is what I got.
The below is saying: Only the grad of parameters of master_pgs has been clipped, but it is not the case for self.model's parameters or parameters which optimizer use to update.
tensor(-0.4746, device='cuda:0', dtype=torch.float16, grad_fn=) # p list(self.model.parameters())[50].grad.mean()
tensor(5.0221e-06, device='cuda:0') # p self.master_pgs[-3][0].grad.mean()
>>> unt 116 # from 114
> /home/yisiang/fastai2/fastai2/callback/fp16.py(116)after_backward()
13 114 if self.clip is not None:
115 for group in self.master_pgs: nn.utils.clip_grad_norm_(group, self.clip)
--> 116 if self.dynamic:
117 self.count += 1
118 if self.count == self.scale_wait:
tensor(-0.4746, device='cuda:0', dtype=torch.float16, grad_fn=) # p list(self.model.parameters())[50].grad.mean()
tensor(3.1092e-06, device='cuda:0') # p self.master_pgs[-3]
tensor(6.3187e-07, device='cuda:0') #p self.opt.all_params(with_grad=True)[33][0].grad.mean()
>>> unt 116 # from 114
> /home/yisiang/fastai2/fastai2/callback/fp16.py(116)after_backward()
13 114 if self.clip is not None:
115 for group in self.master_pgs: nn.utils.clip_grad_norm_(group, self.clip)
--> 116 if self.dynamic:
117 self.count += 1
118 if self.count == self.scale_wait:
tensor(6.3187e-07, device='cuda:0') #p self.opt.all_params(with_grad=True)[33][0].grad.mean()
jph00 commented
According to the docs, the optimizer uses the params in the master model, so I believe this behavior is correct.
http://dev.fast.ai/callback.fp16#A-little-bit-of-theory