Does gradient clip in `to_fp16` really take effect ?

Question

Does gradient clip in `to_fp16` really take effect ?

richarddwang opened this issue 5 years ago · 1 comments

I tried to find the answer using ipdb, below is what I got.

The below is saying: Only the grad of parameters of master_pgs has been clipped, but it is not the case for self.model's parameters or parameters which optimizer use to update.

tensor(-0.4746, device='cuda:0', dtype=torch.float16, grad_fn=) # p list(self.model.parameters())[50].grad.mean()
tensor(5.0221e-06, device='cuda:0') # p self.master_pgs[-3][0].grad.mean()

>>> unt 116 # from 114
> /home/yisiang/fastai2/fastai2/callback/fp16.py(116)after_backward()
13  114         if self.clip is not None:
    115             for group in self.master_pgs: nn.utils.clip_grad_norm_(group, self.clip)
--> 116         if self.dynamic:
    117             self.count += 1
    118             if self.count == self.scale_wait:

tensor(-0.4746, device='cuda:0', dtype=torch.float16, grad_fn=) # p list(self.model.parameters())[50].grad.mean()
tensor(3.1092e-06, device='cuda:0') # p self.master_pgs[-3]

tensor(6.3187e-07, device='cuda:0') #p self.opt.all_params(with_grad=True)[33][0].grad.mean()

>>> unt 116 # from 114
> /home/yisiang/fastai2/fastai2/callback/fp16.py(116)after_backward()
13  114         if self.clip is not None:
    115             for group in self.master_pgs: nn.utils.clip_grad_norm_(group, self.clip)
--> 116         if self.dynamic:
    117             self.count += 1
    118             if self.count == self.scale_wait:

tensor(6.3187e-07, device='cuda:0')  #p self.opt.all_params(with_grad=True)[33][0].grad.mean()

Answer 1 · 2020-08-14T16:54:02.000Z

According to the docs, the optimizer uses the params in the master model, so I believe this behavior is correct.
http://dev.fast.ai/callback.fp16#A-little-bit-of-theory