What is the reason that let you use gradient clipping here?

Question

What is the reason that let you use gradient clipping here?

johnnylu305 opened this issue 4 years ago · 3 comments

Hi, I want to train the network without gradient clipping. However, the loss will converge first and suddenly diverge. Do you know the reason about it? What is the reason that let you use gradient clipping?

Answer 1 · 2020-11-02T02:35:02.000Z

There are exp() operations and multiplications in the coupling layer architecture, so it might be easy to have gradient explosion. Therefore, we may have to restrict the range on exp() and apply gradient clipping.

Answer 2 · 2020-11-02T04:06:21.000Z

@pkuxmq Thank you for your explanation. I know exp() operation is unstable for network so you apply center sigmoid on it. But, why center sigmoid is not enough? Or gradient clipping is for multiplication instead of exp()?

Answer 3 · 2020-11-02T05:23:26.000Z

Yes, each InvBlock has a scaling term regarding exp(), so stacking multiple InvBlocks might lead to exponential scaling and gradient explosion even if a single exp() is controlled in a range.