pkuxmq/Invertible-Image-Rescaling

What is the reason that let you use gradient clipping here?

johnnylu305 opened this issue · 3 comments

Hi, I want to train the network without gradient clipping. However, the loss will converge first and suddenly diverge. Do you know the reason about it? What is the reason that let you use gradient clipping?

There are exp() operations and multiplications in the coupling layer architecture, so it might be easy to have gradient explosion. Therefore, we may have to restrict the range on exp() and apply gradient clipping.

@pkuxmq Thank you for your explanation. I know exp() operation is unstable for network so you apply center sigmoid on it. But, why center sigmoid is not enough? Or gradient clipping is for multiplication instead of exp()?

Yes, each InvBlock has a scaling term regarding exp(), so stacking multiple InvBlocks might lead to exponential scaling and gradient explosion even if a single exp() is controlled in a range.