Calculation of scale term
meowcakes opened this issue · 8 comments
Hi guys,
In the paper in Table 1, you specify that the NN learns the log of the scale, and thus the scale is calculated as s = exp(log s)
. However, in your code, the scale is calculated by scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.)
. Would it be possible to elaborate on why this calculation was used instead? I'm assuming it's for reasons of numerical stability?
I'm also curious about this. What was the reasoning for switching from exp
to sigmoid
? Was it just to keep the result bounded?
So I asked the authors at NeurIPS last year - using sigmoid
here is to bound the gradients of the affine coupling layer. In the previous Real-NVP work a tanh
is used for the same reason. I've tried training without this kind of bounding and it didn't converge.
Ah, I see. I wonder if you could use something like ReLU(x) + 1
. Then your gradient would always be nice and strong, and the constant would prevent the divide by zero problems.
I think the problem is not that the gradient is not strong enough. Actually quite the opposite you wanna bound it.
Sure, I understand that. But the gradient of the ReLU
is bounded (it's constant) as well. And it's a simpler function, without the vanishing gradient problem of the sigmoid
. I don't know if it would perform any better or not though.
Note that y = scale * x + shift
and scale = tf.nn.relu(h[:, :, :, 1::2]), shift = h[:, :, :, ::2]
. I agree that dy/dh
is bounded here due to relu
, but dy/dx = scale
which corresponds the entries of the Jacobian matrix is still unbounded. If this is the case, the determinant of Jacobian can be huge, suggesting a dramatic volume grow from x to y. This makes the training unstable in my experiments. The idea of using sigmoid
is to bound the dy/dx
from above - in fact it only allows the volume to shrink. I think it sacrifices the capacity for stability. However, I don't have any further intuition other than these observations. Guess it should be something to overcome.
@wanglouis49 Thanks for your insight.
Has anyone tried simply splitting the output of the bijective function along the channel dimension and using the first half as the scale and the second half as the translation parameter like in RealNVP/NICE?
Is there any reason why GLOW uses the even numbered channels for scale and the odd numbered channels for translation?