You can use slim.batch_norm(scaled=True) to achieve same ability of Adaptive Normalization
mzh0 opened this issue · 5 comments
slim.batch_norm(scaled=True) is somehow equivalent to slim.batch_norm(scaled=False) in our case:
scale: If True, multiply by `gamma`. If False, `gamma` is
not used. When the next layer is linear (also e.g. `nn.relu`), this can be
disabled since the scaling can be done by the next layer.
described in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py
Yesterday I also tried to run a model with a normalization function like this for Rudin-Osher-Fatemi:
def nm(x):
return slim.batch_norm(x,scale=True)
But the performance is similar to slim.batch_norm(x,scale=False). It has MSE at 56. But our adaptive normalization can achieve MSE at 0.6.
The difference might come from different parametrization or initialization.
What makes scale=True is equivalent to scaled=False for CAN?
- \mu is default initialized to 0
- \sigma is default initialized to 1
- \gamma is default initialized to 1
- \beta is default initialized to 0
As you initialize w_0 = 1 and w_1 = 0, two normalization should have same start point. Will be very interested to see how two parameterization's weight varies during the time.
I also try an experiment with adaptive normalization where w_0 = 0 and w_1 = 1. The performance is not good with MSE at 37 for ROF. So the initialization matters a lot.
On the other hand, batch normalization may be not suitable for identity mapping as \sigma and \beta are always changing during training. To achieve perfect identity mapping, we need \gamma=\sigma and \mu=\beta but \sigma and \beta are changing. It seems hard to keep \gamma=\sigma and \mu=\beta closely by gradient descent.
I also try an experiment with adaptive normalization where w_0 = 0 and w_1 = 0. The performance is not good with MSE at 37 for ROF. So the initialization matters a lot.
Do you mean w_0 = 0 and w_1 = 1?
Right, I mean w_0 = 0 and w_1 = 1. Sorry for the typo.