You can use slim.batch_norm(scaled=True) to achieve same ability of Adaptive Normalization

Question

You can use slim.batch_norm(scaled=True) to achieve same ability of Adaptive Normalization

mzh0 opened this issue 7 years ago · 5 comments

Answer 1 · 2017-09-23T05:40:52.000Z

slim.batch_norm(scaled=True) is somehow equivalent to slim.batch_norm(scaled=False) in our case:

    scale: If True, multiply by `gamma`. If False, `gamma` is
      not used. When the next layer is linear (also e.g. `nn.relu`), this can be
      disabled since the scaling can be done by the next layer.

described in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py

Yesterday I also tried to run a model with a normalization function like this for Rudin-Osher-Fatemi:

def nm(x):
    return slim.batch_norm(x,scale=True)

But the performance is similar to slim.batch_norm(x,scale=False). It has MSE at 56. But our adaptive normalization can achieve MSE at 0.6.

The difference might come from different parametrization or initialization.

Answer 2 · 2017-09-23T06:41:24.000Z

What makes scale=True is equivalent to scaled=False for CAN?

\mu is default initialized to 0
\sigma is default initialized to 1
\gamma is default initialized to 1
\beta is default initialized to 0

As you initialize w_0 = 1 and w_1 = 0, two normalization should have same start point. Will be very interested to see how two parameterization's weight varies during the time.

Answer 3 · 2017-09-24T22:47:28.000Z

I also try an experiment with adaptive normalization where w_0 = 0 and w_1 = 1. The performance is not good with MSE at 37 for ROF. So the initialization matters a lot.

On the other hand, batch normalization may be not suitable for identity mapping as \sigma and \beta are always changing during training. To achieve perfect identity mapping, we need \gamma=\sigma and \mu=\beta but \sigma and \beta are changing. It seems hard to keep \gamma=\sigma and \mu=\beta closely by gradient descent.

Answer 4 · 2017-09-24T23:40:33.000Z

I also try an experiment with adaptive normalization where w_0 = 0 and w_1 = 0. The performance is not good with MSE at 37 for ROF. So the initialization matters a lot.

Do you mean w_0 = 0 and w_1 = 1?

Answer 5 · 2017-09-25T02:24:23.000Z

Right, I mean w_0 = 0 and w_1 = 1. Sorry for the typo.