Masknet replace batchnorm with layernorm

The paper from masknet uses layernorm.
however the code implementation uses batchnorm.

Hi Marc,

Thanks for bringing this up! This is indeed a bug, and we are fixing it.

Hi Marc,

Upon checking, this is not a bug. When applying BatchNorm on the default axis (last dim), BatchNorm reduces to LayerNorm, and since the size of gamma/beta depends on the shape of input tensor, the original implementation is still correct.

However, for the clarity of the code, we updated the example (ref PR #816 ).

Thanks for the comment!

I am not sure I am following see this screenshot.

What am I missing?

Because your code isn't in trianing.

tf.layers.batch_normalization() will call to class BatchNormalizationBase

DeepRec/tensorflow/python/keras/layers/normalization.py

Line 43 in 6bd822e

class BatchNormalizationBase(Layer):

tf.keras.layers.LayerNormalization() will call to class LayerNormalization

DeepRec/tensorflow/python/keras/layers/normalization.py

Line 898 in 6bd822e

class LayerNormalization(Layer):

In LayerNormalization, mean and var are computed by nn.moments

DeepRec/tensorflow/python/keras/layers/normalization.py

Line 1025 in 6bd822e

mean, variance = nn.moments(inputs, self.axis, keep_dims=True)

then use nn.batch_normalization to get the result.

DeepRec/tensorflow/python/keras/layers/normalization.py

Lines 1040 to 1046 in 6bd822e

    
           outputs = nn.batch_normalization( 
        
               inputs, 
        
               mean, 
        
               variance, 
        
               offset=offset, 
        
               scale=scale, 
        
               variance_epsilon=self.epsilon)

It is the same with BN without other features.

DeepRec/tensorflow/python/keras/layers/normalization.py

Lines 643 to 652 in 6bd822e

    
           def _moments(self, inputs, reduction_axes, keep_dims): 
        
             mean, variance = nn.moments(inputs, reduction_axes, keep_dims=keep_dims) 
        
             # TODO(b/129279393): Support zero batch input in non DistributionStrategy 
        
             # code as well. 
        
             if self._support_zero_size_input(): 
        
               inputs_size = array_ops.size(inputs) 
        
               mean = array_ops.where(inputs_size > 0, mean, K.zeros_like(mean)) 
        
               variance = array_ops.where(inputs_size > 0, variance, 
        
                                          K.zeros_like(variance)) 
        
             return mean, variance

DeepRec/tensorflow/python/keras/layers/normalization.py

Lines 736 to 739 in 6bd822e

    
           mean, variance = self._moments( 
        
               math_ops.cast(inputs, self._param_dtype), 
        
               reduction_axes, 
        
               keep_dims=keep_dims)

DeepRec/tensorflow/python/keras/layers/normalization.py

Lines 820 to 825 in 6bd822e

    
           outputs = nn.batch_normalization(inputs, 
        
                                            _broadcast(mean), 
        
                                            _broadcast(variance), 
        
                                            offset, 
        
                                            scale, 
        
                                            self.epsilon)

But the difference is that when you are not in training, the mean and var of BN will be replaced.

DeepRec/tensorflow/python/keras/layers/normalization.py

Lines 744 to 750 in 6bd822e

    
           mean = tf_utils.smart_cond(training, 
        
                                      lambda: mean, 
        
                                      lambda: ops.convert_to_tensor(moving_mean)) 
        
           variance = tf_utils.smart_cond( 
        
               training, 
        
               lambda: variance, 
        
               lambda: ops.convert_to_tensor(moving_variance))

you can add input param moving_mean_initializer='ones' which is defaulted to 'zeros' and find output is changed.

Thanks @Duyi-Wang it makes sense. I was confused by it as well but the doc clearly state it. Thanks for pointing out the code.
Adding a screenshot for posterity.

Feel free to close this one.

	outputs = nn.batch_normalization(
	inputs,
	mean,
	variance,
	offset=offset,
	scale=scale,
	variance_epsilon=self.epsilon)

	def _moments(self, inputs, reduction_axes, keep_dims):
	mean, variance = nn.moments(inputs, reduction_axes, keep_dims=keep_dims)
	# TODO(b/129279393): Support zero batch input in non DistributionStrategy
	# code as well.
	if self._support_zero_size_input():
	inputs_size = array_ops.size(inputs)
	mean = array_ops.where(inputs_size > 0, mean, K.zeros_like(mean))
	variance = array_ops.where(inputs_size > 0, variance,
	K.zeros_like(variance))
	return mean, variance

	mean, variance = self._moments(
	math_ops.cast(inputs, self._param_dtype),
	reduction_axes,
	keep_dims=keep_dims)

	outputs = nn.batch_normalization(inputs,
	_broadcast(mean),
	_broadcast(variance),
	offset,
	scale,
	self.epsilon)

	mean = tf_utils.smart_cond(training,
	lambda: mean,
	lambda: ops.convert_to_tensor(moving_mean))
	variance = tf_utils.smart_cond(
	training,
	lambda: variance,
	lambda: ops.convert_to_tensor(moving_variance))