There is only one BN in network, is that useful?
Closed this issue · 1 comments
In my opinion, there is no information about batch normalization in paper. And there is only first feature layer used this operation in code. Is that useful ? Can we use it for every feature layer before compute_heads ?
Line 92 in 53e481a
In the paper, the authors stated:
Since, as pointed out in [12], conv4 3 has a different feature
scale compared to the other layers, we use the L2 normalization technique introduced
in [12] to scale the feature norm at each location in the feature map to 20 and learn the
scale during back propagation.
Because there weren't any equivalent implementations for Tensorflow, I just used a BatchNorm layer instead. I tried to implement L2 normalization layer on my own, but the loss didn't converge as expected.