There is only one BN in network, is that useful?

In my opinion, there is no information about batch normalization in paper. And there is only first feature layer used this operation in code. Is that useful ? Can we use it for every feature layer before compute_heads ?

ssd-tf2/network.py

Line 92 in 53e481a

conf, loc = self.compute_heads(self.batch_norm(x), head_idx)

In the paper, the authors stated:

Since, as pointed out in [12], conv4 3 has a different feature
scale compared to the other layers, we use the L2 normalization technique introduced
in [12] to scale the feature norm at each location in the feature map to 20 and learn the
scale during back propagation.

Because there weren't any equivalent implementations for Tensorflow, I just used a BatchNorm layer instead. I tried to implement L2 normalization layer on my own, but the loss didn't converge as expected.