Why normalize_scale is set to 4.0?

I found in the

Line 42 in c58daa0

normalize_scale=4.0, learn_scale=True)

, the normalize_scale is set to 4 at the beginning. It is interesting because the performance will drop a lot if I set the scale to other number like 1.0.
I wonder why it is important and why 4 is an ideal number.
Thanks!

great question, this number affects the range of the logit value (input to the logistic function https://en.wikipedia.org/wiki/Logistic_function), if you look at the logistic function, the logit value range must be big enough to cover the whole 0-1 output range (but if it's too big it could land into the flat gradient zone)

I just found experimentally 4.0 works, you can make it a learnable weight, though optimizing that could be more difficult. Finally I've seen similar implementation where people did make it work with 1.0 I'm not sure how

Thank you for your reply. I wanted to replace it with torch.nn.functional.normalize at first, but the result shows it is not as simple as I thought.