th-b/bla

Beta Activation Function

Closed this issue · 1 comments

Thanks for this excellent paper!
I notice that in paper, you use min(x, 0) as beta activation function, I'm quite confused that why you don't use max(0, x) just like ReLU? Seems that ReLU can make much more sense?
Thank you!

th-b commented

Hey, thank you for your interest in our work!

The point of using beta(x) = min(x,0) is that we want to approximate the logits of the probability distribution p, which is uniform on its support. So the logits of p are either 0 (or any constant you like) or -infinity.