szagoruyko/attention-transfer

Setting of β

Opened this issue · 1 comments

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

@tangbohu I assume that β is 10^3 / batch_size / (feature_map_size)^2, this division occurs in the average function here in practice, batch size is set to 128 by default, and feature map size varies in the range of 32x32, 16x16, 8x8, so the aformentioned equation varies about 0.1. Just my own conjecture from the code.