A question about the initialization of biaffine module

Hi~

Why do you initialize the weight to be all zeros?

Lines 54 to 55 in 16ad395

    
           def reset_parameters(self): 
        
               nn.init.zeros_(self.weight)

As I remember, PyTorch initializes the weight in a different way. Could you please explain your different choice?

Hi @speedcell4

My implementation references this repo.
In practice, I found the default initialization strategy or orthogonal initialization also work very well.

So there is no significant performance difference between these three initialization strategies, i.e., zeros, uniform, and orthogonal.
Honestly, using all zeros is quite strange to me, doesn't it map every input vector to a zero vector, and accumulate zero gradients?

@speedcell4 Sorry, It's been a long time and I don't remember the exact details. But empirically I found zero init has always performed well.

Okay, I got it. Thanks for you replies~

@speedcell4

doesn't it map every input vector to a zero vector, and accumulate zero gradients?

I don't think this could lead to zero gradients as gold dependencies can back propagate 1 gradients to Biaffine layers and help get rid of zero init quickly.
However, it does have some potential problems, and in some cases zero init is difficult to train.
In practice, I found normal init performs much better than zero init on Chinese Constituency Parsing.

I understand it now, thank you for your explanation~