A question about the initialization of biaffine module
speedcell4 opened this issue · 6 comments
Hi~
Why do you initialize the weight
to be all zeros?
parser/supar/modules/affine.py
Lines 54 to 55 in 16ad395
As I remember, PyTorch initializes the weight in a different way. Could you please explain your different choice?
Hi @speedcell4
My implementation references this repo.
In practice, I found the default initialization strategy or orthogonal initialization also work very well.
So there is no significant performance difference between these three initialization strategies, i.e., zeros, uniform, and orthogonal.
Honestly, using all zeros is quite strange to me, doesn't it map every input vector to a zero vector, and accumulate zero gradients?
@speedcell4 Sorry, It's been a long time and I don't remember the exact details. But empirically I found zero init has always performed well.
Okay, I got it. Thanks for you replies~
doesn't it map every input vector to a zero vector, and accumulate zero gradients?
I don't think this could lead to zero gradients as gold dependencies can back propagate 1 gradients to Biaffine layers and help get rid of zero init quickly.
However, it does have some potential problems, and in some cases zero init is difficult to train.
In practice, I found normal init performs much better than zero init on Chinese Constituency Parsing.
I understand it now, thank you for your explanation~