<unk> in the second notebook
pietz opened this issue · 3 comments
Could somebody explain why we manually change the unk_init from zeros to a normal distribution and afterwards overwrite the the weights in the embedding layer from the normally distributed weights to zero? This seems redundant.
The unk_init
isn't just used to set the initial embedding of the <unk>
token, it is used to set the initial embedding for every token within your vocabulary that is not in your pre-trained embeddings, e.g. the word "bananas" is in your vocabulary but not in your pre-trained embedding then the embedding for "bananas" will be initialized from whatever you specify your unk_init
function to be.
Saying that, I was told it was good practice to initialize the <unk>
token to all zeros, but I don't believe this is the case anymore and I think only the <pad>
token should be initialized to zeros - although from experimenting it makes basically no difference in the final results.
Thanks @bentrevett. That makes perfect sense. I'm not an NLP expert but I would also question why zero vectors are the default choice for unknown words.
I am not sure either. The only related work I am aware of is this, which tries different embedding initialization techniques and finds that there's not much difference between zeros, Xavier, He, N(0, 0.1), N(0, 0.01) and N(0, 0.001) - see table 1 on page 4.
However their experiments are more focused on initializing all of the embeddings and not just those outside of the pre-trained embedding vocabulary, which they initialize with N(0, 0.01) - see last paragraph of section 3.1.