Glove Embeddings
Closed this issue · 11 comments
Dear author,
I just noticed that the Glove embeddings that were used are filtered. For future references, I am wondering how these embeddings were filtered? Are they filtered based on their occurrence or perhaps based on words that occur in the used datasets?
To make the model light weight, I selected only words that appear in the train, dev, and test sets. In fact, it is equivalent to using the whole GloVE. Just that, the latter will make the model much larger. If the model is evaluated on other test sets, it's recommended to use the whole GLoVE.
Thanks for the quick response!
And how was the #UNK# token computed?
PAD and UNK tokens are randomly initialized and fixed.
I see, I also noticed that in your code you checked if an #UNK# token was added, if so, you would take the mean of entities and use that as the #UNK# token. Why did you decide to use a random init vector, rather than the mean of all entities/words?
I remember I have two ways to initialize unk token: Random and mean of all word vectors. In practice I didn't see any significant differences in performance. You could try one of them.
So I read that you removed words that do not occur more than 2-3 times in AIDA. Is that all the filtering that was applied? Meaning the only words that were removed, were words that did not occur more than 2-3 times in AIDA-train, while you considered all words for the remaining datasets.
If I were to use this model on other datasets. What would your recommendation be? Use the Glove embeddings and simply replace all words that do not occur with the #UNK# token or use al GLoVe embeddings (as was stated before)?
I don't remember that part. Can u tell me which file (and line) are u looking at?
Sorry, with words I meant tokens from the set of Glove embeddings. You stated this in a previous issue [0].
[0] #12
Ok, now I remember. In fact I didn't put the code for building the vocabulary here as it is just a bash command line. So, I have to admit that I'm writing based on my memory (and sorry for being unprofessional). Right, I don't do any further filtering. If u see anything wrong, please let me know.
No problem! So to clarify, you did not perform any filtering and all words in both training, validation and test sets were considered? Or did you in fact remove words that occurred fewer than 2-3 times for the training set?
Thank you so much for taking the time to answer my questions!
I think it's easy to check, by comparing the model's vocabulary with the vocabularies of GloVE and of all the datasets. I don't think I do any more filtering.