Source data for training embedding glove.840B.300d
lfoppiano opened this issue · 5 comments
lfoppiano commented
Dear all,
I'm collecting information about various embedding approaches and I'm looking for information about how you did perform the training the embeddings: `Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip``
The paper does not discuss them indeed.
In particular, I'm interested in:
- would you have the exact source data that was used? Which version of Common Crawls?
- were such embeddings trained only with English text or did you mix it with other languages?
Thank you in advance
AngledLuffa commented
The intent was English only, although other things may have snuck in. I
don't think we have records on which version of Common Crawl, unfortunately.
lfoppiano commented
Thank you
lfoppiano commented
I have another question, do you have, by any chance, the command's parameters that were used to train these embeddings?
AngledLuffa commented
I'm sorry, but the person who did the original training is long gone and
didn't leave behind any notes.
lfoppiano commented
Thanks