stanfordnlp/GloVe

Source data for training embedding glove.840B.300d

lfoppiano opened this issue · 5 comments

Dear all,
I'm collecting information about various embedding approaches and I'm looking for information about how you did perform the training the embeddings: `Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip``

The paper does not discuss them indeed.

In particular, I'm interested in:

  • would you have the exact source data that was used? Which version of Common Crawls?
  • were such embeddings trained only with English text or did you mix it with other languages?

Thank you in advance

Thank you

I have another question, do you have, by any chance, the command's parameters that were used to train these embeddings?

Thanks