Source data for training embedding glove.840B.300d

Question

Source data for training embedding glove.840B.300d

lfoppiano opened this issue 4 years ago · 5 comments

Dear all,
I'm collecting information about various embedding approaches and I'm looking for information about how you did perform the training the embeddings: `Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip``

The paper does not discuss them indeed.

In particular, I'm interested in:

would you have the exact source data that was used? Which version of Common Crawls?
were such embeddings trained only with English text or did you mix it with other languages?

Thank you in advance

lfoppiano commented 4 years ago

Thank you

lfoppiano commented 3 years ago

Thanks

Answer 1 · 2021-06-21T15:24:03.000Z

The intent was English only, although other things may have snuck in. I don't think we have records on which version of Common Crawl, unfortunately.

Answer 2 · 2021-06-29T01:41:00.000Z

I have another question, do you have, by any chance, the command's parameters that were used to train these embeddings?

Answer 3 · 2021-06-29T01:51:01.000Z

I'm sorry, but the person who did the original training is long gone and didn't leave behind any notes.