Alternative word vector source?

Question

Closed this issue 3 years ago · 1 comments

The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:

Sometimes slow to download on our typical instance type (~6min30) - because the combined zip of 50/100/200/300D vectors is downloaded and the unused files discarded. There don't seem to be separate downloads offered for the 100D vector size the model uses.
Only offered pre-trained in English which makes the exercise less transferable for ASEAN customers.

Could maybe instead consider:

Using FastText embeddings: they offer pre-trained "Word vectors for 157 languages", but only 300D... So we'd still need to downsample to 100D using their tool to adapt the dimension - and download would still be slower than necessary.
Using some other embedding source (?)
Pre-preparing and hosting embeddings (e.g. in S3) for optimized download times at the expense of transparency of how the vectors are created.

Answer 1 · 2021-11-05T03:41:33.000Z

Already addressed on both PyTorch and TensorFlow variants - closing the issue