Alternative word vector source?
Closed this issue · 1 comments
athewsey commented
The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:
- Sometimes slow to download on our typical instance type (~6min30) - because the combined zip of 50/100/200/300D vectors is downloaded and the unused files discarded. There don't seem to be separate downloads offered for the 100D vector size the model uses.
- Only offered pre-trained in English which makes the exercise less transferable for ASEAN customers.
Could maybe instead consider:
- Using FastText embeddings: they offer pre-trained "Word vectors for 157 languages", but only 300D... So we'd still need to downsample to 100D using their tool to adapt the dimension - and download would still be slower than necessary.
- Using some other embedding source (?)
- Pre-preparing and hosting embeddings (e.g. in S3) for optimized download times at the expense of transparency of how the vectors are created.
athewsey commented
Already addressed on both PyTorch and TensorFlow variants - closing the issue