aws-samples/sagemaker-101-workshop

Alternative word vector source?

Closed this issue · 1 comments

The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:

  • Sometimes slow to download on our typical instance type (~6min30) - because the combined zip of 50/100/200/300D vectors is downloaded and the unused files discarded. There don't seem to be separate downloads offered for the 100D vector size the model uses.
  • Only offered pre-trained in English which makes the exercise less transferable for ASEAN customers.

Could maybe instead consider:

  • Using FastText embeddings: they offer pre-trained "Word vectors for 157 languages", but only 300D... So we'd still need to downsample to 100D using their tool to adapt the dimension - and download would still be slower than necessary.
  • Using some other embedding source (?)
  • Pre-preparing and hosting embeddings (e.g. in S3) for optimized download times at the expense of transparency of how the vectors are created.

Already addressed on both PyTorch and TensorFlow variants - closing the issue