explosion/sense2vec

GitHub Releases download slowly

neelkamath opened this issue · 2 comments

Use case: I need to download the pretrained vectors fresh on CI systems, etc., and GitHub releases seem to download at an average speed of < 1 MB/s for most users (I found this out through casual online searches).

Suggestion: If possible, perhaps the vectors could be uploaded to a place with faster download speeds (e.g., mirror on SourceForge if that's faster). I'd be happy to maintain mirrors if I'm capable of it as I know everyone is low on bandwidth (even GitHub; no pun intended).

You're welcome to rehost the models for your own use --- you could download the model, unpack it, and then call load() with a path to the directory, instead of the name.

We really can't maintain mirrors though, and downloads from CI systems get expensive.

After some more trial and error, it seems that the US servers are speedy, but Indian ones are (ridiculously) slow (at least for now). I think it isn't worth maintaining a mirror since GitHub will eventually speed things up elsewhere, so I'll close this.

If anyone happens to be seeing this and still has this issue, here's what you can do since the pretrained vectors are rather large to store in the git repo:

  1. Instruct in the README to download and extract the vectors after cloning the repo for local usage.
  2. For Docker, add the vector directory to the .dockerignore. This will require you download the vectors in CI pipelines each time. If your CI runners are too slow for your area, use a shell script which only downloads the file if it doesn't already exist from a previous CI run cache.