feedly/transfer-nlp

Downloader Plugin

kireet opened this issue · 1 comments

From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader component that can download data from URLs and save them locally to disk.

  • If the files exist, the downloader can skip the download
  • the downloader should calculate checksums for downloaded files. it should produce a checksums.cfg file to simplify reusing these in configuration later
  • the downloader should allow checksums to be configured in the experiment file. when set, the downloader would verify the downloaded file is the same as the one specified in the experiment.

so an example json config could be:

{
  "_name": "Downloader",
  "local_dir": "$my_path",
  "checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download 
  "sentences.txt.gz": {
    "url": "$BASE_URL/sentences.txt.gz",
    "decompress": true
  },
  "word_embeddings.npy": {
    "url": "$BASE_URL/word_embeddings.npy"
  }
}

also maybe an option for max number of parallel downloads