Downloader Plugin
kireet opened this issue · 1 comments
kireet commented
From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader
component that can download data from URLs and save them locally to disk.
- If the files exist, the downloader can skip the download
- the downloader should calculate checksums for downloaded files. it should produce a checksums.cfg file to simplify reusing these in configuration later
- the downloader should allow checksums to be configured in the experiment file. when set, the downloader would verify the downloaded file is the same as the one specified in the experiment.
so an example json config could be:
{
"_name": "Downloader",
"local_dir": "$my_path",
"checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download
"sentences.txt.gz": {
"url": "$BASE_URL/sentences.txt.gz",
"decompress": true
},
"word_embeddings.npy": {
"url": "$BASE_URL/word_embeddings.npy"
}
}
kireet commented
also maybe an option for max number of parallel downloads