Add option to skip downloading output from S3 to local for AWS runs
Innixma opened this issue · 1 comments
Currently for AWS runs, the results dir of each task is downloaded from S3 to the local machine that executed the AWS run.
With large-scale runs and additional meta-data, these downloads can become very large (multiple terabytes), leading to out-of-disk on the host machine and potential network errors / bandwidth limitations.
It would be nice to be able to specify to skip the local download (but still have the files saved to S3 from the worker nodes). I primarily work with the S3 files directly for post-run aggregation, which is more convenient generally.
Would welcome a PR. I propose to make this configurable by adding a parameter to the aws
namespace in the configuration (e.g., aws.download
). Ideally it would support three options:
- All: downloads all files, current behavior, should remain default
- Results: download only result files (the
_download_results
function internally already identifies those, so it should be easy to filter?) - None: don't download files
I don't know from the top of my head whether or not downloading the results file is always required just for the remainder of the logic to work (to know whether a task has finished). If it is, then None
could simply choose not to save it to disk (or if there's a non-invasive way to allow it to finish the task without downloading the file, that would work too).