openml/automlbenchmark

Add option to skip downloading output from S3 to local for AWS runs

Innixma opened this issue · 1 comments

Currently for AWS runs, the results dir of each task is downloaded from S3 to the local machine that executed the AWS run.

With large-scale runs and additional meta-data, these downloads can become very large (multiple terabytes), leading to out-of-disk on the host machine and potential network errors / bandwidth limitations.

It would be nice to be able to specify to skip the local download (but still have the files saved to S3 from the worker nodes). I primarily work with the S3 files directly for post-run aggregation, which is more convenient generally.

Would welcome a PR. I propose to make this configurable by adding a parameter to the aws namespace in the configuration (e.g., aws.download). Ideally it would support three options:

  • All: downloads all files, current behavior, should remain default
  • Results: download only result files (the _download_results function internally already identifies those, so it should be easy to filter?)
  • None: don't download files

I don't know from the top of my head whether or not downloading the results file is always required just for the remainder of the logic to work (to know whether a task has finished). If it is, then None could simply choose not to save it to disk (or if there's a non-invasive way to allow it to finish the task without downloading the file, that would work too).