A collection of notebooks and recipes that demonstrate larger data processing workflows in EASI.
Use the code in this repository as examples and guides to be adapted to your workflows.
We regularly see that users can struggle to efficiently go from a development notebook (that works on a small area) to scaling up their workflow to work on a larger set of data or operationally. The main challenges we see are in:
- Efficient use of dask parameters tuned to the workflow
- Resilient and cost-effective workflows
Common patterns that occur for each data processing workflow:
- Get work
- Space, time, product and processing parameters
- Select batching and tiling options
- Output is a list of work to do (number of batches)
- Do work
- Launch processes with each to do a batch
- Select optional dask configuration
There are three main patterns that can be explored. The best solution will likely depend on your workflow and requirements.
Launch one dask cluster per process.
Job tiling with ODC code
Group work into batches each of which is run by a single Argo worker.
- Can control the number of simultaneous Argo workers
- If an Argo worker dies then the batch will be restarted. In this case ensure your code can skip work that was previously done.
- Each Argo worker can itself launch a dask cluster and a grid workflow, or any complex processing task.
Contributions are welcome.
A pre-commit
hook is provided in /bin
. For each notebook committed this will:
- Attempy to strip any AWS secrets.
- Render an HTML copy of the notebook (with outputs) into
html/
. - Strip outputs from the notebook to reduce the size of the repository.
The apply_hooks.sh
script creates a symlink to bin/pre-commit
.
# Run this in your local repository
sh bin/apply_hooks.sh
For contributors:
- Apply the pre-commit hook.
- Run each notebook (that has been updated) to populate the figures, tables and other outputs as you want them.
- Add a link into
html/readme.md
for each new notebook. - Add an item to
whats_new.md
. git add
andgit commit
.- If everything looks ok,
git push
to your fork of this repository and create a pull request.