pangeo-data/pangeo-stacks

Unable to modify dask-kubernetes configuration

Closed this issue · 11 comments

Derivative images seem to be unable to modify the dask-kubernetes configuration without rebuilding the base image from scratch.

This seems rather impractical, since it's only natural that one might want to modify the configuration of the dask workers. For example, we might want to taint the worker nodes so that core pods cannot be scheduled there (this is a problem for me at the moment). This means the dask pods need the corresponding toleration, which cannot presently be added without rebuilding base.

I wanted to go ahead and start a discussion on how to implement this.

cc @jacobtomlinson if you have any thoughts.

@bgroenks96 - can you provide some more details on the workflow you have going right now? How are using the docker images provided here and at what point would you like to update the dask-kubernetes configuration?

I'm using the onbuild docker images. Ideally, I would like to be able to modify the dask configuration at the point where I build my derivative image from onbuild, similar to how we are able to modify the conda and pip environments.

You should be able to use the postBuild file to update/overwrite the default dask-kubernetes config. Have you tried this with the onbuild image?

Not yet, no. But it's not clear to me where exactly the dask-kubernetes config lives (I don't know what KERNEL_PYTHON_PREFIX is). It also seems a bit convoluted to have to write a script to read/modify/replace this file, so I thought perhaps we could make it a more "official" configuration option.

One possibility would be checking for a dask-config.yaml file in the child image in r2doverlay like what's done for conda and pip, and then doing a YAML merge with the default, with user config given preference in conflicts.

@bgroenks96 - I think I see what you are going for. It might be worth reading up (if you haven't already) on 1) the dask configuration system (https://docs.dask.org/en/latest/configuration.html) and 2) the repo2docker configuration system (https://repo2docker.readthedocs.io/en/latest/config_files.html). r2doverlay implements a subset of the repo2docker functionality. You'll notice that there isn't an option in repo2docker for dask configuration files so we use the postBuild utility instead. I think this is still your best bet. You may not need to merge the files though, you can just overwrite the existing one with your own "opinionated" configuration.

BTW, ${KERNEL_PYTHON_PREFIX} is set by repo2docker to sys.prefix. So putting dask configs there get picked up automatically (see dask config docs).

Is KERNEL_PYTHON_PREFIX available in the child image docker build?

Yes, it should be.

So the idea then would be to add a postBuild script that copies a local dask config file to the same location from postBuild in the base notebook?

I suppose that should work, provided that the child postBuild runs after the base notebook.

That's right! This is a fairly well established pattern so I'd be surprised if this didn't work. Let us know how it goes.

It worked! I'll go ahead and close this issue.