
Docker image is rebuilt for every `cluster job submit`

I notice that Docker rebuilds the image for every cluster job submit, and pushes the respective changes to GCP. There is exactly one large layer that is pushed. I assume this is the working directory that is copied into the image. This is annoying since my working directory is quite large (it has a large executable), and I would like to speed this up.

I am not making any changes to the working directory. I don't know what changes trigger this. I assume that these are either spurious changes (Caliban/Docker doesn't detect that nothing actually changes), or maybe there is a time stamp that is needlessly updated.

(It would be convenient if caliban build had an option --dry_run that would output the Dockerfile that it generates.)

Here is a typical output:

$ caliban cluster job submit --cluster_name einsteintoolkit-cluster --min_cpu 2000 --nogpu ./ -- /usr/app/Cactus/exe/cactus_sim /usr/app/Cactus/repos/cactusamrex/azure-pipelines/carpetx.par
I0802 15:22:43.826885 4368600512] Generating Docker image with parameters:
I0802 15:22:43.827569 4368600512] {'adc_path': '/Users/eschnett/.config/gcloud/application_default_credentials.json',
 'build_path': '/Users/eschnett/src/CarpetX',
 'caliban_config': {'apt_packages': {'cpu': [], 'gpu': []},
                    'base_image': 'eschnett/carpetx-caliban:cpu',
                    'build_time_credentials': False,
                    'default_mode': <JobMode.CPU: 'CPU'>,
                    'gcloud': {},
                    'julia_version': None,
                    'mlflow_config': None},
 'conda_env_path': None,
 'credentials_path': '/Users/eschnett/.config/service_key.json',
 'extra_dirs': None,
 'job_mode': <JobMode.CPU: 'CPU'>,
 'no_cache': False,
 'package': Package(executable=['/bin/bash'], package_path='.', script_path='./', main_module=None),
 'requirements_path': None,
 'setup_extras': None}
I0802 15:22:43.831600 4368600512] Running command: docker build --rm -f- /Users/eschnett/src/CarpetX
I0802 15:25:05.499978 4368600512] jobs submitted, visit to monitor

Note that the last copy (copying the current directory) does not use the cache, and the respective layer is pushed.

Erik, I'll have a look, as you are right both that this shouldn't happen and that it's quite annoying.

I believe the problem is using . as source of the COPY command. When I copy another large directory explicitly, the copy command is properly cached.

If that is indeed so, then one work-around might be to explicitly list all files (including dot files, of course) from the package directory in the copy command.

@eschnett ah, nice find. Another workaround would be to move your files out of the top level directory; by default, Caliban only includes the folder containing the script you've passed it, and then you have the ability to add more directories with the -d flag.

So you might change:

caliban cluster job submit ./


caliban cluster job submit -d data -d misc bin/

or something like that, and that should fix it. Hopefully this will help for now!

Also, we're close to removing the --nogpu requirement, by

  • making --nogpu the default
  • allowing you to set a sticky default for your own machine,
  • and allowing you to override the gpu or nogpu default in .calibanconfig.json locally.

Shorter command strings are always a bonus!

@sritchie I have a folder called Cactus in the top level directory of the project. This Cactus folder contains the files that are not cached. This folder is copied into the Docker image. Are you recommending to use two separate directories, one project directory (which contains e.g. the .calibanconfig), and another directory tree outside it which contains the large files? In this case, I would need to specify ../Cactus or similar, not just Cactus, right?

Thanks for the --nogpu.

@eschnett almost - instead of:


caliban cluster job submit ./

do this:


caliban cluster job submit -d Cactus bin/

I bet that will do the trick (unless @ajslone tells us that caliban cluster job submit doesn't handle -d, but I think it does.

Yes, this works! Thanks.

I would usually write ./bin/ instead of bin/, but that doesn't work here. You cannot have the initial ./ in the path.

It seems the data directories are added to the Docker image before the apt dependencies are installed. This is the wrong order for me. I agree that this is the right order if one uses large, unchanging data sets (and I'm not advocating to change this order), but this makes the work-around a bit worse for me.

@eschnett yes, this is a real problem for a few reasons... we have a place where sometimes we need credentials before ANY dependencies, so we can get private deps. But if you don't need that feature, it's very disruptive to rebuild everything before the cache.

The solution here could be either:

  • letting users hint the order in .calibanconfig.json, or
  • figuring out how we can more aggressively use docker's multi-container builds to get out of this linear structure that is hurting us.

I want the tool to keep its sane defaults, but I do want to get to a world where our Docker building looks like like building a list of data structures internally, and then taking hints from the user about how to change that build order.

I don't have an opinion either way. If putting a job script into the root directory could be made to use the cache, I probably wouldn't need data directories. My workflow starts from scratch and produces large output files.