GH Actions (docker): Run a job for "make build-local" first, cache image for job "make build"
mkoeppe opened this issue · 112 comments
Follow up from #32703, where this was implemented for macOS.
Here we implement it for the docker-based workflows
via incremental builds (#34228) on top of previous stages pushed to ghcr.io
Depends on #34115
Depends on #34228
CC: @tobiasdiez @dimpase
Component: porting
Author: Matthias Koeppe
Branch/Commit: 5713937
Reviewer: Dima Pasechnik
Issue created by migration from https://trac.sagemath.org/ticket/33062
Description changed:
---
+++
@@ -1,8 +1,7 @@
Follow up from #32703, where this was implemented for macOS.
-Here we implement it for the `docker`-based workflows.
-
-References
-- https://evilmartians.com/chronicles/build-images-on-github-actions-with-docker-layer-caching
+Here we implement it for the `docker`-based workflows by Docker layer caching via ghcr.io
+
+Description changed:
---
+++
@@ -2,6 +2,6 @@
Here we implement it for the `docker`-based workflows by Docker layer caching via ghcr.io
+(split out from #33222)
-Description changed:
---
+++
@@ -2,6 +2,26 @@
Here we implement it for the `docker`-based workflows by Docker layer caching via ghcr.io
+To use:
+
+```
+$ DOCKER_BUILDKIT=1 EXTRA_CACHE_FROM_REPOSITORIES=ghcr.io/mkoeppe/sage/ tox -e docker-ubuntu-focal-standard
+ => [internal] load metadata for docker.io/library/ubuntu:focal 0.0s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-bootstrapped:dev 1.3s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-system-packages:9.6.beta0-133-gc9c7d9f8d9 1.6s
+ => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-pre:dev 2.4s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-configured:9.6.beta0-133-gc9c7d9f8d9 1.6s
+ => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-configured:dev 2.5s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-pre:9.6.beta0-133-gc9c7d9f8d9 1.6s
+ => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets:dev 2.4s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets:9.6.beta0-133-gc9c7d9f8d9 1.4s
+ => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-optional:dev 2.4s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-optional:9.6.beta0-133-gc9c7d9f8d9 1.4s
+ => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-system-packages:dev 2.4s
+ => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-bootstrapped:9.6.beta0-133-gc9c7d9f8d9 1.5s
+```
+
+
(split out from #33222)
Last 10 new commits:
531cd14 | Merge #33103 |
51191fe | .github/workflows/tox.yml (local-macos): Fix filtering of experimental packages |
12f88cf | .github/workflows/tox.yml (local-macos): Group optional/experimental package builds into fewer jobs |
49236dd | tox.ini: Add local-macos-python3_pythonorg-python3.9 etc., refactor CONFIG_CONFIGURE_ARGS_1=...--with-python |
1fcdc91 | tox.ini: Updates for macos and manylinux from #31396 |
41ec3aa | tox.ini: Reindent |
d7066aa | Merge #33140 |
98a5bae | Merge tag '9.6.beta0' into t/33062/gh_actions__docker___run_a_job_for__make_build_local__first__cache_image_for_job__make_build_ |
c446d33 | .github/workflows, src/doc/en/developer/portability_testing.rst: Switch from docker.pkg.github.com to ghcr.io |
c9c7d9f | tox.ini (docker): Support Docker build layer caching |
Description changed:
---
+++
@@ -2,7 +2,7 @@
Here we implement it for the `docker`-based workflows by Docker layer caching via ghcr.io
-To use:
+To use for local builds via tox:
```
$ DOCKER_BUILDKIT=1 EXTRA_CACHE_FROM_REPOSITORIES=ghcr.io/mkoeppe/sage/ tox -e docker-ubuntu-focal-standardDescription changed:
---
+++
@@ -24,4 +24,5 @@
(split out from #33222)
+We reorganize the CI workflows: `ci-linux.yml` and `ci-macos.yml` replace `tox.yml`, `tox-optional.yml`, `tox-experimental.yml`.
Author: Matthias Koeppe
Obviously there is again a lot of cut+paste going on here, but I'm not sure yet whether this is best refactored using (1) scripts to be put in .ci/; (2) new top-level Makefile targets; (3) reusable workflows in the same repo; (4) reusable workflows from a different repo. So until then, copy+paste it is
Branch pushed to git repo; I updated commit sha1. New commits:
47b636d | .github/workflows/ci-linux.yml: Fix typo |
A lot of jobs are failing because there doesn't exist a corresponding maximal image.
What is the distinction between the docker and docker-optional jobs?
Replying to @tobiasdiez:
What is the distinction between the
dockeranddocker-optionaljobs?
docker-optional also runs the stages that build optional and experimental packages.
Replying to @tobiasdiez:
A lot of jobs are failing because there doesn't exist a corresponding
maximalimage.
Not sure if that's it but there are a number of new failures introduced by switching to DOCKER_BUILDKIT=1, for example https://github.com/mkoeppe/sage/runs/5087229478?check_suite_focus=true
> [with-system-packages 130/204] RUN zypper --ignore-unknown install --no-confirm --auto-agree-with-licenses --no-recommends --details "pkgconfig(gf2x)" || echo "(ignoring error)":
------
failed to prepare d4bec9ys962tp15dnd39jkkya: max depth exceeded
This will require some changes in build/bin/write-dockerfile.sh
Replying to @mkoeppe:
Replying to @tobiasdiez:
What is the distinction between the
dockeranddocker-optionaljobs?
docker-optionalalso runs the stages that build optional and experimental packages.
Okay, thanks! Would it then make sense to base the docker-optional on the image created in the docker job to speed up the build (i.e. as a "stage")?
It's currently using a different package configuration (-maximal instead of -standard)
Replying to @mkoeppe:
It's currently using a different package configuration (
-maximalinstead of-standard)
But wasn't your idea with the stages, that the first stage should build all required images (i.e. minimal, standard, maximum) which could then be reused in different ways for the other stages?
Replying to @tobiasdiez:
wasn't your idea with the stages, that the first stage should build all required images (i.e. minimal, standard, maximum) which could then be reused in different ways for the other stages?
Yes, it is doing that. But it is only building optional packages on top of the -maximal configuration, not on top of the other configurations.
Okay, what confuses me is that it seems to not be part of the "stage" design in the sense that it's not using the stage-1 of the normal docker job.
It's using stage-1 of its own docker job because it needs "maximal", not "minimal"/"standard".
Ah okay, now I understand. And it is not possible to have one common "stage-1" that build all three variants?
GH Actions appears to be running on 8-bit technology: Matrix jobs that would create > 256 jobs are silently dropped from the workflow.
... and 6 stages * 3 packages factors * 46(?) system factors would exceed that.
Replying to @mkoeppe:
... and 6 stages * 3 packages factors * 46(?) system factors would exceed that.
Why 6 stages? What I meant is that you have one job creating all docker images, then build jobs on top of this. I feel like this design is also more flexible, since you can easily run certain follow up stages only on a subset of configs.
2 stages * 3 * 46 also exceeds 256.
In any case, the scope of the ticket is to change how it is build (= in 2 stages, to solve the 6-hour timeouts for some platforms). I am keeping what is built/tested the same as before. One ticket, one change.
Replying to @mkoeppe:
2 stages * 3 * 46 also exceeds 256.
My point was that you shouldn't use the matrix for stages, so that you don't run into these constraints.
- Job "build": (min, standard, max) for each system (=3 * 46 < 256) - build docker image
- Job "test-basic": (min, standard) for each system - build sage, run tests
- Job "test-optional": max for each system - build sage with optional packages, run tests
This would also keep what is build and runs in stages.
Job "build" does not reliably fit into 6 hours.
Replying to @mkoeppe:
Job "build" does not reliably fit into 6 hours.
You seem to have the same problem with the current setup as well (e.g. first stage of debian-bookworm fails due timeout).
In this case, you can selectively add a second stage for these problematic systems using matrices as in https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-including-new-combinations
Another reason for the split into the 2 build stages is explained in the ticket description of #32703.
Replying to @tobiasdiez:
you can selectively add a second stage for these problematic systems using matrices
-1
Replying to @mkoeppe:
Another reason for the split into the 2 build stages is explained in the ticket description of #32703.
I don't see anything there that would be easier with matrix-stages over job-stages. In the contrary, the design I proposed above can easily be extended using additional jobs (that run on top of "build"). While your matrix-stage design would need an additional stage, which is not possible due to 3 stages * 2 configs * 46 systems > 256.
comment:31
Replying to @tobiasdiez:
Replying to @mkoeppe:
Job "build" does not reliably fit into 6 hours.
You seem to have the same problem with the current setup as well (e.g. first stage of debian-bookworm fails due timeout).
No, that's likely only a sporadic failure.
Here is a version where I converted your first stage to a "prebuild" job. It runs exactly the same code (except if I made a mistake and missed some hidden environment variables). So the prebuild job timeouts iff if your stage 1 job would time out.
https://github.com/sagemath/sagetrac-mirror/actions/runs/1814856715
https://github.com/sagemath/sagetrac-mirror/blob/public/build/refactor-ci-runtests/.github/workflows/ci-docker.yml
OK, so to summarize, the revised plan of comment:30 would be:
- Job "prebuild": (minimal, standard, maximal) for each system (=3 * 46 < 256) - build
-with-targets-preimages - Job "build": (minimal, standard, maximal) for each system (=3 * 46 < 256) - build
-with-targetsimages - Job "test-basic": (minimal, standard) for each system - build sage, run tests
- Job "test-optional": (maximal) for each system - build sage with optional packages, run tests
This would work. Minor detail: After this change, prebuild and build would build the union of the platforms currently run for "standard" and for "maximal". That's a small change, but that would be fine with me.
In fact, the first two jobs should probably be named "targets-pre" and "targets"
Replying to @mkoeppe:
OK, so to summarize, the revised plan of comment:30 would be:
- Job "prebuild": (minimal, standard, maximal) for each system (=3 * 46 < 256) - build
-with-targets-preimages- Job "build": (minimal, standard, maximal) for each system (=3 * 46 < 256) - build
-with-targetsimages- Job "test-basic": (minimal, standard) for each system - build sage, run tests
- Job "test-optional": (maximal) for each system - build sage with optional packages, run tests
Yes, that's what I had in mind. And if you extract the current workflow into an reusable workflow and replace the tox command by an input, then this should also be possible without any code duplication (except for the duplicate matrix).
Replying to @mkoeppe:
In fact, the first two jobs should probably be named "targets-pre" and "targets"
To be honest, I don't find "targets" a very illuminating name.
A "target" is what make makes
But why is this information important for the docker images and the ci steps? I find something like with-system-packages or with-python-packages more descriptive.
with-system-packages already exists and it's not that
Consistency matters. Can't invent a new name for the same thing at every turn.
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
0590cd4 | .github/workflows, src/doc/en/developer/portability_testing.rst: Switch from docker.pkg.github.com to ghcr.io |
1bbb9a9 | tox.ini (docker): Support Docker build layer caching |
dd6c0e0 | .github/workflows/tox.yml: Integrate optional and experimental runs here |
15f5bb2 | .github/workflows/ci-{linux,macos}.yml: Split from tox.yml |
7bc769b | .github/workflows/ci-linux.yml: Fix typo |
Rebased
Replying to @tobiasdiez:
Replying to @mkoeppe:
OK, so to summarize, the revised plan of comment:30 would be:
- Job "prebuild": (minimal, standard, maximal) for each system (=3 * 46 < 256) - build
-with-targets-preimages- Job "build": (minimal, standard, maximal) for each system (=3 * 46 < 256) - build
-with-targetsimages- Job "test-basic": (minimal, standard) for each system - build sage, run tests
- Job "test-optional": (maximal) for each system - build sage with optional packages, run tests
Yes, that's what I had in mind. And if you extract the current workflow into an reusable workflow and replace the tox command by an input, then this should also be possible without any code duplication (except for the duplicate matrix).
No objections if you want to do this work. On the present ticket, I don't want to start with reusable workflows though. I have opened #33338 for such refactoring steps.
Branch pushed to git repo; I updated commit sha1. New commits:
0a12596 | tox.ini (slackware-current): New |
ad913db | tox.ini (opensuse-15.2.1,-15.4): New |
8d789f6 | build/pkgs/_prereq/distros/slackware.txt: Add python3 |
c62fc35 | build/pkgs/_prereq/distros/slackware.txt: Add packages needed on -current |
9286299 | Merge #33196 |
a5dca4a | tox.ini (centos-8): Remove |
b1167bd | .github/workflows/tox*.yml: Replace centos-8 by centos-stream-8 |
4898796 | .github/workflows/tox*.yml: Replace opensuse-15 (same as opensuse-15.3) by opensuse-15.2.1 |
e26283a | Merge #33103 |
2ed27f1 | Merge #33306 |
Branch pushed to git repo; I updated commit sha1. New commits:
21cfb54 | .github/workflows/ci-linux.yml: Use build/make/Makefile as no-op TARGETS_OPTIONAL |
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
bfe561f | .github/workflows/ci-{linux,macos}.yml: Use build/make/Makefile as no-op TARGETS_OPTIONAL |
Replying to @mkoeppe:
Replying to @tobiasdiez:
Replying to @mkoeppe:
Job "build" does not reliably fit into 6 hours.
You seem to have the same problem with the current setup as well (e.g. first stage of debian-bookworm fails due timeout).
No, that's likely only a sporadic failure.
Actually it was building too much, fixed now in bfe561f
Branch pushed to git repo; I updated commit sha1. New commits:
7961b22 | .github/workflows/ci-linux.yml: Fix typo in system list |
Work Issues: reduce Docker layers
opensuse-15.3-maximal:
#135 [with-system-packages 130/204] RUN zypper --ignore-unknown install --no-confirm --auto-agree-with-licenses --no-recommends --details "pkgconfig(gf2x)" || echo "(ignoring error)"
#135 sha256:da2e30a5ebcf7a7f99e521987cff4345a05792735e3866c4f70a29402e19512e
#135 ERROR: failed to prepare 956694vunjg38oyu4pzr7kgxx: max depth exceeded
Here build/bin/write-dockerfile.sh uses too many RUN commands because $EXISTS is not defined
Branch pushed to git repo; I updated commit sha1. New commits:
7f1dc97 | build/bin/write-dockerfile.sh (opensuse): Define EXISTS |
Also caching still does not seem to work reliably.
For example, https://github.com/mkoeppe/sage/runs/5178183571?check_suite_focus=true has pushed like ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-minimal-configured:9.5-81701-ge8de33c091, and https://github.com/mkoeppe/sage/runs/5178187220?check_suite_focus=true is able to download it; but it proceeds to build everything from scratch, which leads to the 6h-timeout
Maybe use the github workflow run id as the docker tag?
Branch pushed to git repo; I updated commit sha1. New commits:
ac4d5ff | .github/workflows/ci-linux.yml: Use same TARGETS_PRE, TARGETS for all stages; do not push 2-optional*, 2-experimental* |
Replying to @tobiasdiez:
Maybe use the github workflow run id as the docker tag?
I think this was not the problem
Changed work issues from reduce Docker layers to none
Still broken - e.g. https://github.com/mkoeppe/sage/runs/5204591833?check_suite_focus=true
Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:
a3d2200 | .github/workflows/tox-{optional,experimental}: Remove ubuntu-{hirsute,impish} here too |
03c3124 | Merge #34115 |
a9eaae5 | tox.ini, build/bin/write-dockerfile.sh: Add 'tox -e docker-...-incremental' |
b9bfbf9 | tox.ini: Add comment |
4c0d7f5 | tox.ini: Use FROM_DOCKER_REPOSITORY |
a07874d | build/bin/write-dockerfile.sh: In incremental build, keep logs |
b254da7 | Merge #34228 |
51a335b | Replace tox-optional, tox-experimental.yml by use of reusable workflow in ci-linux.yml |
6f8ef7a | build/bin/write-dockerfile.sh: Do not use persistent env var PACKAGES (except on nix) |
cce8fca | tox.ini (ubuntu-jammy, debian-bookworm, fedora-35, fedora-36): Do not IGNORE_MISSING_SYSTEM_PACKAGES |
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
61f9063 | tox.ini (ubuntu-jammy, debian-bookworm, fedora-35, fedora-36): Do not IGNORE_MISSING_SYSTEM_PACKAGES |
1ecd6b1 | build/bin/write-dockerfile.sh: Do not use persistent env var PACKAGES (except on nix) |
6d3caa2 | Replace tox-optional, tox-experimental.yml by use of reusable workflow in ci-linux.yml |
Description changed:
---
+++
@@ -1,28 +1,5 @@
Follow up from #32703, where this was implemented for macOS.
-Here we implement it for the `docker`-based workflows by Docker layer caching via ghcr.io
+Here we implement it for the `docker`-based workflows
+via incremental builds (#34228) on top of previous stages pushed to ghcr.io
-To use for local builds via tox:
-
-```
-$ DOCKER_BUILDKIT=1 EXTRA_CACHE_FROM_REPOSITORIES=ghcr.io/mkoeppe/sage/ tox -e docker-ubuntu-focal-standard
- => [internal] load metadata for docker.io/library/ubuntu:focal 0.0s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-bootstrapped:dev 1.3s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-system-packages:9.6.beta0-133-gc9c7d9f8d9 1.6s
- => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-pre:dev 2.4s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-configured:9.6.beta0-133-gc9c7d9f8d9 1.6s
- => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-configured:dev 2.5s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-pre:9.6.beta0-133-gc9c7d9f8d9 1.6s
- => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets:dev 2.4s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets:9.6.beta0-133-gc9c7d9f8d9 1.4s
- => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-optional:dev 2.4s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-targets-optional:9.6.beta0-133-gc9c7d9f8d9 1.4s
- => importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-with-system-packages:dev 2.4s
- => ERROR importing cache manifest from ghcr.io/mkoeppe/sage/sage-docker-ubuntu-focal-standard-bootstrapped:9.6.beta0-133-gc9c7d9f8d9 1.5s
-```
-
-
-(split out from #33222)
-
-We reorganize the CI workflows: `ci-linux.yml` and `ci-macos.yml` replace `tox.yml`, `tox-optional.yml`, `tox-experimental.yml`.
-Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
ae269d1 | tox.ini (docker-incremental): Do not include '-incremental' in the Docker image name |
d01263a | Merge #34228 |
ec79d4f | tox.ini (ubuntu-jammy, debian-bookworm, fedora-35, fedora-36): Do not IGNORE_MISSING_SYSTEM_PACKAGES |
e1caf6d | build/bin/write-dockerfile.sh: Do not use persistent env var PACKAGES (except on nix) |
31c5be9 | Replace tox-optional, tox-experimental.yml by use of reusable workflow in ci-linux.yml |
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
951130a | Replace tox-optional, tox-experimental.yml by use of reusable workflow in ci-linux.yml |
Branch pushed to git repo; I updated commit sha1. New commits:
5b425ac | .github/workflows/ci-linux.yml: Run following stages even when a job in the previous stage fails |
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
b354a38 | tox.ini, .github/workflows/docker.yml: Reimplement -incremental via SKIP_SYSTEM_PACKAGES |
5decadd | build/bin/write-dockerfile.sh: Do not use persistent env var PACKAGES (except on nix) |
6c32034 | Merge #34228 |
979779a | tox.ini (ubuntu-jammy, debian-bookworm, fedora-35, fedora-36): Do not IGNORE_MISSING_SYSTEM_PACKAGES |
326f61d | Replace tox-optional, tox-experimental.yml by use of reusable workflow in ci-linux.yml |
31c4722 | .github/workflows/ci-linux.yml: Run following stages even when a job in the previous stage fails |