coq/bot

Ideas for reducing the CI runner load

jfehrle opened this issue · 3 comments

  • Suppose the PR has a failure only in ci_fiat_crypto and you make a fix to address that. It would be helpful to be able to rerun only that job and its prerequisites--or just the jobs that failed in the last full run, or a specific named group of jobs and their prerequisites. Of course, you would later run a full build before the merge. A possible user interface (if it can be implemented) is to define labels for each CI step and have the build process remove labels for jobs that succeed and add labels for jobs that fail or don't run because of failures in prerequisites.
  • Detect whether a build job is only updating documentation and limit the CI jobs accordingly
  • Apply time limits to jobs. For example, not long ago we saw the test suite timing out after 3 hours. Maybe a limit of 1 hour is enough?
  • Do we currently detect the number of cores on the runner and configure NJOBS to match?

A possible user interface (if it can be implemented) is to define labels for each CI step and have the build process remove labels for jobs that succeed and add labels for jobs that fail or don't run because of failures in prerequisites.

This looks technically feasible but also potentially too many labels being added and removed at each step, risking polluting the PR timeline.

Detect whether a build job is only updating documentation and limit the CI jobs accordingly

This should be feasible much more easily. I guess we would still want to run the linter, the build (dependency of the doc) and the doc jobs.

Apply time limits to jobs. For example, not long ago we saw the test suite timing out after 3 hours. Maybe a limit of 1 hour is enough?

I think this has been changed. To check in the Coq repo (but in any case, outside of scope here).

Do we currently detect the number of cores on the runner and configure NJOBS to match?

No, but I think we also control what our runners are and the question of whether the current value of NJOBS was appropriate was asked and answered. Anyway, also out of scope for here.

I think this has been changed. To check in the Coq repo (but in any case, outside of scope here).

coq/coq#16669

Detect whether a build job is only updating documentation and limit the CI jobs accordingly

If we could cache our builds that's solved by itself, also a bit the first point, tho in general we will have a very low hit rate for that kind of fixes. See also coq/coq#16201

Dune will detect the number of cores available, but in some runners setup that's dangerous if such runner drives parallel jobs.