ocurrent/ocaml-ci

Local Opam Vars Job Causing Outages

mtelvers opened this issue · 2 comments

OCaml-CI runs some jobs locally before beginning to submit jobs to the cluster. Firstly, each base image is pulled (~60 images) and then opam-vars and opam-vars (lower-bound) jobs are run on each image. On a new installation, all of the pulls and the subsequent jobs are run concurrently. The data gathered by opam-vars is relatively static and could, and in some cases, has been hard-coded into OCaml-CI. As the jobs are all run locally, they are only run on the OCaml-CI host architecture (currently AMD64) and other platforms are assumed to be the same.

ocaml-ci/lib/platform.ml

Lines 153 to 166 in 047de92

let opam_template arch =
let arch = Option.value ~default:"%{arch}%" arch in
Fmt.str
{|
{
"arch": "%s",
"os": "%%{os}%%",
"os_family": "%%{os-family}%%",
"os_distribution": "%%{os-distribution}%%",
"os_version": "%%{os-version}%%",
"opam_version": "%%{opam-version}%%"
}
|}
arch

The jobs are rebuilt every 30 days.

let schedule = Current_cache.Schedule.v ~valid_for:(Duration.of_day 30) () in

Disk space on the OCaml-CI machine is finite. A cron job runs every hour, deleting the oldest log data and maintaining the volume at 90% capacity. Cron also runs a docker system prune to clear old images. However, when a large number of jobs rebuild simultaneously, the machine can run out of space (see #946).

Options:

  1. Hardcode these data as they change very slowly and could already be up to 30 days out of date;
  2. Submit the jobs to OCluster, thus avoiding the space and capacity issues; or
  3. Update OCaml-CI to manage the number of concurrent jobs.

Were you able to get a sense for what these local jobs do? What is the result of the runs used for?

Ideally you want to do option 2. which could remove the hardcoding of the other platforms. That needs some changes to ocluster so you can return the results of running the commands on a cluster worker.