jaegertracing/jaeger

Do not fail all CI matrix jobs if one of them fails

yurishkuro opened this issue · 0 comments

Many of our CI workflow utilize matrix strategy. Unfortunately, when one of the jobs fails, which sometimes happens due to a transient issue (e.g. a timeout when downloading a dependency), all of the jobs in the matrix are aborted. We would like the jobs to continue, since they are usually independent.

The cancellations could happen for two reasons:

  1. we do not set continue-on-error, perhaps that's the reason
  2. we usually have workflow-level concurrency settings with cancel-in-progress: true to avoid running multiple jobs for different versions of the same pull requests (we only want the latest commit to run). It's possible that it's this setting that's responsible for cancellations if we do not provide granular-enough group key. For instance, in the build-binaries workflow we have group: ${{ github.workflow }}-${{ (github.event.pull_request && github.event.pull_request.number) || github.ref || github.run_id }} - all jobs in the matrix will get the same key and might get cancelled all at once.

We need to validate which of these two reasons is the root cause (by introducing a failure into one of the jobs) and test out a fix.