google/zerocopy

Optimize CI caching strategy

Opened this issue · 0 comments

Per @sl4m in #1375 (comment):

Amazing, thank you for fixing this! I'm not sure I understand why this fix works, though?

Ok, I think I know what's really going on. My change worked in my fork because I did not have existing caches.

I believe cargo-nextest was recently introduced in #1307? If so, it may be a result of an old cache. A new cache is created only if Cargo.toml changes (see here) and since GitHub Actions cache are immutable, it did not preserve the cargo-nextest installation.

Cargo.toml has since been updated via google-pr-creation bot, so the caches should now have cargo-nextest.

I checked a recent build and it looks like it's ignoring the cargo-nextest installation as expected:

Running Miri tests with 8 threads
    Updating crates.io index
     Ignored package `cargo-nextest v0.9.72` is already installed, use --force to override
Preparing a sysroot for Miri (target: aarch64-unknown-linux-gnu)... done

I think I will close this PR if you agree and you can probably close #1312.

Since the cache is immutable based on the key, our current caching strategy has a few issues:

  • If there is a cache hit, then our "generate cache" job is pointless - it's doing work that will not be persisted
  • If there is a cache hit and dependencies have changed since the cache was generated, since the "generate cache" job's output is discarded, all jobs will need to needlessly re-do some amount of work

We should do the following (note that I use "inter-workflow cache hit" here to refer to a cache hit in the "generate cache" job, which happens when a workflow execution uses the same cache key as a previous workflow execution; our CI is designed so that, once the "generate cache" job has run, all subsequent/dependent jobs will always have cache hits so that they can reuse artifacts generated by that job):

  • Figure out whether it's possible to detect a cache hit; in this case, the "generate cache" job should simply be a no-op
  • Consider using a different cache key. A few considerations:
    • The problem only arises on inter-workflow cache hits. Every time we have an inter-workflow cache miss, the "generate cache" job should have the maximum possible impact (namely, all of its artifacts will be used by dependent jobs). This suggests that we should want inter-workflow cache misses more frequently, possibly even on every workflow execution (eg, we could use a "workflow ID" (if that's a thing) as part of the cache key)
    • Even when the cache is somewhat stale (e.g., doesn't contain all of the artifacts that jobs need), it still has some benefit, so the difference between an inter-workflow cache hit (possibly somewhat stale) and an inter-workflow cache miss (updated anew and so definitely not stale) is not the difference between doing 100% of the work from scratch and doing 0% of it from scratch. It's unclear what the real percentage is, but it's not 100%.
    • It takes time to generate the cache in the first place as part of the "generate cache" job. This suggests that an inter-workflow cache hit can improve performance. It's likely the case that, most of the time, a given workflow execution has either mostly or entirely the same artifacts that it needs to build as other, recent workflow executions. We should be careful not to be too aggressive, or else we'll lose out on this benefit.
    • It may be possible to design the cache key such that it always changes when the required artifacts change (although this would risk false-positives - ie, changing the cache key when it's unnecessary). See [1].

[1] The way we have things structured, the "generate cache" job is always the first job to use a given cache key. Thus, it's the only job that can ever observe a cache miss, and so it's the only job that can ever populate the cache. Thus, it's sufficient to look at that job to figure out what artifacts might be cached. As of this writing:

cargo check --workspace --tests &> /dev/null &
cargo metadata &> /dev/null &
cargo install cargo-readme --version 3.2.0 &> /dev/null &
cargo install --locked kani-verifier &> /dev/null &
cargo install cargo-nextest &> /dev/null &
cargo kani setup &> /dev/null &

This suggests that the following would be an acceptable cache key:

  • .github/workflows/ci.yml (probably best to do .github/** in case we rename things or split ci.yml in the future)
  • All Cargo.toml files

The only question marks here are cargo install targets, which may be changed when new versions are uploaded to crates.io (see also #1379).