openforcefield/openff-docs

Race condition in `cookbook_preproc.yaml`

Closed this issue · 1 comments

If more than one copy of cookbook_preproc.yaml runs at the same time, one will win and the other will do a lot of computation and then lose. This is because both will clone the cache, compute a bunch, and then overwrite the cache with their own data.

Only one copy of the workflow can run at a time on each branch/pr, with earlier copies being cancelled when later ones run, so this isn't a complete disaster. But I should still come up with some git fanciness to avoid this.

The problem would be a lot simpler if we were keeping history in the cache, but it should still be fixable.

Easy mitigation: Clone the cache right before deploying changes to minimise the time the race condition is possible.

Proper fix:

  • Change cookbook_preproc's concurrency group to just be the workflow and the DEPLOY_SUBDIR. This way if two runs are going at the same time, they're definitely writing to different folders of the cache and so won't create merge conflicts.
  • When it's time to deploy, repeatedly until it succeeds:
    1. merge in any changes
    2. collapse the history
    3. git push --force-with-lease
    4. wait a random amount of time (so that two runs deploying at exactly the same time eventually fall into an order)

Note to self - cookbook_pr_close.yaml needs to be included in this mitigation/fix.

It might be easier and more robust to have a different tag/branch for each cache, rather than a different folder in the same branch - that plus one concurrency group per branch should solve the race condition without needed to mess around with merges. They could be named _cookbook_data_main, _cookbook_data_PR15 etc. Would also simplify the cleanup action - it just needs to delete a branch, rather than make a commit.