palantir/bulldozer

Account for delays in GitHub API updates

bluekeyes opened this issue · 6 comments

We've observed a rare race condition where the commit status API does not show the latest values for statuses when processing a status event. This can cause PRs to get stuck unmerged, since there are no more events.

I'm not sure how to fix this, but one option might be to thread through the status payload from the event and merge it with the values from the API. This doesn't help if two different status payloads arrive at the same time and both don't appear in the API response but it might help in the common case.

FWIW, I believe I observe this intermittently (and more often) in OSS repo basts.

Here are some PRs from an excavator pass where bulldozer failed to merge until it was poked with a manual comment:

Just wanted to flag that I don't think this is necessarily rare -- looks like there were at least 5-6 instances of this occurring for a round of excavator PRs, and anecdotally I find that there are often excavator PRs that are unmerged on my repos (wasn't aware of the trick to poke via comment until today)

I believe we may be hitting this as well, in combination with Policybot. Policybot UI indicates that the policy matches and is fulfilled, but the commit status seems to be hung and Bulldozer won't merge until someone pokes with a comment. It appears to happen once we have more than 10-15 PRs being managed at once actively, which seems low

@shimmerjs in your case, it sounds like GitHub is still showing a pending status for the Policybot check. Is that correct? If so, leaving a comment also triggers Policybot evaluation again, which will post a new status and trigger Bulldozer. Viewing the UI re-evaluates the policy (without posting a new status), so it's possible for it to disagree with the last posted status.

If you have access to the Policybot logs, there might be something there to explain why the status wasn't updated. 10-15 PRs is very low compared to the number we manage internally, but there are several possible reasons why it might happen: rate limits, saturated processing queue, etc.

@bluekeyes thanks for the response.

in your case, it sounds like GitHub is still showing a pending status for the Policybot check.

Specifically, it is actually still showing failure (no policies match -- our default policy simply requires CI status checks and a review from a member of our team, so the Policybot status remains in a failure state until those CI status checks are present & passing), instead of pending. It seems to get hung in that state even though the UI will correctly show pending (CI status checks passed + no review) or successful (CI status checks passed + one review).

If you have access to the Policybot logs, there might be something there to explain why the status wasn't updated.

I do. I was tailing them on Friday when I noticed the issue and wasn't able to see anything logged as error/warn that indicated anything was incorrectly happening, in fact I saw that policies were being successfully applied with the intended result consistently, which led me to thinking that Bulldozer might be the issue.

I did however notice some calls for team membership were returning size: -1, and assumed it was because Policybot wasn't pulling children team members by default -- I added all children team members to the parent team referenced in our Policybot configuration and that doesn't seem to have changed anything. I will continue investigating as cycles become available and sharing information here. Unless you believe I should file a Policybot issue?

Thanks again

Ah, thank you for clarifying. That does sound like it could be a similar issue, where Policy Bot does not detect the new status after receiving a status event. Since the root cause might be different and the solution will probably be different, it would help to have a separate issue on Policy Bot. If you can include the logs for one of the evaluations that should have updated but didn't, that would be great, but I understand if it's too hard to remove the identifying information.