palantir/bulldozer

PRs not being automerged reliably

AnetaKvapilova opened this issue · 7 comments

Hello, please, I have more like a question / asking for advice.
We're using bulldozer for automatic merging of all PRs, that are approved by policy bot, so the configuration file for bulldozer is quite simple:

version: 1
merge:
  method: merge
  delete_after_merge: true
  required_statuses:
    - "policy-bot: master"
  ignore:
    labels: ["no automerge"]

however we're running into issues with automatic merging. It was almost never working 100%, but a number of unmerged PRs was acceptable. But few days ago PRs suddenly stopped being merged. I can see PRs approved by policy bot, but e.g. 2 out of 14 are really merged because of:

is deemed not mergeable because of unfulfilled status checks: [policy-bot: master]

We have to manually update PR (e.g. add comment) and then it's merged just fine. Nothing has changed in configuration itself.
Example from one of PRs:
policy-bot log:

{"level":"info","rid":"c3i1u8v3pb9eigu4cs0g","github_event_type":"status","github_delivery_id":"xxx","github_installation_id":47,"github_repository_owner":"xxx","github_repository_name":"xxx","time":"2021-07-06T09:15:16.561970528Z","message":"Setting \"policy-bot: master\" status on 71e5926718615c934eab33951dafb4833eceebac to success: All rules are approved"}

bulldozer log:

{"level":"debug","rid":"c3i1u973pb9emgbbnia0","github_event_type":"status","github_delivery_id":"xxx","github_installation_id":40,"github_repository_owner":"xxx","github_repository_name":"xxx","github_pr_num":2985,"time":"2021-07-06T09:15:16.846796347Z","message":"xxx/xxx#2985 is deemed not mergeable because of unfulfilled status checks: [policy-bot: master]"}

Also (maybe related, maybe not) - when looking into bulldozer's logs we can see a lot of errors:

422 Reference does not exist []

Few days before it started to happen, bulldozer was updated to version 1.13.2, running with GitHub Enterprise v3.1.1.
Is it possible that there's something wrong in new version that may really cause described issue? Is there anything we can to do make it work again? I can add some additional info if needed.
We've multiple projects that rely on bulldozer and we would really like to make it work and be reliable.
cc @sideeffffect.

Sorry that you're having reliability issues with Bulldozer. We're also running Bulldozer 1.13.2 with GitHub Enterprise 3.1.1 and are not seeing the behavior you describe, so I don't think this is necessarily related to the upgrade.

Do you know which version of Bulldozer you were using before upgrading to 1.13.2? That would tell me which new features we included in the upgrade.

For the problem with missed merges, the most common reason I've seen for this is hitting GitHub rate limits. If this is the case, you should be able to see these errors in the logs (although the messages are not tagged with the repository/PR number.)

If rate limits are not the problem, it will be helpful to have complete logs for a single pull request that had this problem, from when it was opened to when it merged. If you can share a redacted copy of these logs, that would be great.

Regarding the 422 Reference does not exist error, I believe this is expected. You'll see this when Bulldozer tries to delete a branch after merging if the branch was already deleted (either by the user, by another bot, or by GitHub's native feature to delete branches on merge.) I filed #253 to track removing this error message from the logs.

We upgraded Bulldozer to 1.13.2 from 1.10.1 (and Policy-bot to 1.22.1 from 1.20.0).

@AnetaKvapilova could you fetch the logs from one of your problematic PRs for @bluekeyes 🙏 ? Let me know if you need any help with that.

For the problem with missed merges, the most common reason I've seen for this is hitting GitHub rate limits. If this is the case, you should be able to see these errors in the logs (although the messages are not tagged with the repository/PR number.)

Could you please share how exactly should the message look like? We've tried to search for it in logs, but don't see any errors like this. But we can try again if we know exactly what to look for.

Regarding the 422 Reference does not exist error, I believe this is expected.

OK, thank you very much for explanation.

I'm attaching logs from both - Bulldozer and Policy bots. I tried to select only relevant stuff, but like I said - we're using it for multiple repos, so sorry if there are some messages from another PRs left (but I definitely didn't remove any error messages etc. - from this point of view logs are complete).
policy.log
bulldozer.log

If you are encountering rate limits, you should see messages like this in the logs:

{
  "level": "error",
  "rid": "c35j7irt535deut1lk60",
  "github_event_type": "status",
  "github_delivery_id": "4a768d8a-cf60-11eb-98bd-2f28b1d0cb41",
  "error": "failed to determine open pull requests matching the status context change: failed to list pull requests for repository owner/repo: GET https://github.internal.domain/api/v3/repos/owner/repo/pulls?per_page=100&state=open: 403 API rate limit of 12500 still exceeded until 2021-06-17 11:36:44 +0000 UTC, not making remote request. [rate reset in 0s]\ngithub.com/palantir/bulldozer/pull.ListOpenPullRequests\n\t/home/runner/work/bulldozer/bulldozer/pull/pull_requests.go:80\n...,
  "time": "2021-06-17T11:36:43.888108239Z",
  "message": "Unexpected error handling webhook"
}

The exact stack trace and message will probably be different, but the rate limit of N still exceeded and the Unexpected error handling webhook parts should be there.

Thank you for providing the additional logs. I'll take a look and see if I find anything.

After reviewing the logs, it might be that this is an issue with GitHub or with Bulldozer's caching. I've seen problems before where events that normally appear synchronous are actually asynchronous, but you can't observe this unless the GitHub appliance is under heavy load or experiencing another issue. Do you have access to GitHub metrics to see there is unusual load?

I can also see that Bulldozer used a cached event when listing the statuses and check runs for 71e5926718615c934eab33951dafb4833eceebac. It should always revalidate the cache (by making a conditional request to GitHub), but I can't confirm that this actually happened from the application logs. If you'd like to confirm and have GitHub logs, look for a request to the commit status endpoint for the previous commit around 2021-07-06T09:15:16.846135069Z that returned in a 304 response. It's possible there is a bug in the caching logic or that GitHub falsely claimed there was no change.

To help confirm that this is an event vs. API inconsistency, I'd like to know more about the webhook with delivery ID ad90cfd2-de3a-11eb-93db-b506fb2e4007. If you have SSH access to your GitHub Enterprise appliance, you can find this by running ghe-webhook-logs:

ghe-webhook-logs -g ad90cfd2-de3a-11eb-93db-b506fb2e4007

This will give you the full payload of the event and some extra information. This should be a status event, so I'd like to see the status context, the status value, and the commit SHA.

Hi @bluekeyes, we were unable to get the webhook logs since they probably rolled out of log retention by the time we got to it.
The issue keeps happening, though. Once we again encounter a missed PR, what diagnostic information would you want us to provide?

  • Will Bulldozer logs filtered to the given PR do, or shall we send over full Bulldozer logs?
  • How do we arrive at the webhook delivery ID(s) for which to search webhook logs?
  • Is the Recent Deliveries page on Bulldozer's Advanced tab the same thing as what would ghe-webhook-logs give us? We have to ask GHE admin to run ghe-webhook-logs, which typically takes some time, but we can access Recent Deliveries without admin permissions, so if we could peek in there, it would make things easier.

Thank you!

Will Bulldozer logs filtered to the given PR do, or shall we send over full Bulldozer logs?

Logs filtered to a PR should be enough. There are a few messages early in the webhook handler that aren't associated with a PR number, but these shouldn't matter for the issue you've described.

How do we arrive at the webhook delivery ID(s) for which to search webhook logs?

Bulldozer logs the delivery ID as one of the log fields, so if you have a particular log message of interest, you can see which delivery caused it. You can also find delivery IDs by viewing the "Recent Deliveries" page in the UI if you want to go the other way: find all the logs associated with a particular event.

Is the Recent Deliveries page on Bulldozer's Advanced tab the same thing as what would ghe-webhook-logs give us?

Yes, it displays the most important information, like the webhook payload. The disadvantage of the "Recent Deliveries" page is that is shows all events for all installations of Bulldozer and doesn't have a search feature, which can make it hard to find the particular event of interest.