barosl/homu

Pending state is fragile to CI/builder failures

Closed this issue · 6 comments

In cases where homu sets a pull to pending, but something goes wrong with the builders or CI systems, it's easy for homu to end up stuck but with nothing going on. Worse, the Synchronize button doesn't help, and issuing a retry appears to do nothing, either.

What's the right thing to do here? We've run into this a bit when either Travis CI or linode (where we host buildbot) or GitHub goes wonky under a DDOS, as we start having lost messages, aborted builders w/o any status messages, etc.

Today, I manually restart all the services on the server, go through GH and re-deliver messages, and close PRs as necessary to get things going again. It would be great if there were either:

  1. A different form of synchronize that just "forgot" all pending work and transitioned things back to the approved state
  2. Something we could put in the PRs (e.g., @homu reset?) to clear the pending back to approved

cc @Manishearth for more feedback/ideas and @metajack @edunham since it's been an ugly week or two :-)

retry force clean might be the right invocation.

I don't know if clean on it's own is the reset you want; it clears build details but doesn't touch pending.

In my (probably controversial) opinion, if we run clean on every single build and it causes fewer issues than what we experience now, I'd consider that an improvement

That could probably be arranged in the (probably inevitable) fork. I think things like closing and reopening should clean. I'm okay with retry not cleaning.

Why is a fork inevitable?

issuing a retry appears to do nothing, either.

In general, retry should trigger a rebuild even when the PR is in the pending state. If that didn't work, I guess this is more likely due to a bad interaction between Homu and Buildbot. As Travis is directly informed about the state change using the GitHub webhooks, it is more reliable, at least better than Buildbot. Working with Buildbot using the auto branch is quite fragile.

  1. A different form of synchronize that just "forgot" all pending work and transitioned things back to the approved state

This is exactly what retry does. The problem might be in the fragile communication with Buildbot, which is discovered as follows: Homu sets up the auto branch in the hope of Buildbot responding, but Buildbot's response packets are somehow lost. Homu does not wait for Travis in the same scenario, because the merge commit is directly reported to Travis through the GitHub webhooks. So I've been thinking the right way to do is introducing some timeouts regarding the connection between Homu and Buildbot.

Also note that force and clean are Buildbot-specific commands, so they are not applicable to Travis. (Actualyl I've just found that clean is mistakenly enabled for Travis-enabled repositories too, this seems to be a severe bug.)

  1. Something we could put in the PRs (e.g., @homu reset?) to clear the pending back to approved

As I said above, that's what retry does!

@barosl Aha! I'll close this, then. I suspect the issue is related to our buildbot problems, rather than things in homu itself.

We were also using force way too much during these errors, which made the problems even worse.