Pending state is fragile to CI/builder failures
Closed this issue · 6 comments
In cases where homu sets a pull to pending
, but something goes wrong with the builders or CI systems, it's easy for homu to end up stuck but with nothing going on. Worse, the Synchronize button doesn't help, and issuing a retry
appears to do nothing, either.
What's the right thing to do here? We've run into this a bit when either Travis CI or linode (where we host buildbot) or GitHub goes wonky under a DDOS, as we start having lost messages, aborted builders w/o any status messages, etc.
Today, I manually restart all the services on the server, go through GH and re-deliver messages, and close PRs as necessary to get things going again. It would be great if there were either:
- A different form of synchronize that just "forgot" all pending work and transitioned things back to the
approved
state - Something we could put in the PRs (e.g.,
@homu reset
?) to clear thepending
back toapproved
cc @Manishearth for more feedback/ideas and @metajack @edunham since it's been an ugly week or two :-)
retry force clean
might be the right invocation.
I don't know if clean
on it's own is the reset
you want; it clears build details but doesn't touch pending.
In my (probably controversial) opinion, if we run clean
on every single build and it causes fewer issues than what we experience now, I'd consider that an improvement
That could probably be arranged in the (probably inevitable) fork. I think things like closing and reopening should clean. I'm okay with retry not cleaning.
Why is a fork inevitable?
issuing a
retry
appears to do nothing, either.
In general, retry
should trigger a rebuild even when the PR is in the pending
state. If that didn't work, I guess this is more likely due to a bad interaction between Homu and Buildbot. As Travis is directly informed about the state change using the GitHub webhooks, it is more reliable, at least better than Buildbot. Working with Buildbot using the auto
branch is quite fragile.
- A different form of synchronize that just "forgot" all pending work and transitioned things back to the approved state
This is exactly what retry
does. The problem might be in the fragile communication with Buildbot, which is discovered as follows: Homu sets up the auto
branch in the hope of Buildbot responding, but Buildbot's response packets are somehow lost. Homu does not wait for Travis in the same scenario, because the merge commit is directly reported to Travis through the GitHub webhooks. So I've been thinking the right way to do is introducing some timeouts regarding the connection between Homu and Buildbot.
Also note that force
and clean
are Buildbot-specific commands, so they are not applicable to Travis. (Actualyl I've just found that clean
is mistakenly enabled for Travis-enabled repositories too, this seems to be a severe bug.)
- Something we could put in the PRs (e.g., @homu reset?) to clear the pending back to approved
As I said above, that's what retry
does!
@barosl Aha! I'll close this, then. I suspect the issue is related to our buildbot problems, rather than things in homu itself.
We were also using force
way too much during these errors, which made the problems even worse.