luci/recipes-py

Provide a way to timeout a step

Closed this issue · 23 comments

Yesterday one of our steps hung, stalling the buildbot. It wasn't picking up new builds, resulting in two days worth of lost test coverage. It would be really great if api.step had a timeout. By default it could be conservative - something on the order of hours - but could also provide an optional parameter to set custom timeouts.

/cc @riannucci

Yes, it's probably time to implement this for real. The previous time this was discussed it was discussed in the context of using it to 'catch slow tests', which wasn't a good use of the functionality at the time (that sort of enforcement should probably live closer to the actual tests so that devs can run into them before they get run by the build, and can be fine-grained per-testcase, etc). However, as a 'if this step takes more than X minutes, the world is on fire' measure, it's very reasonable (and good).

I think there's room for 3 types of timeouts (all of which should be optional):

  • global timeout (recipe should never take more than X time to run)
  • per-step timeout (step should never take more than X time to run)
  • per-step silence timeout (if the step fails to produce output for X time, assume it's dead and kill it)

I'd also propose that these raise a different kind of exception (probably derived from StepFailure). Probably StepTimeout.

/cc @martiniss @erikus

Makes sense. In order of utility I think we want per-step timeouts, then global, then silence but could be convinced global should come first.

Sorry for jumping in, but is it also possible to have a callback on timeout?

I suppose buildbot does global timeout now, and when I see global timeout, I usually feel it nice to have a mechanism to upload logs so that I can analyze them later.

Rather than a callback, I think the recipe engine should handle a timeout by raising an explicit "StepTimeout" exception, which would be handled by the recipe.

BTW, timeouts are much easier to implement by swarming tasks that local tasks, as swarming tasks can handle non graceful shutdown much better. We shouldn't focus too much on local task timeout, since it's always going to be broken in a way or another.

I think they address different problems. I know at least in the case for @yoshisatoyanagisawa he wants to set a timeout for stuck goma compiles. Would the swarming task timeout help for cases where recipes are on swarming, setting a timeout for the whole recipe? For long-running recipes I think it especially makes sense to support step timeouts.

Thanks @erikus, what we would like to do is postmortem after timeout or sudden death of buildbot slave, which could be something related to goma. I want to use something like python's finally clause to preserve all data to investigate the cause later.

I think the right way to do that sort of investigation is to stream the
debugging data ASAP via an alternate logdog stream. Finally clauses are
unreliable on bots. Using logdog would preserve the data in the cloud as
soon as it's produced.

On Tue, Apr 5, 2016, 02:58 yoshisatoyanagisawa notifications@github.com
wrote:

Thanks @erikus https://github.com/erikus, what we would like to do is
postmortem after timeout or sudden death of buildbot slave, which could be
something related to goma. I want to use something like python's finally
clause to preserve all data to investigate the cause later.


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
#3 (comment)

Can I have a pointer to a document on LogDog stream?

@danjacques for logdog stream advice.

@riannucci why are finally clauses unreliable on bots? Isn't a common pattern to try a step and perform some action regardless of failure, e.g. doing something with api.step.active_result?

@erikus finally is unreliable on bots because the bot can, and does, crash (like: the CPU/GPU stop, kernel panics, etc.). In events like these, the finally clause will not run. If you're relying on finally to clean up files on the disk, and your bot crashes semifrequently, you'll just run out of disk space.

The right pattern (assuming persistent bots, which we have), is to clean up first, then run your workload. This means that all assumptions are reset on the bot, before you run code that assumes them (e.g. if you assume that things are clean, the only way to be sure they're clean is if you cleaned them yourself).

If you don't have persistent bots, then you just do whatever you want and leave stuff messy: the hypervisor will clean up for you.

<dated_reference>It's like rewinding tapes; it's really nice if the person before you did the rewind, but practically speaking you should just assume that you need to rewind the tape before you watch it.</dated_reference>

Yeah, I get that for cleanups, I was specifically talking about best-effort uploading of debug data for stuck goma compiles. We wouldn't need streaming logs for best-effort and catching the timeout exception or using finally is much simpler to implement.

ukai commented

I think we have two logging support. logdog and monitoring logging (cloudtail).
which is better for goma's log?
I heard logdog is for user's log, and monitoring is for infra's log, so might be better to use cloudtail for goma's log?
(anyway, I think I don't have permission to go/chrome-infra-logs ...)

Let's create a separate bug (crbug.com?) for the monitoring talk. This issue is about step timeouts.

On swarming task hard timeout, the bot sends a SIGTERM first, gives a grace
period to the child process to cleanly exit, then sends SIGKILL. The
default grace period is 30s. You can dump additional files to the output
directory during that time frame.
Le 1 avr. 2016 12:04 AM, "erikus" notifications@github.com a écrit :

I think they address different problems. I know at least in the case for
@yoshisatoyanagisawa https://github.com/yoshisatoyanagisawa he wants to
set a timeout for stuck goma compiles. Would the swarming task timeout help
for cases where recipes are on swarming, setting a timeout for the whole
recipe? For long-running recipes I think it especially makes sense to
support step timeouts.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

@maruel are you suggesting that we provide the same interface for the recipe engine here? SIGTERM and then SIGKILL after 30s delay? If this is a feature of subprocess42, then it should be very easy to add.

When I looked at the subprocess42 module, it looked like it had something like that, but I don't think we could use that exact API, because we look at the streaming IO while the process is running.

@martiniss subprocess42 was explicitly designed for stdout/stderr streaming and this is what is used to implement stdout streaming in swarming bot, see yield_any().

We are using that in the code; see https://cs.chromium.org/chromium/infra/recipes-py/recipe_engine/step_runner.py?q=step_runner.py&sq=package:chromium&l=343

The timeout applies to each time we poll and get some output, but we need a global timeout.

I'm fine with doing the same logic that you mention above though.

The timeout parameter can be a callback. Make it lambda: min(0, timeout + start_time - time.time())

That doesn't kill the process if there is still more output to be processed. I want to quit immediately, but yield_any waits for more output. Right?

See Patch 1 on https://codereview.chromium.org/2080723006; if I do python recipes.py --package=infra/config/recipes.cfg run engine_tests/timeout_streaming, it doesn't time out after 4 seconds, which I want it to.

It does time out with the current code.