fpco/stackage-curator

RFE: timeout outs for long-running test processes

juhp opened this issue · 6 comments

juhp commented

Sometimes testsuite processes don't finish so it would be nice if they could be caught after a sufficiently long timeout and killed to stop them from eating lots of cpu and preventing curation builds from finishing.

This has happened at least 3 times now recently to curators (today for lts-5 binary-search).

Any thoughts on how the timeout should be handled? A hard-coded value, something configurable in the YAML, something per-package, etc?

juhp commented

Good question: I hadn't really thought about it so much...

Maybe starting with a hard-coded value would be good enough for now (like 30min?).
Probably 30min is too long but 10min too short - would 20min would enough perhaps or is that cutting it too fine? I don't know the spectrum of time required for testsuites in Stackage.
I think 1 hour should certainly be safe.

It might be nice to make it configurable later in yaml after more usage.

@bergmark @DanBurton: any thoughts?

I think even 10 mins is very generous. I'd say default 2 mins, and a configuration option to allow longer on a per-package basis. (That's 2 mins for the test suite to run; building it is not counted against this time limit.) Let's time things and gather data on how long packages and their test suites take to build and run.

Perhaps we can have "retry" logic for timed-out runs as well; are we sure that the test suites that hang do so consistently?

juhp commented

Okay but this only affects very few builds (testsuites) so I would make the default a bit lenient at least to avoid the risk of prematurely killing longer testsuites. But okay I think you're right that 10min should be quite plenty. 5min might well be fine too. Anyway I would start with a generous value and then we can reduce it as needed rather than the other way (it would be nice to gather build-time stats one day).

I had 10 min in mind too. It should be enough for everyone, if it isn't they might want to speed up their test-suites a bit...