thejerf/suture

Checking status of supervisor/services

Closed this issue · 2 comments

Just stumbled across this library. Cool! I skimmed the godoc and couldn't find any way to check the health of a supervisor (ie. "good", "restarting", "crashed"...). Is that supported in Erlang? Is it supported in your library or have anticipated that could be opt-in by hooking up to Spec.Log? Just curious...

No, suture can't support it and Erlang doesn't either. Such information is fundamentally racy... by the time you get it, it may already be false, and there's no way around it. That's beyond just "Erlang" or "suture", it's just the way that sort of information is. The only thing you can do is to write your calls to handle failure gracefully, and yeah, that's easier said than done :)

That said... the logging functions can be abused to put an event you can synchronize on around failure and starting. The unit tests use this. Note this occurs in the supervisor's main thread, so loggers that block end up blocking the whole supervisor. And even then, you only get weaker guarantees... if you're logging a failure event, you know that the service has gone down, but you don't know when it went down, and on the other side, when bringing a service back up, you know the supervisor is starting the service up, but you don't know when that will actually happen. (The service can notify you that it has started, but that notification may be arbitrarily delayed too.) You know, the usual multithreading stuff. While I'm not sure I recommend this, I can't stop you. :)

Jeremy,

Thanks for that lengthy answer! I didn't consider that racy condition - a good reminder to stay away from those status checks...

Cheers,

Jens


Sent from Mailbox

On Tue, Sep 30, 2014 at 1:14 AM, Jeremy Bowers notifications@github.com
wrote:

No, suture can't support it and Erlang doesn't either. Such information is fundamentally racy... by the time you get it, it may already be false, and there's no way around it. That's beyond just "Erlang" or "suture", it's just the way that sort of information is. The only thing you can do is to write your calls to handle failure gracefully, and yeah, that's easier said than done :)

That said... the logging functions can be abused to put an event you can synchronize on around failure and starting. The unit tests use this. Note this occurs in the supervisor's main thread, so loggers that block end up blocking the whole supervisor. And even then, you only get weaker guarantees... if you're logging a failure event, you know that the service has gone down, but you don't know when it went down, and on the other side, when bringing a service back up, you know the supervisor is starting the service up, but you don't know when that will actually happen. (The service can notify you that it has started, but that notification may be arbitrarily delayed too.) You know, the usual multithreading stuff. While I'm not sure I recommend this, I can't stop you. :)

Reply to this email directly or view it on GitHub:
#7 (comment)