UniversalDevicesInc/Polyglot

How should timeout settings be scoped and managed?

Opened this issue · 6 comments

Currently, as a node server goes about its business, messages it sends to the ISY are sent using the default timeout values -- which is somewhere in the 5 minute range.

If all is going well, or if there are only occasional network "hiccups", this is of no matter -- once in a while, node server to ISY messages may queue up for 5 minutes until a failed call to the ISY finally times out, at which point the queue of messages is (usually) processed quickly, and things "catch up".

However, there may be cases where a more aggressive timeout makes sense. If many messages get lost (perhaps a poor wifi connection?), then it would be desirable to quickly timeout and get to the next message so that we don't end up with huge queues of unsent messages stacking up. This would imply a global timeout setting for polyglot in general would be desirable for timeouts. Alternatively, this might imply that the default timeout for Polyglot be proactively changed to a more aggressive value - perhaps 60 or 90 seconds may make more sense?

It is also possible for an individual node server to desire more aggressive timeouts. For example, if a monitored value is changing rapidly (perhaps it changes every 3 or 5 seconds), then it might be desirable to set a very short timeout (perhaps as small as a few seconds), on the presumption that it's far better to drop a few status updates than to end up many minutes behind on updates of this rapidly-changing value.

So, this issue is requesting input from node server authors on how they would like to control timeouts:

a) leave it as is, the correct solution is to have the user fix their network!

b) change the timeout to "fill-in-the-blank" seconds globally, and then see (a) above.

c) create a node-server-level timeout setting scoped to all ISY calls for a given node server, that allows the node server author to set the timeouts.

d) implement (c) above, and provide another per-driver (and per-command) list of timeouts so that the node server author can control timeouts down to the granularity of each network operation.

e) implement (d) above, and provide a means for a timeout to be reported to the node server, so that the node server author can implement their own network error detection and retry mechanism down to the granularity of each network operation

Thoughts? Comments?

I vote for giving authors as much control as possible. That being said, set defaults to normal limits. So probably e, but don't force them to care.

Oh and as I am a network engineer for a living... I vote a, a, a and a.

I concur with James.

:) Ok, so full control it is... I'll add the ability to set the variable "_timeout" at the Node Server level, at the Node level, and I'll add the ability to tweak that on a per-driver and per-command level for each Node.

So now part II -- the default timeout is rather long, and is actually the default timeout determined by whatever the requests Python library has set. I suggest that we establish a default timeout for Polyglot (which will be used if none has been specified by a driver, node, or node server). I'd also suggest that we lower that timeout (right now it's around 5 minutes) to something far less -- I'm suggesting 25 seconds.

Here's the logic. Consider the following thought experiment.

a) The ISY goes offline, perhaps for a reboot.
b) A node server has a status change, and Polyglot starts a REST API call to report that.
c) The ISY is offline or busy, and the 5 minute timer is started...
d) Meanwhile, other nodes begin reporting their updates and changes.
e) Eventually, the node-server-to-polyglot-thread queue fills up and begins overflowing.
f) The ISY finally comes back
g) Polyglot's REST call from (b) completes
h) The ISY processes the status update, and begins running a program based on it
i) Polyglot begins processing the backlog of REST calls in the queue from (d)
j) The queue empties, and requests begin to be processed in near-real-time again.

Note a significant problem from the above: the sequence does not follow the principle of "Least Surprise". Specifically, when the ISY restarts, the expected behavior (based on Insteon and ZWave devices) is that messages sent while offline will be lost, and that any message actually received by the ISY can be expected to be reasonably current. In the sequence above, what actually happened is that the ISY got (step g) a very stale set of messages (the first 10 or so, however deep the queue is) from when it originally went offline before it got the current set of messages. So it will run programs or process events from a long time ago (in automation terms) - this would play havoc with user expectations. Imagine a user banging on some virtual device to control their Sonos system, with no response -- and then, 5 minutes later, the ISY boots back up, and starts processing all those button pushes from 5 minutes ago...

The solution to the above is to change the queue logic (we should drop the OLDEST message off the queue, not drop the newest when the queue is full), but we ALSO need to ensure that we don't wait too long for the ISY, lest we feed it very stale data.

For that latter item, I propose a global timeout of 25 seconds (to ensure the timeout is slightly less than the default long_poll time). Based on the experiments done regarding the 404 errors during the 3AM query, we know that if the network is operational, the ISY's normal response is sub-second -- so it seems reasonable to timeout after a matter of seconds rather than the current minutes. This will result in a few more 404 errors during the 3AM query, but we already know that UDI will be fixing that, so the 404 problem is really the lesser of the two evils.

Ok, thoughts and comments on this? Anyone see any problems that I've not considered? (This will be a potentially-disruptive change if done incorrectly, hence I'm looking for buy-in or objections...)

I agree with both the 25 second default change and the drop oldest in queue method. Unless someone specifically coded for a 5 minute timeout in their nodeserver, I think this should be fine with backwards-compatibility as well.

I'll buy in, but what exactly did you mean by potentially disruptive? Just the 5 minute to 25 second change in timeout right?

Regarding "potentially disruptive" -- I've vastly oversimplified the queue management part. We need to ensure that the queue is deep enough that we don't lose messages when they arrive in a massive burst from a node server doing huge updates, but that makes the problem of delivering very stale messages worse, as described originally. So the real solution is that instead of a small queue and throwing away the items at the end of that queue, we need to add ISY messages to the queue with a timestamp and honor that timestamp when pulling items off the queue or when the queue is full.

In other words, if I have a status update that is worthless after 10 seconds, my node server will specify a 10-second timeout. When we pull that message off the queue, or look for old messages when the queue fills up, we need to subtract the time that the message sat in the queue from the timeout in the message -- and throw the message away only if the timeout has expired while on the queue (or if the queue is full, throw away the messages that have the least amount of lifetime remaining). And then when we pass the message to the actual requests library to make the REST API call to the ISY, we need to adjust the timeout down to the remaining time by subtracting the time it spent in the queue before we send it out on the network.

I suspect I'll make some mistakes in that logic, and it's going to be a real challenge to test properly, hence potentially disruptive.

(BTW, it turns out that the default timeout was implemented some time ago -- set at 30 seconds! So that part's already done and tested. (I will drop it to 25 seconds; I want to make sure that if a node server is using all the defaults, a message will never live past a single long_poll() period -- I just have a bad feeling about multiple timers having the same period; it just seems the sort of thing that turns simple bugs into difficult intermittent race-condition sorts of bugs.)

Mike, thanks so very much for the excellent feedback. This reminds me of how ISY subscriptions work: every time a client subscribes to ISY, ISY sends the status of EVERYTHING back to the client (not just those in the queue). This way, at subscription time, the status of the client is identical to what ISY thinks the status of everything should be.

Can something like this be implemented on Polyglot side?

With kind regards,
Michel