Polling loop/thread fragile, gets knocked out by common conditions too easily.

Question

Polling loop/thread fragile, gets knocked out by common conditions too easily.

toggledbits opened this issue 7 years ago · 4 comments

I use pyvera as part of my HomeAssistant installation. I've noticed that a restart of the Vera, either full reboot or restart of Luup (Vera's HA service), causes the polling loop to take an unhandled exception and quietly die. Luup restarts on Vera are very common--almost any device configuration change will cause one, and they occur spontaneously several times a day. That crashes the polling loop and HASS then goes "blind" to device changes on the Vera until HASS itself is restarted. Others have also reported this, for some time, apparently, so I went investigating. I've managed to fix this pretty easily, although perhaps not completely (more below), but I think the result is much more robust than the current implementation.

Apologies in advance for the text wall, but I want to be as clear and thorough as possible.

The observed conditions on my config Ubuntu 16.04.3, python 3.5.2, Home Assistant 0.63.3 with Vera Plus firmware 3232 are:

As it's going down during Luup reload, Vera sometimes returns a nonsense/nonspec "{ status: 0 }" response (valid JSON) that does not contain any data that get_changed_devices() is looking for, resulting in an empty response that corrupts the timestamp values;
Later in the reload, the Vera will return HTTP 500s;
In the final stage of the reload, Vera will return HTTP 200s with valid JSON containing no device data but valid, unchanging loadtime and dataversion for several seconds, until it's settled, then normal operation resumes;
The JSON library packed with HASSIO is throwing an exception that is not caught by the device polling loop (simplejson.errors.JSONDecodeError);
The documentation for the JSON handling in the Response object (2.18.4) says that a ValueError may be thrown by the method, but the code does not have a specific handler for this advertised exception (see more below);
There seems to be some general odd exception-handling problem (perhaps related to threading?) in which none of the exception handlers in the polling loop (even the default one) traps any exception thrown by get_changed_devices()--this is very odd and very troubling/counter-intuitive, and I have not yet figured out why this is, but my other fixes are an effective work-around for much of it.

Most of the loop's fragility is caused by get_changed_devices()'s reliance on the Response.json() method throwing some recognized exception. But there is so much variance in underlying implementations that it's a risky assumption (no guarantee the exception you're looking for is the one you'll get). For unexpected exceptions (e.g. ValueError), the loop's default exception handler passes the exception up, terminating the loop/thread. Also, if the data is parseable, but does not contain the expected values, there are no checks, so the remainder of the method corrupts the pass-through timestamp values. Tightening up the checks on the Response object before attempting to parse the data, and then also checking the parsed data after, yields tremendous benefits in stability, even without having an answer for the exceptional handling matter. Here's what I've done:

I changed my version of get_changed_devices() like this: it does not directly call Response.json() on the return value of data_request(), but first checks Response.status_code for a 200; anything other returns [ None, timestamp ] (returning the timestamp passed in, untouched, but no device data, which is effectively a no-op for the polling loop). If the HTTP status is OK, it parses the JSON at that point, and then goes on to check for the existence of loadtime and dataversion in the parsed data. If they do not exist, the [ None, timestamp ] list is returned; otherwise, the timestamp is updated and returned with the device data to the polling loop/caller.

This has made the polling loop very robust in my HASS/Vera installation--tons of reboots and restarts with reliable recovery. If you're interested in having these changes, let me know how you want to receive them.

Answer 1 · 2018-02-21T15:54:35.000Z

Sounds good. Would welcome a PR.

If you fork the repo into your githib account, create a new local branch, make the changes (and test them) please send me a PR. Guthub make this very easy. Shout if you need more help.

What you describes sounds like a great change (and I probably should have done as you describe from the beginning!

I almost never reboot my vera - so this hasn't been an issue for me - but it's a great improvement.

Answer 2 · 2018-02-21T22:00:32.000Z

OK. Been using GitHub a while, but have never done a PR. Give me another day to really shake this change down. I figured out the exception puzzle, so that makes for a cleaner, more complete solution that I now have running. Stay tuned! Thanks for being gracious about the suggestion/changes.

Answer 3 · 2018-02-22T18:21:17.000Z

OK. Pull request is in. Hope I did it right.

Answer 4 · 2018-03-11T09:48:32.000Z

Now released as 0.2.42. Thanks very much for the contribution.