UniversalDevicesInc/Polyglot

Node server died during 3am query with no info

Closed this issue · 17 comments

I see polyglot having the communication issues during the 3am query which is expected, but my ihcam node server died as well with no information in the log:

ERROR    [03-14-2016 03:02:27] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
ERROR    [03-14-2016 03:02:31] polyglot.nodeserver_manager: Node Server ihcam has stopped responding.

I'll run it in very verbose mode and see if there is more info, but if you have any ideas please let me know.

this is the debugging info, still not sure why polyglot is not giving a traceback of the node server death, will have to look further. The "ERROR"s from ihcam are not really errors, I'm just calling send_error to show the requests in the log since I don't know another way.

DEBUG    [03-15-2016 03:01:35] polyglot.nodeserver_manager: ihcam STDIN: {"ping": {}}
DEBUG    [03-15-2016 03:01:35] polyglot.nodeserver_manager: ihcam STDOUT: {"pong": {}}
INFO     [03-15-2016 03:01:37] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.81ms
INFO     [03-15-2016 03:01:37] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.73ms
INFO     [03-15-2016 03:01:39] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.91ms
INFO     [03-15-2016 03:01:39] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.32ms
INFO     [03-15-2016 03:01:41] tornado.access: 304 GET /api/servers/active (192.168.1.61) 4.15ms
INFO     [03-15-2016 03:01:41] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.45ms
INFO     [03-15-2016 03:01:42] tornado.access: 200 GET /ns/3wb9F/nodes/n003_00626e495be9/query?requestId=191 (192.168.1.64) 3.92ms
DEBUG    [03-15-2016 03:01:42] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495be9", "request_id": "191"}}
ERROR    [03-15-2016 03:01:42] polyglot.nodeserver_manager: ihcam: Sending: http://192.168.1.111:8080/get_params.cgi {}
DEBUG    [03-15-2016 03:01:42] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495be9", "driver_control": "GV2", "uom": 56, "value": 1}}
DEBUG    [03-15-2016 03:01:42] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495be9/report/status/GV2/1/56
INFO     [03-15-2016 03:01:42] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-15-2016 03:01:43] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.78ms
INFO     [03-15-2016 03:01:43] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.25ms
INFO     [03-15-2016 03:01:45] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.71ms
INFO     [03-15-2016 03:01:45] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.23ms
INFO     [03-15-2016 03:01:47] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.70ms
INFO     [03-15-2016 03:01:47] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.50ms
INFO     [03-15-2016 03:01:49] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.93ms
INFO     [03-15-2016 03:01:49] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.65ms
INFO     [03-15-2016 03:01:51] tornado.access: 304 GET /api/servers/active (192.168.1.61) 4.14ms
INFO     [03-15-2016 03:01:51] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.37ms
INFO     [03-15-2016 03:01:52] tornado.access: 200 GET /ns/3wb9F/nodes/n003_00626e495be9m/query?requestId=192 (192.168.1.64) 4.20ms
DEBUG    [03-15-2016 03:01:52] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495be9m", "request_id": "192"}}
ERROR    [03-15-2016 03:01:52] polyglot.nodeserver_manager: ihcam: Sending: http://192.168.1.111:8080/get_status.cgi {}
ERROR    [03-15-2016 03:01:52] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
DEBUG    [03-15-2016 03:01:52] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495be9", "driver_control": "GV3", "uom": 56, "value": 7}}
DEBUG    [03-15-2016 03:01:52] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495be9/report/status/GV3/7/56
INFO     [03-15-2016 03:01:52] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-15-2016 03:01:53] tornado.access: 304 GET /api/servers/active (192.168.1.61) 4.02ms
INFO     [03-15-2016 03:01:53] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.18ms
INFO     [03-15-2016 03:01:55] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.72ms
INFO     [03-15-2016 03:01:55] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.21ms
INFO     [03-15-2016 03:01:57] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.69ms
INFO     [03-15-2016 03:01:57] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.20ms
INFO     [03-15-2016 03:01:59] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.70ms
INFO     [03-15-2016 03:01:59] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.18ms
INFO     [03-15-2016 03:02:01] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.74ms
INFO     [03-15-2016 03:02:01] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.36ms
INFO     [03-15-2016 03:02:02] tornado.access: 200 GET /ns/3wb9F/nodes/n003_00626e495f4c/query?requestId=193 (192.168.1.64) 4.19ms
DEBUG    [03-15-2016 03:02:02] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495f4c", "request_id": "193"}}
ERROR    [03-15-2016 03:02:02] polyglot.nodeserver_manager: ihcam: Sending: http://192.168.1.112:8080/get_params.cgi {}
ERROR    [03-15-2016 03:02:02] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
DEBUG    [03-15-2016 03:02:02] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495be9", "driver_control": "GV1", "uom": 56, "value": 1}}
DEBUG    [03-15-2016 03:02:02] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495be9/report/status/GV1/1/56
INFO     [03-15-2016 03:02:02] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-15-2016 03:02:03] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.88ms
INFO     [03-15-2016 03:02:03] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.61ms
INFO     [03-15-2016 03:02:05] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.84ms
INFO     [03-15-2016 03:02:05] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.51ms
INFO     [03-15-2016 03:02:07] tornado.access: 304 GET /api/servers/active (192.168.1.61) 4.20ms
INFO     [03-15-2016 03:02:07] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.43ms
DEBUG    [03-15-2016 03:02:07] polyglot.nodeserver_manager: ihcam STDIN: {"ping": {}}
INFO     [03-15-2016 03:02:09] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.90ms
INFO     [03-15-2016 03:02:09] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.30ms
INFO     [03-15-2016 03:02:11] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.71ms
INFO     [03-15-2016 03:02:11] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.13ms
INFO     [03-15-2016 03:02:12] tornado.access: 200 GET /ns/3wb9F/nodes/n003_00626e495f4cm/query?requestId=194 (192.168.1.64) 4.12ms
DEBUG    [03-15-2016 03:02:12] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495f4cm", "request_id": "194"}}
ERROR    [03-15-2016 03:02:12] polyglot.nodeserver_manager: ihcam: Sending: http://192.168.1.112:8080/get_status.cgi {}
ERROR    [03-15-2016 03:02:12] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
DEBUG    [03-15-2016 03:02:12] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495be9", "driver_control": "GV4", "uom": 56, "value": 1}}
DEBUG    [03-15-2016 03:02:12] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495be9/report/status/GV4/1/56
INFO     [03-15-2016 03:02:12] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-15-2016 03:02:13] tornado.access: 304 GET /api/servers/active (192.168.1.61) 4.13ms
INFO     [03-15-2016 03:02:13] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.46ms
INFO     [03-15-2016 03:02:15] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.96ms
INFO     [03-15-2016 03:02:15] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.82ms
INFO     [03-15-2016 03:02:17] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.65ms
INFO     [03-15-2016 03:02:17] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.03ms
INFO     [03-15-2016 03:02:19] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.64ms
INFO     [03-15-2016 03:02:19] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.31ms
INFO     [03-15-2016 03:02:21] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.77ms
INFO     [03-15-2016 03:02:21] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.32ms
INFO     [03-15-2016 03:02:22] tornado.access: 200 GET /ns/3wb9F/nodes/n003_000c5ddc9d6c/query?requestId=195 (192.168.1.64) 4.32ms
DEBUG    [03-15-2016 03:02:22] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "000c5ddc9d6c", "request_id": "195"}}
ERROR    [03-15-2016 03:02:22] polyglot.nodeserver_manager: ihcam: Sending: http://192.168.1.110:8080/get_params.cgi {}
ERROR    [03-15-2016 03:02:23] polyglot.element_manager.isy: ISY Could not recieve response from device because of a network issue.
DEBUG    [03-15-2016 03:02:23] polyglot.nodeserver_manager: ihcam STDOUT: {"request": {"success": true, "request_id": "191"}}
DEBUG    [03-15-2016 03:02:23] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/report/request/status/191/success
INFO     [03-15-2016 03:02:23] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-15-2016 03:02:23] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.67ms
INFO     [03-15-2016 03:02:23] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.17ms
INFO     [03-15-2016 03:02:25] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.95ms
INFO     [03-15-2016 03:02:25] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.43ms
INFO     [03-15-2016 03:02:27] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.93ms
INFO     [03-15-2016 03:02:27] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.14ms
INFO     [03-15-2016 03:02:29] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.64ms
INFO     [03-15-2016 03:02:29] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.07ms
INFO     [03-15-2016 03:02:31] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.90ms
INFO     [03-15-2016 03:02:31] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.20ms
INFO     [03-15-2016 03:02:32] tornado.access: 200 GET /ns/3wb9F/nodes/n003_000c5ddc9d6cm/query?requestId=196 (192.168.1.64) 3.73ms
DEBUG    [03-15-2016 03:02:32] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "000c5ddc9d6cm", "request_id": "196"}}
ERROR    [03-15-2016 03:02:32] polyglot.nodeserver_manager: ihcam: Sending: http://192.168.1.110:8080/get_status.cgi {}
ERROR    [03-15-2016 03:02:33] polyglot.element_manager.isy: ISY Could not recieve response from device because of a network issue.
DEBUG    [03-15-2016 03:02:33] polyglot.nodeserver_manager: ihcam STDOUT: {"request": {"success": true, "request_id": "192"}}
DEBUG    [03-15-2016 03:02:33] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/report/request/status/192/success
INFO     [03-15-2016 03:02:33] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-15-2016 03:02:33] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.99ms
INFO     [03-15-2016 03:02:33] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.19ms
INFO     [03-15-2016 03:02:35] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.81ms
INFO     [03-15-2016 03:02:35] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.17ms
INFO     [03-15-2016 03:02:37] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.90ms
INFO     [03-15-2016 03:02:37] tornado.access: 200 GET /api/server/3wb9F (192.168.1.61) 4.96ms
ERROR    [03-15-2016 03:02:37] polyglot.nodeserver_manager: Node Server ihcam has stopped responding.
INFO     [03-15-2016 03:02:39] tornado.access: 200 GET /api/servers/active (192.168.1.61) 3.61ms
INFO     [03-15-2016 03:02:39] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.01ms
INFO     [03-15-2016 03:02:41] tornado.access: 304 GET /api/servers/active (192.168.1.61) 3.59ms
INFO     [03-15-2016 03:02:41] tornado.access: 304 GET /api/server/3wb9F (192.168.1.61) 5.03ms

Still have not figured out this issue, saving this log to look at it later.

INFO     [03-25-2016 03:02:09] tornado.access: 200 GET /ns/iR32U/nodes/n003_cams/query?requestId=313 (192.168.1.64) 4.14ms
DEBUG    [03-25-2016 03:02:09] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "cams", "request_id": "313"}}
DEBUG    [03-25-2016 03:02:12] polyglot.nodeserver_manager: ihcam STDOUT: {"config": {"manifest": {"cams": {"node_def_id": "CameraServer", "drivers": {"GV2": 3, "GV3": 2, "GV1": 0.1}, "added": false, "name": "Camera Server"}, "00626e495f4c": {"node_def_id": "FoscamMJPEG", "drivers": {"GV2": 1, "GV3": 1, "GV1": 1, "GV6": 1, "GV4": 0, "GV5": 5}, "added": false, "name": "DiningRoom"}, "00626e495f4cm": {"node_def_id": "CamMotion", "drivers": {"ST": 0}, "added": false, "name": "DiningRoom-Motion"}, "00626e495be9": {"node_def_id": "FoscamMJPEG", "drivers": {"GV2": 1, "GV3": 1, "GV1": 1, "GV6": 1, "GV4": 0, "GV5": 5}, "added": false, "name": "CameraGarage"}, "00626e495be9m": {"node_def_id": "CamMotion", "drivers": {"ST": 0}, "added": false, "name": "CameraGarage-Motion"}, "000c5ddc9d6c": {"node_def_id": "FoscamMJPEG", "drivers": {"GV2": 1, "GV3": 1, "GV1": 1, "GV6": 1, "GV4": 0, "GV5": 9}, "added": false, "name": "CameraFrontDoor"}, "000c5ddc9d6cm": {"node_def_id": "CamMotion", "drivers": {"ST": 0}, "added": false, "name": "CameraFrontDoor-Motion"}}}}
INFO     [03-25-2016 03:02:12] polyglot.core: Saving Configuration
DEBUG    [03-25-2016 03:02:12] polyglot.config_manager: Config files match no need to write to config file.
DEBUG    [03-25-2016 03:02:15] polyglot.nodeserver_manager: ihcam STDOUT: {"add": {"node_address": "00626e495f4c", "node_def_id": "FoscamMJPEG", "primary": "00626e495f4c", "name": "DiningRoom"}}
DEBUG    [03-25-2016 03:02:15] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495f4c/add/FoscamMJPEG?primary=n003_00626e495f4c&name=DiningRoom
INFO     [03-25-2016 03:02:15] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-25-2016 03:02:19] tornado.access: 200 GET /ns/iR32U/nodes/n003_00626e495f4c/query?requestId=314 (192.168.1.64) 4.46ms
DEBUG    [03-25-2016 03:02:19] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495f4c", "request_id": "314"}}
DEBUG    [03-25-2016 03:02:20] polyglot.nodeserver_manager: ihcam STDIN: {"ping": {}}
ERROR    [03-25-2016 03:02:25] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
DEBUG    [03-25-2016 03:02:25] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495f4c", "driver_control": "GV2", "uom": 56, "value": 0}}
DEBUG    [03-25-2016 03:02:25] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495f4c/report/status/GV2/0/56
INFO     [03-25-2016 03:02:25] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-25-2016 03:02:29] tornado.access: 200 GET /ns/iR32U/nodes/n003_00626e495f4cm/query?requestId=315 (192.168.1.64) 4.15ms
DEBUG    [03-25-2016 03:02:29] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495f4cm", "request_id": "315"}}
ERROR    [03-25-2016 03:02:35] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
DEBUG    [03-25-2016 03:02:35] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495f4c", "driver_control": "GV3", "uom": 56, "value": 0}}
DEBUG    [03-25-2016 03:02:35] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495f4c/report/status/GV3/0/56
INFO     [03-25-2016 03:02:35] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-25-2016 03:02:39] tornado.access: 200 GET /ns/iR32U/nodes/n003_00626e495be9/query?requestId=316 (192.168.1.64) 4.13ms
DEBUG    [03-25-2016 03:02:39] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495be9", "request_id": "316"}}
ERROR    [03-25-2016 03:02:45] polyglot.element_manager.isy: Timed out waiting for response from the ISY device.
DEBUG    [03-25-2016 03:02:45] polyglot.nodeserver_manager: ihcam STDOUT: {"status": {"node_address": "00626e495f4c", "driver_control": "GV1", "uom": 56, "value": 0}}
DEBUG    [03-25-2016 03:02:45] polyglot.element_manager.isy: ISY Request: http://192.168.1.64:80/rest/ns/3/nodes/n003_00626e495f4c/report/status/GV1/0/56
INFO     [03-25-2016 03:02:45] urllib3.connectionpool: Starting new HTTP connection (1): 192.168.1.64
INFO     [03-25-2016 03:02:49] tornado.access: 200 GET /ns/iR32U/nodes/n003_00626e495be9m/query?requestId=317 (192.168.1.64) 2.96ms
DEBUG    [03-25-2016 03:02:49] polyglot.nodeserver_manager: ihcam STDIN: {"query": {"node_address": "00626e495be9m", "request_id": "317"}}
ERROR    [03-25-2016 03:02:50] polyglot.nodeserver_manager: Node Server ihcam has stopped responding.
ERROR    [03-25-2016 03:02:55] polyglot.element_manager.isy: ISY Could not recieve response from device because of a network issue.

I just observed this same event. Nothing at all of interest in the log file, just the declaration from polyglot that the node server had stopped responding. The node server process was in a zombie state.

What's interesting is that the node server in question was written in Perl, and used the STDIN/STDOUT API to talk to polyglot. The Perl node server has its own logging mechanism, which reflected normal operation up until the node server was abruptly terminated for unknown reasons. Since it's Perl, this rules out the problem being in the Python node server code. And since this has now been observed with two radically-different node servers, it tends to point to the fault being in the actual Polyglot nodeserver_manager.py (and related) code.

I suspect that what may be happening is that the 3AM query is overloading something somewhere, resulting in Polyglot getting "stalled" in timeouts on network threads to the ISY - this may result in missed "pong" messages back from the node servers themselves.

I'm glad someone else saw this, I put a bunch of logging in my node server to try and debug this but could never see an issue on the node server side. I also modified the polyglot code in my checked out area to not kill the node server and just retry the:

                # try to get a line from the queue
                line = self._inq.get(True, 5)

and it would always read a new line after the initial timeout. I was doing some reading and there are warnings about reading from process pipes causing deadlocks and to use communicate instead in
https://docs.python.org/2/library/subprocess.html but I have not had time to review the code or debug further, and I'm currently out of town and won't be back till after next weekend, so if you could look into it further that would be appreciated.

Hi Mike,

This can be tested by Querying ISY (right mouse click ISY or Network | Query). The main question is whether or not we can reproduce this at will. Since Polyglot does not subscribe to ISY, I suspect that the Query is calling a Polyglot query which does not work properly.

With kind regards,


Michel Kohanim
CEO

(p) 818.631.0333
(f) 818.436.0702
http://www.universal-devices.comhttp://www.universal-devices.com/


From: Mike Westerhof [mailto:notifications@github.com]
Sent: Sunday, April 3, 2016 9:30 AM
To: UniversalDevicesInc/Polyglot Polyglot@noreply.github.com
Subject: Re: [UniversalDevicesInc/Polyglot] Node server died during 3am query with no info (#13)

I just observed this same event. Nothing at all of interest in the log file, just the declaration from polyglot that the node server had stopped responding. The node server process was in a zombie state.

What's interesting is that the node server in question was written in Perl, and used the STDIN/STDOUT API to talk to polyglot. The Perl node server has its own logging mechanism, which reflected normal operation up until the node server was abruptly terminated for unknown reasons. Since it's Perl, this rules out the problem being in the Python node server code. And since this has now been observed with two radically-different node servers, it tends to point to the fault being in the actual Polyglot nodeserver_manager.py (and related) code.

I suspect that what may be happening is that the 3AM query is overloading something somewhere, resulting in Polyglot getting "stalled" in timeouts on network threads to the ISY - this may result in missed "pong" messages back from the node servers themselves.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-205007156

Just to reiterate: Polyglot does not subscribe to ISY so it cannot be related to massive traffic. It will have to do with one of the node servers.

With kind regards,


Michel Kohanim
CEO

(p) 818.631.0333
(f) 818.436.0702
http://www.universal-devices.comhttp://www.universal-devices.com/


From: jimboca [mailto:notifications@github.com]
Sent: Sunday, April 3, 2016 9:49 AM
To: UniversalDevicesInc/Polyglot Polyglot@noreply.github.com
Subject: Re: [UniversalDevicesInc/Polyglot] Node server died during 3am query with no info (#13)

I'm glad someone else saw this, I put a bunch of logging in my node server to try and debug this but could never see an issue on the node server side. I also modified the polyglot code in my checked out area to not kill the node server and just retry the:

try to get a line from the queue

line = self._inq.get(True, 5)

and it would always read a new line after the initial timeout. I was doing some reading and there are warnings about reading from process pipes causing deadlocks and to use communicate instead in
https://docs.python.org/2/library/subprocess.html but I have not had time to review the code or debug further, and I'm currently out of town and won't be back till after next weekend, so if you could look into it further that would be appreciated.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-205008864

I'm reasonably sure I've identified the root cause of this issue. The fix is not hard, but non-trivial; it'll need some testing.

Polyglot's node server manager is threaded - but not very much. We're interested in the one thread for each node server that is assigned to handle the output from the node server (which would be things like status updates and pong responses). The code in Polyglot reads a line from the node server, figures out what command it is, then goes and does the "right thing" for that command, and upon completion, reads the next line (or waits for the next line).

The "ping" message is sent to the node server from the manager about every 30 seconds. The node server must respond with a "pong" within about 30 seconds, or the manager will determine that the node server has failed, and terminate it.

Consider the following scenario:
a) A ping/pong exchange has been completed.
b) Time passes - 29 seconds for this illustration.
c) Suddenly, the node server has an event occur that requires it to update its status to the ISY -- for example, if the node server is a Hue Hub, perhaps 6 bulbs just changed color and intensity.
d) The node server sends a series of status commands, with arguments -- in our example, that would be color and intensity for 6 bulbs, which might be 18 separate updates to send.
e) The main thread assigned to processing our node server's output in Polyglot reads the first status update, and makes a REST API call to the ISY, and begins repeating this process for the remaining 17 or so status updates in the queue.
f) Meanwhile, the node server's 30 second timer is up, and a "ping" message is sent to the node server.
g) The node server, which is doing nothing right now (it sent the 18 status updates, it's Polyglot's node server output processing thread that's tied up all busy right now!). So the node server happily responds with a "pong".
h) The node server output processing thread runs into a network issue -- one of the requests is taking far longer than usual. After 10 seconds, it times out and the thread continues.
i) The "pong" is still sitting in the queue - but the output processing thread is still stuck, running into timeouts on the network, and each timeout costs 10 seconds.
j) The 30-second ping timer expires -- the pong is there, in the queue, but unprocessed, because the output processing thread is still stuck on the network messages it's trying to get to the ISY. The timer thread determines that the node server must have failed, and terminates it.

Problem 1: The output processing thread is doing work that can take an undetermined amount of time.

Problem 2: Network timeouts seem to be rather common with Polyglot during certain specific times.

For Problem 1, I propose that we create a separate queue and worker thread for REST calls to the ISY from a node server. Instead of making the REST calls itself, the existing output processing code will simply transfer those commands and arguments over to a request queue, which will be processed in FIFO order by a request thread (one per node server) which will do all the network REST calls for us. Code to do this is currently being tested; I'll commit some changes if it passes overnight.

For Problem 2, we need more testing to find out if the problem is in the requests library, the RPi networking code, the network switch/cables, or the ISY. That's a separate issue, though.

Nice. Good catch.

Mike,

Excellent catch! I agree with the approach.

THANK YOU.

With kind regards,


Michel Kohanim
CEO

Changes committed - please test.

Note that I'll create a separate issue to capture the network timeout issue - so don't worry that this fix won't correct any of the frequent network timeouts and failures recorded by Polyglot when attempting to update the ISY. The fix is only intended to ensure that Polyglot no longer wrongly terminates a node server.

Thanks Mike! I will test when I get home Sunday or Monday.

On Wed, Apr 6, 2016, 4:27 PM Mike Westerhof notifications@github.com
wrote:

Changes committed - please test.

Note that I'll create a separate issue to capture the network timeout
issue - so don't worry that this fix won't correct any of the frequent
network timeouts and failures recorded by Polyglot when attempting to
update the ISY. The fix is only intended to ensure that Polyglot no longer
wrongly terminates a node server.


You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#13 (comment)

Thank you Mike!

With kind regards,


Michel Kohanim
CEO

This was stable again for me overnight, despite a bucket-load (that's a lot!) :-) of HTTP errors during the 3AM query. I've bumped up the version number to 0.0.2 just to make sure that we can easily distinguish log files that include this change from earlier ones.

I'll leave this open until we get some broader testing, but I'm pretty sure this fixed the "node server termination" problem without breaking anything else...

Thanks for all the work on this. I am finally back home and the wife is gone so I will have time again to test this.

Running with the development branch and Polyglot survived the 3AM. Still get some HTTP errors, but it's still running. Nice work!

@mjwesterhof this has been running great for me. Should we close this issue?

Yep, if it works for you as well, I'm comfortable with it. I'll close it.