openvstorage/openvstorage-health-check

Cached storagerouter object triggers UnableToConnectException

matthiasdeblock opened this issue · 3 comments

Problem description

The healthcheck is caching the storagerouter object. If the healthcheck keeps the object to long you could run into the following UnableToConnectException:

Traceback (most recent call last):
~   File "/opt/jumpscale7/libext/CloudscalerLibcloud/openvstorage.py", line 39, in run_healthcheck
    hcresults = HealthCheckCLIRunner.run_method(modulename, testname)
~   File "/opt/OpenvStorage/ovs/extensions/healthcheck/expose_to_cli.py", line 310, in run_method
    result_handler.info('Starting OpenvStorage Healthcheck version {0}'.format(Helper.get_healthcheck_version()))
~   File "/opt/OpenvStorage/ovs/extensions/healthcheck/helpers/helper.py", line 61, in get_healthcheck_version
    client = SSHClient(System.get_my_storagerouter())
~   File "/opt/OpenvStorage/ovs/extensions/generic/sshclient.py", line 62, in __init__
    raise UnableToConnectException(message)
~ UnableToConnectException: StorageRouter 10.16.6.64 process heartbeat > 300s

First fetch of the storagerouter object:

In [5]: storagerouter.last_heartbeat - time.time()
Out[5]: -36.863505125045776

In [6]: storagerouter.last_heartbeat - time.time()
Out[6]: -37.91942310333252

In [7]: storagerouter.last_heartbeat - time.time()
Out[7]: -43.97763514518738

In [8]: storagerouter.last_heartbeat - time.time()
Out[8]: -46.687355041503906

In [9]: storagerouter.last_heartbeat - time.time()
Out[9]: -55.87161207199097

In [10]: storagerouter.last_heartbeat - time.time()
Out[10]: -62.55178117752075

In [11]: storagerouter.last_heartbeat - time.time()
Out[11]: -88.7434561252594

In [12]: storagerouter.last_heartbeat - time.time()
Out[12]: -97.71954417228699

In [13]: storagerouter.last_heartbeat - time.time()
Out[13]: -104.4878420829773

If the 300s are reached, the error will get triggered.

A second fetch of the storagerouter object:

In [15]: storagerouter.last_heartbeat - time.time()
Out[15]: -56.2856969833374

@matthiasdeblock Was there anything special going on (removal of Storage Router)? Why would you get a ´UnableToConnect´ otherwise?

@wimpers , no there isn't any removal ongoing. The issue is triggered due to the fact that the object is cached. This was also mentioned in the ticket.

@wimpers
The issue was that the could provide an out-of-date storagerouter object. This object is given to the SSHClient which checks the object's property. When working with an older object, this property is never reset, causing the issue

Either way, this issue has been fixed in both 3.6 and 3.4 when I did some backporting work
Another failsafe was built inside the SSHClient itself: openvstorage/framework#1895