openvstorage/openvstorage-health-check

arakoon collapse test broken on bighorn-rc-2

jeroenmaelbrancke opened this issue · 3 comments

Problem description

The healthcheck is not able to list the contents of a tlog directory while ssh to each node is working for user root and ovs.

Logs

[INFO] Storagerouter Id: Wta3zxDzJWsK95m5
[INFO] Environment Os: Ubuntu 16.04 xenial
[INFO] Hostname: pocops-voldrv03
[INFO] Cluster Id: 1RyBXEa8RadMxmM0
[INFO] Storagerouter Type: MASTER
Disconnecting
Closing underlying parmaiko
[INFO] Starting OpenvStorage Healthcheck version 3.6.7-1
[INFO] ======================
[INFO] Fetching available arakoon clusters.
[INFO] Starting Arakoon collapse test
[INFO] Retrieving all collapsing statistics. This might take a while
Refcounter:  1
Refcounter:  1
[INFO] Retrieving collapse information for Arakoon cluster cacc on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster cacc on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster ovsdb on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster ovsdb on node 10.100.200.11
 [INFO] Retrieving collapse information for Arakoon cluster voldrv on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster voldrv on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-abm on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-abm on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-abm on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-abm on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-nsm_01 on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-nsm_01 on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-nsm_02 on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-nsm_02 on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-nsm_03 on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster hddbackend01-nsm_03 on node 10.100.200.11
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-nsm_01 on node 10.100.200.13
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-nsm_01 on node 10.100.200.11
No handlers could be found for logger "paramiko.transport"
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/extensions/healthcheck/arakoon_hc.py", line 422, in _collapse_worker
    output['avail_size'] = _client.run("df {0} | tail -1 | awk '{{print $4}}'".format(path), allow_insecure=True)
  File "/usr/lib/python2.7/dist-packages/ovs_extensions/generic/sshclient.py", line 61, in inner_function
    return outer_function(self, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/ovs_extensions/generic/sshclient.py", line 400, in run
    _, stdout, stderr = self._client.exec_command(command, timeout=timeout)  # stdin, stdout, stderr
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 401, in exec_command
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/extensions/healthcheck/arakoon_hc.py", line 421, in _collapse_worker
    chan = self._transport.open_session(timeout=timeout)
    timestamp_files = _client.run('stat -c "%Y %n %s" {0}'.format(path), allow_insecure=True)
  File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 703, in open_session
  File "/usr/lib/python2.7/dist-packages/ovs_extensions/generic/sshclient.py", line 61, in inner_function
    return outer_function(self, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/ovs_extensions/generic/sshclient.py", line 400, in run
    _, stdout, stderr = self._client.exec_command(command, timeout=timeout)  # stdin, stdout, stderr
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 401, in exec_command
    chan = self._transport.open_session(timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 703, in open_session
    timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 835, in open_channel
    raise e
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-nsm_02 on node 10.100.200.13
ChannelException: (1, 'Administratively prohibited')
[WARNING] Could not retrieve the collapse information for Arakoon cluster ssdbackend01-abm on node 10.100.200.11 ((1, 'Administratively prohibited'))
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-nsm_02 on node 10.100.200.11
    timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 835, in open_channel
    raise e
SSHException: Unable to open channel.
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-nsm_03 on node 10.100.200.13
 [WARNING] Could not retrieve the collapse information for Arakoon cluster ssdbackend01-nsm_01 on node 10.100.200.11 (Unable to open channel.)
[INFO] Retrieving collapse information for Arakoon cluster ssdbackend01-nsm_03 on node 10.100.200.11
Disconnecting
Closing underlying parmaiko
Disconnecting
Closing underlying parmaiko
[INFO] Retrieving all collapsing statistics succeeded (duration: 1.28319716454)
[INFO] Testing the collapse of CFG Arakoons
[SUCCESS] Spare space for local collapse is  sufficient (n > 4x head.db size)
[SKIPPED] Arakoon cluster cacc on node 10.100.200.11 only has 2 tlx, not worth collapsing (required: 10)
[SUCCESS] Spare space for local collapse is  sufficient (n > 4x head.db size)
[SKIPPED] Arakoon cluster cacc on node 10.100.200.13 only has 2 tlx, not worth collapsing (required: 10)
[INFO] Testing the collapse of FWK Arakoons
[SKIPPED] Arakoon cluster ovsdb on node 10.100.200.11 only has 6 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster ovsdb on node 10.100.200.13 only has 6 tlx, not worth collapsing (required: 10)
[INFO] Testing the collapse of SD Arakoons
[SKIPPED] Arakoon cluster voldrv on node 10.100.200.11 only has 2 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster voldrv on node 10.100.200.13 only has 2 tlx, not worth collapsing (required: 10)
[INFO] Testing the collapse of ABM Arakoons
[SKIPPED] Arakoon cluster hddbackend01-abm on node 10.100.200.11 only has 3 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster hddbackend01-abm on node 10.100.200.13 only has 3 tlx, not worth collapsing (required: 10)
[EXCEPTION] Unable to list the contents of the tlog directory (/mnt/ssd1/arakoon/ssdbackend01-abm/tlogs) for Arakoon cluster ssdbackend01-abm on node 10.100.200.11
[SKIPPED] Arakoon cluster ssdbackend01-abm on node 10.100.200.13 only has 2 tlx, not worth collapsing (required: 10)
[INFO] Testing the collapse of NSM Arakoons
[SKIPPED] Arakoon cluster hddbackend01-nsm_01 on node 10.100.200.11 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster hddbackend01-nsm_01 on node 10.100.200.13 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster hddbackend01-nsm_02 on node 10.100.200.11 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster hddbackend01-nsm_02 on node 10.100.200.13 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster hddbackend01-nsm_03 on node 10.100.200.11 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster hddbackend01-nsm_03 on node 10.100.200.13 only has 1 tlx, not worth collapsing (required: 10)
[EXCEPTION] Unable to list the contents of the tlog directory (/mnt/ssd1/arakoon/ssdbackend01-nsm_01/tlogs) for Arakoon cluster ssdbackend01-nsm_01 on node 10.100.200.11
[SKIPPED] Arakoon cluster ssdbackend01-nsm_01 on node 10.100.200.13 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster ssdbackend01-nsm_02 on node 10.100.200.11 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster ssdbackend01-nsm_02 on node 10.100.200.13 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster ssdbackend01-nsm_03 on node 10.100.200.11 only has 1 tlx, not worth collapsing (required: 10)
[SKIPPED] Arakoon cluster ssdbackend01-nsm_03 on node 10.100.200.13 only has 1 tlx, not worth collapsing (required: 10)
[INFO] Recap of Health Check module arakoon test collapse-test!
[INFO] ======================
[INFO] SUCCESS=2 FAILED=0 SKIPPED=20 WARNING=2 EXCEPTION=2

Additional information

Packages

ii  openvstorage                         2.11.5-1                                   amd64        OpenvStorage
ii  openvstorage-backend                 1.11.2-1                                   amd64        OpenvStorage Backend plugin
ii  openvstorage-backend-core            1.11.2-1                                   amd64        OpenvStorage Backend plugin core
ii  openvstorage-backend-webapps         1.11.2-1                                   amd64        OpenvStorage Backend plugin Web Applications
ii  openvstorage-core                    2.11.5-1                                   amd64        OpenvStorage core
ii  openvstorage-extensions              0.3.3-1                                    amd64        Extensions for Open vStorage
ii  openvstorage-hc                      1.11.2-1                                   amd64        OpenvStorage Backend plugin HyperConverged
ii  openvstorage-health-check            3.6.7-1                                    amd64        Open vStorage HealthCheck
ii  openvstorage-sdm                     1.11.1-1                                   amd64        Open vStorage Backend ASD Manager
ii  openvstorage-webapps                 2.11.5-1                                   amd64        OpenvStorage Web Applications

This is the MaxSessions error. We are hitting the MaxSessions limit (default 10) on that node.
Not much can be done for the client side. However we could try to limit the amount of open sessions in the sshclient by opening only when we need them.

Fixed by #451
Packaged in: openvstorage-health-check_3.7.0-dev.1525262004.35685b0-1_amd64.deb

Released in https://github.com/openvstorage/openvstorage-health-check/releases/tag/3.7.0
Packaged in openvstorage-health-check_3.7.0-1_amd64.deb