Glusto stuck in a loop

Scenario:
1 - Set up a couple of nodes (e.g. 3)
2 - start gluster on all of these node ('glusterd')
3 - Run glusto tests

Result:
The test framework is stuck in an infinite loop trying to restart the glustrd process on the nodes.

Console output from the node running the tests:
glusto -c "/root/glusto-tests/tests/gluster_basic_config.yml" --pytest='/root/glusto-tests/tests/functional/dht/test_dht_create_dir.py'
Log glustomain created as /tmp/glustomain.log with log level DEBUG
Starting glusto via main()
clients: [barak-pc]
clients_info:
barak-pc: {host: barak-pc}
gluster:
volume_create_force: true
volume_types:
arbiter: {arbiter_count: 1, replica_count: 3, transport: tcp, type: arbiter}
dispersed: {disperse_count: 6, redundancy_count: 2, transport: tcp, type: dispersed}
distributed: {dist_count: 3, transport: tcp, type: distributed}
distributed-arbiter: {arbiter_count: 1, dist_count: 2, replica_count: 3, transport: tcp,
type: distributed-arbiter}
distributed-dispersed: {disperse_count: 6, dist_count: 2, redundancy_count: 2,
transport: tcp, type: distributed-dispersed}
distributed-replicated: {dist_count: 1, replica_count: 3, transport: tcp, type: distributed-replicated}
replicated: {arbiter_count: 1, replica_count: 3, transport: tcp, type: replicated}
log_file: /var/log/tests/gluster_tests.log
log_level: DEBUG
servers: [vm1, vm2, vm3]
servers_info:
vm1:
brick_root: /mnt/disk2
devices: [/dev/vdb]
host: vm1
vm2:
brick_root: /mnt/disk2
devices: [/dev/vdb]
host: vm2
vm3:
brick_root: /mnt/disk2
devices: [/dev/vdb]
host: vm3

pytest: /root/glusto-tests/tests/functional/dht/test_dht_create_dir.py
=================================================================================================================================================================================== test session starts ====================================================================================================================================================================================
platform linux -- Python 3.8.0, pytest-6.0.0rc1, py-1.9.0, pluggy-0.13.1
rootdir: /root/glusto-tests/tests/functional/dht
collected 2 items

../root/glusto-tests/tests/functional/dht/test_dht_create_dir.py ^C

===================================================================================================================================================================================== warnings summary =====================================================================================================================================================================================
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/peer_ops.py:293: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/peer_ops.py:296: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py: 168 warnings
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:707: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py: 168 warnings
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:713: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py: 168 warnings
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:716: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py: 504 warnings
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:719: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py: 504 warnings
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:727: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
test_dht_create_dir.py::TestDhtClass_cplex_distributed_glusterfs::test_create_directory
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:453: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

test_dht_create_dir.py: 80 warnings
/usr/local/lib/python3.8/site-packages/glustolibs_gluster-0.22-py3.8.egg/glustolibs/gluster/volume_ops.py:820: DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.

-- Docs: https://docs.pytest.org/en/stable/warnings.html
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/usr/lib64/python3.8/selectors.py:415: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
============================================================================================================================================================================ 1607 warnings in 158.15s (0:02:38) ============================================================================================================================================================================
Ending glusto via main()

Glusto log is also attached.
glustomain.log

From the log file shared in this issue I see that glusterd has crashed and isn't starting due to which glusto tests is going in a loop.

2020-08-09 12:06:07,538 ERROR (restart_glusterd) Unable to restart glusterd on server vm1
2020-08-09 12:06:07,538 ERROR (restart_glusterd) Unable to restart glusterd on server vm2
2020-08-09 12:06:07,538 ERROR (restart_glusterd) Unable to restart glusterd on server vm3
2020-08-09 12:06:07,538 INFO (run_async) �[90;1mroot@vm1 (cp): systemctl reset-failed glusterd�[0m
2020-08-09 12:06:07,538 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm1
2020-08-09 12:06:07,542 INFO (run_async) �[90;1mroot@vm2 (cp): systemctl reset-failed glusterd�[0m
2020-08-09 12:06:07,543 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm2
2020-08-09 12:06:07,545 INFO (run_async) �[90;1mroot@vm3 (cp): systemctl reset-failed glusterd�[0m
2020-08-09 12:06:07,546 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm3
2020-08-09 12:06:07,611 INFO (_log_results) �[34;1mRETCODE (root@vm1): 0�[0m
2020-08-09 12:06:07,611 INFO (_log_results) �[34;1mRETCODE (root@vm2): 0�[0m
2020-08-09 12:06:07,611 INFO (_log_results) �[34;1mRETCODE (root@vm3): 0�[0m
2020-08-09 12:06:07,611 INFO (run_async) �[90;1mroot@vm1 (cp): service glusterd restart�[0m
2020-08-09 12:06:07,611 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm1
2020-08-09 12:06:07,616 INFO (run_async) �[90;1mroot@vm2 (cp): service glusterd restart�[0m
2020-08-09 12:06:07,616 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm2
2020-08-09 12:06:07,627 INFO (run_async) �[90;1mroot@vm3 (cp): service glusterd restart�[0m
2020-08-09 12:06:07,628 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm3
2020-08-09 12:06:17,840 INFO (_log_results) �[34;1mRETCODE (root@vm1): 1�[0m
2020-08-09 12:06:17,841 INFO (_log_results) �[47;30;1mSTDOUT (root@vm1)...
Restarting glusterd (via systemctl):  [FAILED]
�[0m
2020-08-09 12:06:17,841 INFO (_log_results) �[31;1mSTDERR (root@vm1)...
Job for glusterd.service failed because the control process exited with error code.
See "systemctl status glusterd.service" and "journalctl -xe" for details.
�[0m
2020-08-09 12:06:17,842 INFO (_log_results) �[34;1mRETCODE (root@vm2): 1�[0m
2020-08-09 12:06:17,842 INFO (_log_results) �[47;30;1mSTDOUT (root@vm2)...
Restarting glusterd (via systemctl):  [FAILED]
�[0m
2020-08-09 12:06:17,842 INFO (_log_results) �[31;1mSTDERR (root@vm2)...
Job for glusterd.service failed because the control process exited with error code.
See "systemctl status glusterd.service" and "journalctl -xe" for details.
�[0m
2020-08-09 12:06:17,851 INFO (_log_results) �[34;1mRETCODE (root@vm3): 1�[0m
2020-08-09 12:06:17,851 INFO (_log_results) �[47;30;1mSTDOUT (root@vm3)...
Restarting glusterd (via systemctl):  [FAILED]
�[0m
2020-08-09 12:06:17,851 INFO (_log_results) �[31;1mSTDERR (root@vm3)...
Job for glusterd.service failed because the control process exited with error code.
See "systemctl status glusterd.service" and "journalctl -xe" for details.
�[0m

This looks like a glusterfs nightly build issue and not a glusto-tests or a glusto issue, would suggest you to open a issue at [1].

[1] https://github.com/gluster/glusterfs/issues

Hi Barak,

Could you let us know if this issue is seen when running the tests with latest glusterfs stable build ?

Thanks,
Arthy

@aloganat @kshithijiyer
Tested with Gluster 7.7 and the problem is reproducible consistently.
Adding to me previous comment, the 2 key components required to reproduce this are:
1 - Have Gluster already running on the server nodes.
2 - Invoke some failure in the tests (the easiest way I found is configure a wrong amount of nodes on purpose).

It seems that the test framework itself is not handling failures well - Instead of terminating with an error, it keeps trying to restart Gluster, ending being stuck in an endless loop.
To determine whether a problem also exists on Gluster's side, I'll contact @sanjurakonde, but clearly a problem exists with the test framework itself.

Attaching Glusto logs and in addition Gluster logs from one of the server nodes.
glustomain.log
glusterd.log

@kshithijiyer I see no trace of glusterd crash in any of the logs here.

The reason for looping is: glusterd process is already running on the nodes on which glusto is trying start glusterd again.
the difference is glusterd process which is already running is started using "glusterd" command and glusto tries to start using systemd.

from the logs:
[2020-08-12 08:43:09.431454] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-glusterd: Started running glusterd version 7.7 (args: glusterd) <-- glusterd started using glusterd command

[2020-08-12 08:43:41.978421] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 7.7 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) <-- systemd tried start the glusterd
[2020-08-12 08:43:41.979351] I [glusterfsd.c:2594:daemonize] 0-glusterfs: Pid of current running process is 91272
[2020-08-12 08:43:41.981727] I [MSGID: 106478] [glusterd.c:1426:init] 0-management: Maximum allowed open file descriptors set to 65536
[2020-08-12 08:43:41.981756] I [MSGID: 106479] [glusterd.c:1482:init] 0-management: Using /var/lib/glusterd as working directory
[2020-08-12 08:43:41.981765] I [MSGID: 106479] [glusterd.c:1488:init] 0-management: Using /var/run/gluster as pid file working directory
[2020-08-12 08:43:41.985509] E [socket.c:977:__socket_server_bind] 0-socket.management: binding to failed: Address already in use
[2020-08-12 08:43:41.985543] E [socket.c:979:__socket_server_bind] 0-socket.management: Port is already in use
[2020-08-12 08:43:42.985703] E [socket.c:977:__socket_server_bind] 0-socket.management: binding to failed: Address already in use
[2020-08-12 08:43:42.985750] E [socket.c:979:__socket_server_bind] 0-socket.management: Port is already in use
[2020-08-12 08:43:43.985888] E [socket.c:977:__socket_server_bind] 0-socket.management: binding to failed: Address already in use
[2020-08-12 08:43:43.985931] E [socket.c:979:__socket_server_bind] 0-socket.management: Port is already in use <-- port is used by glusterd process which is already running

This is not an issue at glusterfs. glusto library which tries to start glusterd should return failure in case it is unable to start glusterd process. it should not go into the loop.

Thanks,
Sanju

Glusto-tests using service/systemctl to restart or start glusterd, please refer the links. Also from the logs I see that glusterd restart is failing from the first instance of restart:

2020-08-12 11:43:45,498 DEBUG (_get_ssh_connection) Retrieved connection from cache: root@vm3
2020-08-12 11:43:55,612 INFO (_log_results) �[34;1mRETCODE (root@vm1): 1�[0m
2020-08-12 11:43:55,613 INFO (_log_results) �[47;30;1mSTDOUT (root@vm1)...
Restarting glusterd (via systemctl):  [FAILED]

This means the glusterd was starting using glusterd command when it was setup. This is a setup issue and not a glusto-tests or a glusterfs issue if this is the case

glusto-tests/glustolibs-gluster/glustolibs/gluster/gluster_init.py

Line 26 in 6694d72

def start_glusterd(servers):

glusto-tests/glustolibs-gluster/glustolibs/gluster/gluster_init.py

Line 84 in 6694d72

def restart_glusterd(servers, enable_retry=True):

@BarakSason could you please confirm if you are seeing this same issue on a proper setup?

@kshithijiyer The reason as to why glusterd failed to restart is irrelevant here.
I've asked @sanjurakonde to look into the matter as you've mentioned in your previous comment a glusterd crash. Sanju confirmed no crash occured.

The test framework should NOT be stuck in an infinite loop trying to restart gluster, regardless of the reason why gluster is failing to restart.
If retries are build into the system that is fine, but number of restarts should be limited (e.g. 3).

@kshithijiyer The reason as to why glusterd failed to restart is irrelevant here.
I've asked @sanjurakonde to look into the matter as you've mentioned in your previous comment a glusterd crash. Sanju confirmed no crash occured.

The test framework should NOT be stuck in an infinite loop trying to restart gluster, regardless of the reason why gluster is failing to restart.
If retries are build into the system that is fine, but number of restarts should be limited (e.g. 3).

@BarakSason The framework can't do anything if the setup provided is a stale setup where glusterd is started usingglusterd command and not systemctl start glusterd or service glusterd start. Which means it's a setup issue and we can't fix this in the framework as far as I know.

@aloganat Please share your views.

Hi Barak,

Thanks for the suggestion. Looks its a setup issue so could you please give a try with proper test setup by starting glusterd using systemctl.

For handling the infinite loop, its an RFE so please feel free to send a patch fix for enhancing the framework.

Thanks,
Arthy

As per design implementation, glusterd needs to be started using systemctl. The issue was seen in the setup where glusterd has been started in a different way. Hence, closing the issue. If required, please feel free to reopen the issue to send a patch fix for enhancing the framework.

I don't think this issue should have been closed - See my comment from Aug 17tth.
The reason why glusterd failed to restard is irrelevant (for the sake of the argument, glusterd could fail to restart due to a bug).
The test framework should NOT be stuck in an infinite loop, not matter the cause.

I have closed the bug since there is no activity on this from aug18 and it was as per design implementation. Please feel free to re open and provide the fix for the issue.