site-container.yml fails on 'restart ceph osds daemon(s)'

Question

site-container.yml fails on 'restart ceph osds daemon(s)'

DavePiperMicrosoft opened this issue 3 years ago · 3 comments

DavePiperMicrosoft commented 3 years ago

Bug Report

What happened:

Tried to deploy a new ceph cluster using site-container.yml from ceph-ansible stable-5.0 branch. Ansible hit a timeout during the "ceph-handler : restart ceph osds daemon(s)" stage whilst restarting the first OSD.

Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean

After some amount of time has passed, the cluster does eventually report as all pgs active+clean. Re-running the site-container,yml playbook at this point completes successfully.

What you expected to happen:

site-container.yml will run to completion

How to reproduce it (minimal and precise):

We've seen this several times now, but it is intermittent so possibly relies on some timing window? Happens maybe 1 in 10 times.

Environment:

OS (e.g. from /etc/os-release): Centos 7
Kernel (e.g. uname -a): Linux condor_sc0 5.4.119-1.el7.elrepo.x86_64 #1 SMP Fri May 14 06:13:51 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Docker version if applicable (e.g. docker version): 19.03.15
Ansible version (e.g. ansible-playbook --version): 2.9.25
ceph-ansible version (e.g. git head or tag or stable branch): stable-5.0
ansible.log
ansible.log
Ceph version (e.g. ceph -v): ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
all_yml.txt
hosts.txt

Answer 1 · 2021-11-04T20:13:19.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Answer 2 · 2021-11-11T20:13:37.000Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Answer 3 · 2021-12-01T07:57:10.000Z

@DavePiperMicrosoft you might need to adapt handler_health_osd_check_delay and/or handler_health_osd_check_retries