ceph/ceph-ansible

site-container.yml fails on 'restart ceph osds daemon(s)'

DavePiperMicrosoft opened this issue · 3 comments

Bug Report

What happened:

Tried to deploy a new ceph cluster using site-container.yml from ceph-ansible stable-5.0 branch. Ansible hit a timeout during the "ceph-handler : restart ceph osds daemon(s)" stage whilst restarting the first OSD.

Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean

After some amount of time has passed, the cluster does eventually report as all pgs active+clean. Re-running the site-container,yml playbook at this point completes successfully.

What you expected to happen:

site-container.yml will run to completion

How to reproduce it (minimal and precise):

We've seen this several times now, but it is intermittent so possibly relies on some timing window? Happens maybe 1 in 10 times.

Environment:

  • OS (e.g. from /etc/os-release): Centos 7
  • Kernel (e.g. uname -a): Linux condor_sc0 5.4.119-1.el7.elrepo.x86_64 #1 SMP Fri May 14 06:13:51 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Docker version if applicable (e.g. docker version): 19.03.15
  • Ansible version (e.g. ansible-playbook --version): 2.9.25
  • ceph-ansible version (e.g. git head or tag or stable branch): stable-5.0
    ansible.log
    ansible.log
  • Ceph version (e.g. ceph -v): ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
    all_yml.txt
    hosts.txt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

guits commented

@DavePiperMicrosoft you might need to adapt handler_health_osd_check_delay and/or handler_health_osd_check_retries