site-container.yml fails on 'restart ceph osds daemon(s)'
DavePiperMicrosoft opened this issue · 3 comments
Bug Report
What happened:
Tried to deploy a new ceph cluster using site-container.yml from ceph-ansible stable-5.0 branch. Ansible hit a timeout during the "ceph-handler : restart ceph osds daemon(s)" stage whilst restarting the first OSD.
Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean
After some amount of time has passed, the cluster does eventually report as all pgs active+clean. Re-running the site-container,yml playbook at this point completes successfully.
What you expected to happen:
site-container.yml will run to completion
How to reproduce it (minimal and precise):
We've seen this several times now, but it is intermittent so possibly relies on some timing window? Happens maybe 1 in 10 times.
Environment:
- OS (e.g. from /etc/os-release): Centos 7
- Kernel (e.g.
uname -a
): Linux condor_sc0 5.4.119-1.el7.elrepo.x86_64 #1 SMP Fri May 14 06:13:51 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux - Docker version if applicable (e.g.
docker version
): 19.03.15 - Ansible version (e.g.
ansible-playbook --version
): 2.9.25 - ceph-ansible version (e.g.
git head or tag or stable branch
): stable-5.0
ansible.log
ansible.log - Ceph version (e.g.
ceph -v
): ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
all_yml.txt
hosts.txt
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
@DavePiperMicrosoft you might need to adapt handler_health_osd_check_delay
and/or handler_health_osd_check_retries