ceph/ceph-ansible

Can't deploy version 15.2.12 on yocto OS - rocksdb: NotFound: db/ - _read_fsid unparsable uuid

insatomcat opened this issue · 5 comments

What happened:

I'm trying to use ceph-ansible (stable-5.0) to deploy a ceph cluster on server using a yocto OS (ceph version 15.2.12, the packaged "honister" version: https://layers.openembedded.org/layerindex/recipe/192188/)
The OS already contains the ceph binary.

The playbooks works fine up the the OSD creation part where it fails with an error I could not find much about:

Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 2 --monmap /var/lib/ceph/osd/ceph-2/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-2/ --osd-uuid 0886ca96-9af5-4381-8f17-7924b1ccf5fd --setuser ceph --setgroup ceph
stderr: 2022-03-01T08:03:56.942+0000 740d98abfd00 -1 bluestore(/var/lib/ceph/osd/ceph-2/) _read_fsid unparsable uuid
stderr: 2022-03-01T08:03:56.984+0000 740d98abfd00 -1 rocksdb: NotFound: db/: No such file or directory
stderr: 2022-03-01T08:03:56.984+0000 740d98abfd00 -1 bluestore(/var/lib/ceph/osd/ceph-2/) _open_db erroring opening db:
stderr: 2022-03-01T08:03:57.456+0000 740d98abfd00 -1 bluestore(/var/lib/ceph/osd/ceph-2/) mkfs failed, (5) Input/output error
stderr: 2022-03-01T08:03:57.456+0000 740d98abfd00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
stderr: 2022-03-01T08:03:57.457+0000 740d98abfd00 -1 [0;31m ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-2/: (5) Input/output error[0m
    --> Was unable to complete a new OSD, will rollback changes

What you expected to happen:

With a precedent yocto version (yocto dunfell, ceph 15.2.0), I did not have this problem.
This might be linked to ceph but it can also be a problem with the yocto integration, I'm trying to understand what is happening so that I can find the root cause (maybe a link with rocksdb?)

How to reproduce it (minimal and precise):

  • create a yocto image with honister including ceph (default version will be 15.2.12)
  • deploy a ceph cluster (mon, osd, mgr on all nodes, 3 in my setup) with ceph-ansible (branch stable-5.0)

Share your group_vars files, inventory and full ceph-ansibe log

Environment:

  • OS (e.g. from /etc/os-release): Yocto honister
  • Kernel (e.g. uname -a): 5.15.14-rt27-mainline-rt SMP PREEMPT_RT
  • Docker version if applicable (e.g. docker version): N/A
  • Ansible version (e.g. ansible-playbook --version): ansible-playbook 2.9.6
  • ceph-ansible version (e.g. git head or tag or stable branch): stable-5.0
  • Ceph version (e.g. ceph -v): 15.5.12 (ceph version 128-NOTFOUND (8f69994803975eda09ba6fbec77701982c33af34) octopus (rc))

ansible.log
ceph_group_vars.tar.gz
ceph-ansible-site.yaml.gz

Thanks in advance !

guits commented

At first glance, (5) Input/output error usually means the device is faulty, could you check that?

This is a fully virtual environment, ceph being given "/dev/vdb" to create the osd, and those disk being brand new qcow2 files (created with qemu-img create -f qcow2 vm1-osd.qcow2 30G).
I think we can rule out the hardward problem...
Thanks.

guits commented

can you show the output of ls -l /var/lib/ceph/osd/ceph-2/ ?

+do you have the full ceph-volume.log ?

I tried to create a brand new yocto qcow2 image (to be sure there is no curruption) and then after the playbook fails, tried running the command manually on one node, this is the result:

root@hypervisor1-aure:~# vgremove ceph-49e1dc09-cc76-4b74-b03f-7074e56d55ae
Do you really want to remove volume group "ceph-49e1dc09-cc76-4b74-b03f-7074e56d55ae" containing 1 logical volumes? [y/n]: y
Do you really want to remove active logical volume ceph-49e1dc09-cc76-4b74-b03f-7074e56d55ae/osd-block-5a6fcb4c-9416-4968-9986-de52c531b3b1? [y/n]: y
  Logical volume "osd-block-5a6fcb4c-9416-4968-9986-de52c531b3b1" successfully removed
  Volume group "ceph-49e1dc09-cc76-4b74-b03f-7074e56d55ae" successfully removed

root@hypervisor1-aure:~# ceph-volume --cluster ceph lvm batch --bluestore --yes /dev/vdb
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 1.0
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new c38afdbd-8bc4-48a2-86c4-531103a9565c
Running command: /usr/sbin/vgcreate --force --yes ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0 /dev/vdb
 stdout: Volume group "ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0" successfully created
Running command: /usr/sbin/lvcreate --yes -l 7679 -n osd-block-c38afdbd-8bc4-48a2-86c4-531103a9565c ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0
 stdout: Wiping ceph_bluestore signature on /dev/ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0/osd-block-c38afdbd-8bc4-48a2-86c4-531103a9565c.
 stdout: Logical volume "osd-block-c38afdbd-8bc4-48a2-86c4-531103a9565c" created.
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
--> Executable selinuxenabled not in PATH: /usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin
Running command: /bin/chown -h ceph:ceph /dev/ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0/osd-block-c38afdbd-8bc4-48a2-86c4-531103a9565c
Running command: /bin/chown -R ceph:ceph /dev/dm-0
Running command: /bin/ln -s /dev/ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0/osd-block-c38afdbd-8bc4-48a2-86c4-531103a9565c /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
 stderr: got monmap epoch 1
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring --create-keyring --name osd.0 --add-key AQAn1h9iME4+IRAAVo0ARiLaZntkfoTSwXEZiA==
 stdout: creating /var/lib/ceph/osd/ceph-0/keyring
added entity osd.0 auth(key=AQAn1h9iME4+IRAAVo0ARiLaZntkfoTSwXEZiA==)
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid c38afdbd-8bc4-48a2-86c4-531103a9565c --setuser ceph --setgroup ceph
 stderr: 2022-03-02T20:40:09.202+0000 72602867ed00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
 stderr: 2022-03-02T20:40:09.266+0000 72602867ed00 -1 rocksdb: NotFound: db/: No such file or directory
 stderr: 2022-03-02T20:40:09.266+0000 72602867ed00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _open_db erroring opening db:
 stderr: 2022-03-02T20:40:09.716+0000 72602867ed00 -1 bluestore(/var/lib/ceph/osd/ceph-0/) mkfs failed, (5) Input/output error
 stderr: 2022-03-02T20:40:09.716+0000 72602867ed00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
 stderr: 2022-03-02T20:40:09.716+0000 72602867ed00 -1  ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (5) Input/output error
--> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
 stderr: purged osd.0
-->  RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid c38afdbd-8bc4-48a2-86c4-531103a9565c --setuser ceph --setgroup ceph

root@hypervisor1-aure:~# ls -l /var/lib/ceph/osd/ceph-0/
total 12
-rw-r--r-- 1 ceph ceph 514 Mar  2 20:40 activate.monmap
lrwxrwxrwx 1 ceph ceph  93 Mar  2 20:40 block -> /dev/ceph-e4bb2b09-a0d4-4c86-a999-2b0cff143ea0/osd-block-c38afdbd-8bc4-48a2-86c4-531103a9565c
-rw-r--r-- 1 ceph ceph   0 Mar  2 20:40 fsid
-rw------- 1 ceph ceph  56 Mar  2 20:40 keyring
-rw------- 1 ceph ceph  10 Mar  2 20:40 type

This is the ceph-volume.log:

ceph-volume.log.gz

Hope this helps...
Thanks

ok my problem was this bug: https://tracker.ceph.com/issues/49815
It occurs when using a rocksdb version >= 6.15 (Yocto Honister is using v6.20, Yocto Dunfell is using v6.6).

The fix has been backported to octopus from v15.2.14 only (https://tracker.ceph.com/issues/49981), hence my problem with ceph v15.2.12 + rocksdb 6.20.

Thanks.