Replace a failed OSD Drive procedure.
grharry opened this issue ยท 8 comments
Hello ceph people!
Help required here.
It's not clear at all to me how to replace a failed Disk Drive ( luminus ).
In Other Words, I Cannot locate a clear procedure or a how-to stating the steps to replace failed drives using ansible from a CEPH storage system.
Am I the only one with this problem ?
Regards,
Harry.
Hi @grharry
I use ceph-ansible on an almost weekly basis to replace one of our thousands of drives.
I'm currently running pacific
, but started of the cluster on luminous
before eventually upgrading it, but the process is basically the same.
You will have issues if your inventory's disk placement is not 100% configured correctly, in my instance I used manual disk placements and not osd_autodiscovery from osds.yml
(since osd_autodiscovery is not recommended).
E.g. one node would be in your inventory
[nodepool02]
B-02-40-cephosd.maas osd_objectstore=bluestore devices="[ '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdi', '/dev/sdj', '/dev/sdk', '/dev/sdl', '/dev/sdm', '/dev/sdn', '/dev/sdo', '/dev/sdp', '/dev/sdq', '/dev/sdr', '/dev/sds', '/dev/sdt', '/dev/sdu', '/dev/sdv', '/dev/sdw', '/dev/sdx' ]" dedicated_devices="[ '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1' ]"
You will require the same amount of bluestore_devices
for every dedicated_devices
(rockswal/db offload), otherwise you will have issues with the playbook. However note, your cluster must have been built like this from the start. If it is not a 1:1 ratio you will need to configure it as how you've initially built it. i.e. your inventory
must match your nodes & disks 100% as in the current setup.
If you're sure everything is correct you can just run:
ansible-playbook -i /opt/ceph-ansible/inventory -e 'ansible_python_interpreter=/usr/bin/python3' infrastructure-playbooks/shrink-osd.yml -e osd_to_kill=1592
Where osd_to_kill
is your corresponding OSD that is faulty. This will format the disk, and remove it from ceph crush/ceph completely, no other action required.
If the disk is still readable from time to time, i.e. not completely dead, I generally weigh it out manually, ceph osd reweight osd.x 0
which is the equivalent of setting the osd out. It will still try to read data from the disk while draining it for quicker backfilling. Once it is drained, I run shrink-osd.yml
. In the event that the disk is completely dead just run the shrink-osd.yml
as above.
If you don't know the faulty OSD and only the harddrive serial number from IPMI/IDRAC/ILO you can use ceph device ls | grep 'serial number'
to get the corresponding OSD value, or ceph-volume inventory
from within one of your osds/mds/mon/mgr docker containers exec -it ... /bin/bash
Once the disk is replaced with a new one you can run
ansible-playbook -i /opt/ceph-ansible/inventory -e 'ansible_python_interpreter=/usr/bin/python3' site-container.yml --limit=HOSTwithNEWdisk
, --limit=
it to the ceph osd node with the new/replaced disk.
If you run site-container.yml
and it completes successfully but you don't see your new(replacement OSD) added to ceph, I generally ssh
to the host with the new disk and run an lsblk -p
, find the new disk /dev
value and run, dd if=/dev/zero of=/dev/sdX bs=1M
to format it
Then proceed to run: ansible-playbook -i /opt/ceph-ansible/inventory -e 'ansible_python_interpreter=/usr/bin/python3' site-container.yml --limit=HOSTwithNEWdisk
, again.
ansible_python_interpreter
is not required, depending on your environment
Hope it helps, there are no clear guide, this is from a couple of years of figuring stuff out and assistant from @guits on IRC from time to time.
Update: (I remembered a couple of things)
Take into account that ceph osd reweight osd.x 0
will cause backfilling, and if you shrink-osd
afterwards, it will backfill again, since you drained the disk but not the PG crush map. You can manually change the CRUSH values that with ceph osd crush reweight osd.x 0
but I don't.
You can control the speed of the amount of backfills with:
ceph tell osd.* injectargs '--osd_max_backfills 1'
Increasing it will increase the amount of PG backfills per OSD, however it will put more strain on the disks and you might get slow pgs
if you take it up too high (high on magnetic for me is, anything over 7+
). From experience my magnetic disks start to fall over / crash if it goes higher than 8
.
For recovery
processes you can use:
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
(i think this value goes up to 254`
In case you set it accidentally too high and want to revert back, a quick way is to run the upmap
script from:
https://gitlab.cern.ch/ceph/ceph-scripts/-/blob/master/tools/upmap/upmap-remapped.py
upmap-remapped.py | sh
In actual fact I always run this script after adding a new disk or "resetting my backfills", so that it clears the PG's to 100% active+clean
and gradually weigh in the new disk.
Also use ceph balancer on
and ceph balancer status
.
Wow !!!
At LAST!
Thank you so much !!!!
I owe U at least a BEER !!!
Regards,
Harry!
E.g. one node would be in your inventory
[nodepool02] B-02-40-cephosd.maas osd_objectstore=bluestore devices="[ '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdi', '/dev/sdj', '/dev/sdk', '/dev/sdl', '/dev/sdm', '/dev/sdn', '/dev/sdo', '/dev/sdp', '/dev/sdq', '/dev/sdr', '/dev/sds', '/dev/sdt', '/dev/sdu', '/dev/sdv', '/dev/sdw', '/dev/sdx' ]" dedicated_devices="[ '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1' ]"
@jeevadotnet I think having as many dedicated_devices
as devices
is no longer a requirement.
This 1:1 relation disappeared after we dropped ceph-disk support (stable-4.0).
If you take a look at this task in the role ceph-facts
you will see that we use the filter | unique
ceph-ansible/roles/ceph-facts/tasks/devices.yml
Lines 45 to 50 in f288364
@jeevadotnet very useful feedback. I'm considering writing a documentation ouf of it
Thanks!
E.g. one node would be in your inventory
[nodepool02] B-02-40-cephosd.maas osd_objectstore=bluestore devices="[ '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdi', '/dev/sdj', '/dev/sdk', '/dev/sdl', '/dev/sdm', '/dev/sdn', '/dev/sdo', '/dev/sdp', '/dev/sdq', '/dev/sdr', '/dev/sds', '/dev/sdt', '/dev/sdu', '/dev/sdv', '/dev/sdw', '/dev/sdx' ]" dedicated_devices="[ '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1' ]"
@jeevadotnet I think having as many
dedicated_devices
asdevices
is no longer a requirement. This 1:1 relation disappeared after we dropped ceph-disk support (stable-4.0). If you take a look at this task in the roleceph-facts
you will see that we use the filter| unique
ceph-ansible/roles/ceph-facts/tasks/devices.yml
Lines 45 to 50 in f288364
Will try it today when rebuilding testbed with your recommendation as per #7283 , however maybe I did it previously wrong, but tested luminous
and octopus
with the many:1
relation but then it only created a partition for my /dev/sda
.
@jeevadotnet very useful feedback. I'm considering writing a documentation ouf of it
๐ Thanks!
Haha, pleasure, you always great in helping out here and teaching me the 'way of the ceph', so I, eventually over time, have learned enough to be able to reply to other people's issues.
Hello Again!
I need some clarification.
On my ceph instalation ( luminous ).
I've got 4 OSDs DOWN ( waiting for replacement drives )
and my current pg status shows
5 pgs inactive, 3 pgs down, 2 pgs peering, 12 pgs stale
querying the down pg's they seem to be stuck by the dead OSD's
Do I proceed with ..... shrink-osd.yml -e osd_to_kill=xxx first? or ceph pg force_create_pg with the id's of down pgs ??
Thank's again for your help!
Harry.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.