Assess the risk of data loss or other side effects of setting start_dirty_degraded flag
Opened this issue · 3 comments
There is a md_mod
kernel module parameter, called start_dirty_degraded
. It's documented mostly as a way to start dirty degraded bootable RAID array, and to be used as kernel command-line parameter.
Yet, I have another scenario - there is RAID 5 pool with Thin Provisioned volume. Volume is iteratively extending using lvextend
as new data arrives. Each volume extension causes pool to synchronise data. If I eject a physical disk, during synchronisation process and start lvconvert -y --repair
, I expect volume to be accessible as usual during rebuild. In fact I face messages from md in kernel log:
cannot start dirty degraded array
Another lvextend
operation fails with:
device-mapper: reload ioctl on (253:85) failed: Input/output error
Failed to suspend logical volume poolname/thin_vg.
Finally volume comes to out_of_data
state, as it fails to extend further.
Setting start_dirty_degraded
MD parameter helps to avoid lvextend
failure and there is no any detected issues to continue writing data to thin volume, during rebuild process.
In kernel log I see:
kernel: md/raid:mdX: starting dirty degraded array - data corruption possible.
And that's the main issue - I'm curious how to estimate probability of data loss, when I use this flag?
Also, is there any other solution, that wouldn't lead to possible data corruption?
Best regards.
In normal situation you should repair your failing raid volume - likely by adding a new disk to repair the failed one in your RAID5 array.
Follow the 'man lvmraid' for info about how to repair raid volume.
If the repair of raid volume is failing - provide full log of your command and kernel message - with version of your kernel and tooling.
@zkabelac i have same question. What is the probability of data corruption? Why does this message pop up at all?
kernel: md/raid:mdX: starting dirty degraded array - data corruption possible.
@zkabelac , thank you. I've read man lvmraid
, and it answered some of my questions but not this one.
I performed a little bit more investigation.
Initially I have RAID5 array made of 5 disks and a thin provisioned volume.
Once actual size overcomes 85% I call:
/usr/sbin/lvextend -L+10737418240B lvmr5
Adding another 10GB. Where lvmr5
is a VG name.
Also tried to add --nosync
flag, expecting, that there would be no synchronization after each extension:
/usr/sbin/lvextend -L+10737418240B --nosync lvmr5
Here is an example report of lvs
during extension:
[Thu Aug 22 17:57:21 @ ~]:> lvs -a lvmr5
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lvol lvmr5 Vwi-aotz-- 1010.00g thin_vg 52.70
thin_vg lvmr5 twi-aotz-- 630.00g 84.50 1.35
[thin_vg_tdata] lvmr5 Rwi-aor--- 630.00g 99.05
[thin_vg_tdata_rimage_0] lvmr5 Iwi-aor--- 157.50g
[thin_vg_tdata_rimage_1] lvmr5 Iwi-aor--- 157.50g
[thin_vg_tdata_rimage_2] lvmr5 Iwi-aor--- 157.50g
[thin_vg_tdata_rimage_3] lvmr5 Iwi-aor--- 157.50g
[thin_vg_tdata_rimage_4] lvmr5 Iwi-aor--- 157.50g
[thin_vg_tdata_rmeta_0] lvmr5 ewi-aor--- 4.00m
[thin_vg_tdata_rmeta_1] lvmr5 ewi-aor--- 4.00m
[thin_vg_tdata_rmeta_2] lvmr5 ewi-aor--- 4.00m
[thin_vg_tdata_rmeta_3] lvmr5 ewi-aor--- 4.00m
[thin_vg_tdata_rmeta_4] lvmr5 ewi-aor--- 4.00m
[thin_vg_tmeta] lvmr5 ewi-aor--- 15.00g 100.00
[thin_vg_tmeta_rimage_0] lvmr5 iwi-aor--- 3.75g
[thin_vg_tmeta_rimage_1] lvmr5 iwi-aor--- 3.75g
[thin_vg_tmeta_rimage_2] lvmr5 iwi-aor--- 3.75g
[thin_vg_tmeta_rimage_3] lvmr5 iwi-aor--- 3.75g
[thin_vg_tmeta_rimage_4] lvmr5 iwi-aor--- 3.75g
[thin_vg_tmeta_rmeta_0] lvmr5 ewi-aor--- 4.00m
[thin_vg_tmeta_rmeta_1] lvmr5 ewi-aor--- 4.00m
[thin_vg_tmeta_rmeta_2] lvmr5 ewi-aor--- 4.00m
[thin_vg_tmeta_rmeta_3] lvmr5 ewi-aor--- 4.00m
[thin_vg_tmeta_rmeta_4] lvmr5 ewi-aor--- 4.00m
Regardless, of whether I add --nosync
flag, there are images with atribute Iwi-aor---
, where capital I
mean "image out of sync", and Cpy%Sync
denotes, that synchronization is in progress.
While synchronization is in progress, I eject one of the physical disks. Here is an example kernel log right before disk ejection and up to cannot start dirty degraded array
message:
kern.log
In a few seconds after disk is lost following commands are called to replace missing disk with spare one:
vgreduce --removemissing lvmr5
vgreduce --removemissing --mirrorsonly --force lvmr5
dmsetup remove lvmr5-thin_vg_tdata_rimage_0-missing_0_0
dmsetup remove lvmr5-thin_vg_tdata_rmeta_0-missing_0_0
vgextend lvmr5 /dev/disk/by-id/scsi-35000c500f2b47ae7-part1
lvconvert -y --repair lvmr5/thin_vg_tdata
lvconvert -y --repair lvmr5/thin_vg_tmeta
lvchange --addtag 'rebuild_is_started' lvmr5
lvremove -y /dev/lvmr5/lvol0_pmspare
Also I made few runs with additional debug messages in raid5.c
MD driver, to check conditions, why pool end up in "dirty+degraded" state here:
https://github.com/torvalds/linux/blob/47ac09b91befbb6a235ab620c32af719f8208399/drivers/md/raid5.c#L7990
dirty_parity_disks
variable is 0
, this line was never executed:
https://github.com/torvalds/linux/blob/47ac09b91befbb6a235ab620c32af719f8208399/drivers/md/raid5.c#L7972C3-L7972C21
mddev->degraded
is 1
, after this increment:
https://github.com/torvalds/linux/blob/47ac09b91befbb6a235ab620c32af719f8208399/drivers/md/raid5.c#L717
So, I assume, that degraded disk is my newly added one (spare). And there is no dirty, but parity only disks
, according to this commit message:
torvalds/linux@c148ffd
I doubt if this logic works as expected, because there is no chance for newly added disk not to be degraded.
Also, I suspect, but not sure, that mddev->recovery_cp != MaxSector
condition is true while Cpy%Sync
of lvs
output is less than 100%.
If this is true, then there is no way to replace faulty disk, if fault happen during thin provisioned volume extension, before synchronisation is over. But under heavy write load, synchronisation is almost permanent.
Tested this with kernel 6.1.77
lvs --version
LVM version: 2.03.11(2) (2021-01-08)
Library version: 1.02.175 (2021-01-08)
Driver version: 4.47.0