lvmteam/lvm2

Assess the risk of data loss or other side effects of setting start_dirty_degraded flag

Opened this issue · 3 comments

There is a md_mod kernel module parameter, called start_dirty_degraded. It's documented mostly as a way to start dirty degraded bootable RAID array, and to be used as kernel command-line parameter.

Yet, I have another scenario - there is RAID 5 pool with Thin Provisioned volume. Volume is iteratively extending using lvextend as new data arrives. Each volume extension causes pool to synchronise data. If I eject a physical disk, during synchronisation process and start lvconvert -y --repair, I expect volume to be accessible as usual during rebuild. In fact I face messages from md in kernel log:

cannot start dirty degraded array

Another lvextend operation fails with:

device-mapper: reload ioctl on  (253:85) failed: Input/output error
Failed to suspend logical volume poolname/thin_vg.

Finally volume comes to out_of_data state, as it fails to extend further.


Setting start_dirty_degraded MD parameter helps to avoid lvextend failure and there is no any detected issues to continue writing data to thin volume, during rebuild process.

In kernel log I see:

kernel: md/raid:mdX: starting dirty degraded array - data corruption possible.

And that's the main issue - I'm curious how to estimate probability of data loss, when I use this flag?

Also, is there any other solution, that wouldn't lead to possible data corruption?

Best regards.

In normal situation you should repair your failing raid volume - likely by adding a new disk to repair the failed one in your RAID5 array.

Follow the 'man lvmraid' for info about how to repair raid volume.

If the repair of raid volume is failing - provide full log of your command and kernel message - with version of your kernel and tooling.

@zkabelac i have same question. What is the probability of data corruption? Why does this message pop up at all?

kernel: md/raid:mdX: starting dirty degraded array - data corruption possible.

@zkabelac , thank you. I've read man lvmraid , and it answered some of my questions but not this one.


I performed a little bit more investigation.

Initially I have RAID5 array made of 5 disks and a thin provisioned volume.

Once actual size overcomes 85% I call:

/usr/sbin/lvextend -L+10737418240B lvmr5

Adding another 10GB. Where lvmr5 is a VG name.

Also tried to add --nosync flag, expecting, that there would be no synchronization after each extension:

/usr/sbin/lvextend -L+10737418240B --nosync lvmr5

Here is an example report of lvs during extension:

[Thu Aug 22 17:57:21 @ ~]:> lvs -a lvmr5
  LV                       VG    Attr       LSize    Pool    Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol                     lvmr5 Vwi-aotz-- 1010.00g thin_vg        52.70                                  
  thin_vg                  lvmr5 twi-aotz--  630.00g                84.50  1.35                            
  [thin_vg_tdata]          lvmr5 Rwi-aor---  630.00g                                       99.05           
  [thin_vg_tdata_rimage_0] lvmr5 Iwi-aor---  157.50g                                                       
  [thin_vg_tdata_rimage_1] lvmr5 Iwi-aor---  157.50g                                                       
  [thin_vg_tdata_rimage_2] lvmr5 Iwi-aor---  157.50g                                                       
  [thin_vg_tdata_rimage_3] lvmr5 Iwi-aor---  157.50g                                                       
  [thin_vg_tdata_rimage_4] lvmr5 Iwi-aor---  157.50g                                                       
  [thin_vg_tdata_rmeta_0]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tdata_rmeta_1]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tdata_rmeta_2]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tdata_rmeta_3]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tdata_rmeta_4]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tmeta]          lvmr5 ewi-aor---   15.00g                                       100.00          
  [thin_vg_tmeta_rimage_0] lvmr5 iwi-aor---    3.75g                                                       
  [thin_vg_tmeta_rimage_1] lvmr5 iwi-aor---    3.75g                                                       
  [thin_vg_tmeta_rimage_2] lvmr5 iwi-aor---    3.75g                                                       
  [thin_vg_tmeta_rimage_3] lvmr5 iwi-aor---    3.75g                                                       
  [thin_vg_tmeta_rimage_4] lvmr5 iwi-aor---    3.75g                                                       
  [thin_vg_tmeta_rmeta_0]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tmeta_rmeta_1]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tmeta_rmeta_2]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tmeta_rmeta_3]  lvmr5 ewi-aor---    4.00m                                                       
  [thin_vg_tmeta_rmeta_4]  lvmr5 ewi-aor---    4.00m

Regardless, of whether I add --nosync flag, there are images with atribute Iwi-aor--- , where capital I mean "image out of sync", and Cpy%Sync denotes, that synchronization is in progress.


While synchronization is in progress, I eject one of the physical disks. Here is an example kernel log right before disk ejection and up to cannot start dirty degraded array message:
kern.log

In a few seconds after disk is lost following commands are called to replace missing disk with spare one:

vgreduce --removemissing lvmr5
vgreduce --removemissing --mirrorsonly --force lvmr5
dmsetup remove lvmr5-thin_vg_tdata_rimage_0-missing_0_0
dmsetup remove lvmr5-thin_vg_tdata_rmeta_0-missing_0_0
vgextend lvmr5 /dev/disk/by-id/scsi-35000c500f2b47ae7-part1
lvconvert -y --repair lvmr5/thin_vg_tdata
lvconvert -y --repair lvmr5/thin_vg_tmeta
lvchange --addtag 'rebuild_is_started' lvmr5
lvremove -y /dev/lvmr5/lvol0_pmspare

Also I made few runs with additional debug messages in raid5.c MD driver, to check conditions, why pool end up in "dirty+degraded" state here:
https://github.com/torvalds/linux/blob/47ac09b91befbb6a235ab620c32af719f8208399/drivers/md/raid5.c#L7990

dirty_parity_disks variable is 0, this line was never executed:
https://github.com/torvalds/linux/blob/47ac09b91befbb6a235ab620c32af719f8208399/drivers/md/raid5.c#L7972C3-L7972C21

mddev->degraded is 1, after this increment:
https://github.com/torvalds/linux/blob/47ac09b91befbb6a235ab620c32af719f8208399/drivers/md/raid5.c#L717

So, I assume, that degraded disk is my newly added one (spare). And there is no dirty, but parity only disks, according to this commit message:
torvalds/linux@c148ffd

I doubt if this logic works as expected, because there is no chance for newly added disk not to be degraded.

Also, I suspect, but not sure, that mddev->recovery_cp != MaxSector condition is true while Cpy%Sync of lvs output is less than 100%.

If this is true, then there is no way to replace faulty disk, if fault happen during thin provisioned volume extension, before synchronisation is over. But under heavy write load, synchronisation is almost permanent.


Tested this with kernel 6.1.77

lvs --version
  LVM version:     2.03.11(2) (2021-01-08)
  Library version: 1.02.175 (2021-01-08)
  Driver version:  4.47.0