dm-vdo/kvdo

vdo 6.2.6.3 crash on kernel 4.19

qqshow opened this issue · 22 comments

Hi, team,
VDO crash system on kernel 4.19,
Here is the crash info ,
[135468.438992] CPU: 28 PID: 0 Comm: swapper/28 Kdump: loaded Tainted: G W OE 4.19.36

[135468.438996] x25: 0000000000000000
[135468.438998] pstate: 40400009 (nZcv daif +PAN -UAO)
[135468.439000] pc : __list_del_entry_valid+0x60/0xc8
[135468.439002] x24: 0000000000000105
[135468.439004] lr : __list_del_entry_valid+0x60/0xc8
[135468.439006] sp : ffff0000815dbd40
[135468.439009] x29: ffff0000815dbd40 x28: ffffa02f6be1eb80
[135468.439012] x23: ffff0000811d9000
[135468.439017] x22: ffff00000126ad68
[135468.439018] x27: dead000000000100 x26: ffff0000811d9000
[135468.439022] x21: ffffa02fab0cf400
[135468.439025] x25: dead000000000200
[135468.439028] x20: ffff0000811d9000
[135468.439030] x24: ffffa02f69eb6000
[135468.439036] x19: ffffa02f69c4a380
[135468.439038] x23: dead000000000100 x22: ffffa02fab0cf400
[135468.439041] x18: ffffffffffffffff
[135468.439047] x17: 0000000000000000
[135468.439049] x21: dead000000000200 x20: ffff0000815dbdc8
[135468.439054] x16: ffff0000804e0508
[135468.439061] x19: ffffa02f69eb6498 x18: ffffffffffffffff
[135468.439066] x15: ffff0000811d9708 x14: 6461656420736177
[135468.439069] x17: 0000000000000000 x16: ffff0000804e0508
[135468.439076] x13: 20747562202c3839
[135468.439077] x15: ffff0000811d9708 x14: 3030303030303064
[135468.439081] x12: 3461346339366632
[135468.439086] x13: 6165642820314e4f x12: 53494f505f545349
[135468.439089] x11: 3061666666662065
[135468.439093] x10: 0000000000000000
[135468.439095] x11: 4c20736920747865 x10: 6e3e2d3839343662
[135468.439102] x9 : ffff0000811dbaf0
[135468.439104] x9 : 6539366632306166 x8 : ffff0000813d9a54
[135468.439111] x8 : 00000000001a91e0
[135468.439113] x7 : 0000000000000000
[135468.439116] x7 : ffff0000813b6000
[135468.439118] x6 : 00000c066ce23796
[135468.439121] x6 : 0000000000000001
[135468.439123] x5 : 0000000000000001 x4 : ffff802fbfb9a400
[135468.439131] x5 : 0000000000000001
[135468.439133] x3 : ffff802fbfb9a400 x2 : 0000000000000007
[135468.439139] x4 : ffffa02fbf9e0400
[135468.439144] x1 : 45680c97a7e6cd00 x0 : 0000000000000000
[135468.439147] x3 : ffffa02fbf9e0400
[135468.439150] Call trace:
[135468.439152] x2 : 0000000000000007
[135468.439155] __list_del_entry_valid+0x60/0xc8
[135468.439176] timeoutIndexOperations+0x14c/0x268 [kvdo]
[135468.439180] call_timer_fn+0x34/0x178
[135468.439183] x1 : 45680c97a7e6cd00
[135468.439185] expire_timers+0xec/0x158
[135468.439189] run_timer_softirq+0xc0/0x1f8
[135468.439193] __do_softirq+0x120/0x324
[135468.439194] x0 : 0000000000000000
[135468.439199] irq_exit+0x11c/0x140
[135468.439202] __handle_domain_irq+0x6c/0xc0
[135468.439205] Call trace:
[135468.439207] gic_handle_irq+0x6c/0x150
[135468.439211] __list_del_entry_valid+0xa4/0xc8
[135468.439213] el1_irq+0xb8/0x140
[135468.439217] arch_cpu_idle+0x38/0x1c0
[135468.439221] default_idle_call+0x30/0x4c
[135468.439238] finishIndexOperation+0x150/0x208 [kvdo]
[135468.439241] do_idle+0x22c/0x300
[135468.439250] handleCallbacks+0x48/0x80 [uds]
[135468.439260] requestQueueWorker+0x74/0x400 [uds]
[135468.439263] cpu_startup_entry+0x28/0x30
[135468.439273] threadStarter+0xa0/0xd8 [uds]
[135468.439277] secondary_start_kernel+0x17c/0x1c8
[135468.439282] ---[ end trace 488bf3b169748bff ]---
[135468.439285] kthread+0x134/0x138

Blow is the vmcore-dmesg file.
20220314vdolog.zip

The vdo config info :

config: !Configuration
vdos:
dataDB: !VDOService
_operationState: finished
ackThreads: 2
activated: enabled
bioRotationInterval: 64
bioThreads: 6
blockMapCacheSize: 128M
blockMapPeriod: 16380
compression: enabled
cpuThreads: 6
deduplication: enabled
device: /dev/disk/by-id/dm-uuid-LVM-HHcI02NQf6VP6OHSd84FypcdUkND0fTVZ3qdcw5qKO0M61ruXLRxAHyBLec453x6
hashZoneThreads: 4
indexCfreq: 0
indexMemory: 1
indexSparse: enabled
indexThreads: 0
logicalBlockSize: 4096
logicalSize: 10T
logicalThreads: 4
maxDiscardSize: 4K
name: dataDB
physicalSize: 10T
physicalThreads: 4
slabSize: 32G
writePolicy: async
dataFILE: !VDOService
_operationState: finished
ackThreads: 2
activated: enabled
bioRotationInterval: 64
bioThreads: 6
blockMapCacheSize: 128M
blockMapPeriod: 16380
compression: enabled
cpuThreads: 6
deduplication: enabled
device: /dev/disk/by-id/dm-uuid-LVM-HHcI02NQf6VP6OHSd84FypcdUkND0fTV53ChuOvn3e8cDoGoOdHMwgfdjYvePt6x
hashZoneThreads: 4
indexCfreq: 0
indexMemory: 1
indexSparse: enabled
indexThreads: 0
logicalBlockSize: 4096
logicalSize: 91575992M
logicalThreads: 4
maxDiscardSize: 4K
name: dataFILE
physicalSize: 91575992M
physicalThreads: 4
slabSize: 32G
writePolicy: async
dataVM: !VDOService
_operationState: finished
ackThreads: 2
activated: enabled
bioRotationInterval: 64
bioThreads: 6
blockMapCacheSize: 128M
blockMapPeriod: 16380
compression: enabled
cpuThreads: 6
deduplication: enabled
device: /dev/disk/by-id/dm-uuid-LVM-HHcI02NQf6VP6OHSd84FypcdUkND0fTV5OiUTrBefDxxNTdmCfn9rK0SIN3XtVRu
hashZoneThreads: 4
indexCfreq: 0
indexMemory: 1
indexSparse: enabled
indexThreads: 0
logicalBlockSize: 4096
logicalSize: 30T
logicalThreads: 4
maxDiscardSize: 4K
name: dataVM
physicalSize: 30T
physicalThreads: 4
slabSize: 32G
writePolicy: async
version: 538380551

The vdostats info :

vdostats

That seems exciting. Apologies.

[135468.440272] list_del corruption. prev->next should be ffffa02f6baefc98, but was dead000000000100
is a message from your vmcore-dmesg that indicates prev->next's value is LIST_POISON1, which implies the previous member of the list had list_del() called on it, but was still linked to this entry.

DataKVIO *dataKVIO = list_first_entry(&expiredHead, DataKVIO,
and
DedupeContext *dedupeContext = &dataKVIO->dedupeContext;
seem likely to be the relevant pieces of code. It seems to me like there's a window where the timer could put a timed-out request on the expired list; then, as it processes it, if a request on the expired list comes back at just the wrong moment on a different thread, the request could see a previous pointer which has just been list_del'd without a lock on the timer thread.

Let's call them requests A and B on the expired list. Thread 1 is expiring A and has just deleted A; thread 2 gets the callback for B and list_del's it inside a lock, but views B's old pointer to A and A's freshly deleted state, and therefore explodes.

I'm not sure whether this is feasible though...

@raeburn you might be able to tell me if this is impossible? If it's not, though, this race would still exist on the unstable branch and everything in between.

I think the isPending flag ought to prevent that, but it's possible there's a bug in that somewhere. I'll take a closer look soon…

@raeburn you're right, isPending should indeed be protected by the lock, so the scenario I laid out is impossible...

what if thread 2 processes the callback (not deleting it from any lists since its not pending); reuses it for a new request to the index (thus setting isPending again); and then gets a response to that (therefore doing the list_del this time), all before thread 1 gets around to deleting the item off of expiredList? theoretically then I think thread 1 could see a mix of pointer values...

@sweettea, that reuse scenario is prevented by dedupeContext->requestState. Even when a data_vio is reused for a subsequent write, its dedupe_context is not (and cannot be) reused until the index request actually completes and calls finishIndexOperation.

@QQStar, We did see similar kinds of list_del corruption in testing last year, between about March 2021 and June 2021. The failures seemed to suggest some problem in the implementation of the list macros themselves. (The problem appeared without code changes in this project, and similarly disappeared without code changes.) Unfortunately I don't think we ever identified a specific kernel bug that might have caused it. It may be worth making sure your kernel is up-to-date, though.

@lorelei-sakai I did see similar kinds of crash stack several times on different kernel versions and vdo versions. The kernel version 3.10 and 4.19, vdo version 6.1.1.125, 6.1.3.23 and 6.2.6 .

@raeburn
@sweettea
The issue-46 with the simliar list_del corruption with vdo version 6.1 on kernel 3.10.
But no luck, without vmcore generated because the system was hanged.

@raeburn @lorelei-sakai @sweettea Please let me know if you need any other information.

I'm not so sure that requestState provides sufficient protection there. It will guard against both timeoutIndexOperations and finishIndexOperation trying to make the same request structure usable again, but timeoutIndexOperations is using the pendingList field to manage its expired-entry list (after having already removed them from the actual pending list), and will unconditionally remove all entries from it, regardless of requestState. Meanwhile finishIndexOperation will move the VIO toward completion and reuse, causing the list structure to be reinitialized, without regard for whether timeoutIndexOperations has yet removed it from its local expired-entry list.

There are other variations that could cause faults too. Simply having added the entry back to the pending list could cause the list_del in timeoutIndexOperations to detect an inconsistency in the prev/next pointers.

Once timeoutIndexOperations has grabbed an entry for its expired list, we can't let finishIndexOperation by itself free that entry up for reuse, so long as timeoutIndexOperations is using it in its list structure. (Of course timeoutIndexOperations alone isn't enough either; as now, we still need to wait for the dedupe request to complete before we can reuse the request structure, lest that response scribble over some later request.)

@qqshow I don't think so personally. If you happened to have a vmcore, and a kernel-debuginfo package (or just a version that's publicly available), that might be nice, but it's not strictly necessary and it can contain sensitive information. (My hope would be that we could look through the vios and see whether the bad list_entry is in fact part of a VIO in progress.)

@raeburn @lorelei-sakai might have an idea of something useful though?

vmcore-demsg.zip
Here is the vmcore-demsg from the machine running Vdo 6.1 on kernel 3.10.

Hope it's useful !

@raeburn @sweettea How to avoid this problem? Can you give some advices?
I try to reduce vdo threads , but the problem still occured.
vdo modify --all --vdoAckThreads=1 --vdoBioThreads=2 --vdoCpuThreads=1 --vdoHashZoneThreads=1 --vdoLogicalThreads=1 --vdoPhysicalThreads=1

This problem came up because of a race condition between two paths for handling deduplication advice and incorrect handling of a data structure on one of those paths. (One path for receiving advice, one for timing out waiting to get the advice.)

I’ve got a fix for the problem for older branches of our code, but the “unstable” branch was in the middle of getting a significant rewrite of some of that code, which has been checked in in our internal repository in the weeks since the last push to github at the end of April. The changes to the code should have fixed this problem along the way, though I haven’t had a chance to look through the code with an eye towards this particular issue and confirm it.

I don’t know how long it’ll be until the newest code is put up on github.

In the meantime, as a workaround, you may be able to reduce the likelihood of hitting the bug by increasing the deduplication timeout via the module parameter “deduplication_timeout_interval” which represents a number of milliseconds, and is set to 5000 (5 seconds) by default. It can be set to values up to 120 seconds or 120000.

Raising the timeout can reduce throughput if VDO (specifically the UDS index portion) is being slow for some reason. On the other hand, if it’s taking much more than 5 seconds to get deduplication advice back, your setup is already having problems…

Ah, I mis-"spoke" ... the part of the rewrite with the bug fix isn't in yet, even internally; it's still in the works.

Porting the fix applied to older versions forward would be possible but difficult, due to other changes in the code.

Thank you very much for your reply!

Raising the timeout can reduce throughput if VDO (specifically the UDS index portion) is being slow for some reason. On the other hand, if it’s taking much more than 5 seconds to get deduplication advice back, your setup is already having problems…

@raeburn As you said, I tried to add "deduplication_timeout_interval" but vdo still stucked my system.
Do you have any schedule to push the fix to github ?
Cloud you give me the fix to test on my system ?

The fix for the 6.2 code should be getting pushed to github within the next day or so, actually. The rewrite going into "unstable" will be a while longer in coming, but if you're still on the 4.19 kernel the 6.2 version is the one that matters.

@raeburn Thank you very much! I will test the fix.
72f94c2

@raeburn The kvdo 6.2 version hang my system occasionally. My system runs 4.19 kernel.
The problem always occurs when reading a lot.

I had updated to 6.2.8.6.
The kernel messages has nothings releated information when hang.

Hmm, this sounds like something new we haven’t seen.

Is it hanging everything, or can you still issue commands to parts of the system not hung up waiting on I/O to VDO? It would be useful to know if some of the VDO threads are still running (“top”), and if I/O is still happening to the backing storage device (“sar -d 1”) or if it’s sitting idle.

If you can run commands, these should dump some possibly-useful info into the kernel log:

dmsetup message 0 dump pools
dmsetup message 0 dump queues
dmsetup message 0 dump vdo

The “dmsetup message” command should be able to run on a VDO device even if I/O is hung up somehow.