firelzrd/bore-scheduler

Since upgrading BORE v2.2.3 -> v2.2.6: I am having soft lockups/rcu stalls of several processes (bash, emerge, python, …)

DiabeticCrab opened this issue · 10 comments

Mai 16 14:26:51 Gentoo kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mai 16 14:26:51 Gentoo kernel: rcu:         10-....: (21000 ticks this GP) idle=8ba4/1/0x4000000000000000 softirq=2355521/2355521 fqs=4556
Mai 16 14:26:51 Gentoo kernel: rcu:         (t=21001 jiffies g=17817745 q=6862 ncpus=24)
Mai 16 14:26:51 Gentoo kernel: CPU: 10 PID: 53603 Comm: xargs Not tainted 6.3.2-gentoo-BOREv2.2.6 #1
Mai 16 14:26:51 Gentoo kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F34 07/08/2021
Mai 16 14:26:51 Gentoo kernel: RIP: 0010:sched_fork+0x15e/0x2a0
Mai 16 14:26:51 Gentoo kernel: Code: 00 48 8d bd 38 05 00 00 48 8d 88 b8 fa ff ff 48 39 c7 0f 84 30 01 00 00 31 c0 31 f6 ba 08 00 00 00 c4 e2 eb f7 91 f0 >
Mai 16 14:26:51 Gentoo kernel: RSP: 0018:ffff9dbf9ceefc30 EFLAGS: 00000206
Mai 16 14:26:51 Gentoo kernel: RAX: 00015ea74dbf1618 RBX: ffff8e75794b8000 RCX: ffff8e71b78cbe00
Mai 16 14:26:51 Gentoo kernel: RDX: 0000000000004c19 RSI: 000000009ba209d8 RDI: ffff8e7259f00538
Mai 16 14:26:51 Gentoo kernel: RBP: ffff8e7259f00000 R08: ffff9dbf9ceefbf0 R09: ffff8e75794b82c0
Mai 16 14:26:51 Gentoo kernel: R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000
Mai 16 14:26:51 Gentoo kernel: R13: 0000000000000000 R14: ffff9dbf9ceefe10 R15: ffff8e75794b8000
Mai 16 14:26:51 Gentoo kernel: FS:  00007f729989c740(0000) GS:ffff8e785ec80000(0000) knlGS:0000000000000000
Mai 16 14:26:51 Gentoo kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mai 16 14:26:51 Gentoo kernel: CR2: 00007f2b2bf1e72c CR3: 00000005f3b7a000 CR4: 0000000000750ee0
Mai 16 14:26:51 Gentoo kernel: PKRU: 55555554
Mai 16 14:26:51 Gentoo kernel: Call Trace:
Mai 16 14:26:51 Gentoo kernel:  <TASK>
Mai 16 14:26:51 Gentoo kernel:  copy_process+0x9b3/0x24d0
Mai 16 14:26:51 Gentoo kernel:  ? raw_spin_rq_lock_nested+0x5/0x10
Mai 16 14:26:51 Gentoo kernel:  ? newidle_balance.constprop.0+0x223/0x3a0
Mai 16 14:26:51 Gentoo kernel:  ? xa_load+0xa7/0xd0
Mai 16 14:26:51 Gentoo kernel:  kernel_clone+0xba/0x3f0
Mai 16 14:26:51 Gentoo kernel:  __x64_sys_clone+0x84/0xb0
Mai 16 14:26:51 Gentoo kernel:  do_syscall_64+0x5b/0x80
Mai 16 14:26:51 Gentoo kernel:  ? exit_to_user_mode_prepare+0x74/0x110
Mai 16 14:26:51 Gentoo kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
Mai 16 14:26:51 Gentoo kernel: RIP: 0033:0x7f7299976ae3
Mai 16 14:26:51 Gentoo kernel: Code: 00 00 0f 1f 44 00 00 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 >
Mai 16 14:26:51 Gentoo kernel: RSP: 002b:00007ffdabcdff98 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Mai 16 14:26:51 Gentoo kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7299976ae3
Mai 16 14:26:51 Gentoo kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Mai 16 14:26:51 Gentoo kernel: RBP: 00007ffdabce00d0 R08: 0000000000000000 R09: 0000000000000000
Mai 16 14:26:51 Gentoo kernel: R10: 00007f729989ca10 R11: 0000000000000246 R12: 0000000000000000
Mai 16 14:26:51 Gentoo kernel: R13: 0000000000000000 R14: 00005581be207220 R15: 000000000000000b
Mai 16 14:26:51 Gentoo kernel:  </TASK>

Kernel version 6.2.3 w/ voluntary preemption. Problems started after 12hrs of uptime.
Ryzen 5900X. This is without constgran or EEVDF.

I am under the suspicion that 21b257b and/or 29eb369 may be the cause, since the issue happened right after sched_fork -> it affected a task child.

full_dmesg.log
kernel_config.txt

Thank you for the report.

I'm investigating the case from the provided log and config files.
That characteristic je and following 2 consecutive xor instructions in the Code line:

  12:	0f 84 30 01 00 00      	je     0x142
  18:	31 c0                  	xor    %eax,%eax
  1a:	31 f6                  	xor    %esi,%esi

indicates that this is probably happening somewhere around

if (likely(p)) {
	if (p->child_burst_last_cached + sched_burst_cache_lifetime < ktime) {
		p->child_burst_last_cached = ktime;

or around

list_for_each_entry(child, &p->children, sibling)

I'll be updating you with more info, as soon as I'll have found anything more.

@firelzrd

Update: I've been reading the commits and saw that the line p->se.burst_time = 0; introduced with 29eb369 went missing without getting replaced in 9ccfa4a.

I don't fully understand the code, but by the looks of it, this (unintendedly?) changes program behaviour. I'll test re-adding it as se->burst_time = 0; to match the access patterns of 29eb369 and see if it makes a difference.

Maybe due to absence of exclusive lock when crawling through task's child list?

[41165.952040] RIP: 0010:sched_fork+0x15e/0x2a0
00 48                  	add    %cl,(%rax)
8d bd 38 05 00 00      	lea    0x538(%rbp),%edi
48 8d 88 b8 fa ff ff 	lea    -0x548(%rax),%rcx
48 39 c7               	cmp    %rax,%rdi
0f 84 30 01 00 00      	je     0x142
31 c0                  	xor    %eax,%eax
31 f6                  	xor    %esi,%esi
ba 08 00 00 00         	mov    $0x8,%edx
c4 e2 eb f7            	vpmovzxbd %xmm3,%xmm7
91                     	xchg   %eax,%ecx
f0 00 00 00            	lock add %al,(%rax)
48 01 d0               	add    %rdx,%rax <----------RIP
48 8b 91 48 05 00 00   	mov    0x548(%rcx),%rdx
ff c6                  	inc    %esi
48 8d 8a b8 fa ff ff   	lea    -0x548(%rdx),%rcx
48 39 d7               	cmp    %rdx,%rdi
[41165.952044] RSP: 0018:ffff9dbf9ceefc30 EFLAGS: 00000206
[41165.952046] RAX: 00015ea74dbf1618 RBX: ffff8e75794b8000 RCX: ffff8e71b78cbe00
[41165.952046] RDX: 0000000000004c19 RSI: 000000009ba209d8 RDI: ffff8e7259f00538
[41165.952047] RBP: ffff8e7259f00000 R08: ffff9dbf9ceefbf0 R09: ffff8e75794b82c0
[41165.952047] R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000
[41165.952048] R13: 0000000000000000 R14: ffff9dbf9ceefe10 R15: ffff8e75794b8000
[41165.952049] FS:  00007f729989c740(0000) GS:ffff8e785ec80000(0000) knlGS:0000000000000000
[41165.952049] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41165.952050] CR2: 00007f2b2bf1e72c CR3: 00000005f3b7a000 CR4: 0000000000750ee0
[41165.952050] PKRU: 55555554



[41298.037824] RIP: 0010:sched_fork+0x161/0x2a0
bd 38 05 00 00         	lea    0x538(%rbp),%edi
48 8d 88 b8 fa ff ff   	lea    -0x548(%rax),%rcx
48 39 c7               	cmp    %rax,%rdi
0f 84 30 01 00 00      	je     0x143
31 c0                  	xor    %eax,%eax
31 f6                  	xor    %esi,%esi
ba 08 00 00 00         	mov    $0x8,%edx
c4 e2 eb f7            	vpmovzxbd %xmm3,%xmm7
91                     	xchg   %eax,%ecx
f0 00 00 00            	lock add %al,(%rax)
48 01 d0               	add    %rdx,%rax
48 8b 91 48 05 00 00   	mov    0x548(%rcx),%rdx <----------RIP
ff c6                  	inc    %esi
48 8d 8a b8 fa ff ff   	lea    -0x548(%rdx),%rcx
48 39 d7               	cmp    %rdx,%rdi
75 da                  	jne    0x25
[41298.037829] RSP: 0018:ffff9dbf9ceefc30 EFLAGS: 00000216
[41298.037830] RAX: 0009dd08d91c74b4 RBX: ffff8e75794b8000 RCX: ffff8e71b78cbe00
[41298.037830] RDX: 0000000000004c19 RSI: 000000002e76f0d3 RDI: ffff8e7259f00538
[41298.037831] RBP: ffff8e7259f00000 R08: ffff9dbf9ceefbf0 R09: ffff8e75794b82c0
[41298.037831] R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000
[41298.037832] R13: 0000000000000000 R14: ffff9dbf9ceefe10 R15: ffff8e75794b8000
[41298.037832] FS:  00007f729989c740(0000) GS:ffff8e785ec80000(0000) knlGS:0000000000000000
[41298.037833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41298.037834] CR2: 00007f2b2bf1e72c CR3: 00000005f3b7a000 CR4: 0000000000750ee0
[41298.037834] PKRU: 55555554



[41480.964272] RIP: 0010:sched_fork+0x16a/0x2a0
fa                     	cli
ff ff                  	jmpq   *%rdi
48 39 c7               	cmp    %rax,%rdi
0f 84 30 01 00 00      	je     0x136
31 c0                  	xor    %eax,%eax
31 f6                  	xor    %esi,%esi
ba 08 00 00 00         	mov    $0x8,%edx
c4 e2 eb f7            	vpmovzxbd %xmm3,%xmm7
91                     	xchg   %eax,%ecx
f0 00 00 00            	lock add %al,(%rax)
48 01 d0               	add    %rdx,%rax
48 8b 91 48 05 00 00   	mov    0x548(%rcx),%rdx
ff c6                  	inc    %esi
48 8d 8a b8 fa ff ff   	lea    -0x548(%rdx),%rcx <----------RIP
48 39 d7               	cmp    %rdx,%rdi
75 da                  	jne    0x1a
31 d2                  	xor    %edx,%edx
85 f6                  	test   %esi,%esi
74 0c                  	je     0x28
31 d2                  	xor    %edx,%edx
48 f7                  	div    %rsi
[41480.964277] RSP: 0018:ffff9dbf9ceefc30 EFLAGS: 00000202
[41480.964278] RAX: 0015c89e99c1fba2 RBX: ffff8e75794b8000 RCX: ffff8e71b78cbe00
[41480.964279] RDX: ffff8e71b78cc348 RSI: 00000000484becf2 RDI: ffff8e7259f00538
[41480.964279] RBP: ffff8e7259f00000 R08: ffff9dbf9ceefbf0 R09: ffff8e75794b82c0
[41480.964280] R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000
[41480.964280] R13: 0000000000000000 R14: ffff9dbf9ceefe10 R15: ffff8e75794b8000
[41480.964281] FS:  00007f729989c740(0000) GS:ffff8e785ec80000(0000) knlGS:0000000000000000
[41480.964282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41480.964282] CR2: 00007f2b2bf1e72c CR3: 00000005f3b7a000 CR4: 0000000000750ee0
[41480.964283] PKRU: 55555554



[41417.961844] RIP: 0010:sched_fork+0x174/0x2a0
00 00                  	add    %al,(%rax)
31 c0                  	xor    %eax,%eax
31 f6                  	xor    %esi,%esi
ba 08 00 00 00         	mov    $0x8,%edx
c4 e2 eb f7            	vpmovzxbd %xmm3,%xmm7
91                     	xchg   %eax,%ecx
f0 00 00 00            	lock add %al,(%rax)
48 01 d0               	add    %rdx,%rax
48 8b 91 48 05 00 00   	mov    0x548(%rcx),%rdx
ff c6                  	inc    %esi
48 8d 8a b8 fa ff ff   	lea    -0x548(%rdx),%rcx
48 39 d7               	cmp    %rdx,%rdi
75 da                  	jne    0x1a <----------RIP
31 d2                  	xor    %edx,%edx
85 f6                  	test   %esi,%esi
74 0c                  	je     0x28
31 d2                  	xor    %edx,%edx
48 f7 f6               	div    %rsi
48 c1 e0 08            	shl    $0x8,%rax
48 89 c2               	mov    %rax,%rdx
[41417.961849] RSP: 0018:ffff9dbf9ceefc30 EFLAGS: 00000206
[41417.961850] RAX: 0011ac98c8e2e313 RBX: ffff8e75794b8000 RCX: ffff8e71b78cbe00
[41417.961851] RDX: ffff8e71b78cc348 RSI: 0000000075304e0b RDI: ffff8e7259f00538
[41417.961851] RBP: ffff8e7259f00000 R08: ffff9dbf9ceefbf0 R09: ffff8e75794b82c0
[41417.961852] R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000
[41417.961852] R13: 0000000000000000 R14: ffff9dbf9ceefe10 R15: ffff8e75794b8000
[41417.961853] FS:  00007f729989c740(0000) GS:ffff8e785ec80000(0000) knlGS:0000000000000000
[41417.961854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41417.961854] CR2: 00007f2b2bf1e72c CR3: 00000005f3b7a000 CR4: 0000000000750ee0
[41417.961855] PKRU: 55555554

Maybe due to absence of exclusive lock when crawling through task's child list?

What I noticed, is that each time the issue occurred, I was compiling software on all threads, which could be an indicator for such a locking, serialization or synchronization issue. Then again, it could also have been plain coincidence that it happened while the system was under load.

I've been non-stop recompiling LLVM and the Linux kernel for the last 2 hrs in order to provoke it again, but so far no luck. This is with the se->burst_time = 0; line restored in sched/core.c: static void update_task_initial_burst_time(struct task_struct *p).

However, the earliest point in time this issue has occurred for me, was after 4-5hrs of uptime, so it's too early for me to tell whether the code change actually did anything beneficial…

Could you please try this experimental patch?
I've just confirmed that at least it compiles. I'm going to reboot into it from now.
As you can see, the function now read-locks the tasklist_lock when going through child processes.
This won't be at any observable performance impact, since this function only runs no more than once every 6ms for each forking parent. 6ms means like forever to a scheduler.
Reference: https://stackoverflow.com/questions/10021240/significance-of-read-un-locktasklist-lock

From 64522f53f93e5dc43c696e678e6de08bf74b708f Mon Sep 17 00:00:00 2001
From: Masahito S <firelzrd@gmail.com>
Date: Wed, 17 May 2023 05:12:39 +0900
Subject: [PATCH] read_lock(&tasklist_lock) while crawling through children for
 their burst times child_burst_cache and child_burst_last_cached are now reset
 upon fork

---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 214aad93d..a80806511 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4426,10 +4426,12 @@ static inline void update_task_child_burst_time_cache(struct task_struct *p) {
 	u64 sum_burst_time = 0, avg_burst_time = 0;
 	struct task_struct *child;
 
+	read_lock(&tasklist_lock);
 	list_for_each_entry(child, &p->children, sibling) {
 		num_child++;
 		sum_burst_time += child->se.max_burst_time >> 8;
 	}
+	read_unlock(&tasklist_lock);
 
 	if (num_child) avg_burst_time = div_u64(sum_burst_time, num_child) << 8;
 	p->child_burst_cache = max(avg_burst_time, p->se.max_burst_time);
@@ -4469,6 +4471,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 #ifdef CONFIG_SCHED_BORE
+	p->child_burst_cache = 0;
+	p->child_burst_last_cached = 0;
 	p->se.burst_time      = 0;
 #endif // CONFIG_SCHED_BORE
 	p->se.dur_avg			= 0;
-- 
2.25.1

Compiled and running - this time without se->burst_time = 0; re-added in order to not test several modifications at once (unless if you'd explicitly want me to). I'll keep you updated!

Thank you for your kind cooperation.

Not 100% sure, but I assume this issue is caused by parent task's children list entries are somehow deleted by other processors (like by calling the exit syscall) while this one's fork is traversing through the list.
If the above reason is how and why it happened, then the issue affects also EEVDF version too.
Now that this hotfix seems OK booting up, and no significant performance hit is observed, I'll incorporate it to a new version.

I'll keep this issue open for a while.
Let us hope the problem doesn't persist.

Big thanks goes to you once again, DiabeticCrab.
Please feel free to talk to me regarding any matter.

I think your patch did the trick, since I got 68hrs of uptime without any issues now. I'll be monitoring my logs for a little longer, just to be absolutely certain, but I'm optimistic that it's all good now.

Thank you for your quick responses and friendly communication. It has been a pleasure working with you on this! I'll post another update in a couple of days and if nothing happens until then, we can probably close the issue.

I've now had 120hrs (5 days) of continuous uptime without any issues and thus consider this issue solved.
Again, thank you very much!

Thank YOU for the help and report!
Have a nice day!