scx_bpfland: race with p->cpus_ptr

@sirlucjan found an interesting bug with bpfland, but it can potentially affect any scheduler that uses per-CPU DSQs.

In short, under certain rare conditions, a task can be dispatched to a per-CPU DSQ, but the task's cpumask doesn't allow to run the task on that particular CPU, hence there's a stall, because the task will never be consumed from the DSQ.

Example:

paź 02 22:24:03 cachyos bash[612]:   R ksoftirqd/2[34] -7430ms
paź 02 22:24:03 cachyos bash[612]:       scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
paź 02 22:24:03 cachyos bash[612]:       sticky/holding_cpu=-1/-1 dsq_id=0x4 dsq_vtime=51313284698144
paź 02 22:24:03 cachyos bash[612]:       cpus=4
paź 02 22:24:03 cachyos bash[612]:     kthread+0x24c/0x290
paź 02 22:24:03 cachyos bash[612]:     ret_from_fork+0x37/0x50
paź 02 22:24:03 cachyos bash[612]:     ret_from_fork_asm+0x1a/0x30
paź 02 22:24:03 cachyos bash[612]:   R kworker/2:2[302578] -1800ms
paź 02 22:24:03 cachyos bash[612]:       scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
paź 02 22:24:03 cachyos bash[612]:       sticky/holding_cpu=-1/-1 dsq_id=0x4 dsq_vtime=51322424785271
paź 02 22:24:03 cachyos bash[612]:       cpus=4
paź 02 22:24:03 cachyos bash[612]:     kthread+0x24c/0x290
paź 02 22:24:03 cachyos bash[612]:     ret_from_fork+0x37/0x50
paź 02 22:24:03 cachyos bash[612]:     ret_from_fork_asm+0x1a/0x30

In this case ksoftirqd/2 was dispatched to the DSQ 4, but, according to its cpumask (cpus=4) it can only run on CPU 2. Same with kworker/2 (and it's interesting to notice that in this trace even 2 kthreads have been affected by this issue).

In theory this should never happen, because ksoftirqd should never change its cpumask (maybe the kworker can?) and even if it does we are always checking if the CPU is allowed before dispatching to the DSQ (

scx/scheds/rust/scx_bpfland/src/bpf/main.bpf.c

Line 468 in b107044

if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {

).

The direct dispatch to the per-CPU DSQ (dispatch_direct_cpu()) is used only from ops.select_cpu() or ops.enqueue(), but @sirlucjan also tested a version that was doing direct dispatch only from ops.select_cpu() and he could still reproduce the bug.

The only reasonable explanation that I see is that under certain rare conditions p->cpus_ptr is completely unreliable. It could be racy when the task changes its cpumask, but this still doesn't explain the ksoftirqd case that theoretically should never have changed its cpumask (so even checking for the validity of the CPU after dispatch won't help much I think).

The other problem is that this bug seems to be extremely rare (only @sirlucjan was able to reproduce it after 2-3 hours of intense stress tests), so it's quite difficult to debug.

Any pointers / ideas / suggestions? I'm running a bit out of ideas at the moment...

I need to do more stress tests (and I'm entirely sure this is a proper fix), but with this kernel patch applied I can't trigger the kthread stalls anymore:

arighi/sched_ext@30e5b1c

I'm sharing the patch here for now, since it might be useful for discussion. If it makes sense I'll send it in a proper way.

I'd just open code that in ext.c. Can you share the theory of the bug?

I'd just open code that in ext.c. Can you share the theory of the bug?

What is see is that some kthreads are dispatched to DSQs that don't have anything to do with the kthread's assigned CPU, hence I was assuming that task_cpu() might not be perfectly synchronized with kthread->cpu in some cases, hence the patch above. But I'm still trying to understand the code and I'm not sure if this is really the case.

The point is how some per-CPU kthread may end up being dispatched to DSQs associated to CPUs that have nothing to do with the kthread? Could it be that prev_cpu is unreliable in some cases? But even if prev_cpu is not usable anymore when the task is re-enqueued I'm still checking for the validity in p->cpus_ptr, so I still don't understand how the kthread is dispatched on the wrong DSQ...

CPU 9   : nr_run=2 flags=0x1 cpu_rel=0 ops_qseq=500351 pnt_seq=458771
          curr=ld.lld[107123] class=ext_sched_class

 *R ld.lld[107123] -1ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=2754064726029
      cpus=ffff

  R kworker/9:1[71844] -6444ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x10 dsq_vtime=2753085541508
      cpus=0200

    kthread+0xcf/0x100
    ret_from_fork+0x31/0x50
    ret_from_fork_asm+0x1a/0x30

^ This one for example, kworker/9 was dispatched to dsq_id=0x10, that can be consumed only by CPU 16, but kthread/9 of course can't run on CPU 16 => stall.

Also kworker/9 shouldn't change its affinity, so I don't think it's even a race with p->cpus_ptr...

Actually, kworkers can definitely change their affinity, so maybe it's all a race with prev_cpu and the cpumask. Like if a kthread is dispatched to a DSQ that can be consumed only by a specific CPU, then kthread_set_per_cpu() is called to set a different CPU, the cpumask is also updated, and at this point it's too late and we hit a stall.

If this scenario is possible, we can't prevent it from happening and we can only take action after the fact, such as using a bpf_for_each() to iterate through all tasks in a DSQ, identify those that shouldn't be there, and then migrate them elsewhere.

cpumask update always dequeues and re-enqueues, even for kthreads (kthread_set_per_cpu() eventually uses kthread_bind() which which does do_set_cpus_allowed() which does all the usual stuff), and, as long as a task is on a DSQ, sched_ext kernel side should always execute dequeues correctly. Maybe prev_cpu could be out of sync with cpumasks while being updated?

Maybe this is what happens (this is in ops.select_cpu() / ops.enqueue()):

 bpf_cpumask_test_cpu(prev_cpu, p->cpus_ptr))
 => cpumask updated (prev_cpu becomes invalid and there's no dequeue, because the task isn't dispatched yet)
 scx_bpf_dispatch_vtime(p, cpu_to_dsq(prev_cpu), ...)
 => stall

Therefore we may need to do something like this:

 bpf_cpumask_test_cpu(prev_cpu, p->cpus_ptr))
 scx_bpf_dispatch_vtime(p, cpu_to_dsq(prev_cpu), ...)
 if (!bpf_cpumask_test_cpu(prev_cpu, p->cpus_ptr))) {
     // remove p from the DSQ and dispatch it either to a valid CPU or a shared DSQ
 }

And I guess I need to use a BPF iterator and scx_bpf_dispatch_from_dsq(), because this is happening in ops.select_cpu() / ops.enqueue(), so I can't use scx_bpf_dispatch_cancel().

Does it make sense?

I'll look into it further but yeah prev_cpu being out of sync with cpumask makes sense to me as a scenario. I wonder whether the kernel side should always guarantee that prev_cpu is within the current cpus_ptr.

Also, I don't think you'd need to cancel the task. If the task was queued on a DSQ, it'd be dequeued to change the cpumask anyway and if not, there's nothing to cancel.

I'll look into it further but yeah prev_cpu being out of sync with cpumask makes sense to me as a scenario. I wonder whether the kernel side should always guarantee that prev_cpu is within the current cpus_ptr.

That'd be nice, it'd mimic more the in-kernel select_task_rq(), it could even return cpumask_any(p->cpus_ptr) when prev_cpu isn't valid anymore, at the end it doesn't need to be 100% perfect, but it should provide a valid CPU at least, so that we can prevent issues like this in the future.

Also, I don't think you'd need to cancel the task. If the task was queued on a DSQ, it'd be dequeued to change the cpumask anyway and if not, there's nothing to cancel.

Hm... but if the cpumask changes before dispatch (but after checking prev_cpu), then I do the dispatch to the invalid DSQ and at this point I need to cancel it, no? I can't just do another dispatch to a different DSQ (or it'd be a double dispatch, that should trigger an error I think). Am I missing something?

Whenever cpumask changes, the kernel side dequeues the task, updates the cpumask and re-enqueues it. If a task is on a DSQ, when a task gets dequeued, it gets auto-removed from the DSQ. So, when the task gets re-enqueued after the cpumask update, the scheduler gets to do a brand new enqueue again.

So there is a re-enqueue even if the task isn't dispatched yet to any DSQ, right?

Yes, it will get auto-dequeued from the current DSQ and then get re-enqueued.

Wait... then I don't understand why the current code is affected by the stall.

Basically we have this (simplifying a bit, but the order of operation is pretty much the following):

check validity of prev_cpu
dispatch to prev_cpu DSQ
return

If the cpumask changes before 1. or between 2. and 3., all good. If the cpumask changes between 1. and 2., the task is not in any DSQ (there is not current DSQ), but we still have another re-enqueue event, that will do another dispatch, so which one wins? I guess the first one, that is invalid => stall (so in this case we need to cancel the fist dispatch).

Otherwise, if the second one wins, then we shouldn't get the stall condition, because the first dispatch will be automatically overridden by the second one. Maybe I'm missing something...

So, during a normal cpumask update, prev_cpu should match the new cpumask on the re-enqueue. The kthread_bind path might not match that (I haven't checked thoroughly but that seems possible). So, it's not a race condition per-se, but the kernel just providing a prev_cpu value that doesn't match cpusmask.

Ok I'll do some tests, now I think I have a better understanding of what is happening. Thanks tons for the help @htejun, and BTW this is also some really good documentation material.

Yeah, if this actually is the case, I think we just need to fix the kernel so that we always provide prev_cpu which is in the allowed cpus. The current behavior isn't intentional, it's mostly just happenstance as none of these mattered to in-kernel scheduler.

Well, checking the validity of the cpu after dispatch doesn't make any difference, same stall issue, even with the logic to bounce the task to the shared DSQ (even if it's not really needed I wanted to make sure).

R kworker/1:20[5522] -5312ms scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x10 dsq_vtime=2867286468845 cpus=0002

Back to the drawing board...

Full error (current bpfland-next)

paź 10 11:51:04 cachyos systemd[1]: Started Start scx_scheduler.
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] scx_bpfland 1.0.5-g59cfd40-dirty x86_64-unknown-linux-gnu SMT on
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] primary CPU domain = 0xffff
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] cpufreq performance level: max
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 6: sibling CPUs: [6, 14]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 1: sibling CPUs: [1, 9]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 4: sibling CPUs: [4, 12]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 2: sibling CPUs: [2, 10]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 3: sibling CPUs: [3, 11]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 5: sibling CPUs: [5, 13]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 7: sibling CPUs: [7, 15]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L2 cache ID 0: sibling CPUs: [0, 8]
paź 10 11:51:04 cachyos bash[722]: 09:51:04 [INFO] L3 cache ID 0: sibling CPUs: [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]
paź 10 12:29:29 cachyos bash[722]: DEBUG DUMP
paź 10 12:29:29 cachyos bash[722]: ================================================================================
paź 10 12:29:29 cachyos bash[722]: kworker/u64:5[70732] triggered exit kind 1026:
paź 10 12:29:29 cachyos bash[722]:   runnable task stall (kworker/1:225[39010] failed to run for 5.632s)
paź 10 12:29:29 cachyos bash[722]: Backtrace:
paź 10 12:29:29 cachyos bash[722]:   scx_watchdog_workfn+0x119/0x190
paź 10 12:29:29 cachyos bash[722]:   process_scheduled_works+0x208/0x470
paź 10 12:29:29 cachyos bash[722]:   worker_thread+0x349/0x4f0
paź 10 12:29:29 cachyos bash[722]:   kthread+0x14a/0x170
paź 10 12:29:29 cachyos bash[722]:   ret_from_fork+0x37/0x50
paź 10 12:29:29 cachyos bash[722]:   ret_from_fork_asm+0x1a/0x30
paź 10 12:29:29 cachyos bash[722]: CPU states
paź 10 12:29:29 cachyos bash[722]: ----------
paź 10 12:29:29 cachyos bash[722]: CPU 1   : nr_run=3 flags=0x1 cpu_rel=0 ops_qseq=47771 pnt_seq=804627
paź 10 12:29:29 cachyos bash[722]:           curr=git[71042] class=ext_sched_class
paź 10 12:29:29 cachyos bash[722]:  *R git[71042] +0ms
paź 10 12:29:29 cachyos bash[722]:       scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
paź 10 12:29:29 cachyos bash[722]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=1743036489827
paź 10 12:29:29 cachyos bash[722]:       cpus=ffff
paź 10 12:29:29 cachyos bash[722]:   R kworker/1:225[39010] -5632ms
paź 10 12:29:29 cachyos bash[722]:       scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
paź 10 12:29:29 cachyos bash[722]:       sticky/holding_cpu=-1/-1 dsq_id=0x1 dsq_vtime=1739465749216
paź 10 12:29:29 cachyos bash[722]:       cpus=0002
paź 10 12:29:29 cachyos bash[722]:     kthread+0x14a/0x170
paź 10 12:29:29 cachyos bash[722]:     ret_from_fork+0x37/0x50
paź 10 12:29:29 cachyos bash[722]:     ret_from_fork_asm+0x1a/0x30
paź 10 12:29:29 cachyos bash[722]:   R ksoftirqd/1[28] -5127ms
paź 10 12:29:29 cachyos bash[722]:       scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
paź 10 12:29:29 cachyos bash[722]:       sticky/holding_cpu=-1/-1 dsq_id=0x1 dsq_vtime=1740107717031
paź 10 12:29:29 cachyos bash[722]:       cpus=0002
paź 10 12:29:29 cachyos bash[722]:     kthread+0x14a/0x170
paź 10 12:29:29 cachyos bash[722]:     ret_from_fork+0x37/0x50
paź 10 12:29:29 cachyos bash[722]:     ret_from_fork_asm+0x1a/0x30
paź 10 12:29:29 cachyos bash[722]: CPU 5   : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=35729 pnt_seq=696541
paź 10 12:29:29 cachyos bash[722]:           curr=kworker/u64:5[70732] class=ext_sched_class
paź 10 12:29:29 cachyos bash[722]:  *R kworker/u64:5[70732] +0ms
paź 10 12:29:29 cachyos bash[722]:       scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
paź 10 12:29:29 cachyos bash[722]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=1746472757576
paź 10 12:29:29 cachyos bash[722]:       cpus=ffff
paź 10 12:29:29 cachyos bash[722]:     scx_dump_state+0x63f/0x830
paź 10 12:29:29 cachyos bash[722]:     scx_ops_error_irq_workfn+0x23/0x50
paź 10 12:29:29 cachyos bash[722]:     irq_work_run+0xa8/0x140
paź 10 12:29:29 cachyos bash[722]:     __sysvec_irq_work+0x1c/0xc0
paź 10 12:29:29 cachyos bash[722]:     sysvec_irq_work+0x6c/0x80
paź 10 12:29:29 cachyos bash[722]:     asm_sysvec_irq_work+0x1a/0x20
paź 10 12:29:29 cachyos bash[722]:     scx_watchdog_workfn+0x137/0x190
paź 10 12:29:29 cachyos bash[722]:     process_scheduled_works+0x208/0x470
paź 10 12:29:29 cachyos bash[722]:     worker_thread+0x349/0x4f0
paź 10 12:29:29 cachyos bash[722]:     kthread+0x14a/0x170
paź 10 12:29:29 cachyos bash[722]:     ret_from_fork+0x37/0x50
paź 10 12:29:29 cachyos bash[722]:     ret_from_fork_asm+0x1a/0x30
paź 10 12:29:29 cachyos bash[722]: CPU 12  : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=35886 pnt_seq=694380
paź 10 12:29:29 cachyos bash[722]:           curr=kwin_wayla:cs0[1016] class=ext_sched_class
paź 10 12:29:29 cachyos bash[722]:  *R kwin_wayla:cs0[1016] -1ms
paź 10 12:29:29 cachyos bash[722]:       scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
paź 10 12:29:29 cachyos bash[722]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=1746472757576
paź 10 12:29:29 cachyos bash[722]:       cpus=ffff
paź 10 12:29:29 cachyos bash[722]: CPU 14  : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=37311 pnt_seq=687254
paź 10 12:29:29 cachyos bash[722]:           curr=kworker/u64:3[14642] class=ext_sched_class
paź 10 12:29:29 cachyos bash[722]:  *R kworker/u64:3[14642] +0ms
paź 10 12:29:29 cachyos bash[722]:       scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
paź 10 12:29:29 cachyos bash[722]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=1746473994261
paź 10 12:29:29 cachyos bash[722]:       cpus=ffff
paź 10 12:29:29 cachyos bash[722]: CPU 15  : nr_run=0 flags=0x1 cpu_rel=0 ops_qseq=47110 pnt_seq=849075
paź 10 12:29:29 cachyos bash[722]:           curr=kwin_wayland[984] class=rt_sched_class
paź 10 12:29:29 cachyos bash[722]: ================================================================================
paź 10 12:29:29 cachyos bash[722]: 10:29:29 [INFO] Unregister scx_bpfland scheduler
paź 10 12:29:29 cachyos bash[722]: Error: EXIT: runnable task stall (kworker/1:225[39010] failed to run for 5.632s)
paź 10 12:29:29 cachyos systemd[1]: scx.service: Main process exited, code=exited, status=1/FAILURE
paź 10 12:29:29 cachyos systemd[1]: scx.service: Failed with result 'exit-code'.
paź 10 12:29:29 cachyos systemd[1]: scx.service: Scheduled restart job, restart counter is at 1.
paź 10 12:29:29 cachyos systemd[1]: Started Start scx_scheduler.
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] scx_bpfland 1.0.5-g59cfd40-dirty x86_64-unknown-linux-gnu SMT on
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] primary CPU domain = 0xffff
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] cpufreq performance level: max
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 6: sibling CPUs: [6, 14]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 2: sibling CPUs: [2, 10]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 4: sibling CPUs: [4, 12]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 0: sibling CPUs: [0, 8]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 7: sibling CPUs: [7, 15]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 3: sibling CPUs: [3, 11]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 1: sibling CPUs: [1, 9]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L2 cache ID 5: sibling CPUs: [5, 13]
paź 10 12:29:29 cachyos bash[71244]: 10:29:29 [INFO] L3 cache ID 0: sibling CPUs: [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

As already discussed with @sirlucjan in other channels, this last trace looks different, the cpumask is corherent with the DSQ, so this looks more like a "regular" stall, due to the fact that the task wasn't consumed. I'll take a look at this, but it looks like a different issue.

Reporting another trace that looks interesting (different issue, but maybe related) with some thoughts:

DEBUG DUMP
================================================================================

kworker/u64:4[10624] triggered exit kind 1026:
  runnable task stall (kworker/9:0[12850] failed to run for 6.208s)

Backtrace:
  scx_watchdog_workfn+0x119/0x190
  process_scheduled_works+0x208/0x470
  worker_thread+0x349/0x4f0
  kthread+0x14a/0x170
  ret_from_fork+0x37/0x50
  ret_from_fork_asm+0x1a/0x30

CPU states
----------

CPU 9   : nr_run=2 flags=0x1 cpu_rel=0 ops_qseq=19257 pnt_seq=9897
          curr=cc1plus[14157] class=ext_sched_class

 *R cc1plus[14157] -2ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=104153400457
      cpus=ffff

  R kworker/9:0[12850] -6208ms
      scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x10 dsq_vtime=0
      cpus=0200

    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

CPU 11  : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=18848 pnt_seq=16861
          curr=kworker/u64:4[10624] class=ext_sched_class

 *R kworker/u64:4[10624] +0ms
      scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=105066799924
      cpus=ffff

    scx_dump_state+0x63f/0x830
    scx_ops_error_irq_workfn+0x23/0x50
    irq_work_run+0xa8/0x140
    __sysvec_irq_work+0x1c/0xc0
    sysvec_irq_work+0x6c/0x80
    asm_sysvec_irq_work+0x1a/0x20
    scx_watchdog_workfn+0x137/0x190
    process_scheduled_works+0x208/0x470
    worker_thread+0x349/0x4f0
    kthread+0x14a/0x170
    ret_from_fork+0x37/0x50
    ret_from_fork_asm+0x1a/0x30

CPU 13  : nr_run=1 flags=0x5 cpu_rel=0 ops_qseq=32705 pnt_seq=18984
          curr=cc1plus[14160] class=ext_sched_class

 *R cc1plus[14160] -312ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=104930392745
      cpus=ffff

================================================================================

In this case, as a test, I'm always dispatching per-CPU kthreads using dsq_vtime=0 to make sure they're consumed before any other task, but on cpu 9 kworker/9 has been sitting in dsq_id=0x10 for more than 6 sec.

This is not a cpumask issue, because dsq_id=0x10 is a global DSQ in this case (there are 16 CPUs, so DSQ 16 is the global shared one).

So, it could a stall due the fact that we don't consume from the shared DSQ. But I'm doing this in ops.dispatch() to make sure that the per-CPU DSQs are not starving the shared DSQ:

        if (scx_bpf_consume(cpu_to_dsq(cpu)))
                dispatched++;
        if (scx_bpf_consume(shared_dsq_id))
                dispatched++;
        barrier();
        if (dispatched > 0)
                return;

So, at least every time that ops.dispatch() is called, I move to the local DSQ, potentially, 1 task from the per-CPU DSQ and one from the shared DSQ. And from the trace we can see that CPU 9 was active, so it's not a starvation caused by a CPU going idle.

It could be that the running task was monopolizing the CPU (time slice too long), but it's also not possible because the time slice is always set to 20 ms / amount of waiting tasks, so at most the CPU can be taken for 20 ms.

Anyway, that was just a brain dump, I'll keep investigating....

bpfland doesn't use per-CPU DSQs anymore, so we can close this one for now.