sched-ext/scx

[Bug Report] scx_lavd failed to launch

Closed this issue · 11 comments

Summary

Related to chaotic-cx/nyx#684.

Since commit b9d57e8, the scx_lavd program fails to run.

Ref: https://github.com/chaotic-cx/nyx/blob/main/pkgs/scx/common.nix#L9

Steps to reproduce

⯁ flake git:(master) ✗ ❯❯❯ just b
[sudo] password for kev:
warning: Git tree '/home/kev/flake' is dirty
building the system configuration...
warning: Git tree '/home/kev/flake' is dirty
stopping the following units: scx.service
activating the configuration...
setting up /etc...
sops-install-secrets: Imported /etc/ssh/ssh_host_rsa_key as GPG key with fingerprint 4ab394fe0264b5028190ed02231ba03f2115cb3f
sops-install-secrets: Imported /etc/ssh/ssh_host_ed25519_key as age key with fingerprint age1px9v42s7k0urw8af4mt8qc8jrchc02k2qkj0ysu50a0pztfclslqzpr097
reloading user units for kev...
restarting sysinit-reactivation.target
reloading the following units: dbus-broker.service
restarting the following units: polkit.service
starting the following units: scx.service
the following new units were started: battery_charge_threshold.service, sysinit-reactivation.target, systemd-tmpfiles-resetup.service
warning: the following units failed: scx.service

× scx.service - Start scx_scheduler
     Loaded: loaded (/etc/systemd/system/scx.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Fri 2024-04-26 09:14:57 HKT; 9s ago
   Duration: 37ms
       Docs: https://github.com/sched-ext/scx
    Process: 39561 ExecStart=/nix/store/xm7spb165ls4d27n6fgxbgwzkvqqwdrx-scx (code=exited, status=1/FAILURE)
   Main PID: 39561 (code=exited, status=1/FAILURE)
         IP: 0B in, 0B out
        CPU: 36ms

Apr 26 09:14:57 nixos-x1-carbon systemd[1]: scx.service: Scheduled restart job, restart counter is at 5.
Apr 26 09:14:57 nixos-x1-carbon systemd[1]: scx.service: Start request repeated too quickly.
Apr 26 09:14:57 nixos-x1-carbon systemd[1]: scx.service: Failed with result 'exit-code'.
Apr 26 09:14:57 nixos-x1-carbon systemd[1]: Failed to start Start scx_scheduler.
warning: error(s) occurred while switching to the new configuration
error: Recipe `rebuild` failed on line 25 with exit code 1
✖ 1 flake git:(master) ✗ ❯❯❯ sudo scx_lavd
Error: Failed to load BPF program

Caused by:
    Invalid argument (os error 22)

Additional Hardware Information

⯁ ~ ❯❯❯ uname -ar
Linux nixos-x1-carbon 6.8.6-cachyos #1-NixOS SMP PREEMPT_DYNAMIC Sat Apr 13 11:10:12 UTC 2024 x86_64 GNU/Linux

image

Could you try to launch with sudo scx_lavd -vvv -s $(nproc)?

Since scx_lavd doesn't start for me either, this is the output of sudo scx_lavd -vvv -s $(nproc):
lavd_out.txt

Could you try to launch with sudo scx_lavd -vvv -s $(nproc)?

Sure. https://fars.ee/JV0P

I can't tell why the verifier is unhappy about the code. It looks fine to me. I asked BPF folks and they wanna see whether inlining the submit function would resolve the issue. Can you please try the following patch?

diff --git a/scheds/rust/scx_lavd/src/bpf/main.bpf.c b/scheds/rust/scx_lavd/src/bpf/main.bpf.c
index 1062222..96a4ca2 100644
--- a/scheds/rust/scx_lavd/src/bpf/main.bpf.c
+++ b/scheds/rust/scx_lavd/src/bpf/main.bpf.c
@@ -513,8 +513,8 @@ static void flip_sys_cpu_util(void)
 	__sys_cpu_util_idx ^= 0x1;
 }
 
-static int submit_task_ctx(struct task_struct *p, struct task_ctx *taskc,
-			   u16 cpu_id)
+static __attribute__((always_inline))
+int submit_task_ctx(struct task_struct *p, struct task_ctx *taskc, u16 cpu_id)
 {
 	struct sys_cpu_util *cutil_cur = get_sys_cpu_util_cur();
 	struct msg_task_ctx *m;

BPF folks are also asking for:

  • The .bpf.o file. For lavd, this should be build/scheds/rust/scx_lavd/debug/build/scx_lavd-*/out/bpf.bpf.o.
  • Full log from scx_lavd.
  • Kernel version and .config.

Alexei says it's most likely caused by missing 6fceea0fa59f ("bpf: Transfer RCU lock state between subprog calls") in the kernel. The commit was from Feb this year, which would explain why only some people are seeing this failure. Can you guys please update the kernels you're using? Thanks.

@htejun Thanks a lot! It would be better to add the always_inline attribute to the submit_task_ctx() function for the time being.

Alexei says it's most likely caused by missing 6fceea0fa59f ("bpf: Transfer RCU lock state between subprog calls") in the kernel. The commit was from Feb this year, which would explain why only some people are seeing this failure. Can you guys please update the kernels you're using? Thanks.

This patch is not landed in 6.8 either in the official sched-ext 6.8 patchset.
You should consider to add it there also.

Its only landed in 6.9

PR #247 should fix the problem by inlining the problematic function regardless of the kernel version.

PR #247 should fix the problem

Can confirm it's working with b1bb2a5, without any extra kernel patch.

PR #247 should fix the problem by inlining the problematic function regardless of the kernel version.

Hey @multics69, thanks for your kind help. I can confirm that #247 does fix the issue.

Also thanks @PedroHLC for the prompt action!

image