firelzrd/bore-scheduler

Unexpected WARN_ON in 6.7.2-rc1 in place_entity()

hhoffstaette opened this issue · 4 comments

While testing BORE as a possible alternative to BMQ/PDS (and overall seeing very encouraging results!) I just hit the following while compiling in the background:

Jan 23 22:30:54 hho kernel: ------------[ cut here ]------------
Jan 23 22:30:54 hho kernel: WARNING: CPU: 10 PID: 11339 at kernel/sched/fair.c:5429 place_entity+0x20d/0x270
Jan 23 22:30:54 hho kernel: Modules linked in: mq_deadline rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc sch_fq_codel bpf_preload mousedev iwlmvm amdgpu mac80211 snd_ctl_led snd_hda_codec_realtek i2c_algo_bit drm_ttm_helper snd_hda_codec_generic snd_hda_codec_hdmi libarc4 pkcs8_key_parser ttm drm_exec drm_suballoc_helper snd_hda_intel edac_mce_amd amdxcp snd_intel_dspcfg uvcvideo drm_buddy crct10dif_pclmul snd_hda_codec videobuf2_vmalloc crc32_pclmul gpu_sched lm92 iwlwifi videobuf2_memops crc32c_intel snd_hwdep uvc ghash_clmulni_intel drm_display_helper videobuf2_v4l2 sha512_ssse3 snd_hda_core r8169 sha256_ssse3 drivetemp wmi_bmof sha1_ssse3 cec videodev thinkpad_acpi snd_pcm cfg80211 realtek ledtrig_audio drm_kms_helper snd_timer snd_rn_pci_acp3x psmouse mdio_devres snd_acp_config videobuf2_common platform_profile ipmi_devintf ucsi_acpi snd snd_soc_acpi rapl mc serio_raw k10temp i2c_piix4 snd_pci_acp3x soundcore typec_ucsi rfkill libphy ipmi_msghandler roles drm typec battery ac video wmi i2c_scmi button
Jan 23 22:30:54 hho kernel: CPU: 10 PID: 11339 Comm: cc1plus Not tainted 6.7.2-rc1 #1
Jan 23 22:30:54 hho kernel: Hardware name: LENOVO 20U50001GE/20U50001GE, BIOS R19ET32W (1.16 ) 01/26/2021
Jan 23 22:30:54 hho kernel: RIP: 0010:place_entity+0x20d/0x270
Jan 23 22:30:54 hho kernel: Code: 49 8d 34 0a eb c2 4c 89 e6 48 89 4c 24 10 48 89 7c 24 18 e8 c5 d7 ff ff 48 89 c3 48 8b 4c 24 10 48 8b 7c 24 18 e9 44 fe ff ff <0f> 0b eb ab 4c 89 e6 44 88 44 24 0f 48 89 4c 24 10 4c 89 4c 24 18
Jan 23 22:30:54 hho kernel: RSP: 0000:ffffc9000ac17dc0 EFLAGS: 00010046
Jan 23 22:30:54 hho kernel: RAX: 00000000b3ac4821 RBX: 0000000000009637 RCX: 0000000000015ab9
Jan 23 22:30:54 hho kernel: RDX: 0000000000015ab9 RSI: 0000000000015ab9 RDI: 0000000000000000
Jan 23 22:30:54 hho kernel: RBP: ffffc9000ac17e08 R08: 0000000000000000 R09: ffff888109a9be00
Jan 23 22:30:54 hho kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8881009c6980
Jan 23 22:30:54 hho kernel: R13: ffff8887ef6abfc0 R14: 00007e7d1f0dc537 R15: 0000000000000009
Jan 23 22:30:54 hho kernel: FS:  00007fa0b2f4cf00(0000) GS:ffff8887ef680000(0000) knlGS:0000000000000000
Jan 23 22:30:54 hho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 23 22:30:54 hho kernel: CR2: 0000000000ac1c70 CR3: 00000001c5d36000 CR4: 0000000000350ef0
Jan 23 22:30:54 hho kernel: Call Trace:
Jan 23 22:30:54 hho kernel:  <TASK>
Jan 23 22:30:54 hho kernel:  ? place_entity+0x20d/0x270
Jan 23 22:30:54 hho kernel:  ? __warn+0x7d/0x120
Jan 23 22:30:54 hho kernel:  ? place_entity+0x20d/0x270
Jan 23 22:30:54 hho kernel:  ? report_bug+0x155/0x180
Jan 23 22:30:54 hho kernel:  ? handle_bug+0x36/0x70
Jan 23 22:30:54 hho kernel:  ? exc_invalid_op+0x13/0x60
Jan 23 22:30:54 hho kernel:  ? asm_exc_invalid_op+0x16/0x20
Jan 23 22:30:54 hho kernel:  ? place_entity+0x20d/0x270
Jan 23 22:30:54 hho kernel:  enqueue_task_fair+0x16d/0x570
Jan 23 22:30:54 hho kernel:  activate_task+0x54/0x90
Jan 23 22:30:54 hho kernel:  ttwu_do_activate.isra.0+0x49/0x140
Jan 23 22:30:54 hho kernel:  try_to_wake_up+0x19a/0x400
Jan 23 22:30:54 hho kernel:  __do_softirq+0x249/0x255
Jan 23 22:30:54 hho kernel:  irq_exit_rcu+0x62/0x80
Jan 23 22:30:54 hho kernel:  sysvec_apic_timer_interrupt+0x32/0x80
Jan 23 22:30:54 hho kernel:  asm_sysvec_apic_timer_interrupt+0x16/0x20
Jan 23 22:30:54 hho kernel: RIP: 0033:0xea4b2e
Jan 23 22:30:54 hho kernel: Code: ff ca 83 fa 02 76 75 48 83 c4 08 31 c0 5b 5d 41 5c 41 5d c3 0f 1f 84 00 00 00 00 00 48 c1 e0 06 45 31 ed 80 b8 c2 89 8e 02 00 <75> 40 e8 3b c5 22 00 83 fd 26 0f 94 c2 48 89 03 44 08 e2 75 05 83
Jan 23 22:30:54 hho kernel: RSP: 002b:00007ffd2a5a3950 EFLAGS: 00000246
Jan 23 22:30:54 hho kernel: RAX: 00000000000020c0 RBX: 00007fa0b1141260 RCX: 0000000000000006
Jan 23 22:30:54 hho kernel: RDX: 000000000000000a RSI: 00007ffd2a5a39c4 RDI: 00007fa0b1140b00
Jan 23 22:30:54 hho kernel: RBP: 0000000000000083 R08: 0000000000000000 R09: 0000000000000009
Jan 23 22:30:54 hho kernel: R10: 0000000000000000 R11: 00007fa0b2c277e0 R12: 0000000000000000
Jan 23 22:30:54 hho kernel: R13: 0000000000000000 R14: 00007fa0b1140b00 R15: 0000000000000020
Jan 23 22:30:54 hho kernel:  </TASK>
Jan 23 22:30:54 hho kernel: ---[ end trace 0000000000000000 ]---

It seems to correspond to this line which would indicate that a calculation resulted in an unexpected load=0. At least that's my reading of the trace. Hope this helps.

I cannot be certain this was caused by BORE (i.e. it could also be a rare generic condition in EEVDF), but this has never happened before - only after adding the BORE patch.

Hi.
Thank you for the report.
I'll check how a typical avg_load for a tg (=task group) looks like in an EEVDF kernel.

Both on EEVDF and EEVDF-BORE on a 64-bit architecture, certain (I don't know in what condition) cfs_rq's typical .se->load.weight is 2.
I don't know where this value comes from though.
Let's think about the case where such a thing is to be inserted into a parent cfs_rq.
When scale_load_down(weight) macro is called, anything 1 <= weight <= 3071 causes the result to be as low as 2.
Not only a cfs_rq, but also a task may take such low weight value (<=3071).
Because se->load.weight was 2 (<=3071), scale_load_down'ed result is 2 as well.
Only in EEVDF-BORE, when a task's slice_score reaches at >=24 (that often happens to compiler tasks under the default BORE preset), this value is then further scaled down into 0. (2 * 423 / 1024 becomes 0)
This doesn't happen in EEVDF.
So I'm thinking whether I should modify the calculation logics to never let this value to reach 0, or don't check this value.
Ideally these (load weight-related) calculations should be done in much higher precision, not scale_load_down'ed scale.
Moreover, in place_entity where you pointed out, if load == 0, then it works as ==1 after all. This isn't favorable.
Please give me a little more moment to think over.
I need to find out why they use 2 as the minimum value, and also where another value 2 is set on a cfs_rq.

I have just uploaded the fixed version 4.1.3.
BORE now uses its own high precision (higher by 5 bits compared to the previous version) to calculate and store cfs_rq->avg_{vruntime,load}-related stuff in entity_weight().
entity_weight() does not return lower number than 2 if se->load.weight >0.
These changes should prevent the WARN message, and also it may improve the precision of the derived arithmetics, and improve the performance.
Please test it and tell me how you like it. If you continue experiencing any issues, please feel free to let me know.
Thank you for very helpful report and suggestion. It really did help :)

I have been running with the new patch and the WARN has not happened again, so I'm closing this for now. Also context switch performance seems to have improved a bit, from ~4µs to ~3.6. Nice!