Fix invalid state transition while scheduler is loaded #140

htejun · 2024-02-08T01:37:35Z

Trying to load a scheduler while stress-ng --race-sched 8 --timeout 30 is
running easily triggers the following warning.

sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[8698]
WARNING: CPU: 1 PID: 1527 at kernel/sched/ext.c:2378 scx_set_task_state+0xb2/0x1a0
CPU: 1 PID: 1527 Comm: stress-ng-race- Not tainted 6.7.0-work-00411-gc1c1b3b1133b-dirty #404
Sched_ext: rustland (enabled+all), task: runnable_at=-76ms
RIP: 0010:scx_set_task_state+0xb2/0x1a0
...
Call Trace:

scx_ops_enable_task+0xd5/0x180
switching_to_scx+0x17/0xa0
__sched_setscheduler+0x623/0x810
do_sched_setscheduler+0xea/0x170
__x64_sys_sched_setscheduler+0x1c/0x30
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e

This race happens when sched_setscheudler() syscall races SCX ops_enable
path on a DEAD task. If the task is DEAD, scx_ops_enable() directly calls
scx_ops_exit_task() instead of enabling it. If another task was already in
the middle of sched_setscheduler() with the task's reference acquired, it
then can proceed to __sched_setscheduler() which will then call
scx_ops_enable_task(). However, the task has already been exited, so it ends
trying to transition the task from SCX_TASK_NONE to SCX_TASK_ENABLED
triggering the above warning.

This is because the ops_enable path is leaving the task in an inconsistent
state - scx_enabled() && scx_switching_all, so scx_should_scx(p) but p is
SCX_TASK_NONE instead of READY or ENABLED. If the task gets freed
afterwards, it's fine, but if there's an attribute change operation which
gets inbetween, that operation ends up trying to enable SCX on the
unprepared dead task.

The reason why both ops_enable and disable paths handle DEAD tasks
specially, I think, is historical. Before scx_tasks was introduced, the
paths used the usual tasklist iteration. However, tasks are removed from
tasklist before the task transitions to DEAD, so depending on the timing, a
DEAD task may show up in one iteration and disappear for the next one e.g.
while disabling is in progress without any way to detect the event. My
memory is hazy, so the details may not be accurate. Anyways, there were
issues around reliably iterating DEAD tasks and thus there was code to skip
them.

With scx_tasks, we're strongly synchronized against tasks being destoryed,
so this is no longer a concern, so it's likely that we can just remove this
special handling and things will just work.

After this patch, the kernel is surviving load/unload torture test but it's
possible that I'm completely misremembering things, in which case, we'll
have to relearn why TASK_DEAD has to be special.

Signed-off-by: Tejun Heo <[email protected]>

Trying to load a scheduler while `stress-ng --race-sched 8 --timeout 30` is running easily triggers the following warning. sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[8698] WARNING: CPU: 1 PID: 1527 at kernel/sched/ext.c:2378 scx_set_task_state+0xb2/0x1a0 CPU: 1 PID: 1527 Comm: stress-ng-race- Not tainted 6.7.0-work-00411-gc1c1b3b1133b-dirty #404 Sched_ext: rustland (enabled+all), task: runnable_at=-76ms RIP: 0010:scx_set_task_state+0xb2/0x1a0 ... Call Trace: <TASK> scx_ops_enable_task+0xd5/0x180 switching_to_scx+0x17/0xa0 __sched_setscheduler+0x623/0x810 do_sched_setscheduler+0xea/0x170 __x64_sys_sched_setscheduler+0x1c/0x30 do_syscall_64+0x40/0xe0 entry_SYSCALL_64_after_hwframe+0x46/0x4e This race happens when sched_setscheudler() syscall races SCX ops_enable path on a DEAD task. If the task is DEAD, scx_ops_enable() directly calls scx_ops_exit_task() instead of enabling it. If another task was already in the middle of sched_setscheduler() with the task's reference acquired, it then can proceed to __sched_setscheduler() which will then call scx_ops_enable_task(). However, the task has already been exited, so it ends trying to transition the task from SCX_TASK_NONE to SCX_TASK_ENABLED triggering the above warning. This is because the ops_enable path is leaving the task in an inconsistent state - scx_enabled() && scx_switching_all, so scx_should_scx(p) but p is SCX_TASK_NONE instead of READY or ENABLED. If the task gets freed afterwards, it's fine, but if there's an attribute change operation which gets inbetween, that operation ends up trying to enable SCX on the unprepared dead task. The reason why both ops_enable and disable paths handle DEAD tasks specially, I think, is historical. Before scx_tasks was introduced, the paths used the usual tasklist iteration. However, tasks are removed from tasklist before the task transitions to DEAD, so depending on the timing, a DEAD task may show up in one iteration and disappear for the next one e.g. while disabling is in progress without any way to detect the event. My memory is hazy, so the details may not be accurate. Anyways, there were issues around reliably iterating DEAD tasks and thus there was code to skip them. With scx_tasks, we're strongly synchronized against tasks being destoryed, so this is no longer a concern, so it's likely that we can just remove this special handling and things will just work. After this patch, the kernel is surviving load/unload torture test but it's possible that I'm completely misremembering things, in which case, we'll have to relearn why TASK_DEAD has to be special. Signed-off-by: Tejun Heo <[email protected]>

htejun · 2024-02-08T01:50:34Z

So, looking at old code, I think this is correct. The key difference is that we used to call the exit path from task_dead_ext() which was called from finish_task_switch(). So, a DEAD task shouldn't be in the enabled state which led to the exceptions in both init and exit paths. Now that we pushed out the exit path to sched_ext_free() which is called from task_struct free path, this should no longer be a problem.

Byte-Lab

Thanks for the fix

arighi · 2024-02-08T13:20:29Z

I've tried to verify if we can hit some issues with DEAD tasks, running stress-ng --exec 8 (that is constantly forking/exiting tasks) and trying to start some schedulers.

With scx_rusty I can immediately hit this error (that also happens without this PR applied, so it's totally unrelated, but JFYI @Decave):

Error: EXIT: scx_bpf_error (cpu7 dom2 load underflow (load=-100 adj=-100))

With scx_rustland, without this PR I can hit "normal" scheduler stalls (that can be fixed/improved in rustland itself, so another unrelated issue), but with this PR applied I can trigger this condition
(that doesn't seem to happen reverting the PR):

[   66.891276] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   66.891912] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-15): P7958/1:b..l P7811/1:b..l P7878/1:b..l
[   66.892813] rcu: 	(detected by 2, t=21002 jiffies, g=18629, q=180 ncpus=16)
[   66.893370] task:stress-ng       state:R  running task     stack:13328 pid:7878  tgid:7878  ppid:257    flags:0x00004000
[   66.894212] Sched_ext: rustland (enabled+all), task: runnable_at=-21003ms
[   66.894214] Call Trace:
[   66.894954]  <TASK>
[   66.895185]  __schedule+0x36d/0x1410
[   66.896208]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.896635]  ? __rhashtable_lookup.constprop.0+0xc5/0x120
[   66.897069]  preempt_schedule+0x39/0x50
[   66.897395]  preempt_schedule_thunk+0x1a/0x30
[   66.901239]  _raw_spin_unlock+0x1f/0x30
[   66.901565]  unmap_page_range+0x67a/0xe20
[   66.901882]  ? free_unref_page_prepare+0x7a/0x2f0
[   66.902313]  zap_page_range_single+0x133/0x200
[   66.902742]  unmap_mapping_pages+0x102/0x130
[   66.903170]  invalidate_inode_pages2_range+0x19f/0x430
[   66.903603]  fuse_open_common+0x1ce/0x270
[   66.903926]  ? __pfx_fuse_open+0x10/0x10
[   66.904256]  do_dentry_open+0x1f6/0x530
[   66.904581]  backing_file_open+0x76/0xb0
[   66.904901]  ovl_open_realfile+0xd2/0xe0
[   66.905231]  ? __pfx_ovl_open+0x10/0x10
[   66.905554]  ovl_open+0xb1/0x100
[   66.905876]  do_dentry_open+0x1f6/0x530
[   66.906205]  path_openat+0xc66/0x1000
[   66.906528]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.906950]  ? filemap_map_pages+0x3b7/0x450
[   66.907376]  do_filp_open+0xb3/0x160
[   66.907702]  ? __pfx_page_put_link+0x10/0x10
[   66.908129]  do_sys_openat2+0xab/0xe0
[   66.908452]  __x64_sys_openat+0x57/0xa0
[   66.908782]  do_syscall_64+0x47/0xf0
[   66.910179]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   66.910593] RIP: 0033:0x145eaabd2ede
[   66.910908] RSP: 002b:00007ffe175c0fe8 EFLAGS: 00000206 ORIG_RAX: 0000000000000101
[   66.911529] RAX: ffffffffffffffda RBX: 0000145eaa05b9a0 RCX: 0000145eaabd2ede
[   66.912149] RDX: 0000000000080000 RSI: 0000145eaa05b9a0 RDI: 00000000ffffff9c
[   66.914509] RBP: 00007ffe175c1030 R08: 00007ffe175c1097 R09: 0000000000000000
[   66.915128] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
[   66.915761] R13: 0000000000000000 R14: 0000145eaabe5000 R15: 00007ffe175c10b0
[   66.916386]  </TASK>

When this happens the system becomes completely unresponsive and I need to kill the VM. Not sure if it's related to not handling DEAD tasks in the special way.

htejun added 2 commits February 7, 2024 09:12

scx: Better invalid state transition warnings

c1c1b3b

Signed-off-by: Tejun Heo <[email protected]>

htejun requested review from Byte-Lab and arighi February 8, 2024 01:37

Byte-Lab approved these changes Feb 8, 2024

View reviewed changes

Byte-Lab merged commit 6f5bf31 into sched_ext Feb 8, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix invalid state transition while scheduler is loaded #140

Fix invalid state transition while scheduler is loaded #140

htejun commented Feb 8, 2024

htejun commented Feb 8, 2024

Byte-Lab left a comment

arighi commented Feb 8, 2024

Fix invalid state transition while scheduler is loaded #140

Fix invalid state transition while scheduler is loaded #140

Conversation

htejun commented Feb 8, 2024

htejun commented Feb 8, 2024

Byte-Lab left a comment

Choose a reason for hiding this comment

arighi commented Feb 8, 2024