Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Fix invalid state transition while scheduler is loaded #140

Merged
merged 2 commits into from
Feb 8, 2024
Merged

Conversation

htejun
Copy link
Collaborator

@htejun htejun commented Feb 8, 2024

Trying to load a scheduler while stress-ng --race-sched 8 --timeout 30 is
running easily triggers the following warning.

sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[8698]
WARNING: CPU: 1 PID: 1527 at kernel/sched/ext.c:2378 scx_set_task_state+0xb2/0x1a0
CPU: 1 PID: 1527 Comm: stress-ng-race- Not tainted 6.7.0-work-00411-gc1c1b3b1133b-dirty #404
Sched_ext: rustland (enabled+all), task: runnable_at=-76ms
RIP: 0010:scx_set_task_state+0xb2/0x1a0
...
Call Trace:

scx_ops_enable_task+0xd5/0x180
switching_to_scx+0x17/0xa0
__sched_setscheduler+0x623/0x810
do_sched_setscheduler+0xea/0x170
__x64_sys_sched_setscheduler+0x1c/0x30
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e

This race happens when sched_setscheudler() syscall races SCX ops_enable
path on a DEAD task. If the task is DEAD, scx_ops_enable() directly calls
scx_ops_exit_task() instead of enabling it. If another task was already in
the middle of sched_setscheduler() with the task's reference acquired, it
then can proceed to __sched_setscheduler() which will then call
scx_ops_enable_task(). However, the task has already been exited, so it ends
trying to transition the task from SCX_TASK_NONE to SCX_TASK_ENABLED
triggering the above warning.

This is because the ops_enable path is leaving the task in an inconsistent
state - scx_enabled() && scx_switching_all, so scx_should_scx(p) but p is
SCX_TASK_NONE instead of READY or ENABLED. If the task gets freed
afterwards, it's fine, but if there's an attribute change operation which
gets inbetween, that operation ends up trying to enable SCX on the
unprepared dead task.

The reason why both ops_enable and disable paths handle DEAD tasks
specially, I think, is historical. Before scx_tasks was introduced, the
paths used the usual tasklist iteration. However, tasks are removed from
tasklist before the task transitions to DEAD, so depending on the timing, a
DEAD task may show up in one iteration and disappear for the next one e.g.
while disabling is in progress without any way to detect the event. My
memory is hazy, so the details may not be accurate. Anyways, there were
issues around reliably iterating DEAD tasks and thus there was code to skip
them.

With scx_tasks, we're strongly synchronized against tasks being destoryed,
so this is no longer a concern, so it's likely that we can just remove this
special handling and things will just work.

After this patch, the kernel is surviving load/unload torture test but it's
possible that I'm completely misremembering things, in which case, we'll
have to relearn why TASK_DEAD has to be special.

Trying to load a scheduler while `stress-ng --race-sched 8 --timeout 30` is
running easily triggers the following warning.

  sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[8698]
  WARNING: CPU: 1 PID: 1527 at kernel/sched/ext.c:2378 scx_set_task_state+0xb2/0x1a0
  CPU: 1 PID: 1527 Comm: stress-ng-race- Not tainted 6.7.0-work-00411-gc1c1b3b1133b-dirty #404
  Sched_ext: rustland (enabled+all), task: runnable_at=-76ms
  RIP: 0010:scx_set_task_state+0xb2/0x1a0
  ...
  Call Trace:
   <TASK>
   scx_ops_enable_task+0xd5/0x180
   switching_to_scx+0x17/0xa0
   __sched_setscheduler+0x623/0x810
   do_sched_setscheduler+0xea/0x170
   __x64_sys_sched_setscheduler+0x1c/0x30
   do_syscall_64+0x40/0xe0
   entry_SYSCALL_64_after_hwframe+0x46/0x4e

This race happens when sched_setscheudler() syscall races SCX ops_enable
path on a DEAD task. If the task is DEAD, scx_ops_enable() directly calls
scx_ops_exit_task() instead of enabling it. If another task was already in
the middle of sched_setscheduler() with the task's reference acquired, it
then can proceed to __sched_setscheduler() which will then call
scx_ops_enable_task(). However, the task has already been exited, so it ends
trying to transition the task from SCX_TASK_NONE to SCX_TASK_ENABLED
triggering the above warning.

This is because the ops_enable path is leaving the task in an inconsistent
state - scx_enabled() && scx_switching_all, so scx_should_scx(p) but p is
SCX_TASK_NONE instead of READY or ENABLED. If the task gets freed
afterwards, it's fine, but if there's an attribute change operation which
gets inbetween, that operation ends up trying to enable SCX on the
unprepared dead task.

The reason why both ops_enable and disable paths handle DEAD tasks
specially, I think, is historical. Before scx_tasks was introduced, the
paths used the usual tasklist iteration. However, tasks are removed from
tasklist before the task transitions to DEAD, so depending on the timing, a
DEAD task may show up in one iteration and disappear for the next one e.g.
while disabling is in progress without any way to detect the event. My
memory is hazy, so the details may not be accurate. Anyways, there were
issues around reliably iterating DEAD tasks and thus there was code to skip
them.

With scx_tasks, we're strongly synchronized against tasks being destoryed,
so this is no longer a concern, so it's likely that we can just remove this
special handling and things will just work.

After this patch, the kernel is surviving load/unload torture test but it's
possible that I'm completely misremembering things, in which case, we'll
have to relearn why TASK_DEAD has to be special.

Signed-off-by: Tejun Heo <[email protected]>
@htejun htejun requested review from Byte-Lab and arighi February 8, 2024 01:37
@htejun
Copy link
Collaborator Author

htejun commented Feb 8, 2024

So, looking at old code, I think this is correct. The key difference is that we used to call the exit path from task_dead_ext() which was called from finish_task_switch(). So, a DEAD task shouldn't be in the enabled state which led to the exceptions in both init and exit paths. Now that we pushed out the exit path to sched_ext_free() which is called from task_struct free path, this should no longer be a problem.

Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix

@Byte-Lab Byte-Lab merged commit 6f5bf31 into sched_ext Feb 8, 2024
1 check passed
@arighi
Copy link
Collaborator

arighi commented Feb 8, 2024

I've tried to verify if we can hit some issues with DEAD tasks, running stress-ng --exec 8 (that is constantly forking/exiting tasks) and trying to start some schedulers.

With scx_rusty I can immediately hit this error (that also happens without this PR applied, so it's totally unrelated, but JFYI @Decave):

Error: EXIT: scx_bpf_error (cpu7 dom2 load underflow (load=-100 adj=-100))

With scx_rustland, without this PR I can hit "normal" scheduler stalls (that can be fixed/improved in rustland itself, so another unrelated issue), but with this PR applied I can trigger this condition
(that doesn't seem to happen reverting the PR):

[   66.891276] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   66.891912] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-15): P7958/1:b..l P7811/1:b..l P7878/1:b..l
[   66.892813] rcu: 	(detected by 2, t=21002 jiffies, g=18629, q=180 ncpus=16)
[   66.893370] task:stress-ng       state:R  running task     stack:13328 pid:7878  tgid:7878  ppid:257    flags:0x00004000
[   66.894212] Sched_ext: rustland (enabled+all), task: runnable_at=-21003ms
[   66.894214] Call Trace:
[   66.894954]  <TASK>
[   66.895185]  __schedule+0x36d/0x1410
[   66.896208]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.896635]  ? __rhashtable_lookup.constprop.0+0xc5/0x120
[   66.897069]  preempt_schedule+0x39/0x50
[   66.897395]  preempt_schedule_thunk+0x1a/0x30
[   66.901239]  _raw_spin_unlock+0x1f/0x30
[   66.901565]  unmap_page_range+0x67a/0xe20
[   66.901882]  ? free_unref_page_prepare+0x7a/0x2f0
[   66.902313]  zap_page_range_single+0x133/0x200
[   66.902742]  unmap_mapping_pages+0x102/0x130
[   66.903170]  invalidate_inode_pages2_range+0x19f/0x430
[   66.903603]  fuse_open_common+0x1ce/0x270
[   66.903926]  ? __pfx_fuse_open+0x10/0x10
[   66.904256]  do_dentry_open+0x1f6/0x530
[   66.904581]  backing_file_open+0x76/0xb0
[   66.904901]  ovl_open_realfile+0xd2/0xe0
[   66.905231]  ? __pfx_ovl_open+0x10/0x10
[   66.905554]  ovl_open+0xb1/0x100
[   66.905876]  do_dentry_open+0x1f6/0x530
[   66.906205]  path_openat+0xc66/0x1000
[   66.906528]  ? srso_alias_return_thunk+0x5/0xfbef5
[   66.906950]  ? filemap_map_pages+0x3b7/0x450
[   66.907376]  do_filp_open+0xb3/0x160
[   66.907702]  ? __pfx_page_put_link+0x10/0x10
[   66.908129]  do_sys_openat2+0xab/0xe0
[   66.908452]  __x64_sys_openat+0x57/0xa0
[   66.908782]  do_syscall_64+0x47/0xf0
[   66.910179]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   66.910593] RIP: 0033:0x145eaabd2ede
[   66.910908] RSP: 002b:00007ffe175c0fe8 EFLAGS: 00000206 ORIG_RAX: 0000000000000101
[   66.911529] RAX: ffffffffffffffda RBX: 0000145eaa05b9a0 RCX: 0000145eaabd2ede
[   66.912149] RDX: 0000000000080000 RSI: 0000145eaa05b9a0 RDI: 00000000ffffff9c
[   66.914509] RBP: 00007ffe175c1030 R08: 00007ffe175c1097 R09: 0000000000000000
[   66.915128] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
[   66.915761] R13: 0000000000000000 R14: 0000145eaabe5000 R15: 00007ffe175c10b0
[   66.916386]  </TASK>

When this happens the system becomes completely unresponsive and I need to kill the VM. Not sure if it's related to not handling DEAD tasks in the special way.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants