-
Notifications
You must be signed in to change notification settings - Fork 29
scx: fix NULL pointer dereference with scx_exit_info #129
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this! The scx_ops_enable()
part LGTM. Left a couple comments on the disable path to discuss.
kernel/sched/ext.c
Outdated
ei = scx_exit_info; | ||
if (!ei) { | ||
mutex_unlock(&scx_ops_enable_mutex); | ||
goto done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the mutex acquisition above here, we'll also need to drop the lock on line 3324.
kernel/sched/ext.c
Outdated
ei = scx_exit_info; | ||
if (!ei) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this should never happen, right? If we're able to swap out scx_exit_kind
, it means that we were previously successfully running, and were able to successfully allocate scx_exit_info
when the scheduler was loaded. AFAICT, the only fix we should need to add is the one you added below in scx_ops_enable()
where we don't incorrectly clear the exit info.
Note that we can't race with someone enabling a scheduler either because if we were able to get here, it means that scx_ops_enable_state()
will not be SCX_OPS_DISABLED
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. Also, with the fix to scx_ops_enable()
in place, there's no way to have multiple concurrent scx_ops_disable_workfn()
running at the same time, so we don't need any extra protection there.
I've updated the PR to include just the fix for the NULL pointer deref in scx_ops_enable()
. Thanks!
Trying to load multiple schedulers at the same time can trigger the following NULL pointer dereference: [22053.942960] BUG: kernel NULL pointer dereference, address: 0000000000000000 [22053.942966] #PF: supervisor write access in kernel mode [22053.942968] #PF: error_code(0x0002) - not-present page [22053.942969] PGD 0 P4D 0 [22053.942972] Oops: 0002 [#1] PREEMPT SMP NOPTI [22053.942976] CPU: 6 PID: 3550 Comm: sched_ext_ops_h Tainted: G O 6.7.0-4-generic #4+scx3-Ubuntu [22053.942978] Hardware name: GPD G1621-02/G1621-02, BIOS 2.04 09/01/2022 [22053.942980] Sched_ext: central (enabled+all) [22053.942981] RIP: 0010:scx_ops_disable_workfn+0x85/0x610 [22053.942987] Code: 89 df f3 48 ab b9 01 00 00 00 83 fa 01 0f 86 c3 00 00 00 89 d0 f0 0f b1 0d 60 ff 11 02 0f 85 10 05 00 00 48 8b 85 90 fe ff ff <89> 10 81 fa 00 04 00 00 0f 84 ef 04 00 00 0f 8e cb 00 00 00 48 c7 [22053.942989] RSP: 0018:ffffad420257fd30 EFLAGS: 00010246 [22053.942991] RAX: 0000000000000000 RBX: ffffad420257fd58 RCX: 0000000000000001 [22053.942993] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffad420257fd98 [22053.942994] RBP: ffffad420257fea8 R08: 0000000000000000 R09: 0000000000000000 [22053.942996] R10: 0000000000000000 R11: 0000000000000000 R12: ffffad420257fd98 [22053.942998] R13: ffff96a4d07eddc0 R14: ffff96a4d07eddc4 R15: ffffffff8897c3b0 [22053.942999] FS: 0000000000000000(0000) GS:ffff96a7dfb00000(0000) knlGS:0000000000000000 [22053.943002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [22053.943003] CR2: 0000000000000000 CR3: 00000001f6a3c001 CR4: 0000000000f70ef0 [22053.943005] PKRU: 55555554 [22053.943006] Call Trace: [22053.943008] <TASK> [22053.943012] ? show_regs+0x6d/0x80 [22053.943016] ? __die+0x24/0x80 [22053.943018] ? page_fault_oops+0x99/0x1b0 [22053.943021] ? do_user_addr_fault+0x2ee/0x6b0 [22053.943024] ? exc_page_fault+0x83/0x1b0 [22053.943028] ? asm_exc_page_fault+0x27/0x30 [22053.943030] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [22053.943034] ? scx_ops_disable_workfn+0x85/0x610 [22053.943036] ? asm_sysvec_irq_work+0x1b/0x20 [22053.943042] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [22053.943043] kthread_worker_fn+0x9e/0x230 [22053.943048] ? __pfx_kthread_worker_fn+0x10/0x10 [22053.943050] kthread+0xef/0x120 [22053.943053] ? __pfx_kthread+0x10/0x10 [22053.943056] ret_from_fork+0x44/0x70 [22053.943058] ? __pfx_kthread+0x10/0x10 [22053.943061] ret_from_fork_asm+0x1b/0x30 [22053.943064] </TASK> This happens because in scx_ops_enable(), if a scheduler is already running, we are freeing scx_exit_info, that is still owned by the running scheduler. Therefore, as soon as we stop the running scheduler we can hit the NULL pointer dereference. Reproducer: - start any scheduler - try to start another scheduler - stop the running scheduler - BUG Fix this by not freeing scx_exit_info in error path of scx_ops_enable() when there is a running scheduler. Fixes: 26ae1b0 ("scx: Make scx_exit_info fields dynamically allocated") Signed-off-by: Andrea Righi <[email protected]>
57d57d8
to
fa75b1f
Compare
In #129, Andrea fixed a bug where we trip over a NULL pointer deref by trying to load multiple schedulers at a time. Let's add a stress test that tries to load multiple schedulers in a tight loop at the same time. Additionally, do a bit of cleanup in the build system to have testcases take all BPF progs as dependencies. We don't really gain anything by artificially coupling the name of testcases to the BPF progs they use. Signed-off-by: David Vernet <[email protected]>
In #129, Andrea fixed a bug where we trip over a NULL pointer deref by trying to load multiple schedulers at a time. Let's add a stress test that tries to load multiple schedulers in a tight loop at the same time. Additionally, do a bit of cleanup in the build system to have testcases take all BPF progs as dependencies. We don't really gain anything by artificially coupling the name of testcases to the BPF progs they use. Signed-off-by: David Vernet <[email protected]>
In #129, Andrea fixed a bug where we trip over a NULL pointer deref by trying to load multiple schedulers at a time. Let's add a stress test that tries to load multiple schedulers in a tight loop at the same time. Additionally, do a bit of cleanup in the build system to have testcases take all BPF progs as dependencies. We don't really gain anything by artificially coupling the name of testcases to the BPF progs they use. Signed-off-by: David Vernet <[email protected]>
In #129, Andrea fixed a bug where we trip over a NULL pointer deref by trying to load multiple schedulers at a time. Let's add a stress test that tries to load multiple schedulers in a tight loop at the same time. Additionally, do a bit of cleanup in the build system to have testcases take all BPF progs as dependencies. We don't really gain anything by artificially coupling the name of testcases to the BPF progs they use. Signed-off-by: David Vernet <[email protected]>
In #129, Andrea fixed a bug where we trip over a NULL pointer deref by trying to load multiple schedulers at a time. Let's add a stress test that tries to load multiple schedulers in a tight loop at the same time. Additionally, do a bit of cleanup in the build system to have testcases take all BPF progs as dependencies. We don't really gain anything by artificially coupling the name of testcases to the BPF progs they use. Signed-off-by: David Vernet <[email protected]>
Trying to load multiple schedulers at the same time can trigger the following NULL pointer dereference:
[22053.942960] BUG: kernel NULL pointer dereference, address: 0000000000000000 [22053.942966] #PF: supervisor write access in kernel mode [22053.942968] #PF: error_code(0x0002) - not-present page [22053.942969] PGD 0 P4D 0
[22053.942972] Oops: 0002 [#1] PREEMPT SMP NOPTI
[22053.942976] CPU: 6 PID: 3550 Comm: sched_ext_ops_h Tainted: G O 6.7.0-4-generic #4+scx3-Ubuntu
[22053.942978] Hardware name: GPD G1621-02/G1621-02, BIOS 2.04 09/01/2022
[22053.942980] Sched_ext: central (enabled+all)
[22053.942981] RIP: 0010:scx_ops_disable_workfn+0x85/0x610
[22053.942987] Code: 89 df f3 48 ab b9 01 00 00 00 83 fa 01 0f 86 c3 00 00 00 89 d0 f0 0f b1 0d 60 ff 11 02 0f 85 10 05 00 00 48 8b 85 90 fe ff ff <89> 10 81 fa 00 04 00 00 0f 84 ef 04 00 00 0f 8e cb 00 00 00 48 c7
[22053.942989] RSP: 0018:ffffad420257fd30 EFLAGS: 00010246
[22053.942991] RAX: 0000000000000000 RBX: ffffad420257fd58 RCX: 0000000000000001
[22053.942993] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffad420257fd98
[22053.942994] RBP: ffffad420257fea8 R08: 0000000000000000 R09: 0000000000000000
[22053.942996] R10: 0000000000000000 R11: 0000000000000000 R12: ffffad420257fd98
[22053.942998] R13: ffff96a4d07eddc0 R14: ffff96a4d07eddc4 R15: ffffffff8897c3b0
[22053.942999] FS: 0000000000000000(0000) GS:ffff96a7dfb00000(0000) knlGS:0000000000000000
[22053.943002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22053.943003] CR2: 0000000000000000 CR3: 00000001f6a3c001 CR4: 0000000000f70ef0
[22053.943005] PKRU: 55555554
[22053.943006] Call Trace:
[22053.943008]
[22053.943012] ? show_regs+0x6d/0x80
[22053.943016] ? __die+0x24/0x80
[22053.943018] ? page_fault_oops+0x99/0x1b0
[22053.943021] ? do_user_addr_fault+0x2ee/0x6b0
[22053.943024] ? exc_page_fault+0x83/0x1b0
[22053.943028] ? asm_exc_page_fault+0x27/0x30
[22053.943030] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943034] ? scx_ops_disable_workfn+0x85/0x610
[22053.943036] ? asm_sysvec_irq_work+0x1b/0x20
[22053.943042] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943043] kthread_worker_fn+0x9e/0x230
[22053.943048] ? __pfx_kthread_worker_fn+0x10/0x10
[22053.943050] kthread+0xef/0x120
[22053.943053] ? __pfx_kthread+0x10/0x10
[22053.943056] ret_from_fork+0x44/0x70
[22053.943058] ? __pfx_kthread+0x10/0x10
[22053.943061] ret_from_fork_asm+0x1b/0x30
[22053.943064]
This happens because in scx_ops_enable(), if a scheduler is already running, we are freeing scx_exit_info, that is still owned by the running scheduler.
Therefore, as soon as we stop the running scheduler we can hit the NULL pointer dereference.
Reproducer:
Fix this by not freeing scx_exit_info in error path of scx_ops_enable() when there is a running scheduler.
Moreover, also make sure to access scx_exit_info with the scx_ops_enable_mutex held in scx_ops_disable_workfn(), to prevent other potential race conditions.
Fixes: 26ae1b0 ("scx: Make scx_exit_info fields dynamically allocated")