scx: fix NULL pointer dereference with scx_exit_info #129

arighi · 2024-01-26T22:59:36Z

Trying to load multiple schedulers at the same time can trigger the following NULL pointer dereference:

[22053.942960] BUG: kernel NULL pointer dereference, address: 0000000000000000 [22053.942966] #PF: supervisor write access in kernel mode [22053.942968] #PF: error_code(0x0002) - not-present page [22053.942969] PGD 0 P4D 0
[22053.942972] Oops: 0002 [#1] PREEMPT SMP NOPTI
[22053.942976] CPU: 6 PID: 3550 Comm: sched_ext_ops_h Tainted: G O 6.7.0-4-generic #4+scx3-Ubuntu
[22053.942978] Hardware name: GPD G1621-02/G1621-02, BIOS 2.04 09/01/2022
[22053.942980] Sched_ext: central (enabled+all)
[22053.942981] RIP: 0010:scx_ops_disable_workfn+0x85/0x610
[22053.942987] Code: 89 df f3 48 ab b9 01 00 00 00 83 fa 01 0f 86 c3 00 00 00 89 d0 f0 0f b1 0d 60 ff 11 02 0f 85 10 05 00 00 48 8b 85 90 fe ff ff <89> 10 81 fa 00 04 00 00 0f 84 ef 04 00 00 0f 8e cb 00 00 00 48 c7
[22053.942989] RSP: 0018:ffffad420257fd30 EFLAGS: 00010246
[22053.942991] RAX: 0000000000000000 RBX: ffffad420257fd58 RCX: 0000000000000001
[22053.942993] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffad420257fd98
[22053.942994] RBP: ffffad420257fea8 R08: 0000000000000000 R09: 0000000000000000
[22053.942996] R10: 0000000000000000 R11: 0000000000000000 R12: ffffad420257fd98
[22053.942998] R13: ffff96a4d07eddc0 R14: ffff96a4d07eddc4 R15: ffffffff8897c3b0
[22053.942999] FS: 0000000000000000(0000) GS:ffff96a7dfb00000(0000) knlGS:0000000000000000
[22053.943002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22053.943003] CR2: 0000000000000000 CR3: 00000001f6a3c001 CR4: 0000000000f70ef0
[22053.943005] PKRU: 55555554
[22053.943006] Call Trace:
[22053.943008]
[22053.943012] ? show_regs+0x6d/0x80
[22053.943016] ? __die+0x24/0x80
[22053.943018] ? page_fault_oops+0x99/0x1b0
[22053.943021] ? do_user_addr_fault+0x2ee/0x6b0
[22053.943024] ? exc_page_fault+0x83/0x1b0
[22053.943028] ? asm_exc_page_fault+0x27/0x30
[22053.943030] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943034] ? scx_ops_disable_workfn+0x85/0x610
[22053.943036] ? asm_sysvec_irq_work+0x1b/0x20
[22053.943042] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943043] kthread_worker_fn+0x9e/0x230
[22053.943048] ? __pfx_kthread_worker_fn+0x10/0x10
[22053.943050] kthread+0xef/0x120
[22053.943053] ? __pfx_kthread+0x10/0x10
[22053.943056] ret_from_fork+0x44/0x70
[22053.943058] ? __pfx_kthread+0x10/0x10
[22053.943061] ret_from_fork_asm+0x1b/0x30
[22053.943064]

This happens because in scx_ops_enable(), if a scheduler is already running, we are freeing scx_exit_info, that is still owned by the running scheduler.

Therefore, as soon as we stop the running scheduler we can hit the NULL pointer dereference.

Reproducer:

start any scheduler
try to start another scheduler
stop the running scheduler
BUG

Fix this by not freeing scx_exit_info in error path of scx_ops_enable() when there is a running scheduler.

Moreover, also make sure to access scx_exit_info with the scx_ops_enable_mutex held in scx_ops_disable_workfn(), to prevent other potential race conditions.

Fixes: 26ae1b0 ("scx: Make scx_exit_info fields dynamically allocated")

Byte-Lab

Thanks for finding this! The scx_ops_enable() part LGTM. Left a couple comments on the disable path to discuss.

Byte-Lab · 2024-01-26T23:11:54Z

kernel/sched/ext.c

+	ei = scx_exit_info;
+	if (!ei) {
+		mutex_unlock(&scx_ops_enable_mutex);
+		goto done;


If we move the mutex acquisition above here, we'll also need to drop the lock on line 3324.

Byte-Lab · 2024-01-26T23:20:45Z

kernel/sched/ext.c

+	ei = scx_exit_info;
+	if (!ei) {


Hmm, this should never happen, right? If we're able to swap out scx_exit_kind, it means that we were previously successfully running, and were able to successfully allocate scx_exit_info when the scheduler was loaded. AFAICT, the only fix we should need to add is the one you added below in scx_ops_enable() where we don't incorrectly clear the exit info.

Note that we can't race with someone enabling a scheduler either because if we were able to get here, it means that scx_ops_enable_state() will not be SCX_OPS_DISABLED.

I think you're right. Also, with the fix to scx_ops_enable() in place, there's no way to have multiple concurrent scx_ops_disable_workfn() running at the same time, so we don't need any extra protection there.

I've updated the PR to include just the fix for the NULL pointer deref in scx_ops_enable(). Thanks!

Trying to load multiple schedulers at the same time can trigger the following NULL pointer dereference: [22053.942960] BUG: kernel NULL pointer dereference, address: 0000000000000000 [22053.942966] #PF: supervisor write access in kernel mode [22053.942968] #PF: error_code(0x0002) - not-present page [22053.942969] PGD 0 P4D 0 [22053.942972] Oops: 0002 [#1] PREEMPT SMP NOPTI [22053.942976] CPU: 6 PID: 3550 Comm: sched_ext_ops_h Tainted: G O 6.7.0-4-generic #4+scx3-Ubuntu [22053.942978] Hardware name: GPD G1621-02/G1621-02, BIOS 2.04 09/01/2022 [22053.942980] Sched_ext: central (enabled+all) [22053.942981] RIP: 0010:scx_ops_disable_workfn+0x85/0x610 [22053.942987] Code: 89 df f3 48 ab b9 01 00 00 00 83 fa 01 0f 86 c3 00 00 00 89 d0 f0 0f b1 0d 60 ff 11 02 0f 85 10 05 00 00 48 8b 85 90 fe ff ff <89> 10 81 fa 00 04 00 00 0f 84 ef 04 00 00 0f 8e cb 00 00 00 48 c7 [22053.942989] RSP: 0018:ffffad420257fd30 EFLAGS: 00010246 [22053.942991] RAX: 0000000000000000 RBX: ffffad420257fd58 RCX: 0000000000000001 [22053.942993] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffad420257fd98 [22053.942994] RBP: ffffad420257fea8 R08: 0000000000000000 R09: 0000000000000000 [22053.942996] R10: 0000000000000000 R11: 0000000000000000 R12: ffffad420257fd98 [22053.942998] R13: ffff96a4d07eddc0 R14: ffff96a4d07eddc4 R15: ffffffff8897c3b0 [22053.942999] FS: 0000000000000000(0000) GS:ffff96a7dfb00000(0000) knlGS:0000000000000000 [22053.943002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [22053.943003] CR2: 0000000000000000 CR3: 00000001f6a3c001 CR4: 0000000000f70ef0 [22053.943005] PKRU: 55555554 [22053.943006] Call Trace: [22053.943008] <TASK> [22053.943012] ? show_regs+0x6d/0x80 [22053.943016] ? __die+0x24/0x80 [22053.943018] ? page_fault_oops+0x99/0x1b0 [22053.943021] ? do_user_addr_fault+0x2ee/0x6b0 [22053.943024] ? exc_page_fault+0x83/0x1b0 [22053.943028] ? asm_exc_page_fault+0x27/0x30 [22053.943030] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [22053.943034] ? scx_ops_disable_workfn+0x85/0x610 [22053.943036] ? asm_sysvec_irq_work+0x1b/0x20 [22053.943042] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [22053.943043] kthread_worker_fn+0x9e/0x230 [22053.943048] ? __pfx_kthread_worker_fn+0x10/0x10 [22053.943050] kthread+0xef/0x120 [22053.943053] ? __pfx_kthread+0x10/0x10 [22053.943056] ret_from_fork+0x44/0x70 [22053.943058] ? __pfx_kthread+0x10/0x10 [22053.943061] ret_from_fork_asm+0x1b/0x30 [22053.943064] </TASK> This happens because in scx_ops_enable(), if a scheduler is already running, we are freeing scx_exit_info, that is still owned by the running scheduler. Therefore, as soon as we stop the running scheduler we can hit the NULL pointer dereference. Reproducer: - start any scheduler - try to start another scheduler - stop the running scheduler - BUG Fix this by not freeing scx_exit_info in error path of scx_ops_enable() when there is a running scheduler. Fixes: 26ae1b0 ("scx: Make scx_exit_info fields dynamically allocated") Signed-off-by: Andrea Righi <[email protected]>

In #129, Andrea fixed a bug where we trip over a NULL pointer deref by trying to load multiple schedulers at a time. Let's add a stress test that tries to load multiple schedulers in a tight loop at the same time. Additionally, do a bit of cleanup in the build system to have testcases take all BPF progs as dependencies. We don't really gain anything by artificially coupling the name of testcases to the BPF progs they use. Signed-off-by: David Vernet <[email protected]>

arighi mentioned this pull request Jan 26, 2024

systemd-services: setting conflict between schedulers sched-ext/scx#108

Merged

Byte-Lab suggested changes Jan 26, 2024

View reviewed changes

arighi force-pushed the fix-null-scx_exit_info branch from 57d57d8 to fa75b1f Compare January 27, 2024 07:13

htejun approved these changes Jan 27, 2024

View reviewed changes

Byte-Lab approved these changes Jan 28, 2024

View reviewed changes

Byte-Lab merged commit 65c2651 into sched_ext Jan 28, 2024
2 checks passed

Byte-Lab deleted the fix-null-scx_exit_info branch March 14, 2024 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx: fix NULL pointer dereference with scx_exit_info #129

scx: fix NULL pointer dereference with scx_exit_info #129

arighi commented Jan 26, 2024

Byte-Lab left a comment

Byte-Lab Jan 26, 2024

Byte-Lab Jan 26, 2024

arighi Jan 27, 2024

scx: fix NULL pointer dereference with scx_exit_info #129

scx: fix NULL pointer dereference with scx_exit_info #129

Conversation

arighi commented Jan 26, 2024

Byte-Lab left a comment

Choose a reason for hiding this comment

Byte-Lab Jan 26, 2024

Choose a reason for hiding this comment

Byte-Lab Jan 26, 2024

Choose a reason for hiding this comment

arighi Jan 27, 2024

Choose a reason for hiding this comment