scx: Implement scx_bpf_dispatch_cancel() #135

htejun · 2024-02-03T19:37:35Z

This can be used to implement interlocking against dequeue. e.g, Let's say there is one dsq per CPU and the scheduler can stall if a task is dispatched to a dsq which doesn't intersect with p->cpus_ptr. In the dispatch path, if you do the following:

if (bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))
	scx_bpf_dispatch(p, target_dsq, slice, 0);

The task can still end up in the wrong dsq because the cpumask can change between bpf_cpumask_test_cpu() and scx_bpf_dispatch(). However,

scx_bpf_dispatch(p, target_dsq, slice, 0);
if (!bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))
	scx_bpf_dispatch_cancel();

eliminates the race window - either the set_cpumask path sees the task dispatched or the dispatch path sees the updated cpumask. In both cases, the dispatch gets canceled as it should.

Note that in schedules which don't implement .dequeue(), depending on the implementation, there can be multiple enqueued instances of the same task. However, as long as the scheduler guarantees that there is at least one dispatch attempt on the task since the last enqueueing and that dispatch attempt interlocks against the dequeue path as above, it's guaranteed to not act upon stale information or miss dispatching an enqueued task.

Cc: Andrea Righi [email protected]

This can be used to implement interlocking against dequeue. e.g, Let's say there is one dsq per CPU and the scheduler can stall if a task is dispatched to a dsq which doesn't intersect with p->cpus_ptr. In the dispatch path, if you do the following: if (bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) scx_bpf_dispatch(p, target_dsq, slice, 0); The task can still end up in the wrong dsq because the cpumask can change between bpf_cpumask_test_cpu() and scx_bpf_dispatch(). However, scx_bpf_dispatch(p, target_dsq, slice, 0); if (!bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr)) scx_bpf_dispatch_cancel(); eliminates the race window - either the set_cpumask path sees the task dispatched or the dispatch path sees the updated cpumask. In both cases, the dispatch gets canceled as it should. Note that in schedules which don't implement .dequeue(), depending on the implementation, there can be multiple enqueued instances of the same task. However, as long as the scheduler guarantees that there is at least one dispatch attempt on the task since the last enqueueing and that dispatch attempt interlocks against the dequeue path as above, it's guaranteed to not act upon stale information or miss dispatching an enqueued task. Signed-off-by: Tejun Heo <[email protected]> Cc: Andrea Righi <[email protected]>

arighi

Looks good to me, thanks for doing this @htejun .

One question, theoretically can we have the same race if we do direct dispatch from .enqueue() or .select_cpu() targeting a CPU DSQ instead of using SCX_DSQ_LOCAL?

Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions. Link: sched-ext/sched_ext#135 Signed-off-by: Andrea Righi <[email protected]>

htejun · 2024-02-04T17:28:58Z

One question, theoretically can we have the same race if we do direct dispatch from .enqueue() or .select_cpu() targeting a CPU DSQ instead of using SCX_DSQ_LOCAL?

We're holding p->pi_lock across .select_cpu() and .enqueue(), so we're atomic against any attribute changes. We should be okay.

Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Link: sched-ext/sched_ext#135 Signed-off-by: Andrea Righi <[email protected]>

Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: sched-ext/sched_ext#135 Signed-off-by: Andrea Righi <[email protected]>

htejun requested review from arighi and Byte-Lab February 3, 2024 19:37

arighi approved these changes Feb 4, 2024

View reviewed changes

arighi mentioned this pull request Feb 4, 2024

common.bpf.h: Add kfunc prototype for scx_bpf_dispatch_cancel() sched-ext/scx#122

Merged

htejun merged commit 1ecced8 into sched_ext Feb 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx: Implement scx_bpf_dispatch_cancel() #135

scx: Implement scx_bpf_dispatch_cancel() #135

htejun commented Feb 3, 2024

arighi left a comment

htejun commented Feb 4, 2024

scx: Implement scx_bpf_dispatch_cancel() #135

scx: Implement scx_bpf_dispatch_cancel() #135

Conversation

htejun commented Feb 3, 2024

arighi left a comment

Choose a reason for hiding this comment

htejun commented Feb 4, 2024