From 83e31d68117c941d479fc3f4995e774ecc922922 Mon Sep 17 00:00:00 2001
From: David Vernet <void@manifault.com>
Date: Thu, 28 Mar 2024 09:45:35 -0500
Subject: [PATCH] scx: Update conditions for WAKE_SYNC migration

In scx_select_cpu_dfl(), we currently migrate the waking task
to the CPU of the waker in the following scenario:

1. WAKE_SYNC is specified in wake_flags
2. There is at least one idle core in the system
3. The wakee can run on the waker CPU

The assumption implicit with (2) is that the system is under saturated, and
that therefore the wakee's runqueue delay would not be impacted by migrating to
the waker's CPU rather than migrating to an idle core. This doesn't always
happen in practice though. Consider the following scenario:

1. The system is overloaded, and at least one core becomes idle
2. Some groups of pairs of tasks that communicate over IPC are spawned.
3. Sender tasks are running on cores that still have enqueued tasks from when
   the system was overloaded, and they repeatedly wake waker tasks with
   WAKE_SYNC.
4. The waker tasks observe that the system is underloaded, and so think that
   it's optimal for the wakee to be migrated to their CPU despite having a deep
   runqueue.

This can cause serious performance regressions for such workloads. For example,
hackbench regresses by nearly 10x relative to EEVDF:

[1]+ ./scx_simple > /dev/null 2> /dev/null &
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 2.944
[root@virtme-ng bin]# fg
./scx_simple > /dev/null 2> /dev/null
^C
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.345

What we really want is to only migrate to the waker CPU if nobody else
is already enqueued there. This will cause tasks to fan out over any
idle CPUs when they're available if the waker's rq is overloaded, and
then eventually to start enjoying wakeups on the waker's CPU once load
has been distributed and tasks are no longer piling up on a subset of
cores.

With this patch, the regression is addressed:

[root@virtme-ng bin]# ./scx_simple > /dev/null &
[1] 336
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.348
[root@virtme-ng bin]# fg
./scx_simple > /dev/null
^CEXIT: BPF scheduler unregistered
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 1000 messages of 100 bytes
Time: 0.352

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ca44bee6a1b37..2c7eda86e0836 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2084,12 +2084,18 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	}
 
 	/*
-	 * If WAKE_SYNC and the machine isn't fully saturated, wake up @p to the
-	 * local DSQ of the waker.
+	 * If WAKE_SYNC, the waker's local DSQ is empty, and the system is
+	 * under utilized, wake up @p to the local DSQ of the waker. Checking
+	 * only for an empty local DSQ is insufficient as it could give the
+	 * wakee an unfair advantage when the system is oversaturated.
+	 * Checking only for the presence of idle CPUs is also insufficient as
+	 * the local DSQ of the waker could have tasks piled up on it even if
+	 * there is an idle core elsewhere on the system.
 	 */
+	cpu = smp_processor_id();
 	if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 &&
-	    !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING)) {
-		cpu = smp_processor_id();
+	    !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING) &&
+	    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
 		if (cpumask_test_cpu(cpu, p->cpus_ptr))
 			goto cpu_found;
 	}