diff mbox series

[v2,2/2] sched_ext: Add cpuperf support

Message ID ZnM2ywDVRZbrN6OC@slm.duckdns.org
State Accepted
Commit d86adb4fc0655a0867da811d000df75d2a325ef6
Headers show
Series None | expand

Commit Message

Tejun Heo June 19, 2024, 7:51 p.m. UTC
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.

Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:

- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
  different CPUs in the system.

- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
  relative to its max performance.

- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.

With cpuperf support added, there is no reason to block uclamp. Enable while
at it.

A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.

v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
    to avoid factoring in stale util metric. (Christian)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/cpufreq_schedutil.c         |   12 ++
 kernel/sched/ext.c                       |   83 +++++++++++++++++-
 kernel/sched/ext.h                       |    9 +
 kernel/sched/sched.h                     |    1 
 tools/sched_ext/include/scx/common.bpf.h |    3 
 tools/sched_ext/scx_qmap.bpf.c           |  142 ++++++++++++++++++++++++++++++-
 tools/sched_ext/scx_qmap.c               |    8 +
 7 files changed, 252 insertions(+), 6 deletions(-)

Comments

Tejun Heo July 5, 2024, 6:22 p.m. UTC | #1
Hello, Vincent.

On Fri, Jul 05, 2024 at 02:41:41PM +0200, Vincent Guittot wrote:
> >  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> >  {
> > -       unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> > +       unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
> >
> > +       if (!scx_switched_all())
> > +               util += cpu_util_cfs_boost(sg_cpu->cpu);
> 
> I don't see the need for this. If fair is not used, this returns zero

There's scx_enabled() and scx_switched_all(). The former is set when some
tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
some tasks may be on sched_ext but other tasks may be on fair, the condition
is scx_enabled() && !scx_switched_all(). So, the above if statement
condition is true for all cases that tasks may be on CFS (sched_ext is
disabled or is enabled in partial mode).

Thanks.
Vincent Guittot July 6, 2024, 9:01 a.m. UTC | #2
On Fri, 5 Jul 2024 at 20:22, Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Vincent.
>
> On Fri, Jul 05, 2024 at 02:41:41PM +0200, Vincent Guittot wrote:
> > >  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > >  {
> > > -       unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> > > +       unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
> > >
> > > +       if (!scx_switched_all())
> > > +               util += cpu_util_cfs_boost(sg_cpu->cpu);
> >
> > I don't see the need for this. If fair is not used, this returns zero
>
> There's scx_enabled() and scx_switched_all(). The former is set when some
> tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
> some tasks may be on sched_ext but other tasks may be on fair, the condition
> is scx_enabled() && !scx_switched_all(). So, the above if statement
> condition is true for all cases that tasks may be on CFS (sched_ext is
> disabled or is enabled in partial mode).

My point is that if there is no fair task, cpu_util_cfs_boost() will
already return 0 so there is no need to add a sched_ext if statement
there

Vincent

>
> Thanks.
>
> --
> tejun
Tejun Heo July 7, 2024, 1:44 a.m. UTC | #3
Hello,

On Sat, Jul 06, 2024 at 11:01:20AM +0200, Vincent Guittot wrote:
> > There's scx_enabled() and scx_switched_all(). The former is set when some
> > tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
> > some tasks may be on sched_ext but other tasks may be on fair, the condition
> > is scx_enabled() && !scx_switched_all(). So, the above if statement
> > condition is true for all cases that tasks may be on CFS (sched_ext is
> > disabled or is enabled in partial mode).
> 
> My point is that if there is no fair task, cpu_util_cfs_boost() will
> already return 0 so there is no need to add a sched_ext if statement
> there

I see, but scx_switched_all() is a static key while cpu_util_cfs_boost()
isn't necessarily trivial. I can remove the conditional but wouldn't it make
more sense to keep it?

Thanks.
Tejun Heo July 8, 2024, 6:20 p.m. UTC | #4
Hello, Vincent.

On Mon, Jul 08, 2024 at 08:37:06AM +0200, Vincent Guittot wrote:
> I prefer to minimize (if not remove) sched_ext related calls in the
> fair path so we can easily rework it if needed. And this will also
> ensure that all fair task are cleanly removed when they are all
> switched to sched_ext

Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
result will most likely be cpufreq sometimes picking a higher freq than
requested, which won't be the easiest to notice. Would you be against adding
WARN_ON_ONCE(scx_switched_all && !util) too?

Thanks.
Vincent Guittot July 8, 2024, 7:51 p.m. UTC | #5
On Mon, 8 Jul 2024 at 20:20, Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Vincent.
>
> On Mon, Jul 08, 2024 at 08:37:06AM +0200, Vincent Guittot wrote:
> > I prefer to minimize (if not remove) sched_ext related calls in the
> > fair path so we can easily rework it if needed. And this will also
> > ensure that all fair task are cleanly removed when they are all
> > switched to sched_ext
>
> Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
> result will most likely be cpufreq sometimes picking a higher freq than
> requested, which won't be the easiest to notice. Would you be against adding
> WARN_ON_ONCE(scx_switched_all && !util) too?

A WARN_ON_ONCE to detect misbehavior would be ok

>
> Thanks.
>
> --
> tejun
Tejun Heo July 8, 2024, 9:08 p.m. UTC | #6
Hello, Vincent.

On Mon, Jul 08, 2024 at 09:51:08PM +0200, Vincent Guittot wrote:
> > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
> > result will most likely be cpufreq sometimes picking a higher freq than
> > requested, which won't be the easiest to notice. Would you be against adding
> > WARN_ON_ONCE(scx_switched_all && !util) too?
> 
> A WARN_ON_ONCE to detect misbehavior would be ok

I tried this and it's a bit problematic. Migrating out all the tasks do
bring the numbers pretty close to zero but the math doesn't work out exactly
and it often leaves 1 in the averages. While the fair class is in use, they
would decay quickly through __update_blocked_fair(); however, when all tasks
are switched to sched_ext, that function doesn't get called and the
remaining small value never decays.

Now, the value being really low, it doesn't really matter but it's an
unnecessary complication. I can make sched_ext keep calling
__update_blocked_fair() in addition to update_other_load_avgs() to decay
fair's averages but that seems a lot more complicated than having one
scx_switched_all() test.

Thanks.
Vincent Guittot July 9, 2024, 1:36 p.m. UTC | #7
On Mon, 8 Jul 2024 at 23:09, Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Vincent.
>
> On Mon, Jul 08, 2024 at 09:51:08PM +0200, Vincent Guittot wrote:
> > > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
> > > result will most likely be cpufreq sometimes picking a higher freq than
> > > requested, which won't be the easiest to notice. Would you be against adding
> > > WARN_ON_ONCE(scx_switched_all && !util) too?
> >
> > A WARN_ON_ONCE to detect misbehavior would be ok
>
> I tried this and it's a bit problematic. Migrating out all the tasks do
> bring the numbers pretty close to zero but the math doesn't work out exactly
> and it often leaves 1 in the averages. While the fair class is in use, they

hmm interesting, such remaining small value could be expected for
load_avg but not with util_avg which is normally a direct propagation.
Do you have a sequence in particular ?

> would decay quickly through __update_blocked_fair(); however, when all tasks
> are switched to sched_ext, that function doesn't get called and the
> remaining small value never decays.
>
> Now, the value being really low, it doesn't really matter but it's an
> unnecessary complication. I can make sched_ext keep calling
> __update_blocked_fair() in addition to update_other_load_avgs() to decay
> fair's averages but that seems a lot more complicated than having one
> scx_switched_all() test.
>
> Thanks.
>
> --
> tejun
Tejun Heo July 9, 2024, 4:43 p.m. UTC | #8
Hello,

On Tue, Jul 09, 2024 at 03:36:34PM +0200, Vincent Guittot wrote:
> > I tried this and it's a bit problematic. Migrating out all the tasks do
> > bring the numbers pretty close to zero but the math doesn't work out exactly
> > and it often leaves 1 in the averages. While the fair class is in use, they
> 
> hmm interesting, such remaining small value could be expected for
> load_avg but not with util_avg which is normally a direct propagation.
> Do you have a sequence in particular ?

Oh, I thought it was a byproduct of decay calculations not exactly matching
up between the sum and the components but I haven't really checked. It's
really easy to reproduce. Just boot a kernel with sched_ext enabled (with
some instrumentations added to monitor the util calculation), run some
stress workload to be sure and run a sched_ext scheduler (make -C
tools/sched_ext && tools/sched_ext/build/bin/scx_simple).

Thanks.
Vincent Guittot July 12, 2024, 10:12 a.m. UTC | #9
On Tue, 9 Jul 2024 at 18:43, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 09, 2024 at 03:36:34PM +0200, Vincent Guittot wrote:
> > > I tried this and it's a bit problematic. Migrating out all the tasks do
> > > bring the numbers pretty close to zero but the math doesn't work out exactly
> > > and it often leaves 1 in the averages. While the fair class is in use, they
> >
> > hmm interesting, such remaining small value could be expected for
> > load_avg but not with util_avg which is normally a direct propagation.
> > Do you have a sequence in particular ?
>
> Oh, I thought it was a byproduct of decay calculations not exactly matching
> up between the sum and the components but I haven't really checked. It's
> really easy to reproduce. Just boot a kernel with sched_ext enabled (with
> some instrumentations added to monitor the util calculation), run some
> stress workload to be sure and run a sched_ext scheduler (make -C
> tools/sched_ext && tools/sched_ext/build/bin/scx_simple).

II failed to setup my dev system for reproducing your use case in time
and I'm going to be away for the coming weeks so I suppose that you
should move forward and I will look at that when back to my dev system

It seems that "make -C tools/sched_ext ARCH=arm64 LLVM=-16" doesn't
use clang-16 everywhere like the rest of the kernel which triggers
error on my system:

make -C <path-to-linux>/linux/tools/sched_ext ARCH=arm64
LOCALVERSION=+ LLVM=-16
O=<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext
...
clang-16 -g -O0 -fPIC -std=gnu89 -Wbad-function-cast
-Wdeclaration-after-statement -Wformat-security -Wformat-y2k
-Winit-self -Wmissing-declarations -Wmissing-prototypes
-Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked
-Wredundant-decls -Wstrict-prototypes -Wswitch-default -Wswitch-enum
-Wundef -Wwrite-strings -Wformat -Wno-type-limits -Wshadow
-Wno-switch-enum -Werror -Wall
-I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/
-I<path-to-linux>/linux/tools/include
-I<path-to-linux>/linux/tools/include/uapi -fvisibility=hidden
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64  \
--shared -Wl,-soname,libbpf.so.1 \
-Wl,--version-script=libbpf.map
<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/sharedobjs/libbpf-in.o
-lelf -lz -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/libbpf.so.1.5.0
...
clang -g -D__TARGET_ARCH_arm64 -mlittle-endian
-I<path-to-linux>/linux/tools/sched_ext/include
-I<path-to-linux>/linux/tools/sched_ext/include/bpf-compat
-I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/include
-I<path-to-linux>/linux/tools/include/uapi -I../../include -idirafter
/usr/lib/llvm-14/lib/clang/14.0.0/include -idirafter
/usr/local/include -idirafter /usr/include/x86_64-linux-gnu -idirafter
/usr/include  -Wall -Wno-compare-distinct-pointer-types -O2 -mcpu=v3
-target bpf -c scx_simple.bpf.c -o
<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/sched_ext/scx_simple.bpf.o
In file included from scx_simple.bpf.c:23:
<path-to-linux>/linux/tools/sched_ext/include/scx/common.bpf.h:27:17:
error: use of undeclared identifier 'SCX_DSQ_FLAG_BUILTIN'
        _Static_assert(SCX_DSQ_FLAG_BUILTIN,
                       ^
...
fatal error: too many errors emitted, stopping now [-ferror-limit=]
5 warnings and 20 errors generated.

Vincent

>
> Thanks.
>
> --
> tejun
Tejun Heo July 12, 2024, 5:10 p.m. UTC | #10
Hello,

On Fri, Jul 12, 2024 at 12:12:32PM +0200, Vincent Guittot wrote:
...
> II failed to setup my dev system for reproducing your use case in time
> and I'm going to be away for the coming weeks so I suppose that you
> should move forward and I will look at that when back to my dev system

Thankfully, this should be pretty easy to fix up however we want afterwards.

> It seems that "make -C tools/sched_ext ARCH=arm64 LLVM=-16" doesn't
> use clang-16 everywhere like the rest of the kernel which triggers
> error on my system:

Hmm... there is llvm prefix/suffix handling in the Makefile. I wonder what's
broken.

> make -C <path-to-linux>/linux/tools/sched_ext ARCH=arm64
> LOCALVERSION=+ LLVM=-16
> O=<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext
> ...
> clang-16 -g -O0 -fPIC -std=gnu89 -Wbad-function-cast
> -Wdeclaration-after-statement -Wformat-security -Wformat-y2k
> -Winit-self -Wmissing-declarations -Wmissing-prototypes
> -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked
> -Wredundant-decls -Wstrict-prototypes -Wswitch-default -Wswitch-enum
> -Wundef -Wwrite-strings -Wformat -Wno-type-limits -Wshadow
> -Wno-switch-enum -Werror -Wall
> -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/
> -I<path-to-linux>/linux/tools/include
> -I<path-to-linux>/linux/tools/include/uapi -fvisibility=hidden
> -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64  \
> --shared -Wl,-soname,libbpf.so.1 \
> -Wl,--version-script=libbpf.map
> <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/sharedobjs/libbpf-in.o
> -lelf -lz -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/libbpf.so.1.5.0

So, thi sis regular arm target buliding.

> clang -g -D__TARGET_ARCH_arm64 -mlittle-endian
> -I<path-to-linux>/linux/tools/sched_ext/include
> -I<path-to-linux>/linux/tools/sched_ext/include/bpf-compat
> -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/include
> -I<path-to-linux>/linux/tools/include/uapi -I../../include -idirafter
> /usr/lib/llvm-14/lib/clang/14.0.0/include -idirafter
> /usr/local/include -idirafter /usr/include/x86_64-linux-gnu -idirafter
> /usr/include  -Wall -Wno-compare-distinct-pointer-types -O2 -mcpu=v3
> -target bpf -c scx_simple.bpf.c -o
> <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/sched_ext/scx_simple.bpf.o
> In file included from scx_simple.bpf.c:23:
> <path-to-linux>/linux/tools/sched_ext/include/scx/common.bpf.h:27:17:
> error: use of undeclared identifier 'SCX_DSQ_FLAG_BUILTIN'
>         _Static_assert(SCX_DSQ_FLAG_BUILTIN,
>                        ^
> fatal error: too many errors emitted, stopping now [-ferror-limit=]
> 5 warnings and 20 errors generated.

This is BPF.

The Makefile is mostly copied from other existing BPF Makefiles under tools,
so I don't quite understand why things are set up this way but

  CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as

is what's used to build regular targets, while

  $(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@

is what's used to build BPF targets. It's not too out there to use a
different compiler for BPF targtes, so maybe that's why? I'll ask BPF folks.

Thanks.
diff mbox series

Patch

--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -197,8 +197,10 @@  unsigned long sugov_effective_cpu_perf(i
 
 static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 {
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+	unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
 
+	if (!scx_switched_all())
+		util += cpu_util_cfs_boost(sg_cpu->cpu);
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
 	sg_cpu->bw_min = min;
@@ -330,6 +332,14 @@  static bool sugov_hold_freq(struct sugov
 	unsigned long idle_calls;
 	bool ret;
 
+	/*
+	 * The heuristics in this function is for the fair class. For SCX, the
+	 * performance target comes directly from the BPF scheduler. Let's just
+	 * follow it.
+	 */
+	if (scx_switched_all())
+		return false;
+
 	/* if capped by uclamp_max, always update to be in compliance */
 	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
 		return false;
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -16,6 +16,8 @@  enum scx_consts {
 	SCX_EXIT_BT_LEN			= 64,
 	SCX_EXIT_MSG_LEN		= 1024,
 	SCX_EXIT_DUMP_DFL_LEN		= 32768,
+
+	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
 };
 
 enum scx_exit_kind {
@@ -3520,7 +3522,7 @@  DEFINE_SCHED_CLASS(ext) = {
 	.update_curr		= update_curr_scx,
 
 #ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 0,
+	.uclamp_enabled		= 1,
 #endif
 };
 
@@ -4393,7 +4395,7 @@  static int scx_ops_enable(struct sched_e
 	struct scx_task_iter sti;
 	struct task_struct *p;
 	unsigned long timeout;
-	int i, ret;
+	int i, cpu, ret;
 
 	mutex_lock(&scx_ops_enable_mutex);
 
@@ -4442,6 +4444,9 @@  static int scx_ops_enable(struct sched_e
 
 	atomic_long_set(&scx_nr_rejected, 0);
 
+	for_each_possible_cpu(cpu)
+		cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
+
 	/*
 	 * Keep CPUs stable during enable so that the BPF scheduler can track
 	 * online CPUs by watching ->on/offline_cpu() after ->init().
@@ -5836,6 +5841,77 @@  __bpf_kfunc void scx_bpf_dump_bstr(char
 }
 
 /**
+ * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the maximum relative capacity of @cpu in relation to the most
+ * performant CPU in the system. The return value is in the range [1,
+ * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
+{
+	if (ops_cpu_valid(cpu, NULL))
+		return arch_scale_cpu_capacity(cpu);
+	else
+		return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the current relative performance of @cpu in relation to its maximum.
+ * The return value is in the range [1, %SCX_CPUPERF_ONE].
+ *
+ * The current performance level of a CPU in relation to the maximum performance
+ * available in the system can be calculated as follows:
+ *
+ *   scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
+ *
+ * The result is in the range [1, %SCX_CPUPERF_ONE].
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
+{
+	if (ops_cpu_valid(cpu, NULL))
+		return arch_scale_freq_capacity(cpu);
+	else
+		return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
+ * @cpu: CPU of interest
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @flags: %SCX_CPUPERF_* flags
+ *
+ * Set the target performance level of @cpu to @perf. @perf is in linear
+ * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
+ * schedutil cpufreq governor chooses the target frequency.
+ *
+ * The actual performance level chosen, CPU grouping, and the overhead and
+ * latency of the operations are dependent on the hardware and cpufreq driver in
+ * use. Consult hardware and cpufreq documentation for more information. The
+ * current performance level can be monitored using scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
+{
+	if (unlikely(perf > SCX_CPUPERF_ONE)) {
+		scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
+		return;
+	}
+
+	if (ops_cpu_valid(cpu, NULL)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		rq->scx.cpuperf_target = perf;
+
+		rcu_read_lock_sched_notrace();
+		cpufreq_update_util(cpu_rq(cpu), 0);
+		rcu_read_unlock_sched_notrace();
+	}
+}
+
+/**
  * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
  *
  * All valid CPU IDs in the system are smaller than the returned value.
@@ -6045,6 +6121,9 @@  BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
 BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -48,6 +48,14 @@  int scx_check_setscheduler(struct task_s
 bool task_should_scx(struct task_struct *p);
 void init_sched_ext_class(void);
 
+static inline u32 scx_cpuperf_target(s32 cpu)
+{
+	if (scx_enabled())
+		return cpu_rq(cpu)->scx.cpuperf_target;
+	else
+		return 0;
+}
+
 static inline const struct sched_class *next_active_class(const struct sched_class *class)
 {
 	class++;
@@ -89,6 +97,7 @@  static inline void scx_pre_fork(struct t
 static inline int scx_fork(struct task_struct *p) { return 0; }
 static inline void scx_post_fork(struct task_struct *p) {}
 static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
 static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
 static inline void scx_rq_activate(struct rq *rq) {}
 static inline void scx_rq_deactivate(struct rq *rq) {}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -743,6 +743,7 @@  struct scx_rq {
 	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
 	u32			nr_running;
 	u32			flags;
+	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
 	bool			cpu_released;
 	cpumask_var_t		cpus_to_kick;
 	cpumask_var_t		cpus_to_kick_if_idle;
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -42,6 +42,9 @@  void scx_bpf_destroy_dsq(u64 dsq_id) __k
 void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
 void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
 void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak;
+u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
+u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
+void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -69,6 +69,18 @@  struct {
 };
 
 /*
+ * If enabled, CPU performance target is set according to the queue index
+ * according to the following table.
+ */
+static const u32 qidx_to_cpuperf_target[] = {
+	[0] = SCX_CPUPERF_ONE * 0 / 4,
+	[1] = SCX_CPUPERF_ONE * 1 / 4,
+	[2] = SCX_CPUPERF_ONE * 2 / 4,
+	[3] = SCX_CPUPERF_ONE * 3 / 4,
+	[4] = SCX_CPUPERF_ONE * 4 / 4,
+};
+
+/*
  * Per-queue sequence numbers to implement core-sched ordering.
  *
  * Tail seq is assigned to each queued task and incremented. Head seq tracks the
@@ -95,6 +107,8 @@  struct {
 struct cpu_ctx {
 	u64	dsp_idx;	/* dispatch index */
 	u64	dsp_cnt;	/* remaining count */
+	u32	avg_weight;
+	u32	cpuperf_target;
 };
 
 struct {
@@ -107,6 +121,8 @@  struct {
 /* Statistics */
 u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
 u64 nr_core_sched_execed;
+u32 cpuperf_min, cpuperf_avg, cpuperf_max;
+u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
@@ -313,6 +329,29 @@  void BPF_STRUCT_OPS(qmap_dispatch, s32 c
 	}
 }
 
+void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
+{
+	struct cpu_ctx *cpuc;
+	u32 zero = 0;
+	int idx;
+
+	if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
+		scx_bpf_error("failed to look up cpu_ctx");
+		return;
+	}
+
+	/*
+	 * Use the running avg of weights to select the target cpuperf level.
+	 * This is a demonstration of the cpuperf feature rather than a
+	 * practical strategy to regulate CPU frequency.
+	 */
+	cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
+	idx = weight_to_idx(cpuc->avg_weight);
+	cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
+
+	scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+}
+
 /*
  * The distance from the head of the queue scaled by the weight of the queue.
  * The lower the number, the older the task and the higher the priority.
@@ -422,8 +461,9 @@  void BPF_STRUCT_OPS(qmap_dump_cpu, struc
 	if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu)))
 		return;
 
-	scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu",
-		     cpuc->dsp_idx, cpuc->dsp_cnt);
+	scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u",
+		     cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight,
+		     cpuc->cpuperf_target);
 }
 
 void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p)
@@ -492,11 +532,106 @@  void BPF_STRUCT_OPS(qmap_cpu_offline, s3
 	print_cpus();
 }
 
+struct monitor_timer {
+	struct bpf_timer timer;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, u32);
+	__type(value, struct monitor_timer);
+} monitor_timer SEC(".maps");
+
+/*
+ * Print out the min, avg and max performance levels of CPUs every second to
+ * demonstrate the cpuperf interface.
+ */
+static void monitor_cpuperf(void)
+{
+	u32 zero = 0, nr_cpu_ids;
+	u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
+	u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
+	const struct cpumask *online;
+	int i, nr_online_cpus = 0;
+
+	nr_cpu_ids = scx_bpf_nr_cpu_ids();
+	online = scx_bpf_get_online_cpumask();
+
+	bpf_for(i, 0, nr_cpu_ids) {
+		struct cpu_ctx *cpuc;
+		u32 cap, cur;
+
+		if (!bpf_cpumask_test_cpu(i, online))
+			continue;
+		nr_online_cpus++;
+
+		/* collect the capacity and current cpuperf */
+		cap = scx_bpf_cpuperf_cap(i);
+		cur = scx_bpf_cpuperf_cur(i);
+
+		cur_min = cur < cur_min ? cur : cur_min;
+		cur_max = cur > cur_max ? cur : cur_max;
+
+		/*
+		 * $cur is relative to $cap. Scale it down accordingly so that
+		 * it's in the same scale as other CPUs and $cur_sum/$cap_sum
+		 * makes sense.
+		 */
+		cur_sum += cur * cap / SCX_CPUPERF_ONE;
+		cap_sum += cap;
+
+		if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
+			scx_bpf_error("failed to look up cpu_ctx");
+			goto out;
+		}
+
+		/* collect target */
+		cur = cpuc->cpuperf_target;
+		target_sum += cur;
+		target_min = cur < target_min ? cur : target_min;
+		target_max = cur > target_max ? cur : target_max;
+	}
+
+	cpuperf_min = cur_min;
+	cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
+	cpuperf_max = cur_max;
+
+	cpuperf_target_min = target_min;
+	cpuperf_target_avg = target_sum / nr_online_cpus;
+	cpuperf_target_max = target_max;
+out:
+	scx_bpf_put_cpumask(online);
+}
+
+static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+	monitor_cpuperf();
+
+	bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+	return 0;
+}
+
 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 {
+	u32 key = 0;
+	struct bpf_timer *timer;
+	s32 ret;
+
 	print_cpus();
 
-	return scx_bpf_create_dsq(SHARED_DSQ, -1);
+	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+	if (ret)
+		return ret;
+
+	timer = bpf_map_lookup_elem(&monitor_timer, &key);
+	if (!timer)
+		return -ESRCH;
+
+	bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
+	bpf_timer_set_callback(timer, monitor_timerfn);
+
+	return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
 }
 
 void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
@@ -509,6 +644,7 @@  SCX_OPS_DEFINE(qmap_ops,
 	       .enqueue			= (void *)qmap_enqueue,
 	       .dequeue			= (void *)qmap_dequeue,
 	       .dispatch		= (void *)qmap_dispatch,
+	       .tick			= (void *)qmap_tick,
 	       .core_sched_before	= (void *)qmap_core_sched_before,
 	       .cpu_release		= (void *)qmap_cpu_release,
 	       .init_task		= (void *)qmap_init_task,
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -116,6 +116,14 @@  int main(int argc, char **argv)
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
 		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
 		       skel->bss->nr_core_sched_execed);
+		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
+			       skel->bss->cpuperf_min,
+			       skel->bss->cpuperf_avg,
+			       skel->bss->cpuperf_max,
+			       skel->bss->cpuperf_target_min,
+			       skel->bss->cpuperf_target_avg,
+			       skel->bss->cpuperf_target_max);
 		fflush(stdout);
 		sleep(1);
 	}