Message ID | ZnM2ywDVRZbrN6OC@slm.duckdns.org |
---|---|
State | Accepted |
Commit | d86adb4fc0655a0867da811d000df75d2a325ef6 |
Headers | show |
Series | None | expand |
Hello, Vincent. On Fri, Jul 05, 2024 at 02:41:41PM +0200, Vincent Guittot wrote: > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > > { > > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); > > + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu); > > > > + if (!scx_switched_all()) > > + util += cpu_util_cfs_boost(sg_cpu->cpu); > > I don't see the need for this. If fair is not used, this returns zero There's scx_enabled() and scx_switched_all(). The former is set when some tasks may be on sched_ext. The latter when all tasks are on sched_ext. When some tasks may be on sched_ext but other tasks may be on fair, the condition is scx_enabled() && !scx_switched_all(). So, the above if statement condition is true for all cases that tasks may be on CFS (sched_ext is disabled or is enabled in partial mode). Thanks.
On Fri, 5 Jul 2024 at 20:22, Tejun Heo <tj@kernel.org> wrote: > > Hello, Vincent. > > On Fri, Jul 05, 2024 at 02:41:41PM +0200, Vincent Guittot wrote: > > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > > > { > > > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); > > > + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu); > > > > > > + if (!scx_switched_all()) > > > + util += cpu_util_cfs_boost(sg_cpu->cpu); > > > > I don't see the need for this. If fair is not used, this returns zero > > There's scx_enabled() and scx_switched_all(). The former is set when some > tasks may be on sched_ext. The latter when all tasks are on sched_ext. When > some tasks may be on sched_ext but other tasks may be on fair, the condition > is scx_enabled() && !scx_switched_all(). So, the above if statement > condition is true for all cases that tasks may be on CFS (sched_ext is > disabled or is enabled in partial mode). My point is that if there is no fair task, cpu_util_cfs_boost() will already return 0 so there is no need to add a sched_ext if statement there Vincent > > Thanks. > > -- > tejun
Hello, On Sat, Jul 06, 2024 at 11:01:20AM +0200, Vincent Guittot wrote: > > There's scx_enabled() and scx_switched_all(). The former is set when some > > tasks may be on sched_ext. The latter when all tasks are on sched_ext. When > > some tasks may be on sched_ext but other tasks may be on fair, the condition > > is scx_enabled() && !scx_switched_all(). So, the above if statement > > condition is true for all cases that tasks may be on CFS (sched_ext is > > disabled or is enabled in partial mode). > > My point is that if there is no fair task, cpu_util_cfs_boost() will > already return 0 so there is no need to add a sched_ext if statement > there I see, but scx_switched_all() is a static key while cpu_util_cfs_boost() isn't necessarily trivial. I can remove the conditional but wouldn't it make more sense to keep it? Thanks.
Hello, Vincent. On Mon, Jul 08, 2024 at 08:37:06AM +0200, Vincent Guittot wrote: > I prefer to minimize (if not remove) sched_ext related calls in the > fair path so we can easily rework it if needed. And this will also > ensure that all fair task are cleanly removed when they are all > switched to sched_ext Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end result will most likely be cpufreq sometimes picking a higher freq than requested, which won't be the easiest to notice. Would you be against adding WARN_ON_ONCE(scx_switched_all && !util) too? Thanks.
On Mon, 8 Jul 2024 at 20:20, Tejun Heo <tj@kernel.org> wrote: > > Hello, Vincent. > > On Mon, Jul 08, 2024 at 08:37:06AM +0200, Vincent Guittot wrote: > > I prefer to minimize (if not remove) sched_ext related calls in the > > fair path so we can easily rework it if needed. And this will also > > ensure that all fair task are cleanly removed when they are all > > switched to sched_ext > > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end > result will most likely be cpufreq sometimes picking a higher freq than > requested, which won't be the easiest to notice. Would you be against adding > WARN_ON_ONCE(scx_switched_all && !util) too? A WARN_ON_ONCE to detect misbehavior would be ok > > Thanks. > > -- > tejun
Hello, Vincent. On Mon, Jul 08, 2024 at 09:51:08PM +0200, Vincent Guittot wrote: > > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end > > result will most likely be cpufreq sometimes picking a higher freq than > > requested, which won't be the easiest to notice. Would you be against adding > > WARN_ON_ONCE(scx_switched_all && !util) too? > > A WARN_ON_ONCE to detect misbehavior would be ok I tried this and it's a bit problematic. Migrating out all the tasks do bring the numbers pretty close to zero but the math doesn't work out exactly and it often leaves 1 in the averages. While the fair class is in use, they would decay quickly through __update_blocked_fair(); however, when all tasks are switched to sched_ext, that function doesn't get called and the remaining small value never decays. Now, the value being really low, it doesn't really matter but it's an unnecessary complication. I can make sched_ext keep calling __update_blocked_fair() in addition to update_other_load_avgs() to decay fair's averages but that seems a lot more complicated than having one scx_switched_all() test. Thanks.
On Mon, 8 Jul 2024 at 23:09, Tejun Heo <tj@kernel.org> wrote: > > Hello, Vincent. > > On Mon, Jul 08, 2024 at 09:51:08PM +0200, Vincent Guittot wrote: > > > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end > > > result will most likely be cpufreq sometimes picking a higher freq than > > > requested, which won't be the easiest to notice. Would you be against adding > > > WARN_ON_ONCE(scx_switched_all && !util) too? > > > > A WARN_ON_ONCE to detect misbehavior would be ok > > I tried this and it's a bit problematic. Migrating out all the tasks do > bring the numbers pretty close to zero but the math doesn't work out exactly > and it often leaves 1 in the averages. While the fair class is in use, they hmm interesting, such remaining small value could be expected for load_avg but not with util_avg which is normally a direct propagation. Do you have a sequence in particular ? > would decay quickly through __update_blocked_fair(); however, when all tasks > are switched to sched_ext, that function doesn't get called and the > remaining small value never decays. > > Now, the value being really low, it doesn't really matter but it's an > unnecessary complication. I can make sched_ext keep calling > __update_blocked_fair() in addition to update_other_load_avgs() to decay > fair's averages but that seems a lot more complicated than having one > scx_switched_all() test. > > Thanks. > > -- > tejun
Hello, On Tue, Jul 09, 2024 at 03:36:34PM +0200, Vincent Guittot wrote: > > I tried this and it's a bit problematic. Migrating out all the tasks do > > bring the numbers pretty close to zero but the math doesn't work out exactly > > and it often leaves 1 in the averages. While the fair class is in use, they > > hmm interesting, such remaining small value could be expected for > load_avg but not with util_avg which is normally a direct propagation. > Do you have a sequence in particular ? Oh, I thought it was a byproduct of decay calculations not exactly matching up between the sum and the components but I haven't really checked. It's really easy to reproduce. Just boot a kernel with sched_ext enabled (with some instrumentations added to monitor the util calculation), run some stress workload to be sure and run a sched_ext scheduler (make -C tools/sched_ext && tools/sched_ext/build/bin/scx_simple). Thanks.
On Tue, 9 Jul 2024 at 18:43, Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Jul 09, 2024 at 03:36:34PM +0200, Vincent Guittot wrote: > > > I tried this and it's a bit problematic. Migrating out all the tasks do > > > bring the numbers pretty close to zero but the math doesn't work out exactly > > > and it often leaves 1 in the averages. While the fair class is in use, they > > > > hmm interesting, such remaining small value could be expected for > > load_avg but not with util_avg which is normally a direct propagation. > > Do you have a sequence in particular ? > > Oh, I thought it was a byproduct of decay calculations not exactly matching > up between the sum and the components but I haven't really checked. It's > really easy to reproduce. Just boot a kernel with sched_ext enabled (with > some instrumentations added to monitor the util calculation), run some > stress workload to be sure and run a sched_ext scheduler (make -C > tools/sched_ext && tools/sched_ext/build/bin/scx_simple). II failed to setup my dev system for reproducing your use case in time and I'm going to be away for the coming weeks so I suppose that you should move forward and I will look at that when back to my dev system It seems that "make -C tools/sched_ext ARCH=arm64 LLVM=-16" doesn't use clang-16 everywhere like the rest of the kernel which triggers error on my system: make -C <path-to-linux>/linux/tools/sched_ext ARCH=arm64 LOCALVERSION=+ LLVM=-16 O=<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext ... clang-16 -g -O0 -fPIC -std=gnu89 -Wbad-function-cast -Wdeclaration-after-statement -Wformat-security -Wformat-y2k -Winit-self -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked -Wredundant-decls -Wstrict-prototypes -Wswitch-default -Wswitch-enum -Wundef -Wwrite-strings -Wformat -Wno-type-limits -Wshadow -Wno-switch-enum -Werror -Wall -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/ -I<path-to-linux>/linux/tools/include -I<path-to-linux>/linux/tools/include/uapi -fvisibility=hidden -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 \ --shared -Wl,-soname,libbpf.so.1 \ -Wl,--version-script=libbpf.map <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/sharedobjs/libbpf-in.o -lelf -lz -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/libbpf.so.1.5.0 ... clang -g -D__TARGET_ARCH_arm64 -mlittle-endian -I<path-to-linux>/linux/tools/sched_ext/include -I<path-to-linux>/linux/tools/sched_ext/include/bpf-compat -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/include -I<path-to-linux>/linux/tools/include/uapi -I../../include -idirafter /usr/lib/llvm-14/lib/clang/14.0.0/include -idirafter /usr/local/include -idirafter /usr/include/x86_64-linux-gnu -idirafter /usr/include -Wall -Wno-compare-distinct-pointer-types -O2 -mcpu=v3 -target bpf -c scx_simple.bpf.c -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/sched_ext/scx_simple.bpf.o In file included from scx_simple.bpf.c:23: <path-to-linux>/linux/tools/sched_ext/include/scx/common.bpf.h:27:17: error: use of undeclared identifier 'SCX_DSQ_FLAG_BUILTIN' _Static_assert(SCX_DSQ_FLAG_BUILTIN, ^ ... fatal error: too many errors emitted, stopping now [-ferror-limit=] 5 warnings and 20 errors generated. Vincent > > Thanks. > > -- > tejun
Hello, On Fri, Jul 12, 2024 at 12:12:32PM +0200, Vincent Guittot wrote: ... > II failed to setup my dev system for reproducing your use case in time > and I'm going to be away for the coming weeks so I suppose that you > should move forward and I will look at that when back to my dev system Thankfully, this should be pretty easy to fix up however we want afterwards. > It seems that "make -C tools/sched_ext ARCH=arm64 LLVM=-16" doesn't > use clang-16 everywhere like the rest of the kernel which triggers > error on my system: Hmm... there is llvm prefix/suffix handling in the Makefile. I wonder what's broken. > make -C <path-to-linux>/linux/tools/sched_ext ARCH=arm64 > LOCALVERSION=+ LLVM=-16 > O=<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext > ... > clang-16 -g -O0 -fPIC -std=gnu89 -Wbad-function-cast > -Wdeclaration-after-statement -Wformat-security -Wformat-y2k > -Winit-self -Wmissing-declarations -Wmissing-prototypes > -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked > -Wredundant-decls -Wstrict-prototypes -Wswitch-default -Wswitch-enum > -Wundef -Wwrite-strings -Wformat -Wno-type-limits -Wshadow > -Wno-switch-enum -Werror -Wall > -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/ > -I<path-to-linux>/linux/tools/include > -I<path-to-linux>/linux/tools/include/uapi -fvisibility=hidden > -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 \ > --shared -Wl,-soname,libbpf.so.1 \ > -Wl,--version-script=libbpf.map > <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/sharedobjs/libbpf-in.o > -lelf -lz -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/libbpf.so.1.5.0 So, thi sis regular arm target buliding. > clang -g -D__TARGET_ARCH_arm64 -mlittle-endian > -I<path-to-linux>/linux/tools/sched_ext/include > -I<path-to-linux>/linux/tools/sched_ext/include/bpf-compat > -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/include > -I<path-to-linux>/linux/tools/include/uapi -I../../include -idirafter > /usr/lib/llvm-14/lib/clang/14.0.0/include -idirafter > /usr/local/include -idirafter /usr/include/x86_64-linux-gnu -idirafter > /usr/include -Wall -Wno-compare-distinct-pointer-types -O2 -mcpu=v3 > -target bpf -c scx_simple.bpf.c -o > <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/sched_ext/scx_simple.bpf.o > In file included from scx_simple.bpf.c:23: > <path-to-linux>/linux/tools/sched_ext/include/scx/common.bpf.h:27:17: > error: use of undeclared identifier 'SCX_DSQ_FLAG_BUILTIN' > _Static_assert(SCX_DSQ_FLAG_BUILTIN, > ^ > fatal error: too many errors emitted, stopping now [-ferror-limit=] > 5 warnings and 20 errors generated. This is BPF. The Makefile is mostly copied from other existing BPF Makefiles under tools, so I don't quite understand why things are set up this way but CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as is what's used to build regular targets, while $(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@ is what's used to build BPF targets. It's not too out there to use a different compiler for BPF targtes, so maybe that's why? I'll ask BPF folks. Thanks.
--- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -197,8 +197,10 @@ unsigned long sugov_effective_cpu_perf(i static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) { - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu); + if (!scx_switched_all()) + util += cpu_util_cfs_boost(sg_cpu->cpu); util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); util = max(util, boost); sg_cpu->bw_min = min; @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov unsigned long idle_calls; bool ret; + /* + * The heuristics in this function is for the fair class. For SCX, the + * performance target comes directly from the BPF scheduler. Let's just + * follow it. + */ + if (scx_switched_all()) + return false; + /* if capped by uclamp_max, always update to be in compliance */ if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) return false; --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -16,6 +16,8 @@ enum scx_consts { SCX_EXIT_BT_LEN = 64, SCX_EXIT_MSG_LEN = 1024, SCX_EXIT_DUMP_DFL_LEN = 32768, + + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, }; enum scx_exit_kind { @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = { .update_curr = update_curr_scx, #ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 0, + .uclamp_enabled = 1, #endif }; @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_e struct scx_task_iter sti; struct task_struct *p; unsigned long timeout; - int i, ret; + int i, cpu, ret; mutex_lock(&scx_ops_enable_mutex); @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_e atomic_long_set(&scx_nr_rejected, 0); + for_each_possible_cpu(cpu) + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; + /* * Keep CPUs stable during enable so that the BPF scheduler can track * online CPUs by watching ->on/offline_cpu() after ->init(). @@ -5836,6 +5841,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char } /** + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU + * @cpu: CPU of interest + * + * Return the maximum relative capacity of @cpu in relation to the most + * performant CPU in the system. The return value is in the range [1, + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). + */ +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) +{ + if (ops_cpu_valid(cpu, NULL)) + return arch_scale_cpu_capacity(cpu); + else + return SCX_CPUPERF_ONE; +} + +/** + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU + * @cpu: CPU of interest + * + * Return the current relative performance of @cpu in relation to its maximum. + * The return value is in the range [1, %SCX_CPUPERF_ONE]. + * + * The current performance level of a CPU in relation to the maximum performance + * available in the system can be calculated as follows: + * + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE + * + * The result is in the range [1, %SCX_CPUPERF_ONE]. + */ +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) +{ + if (ops_cpu_valid(cpu, NULL)) + return arch_scale_freq_capacity(cpu); + else + return SCX_CPUPERF_ONE; +} + +/** + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU + * @cpu: CPU of interest + * @perf: target performance level [0, %SCX_CPUPERF_ONE] + * @flags: %SCX_CPUPERF_* flags + * + * Set the target performance level of @cpu to @perf. @perf is in linear + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the + * schedutil cpufreq governor chooses the target frequency. + * + * The actual performance level chosen, CPU grouping, and the overhead and + * latency of the operations are dependent on the hardware and cpufreq driver in + * use. Consult hardware and cpufreq documentation for more information. The + * current performance level can be monitored using scx_bpf_cpuperf_cur(). + */ +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf) +{ + if (unlikely(perf > SCX_CPUPERF_ONE)) { + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); + return; + } + + if (ops_cpu_valid(cpu, NULL)) { + struct rq *rq = cpu_rq(cpu); + + rq->scx.cpuperf_target = perf; + + rcu_read_lock_sched_notrace(); + cpufreq_update_util(cpu_rq(cpu), 0); + rcu_read_unlock_sched_notrace(); + } +} + +/** * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs * * All valid CPU IDs in the system are smaller than the returned value. @@ -6045,6 +6121,9 @@ BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -48,6 +48,14 @@ int scx_check_setscheduler(struct task_s bool task_should_scx(struct task_struct *p); void init_sched_ext_class(void); +static inline u32 scx_cpuperf_target(s32 cpu) +{ + if (scx_enabled()) + return cpu_rq(cpu)->scx.cpuperf_target; + else + return 0; +} + static inline const struct sched_class *next_active_class(const struct sched_class *class) { class++; @@ -89,6 +97,7 @@ static inline void scx_pre_fork(struct t static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} +static inline u32 scx_cpuperf_target(s32 cpu) { return 0; } static inline bool scx_can_stop_tick(struct rq *rq) { return true; } static inline void scx_rq_activate(struct rq *rq) {} static inline void scx_rq_deactivate(struct rq *rq) {} --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -743,6 +743,7 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 flags; + u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ bool cpu_released; cpumask_var_t cpus_to_kick; cpumask_var_t cpus_to_kick_if_idle; --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -42,6 +42,9 @@ void scx_bpf_destroy_dsq(u64 dsq_id) __k void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak; +u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak; +u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak; +void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak; u32 scx_bpf_nr_cpu_ids(void) __ksym __weak; const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak; const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak; --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -69,6 +69,18 @@ struct { }; /* + * If enabled, CPU performance target is set according to the queue index + * according to the following table. + */ +static const u32 qidx_to_cpuperf_target[] = { + [0] = SCX_CPUPERF_ONE * 0 / 4, + [1] = SCX_CPUPERF_ONE * 1 / 4, + [2] = SCX_CPUPERF_ONE * 2 / 4, + [3] = SCX_CPUPERF_ONE * 3 / 4, + [4] = SCX_CPUPERF_ONE * 4 / 4, +}; + +/* * Per-queue sequence numbers to implement core-sched ordering. * * Tail seq is assigned to each queued task and incremented. Head seq tracks the @@ -95,6 +107,8 @@ struct { struct cpu_ctx { u64 dsp_idx; /* dispatch index */ u64 dsp_cnt; /* remaining count */ + u32 avg_weight; + u32 cpuperf_target; }; struct { @@ -107,6 +121,8 @@ struct { /* Statistics */ u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued; u64 nr_core_sched_execed; +u32 cpuperf_min, cpuperf_avg, cpuperf_max; +u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max; s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) @@ -313,6 +329,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 c } } +void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p) +{ + struct cpu_ctx *cpuc; + u32 zero = 0; + int idx; + + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { + scx_bpf_error("failed to look up cpu_ctx"); + return; + } + + /* + * Use the running avg of weights to select the target cpuperf level. + * This is a demonstration of the cpuperf feature rather than a + * practical strategy to regulate CPU frequency. + */ + cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4; + idx = weight_to_idx(cpuc->avg_weight); + cpuc->cpuperf_target = qidx_to_cpuperf_target[idx]; + + scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target); +} + /* * The distance from the head of the queue scaled by the weight of the queue. * The lower the number, the older the task and the higher the priority. @@ -422,8 +461,9 @@ void BPF_STRUCT_OPS(qmap_dump_cpu, struc if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu))) return; - scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu", - cpuc->dsp_idx, cpuc->dsp_cnt); + scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u", + cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight, + cpuc->cpuperf_target); } void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p) @@ -492,11 +532,106 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s3 print_cpus(); } +struct monitor_timer { + struct bpf_timer timer; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct monitor_timer); +} monitor_timer SEC(".maps"); + +/* + * Print out the min, avg and max performance levels of CPUs every second to + * demonstrate the cpuperf interface. + */ +static void monitor_cpuperf(void) +{ + u32 zero = 0, nr_cpu_ids; + u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0; + u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0; + const struct cpumask *online; + int i, nr_online_cpus = 0; + + nr_cpu_ids = scx_bpf_nr_cpu_ids(); + online = scx_bpf_get_online_cpumask(); + + bpf_for(i, 0, nr_cpu_ids) { + struct cpu_ctx *cpuc; + u32 cap, cur; + + if (!bpf_cpumask_test_cpu(i, online)) + continue; + nr_online_cpus++; + + /* collect the capacity and current cpuperf */ + cap = scx_bpf_cpuperf_cap(i); + cur = scx_bpf_cpuperf_cur(i); + + cur_min = cur < cur_min ? cur : cur_min; + cur_max = cur > cur_max ? cur : cur_max; + + /* + * $cur is relative to $cap. Scale it down accordingly so that + * it's in the same scale as other CPUs and $cur_sum/$cap_sum + * makes sense. + */ + cur_sum += cur * cap / SCX_CPUPERF_ONE; + cap_sum += cap; + + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) { + scx_bpf_error("failed to look up cpu_ctx"); + goto out; + } + + /* collect target */ + cur = cpuc->cpuperf_target; + target_sum += cur; + target_min = cur < target_min ? cur : target_min; + target_max = cur > target_max ? cur : target_max; + } + + cpuperf_min = cur_min; + cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum; + cpuperf_max = cur_max; + + cpuperf_target_min = target_min; + cpuperf_target_avg = target_sum / nr_online_cpus; + cpuperf_target_max = target_max; +out: + scx_bpf_put_cpumask(online); +} + +static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) +{ + monitor_cpuperf(); + + bpf_timer_start(timer, ONE_SEC_IN_NS, 0); + return 0; +} + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) { + u32 key = 0; + struct bpf_timer *timer; + s32 ret; + print_cpus(); - return scx_bpf_create_dsq(SHARED_DSQ, -1); + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); + if (ret) + return ret; + + timer = bpf_map_lookup_elem(&monitor_timer, &key); + if (!timer) + return -ESRCH; + + bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); + bpf_timer_set_callback(timer, monitor_timerfn); + + return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); } void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) @@ -509,6 +644,7 @@ SCX_OPS_DEFINE(qmap_ops, .enqueue = (void *)qmap_enqueue, .dequeue = (void *)qmap_dequeue, .dispatch = (void *)qmap_dispatch, + .tick = (void *)qmap_tick, .core_sched_before = (void *)qmap_core_sched_before, .cpu_release = (void *)qmap_cpu_release, .init_task = (void *)qmap_init_task, --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -116,6 +116,14 @@ int main(int argc, char **argv) nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, skel->bss->nr_reenqueued, skel->bss->nr_dequeued, skel->bss->nr_core_sched_execed); + if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur")) + printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n", + skel->bss->cpuperf_min, + skel->bss->cpuperf_avg, + skel->bss->cpuperf_max, + skel->bss->cpuperf_target_min, + skel->bss->cpuperf_target_avg, + skel->bss->cpuperf_target_max); fflush(stdout); sleep(1); }