From patchwork Fri Dec 8 01:52:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qais Yousef X-Patchwork-Id: 751986 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b="uG2SzpEF" Received: from mail-wm1-x32b.google.com (mail-wm1-x32b.google.com [IPv6:2a00:1450:4864:20::32b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED771171E for ; Thu, 7 Dec 2023 17:52:57 -0800 (PST) Received: by mail-wm1-x32b.google.com with SMTP id 5b1f17b1804b1-40c07ed92fdso16634335e9.3 for ; Thu, 07 Dec 2023 17:52:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1702000376; x=1702605176; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=u+u72VL4qIOeq+J2csfvDxNp1PRAIPkthflP9kFSqak=; b=uG2SzpEF7sXScVPAx4NaSLF9+iYt/NDfaBx0ZDwmfJ9fBOFgSdvQLYvv/bff7eQbMs ty+ZDxcyFQWUCwi3rOEtGaQreTWieecb2UF5e4AQd0eN+IWXEwWNROtZQlrSp5rH3KAx XIVgctniTlfJU2p6WJlOFonJITB/PcaDws+HPQpJPvJI6xjgh5vPWeClgQqzlVanuwW5 82Ct1LeAoyGF5WUFWXF5uInSrm6hNOqsFn8oJUZO9xvTJrlzgbHbGBEERMIAnHiGtEUF hXRnDjeScVdvLcbwfZ5i3Z1/7seWcGBvjZXednIIXd1xSE8UfmSPJ0U9VYlRICy89E+N LyHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702000376; x=1702605176; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=u+u72VL4qIOeq+J2csfvDxNp1PRAIPkthflP9kFSqak=; b=mCtN/ecLklxTzqQbKHn6vzG0QoMRJ+sQSy3EIC8uP3mjgCRcUd7M9ce0ufl5ClSSNl M91c0hiVdor7VikrDZRGUuE2sieJnxtaZHa5rK+ldvrXCyfkhUj2L2cfOYOUPRZgh8nY HezyXYUfCb0DtLPIlnyOo94XJfr1FtOh94C5Pn9QURfwM30PSWRVOc51ZrJdCIFDBTgt rgjgcuax7wX+LfywZrln/QX/At5oIvPXW0WFU7CFd5jRitIPhzqPsEWUdRNjbstYwpvT PhVxZHrivVHnRhG/56bLifahHWOZ/dFbhcmyGb5tgVLt4gKNZKcUr9rBVx7w0BERI2uI cyjw== X-Gm-Message-State: AOJu0Yy4lU/IV80Kqdjjhe07hMD1zhfBfBK17kAuh54VFe2WfpjZMe9s /MRBw/OSX60QmK8kjCKTs+1/Lw== X-Google-Smtp-Source: AGHT+IF8WsWFoTjiDKV5h3luTOtxLISQDCjwxSEj9rcDbZNRT8PPzOtqlLl5GkjGbkuH2w841PI+CA== X-Received: by 2002:a05:600c:4c94:b0:40b:5e21:cc33 with SMTP id g20-20020a05600c4c9400b0040b5e21cc33mr1786201wmp.94.1702000376453; Thu, 07 Dec 2023 17:52:56 -0800 (PST) Received: from airbuntu.. (host109-153-232-45.range109-153.btcentralplus.com. [109.153.232.45]) by smtp.gmail.com with ESMTPSA id s12-20020adf978c000000b003333a0da243sm902521wrb.81.2023.12.07.17.52.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 17:52:56 -0800 (PST) From: Qais Yousef To: Ingo Molnar , Peter Zijlstra , "Rafael J. Wysocki" , Viresh Kumar , Vincent Guittot , Dietmar Eggemann Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Lukasz Luba , Wei Wang , Rick Yiu , Chung-Kai Mei , Hongyan Xia , Qais Yousef Subject: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Date: Fri, 8 Dec 2023 01:52:39 +0000 Message-Id: <20231208015242.385103-2-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231208015242.385103-1-qyousef@layalina.io> References: <20231208015242.385103-1-qyousef@layalina.io> Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Due to the way code is structured, it makes a lot of sense to trigger cpufreq_update_util() from update_load_avg(). But this is too aggressive as in most cases we are iterating through entities in a loop to update_load_avg() in the hierarchy. So we end up sending too many request in an loop as we're updating the hierarchy. Combine this with the rate limit in schedutil, we could end up prematurely send up a wrong frequency update before we have actually updated all entities appropriately. Be smarter about it by limiting the trigger to perform frequency updates after all accounting logic has done. This ended up being in the following points: 1. enqueue/dequeue_task_fair() 2. throttle/unthrottle_cfs_rq() 3. attach/detach_task_cfs_rq() 4. task_tick_fair() 5. __sched_group_set_shares() This is not 100% ideal still due to other limitations that might be a bit harder to handle. Namely we can end up with premature update request in the following situations: a. Simultaneous task enqueue on the CPU where 2nd task is bigger and requires higher freq. The trigger to cpufreq_update_util() by the first task will lead to dropping the 2nd request until tick. Or another CPU in the same policy trigger a freq update. b. CPUs sharing a policy can end up with the same race in a but the simultaneous enqueue happens on different CPUs in the same policy. The above though are limitations in the governor/hardware, and from scheduler point of view at least that's the best we can do. The governor might consider smarter logic to aggregate near simultaneous request and honour the higher one. Signed-off-by: Qais Yousef (Google) --- kernel/sched/fair.c | 55 ++++++++++++--------------------------------- 1 file changed, 14 insertions(+), 41 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b83448be3f79..f99910fc6705 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3997,29 +3997,6 @@ static inline void update_cfs_group(struct sched_entity *se) } #endif /* CONFIG_FAIR_GROUP_SCHED */ -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags) -{ - struct rq *rq = rq_of(cfs_rq); - - if (&rq->cfs == cfs_rq) { - /* - * There are a few boundary cases this might miss but it should - * get called often enough that that should (hopefully) not be - * a real problem. - * - * It will not get called when we go idle, because the idle - * thread is a different class (!fair), nor will the utilization - * number include things like RT tasks. - * - * As is, the util number is not freq-invariant (we'd have to - * implement arch_scale_freq_capacity() for that). - * - * See cpu_util_cfs(). - */ - cpufreq_update_util(rq, flags); - } -} - #ifdef CONFIG_SMP static inline bool load_avg_is_decayed(struct sched_avg *sa) { @@ -4648,8 +4625,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s add_tg_cfs_propagate(cfs_rq, se->avg.load_sum); - cfs_rq_util_change(cfs_rq, 0); - trace_pelt_cfs_tp(cfs_rq); } @@ -4678,8 +4653,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); - cfs_rq_util_change(cfs_rq, 0); - trace_pelt_cfs_tp(cfs_rq); } @@ -4726,11 +4699,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s */ detach_entity_load_avg(cfs_rq, se); update_tg_load_avg(cfs_rq); - } else if (decayed) { - cfs_rq_util_change(cfs_rq, 0); - - if (flags & UPDATE_TG) - update_tg_load_avg(cfs_rq); + } else if (decayed && (flags & UPDATE_TG)) { + update_tg_load_avg(cfs_rq); } } @@ -5114,7 +5084,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1) { - cfs_rq_util_change(cfs_rq, 0); } static inline void remove_entity_load_avg(struct sched_entity *se) {} @@ -5807,6 +5776,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) sub_nr_running(rq, task_delta); done: + cpufreq_update_util(rq, 0); + /* * Note: distribution will already see us throttled via the * throttled-list. rq->lock protects completion. @@ -5899,6 +5870,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) unthrottle_throttle: assert_list_leaf_cfs_rq(rq); + cpufreq_update_util(rq, 0); + /* Determine whether we need to wake up potentially idle CPU: */ if (rq->curr == rq->idle && rq->cfs.nr_running) resched_curr(rq); @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) */ util_est_enqueue(&rq->cfs, p); - /* - * If in_iowait is set, the code below may not trigger any cpufreq - * utilization updates, so do it here explicitly with the IOWAIT flag - * passed. - */ - if (p->in_iowait) - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); - for_each_sched_entity(se) { if (se->on_rq) break; @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) enqueue_throttle: assert_list_leaf_cfs_rq(rq); + cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0); + hrtick_update(rq); } @@ -6849,6 +6816,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); + cpufreq_update_util(rq, 0); hrtick_update(rq); } @@ -8482,6 +8450,7 @@ done: __maybe_unused; update_misfit_status(p, rq); sched_fair_update_stop_tick(rq, p); + cpufreq_update_util(rq, 0); return p; @@ -12615,6 +12584,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) update_misfit_status(curr, rq); update_overutilized_status(task_rq(curr)); + cpufreq_update_util(rq, 0); task_tick_core(rq, curr); } @@ -12739,6 +12709,7 @@ static void detach_task_cfs_rq(struct task_struct *p) struct sched_entity *se = &p->se; detach_entity_cfs_rq(se); + cpufreq_update_util(task_rq(p), 0); } static void attach_task_cfs_rq(struct task_struct *p) @@ -12746,6 +12717,7 @@ static void attach_task_cfs_rq(struct task_struct *p) struct sched_entity *se = &p->se; attach_entity_cfs_rq(se); + cpufreq_update_util(task_rq(p), 0); } static void switched_from_fair(struct rq *rq, struct task_struct *p) @@ -12991,6 +12963,7 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares) update_load_avg(cfs_rq_of(se), se, UPDATE_TG); update_cfs_group(se); } + cpufreq_update_util(rq, 0); rq_unlock_irqrestore(rq, &rf); } From patchwork Fri Dec 8 01:52:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qais Yousef X-Patchwork-Id: 751985 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b="yWf2LVuZ" Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FC721721 for ; Thu, 7 Dec 2023 17:53:00 -0800 (PST) Received: by mail-wr1-x42e.google.com with SMTP id ffacd0b85a97d-33330a5617fso1680176f8f.2 for ; Thu, 07 Dec 2023 17:53:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1702000378; x=1702605178; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wMxFu7ZWH7I2GJUc2EhA+ep8SVNyceUP9JuCj4DA05c=; b=yWf2LVuZPCpyfaz98NyeA4D3QOjl2sjA8/xbA1EzrxvnAn7q8ugIuBjtH0RX7wdp2M mxYJOWobdVrcSrqx86JpGn7u18T5eNtSZqHNUK2QIMfMyniY6WuWcv6vhYCPIDXUgB02 Tjt+h3Tir2q9GjIZBL+c6rkcYxcJKGzxgeqIUWq7FiR04vbcJQNI46OmepmYlPgfuQuo FyDKLz2E+54wxT4C5lBcpOuPNrM9Zre5sPsHqUZCxgjRIQr4kZr2KN73CugoW7CMoJcB OqacmfI6X+OqXaXAvzBWTvHs1ZwaeCVHzHLMVtYy8ET9UI2FAVbbneAO8SpE5ZJdHEXP wAiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702000378; x=1702605178; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wMxFu7ZWH7I2GJUc2EhA+ep8SVNyceUP9JuCj4DA05c=; b=lGb9ZQUvaBFB47ygXvUR170g48fmJFyBZpgPqKU6o3Xu7oDRRYE2ifdIKyuCELCW46 dgg65XritZIdPTQuq7r3yM4zBRi0zVI96ZVvfVTu/2g9ySvMKqUCVEm9ycktHTStPJsy 7LuEJpYBk2//hTONFBODMD9kl508jtd79vzR4cLDMHcvhOZxOOwB6UgirU7BYDTHzWd9 KBjQm1SBdBlr2qIlrxGLfNddY8MgDlopplIqQN4lC6p1EmFy3SXfYMewbaktEMy0QgWB 6DACkdk7igE/URIUC5Ty26kk99tLt8kh0tXfrwlaJpWx2nlcqP5KyJ/RsPNoSPsX9n5q 0TPg== X-Gm-Message-State: AOJu0YwnXnnbELsjEDwPArFvTIOZ0RCCZfAfU0eLuOFHkM3mP3LsgwM3 16EewWv/njJpvXj6n4AwZ2km569Sb3iT6FYIPaM= X-Google-Smtp-Source: AGHT+IFfqwxFCp/jd604GNIIQkttzgaQppk+z65XWwt8A9fHpn0upYszn9Qn5FAsgPHt5FMr5NNIhw== X-Received: by 2002:a5d:46c3:0:b0:334:b31d:eeb3 with SMTP id g3-20020a5d46c3000000b00334b31deeb3mr1484318wrs.140.1702000378443; Thu, 07 Dec 2023 17:52:58 -0800 (PST) Received: from airbuntu.. (host109-153-232-45.range109-153.btcentralplus.com. [109.153.232.45]) by smtp.gmail.com with ESMTPSA id s12-20020adf978c000000b003333a0da243sm902521wrb.81.2023.12.07.17.52.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 17:52:58 -0800 (PST) From: Qais Yousef To: Ingo Molnar , Peter Zijlstra , "Rafael J. Wysocki" , Viresh Kumar , Vincent Guittot , Dietmar Eggemann Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Lukasz Luba , Wei Wang , Rick Yiu , Chung-Kai Mei , Hongyan Xia , Qais Yousef Subject: [PATCH 2/4] sched/uclamp: Remove rq max aggregation Date: Fri, 8 Dec 2023 01:52:40 +0000 Message-Id: <20231208015242.385103-3-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231208015242.385103-1-qyousef@layalina.io> References: <20231208015242.385103-1-qyousef@layalina.io> Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Max aggregation intended to facilitate combining different uclamp performance hints from all RUNNABLE tasks on the rq. But in practice this ended up being a pain, especially for making uclamp_max effective. What we've seen in practice is that those busy tasks that we really want to limit how much performance they can get (or how much power they can consume); can easily be disrupted with a lot of small tasks waking up during its busy run period. Combine this with schedutil rate limit, we end up effectively running full throttle without capping being effective as these small tasks keep being added to the rq making the max aggregated value 1024 most (if not effectively all) of the time. Instead of max aggregation, we can go with a simpler approach of applying uclamp value based on the RUNNING task. Note that max aggregation is important only for freq selection. For task placement we already look at the task's uclamp values only in the load balancer. In EAS we look at the rq, but EAS cares about the frequency selection too, so it needed to care about max aggregation for that reason when deciding where to place the task. To make sure the cpufreq governor follows the new uclamp settings on context switch, we add a new call to cpufreq_update_util() at the end of the context switch. The governor should then apply clamp the util with the uclamp value of whatever task that is currently running on each CPU in the policy. A new flag SCHED_CPUFREQ_PERF_HINTS was added to help distinguish this request. This approach allows us to simplify the code by removing all notion of buckets, uclamp static key, and the need to inc/dec at enqueue path which was considered expensive addition to the fast path. The down side is that we still have the issue of small tasks waking up can cause the CPU to momentarily to spike to high frequency instigated by the capped task. It is less severe though as the busy tasks will be capped for the majority of time it is RUNNING (limited by DVFS hardware response time to change frequency at each context switch). Compared to previously whenever uncapped task is enqueued we were lifting the cap. The problem of handling these small tasks causing a removal of the capping (or even causing a boost if they had high uclamp_min value) should be shifted to the governor. If a task runtime is shorter than DVFS hardware ability to change frequency, then this task's requirements should be ignored as they won't run long enough to observe the change due to the hardware limitation. To keep access to uclamp_eff_value() fast, we continue to store the effective value there - but instead of relying on active flag which was set at enqueue and cleared at dequeue; we do so unconditionally whenever uclamp_min/max is changed via syscall to sched_setattr() or when we change uclamp_min/max in the hierarchy including root group. Added new update_uclamp_req() to help keep both p->uclamp_req and p->uclamp in sync. Signed-off-by: Qais Yousef (Google) --- include/linux/sched.h | 16 -- include/linux/sched/cpufreq.h | 3 +- init/Kconfig | 31 --- kernel/sched/core.c | 426 +++++-------------------------- kernel/sched/cpufreq_schedutil.c | 2 +- kernel/sched/fair.c | 80 ++---- kernel/sched/rt.c | 4 - kernel/sched/sched.h | 128 ++-------- 8 files changed, 104 insertions(+), 586 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8d258162deb0..73b0c6ef8fd9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -700,25 +700,11 @@ struct sched_dl_entity { }; #ifdef CONFIG_UCLAMP_TASK -/* Number of utilization clamp buckets (shorter alias) */ -#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT - /* * Utilization clamp for a scheduling entity * @value: clamp value "assigned" to a se - * @bucket_id: bucket index corresponding to the "assigned" value - * @active: the se is currently refcounted in a rq's bucket * @user_defined: the requested clamp value comes from user-space * - * The bucket_id is the index of the clamp bucket matching the clamp value - * which is pre-computed and stored to avoid expensive integer divisions from - * the fast path. - * - * The active bit is set whenever a task has got an "effective" value assigned, - * which can be different from the clamp value "requested" from user-space. - * This allows to know a task is refcounted in the rq's bucket corresponding - * to the "effective" bucket_id. - * * The user_defined bit is set whenever a task has got a task-specific clamp * value requested from userspace, i.e. the system defaults apply to this task * just as a restriction. This allows to relax default clamps when a less @@ -728,8 +714,6 @@ struct sched_dl_entity { */ struct uclamp_se { unsigned int value : bits_per(SCHED_CAPACITY_SCALE); - unsigned int bucket_id : bits_per(UCLAMP_BUCKETS); - unsigned int active : 1; unsigned int user_defined : 1; }; #endif /* CONFIG_UCLAMP_TASK */ diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h index d01755d3142f..185f2470597b 100644 --- a/include/linux/sched/cpufreq.h +++ b/include/linux/sched/cpufreq.h @@ -8,7 +8,8 @@ * Interface between cpufreq drivers and the scheduler: */ -#define SCHED_CPUFREQ_IOWAIT (1U << 0) +#define SCHED_CPUFREQ_IOWAIT (1U << 0) +#define SCHED_CPUFREQ_PERF_HINTS (1U << 1) #ifdef CONFIG_CPU_FREQ struct cpufreq_policy; diff --git a/init/Kconfig b/init/Kconfig index 9ffb103fc927..8f0cf36cd577 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -809,37 +809,6 @@ config UCLAMP_TASK If in doubt, say N. -config UCLAMP_BUCKETS_COUNT - int "Number of supported utilization clamp buckets" - range 5 20 - default 5 - depends on UCLAMP_TASK - help - Defines the number of clamp buckets to use. The range of each bucket - will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the - number of clamp buckets the finer their granularity and the higher - the precision of clamping aggregation and tracking at run-time. - - For example, with the minimum configuration value we will have 5 - clamp buckets tracking 20% utilization each. A 25% boosted tasks will - be refcounted in the [20..39]% bucket and will set the bucket clamp - effective value to 25%. - If a second 30% boosted task should be co-scheduled on the same CPU, - that task will be refcounted in the same bucket of the first task and - it will boost the bucket clamp effective value to 30%. - The clamp effective value of a bucket is reset to its nominal value - (20% in the example above) when there are no more tasks refcounted in - that bucket. - - An additional boost/capping margin can be added to some tasks. In the - example above the 25% task will be boosted to 30% until it exits the - CPU. If that should be considered not acceptable on certain systems, - it's always possible to reduce the margin by increasing the number of - clamp buckets to trade off used memory for run-time tracking - precision. - - If in doubt, use the default value. - endmenu # diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9c8626b4ddff..5095f40edf0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1389,37 +1389,9 @@ static unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCA /* All clamps are required to be less or equal than these values */ static struct uclamp_se uclamp_default[UCLAMP_CNT]; -/* - * This static key is used to reduce the uclamp overhead in the fast path. It - * primarily disables the call to uclamp_rq_{inc, dec}() in - * enqueue/dequeue_task(). - * - * This allows users to continue to enable uclamp in their kernel config with - * minimum uclamp overhead in the fast path. - * - * As soon as userspace modifies any of the uclamp knobs, the static key is - * enabled, since we have an actual users that make use of uclamp - * functionality. - * - * The knobs that would enable this static key are: - * - * * A task modifying its uclamp value with sched_setattr(). - * * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs. - * * An admin modifying the cgroup cpu.uclamp.{min, max} - */ -DEFINE_STATIC_KEY_FALSE(sched_uclamp_used); - -/* Integer rounded range for each bucket */ -#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS) - #define for_each_clamp_id(clamp_id) \ for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++) -static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) -{ - return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1); -} - static inline unsigned int uclamp_none(enum uclamp_id clamp_id) { if (clamp_id == UCLAMP_MIN) @@ -1431,73 +1403,25 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined) { uc_se->value = value; - uc_se->bucket_id = uclamp_bucket_id(value); uc_se->user_defined = user_defined; } -static inline unsigned int -uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id, - unsigned int clamp_value) -{ - /* - * Avoid blocked utilization pushing up the frequency when we go - * idle (which drops the max-clamp) by retaining the last known - * max-clamp. - */ - if (clamp_id == UCLAMP_MAX) { - rq->uclamp_flags |= UCLAMP_FLAG_IDLE; - return clamp_value; - } - - return uclamp_none(UCLAMP_MIN); -} - -static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id, - unsigned int clamp_value) -{ - /* Reset max-clamp retention only on idle exit */ - if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE)) - return; - - uclamp_rq_set(rq, clamp_id, clamp_value); -} - -static inline -unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id, - unsigned int clamp_value) -{ - struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket; - int bucket_id = UCLAMP_BUCKETS - 1; - - /* - * Since both min and max clamps are max aggregated, find the - * top most bucket with tasks in. - */ - for ( ; bucket_id >= 0; bucket_id--) { - if (!bucket[bucket_id].tasks) - continue; - return bucket[bucket_id].value; - } - - /* No tasks -- default clamp values */ - return uclamp_idle_value(rq, clamp_id, clamp_value); -} +static inline void update_uclamp_req(struct task_struct *p, + enum uclamp_id clamp_id, + unsigned int value, bool user_defined); static void __uclamp_update_util_min_rt_default(struct task_struct *p) { unsigned int default_util_min; - struct uclamp_se *uc_se; lockdep_assert_held(&p->pi_lock); - uc_se = &p->uclamp_req[UCLAMP_MIN]; - /* Only sync if user didn't override the default */ - if (uc_se->user_defined) + if (p->uclamp_req[UCLAMP_MIN].user_defined) return; default_util_min = sysctl_sched_uclamp_util_min_rt_default; - uclamp_se_set(uc_se, default_util_min, false); + update_uclamp_req(p, UCLAMP_MIN, default_util_min, false); } static void uclamp_update_util_min_rt_default(struct task_struct *p) @@ -1545,7 +1469,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id) * group or in an autogroup * - the system default clamp value, defined by the sysadmin */ -static inline struct uclamp_se +static inline unsigned long uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id) { struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id); @@ -1553,241 +1477,27 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id) /* System default restrictions always apply */ if (unlikely(uc_req.value > uc_max.value)) - return uc_max; + return uc_max.value; - return uc_req; + return uc_req.value; } unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id) { - struct uclamp_se uc_eff; - - /* Task currently refcounted: use back-annotated (effective) value */ - if (p->uclamp[clamp_id].active) - return (unsigned long)p->uclamp[clamp_id].value; - - uc_eff = uclamp_eff_get(p, clamp_id); - - return (unsigned long)uc_eff.value; + /* This should be kept up-to-date whenever uclamp value is changed */ + return (unsigned long)p->uclamp[clamp_id].value; } -/* - * When a task is enqueued on a rq, the clamp bucket currently defined by the - * task's uclamp::bucket_id is refcounted on that rq. This also immediately - * updates the rq's clamp value if required. - * - * Tasks can have a task-specific value requested from user-space, track - * within each bucket the maximum value for tasks refcounted in it. - * This "local max aggregation" allows to track the exact "requested" value - * for each bucket when all its RUNNABLE tasks require the same clamp. - */ -static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, - enum uclamp_id clamp_id) +/* Update task's p->uclamp_req and effective p->uclamp in one go */ +static inline void update_uclamp_req(struct task_struct *p, + enum uclamp_id clamp_id, + unsigned int value, bool user_defined) { - struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; - struct uclamp_se *uc_se = &p->uclamp[clamp_id]; - struct uclamp_bucket *bucket; - - lockdep_assert_rq_held(rq); - - /* Update task effective clamp */ - p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id); - - bucket = &uc_rq->bucket[uc_se->bucket_id]; - bucket->tasks++; - uc_se->active = true; - - uclamp_idle_reset(rq, clamp_id, uc_se->value); - - /* - * Local max aggregation: rq buckets always track the max - * "requested" clamp value of its RUNNABLE tasks. - */ - if (bucket->tasks == 1 || uc_se->value > bucket->value) - bucket->value = uc_se->value; - - if (uc_se->value > uclamp_rq_get(rq, clamp_id)) - uclamp_rq_set(rq, clamp_id, uc_se->value); -} - -/* - * When a task is dequeued from a rq, the clamp bucket refcounted by the task - * is released. If this is the last task reference counting the rq's max - * active clamp value, then the rq's clamp value is updated. - * - * Both refcounted tasks and rq's cached clamp values are expected to be - * always valid. If it's detected they are not, as defensive programming, - * enforce the expected state and warn. - */ -static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p, - enum uclamp_id clamp_id) -{ - struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id]; - struct uclamp_se *uc_se = &p->uclamp[clamp_id]; - struct uclamp_bucket *bucket; - unsigned int bkt_clamp; - unsigned int rq_clamp; - - lockdep_assert_rq_held(rq); - - /* - * If sched_uclamp_used was enabled after task @p was enqueued, - * we could end up with unbalanced call to uclamp_rq_dec_id(). - * - * In this case the uc_se->active flag should be false since no uclamp - * accounting was performed at enqueue time and we can just return - * here. - * - * Need to be careful of the following enqueue/dequeue ordering - * problem too - * - * enqueue(taskA) - * // sched_uclamp_used gets enabled - * enqueue(taskB) - * dequeue(taskA) - * // Must not decrement bucket->tasks here - * dequeue(taskB) - * - * where we could end up with stale data in uc_se and - * bucket[uc_se->bucket_id]. - * - * The following check here eliminates the possibility of such race. - */ - if (unlikely(!uc_se->active)) - return; - - bucket = &uc_rq->bucket[uc_se->bucket_id]; - - SCHED_WARN_ON(!bucket->tasks); - if (likely(bucket->tasks)) - bucket->tasks--; - - uc_se->active = false; - - /* - * Keep "local max aggregation" simple and accept to (possibly) - * overboost some RUNNABLE tasks in the same bucket. - * The rq clamp bucket value is reset to its base value whenever - * there are no more RUNNABLE tasks refcounting it. - */ - if (likely(bucket->tasks)) - return; - - rq_clamp = uclamp_rq_get(rq, clamp_id); - /* - * Defensive programming: this should never happen. If it happens, - * e.g. due to future modification, warn and fixup the expected value. - */ - SCHED_WARN_ON(bucket->value > rq_clamp); - if (bucket->value >= rq_clamp) { - bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value); - uclamp_rq_set(rq, clamp_id, bkt_clamp); - } -} - -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) -{ - enum uclamp_id clamp_id; - - /* - * Avoid any overhead until uclamp is actually used by the userspace. - * - * The condition is constructed such that a NOP is generated when - * sched_uclamp_used is disabled. - */ - if (!static_branch_unlikely(&sched_uclamp_used)) - return; - - if (unlikely(!p->sched_class->uclamp_enabled)) - return; - - for_each_clamp_id(clamp_id) - uclamp_rq_inc_id(rq, p, clamp_id); - - /* Reset clamp idle holding when there is one RUNNABLE task */ - if (rq->uclamp_flags & UCLAMP_FLAG_IDLE) - rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE; -} - -static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) -{ - enum uclamp_id clamp_id; - - /* - * Avoid any overhead until uclamp is actually used by the userspace. - * - * The condition is constructed such that a NOP is generated when - * sched_uclamp_used is disabled. - */ - if (!static_branch_unlikely(&sched_uclamp_used)) - return; - - if (unlikely(!p->sched_class->uclamp_enabled)) - return; - - for_each_clamp_id(clamp_id) - uclamp_rq_dec_id(rq, p, clamp_id); -} - -static inline void uclamp_rq_reinc_id(struct rq *rq, struct task_struct *p, - enum uclamp_id clamp_id) -{ - if (!p->uclamp[clamp_id].active) - return; - - uclamp_rq_dec_id(rq, p, clamp_id); - uclamp_rq_inc_id(rq, p, clamp_id); - - /* - * Make sure to clear the idle flag if we've transiently reached 0 - * active tasks on rq. - */ - if (clamp_id == UCLAMP_MAX && (rq->uclamp_flags & UCLAMP_FLAG_IDLE)) - rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE; -} - -static inline void -uclamp_update_active(struct task_struct *p) -{ - enum uclamp_id clamp_id; - struct rq_flags rf; - struct rq *rq; - - /* - * Lock the task and the rq where the task is (or was) queued. - * - * We might lock the (previous) rq of a !RUNNABLE task, but that's the - * price to pay to safely serialize util_{min,max} updates with - * enqueues, dequeues and migration operations. - * This is the same locking schema used by __set_cpus_allowed_ptr(). - */ - rq = task_rq_lock(p, &rf); - - /* - * Setting the clamp bucket is serialized by task_rq_lock(). - * If the task is not yet RUNNABLE and its task_struct is not - * affecting a valid clamp bucket, the next time it's enqueued, - * it will already see the updated clamp bucket value. - */ - for_each_clamp_id(clamp_id) - uclamp_rq_reinc_id(rq, p, clamp_id); - - task_rq_unlock(rq, p, &rf); + uclamp_se_set(&p->uclamp_req[clamp_id], value, user_defined); + uclamp_se_set(&p->uclamp[clamp_id], uclamp_eff_get(p, clamp_id), user_defined); } #ifdef CONFIG_UCLAMP_TASK_GROUP -static inline void -uclamp_update_active_tasks(struct cgroup_subsys_state *css) -{ - struct css_task_iter it; - struct task_struct *p; - - css_task_iter_start(css, 0, &it); - while ((p = css_task_iter_next(&it))) - uclamp_update_active(p); - css_task_iter_end(&it); -} - static void cpu_util_update_eff(struct cgroup_subsys_state *css); #endif @@ -1874,15 +1584,11 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, update_root_tg = true; } - if (update_root_tg) { - static_branch_enable(&sched_uclamp_used); + if (update_root_tg) uclamp_update_root_tg(); - } - if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) { - static_branch_enable(&sched_uclamp_used); + if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) uclamp_sync_util_min_rt_default(); - } /* * We update all RUNNABLE tasks only when task groups are in use. @@ -1923,15 +1629,6 @@ static int uclamp_validate(struct task_struct *p, if (util_min != -1 && util_max != -1 && util_min > util_max) return -EINVAL; - /* - * We have valid uclamp attributes; make sure uclamp is enabled. - * - * We need to do that here, because enabling static branches is a - * blocking operation which obviously cannot be done while holding - * scheduler locks. - */ - static_branch_enable(&sched_uclamp_used); - return 0; } @@ -1981,7 +1678,7 @@ static void __setscheduler_uclamp(struct task_struct *p, else value = uclamp_none(clamp_id); - uclamp_se_set(uc_se, value, false); + update_uclamp_req(p, clamp_id, value, false); } @@ -1990,14 +1687,14 @@ static void __setscheduler_uclamp(struct task_struct *p, if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && attr->sched_util_min != -1) { - uclamp_se_set(&p->uclamp_req[UCLAMP_MIN], - attr->sched_util_min, true); + + update_uclamp_req(p, UCLAMP_MIN, attr->sched_util_min, true); } if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX && attr->sched_util_max != -1) { - uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], - attr->sched_util_max, true); + + update_uclamp_req(p, UCLAMP_MAX, attr->sched_util_max, true); } } @@ -2009,15 +1706,12 @@ static void uclamp_fork(struct task_struct *p) * We don't need to hold task_rq_lock() when updating p->uclamp_* here * as the task is still at its early fork stages. */ - for_each_clamp_id(clamp_id) - p->uclamp[clamp_id].active = false; - if (likely(!p->sched_reset_on_fork)) return; for_each_clamp_id(clamp_id) { - uclamp_se_set(&p->uclamp_req[clamp_id], - uclamp_none(clamp_id), false); + update_uclamp_req(&init_task, clamp_id, + uclamp_none(clamp_id), false); } } @@ -2026,32 +1720,17 @@ static void uclamp_post_fork(struct task_struct *p) uclamp_update_util_min_rt_default(p); } -static void __init init_uclamp_rq(struct rq *rq) -{ - enum uclamp_id clamp_id; - struct uclamp_rq *uc_rq = rq->uclamp; - - for_each_clamp_id(clamp_id) { - uc_rq[clamp_id] = (struct uclamp_rq) { - .value = uclamp_none(clamp_id) - }; - } - - rq->uclamp_flags = UCLAMP_FLAG_IDLE; -} - static void __init init_uclamp(void) { struct uclamp_se uc_max = {}; enum uclamp_id clamp_id; - int cpu; - - for_each_possible_cpu(cpu) - init_uclamp_rq(cpu_rq(cpu)); for_each_clamp_id(clamp_id) { uclamp_se_set(&init_task.uclamp_req[clamp_id], uclamp_none(clamp_id), false); + + uclamp_se_set(&init_task.uclamp[clamp_id], + uclamp_none(clamp_id), false); } /* System defaults allow max clamp values for both indexes */ @@ -2066,8 +1745,6 @@ static void __init init_uclamp(void) } #else /* CONFIG_UCLAMP_TASK */ -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { } -static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { } static inline int uclamp_validate(struct task_struct *p, const struct sched_attr *attr) { @@ -2114,7 +1791,6 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)); } - uclamp_rq_inc(rq, p); p->sched_class->enqueue_task(rq, p, flags); if (sched_core_enabled(rq)) @@ -2134,7 +1810,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) psi_dequeue(p, flags & DEQUEUE_SLEEP); } - uclamp_rq_dec(rq, p); p->sched_class->dequeue_task(rq, p, flags); } @@ -5150,6 +4825,7 @@ static inline void finish_lock_switch(struct rq *rq) * prev into current: */ spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); + uclamp_context_switch(rq); __balance_callbacks(rq); raw_spin_rq_unlock_irq(rq); } @@ -7469,13 +7145,17 @@ int sched_core_idle_cpu(int cpu) */ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long *min, - unsigned long *max) + unsigned long *max, + struct task_struct *p) { unsigned long util, irq, scale; struct rq *rq = cpu_rq(cpu); scale = arch_scale_cpu_capacity(cpu); + if (!p) + p = rq->curr; + /* * Early check to see if IRQ/steal time saturates the CPU, can be * because of inaccuracies in how we track these -- see @@ -7497,13 +7177,13 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * steals time to the deadline task. * - The minimum performance requirement for CFS and/or RT. */ - *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN)); + *min = max(irq + cpu_bw_dl(rq), uclamp_eff_value(p, UCLAMP_MIN)); /* * When an RT task is runnable and uclamp is not used, we must * ensure that the task will run at maximum compute capacity. */ - if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)) + if (!IS_ENABLED(CONFIG_UCLAMP_TASK) && rt_rq_is_runnable(&rq->rt)) *min = max(*min, scale); } @@ -7521,7 +7201,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * than the actual utilization because of uclamp_max requirements. */ if (max) - *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX)); + *max = min(scale, uclamp_eff_value(p, UCLAMP_MAX)); if (util >= scale) return scale; @@ -7543,7 +7223,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long sched_cpu_util(int cpu) { - return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL); + return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL, NULL); } #endif /* CONFIG_SMP */ @@ -10585,6 +10265,30 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) } #ifdef CONFIG_UCLAMP_TASK_GROUP +static inline void +uclamp_update_css_tasks(struct cgroup_subsys_state *css) +{ + enum uclamp_id clamp_id; + struct css_task_iter it; + struct task_struct *p; + + /* Update effective uclamp value for all the css tasks */ + css_task_iter_start(css, 0, &it); + while ((p = css_task_iter_next(&it))) { + + /* Only RT and fair support uclamp */ + if (!rt_policy(p->policy) && !fair_policy(p->policy)) + continue; + + /* Update task effect value */ + for_each_clamp_id(clamp_id) { + uclamp_se_set(&p->uclamp[clamp_id], + uclamp_eff_get(p, clamp_id), false); + } + } + css_task_iter_end(&it); +} + static void cpu_util_update_eff(struct cgroup_subsys_state *css) { struct cgroup_subsys_state *top_css = css; @@ -10620,7 +10324,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css) if (eff[clamp_id] == uc_se[clamp_id].value) continue; uc_se[clamp_id].value = eff[clamp_id]; - uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]); clamps |= (0x1 << clamp_id); } if (!clamps) { @@ -10628,8 +10331,7 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css) continue; } - /* Immediately update descendants RUNNABLE tasks */ - uclamp_update_active_tasks(css); + uclamp_update_css_tasks(css); } } @@ -10687,8 +10389,6 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf, if (req.ret) return req.ret; - static_branch_enable(&sched_uclamp_used); - guard(mutex)(&uclamp_mutex); guard(rcu)(); diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 788208becc13..137636f62593 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -244,7 +244,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) { unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); - util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); + util = effective_cpu_util(sg_cpu->cpu, util, &min, &max, NULL); util = max(util, boost); sg_cpu->bw_min = min; sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f99910fc6705..74326179658c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4931,7 +4931,7 @@ static inline int util_fits_cpu(unsigned long util, */ fits = fits_capacity(util, capacity); - if (!uclamp_is_used()) + if (!IS_ENABLED(CONFIG_UCLAMP_TASK)) return fits; /* @@ -6624,11 +6624,18 @@ static inline void hrtick_update(struct rq *rq) #ifdef CONFIG_SMP static inline bool cpu_overutilized(int cpu) { - unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN); - unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX); + struct task_struct *p = cpu_rq(cpu)->curr; + unsigned long util_min = uclamp_eff_value(p, UCLAMP_MIN); + unsigned long util_max = uclamp_eff_value(p, UCLAMP_MAX); - /* Return true only if the utilization doesn't fit CPU's capacity */ - return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu); + /* + * Return true only if the utilization doesn't fit CPU's capacity + * + * TODO: this definition of overutilized = misfit() is stale and no + * longer fit for purpose and must be updated to reflect when the + * system or a cluster is truly overloaded. + */ + return !util_fits_cpu(cpu_util_cfs(cpu), util_min, util_max, cpu); } static inline void update_overutilized_status(struct rq *rq) @@ -7776,7 +7783,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv, for_each_cpu(cpu, pd_cpus) { unsigned long util = cpu_util(cpu, p, -1, 0); - busy_time += effective_cpu_util(cpu, util, NULL, NULL); + busy_time += effective_cpu_util(cpu, util, NULL, NULL, NULL); } eenv->pd_busy_time = min(eenv->pd_cap, busy_time); @@ -7801,30 +7808,9 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus, unsigned long util = cpu_util(cpu, p, dst_cpu, 1); unsigned long eff_util, min, max; - /* - * Performance domain frequency: utilization clamping - * must be considered since it affects the selection - * of the performance domain frequency. - * NOTE: in case RT tasks are running, by default the - * FREQUENCY_UTIL's utilization can be max OPP. - */ - eff_util = effective_cpu_util(cpu, util, &min, &max); - - /* Task's uclamp can modify min and max value */ - if (tsk && uclamp_is_used()) { - min = max(min, uclamp_eff_value(p, UCLAMP_MIN)); - - /* - * If there is no active max uclamp constraint, - * directly use task's one, otherwise keep max. - */ - if (uclamp_rq_is_idle(cpu_rq(cpu))) - max = uclamp_eff_value(p, UCLAMP_MAX); - else - max = max(max, uclamp_eff_value(p, UCLAMP_MAX)); - } - + eff_util = effective_cpu_util(cpu, util, &min, &max, tsk); eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max); + max_util = max(max_util, eff_util); } @@ -7897,8 +7883,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX; - unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0; - unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024; + unsigned long p_util_min = uclamp_eff_value(p, UCLAMP_MIN); + unsigned long p_util_max = uclamp_eff_value(p, UCLAMP_MAX); struct root_domain *rd = this_rq()->rd; int cpu, best_energy_cpu, target = -1; int prev_fits = -1, best_fits = -1; @@ -7932,10 +7918,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) eenv_task_busy_time(&eenv, p, prev_cpu); for (; pd; pd = pd->next) { - unsigned long util_min = p_util_min, util_max = p_util_max; unsigned long cpu_cap, cpu_thermal_cap, util; long prev_spare_cap = -1, max_spare_cap = -1; - unsigned long rq_util_min, rq_util_max; unsigned long cur_delta, base_energy; int max_spare_cap_cpu = -1; int fits, max_fits = -1; @@ -7954,8 +7938,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) eenv.pd_cap = 0; for_each_cpu(cpu, cpus) { - struct rq *rq = cpu_rq(cpu); - eenv.pd_cap += cpu_thermal_cap; if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) @@ -7967,29 +7949,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) util = cpu_util(cpu, p, cpu, 0); cpu_cap = capacity_of(cpu); - /* - * Skip CPUs that cannot satisfy the capacity request. - * IOW, placing the task there would make the CPU - * overutilized. Take uclamp into account to see how - * much capacity we can get out of the CPU; this is - * aligned with sched_cpu_util(). - */ - if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) { - /* - * Open code uclamp_rq_util_with() except for - * the clamp() part. Ie: apply max aggregation - * only. util_fits_cpu() logic requires to - * operate on non clamped util but must use the - * max-aggregated uclamp_{min, max}. - */ - rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN); - rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX); - - util_min = max(rq_util_min, p_util_min); - util_max = max(rq_util_max, p_util_max); - } - - fits = util_fits_cpu(util, util_min, util_max, cpu); + fits = util_fits_cpu(util, p_util_min, p_util_max, cpu); if (!fits) continue; @@ -13135,10 +13095,6 @@ DEFINE_SCHED_CLASS(fair) = { #ifdef CONFIG_SCHED_CORE .task_is_throttled = task_is_throttled_fair, #endif - -#ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 1, -#endif }; #ifdef CONFIG_SCHED_DEBUG diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 3261b067b67e..86733bed0e3c 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2681,10 +2681,6 @@ DEFINE_SCHED_CLASS(rt) = { #ifdef CONFIG_SCHED_CORE .task_is_throttled = task_is_throttled_rt, #endif - -#ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 1, -#endif }; #ifdef CONFIG_RT_GROUP_SCHED diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a7c89c623250..f05a0674a036 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -912,50 +912,6 @@ extern void rto_push_irq_work_func(struct irq_work *work); #endif #endif /* CONFIG_SMP */ -#ifdef CONFIG_UCLAMP_TASK -/* - * struct uclamp_bucket - Utilization clamp bucket - * @value: utilization clamp value for tasks on this clamp bucket - * @tasks: number of RUNNABLE tasks on this clamp bucket - * - * Keep track of how many tasks are RUNNABLE for a given utilization - * clamp value. - */ -struct uclamp_bucket { - unsigned long value : bits_per(SCHED_CAPACITY_SCALE); - unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE); -}; - -/* - * struct uclamp_rq - rq's utilization clamp - * @value: currently active clamp values for a rq - * @bucket: utilization clamp buckets affecting a rq - * - * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values. - * A clamp value is affecting a rq when there is at least one task RUNNABLE - * (or actually running) with that value. - * - * There are up to UCLAMP_CNT possible different clamp values, currently there - * are only two: minimum utilization and maximum utilization. - * - * All utilization clamping values are MAX aggregated, since: - * - for util_min: we want to run the CPU at least at the max of the minimum - * utilization required by its currently RUNNABLE tasks. - * - for util_max: we want to allow the CPU to run up to the max of the - * maximum utilization allowed by its currently RUNNABLE tasks. - * - * Since on each system we expect only a limited number of different - * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track - * the metrics required to compute all the per-rq utilization clamp values. - */ -struct uclamp_rq { - unsigned int value; - struct uclamp_bucket bucket[UCLAMP_BUCKETS]; -}; - -DECLARE_STATIC_KEY_FALSE(sched_uclamp_used); -#endif /* CONFIG_UCLAMP_TASK */ - struct rq; struct balance_callback { struct balance_callback *next; @@ -994,13 +950,6 @@ struct rq { #endif u64 nr_switches; -#ifdef CONFIG_UCLAMP_TASK - /* Utilization clamp values based on CPU's RUNNABLE tasks */ - struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned; - unsigned int uclamp_flags; -#define UCLAMP_FLAG_IDLE 0x01 -#endif - struct cfs_rq cfs; struct rt_rq rt; struct dl_rq dl; @@ -2248,10 +2197,6 @@ extern s64 update_curr_common(struct rq *rq); struct sched_class { -#ifdef CONFIG_UCLAMP_TASK - int uclamp_enabled; -#endif - void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); @@ -2997,7 +2942,8 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #ifdef CONFIG_SMP unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long *min, - unsigned long *max); + unsigned long *max, + struct task_struct *p); unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, unsigned long min, @@ -3072,49 +3018,35 @@ static inline unsigned long cpu_util_rt(struct rq *rq) #ifdef CONFIG_UCLAMP_TASK unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id); -static inline unsigned long uclamp_rq_get(struct rq *rq, - enum uclamp_id clamp_id) -{ - return READ_ONCE(rq->uclamp[clamp_id].value); -} - -static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id, - unsigned int value) -{ - WRITE_ONCE(rq->uclamp[clamp_id].value, value); -} - -static inline bool uclamp_rq_is_idle(struct rq *rq) -{ - return rq->uclamp_flags & UCLAMP_FLAG_IDLE; -} - /* Is the rq being capped/throttled by uclamp_max? */ static inline bool uclamp_rq_is_capped(struct rq *rq) { unsigned long rq_util; unsigned long max_util; - if (!static_branch_likely(&sched_uclamp_used)) - return false; - rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq); - max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); + max_util = uclamp_eff_value(rq->curr, UCLAMP_MAX); return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util; } -/* - * When uclamp is compiled in, the aggregation at rq level is 'turned off' - * by default in the fast path and only gets turned on once userspace performs - * an operation that requires it. - * - * Returns true if userspace opted-in to use uclamp and aggregation at rq level - * hence is active. - */ -static inline bool uclamp_is_used(void) +/* Request freq update on context switch if necessary */ +static inline void uclamp_context_switch(struct rq *rq) { - return static_branch_likely(&sched_uclamp_used); + unsigned long uclamp_min; + unsigned long uclamp_max; + unsigned long util; + + /* Only RT and FAIR tasks are aware of uclamp */ + if (!rt_policy(current->policy) && !fair_policy(current->policy)) + return; + + uclamp_min = uclamp_eff_value(current, UCLAMP_MIN); + uclamp_max = uclamp_eff_value(current, UCLAMP_MAX); + util = rq->cfs.avg.util_avg; + + if (uclamp_min > util || uclamp_max < util) + cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS); } #else /* CONFIG_UCLAMP_TASK */ static inline unsigned long uclamp_eff_value(struct task_struct *p, @@ -3128,28 +3060,8 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p, static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; } -static inline bool uclamp_is_used(void) +static inline void uclamp_context_switch(struct rq *rq) { - return false; -} - -static inline unsigned long uclamp_rq_get(struct rq *rq, - enum uclamp_id clamp_id) -{ - if (clamp_id == UCLAMP_MIN) - return 0; - - return SCHED_CAPACITY_SCALE; -} - -static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id, - unsigned int value) -{ -} - -static inline bool uclamp_rq_is_idle(struct rq *rq) -{ - return false; } #endif /* CONFIG_UCLAMP_TASK */ From patchwork Fri Dec 8 01:52:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qais Yousef X-Patchwork-Id: 752420 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b="h3NSf5ps" Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18EC6171F for ; Thu, 7 Dec 2023 17:53:02 -0800 (PST) Received: by mail-wr1-x42e.google.com with SMTP id ffacd0b85a97d-3333224c7b9so1537467f8f.1 for ; Thu, 07 Dec 2023 17:53:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1702000380; x=1702605180; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=da7LLhF7dTBCpKaBg/sVxSbQ3IIX4np42FTWMRnWSyE=; b=h3NSf5psqsjkw7XBQeF3kO2syAHTzCZktpbn2UBoH05Gb9Mj+NjtvgHGhszvSSEEtK cdmcuAG9N7l7vQkGrRVGdTmULMWXFWxgt+igiUMabUkxV1ALobffFSZkqXQHq3eIv1w5 S2MiHzYgaKWhqzunE8/uYLSqqPSPtRq6uHFJIWqF+g/TS98irxNdI0yabRjnJvmTViWP 7CDXCAal0r+udd1WFyCdnw7gCAuU0kF2jrsRGjivYwpC+OJYz4CsGgIsdMhUFalDfKE1 7gcPXrIq4Zm0/6iIIPrsBcE9kkeqG1rDFEtOwW96EjkaOUG/AaXdy07xtc7PvMO7gDhU ZozQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702000380; x=1702605180; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=da7LLhF7dTBCpKaBg/sVxSbQ3IIX4np42FTWMRnWSyE=; b=FirPX0ZT6mf93RPklm7Z1Dr7YLTnTKrnbqB6DLUK7m0NVkhDCf1pMBVSsjzj+z0SZN TbsdiXp9iYQNU2LcO7p6von1G+Ll+mJe2lJRX7U/tpeOwgwkmwhb7PDrvCkXaSyWPae9 TYve8GpA+k0IjytLvquSTxu8rXOqi7Ye29qjBoQdrBccLRsTW6cI8NLeV+px8WQy7Qk+ 7aowgpeTDxbHrRDYQbBVgCcCSvg3A++5dLVzbB/m1D++9Ixw791jVOIEWxhqR45xjxuf oKh//rW6BJeVxZd4YfGWxEmEQspfbdUiYe/OWLLCLN9O6dax2mXJWDa9V94Y/1AmmDKj tP6g== X-Gm-Message-State: AOJu0YwdMkzcpVlQH3YTlJNGdpP17JJOHZr3t3DjVCDHPZq2+Wy5JKIv WY1SoDKaX2HCVUXQxd+8+xdjVA== X-Google-Smtp-Source: AGHT+IGrZHyEj+GDKOaEfnmPxdSCkC2yxj/SBOklZX3lgEv39Ku/B2G3SW2ba5uZHZqubXcqURrVgA== X-Received: by 2002:a5d:4911:0:b0:333:2fd2:68f5 with SMTP id x17-20020a5d4911000000b003332fd268f5mr1934815wrq.136.1702000380663; Thu, 07 Dec 2023 17:53:00 -0800 (PST) Received: from airbuntu.. (host109-153-232-45.range109-153.btcentralplus.com. [109.153.232.45]) by smtp.gmail.com with ESMTPSA id s12-20020adf978c000000b003333a0da243sm902521wrb.81.2023.12.07.17.52.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 17:53:00 -0800 (PST) From: Qais Yousef To: Ingo Molnar , Peter Zijlstra , "Rafael J. Wysocki" , Viresh Kumar , Vincent Guittot , Dietmar Eggemann Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Lukasz Luba , Wei Wang , Rick Yiu , Chung-Kai Mei , Hongyan Xia , Qais Yousef Subject: [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks Date: Fri, 8 Dec 2023 01:52:41 +0000 Message-Id: <20231208015242.385103-4-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231208015242.385103-1-qyousef@layalina.io> References: <20231208015242.385103-1-qyousef@layalina.io> Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Ignore freq updates to honour uclamp requests if the task is short running. It won't run long enough to see the changes, so avoid the unnecessary work and noise. Make sure SCHED_CPUFREQ_PERF_HINTS flag is set in task_tick_fair() so that we can do correction action if the task continued to run such that it is no longer considered a short task. Should address the problem of noisy short running tasks unnecessary causing frequency spikes when waking up on a CPU that is running a busy task capped by UCLAMP_MAX. Move helper functions to access task_util_est() and related attributes to sched.h to enable using it from cpufreq_schedutil.c Signed-off-by: Qais Yousef (Google) --- kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 24 +------------ kernel/sched/sched.h | 22 ++++++++++++ 3 files changed, 82 insertions(+), 23 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 137636f62593..04a5cfcdbbf2 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -8,11 +8,18 @@ #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) +/* + * Min runtime in us a task should run for above rate_limit_us so that we don't + * ignore it in ignore_short_tasks(). + */ +#define SHORT_TASK_MIN 500 + DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult); struct sugov_tunables { struct gov_attr_set attr_set; unsigned int rate_limit_us; + unsigned long rate_limit_util; unsigned int response_time_ms; }; @@ -127,6 +134,49 @@ sugov_apply_response_time(unsigned long util, int cpu) return mult >> SCHED_CAPACITY_SHIFT; } +/* + * Ignore updates if current task's runtime is too short for the rate limit. + * The task must run for an average of rate_limit_us + SHORT_TASK_MIN for it + * not to be ignored. + * + * If we made a wrong decision and the task has changed characteristic such + * that it is no longer a short task, we should detect that at tick. Which can + * be a high penalty if the tick value is too high. + * + * XXX: can we take TICK_US into account somehow when verifying if we can + * ignore it? + * + * This is only valid for requests containing SCHED_CPUFREQ_PERF_HINTS flag, + * ie: uclamp hints requests at context switches. + * + * This flag is expected to be passed on context switch and tick. Only fair + * tasks are considered now as we use util to approximate its average runtime. + * We can't do the same without tracking the average runtime of the RT task in + * our accounting. And it might be risky to temporarily ignore the RT task's + * perf requirements as a mistake could have higher consequence. + * + * Once fair gains the concept of latency sensitive tasks, we might need to + * consider the consequence of ignoring them here too. For the same reason + * ignoring RT tasks is risky. + */ +static inline bool ignore_short_tasks(int cpu, + struct sugov_policy *sg_policy, + unsigned int flags) +{ + struct task_struct *p = cpu_rq(cpu)->curr; + unsigned long task_util; + + if (!(flags & SCHED_CPUFREQ_PERF_HINTS)) + return false; + + if (!fair_policy(p->policy)) + return false; + + task_util = task_util_est(p); + + return task_util < sg_policy->tunables->rate_limit_util; +} + static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) { s64 delta_ns; @@ -396,8 +446,12 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, u64 time, unsigned long max_cap, unsigned int flags) { + struct sugov_policy *sg_policy = sg_cpu->sg_policy; unsigned long boost; + if (ignore_short_tasks(sg_cpu->cpu, sg_policy, flags)) + return false; + sugov_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; @@ -526,6 +580,9 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) struct sugov_policy *sg_policy = sg_cpu->sg_policy; unsigned int next_f; + if (ignore_short_tasks(sg_cpu->cpu, sg_policy, flags)) + return; + raw_spin_lock(&sg_policy->update_lock); sugov_iowait_boost(sg_cpu, time, flags); @@ -612,6 +669,7 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count return -EINVAL; tunables->rate_limit_us = rate_limit_us; + tunables->rate_limit_util = approximate_util_avg(0, rate_limit_us + SHORT_TASK_MIN); list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { @@ -853,6 +911,7 @@ static int sugov_init(struct cpufreq_policy *policy) sg_policy->tunables = tunables; tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy); + tunables->rate_limit_util = approximate_util_avg(0, tunables->rate_limit_us + SHORT_TASK_MIN); tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy); sugov_update_response_time_mult(sg_policy); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 74326179658c..b824e633ac2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4754,28 +4754,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) static int newidle_balance(struct rq *this_rq, struct rq_flags *rf); -static inline unsigned long task_util(struct task_struct *p) -{ - return READ_ONCE(p->se.avg.util_avg); -} - -static inline unsigned long task_runnable(struct task_struct *p) -{ - return READ_ONCE(p->se.avg.runnable_avg); -} - -static inline unsigned long _task_util_est(struct task_struct *p) -{ - struct util_est ue = READ_ONCE(p->se.avg.util_est); - - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)); -} - -static inline unsigned long task_util_est(struct task_struct *p) -{ - return max(task_util(p), _task_util_est(p)); -} - static inline void util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) { @@ -12544,7 +12522,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) update_misfit_status(curr, rq); update_overutilized_status(task_rq(curr)); - cpufreq_update_util(rq, 0); + cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS); task_tick_core(rq, curr); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f05a0674a036..b7a8cd768bef 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2952,6 +2952,28 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, unsigned long approximate_util_avg(unsigned long util, u64 delta); u64 approximate_runtime(unsigned long util); +static inline unsigned long task_util(struct task_struct *p) +{ + return READ_ONCE(p->se.avg.util_avg); +} + +static inline unsigned long task_runnable(struct task_struct *p) +{ + return READ_ONCE(p->se.avg.runnable_avg); +} + +static inline unsigned long _task_util_est(struct task_struct *p) +{ + struct util_est ue = READ_ONCE(p->se.avg.util_est); + + return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)); +} + +static inline unsigned long task_util_est(struct task_struct *p) +{ + return max(task_util(p), _task_util_est(p)); +} + /* * Any governor that relies on util signal to drive DVFS, must populate these * percpu dvfs_update_delay variables. From patchwork Fri Dec 8 01:52:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qais Yousef X-Patchwork-Id: 752419 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b="vOrvHqgD" Received: from mail-wr1-x431.google.com (mail-wr1-x431.google.com [IPv6:2a00:1450:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1644172E for ; Thu, 7 Dec 2023 17:53:04 -0800 (PST) Received: by mail-wr1-x431.google.com with SMTP id ffacd0b85a97d-33330a5617fso1680231f8f.2 for ; Thu, 07 Dec 2023 17:53:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1702000383; x=1702605183; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=31uCcMHZPgEoDhCw1uvDKEltQSe+mPpn7IW2GIItino=; b=vOrvHqgDi0qxb5BdwhAL0mk6P007jCigtoK2GZFaibOEAN8gtKZYPTvoBKP8m3LYik fV0yEIAlCCGzrSrVzgaHrZpKgcXt6B+vCIqmhPQhdoa3T/S1bTAARxT8IQpr/nIGkNMV skAkL+jzNEkThS0zaOZFpQXiYqqz2y/WyKkDdAM3wHs5lpIDFyy+ymbIJXsLTFNhRFuc jM23QQWJwbMtEw41S7mZu/MHrtHVoe8hpu+DdJOQwVpsVALRG6rxGXqJ+hZ3jVhiUHNb ZhhkvX+yaFHdlQsxE0OGLFs15p3+tSdg9lAbK+W9s7tauTvDe6bZF/J6lWZXaLzJZDUh BuQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702000383; x=1702605183; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=31uCcMHZPgEoDhCw1uvDKEltQSe+mPpn7IW2GIItino=; b=tZtGFToA+5lJaW1il0oByUwI9PLmwvNkHjrcP9PCZOjdq6oscBJWp878MjyI5gn9cG O+AoFDzkLFRI60sqNljmEtAvKKZvNERf1nzwg/sXtmm1Bp8kXaJkn5q3LqUjQFT1h+Yk 4jhEmOPceEwLE2Ts3J1PVTINVehfRNsrujpDOv9RiwLwAszJQNMvmCe2THIG3v0Ro5J/ 9tdbiTKv7/Xxn6zjOC2uDg7H0W8EI8CTk2MHImc5PoRAo5Jij/RFlSeXD+zu20m4Ov5o ry4HbFSfGs3j7NNnGHT9ZHCGPJ1lp7ke/jN5pHtzLYqsiEW7sqS+aJkw7zLkMB5i5Kgr aKOw== X-Gm-Message-State: AOJu0YySnppiATl7X3BVEEL6yN6n23TF+HJpUJcC50Agjd3UKBCKVQHN mecKdS7YNi6hiQViJZmXXk6+Mg== X-Google-Smtp-Source: AGHT+IG4IUsa7DV15wnNOlcuAvDlFGhKyQFMusK/if39EeqD0F8ohN6cpsYvBkwY1HdsAijqH552LQ== X-Received: by 2002:adf:e952:0:b0:333:2fd2:5d3d with SMTP id m18-20020adfe952000000b003332fd25d3dmr1899176wrn.111.1702000383066; Thu, 07 Dec 2023 17:53:03 -0800 (PST) Received: from airbuntu.. (host109-153-232-45.range109-153.btcentralplus.com. [109.153.232.45]) by smtp.gmail.com with ESMTPSA id s12-20020adf978c000000b003333a0da243sm902521wrb.81.2023.12.07.17.53.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 17:53:02 -0800 (PST) From: Qais Yousef To: Ingo Molnar , Peter Zijlstra , "Rafael J. Wysocki" , Viresh Kumar , Vincent Guittot , Dietmar Eggemann Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Lukasz Luba , Wei Wang , Rick Yiu , Chung-Kai Mei , Hongyan Xia , Qais Yousef Subject: [PATCH 4/4] sched/documentation: Remove reference to max aggregation Date: Fri, 8 Dec 2023 01:52:42 +0000 Message-Id: <20231208015242.385103-5-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231208015242.385103-1-qyousef@layalina.io> References: <20231208015242.385103-1-qyousef@layalina.io> Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Now that max aggregation complexity was removed, update the doc to reflect the new working design. And since we fixed one of the limitation, remove it as well. Signed-off-by: Qais Yousef (Google) --- Documentation/scheduler/sched-util-clamp.rst | 239 ++++--------------- 1 file changed, 45 insertions(+), 194 deletions(-) diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst index 74d5b7c6431d..642e5f386f8e 100644 --- a/Documentation/scheduler/sched-util-clamp.rst +++ b/Documentation/scheduler/sched-util-clamp.rst @@ -129,172 +129,50 @@ and the scheduler needs to select a suitable CPU for it to run on. Since the goal of util clamp is to allow requesting a minimum and maximum performance point for a task to run on, it must be able to influence the -frequency selection as well as task placement to be most effective. Both of -which have implications on the utilization value at CPU runqueue (rq for short) -level, which brings us to the main design challenge. - -When a task wakes up on an rq, the utilization signal of the rq will be -affected by the uclamp settings of all the tasks enqueued on it. For example if -a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs -to respect to this request as well as all other requests from all of the -enqueued tasks. - -To be able to aggregate the util clamp value of all the tasks attached to the -rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the -scheduler hot path. Hence care must be taken since any slow down will have -significant impact on a lot of use cases and could hinder its usability in -practice. - -The way this is handled is by dividing the utilization range into buckets -(struct uclamp_bucket) which allows us to reduce the search space from every -task on the rq to only a subset of tasks on the top-most bucket. - -When a task is enqueued, the counter in the matching bucket is incremented, -and on dequeue it is decremented. This makes keeping track of the effective -uclamp value at rq level a lot easier. - -As tasks are enqueued and dequeued, we keep track of the current effective -uclamp value of the rq. See :ref:`section 2.1 ` for details on -how this works. - -Later at any path that wants to identify the effective uclamp value of the rq, -it will simply need to read this effective uclamp value of the rq at that exact -moment of time it needs to take a decision. +frequency selection as well as task placement to be most effective. For task placement case, only Energy Aware and Capacity Aware Scheduling (EAS/CAS) make use of uclamp for now, which implies that it is applied on -heterogeneous systems only. -When a task wakes up, the scheduler will look at the current effective uclamp -value of every rq and compare it with the potential new value if the task were -to be enqueued there. Favoring the rq that will end up with the most energy -efficient combination. +heterogeneous systems only. When a task wakes up, the scheduler will look at +the effective uclamp values of the woken task to check if it will fit the +capacity of the CPU. Favouring the most energy efficient CPU that fits the +performant request hints. Enabling it to favour a bigger CPU if uclamp_min is +high even if the utilization of the task is low. Or enable it to run on +a smaller CPU if UCLAMP_MAX is low, even if the utilization of the task is +high. Similarly in schedutil, when it needs to make a frequency update it will look -at the current effective uclamp value of the rq which is influenced by the set -of tasks currently enqueued there and select the appropriate frequency that -will satisfy constraints from requests. +at the effective uclamp values of the current running task on the rq and select +the appropriate frequency that will satisfy the performance request hints of +the task, taking into account the current utilization of the rq. Other paths like setting overutilization state (which effectively disables EAS) make use of uclamp as well. Such cases are considered necessary housekeeping to allow the 2 main use cases above and will not be covered in detail here as they could change with implementation details. -.. _uclamp-buckets: +2.1. Frequency hints are applied at context switch +-------------------------------------------------- -2.1. Buckets ------------- +At context switch, we tell schedutil of the new uclamp values (or min/max perf +requirments) of the newly RUNNING task. It is up to the governor to try its +best to honour these requests. -:: - - [struct rq] - - (bottom) (top) - - 0 1024 - | | - +-----------+-----------+-----------+---- ----+-----------+ - | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | - +-----------+-----------+-----------+---- ----+-----------+ - : : : - +- p0 +- p3 +- p4 - : : - +- p1 +- p5 - : - +- p2 - - -.. note:: - The diagram above is an illustration rather than a true depiction of the - internal data structure. - -To reduce the search space when trying to decide the effective uclamp value of -an rq as tasks are enqueued/dequeued, the whole utilization range is divided -into N buckets where N is configured at compile time by setting -CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. - -The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. - -The range of each bucket is 1024/N. For example, for the default value of -5 there will be 5 buckets, each of which will cover the following range: - -:: - - DELTA = round_closest(1024/5) = 204.8 = 205 - - Bucket 0: [0:204] - Bucket 1: [205:409] - Bucket 2: [410:614] - Bucket 3: [615:819] - Bucket 4: [820:1024] - -When a task p with following tunable parameters - -:: - - p->uclamp[UCLAMP_MIN] = 300 - p->uclamp[UCLAMP_MAX] = 1024 - -is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket -4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in -this range. - -The rq then keeps track of its current effective uclamp value for each -uclamp_id. - -When a task p is enqueued, the rq value changes to: - -:: - - // update bucket logic goes here - rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) - // repeat for UCLAMP_MAX +For uclamp to be effective, it is desired to have a hardware with reasonably +fast DVFS hardware (rate_limit_us is short). -Similarly, when p is dequeued the rq value changes to: - -:: - - // update bucket logic goes here - rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() - // repeat for UCLAMP_MAX - -When all buckets are empty, the rq uclamp values are reset to system defaults. -See :ref:`section 3.4 ` for details on default values. +It is believed that most modern hardware (including lower rend ones) has +rate_limit_us <= 2ms which should be good enough. Having 500us or below would +be ideal so the hardware can reasonably catch up with each task's performance +constraints. +Schedutil might ignore the task's performance request if its historical runtime +has been shorter than the rate_limit_us. -2.2. Max aggregation --------------------- +See :ref:`Section 5.2 ` for more details on +schedutil related limitations. -Util clamp is tuned to honour the request for the task that requires the -highest performance point. - -When multiple tasks are attached to the same rq, then util clamp must make sure -the task that needs the highest performance point gets it even if there's -another task that doesn't need it or is disallowed from reaching this point. - -For example, if there are multiple tasks attached to an rq with the following -values: - -:: - - p0->uclamp[UCLAMP_MIN] = 300 - p0->uclamp[UCLAMP_MAX] = 900 - - p1->uclamp[UCLAMP_MIN] = 500 - p1->uclamp[UCLAMP_MAX] = 500 - -then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN -and UCLAMP_MAX become: - -:: - - rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 - rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 - -As we shall see in :ref:`section 5.1 `, this max -aggregation is the cause of one of limitations when using util clamp, in -particular for UCLAMP_MAX hint when user space would like to save power. - -2.3. Hierarchical aggregation +2.2. Hierarchical aggregation ----------------------------- As stated earlier, util clamp is a property of every task in the system. But @@ -324,7 +202,7 @@ In other words, this aggregation will not cause an error when a task changes its uclamp values, but rather the system may not be able to satisfy requests based on those factors. -2.4. Range +2.3. Range ---------- Uclamp performance request has the range of 0 to 1024 inclusive. @@ -332,6 +210,14 @@ Uclamp performance request has the range of 0 to 1024 inclusive. For cgroup interface percentage is used (that is 0 to 100 inclusive). Just like other cgroup interfaces, you can use 'max' instead of 100. +2.4. Older design +----------------- + +Older design was behaving differently due what was called max-aggregation rule +which was adding high complexity and had some limitations. Please consult the +docs corresponding to your kernel version as part of this doc might reflect how +it works on your kernel. + .. _uclamp-interfaces: 3. Interfaces @@ -594,39 +480,7 @@ possible. 5. Limitations ============== -.. _uclamp-capping-fail: - -5.1. Capping frequency with uclamp_max fails under certain conditions ---------------------------------------------------------------------- - -If task p0 is capped to run at 512: - -:: - - p0->uclamp[UCLAMP_MAX] = 512 - -and it shares the rq with p1 which is free to run at any performance point: - -:: - - p1->uclamp[UCLAMP_MAX] = 1024 - -then due to max aggregation the rq will be allowed to reach max performance -point: - -:: - - rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 - -Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for -the rq will depend on the actual utilization value of the tasks. - -If p1 is a small task but p0 is a CPU intensive task, then due to the fact that -both are running at the same rq, p1 will cause the frequency capping to be left -from the rq although p1, which is allowed to run at any performance point, -doesn't actually need to run at that frequency. - -5.2. UCLAMP_MAX can break PELT (util_avg) signal +5.1. UCLAMP_MAX can break PELT (util_avg) signal ------------------------------------------------ PELT assumes that frequency will always increase as the signals grow to ensure @@ -650,10 +504,6 @@ CPU is capable of. The max CPU frequency (Fmax) matters here as well, since it designates the shortest computational time to finish the task's work on this CPU. -:: - - rq->uclamp[UCLAMP_MAX] = 0 - If the ratio of Fmax/Fmin is 3, then maximum value will be: :: @@ -689,9 +539,8 @@ If task p1 wakes up on this CPU, which have: p1->util_avg = 200 p1->uclamp[UCLAMP_MAX] = 1024 -then the effective UCLAMP_MAX for the CPU will be 1024 according to max -aggregation rule. But since the capped p0 task was running and throttled -severely, then the rq->util_avg will be: +Since the capped p0 task was running and throttled severely, then the +rq->util_avg will be: :: @@ -699,9 +548,9 @@ severely, then the rq->util_avg will be: p1->util_avg = 200 rq->util_avg = 1024 - rq->uclamp[UCLAMP_MAX] = 1024 -Hence lead to a frequency spike since if p0 wasn't throttled we should get: +Hence lead to a frequency spike when p1 is running. If p0 wasn't throttled we +should get: :: @@ -710,9 +559,11 @@ Hence lead to a frequency spike since if p0 wasn't throttled we should get: rq->util_avg = 500 -and run somewhere near mid performance point of that CPU, not the Fmax we get. +and run somewhere near mid performance point of that CPU, not the Fmax p1 gets. + +.. _schedutil_response_time_issues: -5.3. Schedutil response time issues +5.2. Schedutil response time issues ----------------------------------- schedutil has three limitations: