From patchwork Thu Nov 26 17:18:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 333941 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4AFAC56202 for ; Thu, 26 Nov 2020 17:25:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A02BF21D7E for ; Thu, 26 Nov 2020 17:25:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404314AbgKZRZC (ORCPT ); Thu, 26 Nov 2020 12:25:02 -0500 Received: from outbound-smtp14.blacknight.com ([46.22.139.231]:58093 "EHLO outbound-smtp14.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404312AbgKZRZC (ORCPT ); Thu, 26 Nov 2020 12:25:02 -0500 X-Greylist: delayed 393 seconds by postgrey-1.27 at vger.kernel.org; Thu, 26 Nov 2020 12:25:00 EST Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp14.blacknight.com (Postfix) with ESMTPS id 919C81C4444 for ; Thu, 26 Nov 2020 17:18:26 +0000 (GMT) Received: (qmail 18254 invoked from network); 26 Nov 2020 17:18:26 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.22.4]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 26 Nov 2020 17:18:26 -0000 Date: Thu, 26 Nov 2020 17:18:24 +0000 From: Mel Gorman To: "Rafael J. Wysocki" Cc: Daniel Lezcano , Marcelo Tosatti , Linux Kernel Mailing List , Linux PM Subject: [PATCH] cpuidle: Allow configuration of the polling interval before cpuidle enters a c-state Message-ID: <20201126171824.GK3371@techsingularity.net> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org It was noted that a few workloads that idle rapidly regressed when commit 36fcb4292473 ("cpuidle: use first valid target residency as poll time") was merged. The workloads in question were heavy communicators that idle rapidly and were impacted by the c-state exit latency as the active CPUs were not polling at the time of wakeup. As they were not particularly realistic workloads, it was not considered to be a major problem. Unfortunately, a bug was then reported for a real workload in a production environment that relied on large numbers of threads operating in a worker pool pattern. These threads would idle for periods of time slightly longer than the C1 exit latency and so incurred the c-state exit latency. The application is sensitive to wakeup latency and appears to indirectly rely on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()") to poll for long enough to avoid the exit latency cost. The current behaviour favours power consumption over wakeup latency and it is reasonable behaviour but it should be tunable. In theory applications could use /dev/cpu_dma_latency but not all applications are aware of cpu_dma_latency. Similarly, a tool could be installed that opens cpu_dma_latency for the whole system but such a tool is not always available, is not always known to the sysadmin or the tool can have unexpected side-effects if it tunes more than cpu_dma_latency. In practice, it is more common for sysadmins to try idle=poll (which is x86 specific) or try disabling c-states and hope for the best. This patch makes it straight-forward to configure how long a CPU should poll before entering a c-state. By default, there is no behaviour change. At build time a decision can be made to favour performance over power by default even if that potentially impacts turbo boosting for workloads that are sensitive to wakeup latency. In the event the kernel default is not suitable, the kernel command line can be used as a substitute for implementing cpu_dma_latency support in an application or requiring an additional tool to be installed. Note that it is not expected that tuning for longer polling times will be a universal win. For example, extra polling might prevent a turbo state being used or prevent hyperthread resources being released to an SMT sibling. By default, nothing has changed but here is an example of tbench4 comparing the default "poll based on the min cstate" vs "poll based on the max cstate" tbench4 min-cstate max-cstate Hmean 1 512.88 ( 0.00%) 566.74 * 10.50%* Hmean 2 999.47 ( 0.00%) 1090.01 * 9.06%* Hmean 4 1960.83 ( 0.00%) 2106.62 * 7.44%* Hmean 8 3828.61 ( 0.00%) 4097.93 * 7.03%* Hmean 16 6899.44 ( 0.00%) 7120.38 * 3.20%* Hmean 32 10718.38 ( 0.00%) 10672.44 * -0.43%* Hmean 64 12672.21 ( 0.00%) 12608.15 * -0.51%* Hmean 128 20744.83 ( 0.00%) 21147.02 * 1.94%* Hmean 256 20646.60 ( 0.00%) 20608.48 * -0.18%* Hmean 320 20892.89 ( 0.00%) 20831.99 * -0.29%* Signed-off-by: Mel Gorman --- Documentation/admin-guide/kernel-parameters.txt | 25 +++++++++ drivers/cpuidle/Kconfig | 57 +++++++++++++++++++++ drivers/cpuidle/cpuidle.c | 67 +++++++++++++++++++++++-- include/linux/moduleparam.h | 5 ++ kernel/params.c | 1 + 5 files changed, 151 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 526d65d8573a..382126781b0e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -719,6 +719,31 @@ policy to use. This governor must be registered in the kernel before the cpufreq driver probes. + cpuidle.poll= + [CPU_IDLE] + Format: + Set the time in microseconds a CPU should poll + in cpuidle for a new task before entering + a sleep state. The default is usually fine + but may need to be tuned for workloads that + are both latency-sensitive and idles rapidly + for short durations. Limiting c-states can + be insufficient if the polling time is still + too short, the application has no knowledge + of /dev/cpu_dma_latency, there are multiple + applications or the environment does not allow + the installation of a userspace tool that controls + cpu_dma_latency. + + -1: default, poll based on the shallowest enabled c-state + -2: poll based on the deepest enabled c-state. If no + c-states are enabled then poll based on the tick + -3: poll up until a tick would force a resched event + >0: poll for a fixed duration in nanoseconds. Poll + range is limited. The lowest bound is the + shallowest enabled c-state. The maximum + bound is based on the tick + cpu_init_udelay=N [X86] Delay for N microsec between assert and de-assert of APIC INIT to start processors. This delay occurs diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index c0aeedd66f02..7c9d2073db85 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -47,6 +47,63 @@ config CPU_IDLE_GOV_HALTPOLL config DT_IDLE_STATES bool +choice + prompt "CPU Idle default polling interval before entering an idle state" + default CPUIDLE_DEFAULT_POLL_MINCSTATE + help + CPU Idle polls briefly waiting for a new task to be runnable. This + setting determines what the default polling interval is. Lower + polling intervals will optimise for power consumption while higher + intervals will favour wakeup latency in cases where a task is + queued before the CPU enters a c-state. If c-states are unavailable + at runtime, polling will poll for up to a tick. + + config CPUIDLE_DEFAULT_POLL_MINCSTATE + bool "interval mincstate" + help + Polling interval will be the same as the shallowest enabled c-state + e.g. C1 exit latency + + config CPUIDLE_DEFAULT_POLL_MAXCSTATE + bool "interval maxcstate" + help + Polling interval will be the same as the deepest enabled c-state + e.g. C3 exit latency + + config CPUIDLE_DEFAULT_POLL_TICK + bool "interval tick" + help + Polling interval will be up to a tick interval or until the idle + task needs to reschedule. e.g. a kernel with HZ of 250 could + potentially poll up to 4ms. + + config CPUIDLE_DEFAULT_POLL_FIXED + bool "interval fixed" + help + Polling can be specified as a fixed interval regardless of + c-states. This option is only useful in the case where there + is a known class of applications that have a fixed latency + that is not necessarily related to the underlying c-state + capabilities and there is a need for the polling interval + to be be hardware invariant. +endchoice + +config CPUIDLE_DEFAULT_POOL_FIXED_INTERVAL + depends on CPUIDLE_DEFAULT_POLL_FIXED + int "Default polling inverval for CPU Idle" + default 2 + help + This is the fixed interval used for cpuidle polling if + CPUIDLE_DEFAULT_POLL_FIXED is selected. + +config CPUIDLE_DEFAULT_POLL + int + default -1 if CPUIDLE_DEFAULT_POLL_MINCSTATE + default -2 if CPUIDLE_DEFAULT_POLL_MAXCSTATE + default -3 if CPUIDLE_DEFAULT_POLL_TICK + default CPUIDLE_DEFAULT_POOL_FIXED_INTERVAL if CPUIDLE_DEFAULT_POLL_FIXED + + menu "ARM CPU Idle Drivers" depends on ARM || ARM64 source "drivers/cpuidle/Kconfig.arm" diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 83af15f77f66..32a745bb7eb8 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -368,6 +368,25 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index) cpuidle_curr_governor->reflect(dev, index); } +static struct kernel_param_ops poll_param_ops = { + .set = param_set_llong, + .get = param_get_llong, +}; + +/* Poll based on the shallowest enabled cstate */ +#define CPUIDLE_POLL_MINCSTATE -1 + +/* Poll based on the deepest enabled cstate */ +#define CPUIDLE_POLL_MAXCSTATE -2 + +/* Poll up to when a tick forces a resched event */ +#define CPUIDLE_POLL_TICK -3 + +/* Sanity check to guard against future options */ +#define CPUIDLE_POLL_BOUND CPUIDLE_POLL_TICK + +static long long param_poll = CONFIG_CPUIDLE_DEFAULT_POLL; + /** * cpuidle_poll_time - return amount of time to poll for, * governors can override dev->poll_limit_ns if necessary @@ -380,21 +399,60 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv, struct cpuidle_device *dev) { int i; - u64 limit_ns; + u64 min_limit_ns, max_limit_ns; if (dev->poll_limit_ns) return dev->poll_limit_ns; - limit_ns = TICK_NSEC; + /* Sanitise param_poll. */ + param_poll = clamp_t(int, param_poll, CPUIDLE_POLL_BOUND, TICK_NSEC); + if (param_poll > 0) + param_poll *= NSEC_PER_USEC; + + /* Find the min and max exit latencies for c-states */ + max_limit_ns = 0; + min_limit_ns = TICK_NSEC; for (i = 1; i < drv->state_count; i++) { + u64 state_limit; + if (dev->states_usage[i].disable) continue; - limit_ns = drv->states[i].target_residency_ns; + state_limit = drv->states[i].target_residency_ns; + min_limit_ns = min_t(u64, min_limit_ns, state_limit); + max_limit_ns = max_t(u64, max_limit_ns, state_limit); + } + + /* Select a polling interval */ + switch (param_poll) { + case CPUIDLE_POLL_MINCSTATE: + dev->poll_limit_ns = min_limit_ns; + break; + case CPUIDLE_POLL_MAXCSTATE: + /* If there are no enabled c-states, use the tick interval */ + if (!max_limit_ns) + max_limit_ns = TICK_NSEC; + + dev->poll_limit_ns = max_limit_ns; + break; + case CPUIDLE_POLL_TICK: + dev->poll_limit_ns = TICK_NSEC; + break; + default: + /* If there are no enabled c-states, make sure param_poll is used */ + if (!max_limit_ns) + min_limit_ns = 1; + + dev->poll_limit_ns = clamp_t(u64, param_poll, min_limit_ns, TICK_NSEC); break; } - dev->poll_limit_ns = limit_ns; + + /* + * Polling parameter reported as usec to match the values reported + * for c-cstate exit latencies in sysfs. + */ + param_poll = dev->poll_limit_ns / NSEC_PER_USEC;; return dev->poll_limit_ns; } @@ -755,4 +813,5 @@ static int __init cpuidle_init(void) module_param(off, int, 0444); module_param_string(governor, param_governor, CPUIDLE_NAME_LEN, 0444); +module_param_cb(poll, &poll_param_ops, ¶m_poll, 0444); core_initcall(cpuidle_init); diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h index 6388eb9734a5..418cf714580d 100644 --- a/include/linux/moduleparam.h +++ b/include/linux/moduleparam.h @@ -438,6 +438,11 @@ extern int param_set_long(const char *val, const struct kernel_param *kp); extern int param_get_long(char *buffer, const struct kernel_param *kp); #define param_check_long(name, p) __param_check(name, p, long) +extern const struct kernel_param_ops param_ops_llong; +extern int param_set_llong(const char *val, const struct kernel_param *kp); +extern int param_get_llong(char *buffer, const struct kernel_param *kp); +#define param_check_llong(name, p) __param_check(name, p, long long) + extern const struct kernel_param_ops param_ops_ulong; extern int param_set_ulong(const char *val, const struct kernel_param *kp); extern int param_get_ulong(char *buffer, const struct kernel_param *kp); diff --git a/kernel/params.c b/kernel/params.c index 164d79330849..a2b0e5b1b26b 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -239,6 +239,7 @@ STANDARD_PARAM_DEF(ushort, unsigned short, "%hu", kstrtou16); STANDARD_PARAM_DEF(int, int, "%i", kstrtoint); STANDARD_PARAM_DEF(uint, unsigned int, "%u", kstrtouint); STANDARD_PARAM_DEF(long, long, "%li", kstrtol); +STANDARD_PARAM_DEF(llong, long long, "%lli", kstrtoll); STANDARD_PARAM_DEF(ulong, unsigned long, "%lu", kstrtoul); STANDARD_PARAM_DEF(ullong, unsigned long long, "%llu", kstrtoull); STANDARD_PARAM_DEF(hexint, unsigned int, "%#08x", kstrtouint);