[RFC] sched: Consolidate cpufreq updates

Improve the interaction with cpufreq governors by making the
cpufreq_update_util() calls more intentional.

At the moment we send them when load is updated for CFS, bandwidth for
DL and at enqueue/dequeue for RT. But this can lead to too many updates
sent in a short period of time and potentially be ignored at a critical
moment due to the rate_limit_us in schedutil.

For example, simultaneous task enqueue on the CPU where 2nd task is
bigger and requires higher freq. The trigger to cpufreq_update_util() by
the first task will lead to dropping the 2nd request until tick. Or
another CPU in the same policy triggers a freq update shortly after.

Updates at enqueue for RT are not strictly required. Though they do help
to reduce the delay for switching the frequency and the potential
observation of lower frequency during this delay. But current logic
doesn't intentionally (at least to my understanding) try to speed up the
request.

To help reduce the amount of cpufreq updates and make them more
purposeful, consolidate them into these locations:

1. context_switch()
2. task_tick_fair()
3. {attach, detach}_entity_load_avg()
4. update_blocked_averages()

The update at context switch should help guarantee that DL and RT get
the right frequency straightaway when they're RUNNING. As mentioned
though the update will happen slightly after enqueue_task(); though in
an ideal world these tasks should be RUNNING ASAP and this additional
delay should be negligible. For fair tasks we need to make sure we send
a single update for every decay for the root cfs_rq. Any changes to the
rq will be deferred until the next task is ready to run, or we hit TICK.
But we are guaranteed the task is running at a level that meets its
requirements after enqueue.

To guarantee RT and DL tasks updates are never missed, we add a new
SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
already running at the right freq, the governor will end up doing
nothing, but we eliminate the risk of the task ending up accidentally
running at the wrong freq due to rate_limit_us.

Similarly for iowait boost. We also handle a case of a boost reset
prematurely by adding a guard against TICK_NSEC in sugov_iowait_apply()
in similar fashion to sugov_iowait_reset().

The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
time stamps otherwise we can end up delaying updates for normal
requests.

We also teach sugov to ignore cpufreq updates from its sugov workers. It
doesn't make sense for the kworker that applies the frequency update
(which is a DL task) to trigger a frequency update itself.

There's room for an optimization that I haven't pursued yet (but plan to
follow up with in the future) which is not to do an update for RT/DL if
the frequency level is already the same. sugov currently already handles
this, but since we force ignoring rate limit now, it would be ideal not
to force too many frequency updates in a row if there's not much to do.
We need to compare uclamp_min values for RT and bandwidth values for DL
between next and prev tasks.

The update at task_tick_fair will guarantee that the governor will
follow any updates to load for tasks/CPU or due to new enqueues/dequeues
to the rq. Since DL and RT always run at constant frequencies and have
no load tracking, this is only required for fair tasks.

The update at attach/detach_entity_load_avg() will ensure we adapt to
big changes when tasks are added/removed from cgroups.

The update at update_blocked_averages() will ensure we decay frequency
as the CPU becomes idle for long enough.

I am contemplating to make all updates except for CFS at context switch
a forced updates. I'd welcome thoughts. Context switch should be our
major drive for frequency change, and the other operations should be
treated as out-of-line updates that must be honoured and not be
accidentally dropped by rate limit. Contemplating the impact on shared
policies if we go down that route too. Thoughts would be appreciated.

To make sure governors that don't register a cpufreq_update_util()
handler aren't impacted, we protect the call with a static key to ensure
that it is only active when the current governor makes use of it.

Results of `perf stat --repeat 10 perf bench sched pipe` on AMD 3900X to
verify any potential overhead because of the addition at context switch

Before:
-------

	Performance counter stats for 'perf bench sched pipe' (10 runs):

		 16,839.74 msec task-clock:u              #    1.158 CPUs utilized            ( +-  0.52% )
			 0      context-switches:u        #    0.000 /sec
			 0      cpu-migrations:u          #    0.000 /sec
		     1,390      page-faults:u             #   83.903 /sec                     ( +-  0.06% )
	       333,773,107      cycles:u                  #    0.020 GHz                      ( +-  0.70% )  (83.72%)
		67,050,466      stalled-cycles-frontend:u #   19.94% frontend cycles idle     ( +-  2.99% )  (83.23%)
		37,763,775      stalled-cycles-backend:u  #   11.23% backend cycles idle      ( +-  2.18% )  (83.09%)
		84,456,137      instructions:u            #    0.25  insn per cycle
							  #    0.83  stalled cycles per insn  ( +-  0.02% )  (83.01%)
		34,097,544      branches:u                #    2.058 M/sec                    ( +-  0.02% )  (83.52%)
		 8,038,902      branch-misses:u           #   23.59% of all branches          ( +-  0.03% )  (83.44%)

		   14.5464 +- 0.0758 seconds time elapsed  ( +-  0.52% )

After:
-------

	Performance counter stats for 'perf bench sched pipe' (10 runs):

		 16,219.58 msec task-clock:u              #    1.130 CPUs utilized            ( +-  0.80% )
			 0      context-switches:u        #    0.000 /sec
			 0      cpu-migrations:u          #    0.000 /sec
		     1,391      page-faults:u             #   85.163 /sec                     ( +-  0.06% )
	       342,768,312      cycles:u                  #    0.021 GHz                      ( +-  0.63% )  (83.36%)
		66,231,208      stalled-cycles-frontend:u #   18.91% frontend cycles idle     ( +-  2.34% )  (83.95%)
		39,055,410      stalled-cycles-backend:u  #   11.15% backend cycles idle      ( +-  1.80% )  (82.73%)
		84,475,662      instructions:u            #    0.24  insn per cycle
							  #    0.82  stalled cycles per insn  ( +-  0.02% )  (83.05%)
		34,067,160      branches:u                #    2.086 M/sec                    ( +-  0.02% )  (83.67%)
		 8,042,888      branch-misses:u           #   23.60% of all branches          ( +-  0.07% )  (83.25%)

		    14.358 +- 0.116 seconds time elapsed  ( +-  0.81% )

Note worthy that we still have the following race condition on systems
that have shared policy:

* CPUs with shared policy can end up sending simultaneous cpufreq
  updates requests where the 2nd one will be unlucky and get blocked by
  the rate_limit_us (schedutil).

We can potentially address this limitation later, but it is out of the
scope of this patch.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 include/linux/sched/cpufreq.h    |  3 +-
 kernel/sched/core.c              | 51 +++++++++++++++++++++++
 kernel/sched/cpufreq.c           |  5 +++
 kernel/sched/cpufreq_schedutil.c | 71 +++++++++++++++++++++++++-------
 kernel/sched/deadline.c          |  4 --
 kernel/sched/fair.c              | 53 ++++--------------------
 kernel/sched/rt.c                |  8 +---
 kernel/sched/sched.h             | 10 +++++
 8 files changed, 133 insertions(+), 72 deletions(-)

Message ID	20240324020139.1032473-1-qyousef@layalina.io
State	New
Headers	show Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FC0E15BF for <linux-pm@vger.kernel.org>; Sun, 24 Mar 2024 02:02:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711245731; cv=none; b=A08AvwODlce1BHTsefQRO18t1tgsE/IDcVyoFSDnSsTRM/PV6XLhxlWOGhG0tB63eVrT7wParhDQm74yUB9MvtQNeCue1OovMraksJI6+MkmUuPBncaDFXF25CvJnfLgc8RaWe2J+f6F2nrbTXxcu6kp2SEj0fsCi4zWK6/c6BM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711245731; c=relaxed/simple; bh=LcrTVl6yGlEYdH8V01BCaZoaA0Q1/k9QeLX7Swe0GEg=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=h6eb6B3zdJyvc5SBbZrNVvzNjZIIHeWeJvWccH3BybuzBMh0NOmiMH3E0KVAc9gRDCJRP7qv+GOJ7GDLBJfZk98Gxptw2Wram3XIYXIsHtp5+zFVy3gEW5ixE9FpJ9SaOQ1fmdeYlskE9IeLPKrQRl9WJZLMjvogjjGCF3LYhEk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=layalina.io; spf=pass smtp.mailfrom=layalina.io; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b=bNNuvTyW; arc=none smtp.client-ip=209.85.221.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=layalina.io Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=layalina.io Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b="bNNuvTyW" Received: by mail-wr1-f50.google.com with SMTP id ffacd0b85a97d-33ececeb19eso1975328f8f.3 for <linux-pm@vger.kernel.org>; Sat, 23 Mar 2024 19:02:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1711245725; x=1711850525; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=I6AD1ORG9ciub/kWKco/mYe1SbK4P1Y/eCXm0V1AD8E=; b=bNNuvTyWNT2kOonl/xl/MUVFf0iGt/9wG0ngN/p6elM0IoJHfkXUOBtAEpzHDkLODK QC7MAT/bdb6TNofxhFPJHHd63cLnlIiIEghLvOFpymdjLma9ID+S9RDlvs8SG7chEr1e CDL/hA5QT1j26Haw3ooGoez5rE9JEkneUcNor/NRQOpkweDTnYTRrUbmSGMLGUpo1MwQ rF5quDYNNSYaVKAuCCDcngeKnJyQJ4kMN6dXPyWX+alJZc5F0k3E9UssR+trhFZ4pSvc sPXb31yWJJ0l0wKklkAL+lNjXD8Yij0Mqtosv0/bWLArbO8AnyMXvCSmPz3NLeNXpZvP BjUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711245725; x=1711850525; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=I6AD1ORG9ciub/kWKco/mYe1SbK4P1Y/eCXm0V1AD8E=; b=bT8sRVqNhvPsEAR8okli/ddm3XHGNBjJwlF9dAcB+1/CEdwzXwtbOWUwM2Taf5kyGt sNkYzMvolK7c8WLSu/vL1y1KOxcCZz7ICCblYCPMWnVt8mjg1drTgR+1P+vgzjV9UzBy djaLAllK4Lz+5/FzooPbpspuOPI2d5bRCRKiXNSsklEWJEsf7ZVs1OV4ZbCxsG8paBqp zXOFXJ3HN6coUwM+zBjnDtiqi7MHH/2BhEknHudOboIlDYr4RO48wKNY6REcJe5tzdQ/ Xe8XN8g4pSLE1xujWsaiwWMw89Fvg5K0PCNFhOaJ9xGwpvY7T6+Wk3WXaIrY5gu5Gnvb HS9A== X-Forwarded-Encrypted: i=1; AJvYcCXrhbLMz+885MYIYsOJHfo611RINrCozUl9pqPAHLL+9Z+boeJJ2epqC22ltMI3DC27/4nGKQ82ugNaUYxw+yRkyc/g0F4yyLQ= X-Gm-Message-State: AOJu0YxNpSFo4sNW59YX0HQtPxu7RO4N4yuYq7M2zV5rYsmQsqdJoZ6J 5Tnxo4j70h33TU+bllWGxLYRdswPErqWUl3911HFzjmcOfq1i6H+kf59PHGmtSY= X-Google-Smtp-Source: AGHT+IEMB2qOz2PYuzKHH5qGsO9DHzdWPBEfuIaeWWPnw0+89bOvMV0OvFv9ZvfFCH3TDf6hsNA5wQ== X-Received: by 2002:a5d:604a:0:b0:33e:6366:5f33 with SMTP id j10-20020a5d604a000000b0033e63665f33mr2109940wrt.9.1711245725535; Sat, 23 Mar 2024 19:02:05 -0700 (PDT) Received: from airbuntu.. (host81-157-90-255.range81-157.btcentralplus.com. [81.157.90.255]) by smtp.gmail.com with ESMTPSA id m9-20020a5d4a09000000b00341bdecdae3sm3474973wrq.117.2024.03.23.19.02.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Mar 2024 19:02:05 -0700 (PDT) From: Qais Yousef <qyousef@layalina.io> To: "Rafael J. Wysocki" <rafael@kernel.org>, Viresh Kumar <viresh.kumar@linaro.org>, Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Christian Loehle <christian.loehle@arm.com>, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Qais Yousef <qyousef@layalina.io> Subject: [RFC PATCH] sched: Consolidate cpufreq updates Date: Sun, 24 Mar 2024 02:01:39 +0000 Message-Id: <20240324020139.1032473-1-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: <linux-pm.vger.kernel.org> List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC] sched: Consolidate cpufreq updates \| expand [RFC] sched: Consolidate cpufreq updates

[RFC] sched: Consolidate cpufreq updates

Commit Message

Comments

Patch