x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula

Phoronix.com discovered a severe performance regression on AMD APYC
introduced on schedutil [see link 1] by the following commits from v5.11-rc1

    commit 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
    commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")

Furthermore commit db865272d9c4 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from v5.10 made it extremely easy to default to
schedutil even if the preferred driver is acpi_cpufreq. Distros are likely to
build both intel_pstate and acpi_cpufreq on x86, and the presence of the
former removes ondemand from the defaults. This situation amplifies the
visibility of the bug we're addressing.

[link 1] https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1

1. PROBLEM DESCRIPTION   : over-utilization and schedutil
2. PROPOSED SOLUTION     : raise freq_max in schedutil formula
3. DATA TABLE            : image processing benchmark
4. ANALYSIS AND COMMENTS : with over-utilization, freq-invariance is lost

1. PROBLEM DESCRIPTION (over-utilization and schedutil)

The problem happens on CPU-bound workloads spanning a large number of cores.
In this case schedutil won't select the maximum P-State. Actually, it's
likely that it will select the minimum one.

A CPU-bound workload puts the machine in a state generally called
"over-utilization": an increase in CPU speed doesn't result in an increase of
capacity. The fraction of time tasks spend on CPU becomes constant regardless
of clock frequency (the tasks eat whatever we throw at them), and the PELT
invariant util goes up and down with the frequency (i.e. it's not invariant
anymore).

2. PROPOSED SOLUTION (raise freq_max in schedutil formula)

The solution we implement here is a stop-gap one: when the driver is
acpi_cpufreq and the machine an AMD EPYC, schedutil will use max_boost instead
of max_P as the value for freq_max in its formula

    freq_next = 1.25 * freq_max * util

essentially giving freq_next some more headroom to grow in the over-utilized
case. This is the approach also followed by intel_pstate in passive mode.

The correct way to attack this problem would be to have schedutil detect
over-utilization and select freq_max irrespective of the util value, which has
no meaning at that point. This approach is too risky for an -rc5 submission so
we defer it to the next cycle.

3. DATA TABLE (image processing benchmark)

What follow is a more detailed account of the effects on a specific test.

TEST        : Intel Open Image Denoise, www.openimagedenoise.org
INVOCATION  : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS
CPU         : MODEL            : 2x AMD EPYC 7742
              FREQUENCY TABLE  : P2: 1.50 GHz
                                 P1: 2.00 GHz
				 P0: 2.25 GHz
              MAX BOOST        :     3.40 GHz

Results: threads, msecs (ratio). Lower is better.

               v5.10          v5.11-rc4    v5.11-rc4-patch
    -------------------------------------------------------
      1   1069.85 (1.00)   1071.84 (1.00)   1070.42 (1.00)
      2    542.24 (1.00)    544.40 (1.00)    544.48 (1.00)
      4    278.00 (1.00)    278.44 (1.00)    277.72 (1.00)
      8    149.81 (1.00)    149.61 (1.00)    149.87 (1.00)
     16     79.01 (1.00)     79.31 (1.00)     78.94 (1.00)
     24     58.01 (1.00)     58.51 (1.01)     58.15 (1.00)
     32     46.58 (1.00)     48.30 (1.04)     46.66 (1.00)
     48     37.29 (1.00)     51.29 (1.38)     37.27 (1.00)
     64     34.01 (1.00)     49.59 (1.46)     33.71 (0.99)
     80     31.09 (1.00)     44.27 (1.42)     31.33 (1.01)
     96     28.56 (1.00)     40.82 (1.43)     28.47 (1.00)
    112     28.09 (1.00)     40.06 (1.43)     28.63 (1.02)
    120     28.73 (1.00)     39.78 (1.38)     28.14 (0.98)
    128     28.93 (1.00)     39.60 (1.37)     29.38 (1.02)

See how the 128 threads case is almost 40% worse than baseline in v5.11-rc4.

4. ANALYSIS AND COMMENTS (with over-utilization freq-invariance is lost)

Statistics for NTHREADS=128 (number of physical cores of the machine)

                                      v5.10          v5.11-rc4
                                      ------------------------
CPU activity (mpstat)                 80-90%         80-90%
schedutil requests (tracepoint)       always P0      mostly P2
CPU frequency (HW feedback)           ~2.2 GHz       ~1.5 GHz
PELT root rq util (tracepoint)        ~825           ~450

mpstat shows that the workload is CPU-bound and usage doesn't change with
clock speed. What is striking is that the PELT util of any root runqueue in
v5.11-rc4 is half of what used to be before the frequency invariant support
(v5.10), leading to wrong frequency choices. How did we get there?

This workload is constant in time, so instead of using the PELT sum we can
pretend that scale invariance is obtained with

    util_inv = util_raw * freq_curr / freq_max1        [formula-1]

where util_raw is the PELT util from v5.10 (which is to say, not invariant),
and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
2.825 GHz.  Then we have the schedutil formula

    freq_next = 1.25 * freq_max2 * util_inv            [formula-2]

Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
3.4 GHz).

Since all cores are busy, there is no boost available. Let's be generous and say
the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
above and taking util_raw = 825/1024 = 0.8, freq_next is:

    freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz

After quantization (pick the next frequency up in the table), freq_next is
P1 = 2.0 GHz. See how we lost 250 MHz in the process. Iterate once more,
freq_next become 1.59 GHz. Since it's > P2, it's saved by quantization and P1
is selected, but if util_raw fluctuates a little and goes below 0.75, P0 is
selected and that kills util_inv by formula-1, which gives util_inv = 0.4.

The culprit of the problem is that with over-utilization, util_raw and
freq_curr in formula-1 are independent. In the nominal case, if freq_curr goes
up then util_raw goes down and viceversa. Here util_raw doesn't care and stays
constant. If freq_curr descrease, util_inv decreases too and so forth (it's a
feedback loop).

Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 drivers/cpufreq/freq_table.c     | 51 ++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h          |  5 ++++
 kernel/sched/cpufreq_schedutil.c |  8 +++--
 3 files changed, 62 insertions(+), 2 deletions(-)

Message ID	20210121003550.20415-1-ggherdovich@suse.cz
State	New
Headers	show Return-Path: <linux-pm-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 508CAC4321A for <linux-pm@archiver.kernel.org>; Thu, 21 Jan 2021 02:09:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 350D82388E for <linux-pm@archiver.kernel.org>; Thu, 21 Jan 2021 02:09:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728574AbhAUAwX (ORCPT <rfc822;linux-pm@archiver.kernel.org>); Wed, 20 Jan 2021 19:52:23 -0500 Received: from mx2.suse.de ([195.135.220.15]:56374 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726671AbhAUAgl (ORCPT <rfc822;linux-pm@vger.kernel.org>); Wed, 20 Jan 2021 19:36:41 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 5E99DAC9B; Thu, 21 Jan 2021 00:35:59 +0000 (UTC) From: Giovanni Gherdovich <ggherdovich@suse.cz> To: Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, "Rafael J . Wysocki" <rjw@rjwysocki.net>, Viresh Kumar <viresh.kumar@linaro.org> Cc: Jon Grimm <Jon.Grimm@amd.com>, Nathan Fontenot <Nathan.Fontenot@amd.com>, Yazen Ghannam <Yazen.Ghannam@amd.com>, Thomas Lendacky <Thomas.Lendacky@amd.com>, Suthikulpanit Suravee <Suravee.Suthikulpanit@amd.com>, Mel Gorman <mgorman@techsingularity.net>, Pu Wen <puwen@hygon.cn>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, Giovanni Gherdovich <ggherdovich@suse.cz> Subject: [PATCH] x86, sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Date: Thu, 21 Jan 2021 01:35:50 +0100 Message-Id: <20210121003550.20415-1-ggherdovich@suse.cz> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-pm.vger.kernel.org> X-Mailing-List: linux-pm@vger.kernel.org
Series	x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula \| expand x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula

x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula

Commit Message

Patch