mbox series

[v7,0/2] Improve VM CPUfreq and task placement behavior

Message ID 20240919000837.1004642-1-davidai@google.com
Headers show
Series Improve VM CPUfreq and task placement behavior | expand

Message

David Dai Sept. 19, 2024, 12:08 a.m. UTC
Hi,

This patch series is a continuation of the talk Saravana gave at LPC 2022
titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
of the talk is that workloads running in a guest VM get terrible task
placement and CPUfreq behavior when compared to running the same workload
in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
inside VMs. This would make power and performance terrible just by running
the workload in a VM even if we assume there is zero virtualization
overhead.

With this series, a workload running in a VM gets the same task placement
and CPUfreq behavior as it would when running in the host.

The idea is to improve VM CPUfreq/sched behavior by:
- Having guest kernel do accurate load tracking by taking host CPU
  arch/type and frequency into account.
- Sharing vCPU frequency requirements with the host so that the
  host can do proper frequency scaling and task placement on the host side.

Based on feedback from RFC v1 proposal[4], we've revised our
implementation to using MMIO reads and writes to pass information
from/to host instead of using hypercalls. In our example, the
VMM(Virtual Machine Manager) translates the frequency requests into
Uclamp_min and applies it to the vCPU thread as a hint to the host
kernel.

To achieve the results below, configure the host to:
- Affine vCPUs to specific clusters.
- Set vCPU capacity to match the host CPU they are running on.

To make it easy for folks to try this out with CrosVM, we have put up
userspace patches[5][6]. With those patches, you can configure CrosVM
correctly by adding the options "--host-cpu-topology" and "--virt-cpufreq".

Results:
========

Here are some side-by-side comparisons of RFC v1 proposal vs the current
patch series and are labelled as follows.

- (RFC v1) UtilHyp = hypercall + util_guest
- (current) UClampMMIO = MMIO + UClamp_min

Use cases running a minimal system inside a VM on a Pixel 6:
============================================================

FIO
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Usecase(avg MB/s) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Seq Write         |     13.3 |    16.4 |   +23% |       13.6 |    +2% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Write        |     11.2 |    12.9 |   +15% |       11.8 |    +8% |
+-------------------+----------+---------+--------+------------+--------+
| Seq Read          |      100 |     168 |   +68% |        138 |   +38% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Read         |     20.5 |    35.6 |   +74% |       31.0 |   +51% |
+-------------------+----------+---------+--------+------------+--------+

CPU-based ML Inference Benchmark
Lower is better
+----------------+----------+------------+--------+------------+--------+
| Test Case (ms) | Baseline | UtilHyp    | %delta | UClampMMIO | %delta |
+----------------+----------+------------+--------+------------+--------+
| Cached Sample  |          |            |        |            |        |
| Inference      |     3.40 |       2.37 |   -30% |       2.99 |   -12% |
+----------------+----------+------------+--------+------------+--------+
| Small Sample   |          |            |        |            |        |
| Inference      |     9.87 |       6.78 |   -31% |       7.65 |   -22% |
+----------------+----------+------------+--------+------------+--------+
| Large Sample   |          |            |        |            |        |
| Inference      |    33.35 |      26.74 |   -20% |      31.05 |    -7% |
+----------------+----------+------------+--------+------------+--------+

Use cases running Android inside a VM on a Chromebook:
======================================================

PCMark (Emulates real world usecases)
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Test Case (score) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Weighted Total    |     6190 |    7442 |   +20% |       7171 |   +16% |
+-------------------+----------+---------+--------+------------+--------+
| Web Browsing      |     5461 |    6620 |   +21% |       6284 |   +15% |
+-------------------+----------+---------+--------+------------+--------+
| Video Editing     |     4891 |    5376 |   +10% |       5344 |    +9% |
+-------------------+----------+---------+--------+------------+--------+
| Writing           |     6929 |    8791 |   +27% |       8457 |   +22% |
+-------------------+----------+---------+--------+------------+--------+
| Photo Editing     |     7966 |   12057 |   +51% |      11881 |   +49% |
+-------------------+----------+---------+--------+------------+--------+
| Data Manipulation |     5596 |    6057 |    +8% |       5694 |    +2% |
+-------------------+----------+---------+--------+------------+--------+

PCMark Performance/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Score/mAh         |       87 |     100 |   +15% |         92 |    +5% |
+-------------------+----------+---------+--------+------------+--------+

Roblox
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| FPS               |    17.92 |   21.82 |   +22% |      20.02 |   +12% |
+-------------------+----------+---------+--------+------------+--------+

Roblox Frames/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Frames/mAh        |    77.91 |   84.46 |    +8% |     81.71  |     5% |
+-------------------+----------+---------+--------+------------+--------+

We've simplified our implementation based on community feedback to make
it less intrusive and to use a more generic MMIO interface for
communication with the host. The results show that the current design
still has tangible improvements over baseline. We'll continue looking
into ways to reduce the overhead of the MMIO read/writes and submit
separate and generic patches for that if we find any good optimizations.

Thanks,
David & Saravana

Cc: Saravana Kannan <saravanak@google.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Masami Hiramatsu <mhiramat@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: Gupta Pankaj <pankaj.gupta@amd.com>
Cc: Mel Gorman <mgorman@suse.de>

v6 -> v7:
-Updated .exit and .remove function type from int to void to match
 required types
-Added Reviewed-by tag from Rob on dt-bindings patch
-Dropped "depends on OF" as the driver doesn't depend on it

v5 -> v6:
-Renamed dt-binding documentation file to match compatible string
-Removed opp bindings from dt-binding examples
-Added register interface description as a comment in driver
-Performance info now initialized from the device via MMIO instead
 of opp bindings
-Updated driver to use perf tables or max perf depending on VMM
-Added initialization for sharing perf domain
-Updated driver to use .target instead of .target_index
-Updated .verify to handle both perf tables and max perf case
-Updated Kconfig dependencies

v4 -> v5:
-Added dt-binding description to allow for normalized frequencies
-Updated dt-binding examples with normalized frequency values
-Updated cpufreq exit to use dev_pm_opp_free_cpufreq_table to free tables
-Updated fast_switch and target_index to use entries from cpufreq tables
-Refreshed benchmark numbers using indexed frequencies
-Added missing header that was indirectly being used

v3 -> v4:
-Fixed dt-binding formatting issues
-Added additional dt-binding descriptions for “HW interfaces”
-Changed dt-binding to “qemu,virtual-cpufreq”
-Fixed Kconfig formatting issues
-Removed frequency downscaling when requesting frequency updates
-Removed ops and cpufreq driver data
-Added check to limit freq_scale to 1024
-Added KHZ in the register offset naming
-Added comments to explain FIE and not allowing dvfs_possible_from_any_cpu

v2 -> v3:
- Dropped patches adding new hypercalls
- Dropped patch adding util_guest in sched/fair
- Cpufreq driver now populates frequency using opp bindings
- Removed transition_delay_us=1 cpufreq setting as it was configured too
  agressively and resulted in poor I/O performance
- Modified guest cpufreq driver to read/write MMIO regions instead of
  using hypercalls to communicate with the host
- Modified guest cpufreq driver to pass frequency info instead of
  utilization of the current vCPU's runqueue which now takes
  iowait_boost into account from the schedutil governor
- Updated DT bindings for a virtual CPU frequency device
Userspace changes:
- Updated CrosVM patches to emulate a virtual cpufreq device
- Updated to newer userspace binaries when collecting more recent
  benchmark data

v1 -> v2:
- No functional changes.
- Added description for EAS and removed DVFS in coverletter.
- Added a v2 tag to the subject.
- Fixed up the inconsistent "units" between tables.
- Made sure everyone is To/Cc-ed for all the patches in the series.

[1] - https://lpc.events/event/16/contributions/1195/
[2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
[3] - https://www.youtube.com/watch?v=hIg_5bg6opU
[4] - https://lore.kernel.org/all/20230331014356.1033759-1-davidai@google.com/
[5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
[6] - https://chromium-review.googlesource.com/q/topic:%22virtcpufreq-v6%22

David Dai (2):
  dt-bindings: cpufreq: add virtual cpufreq device
  cpufreq: add virtual-cpufreq driver

 .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
 drivers/cpufreq/Kconfig                       |  14 +
 drivers/cpufreq/Makefile                      |   1 +
 drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
 include/linux/arch_topology.h                 |   1 +
 5 files changed, 397 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
 create mode 100644 drivers/cpufreq/virtual-cpufreq.c

Comments

Viresh Kumar Oct. 1, 2024, 9:25 a.m. UTC | #1
On 18-09-24, 17:08, David Dai wrote:
> Hi,
> 
> This patch series is a continuation of the talk Saravana gave at LPC 2022
> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> of the talk is that workloads running in a guest VM get terrible task
> placement and CPUfreq behavior when compared to running the same workload
> in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
> inside VMs. This would make power and performance terrible just by running
> the workload in a VM even if we assume there is zero virtualization
> overhead.

> David Dai (2):
>   dt-bindings: cpufreq: add virtual cpufreq device
>   cpufreq: add virtual-cpufreq driver
> 
>  .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
>  drivers/cpufreq/Kconfig                       |  14 +
>  drivers/cpufreq/Makefile                      |   1 +
>  drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
>  include/linux/arch_topology.h                 |   1 +
>  5 files changed, 397 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
>  create mode 100644 drivers/cpufreq/virtual-cpufreq.c

LGTM.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Saravana Kannan Oct. 25, 2024, 10:25 p.m. UTC | #2
On Tue, Oct 1, 2024 at 2:25 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-09-24, 17:08, David Dai wrote:
> > Hi,
> >
> > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > of the talk is that workloads running in a guest VM get terrible task
> > placement and CPUfreq behavior when compared to running the same workload
> > in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
> > inside VMs. This would make power and performance terrible just by running
> > the workload in a VM even if we assume there is zero virtualization
> > overhead.
>
> > David Dai (2):
> >   dt-bindings: cpufreq: add virtual cpufreq device
> >   cpufreq: add virtual-cpufreq driver
> >
> >  .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
> >  drivers/cpufreq/Kconfig                       |  14 +
> >  drivers/cpufreq/Makefile                      |   1 +
> >  drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
> >  include/linux/arch_topology.h                 |   1 +
> >  5 files changed, 397 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
> >  create mode 100644 drivers/cpufreq/virtual-cpufreq.c
>
> LGTM.
>
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Rafael/Viresh,

Nudge... Any chance this will get pulled into 6.12?

Thanks,
Saravana
Rafael J. Wysocki Oct. 28, 2024, 11:39 a.m. UTC | #3
On Sat, Oct 26, 2024 at 12:26 AM Saravana Kannan <saravanak@google.com> wrote:
>
> On Tue, Oct 1, 2024 at 2:25 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 18-09-24, 17:08, David Dai wrote:
> > > Hi,
> > >
> > > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > > of the talk is that workloads running in a guest VM get terrible task
> > > placement and CPUfreq behavior when compared to running the same workload
> > > in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
> > > inside VMs. This would make power and performance terrible just by running
> > > the workload in a VM even if we assume there is zero virtualization
> > > overhead.
> >
> > > David Dai (2):
> > >   dt-bindings: cpufreq: add virtual cpufreq device
> > >   cpufreq: add virtual-cpufreq driver
> > >
> > >  .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
> > >  drivers/cpufreq/Kconfig                       |  14 +
> > >  drivers/cpufreq/Makefile                      |   1 +
> > >  drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
> > >  include/linux/arch_topology.h                 |   1 +
> > >  5 files changed, 397 insertions(+)
> > >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
> > >  create mode 100644 drivers/cpufreq/virtual-cpufreq.c
> >
> > LGTM.
> >
> > Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
>
> Rafael/Viresh,
>
> Nudge... Any chance this will get pulled into 6.12?

This is not a fix AFAICS, so 6.12 is out of the question.

As for 6.13, Viresh thinks that this change is a good idea (or he
wouldn't have ACKed it), so it's up to him.  I'm still not convinced
that it will work on x86 or anything that doesn't use DT.

Viresh, I think that this falls into your bucket.
Sudeep Holla Oct. 28, 2024, 12:33 p.m. UTC | #4
On Mon, Oct 28, 2024 at 12:39:31PM +0100, Rafael J. Wysocki wrote:
> On Sat, Oct 26, 2024 at 12:26 AM Saravana Kannan <saravanak@google.com> wrote:
> >
> > On Tue, Oct 1, 2024 at 2:25 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > >
> > > On 18-09-24, 17:08, David Dai wrote:
> > > > Hi,
> > > >
> > > > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > > > of the talk is that workloads running in a guest VM get terrible task
> > > > placement and CPUfreq behavior when compared to running the same workload
> > > > in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
> > > > inside VMs. This would make power and performance terrible just by running
> > > > the workload in a VM even if we assume there is zero virtualization
> > > > overhead.
> > >
> > > > David Dai (2):
> > > >   dt-bindings: cpufreq: add virtual cpufreq device
> > > >   cpufreq: add virtual-cpufreq driver
> > > >
> > > >  .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
> > > >  drivers/cpufreq/Kconfig                       |  14 +
> > > >  drivers/cpufreq/Makefile                      |   1 +
> > > >  drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
> > > >  include/linux/arch_topology.h                 |   1 +
> > > >  5 files changed, 397 insertions(+)
> > > >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
> > > >  create mode 100644 drivers/cpufreq/virtual-cpufreq.c
> > >
> > > LGTM.
> > >
> > > Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> >
> > Rafael/Viresh,
> >
> > Nudge... Any chance this will get pulled into 6.12?
>
> This is not a fix AFAICS, so 6.12 is out of the question.
>
> As for 6.13, Viresh thinks that this change is a good idea (or he
> wouldn't have ACKed it), so it's up to him.  I'm still not convinced
> that it will work on x86 or anything that doesn't use DT.
>

+1, I was about to comment on DT bindings patch, but then I assumed it is
accepted to have a device object with similar CID and CRS(for register address)
in ACPI for example. But yes, the patch itself is not adding support for that
yet. If not is not the way, then we need to come up with a way that works
for both ACPI and DT.

--
Regards,
Sudeep
Sudeep Holla Oct. 28, 2024, 12:52 p.m. UTC | #5
On Mon, Oct 28, 2024 at 01:43:38PM +0100, Rafael J. Wysocki wrote:
> On Mon, Oct 28, 2024 at 1:33 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> >
> > On Mon, Oct 28, 2024 at 12:39:31PM +0100, Rafael J. Wysocki wrote:
> > > On Sat, Oct 26, 2024 at 12:26 AM Saravana Kannan <saravanak@google.com> wrote:
> > > >
> > > > On Tue, Oct 1, 2024 at 2:25 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > > >
> > > > > On 18-09-24, 17:08, David Dai wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This patch series is a continuation of the talk Saravana gave at LPC 2022
> > > > > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> > > > > > of the talk is that workloads running in a guest VM get terrible task
> > > > > > placement and CPUfreq behavior when compared to running the same workload
> > > > > > in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
> > > > > > inside VMs. This would make power and performance terrible just by running
> > > > > > the workload in a VM even if we assume there is zero virtualization
> > > > > > overhead.
> > > > >
> > > > > > David Dai (2):
> > > > > >   dt-bindings: cpufreq: add virtual cpufreq device
> > > > > >   cpufreq: add virtual-cpufreq driver
> > > > > >
> > > > > >  .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
> > > > > >  drivers/cpufreq/Kconfig                       |  14 +
> > > > > >  drivers/cpufreq/Makefile                      |   1 +
> > > > > >  drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
> > > > > >  include/linux/arch_topology.h                 |   1 +
> > > > > >  5 files changed, 397 insertions(+)
> > > > > >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
> > > > > >  create mode 100644 drivers/cpufreq/virtual-cpufreq.c
> > > > >
> > > > > LGTM.
> > > > >
> > > > > Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> > > >
> > > > Rafael/Viresh,
> > > >
> > > > Nudge... Any chance this will get pulled into 6.12?
> > >
> > > This is not a fix AFAICS, so 6.12 is out of the question.
> > >
> > > As for 6.13, Viresh thinks that this change is a good idea (or he
> > > wouldn't have ACKed it), so it's up to him.  I'm still not convinced
> > > that it will work on x86 or anything that doesn't use DT.
> > >
> >
> > +1, I was about to comment on DT bindings patch, but then I assumed it is
> > accepted to have a device object with similar CID and CRS(for register address)
> > in ACPI for example.
> 
> Well, where would the device ID be defined for this?  The spec or
> somewhere else?  If the latter, then where again?
>

Yes, we need to figure those details, but I assumed that is the general
idea to get it working in ACPI. We can figure out details when we have to
add it.

> > But yes, the patch itself is not adding support for that
> > yet. If not is not the way, then we need to come up with a way that works
> > for both ACPI and DT.
> 
> The DT use case is there I think and so I don't want to block it just
> because there is no ACPI counterpart.  It can be added later if the
> use case is relevant enough.

Agreed and that was my thoughts as well.

--
Regards,
Sudeep
Saravana Kannan Oct. 30, 2024, 8:24 p.m. UTC | #6
On Mon, Oct 28, 2024 at 11:36 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-09-24, 17:08, David Dai wrote:
> > v6 -> v7:
> > -Updated .exit and .remove function type from int to void to match
> >  required types
> > -Added Reviewed-by tag from Rob on dt-bindings patch
> > -Dropped "depends on OF" as the driver doesn't depend on it
>
> > David Dai (2):
> >   dt-bindings: cpufreq: add virtual cpufreq device
> >   cpufreq: add virtual-cpufreq driver
> >
> >  .../cpufreq/qemu,virtual-cpufreq.yaml         |  48 +++
> >  drivers/cpufreq/Kconfig                       |  14 +
> >  drivers/cpufreq/Makefile                      |   1 +
> >  drivers/cpufreq/virtual-cpufreq.c             | 333 ++++++++++++++++++
> >  include/linux/arch_topology.h                 |   1 +
> >  5 files changed, 397 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,virtual-cpufreq.yaml
> >  create mode 100644 drivers/cpufreq/virtual-cpufreq.c
>
> Applied for v6.13. Thanks.

Thanks!

-Saravana