mbox series

[RFC,00/26] Intel Thread Director Virtualization

Message ID 20240203091214.411862-1-zhao1.liu@linux.intel.com
Headers show
Series Intel Thread Director Virtualization | expand

Message

Zhao Liu Feb. 3, 2024, 9:11 a.m. UTC
From: Zhao Liu <zhao1.liu@intel.com>

Hi list,

This is our RFC to virtualize Intel Thread Director (ITD) feature for
Guest, which is based on Ricardo's patch series about ITD related
support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the
virtualization of HFI" [1]).

In short, the purpose of this patch set is to enable the ITD-based
scheduling logic in Guest so that Guest can better schedule Guest tasks
on Intel hybrid platforms.

Currently, ITD is necessary for Windows VMs. Based on ITD virtualization
support, the Windows 11 Guest could have significant performance
improvement (for example, on i9-13900K, up to 14%+ improvement on
3DMARK).

Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs'
CPU affinity. However, in our practice, the ITD scheduling optimization
for win11 VMs works best when combined with hybrid topology and CPU
affinity (this is related to the specific implementation of Win11
scheduling). For more details, please see the Section.1.2 "About hybrid
topology and vCPU pinning".

To enable ITD related scheduling optimization in Win11 VM, some other
thermal related support is also needed (HWP, CPPC), but we could emulate
it with dummy value in the VMM (We'll also be sending out extra patches
in the future for these).

Welcome your feedback!


1. Background and Motivation
============================

1.1. Background
^^^^^^^^^^^^^^^

We have the use case to run games in the client Windows VM as the cloud
gaming solution.

Gaming VMs are performance-sensitive VMs on Client, so that they usually
have two characteristics to ensure interactivity and performance:

i) There will be vCPUs equal to or close to the number of Host pCPUs.

ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve
exclusive resources and avoid the overhead of migration.

In this case, Host can't provide effective scheduling for Guest, so we
need to deliver more hardware-assisted scheduling capabilities to Guest
to enhance Guest's scheduling.

Windows 11 (and future Windows products) is heavily optimized for the
Intel hybrid platform. To get the best performance, we need to
virtualize hybrid scheduling features (HFI/ITD) for Windows Guest.


1.2. About hybrid topology and vCPU pinning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Our ITD virtualization can support most vCPU topologies (except multiple
packages/dies, see details in 3.5 Restrictions on Guest Topology), and
can also support the case of non-pinning vCPUs (i.e. it can handle vCPU
thread migration).

The following is our performance measuremnt on an i9-13900K machine
(2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with
iGPU passthrough, running 3DMARK in Win11 Professional Guest:


compared with smp topo case       smp topo        smp topo        smp topo      hybrid topo       hybrid topo     hybrid topo     hybrid topo
                                + affinity      + ITD           + ITD                           + affinity      + ITD           + ITD
                                                                + affinity                                                      + affinity
Time Spy - Overall                0.179%        -0.250%           0.179%        -0.107%           0.143%        -0.179%         -0.107%
Graphics score                    0.124%        -0.249%           0.124%        -0.083%           0.124%        -0.166%         -0.249%
CPU score                         0.916%        -0.485%           1.149%        -0.076%           0.722%        -0.324%         11.915%
Fire Strike Extreme - Overall     0.149%         0.000%           0.224%        -1.021%          -3.361%        -1.319%         -3.361%
Graphics score                    0.100%         0.050%           0.150%        -1.376%          -3.427%        -1.676%         -3.652%
Physics score                     5.060%         0.759%           0.518%        -2.907%         -10.914%        -0.897%         14.638%
Combined  score                   0.120%        -0.179%           0.418%         0.060%          -2.929%        -0.179%         -2.809%
Fire Strike - Overall             0.350%        -0.085%           0.193%        -1.377%          -1.365%        -1.509%         -1.787%
Graphics score                    0.256%        -0.047%           0.210%        -1.527%          -1.376%        -1.504%         -2.320%
Physics score                     3.695%        -2.180%           0.629%        -1.581%          -6.846%        -1.444%         14.100%
Combined  score                   0.415%        -0.128%           0.128%        -0.957%          -1.052%        -1.594%         -0.957%
CPU Profile Max Threads           1.836%         0.298%           1.786%        -0.069%           1.545%         0.025%          9.472%
16 Threads                        4.290%         0.989%           3.588%         0.595%           1.580%         0.848%         11.295%
8 Threads                       -22.632%        -0.602%         -23.167%        -0.988%          -1.345%        -1.340%          8.648%
4 Threads                       -21.598%         0.449%         -21.429%        -0.817%           1.951%        -0.832%          2.084%
2 Threads                       -12.912%        -0.014%         -12.006%        -0.481%          -0.609%        -0.595%          1.161%
1 Threads                        -3.793%        -0.137%          -3.793%        -0.495%          -3.189%        -0.495%          1.154%


Based on the above result, we can find exposing only HFI/ITD to win11
VMs without hybrid topology or CPU affinity (case "smp topo + ITD")
won't hurt performance, but would also not get any performance
improvement.

Setting both hybrid topology and CPU affinity for ITD, then win11 VMs
get significate performance improvement (up to 14%+, compared with the
case setting smp topology without CPU affinity).

Not only the numerical results of 3DMARK, but in practice, there is an
significate improvement in the frame rate of the games.

Also, the more powerful the machine, the more significate the
performance gains!

Therefore, the best practice for enabling ITD scheduling optimization
is to set up both CPU affinity and hybrid topology for win11 Guest while
enabling our ITD virtualization.

Our earlier QEMU prototype RFC [2] presented the initial hybrid
topology support for VMs. And currently our another proposal about
"QOM topology" [3] has been raised in the QEMU community, which is the
first step towards the hybrid topology implementation based on QOM
approach.


2. Introduction of HFI and ITD
==============================

Intel provides Hardware Feedback Interface (HFI) feature to allow
hardware to provide guidance to the OS scheduler to perform optimal
workload scheduling through a hardware feedback interface structure in
memory [4]. This HFI structure is called HFI table.

For now, the guidance includes performance and energy efficiency
hints, and it could be update via thermal interrupt as the actual
operating conditions of the processor change during run time.

Intel Thread Director (ITD) feature extends the HFI to provide
performance and energy efficiency data for advanced classes of
instructions.

Since ITD is an extension of HFI, our ITD virtualization also
virtualizes the native HFI feature.


3. Dependencies of ITD
======================

ITD is a thermal FEATURE that requires:
* PTM (Package Thermal Management, alias, PTS)
* HFI (Hardware Feedback Interface)

In order to support the notification mechanism of ITD/HFI dynamic
update, we also need to add thermal interrupt related support,
including the following two features:
* ACPI (Thermal Monitor and Software Controlled Clock Facilities)
* TM (Thermal Monitor, alias, TM1/ACC)

Therefore, we must also consider support for the emulation of all
the above dependencies.


3.1. ACPI emulation
^^^^^^^^^^^^^^^^^^^

For both ACPI, we can support it by emulating the RDMSR/WRMSR of the
associated MSRs and adding the ability to inject thermal interrupts.
But in fact, we don't really inject termal interrupts into Guest for
the termal conditions corresponding to ACPI. Here the termal interrupt
is prepared for the subsequent HFI/ITD.


3.2. TM emulation
^^^^^^^^^^^^^^^^^

TM is a hardware feature and its CPUID bit only indicates the presence
of the automatic thermal monitoring facilities. For TM, there's no
interactive interface between OS and hardware, but its flag is one of
the prerequisites for the OS to enable thermal interrupt.

Thereby, as the support for TM, it is enough for us to expose its CPUID
flag to Guest.


3.3. PTM emulation
^^^^^^^^^^^^^^^^^^

PTM is a package-scope feature that includes package-level MSR and
package-level thermal interrupt. Unfortunately, KVM currently only
supports thread-scope MSR handling, and also doesn't care about the
specific Guest's topology.

But considering that our purpose of supporting PTM in KVM is to further
support ITD, and the current platforms with ITD are all 1 package, so we
emulate the MSRs of the package scope provided by PTM at the VM level.

In this way, the VMM is required to set only one package topology for
the PTM. In order to alleviate this limitation, we only expose the PTM
feature bit to Guest when ITD needs to be supported.


3.4. HFI emulation
^^^^^^^^^^^^^^^^^^

ITD is the extension of HFI, so both HFI and ITD depend on HFI table.
HFI itself is used on the Host for power-related management control, so
we should only expose HFI to Guest when we need to enable ITD.

HFI also relies on PTM interrupt control, so it also has requirements
for package topology, and we also emulate HFI (including ITD) at the VM
level.

In addition, because the HFI driver allocates HFI instances per die,
this also affects HFI (and ITD) and must limit the Guest to only set one
die.


3.5. Restrictions on Guest Topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to KVM's incomplete support for MSR topology and the requirement for
HFI instance management in the kernel, PTM, HFI, and ITD limit the
topology of the Guest (mainly restricting the topology types created on
the VMM side).

Therefore, we only expose PTM, HFI, and ITD to userspace when we need to
support ITD. At the same time, considering that currently, ITD is only
used on the client platform with 1 package and 1 die, such temporary
restrictions will not have too much impact.


4. Overview of ITD (and HFI) virtualization
===========================================

The main tasks of ITD (including HFI) virtualization are:
* maintain a virtual HFI table for VM.
* inject thermal interrupt when HFI table updates.
* handle related MSRs' emulation and adjust HFI table based on MSR's
  control bits.
* expose ITD/HFI configuration info in related CPUID leaves.

The most important of these is the maintenance of the virtual HFI table.
Although the HFI table should also be per package, since ITD/HFI related
MSRs are treated as per VM in KVM, we also treat the virtual HFI table
as per VM.


4.1. HFI table building
^^^^^^^^^^^^^^^^^^^^^^^

HFI table contains a table header and many table entries. Each table
entry is identified by an hfi table index, and each CPU corresponds to
one of the hfi table indexes.

ITD and HFI features both depend on the HFI table, but their HFI table
are a little different. The HFI table provided by the ITD feature has
more classes (in terms of more columns in the table) than the HFI table
of native HFI feature.

The virtual HFI table in KVM is built based on the actual HFI table,
which is maintained by HFI instance in HFI driver. We extract the HFI
data of the pCPUs, which vCPUs are running on, to form a virtual HFI
table.


4.2. HFI table index
^^^^^^^^^^^^^^^^^^^^

There are many entries in the HFI table, and the vCPU will be assigned
an HFI table index to specify the entry it maps. KVM will fill the
pCPU's HFI data (the pCPU that vCPU is running on) into the entry
corresponding to the HFI table index of the vCPU in the vcitual HFI
table.

This index is set by VMM in CPUID.


4.3. HFI table updating
^^^^^^^^^^^^^^^^^^^^^^^

On some platforms, the HFI table will be dynamically updated with
thermal interrupts. In order to update the virtual HFI table in time, we
added the per-VM notifier to the HFI driver to notify KVM to update the
virtual HFI table for the VM, and then inject thermal interrupt into the
VM to notify the Guest.

There is another case that needs to update the virtual HFI table, that
is, when the vCPU is migrated, the pCPU where it is located is changed,
and the corresponding virtual HFI data should also be updated to the new
pCPU's data. In this case, in order to reduce overhead, we can only
update the data of a single vPCU without traversing the entire virtual
HFI table.


5. Patch Summary
================

Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data
             structures that KVM needs.
Patch 04-05: Add the sched_out arch hook and reset the classification
             history at sched_in()/schedu_out().
Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and
             related MSRs.
Patch 11-20: Add the emulation support for HFI, including maintaining
             the HFI table for VM.
Patch 21-23: Add the emulation support for ITD, including extending HFI
             to ITD and passing through the classification MSRs.
Patch 24-25: Add HRESET emulation support, which is also used by IPC
             classes feature.
Patch 26:    Add the brief doc about the per-VM lock - pkg_therm_lock.


6. References
=============

[1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI
     https://lore.kernel.org/lkml/20240203040515.23947-1-ricardo.neri-calderon@linux.intel.com/
[2]: [RFC 00/52] Introduce hybrid CPU topology,
     https://lore.kernel.org/qemu-devel/20230213095035.158240-1-zhao1.liu@linux.intel.com/
[3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology,
     https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@linux.intel.com/
[4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL
     THREAD DIRECTOR


Thanks and Best Regards,
Zhao
---
Zhao Liu (17):
  thermal: Add bit definition for x86 thermal related MSRs
  KVM: Add kvm_arch_sched_out() hook
  KVM: x86: Reset hardware history at vCPU's sched_in/out
  KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits
  KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask()
  KVM: VMX: Introduce HFI description structure
  KVM: VMX: Introduce HFI table index for vCPU
  KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops
  KVM: VMX: Allow to inject thermal interrupt without HFI update
  KVM: VMX: Emulate HFI related bits in package thermal MSRs
  KVM: VMX: Emulate the MSRs of HFI feature
  KVM: x86: Expose HFI feature bit and HFI info in CPUID
  KVM: VMX: Extend HFI table and MSR emulation to support ITD
  KVM: VMX: Pass through ITD classification related MSRs to Guest
  KVM: x86: Expose ITD feature bit and related info in CPUID
  KVM: VMX: Emulate the MSR of HRESET feature
  Documentation: KVM: Add description of pkg_therm_lock

Zhuocheng Ding (9):
  thermal: intel: hfi: Add helpers to build HFI/ITD structures
  thermal: intel: hfi: Add HFI notifier helpers to notify HFI update
  KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature
  KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM
  KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature
  KVM: VMX: Support virtual HFI table for VM
  KVM: VMX: Sync update of Host HFI table to Guest
  KVM: VMX: Update HFI table when vCPU migrates
  KVM: x86: Expose HRESET feature's CPUID to Guest

 Documentation/virt/kvm/locking.rst  |  13 +-
 arch/arm64/include/asm/kvm_host.h   |   1 +
 arch/mips/include/asm/kvm_host.h    |   1 +
 arch/powerpc/include/asm/kvm_host.h |   1 +
 arch/riscv/include/asm/kvm_host.h   |   1 +
 arch/s390/include/asm/kvm_host.h    |   1 +
 arch/x86/include/asm/hfi.h          |  28 ++
 arch/x86/include/asm/kvm-x86-ops.h  |   3 +-
 arch/x86/include/asm/kvm_host.h     |   2 +
 arch/x86/include/asm/msr-index.h    |  54 +-
 arch/x86/kvm/cpuid.c                | 201 +++++++-
 arch/x86/kvm/irq.h                  |   1 +
 arch/x86/kvm/lapic.c                |   9 +
 arch/x86/kvm/svm/svm.c              |   8 +
 arch/x86/kvm/vmx/vmx.c              | 751 +++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h              |  79 ++-
 arch/x86/kvm/x86.c                  |  18 +
 drivers/thermal/intel/intel_hfi.c   | 212 +++++++-
 drivers/thermal/intel/therm_throt.c |   1 -
 include/linux/kvm_host.h            |   1 +
 virt/kvm/kvm_main.c                 |   1 +
 21 files changed, 1343 insertions(+), 44 deletions(-)

Comments

Zhao Liu Feb. 22, 2024, 7:42 a.m. UTC | #1
Ping Paolo & Sean,

Do you have any comment? Or do you think ITD virtualization is
appropriate to discuss at PUCK?

Thanks,
Zhao

On Sat, Feb 03, 2024 at 05:11:48PM +0800, Zhao Liu wrote:
> Date: Sat, 3 Feb 2024 17:11:48 +0800
> From: Zhao Liu <zhao1.liu@linux.intel.com>
> Subject: [RFC 00/26] Intel Thread Director Virtualization
> X-Mailer: git-send-email 2.34.1
> 
> From: Zhao Liu <zhao1.liu@intel.com>
> 
> Hi list,
> 
> This is our RFC to virtualize Intel Thread Director (ITD) feature for
> Guest, which is based on Ricardo's patch series about ITD related
> support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the
> virtualization of HFI" [1]).
> 
> In short, the purpose of this patch set is to enable the ITD-based
> scheduling logic in Guest so that Guest can better schedule Guest tasks
> on Intel hybrid platforms.
> 
> Currently, ITD is necessary for Windows VMs. Based on ITD virtualization
> support, the Windows 11 Guest could have significant performance
> improvement (for example, on i9-13900K, up to 14%+ improvement on
> 3DMARK).
> 
> Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs'
> CPU affinity. However, in our practice, the ITD scheduling optimization
> for win11 VMs works best when combined with hybrid topology and CPU
> affinity (this is related to the specific implementation of Win11
> scheduling). For more details, please see the Section.1.2 "About hybrid
> topology and vCPU pinning".
> 
> To enable ITD related scheduling optimization in Win11 VM, some other
> thermal related support is also needed (HWP, CPPC), but we could emulate
> it with dummy value in the VMM (We'll also be sending out extra patches
> in the future for these).
> 
> Welcome your feedback!
> 
> 
> 1. Background and Motivation
> ============================
> 
> 1.1. Background
> ^^^^^^^^^^^^^^^
> 
> We have the use case to run games in the client Windows VM as the cloud
> gaming solution.
> 
> Gaming VMs are performance-sensitive VMs on Client, so that they usually
> have two characteristics to ensure interactivity and performance:
> 
> i) There will be vCPUs equal to or close to the number of Host pCPUs.
> 
> ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve
> exclusive resources and avoid the overhead of migration.
> 
> In this case, Host can't provide effective scheduling for Guest, so we
> need to deliver more hardware-assisted scheduling capabilities to Guest
> to enhance Guest's scheduling.
> 
> Windows 11 (and future Windows products) is heavily optimized for the
> Intel hybrid platform. To get the best performance, we need to
> virtualize hybrid scheduling features (HFI/ITD) for Windows Guest.
> 
> 
> 1.2. About hybrid topology and vCPU pinning
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Our ITD virtualization can support most vCPU topologies (except multiple
> packages/dies, see details in 3.5 Restrictions on Guest Topology), and
> can also support the case of non-pinning vCPUs (i.e. it can handle vCPU
> thread migration).
> 
> The following is our performance measuremnt on an i9-13900K machine
> (2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with
> iGPU passthrough, running 3DMARK in Win11 Professional Guest:
> 
> 
> compared with smp topo case       smp topo        smp topo        smp topo      hybrid topo       hybrid topo     hybrid topo     hybrid topo
>                                 + affinity      + ITD           + ITD                           + affinity      + ITD           + ITD
>                                                                 + affinity                                                      + affinity
> Time Spy - Overall                0.179%        -0.250%           0.179%        -0.107%           0.143%        -0.179%         -0.107%
> Graphics score                    0.124%        -0.249%           0.124%        -0.083%           0.124%        -0.166%         -0.249%
> CPU score                         0.916%        -0.485%           1.149%        -0.076%           0.722%        -0.324%         11.915%
> Fire Strike Extreme - Overall     0.149%         0.000%           0.224%        -1.021%          -3.361%        -1.319%         -3.361%
> Graphics score                    0.100%         0.050%           0.150%        -1.376%          -3.427%        -1.676%         -3.652%
> Physics score                     5.060%         0.759%           0.518%        -2.907%         -10.914%        -0.897%         14.638%
> Combined  score                   0.120%        -0.179%           0.418%         0.060%          -2.929%        -0.179%         -2.809%
> Fire Strike - Overall             0.350%        -0.085%           0.193%        -1.377%          -1.365%        -1.509%         -1.787%
> Graphics score                    0.256%        -0.047%           0.210%        -1.527%          -1.376%        -1.504%         -2.320%
> Physics score                     3.695%        -2.180%           0.629%        -1.581%          -6.846%        -1.444%         14.100%
> Combined  score                   0.415%        -0.128%           0.128%        -0.957%          -1.052%        -1.594%         -0.957%
> CPU Profile Max Threads           1.836%         0.298%           1.786%        -0.069%           1.545%         0.025%          9.472%
> 16 Threads                        4.290%         0.989%           3.588%         0.595%           1.580%         0.848%         11.295%
> 8 Threads                       -22.632%        -0.602%         -23.167%        -0.988%          -1.345%        -1.340%          8.648%
> 4 Threads                       -21.598%         0.449%         -21.429%        -0.817%           1.951%        -0.832%          2.084%
> 2 Threads                       -12.912%        -0.014%         -12.006%        -0.481%          -0.609%        -0.595%          1.161%
> 1 Threads                        -3.793%        -0.137%          -3.793%        -0.495%          -3.189%        -0.495%          1.154%
> 
> 
> Based on the above result, we can find exposing only HFI/ITD to win11
> VMs without hybrid topology or CPU affinity (case "smp topo + ITD")
> won't hurt performance, but would also not get any performance
> improvement.
> 
> Setting both hybrid topology and CPU affinity for ITD, then win11 VMs
> get significate performance improvement (up to 14%+, compared with the
> case setting smp topology without CPU affinity).
> 
> Not only the numerical results of 3DMARK, but in practice, there is an
> significate improvement in the frame rate of the games.
> 
> Also, the more powerful the machine, the more significate the
> performance gains!
> 
> Therefore, the best practice for enabling ITD scheduling optimization
> is to set up both CPU affinity and hybrid topology for win11 Guest while
> enabling our ITD virtualization.
> 
> Our earlier QEMU prototype RFC [2] presented the initial hybrid
> topology support for VMs. And currently our another proposal about
> "QOM topology" [3] has been raised in the QEMU community, which is the
> first step towards the hybrid topology implementation based on QOM
> approach.
> 
> 
> 2. Introduction of HFI and ITD
> ==============================
> 
> Intel provides Hardware Feedback Interface (HFI) feature to allow
> hardware to provide guidance to the OS scheduler to perform optimal
> workload scheduling through a hardware feedback interface structure in
> memory [4]. This HFI structure is called HFI table.
> 
> For now, the guidance includes performance and energy efficiency
> hints, and it could be update via thermal interrupt as the actual
> operating conditions of the processor change during run time.
> 
> Intel Thread Director (ITD) feature extends the HFI to provide
> performance and energy efficiency data for advanced classes of
> instructions.
> 
> Since ITD is an extension of HFI, our ITD virtualization also
> virtualizes the native HFI feature.
> 
> 
> 3. Dependencies of ITD
> ======================
> 
> ITD is a thermal FEATURE that requires:
> * PTM (Package Thermal Management, alias, PTS)
> * HFI (Hardware Feedback Interface)
> 
> In order to support the notification mechanism of ITD/HFI dynamic
> update, we also need to add thermal interrupt related support,
> including the following two features:
> * ACPI (Thermal Monitor and Software Controlled Clock Facilities)
> * TM (Thermal Monitor, alias, TM1/ACC)
> 
> Therefore, we must also consider support for the emulation of all
> the above dependencies.
> 
> 
> 3.1. ACPI emulation
> ^^^^^^^^^^^^^^^^^^^
> 
> For both ACPI, we can support it by emulating the RDMSR/WRMSR of the
> associated MSRs and adding the ability to inject thermal interrupts.
> But in fact, we don't really inject termal interrupts into Guest for
> the termal conditions corresponding to ACPI. Here the termal interrupt
> is prepared for the subsequent HFI/ITD.
> 
> 
> 3.2. TM emulation
> ^^^^^^^^^^^^^^^^^
> 
> TM is a hardware feature and its CPUID bit only indicates the presence
> of the automatic thermal monitoring facilities. For TM, there's no
> interactive interface between OS and hardware, but its flag is one of
> the prerequisites for the OS to enable thermal interrupt.
> 
> Thereby, as the support for TM, it is enough for us to expose its CPUID
> flag to Guest.
> 
> 
> 3.3. PTM emulation
> ^^^^^^^^^^^^^^^^^^
> 
> PTM is a package-scope feature that includes package-level MSR and
> package-level thermal interrupt. Unfortunately, KVM currently only
> supports thread-scope MSR handling, and also doesn't care about the
> specific Guest's topology.
> 
> But considering that our purpose of supporting PTM in KVM is to further
> support ITD, and the current platforms with ITD are all 1 package, so we
> emulate the MSRs of the package scope provided by PTM at the VM level.
> 
> In this way, the VMM is required to set only one package topology for
> the PTM. In order to alleviate this limitation, we only expose the PTM
> feature bit to Guest when ITD needs to be supported.
> 
> 
> 3.4. HFI emulation
> ^^^^^^^^^^^^^^^^^^
> 
> ITD is the extension of HFI, so both HFI and ITD depend on HFI table.
> HFI itself is used on the Host for power-related management control, so
> we should only expose HFI to Guest when we need to enable ITD.
> 
> HFI also relies on PTM interrupt control, so it also has requirements
> for package topology, and we also emulate HFI (including ITD) at the VM
> level.
> 
> In addition, because the HFI driver allocates HFI instances per die,
> this also affects HFI (and ITD) and must limit the Guest to only set one
> die.
> 
> 
> 3.5. Restrictions on Guest Topology
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Due to KVM's incomplete support for MSR topology and the requirement for
> HFI instance management in the kernel, PTM, HFI, and ITD limit the
> topology of the Guest (mainly restricting the topology types created on
> the VMM side).
> 
> Therefore, we only expose PTM, HFI, and ITD to userspace when we need to
> support ITD. At the same time, considering that currently, ITD is only
> used on the client platform with 1 package and 1 die, such temporary
> restrictions will not have too much impact.
> 
> 
> 4. Overview of ITD (and HFI) virtualization
> ===========================================
> 
> The main tasks of ITD (including HFI) virtualization are:
> * maintain a virtual HFI table for VM.
> * inject thermal interrupt when HFI table updates.
> * handle related MSRs' emulation and adjust HFI table based on MSR's
>   control bits.
> * expose ITD/HFI configuration info in related CPUID leaves.
> 
> The most important of these is the maintenance of the virtual HFI table.
> Although the HFI table should also be per package, since ITD/HFI related
> MSRs are treated as per VM in KVM, we also treat the virtual HFI table
> as per VM.
> 
> 
> 4.1. HFI table building
> ^^^^^^^^^^^^^^^^^^^^^^^
> 
> HFI table contains a table header and many table entries. Each table
> entry is identified by an hfi table index, and each CPU corresponds to
> one of the hfi table indexes.
> 
> ITD and HFI features both depend on the HFI table, but their HFI table
> are a little different. The HFI table provided by the ITD feature has
> more classes (in terms of more columns in the table) than the HFI table
> of native HFI feature.
> 
> The virtual HFI table in KVM is built based on the actual HFI table,
> which is maintained by HFI instance in HFI driver. We extract the HFI
> data of the pCPUs, which vCPUs are running on, to form a virtual HFI
> table.
> 
> 
> 4.2. HFI table index
> ^^^^^^^^^^^^^^^^^^^^
> 
> There are many entries in the HFI table, and the vCPU will be assigned
> an HFI table index to specify the entry it maps. KVM will fill the
> pCPU's HFI data (the pCPU that vCPU is running on) into the entry
> corresponding to the HFI table index of the vCPU in the vcitual HFI
> table.
> 
> This index is set by VMM in CPUID.
> 
> 
> 4.3. HFI table updating
> ^^^^^^^^^^^^^^^^^^^^^^^
> 
> On some platforms, the HFI table will be dynamically updated with
> thermal interrupts. In order to update the virtual HFI table in time, we
> added the per-VM notifier to the HFI driver to notify KVM to update the
> virtual HFI table for the VM, and then inject thermal interrupt into the
> VM to notify the Guest.
> 
> There is another case that needs to update the virtual HFI table, that
> is, when the vCPU is migrated, the pCPU where it is located is changed,
> and the corresponding virtual HFI data should also be updated to the new
> pCPU's data. In this case, in order to reduce overhead, we can only
> update the data of a single vPCU without traversing the entire virtual
> HFI table.
> 
> 
> 5. Patch Summary
> ================
> 
> Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data
>              structures that KVM needs.
> Patch 04-05: Add the sched_out arch hook and reset the classification
>              history at sched_in()/schedu_out().
> Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and
>              related MSRs.
> Patch 11-20: Add the emulation support for HFI, including maintaining
>              the HFI table for VM.
> Patch 21-23: Add the emulation support for ITD, including extending HFI
>              to ITD and passing through the classification MSRs.
> Patch 24-25: Add HRESET emulation support, which is also used by IPC
>              classes feature.
> Patch 26:    Add the brief doc about the per-VM lock - pkg_therm_lock.
> 
> 
> 6. References
> =============
> 
> [1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI
>      https://lore.kernel.org/lkml/20240203040515.23947-1-ricardo.neri-calderon@linux.intel.com/
> [2]: [RFC 00/52] Introduce hybrid CPU topology,
>      https://lore.kernel.org/qemu-devel/20230213095035.158240-1-zhao1.liu@linux.intel.com/
> [3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology,
>      https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@linux.intel.com/
> [4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL
>      THREAD DIRECTOR
> 
> 
> Thanks and Best Regards,
> Zhao
> ---
> Zhao Liu (17):
>   thermal: Add bit definition for x86 thermal related MSRs
>   KVM: Add kvm_arch_sched_out() hook
>   KVM: x86: Reset hardware history at vCPU's sched_in/out
>   KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits
>   KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask()
>   KVM: VMX: Introduce HFI description structure
>   KVM: VMX: Introduce HFI table index for vCPU
>   KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops
>   KVM: VMX: Allow to inject thermal interrupt without HFI update
>   KVM: VMX: Emulate HFI related bits in package thermal MSRs
>   KVM: VMX: Emulate the MSRs of HFI feature
>   KVM: x86: Expose HFI feature bit and HFI info in CPUID
>   KVM: VMX: Extend HFI table and MSR emulation to support ITD
>   KVM: VMX: Pass through ITD classification related MSRs to Guest
>   KVM: x86: Expose ITD feature bit and related info in CPUID
>   KVM: VMX: Emulate the MSR of HRESET feature
>   Documentation: KVM: Add description of pkg_therm_lock
> 
> Zhuocheng Ding (9):
>   thermal: intel: hfi: Add helpers to build HFI/ITD structures
>   thermal: intel: hfi: Add HFI notifier helpers to notify HFI update
>   KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature
>   KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM
>   KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature
>   KVM: VMX: Support virtual HFI table for VM
>   KVM: VMX: Sync update of Host HFI table to Guest
>   KVM: VMX: Update HFI table when vCPU migrates
>   KVM: x86: Expose HRESET feature's CPUID to Guest
> 
>  Documentation/virt/kvm/locking.rst  |  13 +-
>  arch/arm64/include/asm/kvm_host.h   |   1 +
>  arch/mips/include/asm/kvm_host.h    |   1 +
>  arch/powerpc/include/asm/kvm_host.h |   1 +
>  arch/riscv/include/asm/kvm_host.h   |   1 +
>  arch/s390/include/asm/kvm_host.h    |   1 +
>  arch/x86/include/asm/hfi.h          |  28 ++
>  arch/x86/include/asm/kvm-x86-ops.h  |   3 +-
>  arch/x86/include/asm/kvm_host.h     |   2 +
>  arch/x86/include/asm/msr-index.h    |  54 +-
>  arch/x86/kvm/cpuid.c                | 201 +++++++-
>  arch/x86/kvm/irq.h                  |   1 +
>  arch/x86/kvm/lapic.c                |   9 +
>  arch/x86/kvm/svm/svm.c              |   8 +
>  arch/x86/kvm/vmx/vmx.c              | 751 +++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/vmx.h              |  79 ++-
>  arch/x86/kvm/x86.c                  |  18 +
>  drivers/thermal/intel/intel_hfi.c   | 212 +++++++-
>  drivers/thermal/intel/therm_throt.c |   1 -
>  include/linux/kvm_host.h            |   1 +
>  virt/kvm/kvm_main.c                 |   1 +
>  21 files changed, 1343 insertions(+), 44 deletions(-)
> 
> -- 
> 2.34.1
>