[RFC,v2,00/34] MSR refactor with new MSR instructions support

Message ID	20250422082216.1954310-1-xin@zytor.com
Headers	show Received: from mail.zytor.com (terminus.zytor.com [198.137.202.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B4AD259C82; Tue, 22 Apr 2025 08:23:51 +0000 (UTC) From: "Xin Li (Intel)" <xin@zytor.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev, linux-pm@vger.kernel.org, linux-edac@vger.kernel.org, xen-devel@lists.xenproject.org, linux-acpi@vger.kernel.org, linux-hwmon@vger.kernel.org, netdev@vger.kernel.org, platform-driver-x86@vger.kernel.org Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, acme@kernel.org, jgross@suse.com, andrew.cooper3@citrix.com, peterz@infradead.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, wei.liu@kernel.org, ajay.kaher@broadcom.com, bcm-kernel-feedback-list@broadcom.com, tony.luck@intel.com, pbonzini@redhat.com, vkuznets@redhat.com, seanjc@google.com, luto@kernel.org, boris.ostrovsky@oracle.com, kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com Subject: [RFC PATCH v2 00/34] MSR refactor with new MSR instructions support Date: Tue, 22 Apr 2025 01:21:41 -0700 Message-ID: <20250422082216.1954310-1-xin@zytor.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	MSR refactor with new MSR instructions support \| expand [RFC,v2,00/34] MSR refactor with new MSR instructions support [RFC,v2,01/34] x86/msr: Move rdtsc{,_ordered}() to <asm/tsc.h> [RFC,v2,02/34] x86/msr: Remove rdpmc() [RFC,v2,03/34] x86/msr: Rename rdpmcl() to rdpmcq() [RFC,v2,04/34] x86/msr: Convert rdpmcq() into a function [RFC,v2,05/34] x86/msr: Return u64 consistently in Xen PMC read functions [RFC,v2,06/34] x86/msr: Use the alternatives mechanism to read PMC [RFC,v2,07/34] x86/msr: Convert __wrmsr() uses to native_wrmsr{,q}() uses [RFC,v2,08/34] x86/msr: Convert a native_wrmsr() use to native_wrmsrq() [RFC,v2,09/34] x86/msr: Add the native_rdmsrq() helper [RFC,v2,10/34] x86/msr: Convert __rdmsr() uses to native_rdmsrq() uses [RFC,v2,11/34] x86/msr: Remove calling native_{read,write}_msr{,_safe}() in pmu_msr_{read,write}() [RFC,v2,12/34] x86/msr: Remove pmu_msr_{read,write}() [RFC,v2,13/34] x86/xen/msr: Remove the error pointer argument from set_reg() [RFC,v2,14/34] x86/msr: refactor pv_cpu_ops.write_msr{_safe}() [RFC,v2,15/34] x86/msr: Replace wrmsr(msr, low, 0) with wrmsrq(msr, low) [RFC,v2,16/34] x86/msr: Change function type of native_read_msr_safe() [RFC,v2,17/34] x86/cpufeatures: Add a CPU feature bit for MSR immediate form instructions [RFC,v2,18/34] x86/opcode: Add immediate form MSR instructions [RFC,v2,19/34] x86/extable: Add support for immediate form MSR instructions [RFC,v2,20/34] x86/extable: Implement EX_TYPE_FUNC_REWIND [RFC,v2,21/34] x86/msr: Utilize the alternatives mechanism to write MSR [RFC,v2,22/34] x86/msr: Utilize the alternatives mechanism to read MSR [RFC,v2,23/34] x86/extable: Remove new dead code in ex_handler_msr() [RFC,v2,24/34] x86/mce: Use native MSR API __native_{wr,rd}msrq() [RFC,v2,25/34] x86/msr: Rename native_wrmsrq() to native_wrmsrq_no_trace() [RFC,v2,26/34] x86/msr: Rename native_wrmsr() to native_wrmsr_no_trace() [RFC,v2,27/34] x86/msr: Rename native_write_msr() to native_wrmsrq() [RFC,v2,28/34] x86/msr: Rename native_write_msr_safe() to native_wrmsrq_safe() [RFC,v2,29/34] x86/msr: Rename native_rdmsrq() to native_rdmsrq_no_trace() [RFC,v2,30/34] x86/msr: Rename native_rdmsr() to native_rdmsr_no_trace() [RFC,v2,31/34] x86/msr: Rename native_read_msr() to native_rdmsrq() [RFC,v2,32/34] x86/msr: Rename native_read_msr_safe() to native_rdmsrq_safe() [RFC,v2,33/34] x86/msr: Move the ARGS macros after the MSR read/write APIs [RFC,v2,34/34] x86/msr: Convert native_rdmsr_no_trace() uses to native_rdmsrq_no_trace() uses

Xin Li April 22, 2025, 8:21 a.m. UTC

Obviously the existing MSR code and the pv_ops MSR access APIs need some
love: https://lore.kernel.org/lkml/87y1h81ht4.ffs@tglx/

hpa has started a discussion about how to refactor it last October:
https://lore.kernel.org/lkml/7a4de623-ecda-4369-a7ae-0c43ef328177@zytor.com/

The consensus so far is to utilize the alternatives mechanism to eliminate
the Xen MSR access overhead on native systems and enable new MSR instructions
based on their availability.

To achieve this, a code refactor is required:

Patch 1 relocates rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and
removes the inclusion of <asm/msr.h> in <asm/tsc.h>.  As a result,
<asm/msr.h> must now be explicitly included in several source files where
it was previously included implicitly through <asm/tsc.h>.

Patches 2 ~ 6 refactor the code to use the alternatives mechanism to read
PMC.

Patches 7 ~ 16 unify and simplify the MSR API definitions and usages.

Patches 17 ~ 19 add basic support for immediate form MSR instructions,
e.g., its CPU feature bit and opcode.

Patch 20 adds a new exception type to allow a function call inside an
alternative for instruction emulation to "kick back" the exception into
the alternatives pattern, possibly invoking a different exception handling
pattern there, or at least indicating the "real" location of the fault.

patches 21 and 22 refactor the code to use the alternatives mechanism to
read and write MSR.

Patches 23 ~ 34 are afterwards cleanups.


H. Peter Anvin (Intel) (1):
  x86/extable: Implement EX_TYPE_FUNC_REWIND

Xin Li (Intel) (33):
  x86/msr: Move rdtsc{,_ordered}() to <asm/tsc.h>
  x86/msr: Remove rdpmc()
  x86/msr: Rename rdpmcl() to rdpmcq()
  x86/msr: Convert rdpmcq() into a function
  x86/msr: Return u64 consistently in Xen PMC read functions
  x86/msr: Use the alternatives mechanism to read PMC
  x86/msr: Convert __wrmsr() uses to native_wrmsr{,q}() uses
  x86/msr: Convert a native_wrmsr() use to native_wrmsrq()
  x86/msr: Add the native_rdmsrq() helper
  x86/msr: Convert __rdmsr() uses to native_rdmsrq() uses
  x86/msr: Remove calling native_{read,write}_msr{,_safe}() in
    pmu_msr_{read,write}()
  x86/msr: Remove pmu_msr_{read,write}()
  x86/xen/msr: Remove the error pointer argument from set_reg()
  x86/msr: refactor pv_cpu_ops.write_msr{_safe}()
  x86/msr: Replace wrmsr(msr, low, 0) with wrmsrq(msr, low)
  x86/msr: Change function type of native_read_msr_safe()
  x86/cpufeatures: Add a CPU feature bit for MSR immediate form
    instructions
  x86/opcode: Add immediate form MSR instructions
  x86/extable: Add support for immediate form MSR instructions
  x86/msr: Utilize the alternatives mechanism to write MSR
  x86/msr: Utilize the alternatives mechanism to read MSR
  x86/extable: Remove new dead code in ex_handler_msr()
  x86/mce: Use native MSR API __native_{wr,rd}msrq()
  x86/msr: Rename native_wrmsrq() to native_wrmsrq_no_trace()
  x86/msr: Rename native_wrmsr() to native_wrmsr_no_trace()
  x86/msr: Rename native_write_msr() to native_wrmsrq()
  x86/msr: Rename native_write_msr_safe() to native_wrmsrq_safe()
  x86/msr: Rename native_rdmsrq() to native_rdmsrq_no_trace()
  x86/msr: Rename native_rdmsr() to native_rdmsr_no_trace()
  x86/msr: Rename native_read_msr() to native_rdmsrq()
  x86/msr: Rename native_read_msr_safe() to native_rdmsrq_safe()
  x86/msr: Move the ARGS macros after the MSR read/write APIs
  x86/msr: Convert native_rdmsr_no_trace() uses to
    native_rdmsrq_no_trace() uses

 arch/x86/boot/startup/sme.c                   |   5 +-
 arch/x86/events/amd/brs.c                     |   4 +-
 arch/x86/events/amd/uncore.c                  |   2 +-
 arch/x86/events/core.c                        |   2 +-
 arch/x86/events/intel/core.c                  |   4 +-
 arch/x86/events/intel/ds.c                    |   2 +-
 arch/x86/events/msr.c                         |   3 +
 arch/x86/events/perf_event.h                  |   1 +
 arch/x86/events/probe.c                       |   2 +
 arch/x86/hyperv/hv_apic.c                     |   6 +-
 arch/x86/hyperv/hv_vtl.c                      |   4 +-
 arch/x86/hyperv/ivm.c                         |   7 +-
 arch/x86/include/asm/apic.h                   |   4 +-
 arch/x86/include/asm/asm.h                    |   6 +
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/extable_fixup_types.h    |   1 +
 arch/x86/include/asm/fred.h                   |   3 +-
 arch/x86/include/asm/microcode.h              |  10 +-
 arch/x86/include/asm/mshyperv.h               |   3 +-
 arch/x86/include/asm/msr.h                    | 637 ++++++++++++------
 arch/x86/include/asm/paravirt.h               |  78 ---
 arch/x86/include/asm/paravirt_types.h         |  13 -
 arch/x86/include/asm/sev-internal.h           |   9 +-
 arch/x86/include/asm/spec-ctrl.h              |   2 +-
 arch/x86/include/asm/suspend_32.h             |   1 +
 arch/x86/include/asm/suspend_64.h             |   1 +
 arch/x86/include/asm/switch_to.h              |   4 +-
 arch/x86/include/asm/tsc.h                    |  76 ++-
 arch/x86/kernel/cpu/amd.c                     |   2 +-
 arch/x86/kernel/cpu/common.c                  |  10 +-
 arch/x86/kernel/cpu/mce/core.c                |  61 +-
 arch/x86/kernel/cpu/microcode/amd.c           |  10 +-
 arch/x86/kernel/cpu/microcode/core.c          |   4 +-
 arch/x86/kernel/cpu/microcode/intel.c         |   8 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c     |  25 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        |   2 +-
 arch/x86/kernel/cpu/scattered.c               |   1 +
 arch/x86/kernel/cpu/umwait.c                  |   4 +-
 arch/x86/kernel/fpu/xstate.h                  |   1 +
 arch/x86/kernel/hpet.c                        |   1 +
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/kvmclock.c                    |   2 +-
 arch/x86/kernel/paravirt.c                    |   5 -
 arch/x86/kernel/process_64.c                  |   1 +
 arch/x86/kernel/trace_clock.c                 |   2 +-
 arch/x86/kernel/tsc_sync.c                    |   1 +
 arch/x86/kvm/svm/svm.c                        |  34 +-
 arch/x86/kvm/vmx/vmx.c                        |  12 +-
 arch/x86/lib/kaslr.c                          |   2 +-
 arch/x86/lib/x86-opcode-map.txt               |   5 +-
 arch/x86/mm/extable.c                         | 181 +++--
 arch/x86/realmode/init.c                      |   1 +
 arch/x86/xen/enlighten_pv.c                   | 112 ++-
 arch/x86/xen/pmu.c                            |  63 +-
 arch/x86/xen/xen-asm.S                        | 113 ++++
 arch/x86/xen/xen-ops.h                        |  14 +-
 drivers/acpi/processor_perflib.c              |   1 +
 drivers/acpi/processor_throttling.c           |   3 +-
 drivers/cpufreq/amd-pstate-ut.c               |   2 +
 drivers/hwmon/hwmon-vid.c                     |   4 +
 drivers/net/vmxnet3/vmxnet3_drv.c             |   6 +-
 .../intel/speed_select_if/isst_if_common.c    |   1 +
 drivers/platform/x86/intel/turbo_max_3.c      |   1 +
 tools/arch/x86/lib/x86-opcode-map.txt         |   5 +-
 64 files changed, 988 insertions(+), 605 deletions(-)


base-commit: f30a0c0d2b08b355c01392538de8fc872387cb2b

Sean Christopherson April 22, 2025, 3:09 p.m. UTC | #1

On Tue, Apr 22, 2025, Xin Li (Intel) wrote:
> __rdmsr() is the lowest level primitive MSR read API, and its direct
> use is NOT preferred.

Doesn't mean it's wrong.

> Use its wrapper function native_rdmsrq() instead.

...

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1547bfacd40f..e73c1d5ba6c4 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -380,7 +380,7 @@ static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx)
>  	if (!vmx->disable_fb_clear)
>  		return;
>  
> -	msr = __rdmsr(MSR_IA32_MCU_OPT_CTRL);
> +	msr = native_rdmsrq(MSR_IA32_MCU_OPT_CTRL);
>  	msr |= FB_CLEAR_DIS;
>  	native_wrmsrq(MSR_IA32_MCU_OPT_CTRL, msr);
>  	/* Cache the MSR value to avoid reading it later */
> @@ -7307,7 +7307,7 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx,
>  		return;
>  
>  	if (flags & VMX_RUN_SAVE_SPEC_CTRL)
> -		vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL);
> +		vmx->spec_ctrl = native_rdmsrq(MSR_IA32_SPEC_CTRL);

And what guarantees that native_rdmsrq() won't have tracing?  Ugh, a later patch
renames native_rdmsrq() => native_rdmsrq_no_trace().

I really don't like this.  It makes simple and obvious code:

	vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL);

so much harder to read:

	vmx->spec_ctrl = native_rdmsrq_no_trace(MSR_IA32_SPEC_CTRL);

and does so in a way that is difficult to review, e.g. I have to peek ahead to
understand that this is even ok.

I strongly prefer that we find a way to not require such verbose APIs, especially
if KVM ends up using native variants throughout.  Xen PV is supposed to be the
odd one out, yet native code is what suffers.  Blech.

Luck, Tony April 22, 2025, 6:05 p.m. UTC | #2

> >> base-commit: f30a0c0d2b08b355c01392538de8fc872387cb2b
> >
> > This commit doesn't exist in Linus' tree or the tip tree, and the series doesn't
> > apply cleanly on any of the "obvious" choices.  Reviewing a 34 patches series
> > without being able to apply it is a wee bit difficult...
> >
>
> $ git show f30a0c0d2b08b355c01392538de8fc872387cb2b
> commit f30a0c0d2b08b355c01392538de8fc872387cb2b
> Merge: 49b517e68cf7 e396dd85172c
> Author: Ingo Molnar <mingo@kernel.org>
> Date:   Tue Apr 22 08:37:32 2025 +0200
>
>      Merge branch into tip/master: 'x86/sev'
>
>       # New commits in x86/sev:
>          e396dd85172c ("x86/sev: Register tpm-svsm platform device")
>          93b7c6b3ce91 ("tpm: Add SNP SVSM vTPM driver")
>          b2849b072366 ("svsm: Add header with SVSM_VTPM_CMD helpers")
>          770de678bc28 ("x86/sev: Add SVSM vTPM probe/send_command
> functions")
>
>      Signed-off-by: Ingo Molnar <mingo@kernel.org>
>
>
> You probably need to git pull from the tip tree :-)

If possible, you should avoid basing a series on tip/master as it gets recreated
frequently by merging all the topic branches. The SHA1 is here today, gone
tomorrow.

If your changes only depend on one TIP topic branch, base on that and mention
in the cover letter (as well as the SHA1 supplied from git format-patches --base=xxx).

If you do depend on multiple tip topic branches, then maybe tip/master is your
only hope. But in that case cover letter should say "tip/master as of yyy-mm-dd.

-Tony

Ingo Molnar April 22, 2025, 7:44 p.m. UTC | #3

* Luck, Tony <tony.luck@intel.com> wrote:

> > >> base-commit: f30a0c0d2b08b355c01392538de8fc872387cb2b
> > >
> > > This commit doesn't exist in Linus' tree or the tip tree, and the series doesn't
> > > apply cleanly on any of the "obvious" choices.  Reviewing a 34 patches series
> > > without being able to apply it is a wee bit difficult...
> > >
> >
> > $ git show f30a0c0d2b08b355c01392538de8fc872387cb2b
> > commit f30a0c0d2b08b355c01392538de8fc872387cb2b
> > Merge: 49b517e68cf7 e396dd85172c
> > Author: Ingo Molnar <mingo@kernel.org>
> > Date:   Tue Apr 22 08:37:32 2025 +0200
> >
> >      Merge branch into tip/master: 'x86/sev'
> >
> >       # New commits in x86/sev:
> >          e396dd85172c ("x86/sev: Register tpm-svsm platform device")
> >          93b7c6b3ce91 ("tpm: Add SNP SVSM vTPM driver")
> >          b2849b072366 ("svsm: Add header with SVSM_VTPM_CMD helpers")
> >          770de678bc28 ("x86/sev: Add SVSM vTPM probe/send_command
> > functions")
> >
> >      Signed-off-by: Ingo Molnar <mingo@kernel.org>
> >
> >
> > You probably need to git pull from the tip tree :-)
> 
> If possible, you should avoid basing a series on tip/master as it 
> gets recreated frequently by merging all the topic branches. The SHA1 
> is here today, gone tomorrow.

Correct, although for x86 patch submissions via email it's not wrong: 
what applies today will likely apply tomorrow as well, regardless of 
the SHA1 change. :-)

> If your changes only depend on one TIP topic branch, base on that and 
> mention in the cover letter (as well as the SHA1 supplied from git 
> format-patches --base=xxx).

Yeah, the main dependency this series has is tip:x86/msr I believe:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/msr

Which SHA1's should be stable at this point.

Thanks,

	Ingo

Xin Li April 23, 2025, 8:51 a.m. UTC | #4

On 4/22/2025 2:57 AM, Jürgen Groß wrote:
> On 22.04.25 10:22, Xin Li (Intel) wrote:
>> The story started from tglx's reply in [1]:
>>
>>    For actual performance relevant code the current PV ops mechanics
>>    are a horrorshow when the op defaults to the native instruction.
>>
>>    look at wrmsrl():
>>
>>    wrmsrl(msr, val
>>     wrmsr(msr, (u32)val, (u32)val >> 32))
>>      paravirt_write_msr(msr, low, high)
>>        PVOP_VCALL3(cpu.write_msr, msr, low, high)
>>
>>    Which results in
>>
>>     mov    $msr, %edi
>>     mov    $val, %rdx
>>     mov    %edx, %esi
>>     shr    $0x20, %rdx
>>     call    native_write_msr
>>
>>    and native_write_msr() does at minimum:
>>
>>     mov    %edi,%ecx
>>     mov    %esi,%eax
>>     wrmsr
>>     ret
>>
>>    In the worst case 'ret' is going through the return thunk. Not to
>>    talk about function prologues and whatever.
>>
>>    This becomes even more silly for trivial instructions like STI/CLI
>>    or in the worst case paravirt_nop().
> 
> This is nonsense.
> 
> In the non-Xen case the initial indirect call is directly replaced with
> STI/CLI via alternative patching, while for Xen it is replaced by a direct
> call.
> 
> The paravirt_nop() case is handled in alt_replace_call() by replacing the
> indirect call with a nop in case the target of the call was paravirt_nop()
> (which is in fact no_func()).
> 
>>
>>    The call makes only sense, when the native default is an actual
>>    function, but for the trivial cases it's a blatant engineering
>>    trainwreck.
> 
> The trivial cases are all handled as stated above: a direct replacement
> instruction is placed at the indirect call position.

The above comment was given in 2023 IIRC, and you have addressed it.

> 
>> Later a consensus was reached to utilize the alternatives mechanism to
>> eliminate the indirect call overhead introduced by the pv_ops APIs:
>>
>>      1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
>>         disabled feature, preventing the Xen code from being built
>>         and ensuring the native code is executed unconditionally.
> 
> This is the case today already. There is no need for any change to have
> this in place.
> 
>>
>>      2) When built with CONFIG_XEN_PV:
>>
>>         2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
>>              the kernel runtime binary is patched to unconditionally
>>              jump to the native MSR write code.
>>
>>         2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
>>              kernel runtime binary is patched to unconditionally jump
>>              to the Xen MSR write code.
> 
> I can't see what is different here compared to today's state.
> 
>>
>> The alternatives mechanism is also used to choose the new immediate
>> form MSR write instruction when it's available.
> 
> Yes, this needs to be added.
> 
>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
> 
> I still don't see a major difference to today's solution.

The existing code generates:

     ...
     bf e0 06 00 00          mov    $0x6e0,%edi
     89 d6                   mov    %edx,%esi
     48 c1 ea 20             shr    $0x20,%rdx
     ff 15 07 48 8c 01       call   *0x18c4807(%rip)  # <pv_ops+0xb8>
     31 c0                   xor    %eax,%eax
     ...

And on native, the indirect call instruction is patched to a direct call
as you mentioned:

     ...
     bf e0 06 00 00          mov    $0x6e0,%edi
     89 d6                   mov    %edx,%esi
     48 c1 ea 20             shr    $0x20,%rdx
     e8 60 3e 01 00          call   <{native,xen}_write_msr> # direct
     90                      nop
     31 c0                   xor    %eax,%eax
     ...


This patch set generates assembly w/o CALL on native:

     ...
     e9 e6 22 c6 01          jmp    1f   # on native or nop on Xen
     b9 e0 06 00 00          mov    $0x6e0,%ecx
     e8 91 d4 fa ff          call   ffffffff8134ee80 <asm_xen_write_msr>
     e9 a4 9f eb 00          jmp    ffffffff8225b9a0 <__x86_return_thunk>
         ...
1:  b9 e0 06 00 00          mov    $0x6e0,%ecx   # immediate form here
     48 89 c2                mov    %rax,%rdx
     48 c1 ea 20             shr    $0x20,%rdx
     3e 0f 30                ds wrmsr
     ...

It's not a major change, but when it is patched to use the immediate 
form MSR write instruction, it's straightforwardly streamlined.

> 
> Only the "paravirt" term has been eliminated.

Yes.

But a PV guest doesn't operate at the highest privilege level, which
means MSR instructions typically result in a #GP fault.  I actually 
think the pv_ops MSR APIs are unnecessary because of this inherent
limitation.

Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
back to executes native MSR instructions.  As MSR instructions trigger
#GP, Xen takes control and handles them in 2 ways:

   1) emulate (or ignore) a MSR operation and skip the guest instruction.

   2) inject the #GP back to guest OS and let its #GP handler handle it.
      But Linux MSR exception handler just ignores the MSR instruction
      (MCE MSR exception will panic).

So why not let Xen handle all the details which it already tries to do?
(Linux w/ such a change may not be able to run on old Xen hypervisors.)

BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
MSR_GS_BASE anyway are hpyercalls into Xen.

Thanks!
     Xin

Xin Li April 23, 2025, 9:27 a.m. UTC | #5

On 4/22/2025 8:09 AM, Sean Christopherson wrote:
> On Tue, Apr 22, 2025, Xin Li (Intel) wrote:
>> __rdmsr() is the lowest level primitive MSR read API, and its direct
>> use is NOT preferred.
> 
> Doesn't mean it's wrong.

I wouldn't go so far as to claim that it's wrong :-)

>> Use its wrapper function native_rdmsrq() instead.

The current code exhibits a somewhat haphazard use of MSR APIs, so I
wanted to clarify which API to employ in specific situations with
verbose function naming.

Here is an example that Boris had to fix the use of MSR APIs:

https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f980f9c31a923e9040dee0bc679a5f5b09e61f40

> 
> ...
> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 1547bfacd40f..e73c1d5ba6c4 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -380,7 +380,7 @@ static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx)
>>   	if (!vmx->disable_fb_clear)
>>   		return;
>>   
>> -	msr = __rdmsr(MSR_IA32_MCU_OPT_CTRL);
>> +	msr = native_rdmsrq(MSR_IA32_MCU_OPT_CTRL);
>>   	msr |= FB_CLEAR_DIS;
>>   	native_wrmsrq(MSR_IA32_MCU_OPT_CTRL, msr);
>>   	/* Cache the MSR value to avoid reading it later */
>> @@ -7307,7 +7307,7 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx,
>>   		return;
>>   
>>   	if (flags & VMX_RUN_SAVE_SPEC_CTRL)
>> -		vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL);
>> +		vmx->spec_ctrl = native_rdmsrq(MSR_IA32_SPEC_CTRL);
> 
> And what guarantees that native_rdmsrq() won't have tracing?  Ugh, a later patch
> renames native_rdmsrq() => native_rdmsrq_no_trace().
> 
> I really don't like this.  It makes simple and obvious code:
> 
> 	vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL);
> 
> so much harder to read:
> 
> 	vmx->spec_ctrl = native_rdmsrq_no_trace(MSR_IA32_SPEC_CTRL);
> 
> and does so in a way that is difficult to review, e.g. I have to peek ahead to
> understand that this is even ok.
> 
> I strongly prefer that we find a way to not require such verbose APIs, especially
> if KVM ends up using native variants throughout.  Xen PV is supposed to be the
> odd one out, yet native code is what suffers.  Blech.

Will try to figure out how to name the APIs.

One reason I chose verbose names is that short names are in use and
renaming needs to touch a lot of files (and not fun at all).

Thanks!
     Xin

Sean Christopherson April 23, 2025, 1:37 p.m. UTC | #6

On Wed, Apr 23, 2025, Xin Li wrote:
> On 4/22/2025 8:09 AM, Sean Christopherson wrote:
> > I strongly prefer that we find a way to not require such verbose APIs, especially
> > if KVM ends up using native variants throughout.  Xen PV is supposed to be the
> > odd one out, yet native code is what suffers.  Blech.
> 
> Will try to figure out how to name the APIs.
> 
> One reason I chose verbose names is that short names are in use and
> renaming needs to touch a lot of files (and not fun at all).

Yeah, I've looked at modifying rdmsrl() to "return" a value more than once, and
ran away screaming every time.

But since you're already doing a pile of renames, IMO this is the perfect time to
do an aggressive cleanup.

Dave Hansen April 23, 2025, 2:02 p.m. UTC | #7

On 4/23/25 02:27, Xin Li wrote:
> One reason I chose verbose names is that short names are in use and
> renaming needs to touch a lot of files (and not fun at all).

This series is getting *WAY* too big.

Could you please peel the renaming stuff out and we can get it applied
independently of the new instruction gunk?

Dave Hansen April 23, 2025, 2:24 p.m. UTC | #8

On 4/22/25 01:21, Xin Li (Intel) wrote:
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>

We had a non-trivial discussion about the l=>q renames. Please at least
include a sentence or two about those discussions.

For the code:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Dave Hansen April 23, 2025, 2:25 p.m. UTC | #9

On 4/22/25 01:21, Xin Li (Intel) wrote:
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>

Code: good.  No changelog: bad.

Once there's some semblance of a changelog:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Sean Christopherson April 23, 2025, 2:28 p.m. UTC | #10

On Tue, Apr 22, 2025, Xin Li (Intel) wrote:
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/events/amd/uncore.c              |  2 +-
>  arch/x86/events/core.c                    |  2 +-
>  arch/x86/events/intel/core.c              |  4 ++--
>  arch/x86/events/intel/ds.c                |  2 +-
>  arch/x86/include/asm/msr.h                |  2 +-
>  arch/x86/include/asm/paravirt.h           |  2 +-
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 12 ++++++------
>  7 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
> index f231e1078e51..b9933ab3116c 100644
> --- a/arch/x86/events/amd/uncore.c
> +++ b/arch/x86/events/amd/uncore.c
> @@ -152,7 +152,7 @@ static void amd_uncore_read(struct perf_event *event)
>  	if (hwc->event_base_rdpmc < 0)
>  		rdmsrq(hwc->event_base, new);
>  	else
> -		rdpmcl(hwc->event_base_rdpmc, new);
> +		rdpmcq(hwc->event_base_rdpmc, new);

Now that rdpmc() is gone, i.e. rdpmcl/rdpmcq() is the only helper, why not simply
rename rdpmcl() => rdpmc()?  I see no point in adding a 'q' qualifier; it doesn't
disambiguate anything and IMO is pure noise.

Dave Hansen April 23, 2025, 3:06 p.m. UTC | #11

On 4/23/25 07:28, Sean Christopherson wrote:
> Now that rdpmc() is gone, i.e. rdpmcl/rdpmcq() is the only helper, why not simply
> rename rdpmcl() => rdpmc()?  I see no point in adding a 'q' qualifier; it doesn't
> disambiguate anything and IMO is pure noise.

That makes total sense to me.

Dave Hansen April 23, 2025, 3:51 p.m. UTC | #12

On 4/22/25 01:21, Xin Li (Intel) wrote:
>  static __always_inline void sev_es_wr_ghcb_msr(u64 val)
>  {
> -	u32 low, high;
> -
> -	low  = (u32)(val);
> -	high = (u32)(val >> 32);
> -
> -	native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
> +	native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val);
>  }

A note on ordering: Had this been a native_wrmsr()=>__wrmsr()
conversion, it could be sucked into the tree easily before the big
__wrmsr()=>native_wrmsrq() conversion.

Yeah, you'd have to base the big rename on top of this. But with a
series this big, I'd prioritize whatever gets it trimmed down.

Jürgen Groß April 23, 2025, 4:05 p.m. UTC | #13

On 23.04.25 10:51, Xin Li wrote:
> On 4/22/2025 2:57 AM, Jürgen Groß wrote:
>> On 22.04.25 10:22, Xin Li (Intel) wrote:
>>> The story started from tglx's reply in [1]:
>>>
>>>    For actual performance relevant code the current PV ops mechanics
>>>    are a horrorshow when the op defaults to the native instruction.
>>>
>>>    look at wrmsrl():
>>>
>>>    wrmsrl(msr, val
>>>     wrmsr(msr, (u32)val, (u32)val >> 32))
>>>      paravirt_write_msr(msr, low, high)
>>>        PVOP_VCALL3(cpu.write_msr, msr, low, high)
>>>
>>>    Which results in
>>>
>>>     mov    $msr, %edi
>>>     mov    $val, %rdx
>>>     mov    %edx, %esi
>>>     shr    $0x20, %rdx
>>>     call    native_write_msr
>>>
>>>    and native_write_msr() does at minimum:
>>>
>>>     mov    %edi,%ecx
>>>     mov    %esi,%eax
>>>     wrmsr
>>>     ret
>>>
>>>    In the worst case 'ret' is going through the return thunk. Not to
>>>    talk about function prologues and whatever.
>>>
>>>    This becomes even more silly for trivial instructions like STI/CLI
>>>    or in the worst case paravirt_nop().
>>
>> This is nonsense.
>>
>> In the non-Xen case the initial indirect call is directly replaced with
>> STI/CLI via alternative patching, while for Xen it is replaced by a direct
>> call.
>>
>> The paravirt_nop() case is handled in alt_replace_call() by replacing the
>> indirect call with a nop in case the target of the call was paravirt_nop()
>> (which is in fact no_func()).
>>
>>>
>>>    The call makes only sense, when the native default is an actual
>>>    function, but for the trivial cases it's a blatant engineering
>>>    trainwreck.
>>
>> The trivial cases are all handled as stated above: a direct replacement
>> instruction is placed at the indirect call position.
> 
> The above comment was given in 2023 IIRC, and you have addressed it.
> 
>>
>>> Later a consensus was reached to utilize the alternatives mechanism to
>>> eliminate the indirect call overhead introduced by the pv_ops APIs:
>>>
>>>      1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
>>>         disabled feature, preventing the Xen code from being built
>>>         and ensuring the native code is executed unconditionally.
>>
>> This is the case today already. There is no need for any change to have
>> this in place.
>>
>>>
>>>      2) When built with CONFIG_XEN_PV:
>>>
>>>         2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
>>>              the kernel runtime binary is patched to unconditionally
>>>              jump to the native MSR write code.
>>>
>>>         2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
>>>              kernel runtime binary is patched to unconditionally jump
>>>              to the Xen MSR write code.
>>
>> I can't see what is different here compared to today's state.
>>
>>>
>>> The alternatives mechanism is also used to choose the new immediate
>>> form MSR write instruction when it's available.
>>
>> Yes, this needs to be added.
>>
>>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
>>
>> I still don't see a major difference to today's solution.
> 
> The existing code generates:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      ff 15 07 48 8c 01       call   *0x18c4807(%rip)  # <pv_ops+0xb8>
>      31 c0                   xor    %eax,%eax
>      ...
> 
> And on native, the indirect call instruction is patched to a direct call
> as you mentioned:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      e8 60 3e 01 00          call   <{native,xen}_write_msr> # direct
>      90                      nop
>      31 c0                   xor    %eax,%eax
>      ...
> 
> 
> This patch set generates assembly w/o CALL on native:
> 
>      ...
>      e9 e6 22 c6 01          jmp    1f   # on native or nop on Xen
>      b9 e0 06 00 00          mov    $0x6e0,%ecx
>      e8 91 d4 fa ff          call   ffffffff8134ee80 <asm_xen_write_msr>
>      e9 a4 9f eb 00          jmp    ffffffff8225b9a0 <__x86_return_thunk>
>          ...
> 1:  b9 e0 06 00 00          mov    $0x6e0,%ecx   # immediate form here
>      48 89 c2                mov    %rax,%rdx
>      48 c1 ea 20             shr    $0x20,%rdx
>      3e 0f 30                ds wrmsr
>      ...
> 
> It's not a major change, but when it is patched to use the immediate form MSR 
> write instruction, it's straightforwardly streamlined.

It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
locations to use the rdmsr/wrmsr instructions instead of doing a call to
native_*msr().

The case of the new immediate form could be handled the same way.

> 
>>
>> Only the "paravirt" term has been eliminated.
> 
> Yes.
> 
> But a PV guest doesn't operate at the highest privilege level, which
> means MSR instructions typically result in a #GP fault.  I actually think the 
> pv_ops MSR APIs are unnecessary because of this inherent
> limitation.
> 
> Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
> back to executes native MSR instructions.  As MSR instructions trigger
> #GP, Xen takes control and handles them in 2 ways:
> 
>    1) emulate (or ignore) a MSR operation and skip the guest instruction.
> 
>    2) inject the #GP back to guest OS and let its #GP handler handle it.
>       But Linux MSR exception handler just ignores the MSR instruction
>       (MCE MSR exception will panic).
> 
> So why not let Xen handle all the details which it already tries to do?

Some MSRs are not handled that way, but via a kernel internal emulation.
And those are handled that way mostly due to performance reasons. And some
need special treatment.

> (Linux w/ such a change may not be able to run on old Xen hypervisors.)

Yes, and this is something to avoid.

And remember that Linux isn't the only PV-mode guest existing.

> BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
> MSR_GS_BASE anyway are hpyercalls into Xen.

Yes, and some other MSR writes are just NOPs with Xen-PV.


Juergen

Xin Li April 23, 2025, 5:23 p.m. UTC | #14

On 4/23/2025 8:06 AM, Dave Hansen wrote:
> On 4/23/25 07:28, Sean Christopherson wrote:
>> Now that rdpmc() is gone, i.e. rdpmcl/rdpmcq() is the only helper, why not simply
>> rename rdpmcl() => rdpmc()?  I see no point in adding a 'q' qualifier; it doesn't
>> disambiguate anything and IMO is pure noise.
> 
> That makes total sense to me.
> 

Unable to argue with two maintainers on a simple naming ;), so will make
the change.

Xin Li April 23, 2025, 5:27 p.m. UTC | #15

On 4/23/2025 8:51 AM, Dave Hansen wrote:
> On 4/22/25 01:21, Xin Li (Intel) wrote:
>>   static __always_inline void sev_es_wr_ghcb_msr(u64 val)
>>   {
>> -	u32 low, high;
>> -
>> -	low  = (u32)(val);
>> -	high = (u32)(val >> 32);
>> -
>> -	native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
>> +	native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val);
>>   }
> 
> A note on ordering: Had this been a native_wrmsr()=>__wrmsr()
> conversion, it could be sucked into the tree easily before the big
> __wrmsr()=>native_wrmsrq() conversion.
> 
> Yeah, you'd have to base the big rename on top of this. But with a
> series this big, I'd prioritize whatever gets it trimmed down.

Okay, I will focus on cleanup first.

Xin Li April 23, 2025, 11:23 p.m. UTC | #16

On 4/23/2025 8:51 AM, Dave Hansen wrote:
> On 4/22/25 01:21, Xin Li (Intel) wrote:
>>   static __always_inline void sev_es_wr_ghcb_msr(u64 val)
>>   {
>> -	u32 low, high;
>> -
>> -	low  = (u32)(val);
>> -	high = (u32)(val >> 32);
>> -
>> -	native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
>> +	native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val);
>>   }
> 
> A note on ordering: Had this been a native_wrmsr()=>__wrmsr()
> conversion, it could be sucked into the tree easily before the big
> __wrmsr()=>native_wrmsrq() conversion.

Can't reorder the 2 patches, because __wrmsr() takes two u32 arguments
and the split has to be done explicitly in sev_es_wr_ghcb_msr().

Thanks!
     Xin

Mi, Dapeng April 24, 2025, 6:25 a.m. UTC | #17

On 4/22/2025 4:21 PM, Xin Li (Intel) wrote:
> hpa found that pmu_msr_write() is actually a completely pointless
> function [1]: all it does is shuffle some arguments, then calls
> pmu_msr_chk_emulated() and if it returns true AND the emulated flag
> is clear then does *exactly the same thing* that the calling code
> would have done if pmu_msr_write() itself had returned true.  And
> pmu_msr_read() does the equivalent stupidity.
>
> Remove the calls to native_{read,write}_msr{,_safe}() within
> pmu_msr_{read,write}().  Instead reuse the existing calling code
> that decides whether to call native_{read,write}_msr{,_safe}() based
> on the return value from pmu_msr_{read,write}().  Consequently,
> eliminate the need to pass an error pointer to pmu_msr_{read,write}().
>
> While at it, refactor pmu_msr_write() to take the MSR value as a u64
> argument, replacing the current dual u32 arguments, because the dual
> u32 arguments were only used to call native_write_msr{,_safe}(), which
> has now been removed.
>
> [1]: https://lore.kernel.org/lkml/0ec48b84-d158-47c6-b14c-3563fd14bcc4@zytor.com/
>
> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Sign-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/xen/enlighten_pv.c |  6 +++++-
>  arch/x86/xen/pmu.c          | 27 ++++-----------------------
>  arch/x86/xen/xen-ops.h      |  4 ++--
>  3 files changed, 11 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 9fbe187aff00..1418758b57ff 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -1132,6 +1132,8 @@ static void set_seg(unsigned int which, unsigned int low, unsigned int high,
>  static void xen_do_write_msr(unsigned int msr, unsigned int low,
>  			     unsigned int high, int *err)
>  {
> +	u64 val;
> +
>  	switch (msr) {
>  	case MSR_FS_BASE:
>  		set_seg(SEGBASE_FS, low, high, err);
> @@ -1158,7 +1160,9 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low,
>  		break;
>  
>  	default:
> -		if (!pmu_msr_write(msr, low, high, err)) {
> +		val = (u64)high << 32 | low;
> +
> +		if (!pmu_msr_write(msr, val)) {
>  			if (err)
>  				*err = native_write_msr_safe(msr, low, high);
>  			else
> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
> index 9c1682af620a..95caae97a394 100644
> --- a/arch/x86/xen/pmu.c
> +++ b/arch/x86/xen/pmu.c
> @@ -313,37 +313,18 @@ static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read,
>  	return true;
>  }
>  
> -bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err)
> +bool pmu_msr_read(u32 msr, u64 *val)

The function name is some kind of misleading right now. With the change,
this function only read PMU MSR's value if it's emulated, otherwise it
won't really read PMU MSR. How about changing the name to
"pmu_emulated_msr_read" or something similar?


>  {
>  	bool emulated;
>  
> -	if (!pmu_msr_chk_emulated(msr, val, true, &emulated))
> -		return false;
> -
> -	if (!emulated) {
> -		*val = err ? native_read_msr_safe(msr, err)
> -			   : native_read_msr(msr);
> -	}
> -
> -	return true;
> +	return pmu_msr_chk_emulated(msr, val, true, &emulated) && emulated;
>  }
>  
> -bool pmu_msr_write(unsigned int msr, uint32_t low, uint32_t high, int *err)
> +bool pmu_msr_write(u32 msr, u64 val)

ditto.


>  {
> -	uint64_t val = ((uint64_t)high << 32) | low;
>  	bool emulated;
>  
> -	if (!pmu_msr_chk_emulated(msr, &val, false, &emulated))
> -		return false;
> -
> -	if (!emulated) {
> -		if (err)
> -			*err = native_write_msr_safe(msr, low, high);
> -		else
> -			native_write_msr(msr, low, high);
> -	}
> -
> -	return true;
> +	return pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated;
>  }
>  
>  static u64 xen_amd_read_pmc(int counter)
> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> index dc886c3cc24d..a1875e10be31 100644
> --- a/arch/x86/xen/xen-ops.h
> +++ b/arch/x86/xen/xen-ops.h
> @@ -271,8 +271,8 @@ void xen_pmu_finish(int cpu);
>  static inline void xen_pmu_init(int cpu) {}
>  static inline void xen_pmu_finish(int cpu) {}
>  #endif
> -bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err);
> -bool pmu_msr_write(unsigned int msr, uint32_t low, uint32_t high, int *err);
> +bool pmu_msr_read(u32 msr, u64 *val);

The prototype of pmu_msr_read() has been changed, but why there is no
corresponding change in its caller (xen_do_read_msr())?


> +bool pmu_msr_write(u32 msr, u64 val);
>  int pmu_apic_update(uint32_t reg);
>  u64 xen_read_pmc(int counter);
>

Xin Li April 24, 2025, 7:16 a.m. UTC | #18

On 4/23/2025 11:25 PM, Mi, Dapeng wrote:

>> -bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err)
>> +bool pmu_msr_read(u32 msr, u64 *val)
> 
> The function name is some kind of misleading right now. With the change,
> this function only read PMU MSR's value if it's emulated, otherwise it
> won't really read PMU MSR. How about changing the name to
> "pmu_emulated_msr_read" or something similar?

This makes sense!

>> -bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err);
>> -bool pmu_msr_write(unsigned int msr, uint32_t low, uint32_t high, int *err);
>> +bool pmu_msr_read(u32 msr, u64 *val);
> 
> The prototype of pmu_msr_read() has been changed, but why there is no
> corresponding change in its caller (xen_do_read_msr())?

Good catch.  I didn't compile one by one thus missed it.

Xin Li April 24, 2025, 7:50 a.m. UTC | #19

On 4/24/2025 12:43 AM, Mi, Dapeng wrote:
> These 2 patches are not complicated, it won't be difficult to review if
> merging them into one as long as the commit message mentions it clearly.
> Anyway I'm fine if you hope to keep them into two patches.

Simple Small Steps...

Xin Li April 24, 2025, 8:06 a.m. UTC | #20

On 4/23/2025 9:05 AM, Jürgen Groß wrote:
>> It's not a major change, but when it is patched to use the immediate 
>> form MSR write instruction, it's straightforwardly streamlined.
> 
> It should be rather easy to switch the current wrmsr/rdmsr paravirt 
> patching
> locations to use the rdmsr/wrmsr instructions instead of doing a call to
> native_*msr().
> 
> The case of the new immediate form could be handled the same way.

Actually, that is how we get this patch with the existing alternatives
infrastructure.  And we took a step further to also remove the pv_ops
MSR APIs...

It looks to me that you want to add a new facility to the alternatives
infrastructure first?


>>> Only the "paravirt" term has been eliminated.
>>
>> Yes.
>>
>> But a PV guest doesn't operate at the highest privilege level, which
>> means MSR instructions typically result in a #GP fault.  I actually 
>> think the pv_ops MSR APIs are unnecessary because of this inherent
>> limitation.
>>
>> Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
>> back to executes native MSR instructions.  As MSR instructions trigger
>> #GP, Xen takes control and handles them in 2 ways:
>>
>>    1) emulate (or ignore) a MSR operation and skip the guest instruction.
>>
>>    2) inject the #GP back to guest OS and let its #GP handler handle it.
>>       But Linux MSR exception handler just ignores the MSR instruction
>>       (MCE MSR exception will panic).
>>
>> So why not let Xen handle all the details which it already tries to do?
> 
> Some MSRs are not handled that way, but via a kernel internal emulation.
> And those are handled that way mostly due to performance reasons. And some
> need special treatment.
> 
>> (Linux w/ such a change may not be able to run on old Xen hypervisors.)
> 
> Yes, and this is something to avoid.
> 
> And remember that Linux isn't the only PV-mode guest existing.
> 
>> BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
>> MSR_GS_BASE anyway are hpyercalls into Xen.
> 
> Yes, and some other MSR writes are just NOPs with Xen-PV.
> 

I will do some cleanup and refactor first.

BTW, at least we can merge the safe() APIs into the non-safe() ones.

Thanks!
     Xin

Jürgen Groß April 24, 2025, 8:14 a.m. UTC | #21

On 24.04.25 10:06, Xin Li wrote:
> On 4/23/2025 9:05 AM, Jürgen Groß wrote:
>>> It's not a major change, but when it is patched to use the immediate form MSR 
>>> write instruction, it's straightforwardly streamlined.
>>
>> It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
>> locations to use the rdmsr/wrmsr instructions instead of doing a call to
>> native_*msr().
>>
>> The case of the new immediate form could be handled the same way.
> 
> Actually, that is how we get this patch with the existing alternatives
> infrastructure.  And we took a step further to also remove the pv_ops
> MSR APIs...

And this is what I'm questioning. IMHO this approach is adding more
code by removing the pv_ops MSR_APIs just because "pv_ops is bad". And
I believe most refusal of pv_ops is based on no longer valid reasoning.

> It looks to me that you want to add a new facility to the alternatives
> infrastructure first?

Why would we need a new facility in the alternatives infrastructure?


Juergen

Jürgen Groß April 24, 2025, 10:05 a.m. UTC | #22

On 22.04.25 10:21, Xin Li (Intel) wrote:
> As pmu_msr_{read,write}() are now wrappers of pmu_msr_chk_emulated(),
> remove them and use pmu_msr_chk_emulated() directly.
> 
> While at it, convert the data type of MSR index to u32 in functions
> called in pmu_msr_chk_emulated().
> 
> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>   arch/x86/xen/enlighten_pv.c | 17 ++++++++++-------
>   arch/x86/xen/pmu.c          | 24 ++++--------------------
>   arch/x86/xen/xen-ops.h      |  3 +--
>   3 files changed, 15 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 1418758b57ff..b5a8bceb5f56 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -1089,8 +1089,9 @@ static void xen_write_cr4(unsigned long cr4)
>   static u64 xen_do_read_msr(unsigned int msr, int *err)
>   {
>   	u64 val = 0;	/* Avoid uninitialized value for safe variant. */
> +	bool emulated;
>   
> -	if (pmu_msr_read(msr, &val, err))
> +	if (pmu_msr_chk_emulated(msr, &val, true, &emulated) && emulated)
>   		return val;
>   
>   	if (err)
> @@ -1133,6 +1134,7 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low,
>   			     unsigned int high, int *err)
>   {
>   	u64 val;
> +	bool emulated;
>   
>   	switch (msr) {
>   	case MSR_FS_BASE:
> @@ -1162,12 +1164,13 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low,
>   	default:
>   		val = (u64)high << 32 | low;
>   
> -		if (!pmu_msr_write(msr, val)) {
> -			if (err)
> -				*err = native_write_msr_safe(msr, low, high);
> -			else
> -				native_write_msr(msr, low, high);
> -		}
> +		if (pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated)
> +			return;
> +
> +		if (err)
> +			*err = native_write_msr_safe(msr, low, high);
> +		else
> +			native_write_msr(msr, low, high);
>   	}
>   }
>   
> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
> index 95caae97a394..afb02f43ee3f 100644
> --- a/arch/x86/xen/pmu.c
> +++ b/arch/x86/xen/pmu.c
> @@ -128,7 +128,7 @@ static inline uint32_t get_fam15h_addr(u32 addr)
>   	return addr;
>   }
>   
> -static inline bool is_amd_pmu_msr(unsigned int msr)
> +static bool is_amd_pmu_msr(u32 msr)
>   {
>   	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD &&
>   	    boot_cpu_data.x86_vendor != X86_VENDOR_HYGON)
> @@ -194,8 +194,7 @@ static bool is_intel_pmu_msr(u32 msr_index, int *type, int *index)
>   	}
>   }
>   
> -static bool xen_intel_pmu_emulate(unsigned int msr, u64 *val, int type,
> -				  int index, bool is_read)
> +static bool xen_intel_pmu_emulate(u32 msr, u64 *val, int type, int index, bool is_read)
>   {
>   	uint64_t *reg = NULL;
>   	struct xen_pmu_intel_ctxt *ctxt;
> @@ -257,7 +256,7 @@ static bool xen_intel_pmu_emulate(unsigned int msr, u64 *val, int type,
>   	return false;
>   }
>   
> -static bool xen_amd_pmu_emulate(unsigned int msr, u64 *val, bool is_read)
> +static bool xen_amd_pmu_emulate(u32 msr, u64 *val, bool is_read)
>   {
>   	uint64_t *reg = NULL;
>   	int i, off = 0;
> @@ -298,8 +297,7 @@ static bool xen_amd_pmu_emulate(unsigned int msr, u64 *val, bool is_read)
>   	return false;
>   }
>   
> -static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read,
> -				 bool *emul)
> +bool pmu_msr_chk_emulated(u32 msr, u64 *val, bool is_read, bool *emul)
>   {
>   	int type, index = 0;
>   
> @@ -313,20 +311,6 @@ static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read,
>   	return true;
>   }
>   
> -bool pmu_msr_read(u32 msr, u64 *val)
> -{
> -	bool emulated;
> -
> -	return pmu_msr_chk_emulated(msr, val, true, &emulated) && emulated;
> -}
> -
> -bool pmu_msr_write(u32 msr, u64 val)
> -{
> -	bool emulated;
> -
> -	return pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated;
> -}
> -
>   static u64 xen_amd_read_pmc(int counter)
>   {
>   	struct xen_pmu_amd_ctxt *ctxt;
> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> index a1875e10be31..fde9f9d7415f 100644
> --- a/arch/x86/xen/xen-ops.h
> +++ b/arch/x86/xen/xen-ops.h
> @@ -271,8 +271,7 @@ void xen_pmu_finish(int cpu);
>   static inline void xen_pmu_init(int cpu) {}
>   static inline void xen_pmu_finish(int cpu) {}
>   #endif
> -bool pmu_msr_read(u32 msr, u64 *val);
> -bool pmu_msr_write(u32 msr, u64 val);
> +bool pmu_msr_chk_emulated(u32 msr, u64 *val, bool is_read, bool *emul);
>   int pmu_apic_update(uint32_t reg);
>   u64 xen_read_pmc(int counter);
>   

May I suggest to get rid of the "emul" parameter of pmu_msr_chk_emulated()?
It has no real value, as pmu_msr_chk_emulated() could easily return false in
the cases where it would set *emul to false.


Juergen

Xin Li April 24, 2025, 5:50 p.m. UTC | #23

On 4/24/2025 3:11 AM, Jürgen Groß wrote:
> set_seg(), please (further up, too).

Good catch, thanks a lot!

Xin Li April 24, 2025, 10:24 p.m. UTC | #24

> By the way, this patch should have "xen" in its subject tag.
> 

Right, I should add it.

H. Peter Anvin April 25, 2025, 3:44 a.m. UTC | #25

On 4/24/25 18:15, H. Peter Anvin wrote:
> On 4/24/25 01:14, Jürgen Groß wrote:
>>>
>>> Actually, that is how we get this patch with the existing alternatives
>>> infrastructure.  And we took a step further to also remove the pv_ops
>>> MSR APIs...
>>
>> And this is what I'm questioning. IMHO this approach is adding more
>> code by removing the pv_ops MSR_APIs just because "pv_ops is bad". And
>> I believe most refusal of pv_ops is based on no longer valid reasoning.
>>
> 
> pvops are a headache because it is effectively a secondary alternatives 
> infrastructure that is incompatible with the alternatives one...
> 
>>> It looks to me that you want to add a new facility to the alternatives
>>> infrastructure first?
>>
>> Why would we need a new facility in the alternatives infrastructure?
> 
> I'm not sure what Xin means with "facility", but a key motivation for 
> this is to:
> 
> a. Avoid using the pvops for MSRs when on the only remaining user 
> thereof (Xen) is only using it for a very small subset of MSRs and for 
> the rest it is just overhead, even for Xen;
> 
> b. Being able to do wrmsrns immediate/wrmsrns/wrmsr and rdmsr immediate/ 
> rdmsr alternatives.
> 
> Of these, (b) is by far the biggest motivation. The architectural 
> direction for supervisor states is to avoid ad hoc and XSAVES ISA and 
> instead use MSRs. The immediate forms are expected to be significantly 
> faster, because they make the MSR index available at the very beginning 
> of the pipeline instead of at a relatively late stage.
> 

Note that to support the immediate forms, we *must* do these inline, or 
the const-ness of the MSR index -- which applies to by far the vast 
majority of MSR references -- gets lost. pvops does exactly that.

Furthermore, the MSR immediate instructions take a 64-bit number in a 
single register; as these instructions are by necessity relatively long, 
it makes sense for the alternative sequence to accept a 64-bit input 
register and do the %eax/%edx shuffle in the legacy fallback code... we 
did a bunch of experiments to see what made most sense.

	-hpa

Jürgen Groß April 25, 2025, 7:01 a.m. UTC | #26

On 25.04.25 05:44, H. Peter Anvin wrote:
> On 4/24/25 18:15, H. Peter Anvin wrote:
>> On 4/24/25 01:14, Jürgen Groß wrote:
>>>>
>>>> Actually, that is how we get this patch with the existing alternatives
>>>> infrastructure.  And we took a step further to also remove the pv_ops
>>>> MSR APIs...
>>>
>>> And this is what I'm questioning. IMHO this approach is adding more
>>> code by removing the pv_ops MSR_APIs just because "pv_ops is bad". And
>>> I believe most refusal of pv_ops is based on no longer valid reasoning.
>>>
>>
>> pvops are a headache because it is effectively a secondary alternatives 
>> infrastructure that is incompatible with the alternatives one...
>>
>>>> It looks to me that you want to add a new facility to the alternatives
>>>> infrastructure first?
>>>
>>> Why would we need a new facility in the alternatives infrastructure?
>>
>> I'm not sure what Xin means with "facility", but a key motivation for this is to:
>>
>> a. Avoid using the pvops for MSRs when on the only remaining user thereof 
>> (Xen) is only using it for a very small subset of MSRs and for the rest it is 
>> just overhead, even for Xen;
>>
>> b. Being able to do wrmsrns immediate/wrmsrns/wrmsr and rdmsr immediate/ rdmsr 
>> alternatives.
>>
>> Of these, (b) is by far the biggest motivation. The architectural direction 
>> for supervisor states is to avoid ad hoc and XSAVES ISA and instead use MSRs. 
>> The immediate forms are expected to be significantly faster, because they make 
>> the MSR index available at the very beginning of the pipeline instead of at a 
>> relatively late stage.
>>
> 
> Note that to support the immediate forms, we *must* do these inline, or the 
> const-ness of the MSR index -- which applies to by far the vast majority of MSR 
> references -- gets lost. pvops does exactly that.
> 
> Furthermore, the MSR immediate instructions take a 64-bit number in a single 
> register; as these instructions are by necessity relatively long, it makes sense 
> for the alternative sequence to accept a 64-bit input register and do the %eax/ 
> %edx shuffle in the legacy fallback code... we did a bunch of experiments to see 
> what made most sense.

Yes, I understand that.

And I'm totally in favor of Xin's rework of the MSR low level functions.

Inlining the MSR access instructions with pv_ops should not be very
complicated. We do that with other instructions (STI/CLI, PTE accesses)
today, so this is no new kind of functionality.

I could have a try writing a patch achieving that, but I would only start
that work in case you might consider taking it instead of Xin's patch
removing the pv_ops usage for rdmsr/wrmsr. In case it turns out that my
version results in more code changes than Xin's patch, I'd be fine to drop
my patch, of course.


Juergen

Jürgen Groß April 25, 2025, 12:51 p.m. UTC | #27

On 25.04.25 14:33, Peter Zijlstra wrote:
> On Wed, Apr 23, 2025 at 06:05:19PM +0200, Jürgen Groß wrote:
> 
>>> It's not a major change, but when it is patched to use the immediate
>>> form MSR write instruction, it's straightforwardly streamlined.
>>
>> It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
>> locations to use the rdmsr/wrmsr instructions instead of doing a call to
>> native_*msr().
> 
> Right, just make the Xen functions asm stubs that expect the instruction
> registers instead of C-abi and ALT_NOT_XEN the thing.
> 
> Shouldn't be hard at all.

Correct. And for the new immediate form we can use ALTERNATIVE_3().


Juergen

H. Peter Anvin April 25, 2025, 3:28 p.m. UTC | #28

On April 25, 2025 12:01:29 AM PDT, "Jürgen Groß" <jgross@suse.com> wrote:
>On 25.04.25 05:44, H. Peter Anvin wrote:
>> On 4/24/25 18:15, H. Peter Anvin wrote:
>>> On 4/24/25 01:14, Jürgen Groß wrote:
>>>>> 
>>>>> Actually, that is how we get this patch with the existing alternatives
>>>>> infrastructure.  And we took a step further to also remove the pv_ops
>>>>> MSR APIs...
>>>> 
>>>> And this is what I'm questioning. IMHO this approach is adding more
>>>> code by removing the pv_ops MSR_APIs just because "pv_ops is bad". And
>>>> I believe most refusal of pv_ops is based on no longer valid reasoning.
>>>> 
>>> 
>>> pvops are a headache because it is effectively a secondary alternatives infrastructure that is incompatible with the alternatives one...
>>> 
>>>>> It looks to me that you want to add a new facility to the alternatives
>>>>> infrastructure first?
>>>> 
>>>> Why would we need a new facility in the alternatives infrastructure?
>>> 
>>> I'm not sure what Xin means with "facility", but a key motivation for this is to:
>>> 
>>> a. Avoid using the pvops for MSRs when on the only remaining user thereof (Xen) is only using it for a very small subset of MSRs and for the rest it is just overhead, even for Xen;
>>> 
>>> b. Being able to do wrmsrns immediate/wrmsrns/wrmsr and rdmsr immediate/ rdmsr alternatives.
>>> 
>>> Of these, (b) is by far the biggest motivation. The architectural direction for supervisor states is to avoid ad hoc and XSAVES ISA and instead use MSRs. The immediate forms are expected to be significantly faster, because they make the MSR index available at the very beginning of the pipeline instead of at a relatively late stage.
>>> 
>> 
>> Note that to support the immediate forms, we *must* do these inline, or the const-ness of the MSR index -- which applies to by far the vast majority of MSR references -- gets lost. pvops does exactly that.
>> 
>> Furthermore, the MSR immediate instructions take a 64-bit number in a single register; as these instructions are by necessity relatively long, it makes sense for the alternative sequence to accept a 64-bit input register and do the %eax/ %edx shuffle in the legacy fallback code... we did a bunch of experiments to see what made most sense.
>
>Yes, I understand that.
>
>And I'm totally in favor of Xin's rework of the MSR low level functions.
>
>Inlining the MSR access instructions with pv_ops should not be very
>complicated. We do that with other instructions (STI/CLI, PTE accesses)
>today, so this is no new kind of functionality.
>
>I could have a try writing a patch achieving that, but I would only start
>that work in case you might consider taking it instead of Xin's patch
>removing the pv_ops usage for rdmsr/wrmsr. In case it turns out that my
>version results in more code changes than Xin's patch, I'd be fine to drop
>my patch, of course.
>
>
>Juergen

The wrapper in question is painfully opaque, but if it is much simpler, then I'm certainly willing to consider it... but I don't really see how it would be possible given among other things the need for trap points for the safe MSRs.

Keep in mind this needs to work even without PV enabled!

Note that Andrew encouraged us to pursue the pvops removal for MSRs. Note that Xen benefits pretty heavily because it can dispatch the proper path of the few that are left for the common case of fixed MSRs.

H. Peter Anvin April 25, 2025, 3:29 p.m. UTC | #29

On April 25, 2025 5:33:17 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Wed, Apr 23, 2025 at 06:05:19PM +0200, Jürgen Groß wrote:
>
>> > It's not a major change, but when it is patched to use the immediate
>> > form MSR write instruction, it's straightforwardly streamlined.
>> 
>> It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
>> locations to use the rdmsr/wrmsr instructions instead of doing a call to
>> native_*msr().
>
>Right, just make the Xen functions asm stubs that expect the instruction
>registers instead of C-abi and ALT_NOT_XEN the thing.
>
>Shouldn't be hard at all.

And that's what we will be doing. We already have code for that.

H. Peter Anvin April 25, 2025, 8:12 p.m. UTC | #30

On April 25, 2025 5:51:27 AM PDT, "Jürgen Groß" <jgross@suse.com> wrote:
>On 25.04.25 14:33, Peter Zijlstra wrote:
>> On Wed, Apr 23, 2025 at 06:05:19PM +0200, Jürgen Groß wrote:
>> 
>>>> It's not a major change, but when it is patched to use the immediate
>>>> form MSR write instruction, it's straightforwardly streamlined.
>>> 
>>> It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
>>> locations to use the rdmsr/wrmsr instructions instead of doing a call to
>>> native_*msr().
>> 
>> Right, just make the Xen functions asm stubs that expect the instruction
>> registers instead of C-abi and ALT_NOT_XEN the thing.
>> 
>> Shouldn't be hard at all.
>
>Correct. And for the new immediate form we can use ALTERNATIVE_3().
>
>
>Juergen

Yes; in the ultimate case there are *four* alternatives, but the concept is the same and again we have it implemented already.

[RFC,v2,00/34] MSR refactor with new MSR instructions support

Message

Comments