[RFC,v2,00/34] MSR refactor with new MSR instructions support

Message ID	20250422082216.1954310-1-xin@zytor.com
Headers	show Received: from mail.zytor.com (terminus.zytor.com [198.137.202.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B4AD259C82; Tue, 22 Apr 2025 08:23:51 +0000 (UTC) From: "Xin Li (Intel)" <xin@zytor.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev, linux-pm@vger.kernel.org, linux-edac@vger.kernel.org, xen-devel@lists.xenproject.org, linux-acpi@vger.kernel.org, linux-hwmon@vger.kernel.org, netdev@vger.kernel.org, platform-driver-x86@vger.kernel.org Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, acme@kernel.org, jgross@suse.com, andrew.cooper3@citrix.com, peterz@infradead.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, wei.liu@kernel.org, ajay.kaher@broadcom.com, bcm-kernel-feedback-list@broadcom.com, tony.luck@intel.com, pbonzini@redhat.com, vkuznets@redhat.com, seanjc@google.com, luto@kernel.org, boris.ostrovsky@oracle.com, kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com Subject: [RFC PATCH v2 00/34] MSR refactor with new MSR instructions support Date: Tue, 22 Apr 2025 01:21:41 -0700 Message-ID: <20250422082216.1954310-1-xin@zytor.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	MSR refactor with new MSR instructions support \| expand [RFC,v2,00/34] MSR refactor with new MSR instructions support [RFC,v2,01/34] x86/msr: Move rdtsc{,_ordered}() to <asm/tsc.h> [RFC,v2,02/34] x86/msr: Remove rdpmc() [RFC,v2,03/34] x86/msr: Rename rdpmcl() to rdpmcq() [RFC,v2,05/34] x86/msr: Return u64 consistently in Xen PMC read functions [RFC,v2,06/34] x86/msr: Use the alternatives mechanism to read PMC [RFC,v2,14/34] x86/msr: refactor pv_cpu_ops.write_msr{_safe}() [RFC,v2,15/34] x86/msr: Replace wrmsr(msr, low, 0) with wrmsrq(msr, low) [RFC,v2,18/34] x86/opcode: Add immediate form MSR instructions [RFC,v2,20/34] x86/extable: Implement EX_TYPE_FUNC_REWIND [RFC,v2,22/34] x86/msr: Utilize the alternatives mechanism to read MSR [RFC,v2,23/34] x86/extable: Remove new dead code in ex_handler_msr() [RFC,v2,27/34] x86/msr: Rename native_write_msr() to native_wrmsrq() [RFC,v2,28/34] x86/msr: Rename native_write_msr_safe() to native_wrmsrq_safe() [RFC,v2,30/34] x86/msr: Rename native_rdmsr() to native_rdmsr_no_trace() [RFC,v2,31/34] x86/msr: Rename native_read_msr() to native_rdmsrq() [RFC,v2,32/34] x86/msr: Rename native_read_msr_safe() to native_rdmsrq_safe() [RFC,v2,33/34] x86/msr: Move the ARGS macros after the MSR read/write APIs

Xin Li (Intel) April 22, 2025, 8:21 a.m. UTC

Obviously the existing MSR code and the pv_ops MSR access APIs need some
love: https://lore.kernel.org/lkml/87y1h81ht4.ffs@tglx/

hpa has started a discussion about how to refactor it last October:
https://lore.kernel.org/lkml/7a4de623-ecda-4369-a7ae-0c43ef328177@zytor.com/

The consensus so far is to utilize the alternatives mechanism to eliminate
the Xen MSR access overhead on native systems and enable new MSR instructions
based on their availability.

To achieve this, a code refactor is required:

Patch 1 relocates rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and
removes the inclusion of <asm/msr.h> in <asm/tsc.h>.  As a result,
<asm/msr.h> must now be explicitly included in several source files where
it was previously included implicitly through <asm/tsc.h>.

Patches 2 ~ 6 refactor the code to use the alternatives mechanism to read
PMC.

Patches 7 ~ 16 unify and simplify the MSR API definitions and usages.

Patches 17 ~ 19 add basic support for immediate form MSR instructions,
e.g., its CPU feature bit and opcode.

Patch 20 adds a new exception type to allow a function call inside an
alternative for instruction emulation to "kick back" the exception into
the alternatives pattern, possibly invoking a different exception handling
pattern there, or at least indicating the "real" location of the fault.

patches 21 and 22 refactor the code to use the alternatives mechanism to
read and write MSR.

Patches 23 ~ 34 are afterwards cleanups.


H. Peter Anvin (Intel) (1):
  x86/extable: Implement EX_TYPE_FUNC_REWIND

Xin Li (Intel) (33):
  x86/msr: Move rdtsc{,_ordered}() to <asm/tsc.h>
  x86/msr: Remove rdpmc()
  x86/msr: Rename rdpmcl() to rdpmcq()
  x86/msr: Convert rdpmcq() into a function
  x86/msr: Return u64 consistently in Xen PMC read functions
  x86/msr: Use the alternatives mechanism to read PMC
  x86/msr: Convert __wrmsr() uses to native_wrmsr{,q}() uses
  x86/msr: Convert a native_wrmsr() use to native_wrmsrq()
  x86/msr: Add the native_rdmsrq() helper
  x86/msr: Convert __rdmsr() uses to native_rdmsrq() uses
  x86/msr: Remove calling native_{read,write}_msr{,_safe}() in
    pmu_msr_{read,write}()
  x86/msr: Remove pmu_msr_{read,write}()
  x86/xen/msr: Remove the error pointer argument from set_reg()
  x86/msr: refactor pv_cpu_ops.write_msr{_safe}()
  x86/msr: Replace wrmsr(msr, low, 0) with wrmsrq(msr, low)
  x86/msr: Change function type of native_read_msr_safe()
  x86/cpufeatures: Add a CPU feature bit for MSR immediate form
    instructions
  x86/opcode: Add immediate form MSR instructions
  x86/extable: Add support for immediate form MSR instructions
  x86/msr: Utilize the alternatives mechanism to write MSR
  x86/msr: Utilize the alternatives mechanism to read MSR
  x86/extable: Remove new dead code in ex_handler_msr()
  x86/mce: Use native MSR API __native_{wr,rd}msrq()
  x86/msr: Rename native_wrmsrq() to native_wrmsrq_no_trace()
  x86/msr: Rename native_wrmsr() to native_wrmsr_no_trace()
  x86/msr: Rename native_write_msr() to native_wrmsrq()
  x86/msr: Rename native_write_msr_safe() to native_wrmsrq_safe()
  x86/msr: Rename native_rdmsrq() to native_rdmsrq_no_trace()
  x86/msr: Rename native_rdmsr() to native_rdmsr_no_trace()
  x86/msr: Rename native_read_msr() to native_rdmsrq()
  x86/msr: Rename native_read_msr_safe() to native_rdmsrq_safe()
  x86/msr: Move the ARGS macros after the MSR read/write APIs
  x86/msr: Convert native_rdmsr_no_trace() uses to
    native_rdmsrq_no_trace() uses

 arch/x86/boot/startup/sme.c                   |   5 +-
 arch/x86/events/amd/brs.c                     |   4 +-
 arch/x86/events/amd/uncore.c                  |   2 +-
 arch/x86/events/core.c                        |   2 +-
 arch/x86/events/intel/core.c                  |   4 +-
 arch/x86/events/intel/ds.c                    |   2 +-
 arch/x86/events/msr.c                         |   3 +
 arch/x86/events/perf_event.h                  |   1 +
 arch/x86/events/probe.c                       |   2 +
 arch/x86/hyperv/hv_apic.c                     |   6 +-
 arch/x86/hyperv/hv_vtl.c                      |   4 +-
 arch/x86/hyperv/ivm.c                         |   7 +-
 arch/x86/include/asm/apic.h                   |   4 +-
 arch/x86/include/asm/asm.h                    |   6 +
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/extable_fixup_types.h    |   1 +
 arch/x86/include/asm/fred.h                   |   3 +-
 arch/x86/include/asm/microcode.h              |  10 +-
 arch/x86/include/asm/mshyperv.h               |   3 +-
 arch/x86/include/asm/msr.h                    | 637 ++++++++++++------
 arch/x86/include/asm/paravirt.h               |  78 ---
 arch/x86/include/asm/paravirt_types.h         |  13 -
 arch/x86/include/asm/sev-internal.h           |   9 +-
 arch/x86/include/asm/spec-ctrl.h              |   2 +-
 arch/x86/include/asm/suspend_32.h             |   1 +
 arch/x86/include/asm/suspend_64.h             |   1 +
 arch/x86/include/asm/switch_to.h              |   4 +-
 arch/x86/include/asm/tsc.h                    |  76 ++-
 arch/x86/kernel/cpu/amd.c                     |   2 +-
 arch/x86/kernel/cpu/common.c                  |  10 +-
 arch/x86/kernel/cpu/mce/core.c                |  61 +-
 arch/x86/kernel/cpu/microcode/amd.c           |  10 +-
 arch/x86/kernel/cpu/microcode/core.c          |   4 +-
 arch/x86/kernel/cpu/microcode/intel.c         |   8 +-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c     |  25 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        |   2 +-
 arch/x86/kernel/cpu/scattered.c               |   1 +
 arch/x86/kernel/cpu/umwait.c                  |   4 +-
 arch/x86/kernel/fpu/xstate.h                  |   1 +
 arch/x86/kernel/hpet.c                        |   1 +
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/kvmclock.c                    |   2 +-
 arch/x86/kernel/paravirt.c                    |   5 -
 arch/x86/kernel/process_64.c                  |   1 +
 arch/x86/kernel/trace_clock.c                 |   2 +-
 arch/x86/kernel/tsc_sync.c                    |   1 +
 arch/x86/kvm/svm/svm.c                        |  34 +-
 arch/x86/kvm/vmx/vmx.c                        |  12 +-
 arch/x86/lib/kaslr.c                          |   2 +-
 arch/x86/lib/x86-opcode-map.txt               |   5 +-
 arch/x86/mm/extable.c                         | 181 +++--
 arch/x86/realmode/init.c                      |   1 +
 arch/x86/xen/enlighten_pv.c                   | 112 ++-
 arch/x86/xen/pmu.c                            |  63 +-
 arch/x86/xen/xen-asm.S                        | 113 ++++
 arch/x86/xen/xen-ops.h                        |  14 +-
 drivers/acpi/processor_perflib.c              |   1 +
 drivers/acpi/processor_throttling.c           |   3 +-
 drivers/cpufreq/amd-pstate-ut.c               |   2 +
 drivers/hwmon/hwmon-vid.c                     |   4 +
 drivers/net/vmxnet3/vmxnet3_drv.c             |   6 +-
 .../intel/speed_select_if/isst_if_common.c    |   1 +
 drivers/platform/x86/intel/turbo_max_3.c      |   1 +
 tools/arch/x86/lib/x86-opcode-map.txt         |   5 +-
 64 files changed, 988 insertions(+), 605 deletions(-)


base-commit: f30a0c0d2b08b355c01392538de8fc872387cb2b

Xin Li (Intel) April 23, 2025, 9:27 a.m. UTC | #1

On 4/22/2025 8:09 AM, Sean Christopherson wrote:
> On Tue, Apr 22, 2025, Xin Li (Intel) wrote:
>> __rdmsr() is the lowest level primitive MSR read API, and its direct
>> use is NOT preferred.
> 
> Doesn't mean it's wrong.

I wouldn't go so far as to claim that it's wrong :-)

>> Use its wrapper function native_rdmsrq() instead.

The current code exhibits a somewhat haphazard use of MSR APIs, so I
wanted to clarify which API to employ in specific situations with
verbose function naming.

Here is an example that Boris had to fix the use of MSR APIs:

https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f980f9c31a923e9040dee0bc679a5f5b09e61f40

> 
> ...
> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 1547bfacd40f..e73c1d5ba6c4 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -380,7 +380,7 @@ static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx)
>>   	if (!vmx->disable_fb_clear)
>>   		return;
>>   
>> -	msr = __rdmsr(MSR_IA32_MCU_OPT_CTRL);
>> +	msr = native_rdmsrq(MSR_IA32_MCU_OPT_CTRL);
>>   	msr |= FB_CLEAR_DIS;
>>   	native_wrmsrq(MSR_IA32_MCU_OPT_CTRL, msr);
>>   	/* Cache the MSR value to avoid reading it later */
>> @@ -7307,7 +7307,7 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx,
>>   		return;
>>   
>>   	if (flags & VMX_RUN_SAVE_SPEC_CTRL)
>> -		vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL);
>> +		vmx->spec_ctrl = native_rdmsrq(MSR_IA32_SPEC_CTRL);
> 
> And what guarantees that native_rdmsrq() won't have tracing?  Ugh, a later patch
> renames native_rdmsrq() => native_rdmsrq_no_trace().
> 
> I really don't like this.  It makes simple and obvious code:
> 
> 	vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL);
> 
> so much harder to read:
> 
> 	vmx->spec_ctrl = native_rdmsrq_no_trace(MSR_IA32_SPEC_CTRL);
> 
> and does so in a way that is difficult to review, e.g. I have to peek ahead to
> understand that this is even ok.
> 
> I strongly prefer that we find a way to not require such verbose APIs, especially
> if KVM ends up using native variants throughout.  Xen PV is supposed to be the
> odd one out, yet native code is what suffers.  Blech.

Will try to figure out how to name the APIs.

One reason I chose verbose names is that short names are in use and
renaming needs to touch a lot of files (and not fun at all).

Thanks!
     Xin

Dave Hansen April 23, 2025, 2:02 p.m. UTC | #2

On 4/23/25 02:27, Xin Li wrote:
> One reason I chose verbose names is that short names are in use and
> renaming needs to touch a lot of files (and not fun at all).

This series is getting *WAY* too big.

Could you please peel the renaming stuff out and we can get it applied
independently of the new instruction gunk?

Dave Hansen April 23, 2025, 2:25 p.m. UTC | #3

On 4/22/25 01:21, Xin Li (Intel) wrote:
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>

Code: good.  No changelog: bad.

Once there's some semblance of a changelog:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Jürgen Groß April 23, 2025, 4:05 p.m. UTC | #4

On 23.04.25 10:51, Xin Li wrote:
> On 4/22/2025 2:57 AM, Jürgen Groß wrote:
>> On 22.04.25 10:22, Xin Li (Intel) wrote:
>>> The story started from tglx's reply in [1]:
>>>
>>>    For actual performance relevant code the current PV ops mechanics
>>>    are a horrorshow when the op defaults to the native instruction.
>>>
>>>    look at wrmsrl():
>>>
>>>    wrmsrl(msr, val
>>>     wrmsr(msr, (u32)val, (u32)val >> 32))
>>>      paravirt_write_msr(msr, low, high)
>>>        PVOP_VCALL3(cpu.write_msr, msr, low, high)
>>>
>>>    Which results in
>>>
>>>     mov    $msr, %edi
>>>     mov    $val, %rdx
>>>     mov    %edx, %esi
>>>     shr    $0x20, %rdx
>>>     call    native_write_msr
>>>
>>>    and native_write_msr() does at minimum:
>>>
>>>     mov    %edi,%ecx
>>>     mov    %esi,%eax
>>>     wrmsr
>>>     ret
>>>
>>>    In the worst case 'ret' is going through the return thunk. Not to
>>>    talk about function prologues and whatever.
>>>
>>>    This becomes even more silly for trivial instructions like STI/CLI
>>>    or in the worst case paravirt_nop().
>>
>> This is nonsense.
>>
>> In the non-Xen case the initial indirect call is directly replaced with
>> STI/CLI via alternative patching, while for Xen it is replaced by a direct
>> call.
>>
>> The paravirt_nop() case is handled in alt_replace_call() by replacing the
>> indirect call with a nop in case the target of the call was paravirt_nop()
>> (which is in fact no_func()).
>>
>>>
>>>    The call makes only sense, when the native default is an actual
>>>    function, but for the trivial cases it's a blatant engineering
>>>    trainwreck.
>>
>> The trivial cases are all handled as stated above: a direct replacement
>> instruction is placed at the indirect call position.
> 
> The above comment was given in 2023 IIRC, and you have addressed it.
> 
>>
>>> Later a consensus was reached to utilize the alternatives mechanism to
>>> eliminate the indirect call overhead introduced by the pv_ops APIs:
>>>
>>>      1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
>>>         disabled feature, preventing the Xen code from being built
>>>         and ensuring the native code is executed unconditionally.
>>
>> This is the case today already. There is no need for any change to have
>> this in place.
>>
>>>
>>>      2) When built with CONFIG_XEN_PV:
>>>
>>>         2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
>>>              the kernel runtime binary is patched to unconditionally
>>>              jump to the native MSR write code.
>>>
>>>         2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
>>>              kernel runtime binary is patched to unconditionally jump
>>>              to the Xen MSR write code.
>>
>> I can't see what is different here compared to today's state.
>>
>>>
>>> The alternatives mechanism is also used to choose the new immediate
>>> form MSR write instruction when it's available.
>>
>> Yes, this needs to be added.
>>
>>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
>>
>> I still don't see a major difference to today's solution.
> 
> The existing code generates:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      ff 15 07 48 8c 01       call   *0x18c4807(%rip)  # <pv_ops+0xb8>
>      31 c0                   xor    %eax,%eax
>      ...
> 
> And on native, the indirect call instruction is patched to a direct call
> as you mentioned:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      e8 60 3e 01 00          call   <{native,xen}_write_msr> # direct
>      90                      nop
>      31 c0                   xor    %eax,%eax
>      ...
> 
> 
> This patch set generates assembly w/o CALL on native:
> 
>      ...
>      e9 e6 22 c6 01          jmp    1f   # on native or nop on Xen
>      b9 e0 06 00 00          mov    $0x6e0,%ecx
>      e8 91 d4 fa ff          call   ffffffff8134ee80 <asm_xen_write_msr>
>      e9 a4 9f eb 00          jmp    ffffffff8225b9a0 <__x86_return_thunk>
>          ...
> 1:  b9 e0 06 00 00          mov    $0x6e0,%ecx   # immediate form here
>      48 89 c2                mov    %rax,%rdx
>      48 c1 ea 20             shr    $0x20,%rdx
>      3e 0f 30                ds wrmsr
>      ...
> 
> It's not a major change, but when it is patched to use the immediate form MSR 
> write instruction, it's straightforwardly streamlined.

It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
locations to use the rdmsr/wrmsr instructions instead of doing a call to
native_*msr().

The case of the new immediate form could be handled the same way.

> 
>>
>> Only the "paravirt" term has been eliminated.
> 
> Yes.
> 
> But a PV guest doesn't operate at the highest privilege level, which
> means MSR instructions typically result in a #GP fault.  I actually think the 
> pv_ops MSR APIs are unnecessary because of this inherent
> limitation.
> 
> Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
> back to executes native MSR instructions.  As MSR instructions trigger
> #GP, Xen takes control and handles them in 2 ways:
> 
>    1) emulate (or ignore) a MSR operation and skip the guest instruction.
> 
>    2) inject the #GP back to guest OS and let its #GP handler handle it.
>       But Linux MSR exception handler just ignores the MSR instruction
>       (MCE MSR exception will panic).
> 
> So why not let Xen handle all the details which it already tries to do?

Some MSRs are not handled that way, but via a kernel internal emulation.
And those are handled that way mostly due to performance reasons. And some
need special treatment.

> (Linux w/ such a change may not be able to run on old Xen hypervisors.)

Yes, and this is something to avoid.

And remember that Linux isn't the only PV-mode guest existing.

> BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
> MSR_GS_BASE anyway are hpyercalls into Xen.

Yes, and some other MSR writes are just NOPs with Xen-PV.


Juergen

Xin Li (Intel) April 23, 2025, 5:27 p.m. UTC | #5

On 4/23/2025 8:51 AM, Dave Hansen wrote:
> On 4/22/25 01:21, Xin Li (Intel) wrote:
>>   static __always_inline void sev_es_wr_ghcb_msr(u64 val)
>>   {
>> -	u32 low, high;
>> -
>> -	low  = (u32)(val);
>> -	high = (u32)(val >> 32);
>> -
>> -	native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
>> +	native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val);
>>   }
> 
> A note on ordering: Had this been a native_wrmsr()=>__wrmsr()
> conversion, it could be sucked into the tree easily before the big
> __wrmsr()=>native_wrmsrq() conversion.
> 
> Yeah, you'd have to base the big rename on top of this. But with a
> series this big, I'd prioritize whatever gets it trimmed down.

Okay, I will focus on cleanup first.

Xin Li (Intel) April 23, 2025, 11:23 p.m. UTC | #6

On 4/23/2025 8:51 AM, Dave Hansen wrote:
> On 4/22/25 01:21, Xin Li (Intel) wrote:
>>   static __always_inline void sev_es_wr_ghcb_msr(u64 val)
>>   {
>> -	u32 low, high;
>> -
>> -	low  = (u32)(val);
>> -	high = (u32)(val >> 32);
>> -
>> -	native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
>> +	native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val);
>>   }
> 
> A note on ordering: Had this been a native_wrmsr()=>__wrmsr()
> conversion, it could be sucked into the tree easily before the big
> __wrmsr()=>native_wrmsrq() conversion.

Can't reorder the 2 patches, because __wrmsr() takes two u32 arguments
and the split has to be done explicitly in sev_es_wr_ghcb_msr().

Thanks!
     Xin

Mi, Dapeng April 24, 2025, 6:33 a.m. UTC | #7

On 4/22/2025 4:21 PM, Xin Li (Intel) wrote:
> As pmu_msr_{read,write}() are now wrappers of pmu_msr_chk_emulated(),
> remove them and use pmu_msr_chk_emulated() directly.
>
> While at it, convert the data type of MSR index to u32 in functions
> called in pmu_msr_chk_emulated().
>
> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/xen/enlighten_pv.c | 17 ++++++++++-------
>  arch/x86/xen/pmu.c          | 24 ++++--------------------
>  arch/x86/xen/xen-ops.h      |  3 +--
>  3 files changed, 15 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 1418758b57ff..b5a8bceb5f56 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -1089,8 +1089,9 @@ static void xen_write_cr4(unsigned long cr4)
>  static u64 xen_do_read_msr(unsigned int msr, int *err)
>  {
>  	u64 val = 0;	/* Avoid uninitialized value for safe variant. */
> +	bool emulated;
>  
> -	if (pmu_msr_read(msr, &val, err))
> +	if (pmu_msr_chk_emulated(msr, &val, true, &emulated) && emulated)

ah, here it is.

Could we merge this patch and previous patch into a single patch? It's
unnecessary to just modify the pmu_msr_read()/pmu_msr_write() in previous
patch and delete them immediately. It just wastes the effort.


>  		return val;
>  
>  	if (err)
> @@ -1133,6 +1134,7 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low,
>  			     unsigned int high, int *err)
>  {
>  	u64 val;
> +	bool emulated;
>  
>  	switch (msr) {
>  	case MSR_FS_BASE:
> @@ -1162,12 +1164,13 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low,
>  	default:
>  		val = (u64)high << 32 | low;
>  
> -		if (!pmu_msr_write(msr, val)) {
> -			if (err)
> -				*err = native_write_msr_safe(msr, low, high);
> -			else
> -				native_write_msr(msr, low, high);
> -		}
> +		if (pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated)
> +			return;
> +
> +		if (err)
> +			*err = native_write_msr_safe(msr, low, high);
> +		else
> +			native_write_msr(msr, low, high);
>  	}
>  }
>  
> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
> index 95caae97a394..afb02f43ee3f 100644
> --- a/arch/x86/xen/pmu.c
> +++ b/arch/x86/xen/pmu.c
> @@ -128,7 +128,7 @@ static inline uint32_t get_fam15h_addr(u32 addr)
>  	return addr;
>  }
>  
> -static inline bool is_amd_pmu_msr(unsigned int msr)
> +static bool is_amd_pmu_msr(u32 msr)
>  {
>  	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD &&
>  	    boot_cpu_data.x86_vendor != X86_VENDOR_HYGON)
> @@ -194,8 +194,7 @@ static bool is_intel_pmu_msr(u32 msr_index, int *type, int *index)
>  	}
>  }
>  
> -static bool xen_intel_pmu_emulate(unsigned int msr, u64 *val, int type,
> -				  int index, bool is_read)
> +static bool xen_intel_pmu_emulate(u32 msr, u64 *val, int type, int index, bool is_read)
>  {
>  	uint64_t *reg = NULL;
>  	struct xen_pmu_intel_ctxt *ctxt;
> @@ -257,7 +256,7 @@ static bool xen_intel_pmu_emulate(unsigned int msr, u64 *val, int type,
>  	return false;
>  }
>  
> -static bool xen_amd_pmu_emulate(unsigned int msr, u64 *val, bool is_read)
> +static bool xen_amd_pmu_emulate(u32 msr, u64 *val, bool is_read)
>  {
>  	uint64_t *reg = NULL;
>  	int i, off = 0;
> @@ -298,8 +297,7 @@ static bool xen_amd_pmu_emulate(unsigned int msr, u64 *val, bool is_read)
>  	return false;
>  }
>  
> -static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read,
> -				 bool *emul)
> +bool pmu_msr_chk_emulated(u32 msr, u64 *val, bool is_read, bool *emul)
>  {
>  	int type, index = 0;
>  
> @@ -313,20 +311,6 @@ static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read,
>  	return true;
>  }
>  
> -bool pmu_msr_read(u32 msr, u64 *val)
> -{
> -	bool emulated;
> -
> -	return pmu_msr_chk_emulated(msr, val, true, &emulated) && emulated;
> -}
> -
> -bool pmu_msr_write(u32 msr, u64 val)
> -{
> -	bool emulated;
> -
> -	return pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated;
> -}
> -
>  static u64 xen_amd_read_pmc(int counter)
>  {
>  	struct xen_pmu_amd_ctxt *ctxt;
> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> index a1875e10be31..fde9f9d7415f 100644
> --- a/arch/x86/xen/xen-ops.h
> +++ b/arch/x86/xen/xen-ops.h
> @@ -271,8 +271,7 @@ void xen_pmu_finish(int cpu);
>  static inline void xen_pmu_init(int cpu) {}
>  static inline void xen_pmu_finish(int cpu) {}
>  #endif
> -bool pmu_msr_read(u32 msr, u64 *val);
> -bool pmu_msr_write(u32 msr, u64 val);
> +bool pmu_msr_chk_emulated(u32 msr, u64 *val, bool is_read, bool *emul);
>  int pmu_apic_update(uint32_t reg);
>  u64 xen_read_pmc(int counter);
>

Xin Li (Intel) April 24, 2025, 7:21 a.m. UTC | #8

On 4/23/2025 11:33 PM, Mi, Dapeng wrote:
> Could we merge this patch and previous patch into a single patch? It's
> unnecessary to just modify the pmu_msr_read()/pmu_msr_write() in previous
> patch and delete them immediately. It just wastes the effort.

No, it's not wasting effort, it's for easier review.

Look at this patch, you can easily tell that pmu_msr_read() and
pmu_msr_write() are nothing more than pmu_msr_chk_emulated(), and
then removing them makes a lot of sense.

Xin Li (Intel) April 24, 2025, 7:50 a.m. UTC | #9

On 4/24/2025 12:43 AM, Mi, Dapeng wrote:
> These 2 patches are not complicated, it won't be difficult to review if
> merging them into one as long as the commit message mentions it clearly.
> Anyway I'm fine if you hope to keep them into two patches.

Simple Small Steps...

Jürgen Groß April 24, 2025, 8:14 a.m. UTC | #10

On 24.04.25 10:06, Xin Li wrote:
> On 4/23/2025 9:05 AM, Jürgen Groß wrote:
>>> It's not a major change, but when it is patched to use the immediate form MSR 
>>> write instruction, it's straightforwardly streamlined.
>>
>> It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
>> locations to use the rdmsr/wrmsr instructions instead of doing a call to
>> native_*msr().
>>
>> The case of the new immediate form could be handled the same way.
> 
> Actually, that is how we get this patch with the existing alternatives
> infrastructure.  And we took a step further to also remove the pv_ops
> MSR APIs...

And this is what I'm questioning. IMHO this approach is adding more
code by removing the pv_ops MSR_APIs just because "pv_ops is bad". And
I believe most refusal of pv_ops is based on no longer valid reasoning.

> It looks to me that you want to add a new facility to the alternatives
> infrastructure first?

Why would we need a new facility in the alternatives infrastructure?


Juergen

Jürgen Groß April 24, 2025, 10:11 a.m. UTC | #11

On 22.04.25 10:21, Xin Li (Intel) wrote:
> set_reg() is used to write the following MSRs on Xen:
> 
>      MSR_FS_BASE
>      MSR_KERNEL_GS_BASE
>      MSR_GS_BASE
> 
> But none of these MSRs are written using any MSR write safe API.
> Therefore there is no need to pass an error pointer argument to
> set_reg() for returning an error code to be used in MSR safe APIs.

set_seg(), please (further up, too).

> 
> Remove the error pointer argument.
> 
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

Xin Li (Intel) April 24, 2025, 5:50 p.m. UTC | #12

On 4/24/2025 3:11 AM, Jürgen Groß wrote:
> set_seg(), please (further up, too).

Good catch, thanks a lot!

Xin Li (Intel) April 24, 2025, 10:24 p.m. UTC | #13

> By the way, this patch should have "xen" in its subject tag.
> 

Right, I should add it.

[RFC,v2,00/34] MSR refactor with new MSR instructions support

Message

Comments