Message ID | 20250422082216.1954310-1-xin@zytor.com |
---|---|
Headers | show |
Series | MSR refactor with new MSR instructions support | expand |
On 4/22/2025 8:09 AM, Sean Christopherson wrote: > On Tue, Apr 22, 2025, Xin Li (Intel) wrote: >> __rdmsr() is the lowest level primitive MSR read API, and its direct >> use is NOT preferred. > > Doesn't mean it's wrong. I wouldn't go so far as to claim that it's wrong :-) >> Use its wrapper function native_rdmsrq() instead. The current code exhibits a somewhat haphazard use of MSR APIs, so I wanted to clarify which API to employ in specific situations with verbose function naming. Here is an example that Boris had to fix the use of MSR APIs: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f980f9c31a923e9040dee0bc679a5f5b09e61f40 > > ... > >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >> index 1547bfacd40f..e73c1d5ba6c4 100644 >> --- a/arch/x86/kvm/vmx/vmx.c >> +++ b/arch/x86/kvm/vmx/vmx.c >> @@ -380,7 +380,7 @@ static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx) >> if (!vmx->disable_fb_clear) >> return; >> >> - msr = __rdmsr(MSR_IA32_MCU_OPT_CTRL); >> + msr = native_rdmsrq(MSR_IA32_MCU_OPT_CTRL); >> msr |= FB_CLEAR_DIS; >> native_wrmsrq(MSR_IA32_MCU_OPT_CTRL, msr); >> /* Cache the MSR value to avoid reading it later */ >> @@ -7307,7 +7307,7 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx, >> return; >> >> if (flags & VMX_RUN_SAVE_SPEC_CTRL) >> - vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL); >> + vmx->spec_ctrl = native_rdmsrq(MSR_IA32_SPEC_CTRL); > > And what guarantees that native_rdmsrq() won't have tracing? Ugh, a later patch > renames native_rdmsrq() => native_rdmsrq_no_trace(). > > I really don't like this. It makes simple and obvious code: > > vmx->spec_ctrl = __rdmsr(MSR_IA32_SPEC_CTRL); > > so much harder to read: > > vmx->spec_ctrl = native_rdmsrq_no_trace(MSR_IA32_SPEC_CTRL); > > and does so in a way that is difficult to review, e.g. I have to peek ahead to > understand that this is even ok. > > I strongly prefer that we find a way to not require such verbose APIs, especially > if KVM ends up using native variants throughout. Xen PV is supposed to be the > odd one out, yet native code is what suffers. Blech. Will try to figure out how to name the APIs. One reason I chose verbose names is that short names are in use and renaming needs to touch a lot of files (and not fun at all). Thanks! Xin
On 4/23/25 02:27, Xin Li wrote: > One reason I chose verbose names is that short names are in use and > renaming needs to touch a lot of files (and not fun at all). This series is getting *WAY* too big. Could you please peel the renaming stuff out and we can get it applied independently of the new instruction gunk?
On 4/22/25 01:21, Xin Li (Intel) wrote: > Signed-off-by: Xin Li (Intel) <xin@zytor.com> Code: good. No changelog: bad. Once there's some semblance of a changelog: Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
On 23.04.25 10:51, Xin Li wrote: > On 4/22/2025 2:57 AM, Jürgen Groß wrote: >> On 22.04.25 10:22, Xin Li (Intel) wrote: >>> The story started from tglx's reply in [1]: >>> >>> For actual performance relevant code the current PV ops mechanics >>> are a horrorshow when the op defaults to the native instruction. >>> >>> look at wrmsrl(): >>> >>> wrmsrl(msr, val >>> wrmsr(msr, (u32)val, (u32)val >> 32)) >>> paravirt_write_msr(msr, low, high) >>> PVOP_VCALL3(cpu.write_msr, msr, low, high) >>> >>> Which results in >>> >>> mov $msr, %edi >>> mov $val, %rdx >>> mov %edx, %esi >>> shr $0x20, %rdx >>> call native_write_msr >>> >>> and native_write_msr() does at minimum: >>> >>> mov %edi,%ecx >>> mov %esi,%eax >>> wrmsr >>> ret >>> >>> In the worst case 'ret' is going through the return thunk. Not to >>> talk about function prologues and whatever. >>> >>> This becomes even more silly for trivial instructions like STI/CLI >>> or in the worst case paravirt_nop(). >> >> This is nonsense. >> >> In the non-Xen case the initial indirect call is directly replaced with >> STI/CLI via alternative patching, while for Xen it is replaced by a direct >> call. >> >> The paravirt_nop() case is handled in alt_replace_call() by replacing the >> indirect call with a nop in case the target of the call was paravirt_nop() >> (which is in fact no_func()). >> >>> >>> The call makes only sense, when the native default is an actual >>> function, but for the trivial cases it's a blatant engineering >>> trainwreck. >> >> The trivial cases are all handled as stated above: a direct replacement >> instruction is placed at the indirect call position. > > The above comment was given in 2023 IIRC, and you have addressed it. > >> >>> Later a consensus was reached to utilize the alternatives mechanism to >>> eliminate the indirect call overhead introduced by the pv_ops APIs: >>> >>> 1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a >>> disabled feature, preventing the Xen code from being built >>> and ensuring the native code is executed unconditionally. >> >> This is the case today already. There is no need for any change to have >> this in place. >> >>> >>> 2) When built with CONFIG_XEN_PV: >>> >>> 2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV), >>> the kernel runtime binary is patched to unconditionally >>> jump to the native MSR write code. >>> >>> 2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the >>> kernel runtime binary is patched to unconditionally jump >>> to the Xen MSR write code. >> >> I can't see what is different here compared to today's state. >> >>> >>> The alternatives mechanism is also used to choose the new immediate >>> form MSR write instruction when it's available. >> >> Yes, this needs to be added. >> >>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks. >> >> I still don't see a major difference to today's solution. > > The existing code generates: > > ... > bf e0 06 00 00 mov $0x6e0,%edi > 89 d6 mov %edx,%esi > 48 c1 ea 20 shr $0x20,%rdx > ff 15 07 48 8c 01 call *0x18c4807(%rip) # <pv_ops+0xb8> > 31 c0 xor %eax,%eax > ... > > And on native, the indirect call instruction is patched to a direct call > as you mentioned: > > ... > bf e0 06 00 00 mov $0x6e0,%edi > 89 d6 mov %edx,%esi > 48 c1 ea 20 shr $0x20,%rdx > e8 60 3e 01 00 call <{native,xen}_write_msr> # direct > 90 nop > 31 c0 xor %eax,%eax > ... > > > This patch set generates assembly w/o CALL on native: > > ... > e9 e6 22 c6 01 jmp 1f # on native or nop on Xen > b9 e0 06 00 00 mov $0x6e0,%ecx > e8 91 d4 fa ff call ffffffff8134ee80 <asm_xen_write_msr> > e9 a4 9f eb 00 jmp ffffffff8225b9a0 <__x86_return_thunk> > ... > 1: b9 e0 06 00 00 mov $0x6e0,%ecx # immediate form here > 48 89 c2 mov %rax,%rdx > 48 c1 ea 20 shr $0x20,%rdx > 3e 0f 30 ds wrmsr > ... > > It's not a major change, but when it is patched to use the immediate form MSR > write instruction, it's straightforwardly streamlined. It should be rather easy to switch the current wrmsr/rdmsr paravirt patching locations to use the rdmsr/wrmsr instructions instead of doing a call to native_*msr(). The case of the new immediate form could be handled the same way. > >> >> Only the "paravirt" term has been eliminated. > > Yes. > > But a PV guest doesn't operate at the highest privilege level, which > means MSR instructions typically result in a #GP fault. I actually think the > pv_ops MSR APIs are unnecessary because of this inherent > limitation. > > Looking at the Xen MSR code, except PMU and just a few MSRs, it falls > back to executes native MSR instructions. As MSR instructions trigger > #GP, Xen takes control and handles them in 2 ways: > > 1) emulate (or ignore) a MSR operation and skip the guest instruction. > > 2) inject the #GP back to guest OS and let its #GP handler handle it. > But Linux MSR exception handler just ignores the MSR instruction > (MCE MSR exception will panic). > > So why not let Xen handle all the details which it already tries to do? Some MSRs are not handled that way, but via a kernel internal emulation. And those are handled that way mostly due to performance reasons. And some need special treatment. > (Linux w/ such a change may not be able to run on old Xen hypervisors.) Yes, and this is something to avoid. And remember that Linux isn't the only PV-mode guest existing. > BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and > MSR_GS_BASE anyway are hpyercalls into Xen. Yes, and some other MSR writes are just NOPs with Xen-PV. Juergen
On 4/23/2025 8:51 AM, Dave Hansen wrote: > On 4/22/25 01:21, Xin Li (Intel) wrote: >> static __always_inline void sev_es_wr_ghcb_msr(u64 val) >> { >> - u32 low, high; >> - >> - low = (u32)(val); >> - high = (u32)(val >> 32); >> - >> - native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high); >> + native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val); >> } > > A note on ordering: Had this been a native_wrmsr()=>__wrmsr() > conversion, it could be sucked into the tree easily before the big > __wrmsr()=>native_wrmsrq() conversion. > > Yeah, you'd have to base the big rename on top of this. But with a > series this big, I'd prioritize whatever gets it trimmed down. Okay, I will focus on cleanup first.
On 4/23/2025 8:51 AM, Dave Hansen wrote: > On 4/22/25 01:21, Xin Li (Intel) wrote: >> static __always_inline void sev_es_wr_ghcb_msr(u64 val) >> { >> - u32 low, high; >> - >> - low = (u32)(val); >> - high = (u32)(val >> 32); >> - >> - native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high); >> + native_wrmsrq(MSR_AMD64_SEV_ES_GHCB, val); >> } > > A note on ordering: Had this been a native_wrmsr()=>__wrmsr() > conversion, it could be sucked into the tree easily before the big > __wrmsr()=>native_wrmsrq() conversion. Can't reorder the 2 patches, because __wrmsr() takes two u32 arguments and the split has to be done explicitly in sev_es_wr_ghcb_msr(). Thanks! Xin
On 4/22/2025 4:21 PM, Xin Li (Intel) wrote: > As pmu_msr_{read,write}() are now wrappers of pmu_msr_chk_emulated(), > remove them and use pmu_msr_chk_emulated() directly. > > While at it, convert the data type of MSR index to u32 in functions > called in pmu_msr_chk_emulated(). > > Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> > Signed-off-by: Xin Li (Intel) <xin@zytor.com> > --- > arch/x86/xen/enlighten_pv.c | 17 ++++++++++------- > arch/x86/xen/pmu.c | 24 ++++-------------------- > arch/x86/xen/xen-ops.h | 3 +-- > 3 files changed, 15 insertions(+), 29 deletions(-) > > diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c > index 1418758b57ff..b5a8bceb5f56 100644 > --- a/arch/x86/xen/enlighten_pv.c > +++ b/arch/x86/xen/enlighten_pv.c > @@ -1089,8 +1089,9 @@ static void xen_write_cr4(unsigned long cr4) > static u64 xen_do_read_msr(unsigned int msr, int *err) > { > u64 val = 0; /* Avoid uninitialized value for safe variant. */ > + bool emulated; > > - if (pmu_msr_read(msr, &val, err)) > + if (pmu_msr_chk_emulated(msr, &val, true, &emulated) && emulated) ah, here it is. Could we merge this patch and previous patch into a single patch? It's unnecessary to just modify the pmu_msr_read()/pmu_msr_write() in previous patch and delete them immediately. It just wastes the effort. > return val; > > if (err) > @@ -1133,6 +1134,7 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low, > unsigned int high, int *err) > { > u64 val; > + bool emulated; > > switch (msr) { > case MSR_FS_BASE: > @@ -1162,12 +1164,13 @@ static void xen_do_write_msr(unsigned int msr, unsigned int low, > default: > val = (u64)high << 32 | low; > > - if (!pmu_msr_write(msr, val)) { > - if (err) > - *err = native_write_msr_safe(msr, low, high); > - else > - native_write_msr(msr, low, high); > - } > + if (pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated) > + return; > + > + if (err) > + *err = native_write_msr_safe(msr, low, high); > + else > + native_write_msr(msr, low, high); > } > } > > diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c > index 95caae97a394..afb02f43ee3f 100644 > --- a/arch/x86/xen/pmu.c > +++ b/arch/x86/xen/pmu.c > @@ -128,7 +128,7 @@ static inline uint32_t get_fam15h_addr(u32 addr) > return addr; > } > > -static inline bool is_amd_pmu_msr(unsigned int msr) > +static bool is_amd_pmu_msr(u32 msr) > { > if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD && > boot_cpu_data.x86_vendor != X86_VENDOR_HYGON) > @@ -194,8 +194,7 @@ static bool is_intel_pmu_msr(u32 msr_index, int *type, int *index) > } > } > > -static bool xen_intel_pmu_emulate(unsigned int msr, u64 *val, int type, > - int index, bool is_read) > +static bool xen_intel_pmu_emulate(u32 msr, u64 *val, int type, int index, bool is_read) > { > uint64_t *reg = NULL; > struct xen_pmu_intel_ctxt *ctxt; > @@ -257,7 +256,7 @@ static bool xen_intel_pmu_emulate(unsigned int msr, u64 *val, int type, > return false; > } > > -static bool xen_amd_pmu_emulate(unsigned int msr, u64 *val, bool is_read) > +static bool xen_amd_pmu_emulate(u32 msr, u64 *val, bool is_read) > { > uint64_t *reg = NULL; > int i, off = 0; > @@ -298,8 +297,7 @@ static bool xen_amd_pmu_emulate(unsigned int msr, u64 *val, bool is_read) > return false; > } > > -static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read, > - bool *emul) > +bool pmu_msr_chk_emulated(u32 msr, u64 *val, bool is_read, bool *emul) > { > int type, index = 0; > > @@ -313,20 +311,6 @@ static bool pmu_msr_chk_emulated(unsigned int msr, uint64_t *val, bool is_read, > return true; > } > > -bool pmu_msr_read(u32 msr, u64 *val) > -{ > - bool emulated; > - > - return pmu_msr_chk_emulated(msr, val, true, &emulated) && emulated; > -} > - > -bool pmu_msr_write(u32 msr, u64 val) > -{ > - bool emulated; > - > - return pmu_msr_chk_emulated(msr, &val, false, &emulated) && emulated; > -} > - > static u64 xen_amd_read_pmc(int counter) > { > struct xen_pmu_amd_ctxt *ctxt; > diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h > index a1875e10be31..fde9f9d7415f 100644 > --- a/arch/x86/xen/xen-ops.h > +++ b/arch/x86/xen/xen-ops.h > @@ -271,8 +271,7 @@ void xen_pmu_finish(int cpu); > static inline void xen_pmu_init(int cpu) {} > static inline void xen_pmu_finish(int cpu) {} > #endif > -bool pmu_msr_read(u32 msr, u64 *val); > -bool pmu_msr_write(u32 msr, u64 val); > +bool pmu_msr_chk_emulated(u32 msr, u64 *val, bool is_read, bool *emul); > int pmu_apic_update(uint32_t reg); > u64 xen_read_pmc(int counter); >
On 4/23/2025 11:33 PM, Mi, Dapeng wrote: > Could we merge this patch and previous patch into a single patch? It's > unnecessary to just modify the pmu_msr_read()/pmu_msr_write() in previous > patch and delete them immediately. It just wastes the effort. No, it's not wasting effort, it's for easier review. Look at this patch, you can easily tell that pmu_msr_read() and pmu_msr_write() are nothing more than pmu_msr_chk_emulated(), and then removing them makes a lot of sense.
On 4/24/2025 12:43 AM, Mi, Dapeng wrote: > These 2 patches are not complicated, it won't be difficult to review if > merging them into one as long as the commit message mentions it clearly. > Anyway I'm fine if you hope to keep them into two patches. Simple Small Steps...
On 24.04.25 10:06, Xin Li wrote: > On 4/23/2025 9:05 AM, Jürgen Groß wrote: >>> It's not a major change, but when it is patched to use the immediate form MSR >>> write instruction, it's straightforwardly streamlined. >> >> It should be rather easy to switch the current wrmsr/rdmsr paravirt patching >> locations to use the rdmsr/wrmsr instructions instead of doing a call to >> native_*msr(). >> >> The case of the new immediate form could be handled the same way. > > Actually, that is how we get this patch with the existing alternatives > infrastructure. And we took a step further to also remove the pv_ops > MSR APIs... And this is what I'm questioning. IMHO this approach is adding more code by removing the pv_ops MSR_APIs just because "pv_ops is bad". And I believe most refusal of pv_ops is based on no longer valid reasoning. > It looks to me that you want to add a new facility to the alternatives > infrastructure first? Why would we need a new facility in the alternatives infrastructure? Juergen
On 22.04.25 10:21, Xin Li (Intel) wrote: > set_reg() is used to write the following MSRs on Xen: > > MSR_FS_BASE > MSR_KERNEL_GS_BASE > MSR_GS_BASE > > But none of these MSRs are written using any MSR write safe API. > Therefore there is no need to pass an error pointer argument to > set_reg() for returning an error code to be used in MSR safe APIs. set_seg(), please (further up, too). > > Remove the error pointer argument. > > Signed-off-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Juergen Gross <jgross@suse.com> Juergen
On 4/24/2025 3:11 AM, Jürgen Groß wrote:
> set_seg(), please (further up, too).
Good catch, thanks a lot!
> By the way, this patch should have "xen" in its subject tag. > Right, I should add it.