Message ID | 20211222124052.644626-14-jing2.liu@intel.com |
---|---|
State | New |
Headers | show |
Series | AMX Support in KVM | expand |
On Wed, Dec 22, 2021, Jing Liu wrote: > Guest IA32_XFD_ERR is generally modified in two places: > > - Set by CPU when #NM is triggered; > - Cleared by guest in its #NM handler; > > Intercept #NM for the first case, if guest writes XFD as nonzero for > the first time which indicates guest is possible to use XFD generating > the exception. #NM is rare if the guest doesn't use dynamic features. > Otherwise, there is at most one exception per guest task given a > dynamic feature. > > Save the current XFD_ERR value to the guest_fpu container in the #NM > VM-exit handler. This must be done with interrupt/preemption disabled, Assuming my below understanding is correct, drop the "preemption" bit, it's misleading. > otherwise the unsaved MSR value may be clobbered by host operations. > > Inject a virtual #NM to the guest after saving the MSR value. > > Restore the host value (always ZERO outside of the host #NM > handler) before enabling preemption. AIUI, changelog is wrong, code is right. This must be done before _IRQs_ are enabled, same as handling TIF_NEED_FPU_LOAD. > Restore the guest value from the guest_fpu container right before > entering the guest (with preemption disabled). Same complaint about preemption. > Suggested-by: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: Jing Liu <jing2.liu@intel.com> > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/vmx/vmcs.h | 5 +++++ > arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++- > arch/x86/kvm/x86.c | 6 ++++++ > 4 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 555f4de47ef2..f7a661f35d1a 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -640,6 +640,7 @@ struct kvm_vcpu_arch { > u64 smi_count; > bool tpr_access_reporting; > bool xsaves_enabled; > + bool trap_nm; > u64 ia32_xss; > u64 microcode_version; > u64 arch_capabilities; ... > @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu) > vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match); > } > > + if (vcpu->arch.trap_nm) > + eb |= (1u << NM_VECTOR); > + > vmcs_write32(EXCEPTION_BITMAP, eb); > } > > @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > case MSR_KERNEL_GS_BASE: > vmx_write_guest_kernel_gs_base(vmx, data); > break; > + case MSR_IA32_XFD: > + ret = kvm_set_msr_common(vcpu, msr_info); > + if (!ret && data) { > + vcpu->arch.trap_nm = true; > + vmx_update_exception_bitmap(vcpu); This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if the MSR is cleared. But why even bother with an extra flag? Can't vmx_update_exception_bitmap() get the guest's MSR_IA32_XFD value and intercept #NM accordingly? Then you could even handle this fully in kvm_set_msr_common(), e.g. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2c9606380bca..c6c936d2b298 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data); + /* Blah blah blah blah */ + static_call(kvm_x86_update_exception_bitmap)(vcpu); break; case MSR_IA32_XFD_ERR: if (!msr_info->host_initiated && > + } > + break; > #endif > case MSR_IA32_SYSENTER_CS: > if (is_guest_mode(vcpu)) > @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) > vect_info = vmx->idt_vectoring_info; > intr_info = vmx_get_intr_info(vcpu); > > - if (is_machine_check(intr_info) || is_nmi(intr_info)) > + if (is_machine_check(intr_info) || is_nmi(intr_info) || is_nm(intr_info)) > return 1; /* handled by handle_exception_nmi_irqoff() */ > > if (is_invalid_opcode(intr_info)) > @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, > kvm_after_interrupt(vcpu); > } > > +static void handle_exception_nm(struct kvm_vcpu *vcpu) This needs a different name, it's waaaay too close to the base handle_exception_nmi(), which runs with IRQs _on_. And please add "_irqoff" at the end. Maybe handle_nm_fault_irqoff()? > +{ > + rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); > + kvm_queue_exception(vcpu, NM_VECTOR); > +} > + > static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) > { > const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist; > @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) > /* if exit due to PF check for async PF */ > if (is_page_fault(intr_info)) > vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags(); > + /* if exit due to NM, handle before preemptions are enabled */ > + else if (is_nm(intr_info)) Same naming complaint about this helper, it looks like an is_nmi() typo. is_nm_fault()? > + handle_exception_nm(&vmx->vcpu); > /* Handle machine checks before interrupts are enabled */ > else if (is_machine_check(intr_info)) > kvm_machine_check();
> From: Sean Christopherson <seanjc@google.com> > Sent: Wednesday, December 29, 2021 8:10 AM > > On Wed, Dec 22, 2021, Jing Liu wrote: > > Guest IA32_XFD_ERR is generally modified in two places: > > > > - Set by CPU when #NM is triggered; > > - Cleared by guest in its #NM handler; > > > > Intercept #NM for the first case, if guest writes XFD as nonzero for > > the first time which indicates guest is possible to use XFD generating > > the exception. #NM is rare if the guest doesn't use dynamic features. > > Otherwise, there is at most one exception per guest task given a > > dynamic feature. > > > > Save the current XFD_ERR value to the guest_fpu container in the #NM > > VM-exit handler. This must be done with interrupt/preemption disabled, > > Assuming my below understanding is correct, drop the "preemption" bit, it's > misleading. code-wise yes. In concept we just want to highlight that this operation must be completed when both interrupt and preemption are disabled. But we can also drop preemption if you prefer to, since preemption is certainly disabled when interrupt is disabled. > > > otherwise the unsaved MSR value may be clobbered by host operations. > > > > Inject a virtual #NM to the guest after saving the MSR value. > > > > Restore the host value (always ZERO outside of the host #NM > > handler) before enabling preemption. > > AIUI, changelog is wrong, code is right. This must be done before _IRQs_ are > enabled, same as handling TIF_NEED_FPU_LOAD. yes > > > Restore the guest value from the guest_fpu container right before > > entering the guest (with preemption disabled). > > Same complaint about preemption. > > > Suggested-by: Thomas Gleixner <tglx@linutronix.de> > > Signed-off-by: Jing Liu <jing2.liu@intel.com> > > --- > > arch/x86/include/asm/kvm_host.h | 1 + > > arch/x86/kvm/vmx/vmcs.h | 5 +++++ > > arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++- > > arch/x86/kvm/x86.c | 6 ++++++ > > 4 files changed, 33 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h > b/arch/x86/include/asm/kvm_host.h > > index 555f4de47ef2..f7a661f35d1a 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -640,6 +640,7 @@ struct kvm_vcpu_arch { > > u64 smi_count; > > bool tpr_access_reporting; > > bool xsaves_enabled; > > + bool trap_nm; > > u64 ia32_xss; > > u64 microcode_version; > > u64 arch_capabilities; > > ... > > > @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct > kvm_vcpu *vcpu) > > vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match); > > } > > > > + if (vcpu->arch.trap_nm) > > + eb |= (1u << NM_VECTOR); > > + > > vmcs_write32(EXCEPTION_BITMAP, eb); > > } > > > > @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, > struct msr_data *msr_info) > > case MSR_KERNEL_GS_BASE: > > vmx_write_guest_kernel_gs_base(vmx, data); > > break; > > + case MSR_IA32_XFD: > > + ret = kvm_set_msr_common(vcpu, msr_info); > > + if (!ret && data) { > > + vcpu->arch.trap_nm = true; > > + vmx_update_exception_bitmap(vcpu); > > This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if > the > MSR is cleared. In concept you are right if just looking at this patch. It's pointless to trap #NM if guest xfd is cleared. But here we need think about patch22 which disables write interception for xfd. With that in consideration we use the 1st non-zero write as the hint indicating that guest might enable xfd-related usages thus always trap #NM after this point. It's not a good ordering, but Paolo wants to put the optimization in the end of this series. But we do need to put a clear comment here explaining the always-trap policy. > > But why even bother with an extra flag? Can't > vmx_update_exception_bitmap() get > the guest's MSR_IA32_XFD value and intercept #NM accordingly? Then you Above is the reason for the extra flag > could > even handle this fully in kvm_set_msr_common(), e.g. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 2c9606380bca..c6c936d2b298 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, > struct msr_data *msr_info) > return 1; > > fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data); > + /* Blah blah blah blah */ > + static_call(kvm_x86_update_exception_bitmap)(vcpu); > break; > case MSR_IA32_XFD_ERR: > if (!msr_info->host_initiated && > > > + } > > + break; > > #endif > > case MSR_IA32_SYSENTER_CS: > > if (is_guest_mode(vcpu)) > > @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu > *vcpu) > > vect_info = vmx->idt_vectoring_info; > > intr_info = vmx_get_intr_info(vcpu); > > > > - if (is_machine_check(intr_info) || is_nmi(intr_info)) > > + if (is_machine_check(intr_info) || is_nmi(intr_info) || > is_nm(intr_info)) > > return 1; /* handled by handle_exception_nmi_irqoff() */ > > > > if (is_invalid_opcode(intr_info)) > > @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct > kvm_vcpu *vcpu, > > kvm_after_interrupt(vcpu); > > } > > > > +static void handle_exception_nm(struct kvm_vcpu *vcpu) > > This needs a different name, it's waaaay too close to the base > handle_exception_nmi(), > which runs with IRQs _on_. And please add "_irqoff" at the end. Maybe > handle_nm_fault_irqoff()? sounds good. > > > +{ > > + rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); > > + kvm_queue_exception(vcpu, NM_VECTOR); > > +} > > + > > static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) > > { > > const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist; > > @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct > vcpu_vmx *vmx) > > /* if exit due to PF check for async PF */ > > if (is_page_fault(intr_info)) > > vmx->vcpu.arch.apf.host_apf_flags = > kvm_read_and_reset_apf_flags(); > > + /* if exit due to NM, handle before preemptions are enabled */ > > + else if (is_nm(intr_info)) > > Same naming complaint about this helper, it looks like an is_nmi() typo. > is_nm_fault()? will fix > > > + handle_exception_nm(&vmx->vcpu); > > /* Handle machine checks before interrupts are enabled */ > > else if (is_machine_check(intr_info)) > > kvm_machine_check();
> From: Tian, Kevin > Sent: Wednesday, December 29, 2021 10:53 AM > > > + case MSR_IA32_XFD: > > > + ret = kvm_set_msr_common(vcpu, msr_info); > > > + if (!ret && data) { > > > + vcpu->arch.trap_nm = true; > > > + vmx_update_exception_bitmap(vcpu); > > > > This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if > > the > > MSR is cleared. > > In concept you are right if just looking at this patch. It's pointless to > trap #NM if guest xfd is cleared. > > But here we need think about patch22 which disables write interception > for xfd. With that in consideration we use the 1st non-zero write as the > hint indicating that guest might enable xfd-related usages thus always > trap #NM after this point. > > It's not a good ordering, but Paolo wants to put the optimization in the > end of this series. But we do need to put a clear comment here explaining > the always-trap policy. Given write emulation of XFD is not disabled in this patch, it reads cleaner to always update exception bitmap according to guest xfd value at this stage. So we will follow your suggestion here and then change to check msr bitmap when write emulation is disabled in patch22. Thanks Kevin
> From: Sean Christopherson <seanjc@google.com> > Sent: Wednesday, December 29, 2021 8:10 AM > > But why even bother with an extra flag? Can't > vmx_update_exception_bitmap() get > the guest's MSR_IA32_XFD value and intercept #NM accordingly? Then you > could > even handle this fully in kvm_set_msr_common(), e.g. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 2c9606380bca..c6c936d2b298 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, > struct msr_data *msr_info) > return 1; > > fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data); > + /* Blah blah blah blah */ > + static_call(kvm_x86_update_exception_bitmap)(vcpu); > break; btw follow Paolo's suggestion here: https://lore.kernel.org/all/6e95b6f7-44dc-7e48-4e6e-81cf85fc11c6@redhat.com/ He prefers to disabling xfd interception in vmx specific code instead of introducing a new callback to be called in common code. Since we want to always trap #NM after xfd interception is disabled (which is done after calling the msr common code), it also reads cleaner to call update_exception_bitmap() from vmx.c instead of here. Thanks Kevin
On Wed, Dec 29, 2021, Tian, Kevin wrote: > > From: Sean Christopherson <seanjc@google.com> > > Sent: Wednesday, December 29, 2021 8:10 AM > > > > On Wed, Dec 22, 2021, Jing Liu wrote: > > > Guest IA32_XFD_ERR is generally modified in two places: > > > > > > - Set by CPU when #NM is triggered; > > > - Cleared by guest in its #NM handler; > > > > > > Intercept #NM for the first case, if guest writes XFD as nonzero for > > > the first time which indicates guest is possible to use XFD generating > > > the exception. #NM is rare if the guest doesn't use dynamic features. > > > Otherwise, there is at most one exception per guest task given a > > > dynamic feature. > > > > > > Save the current XFD_ERR value to the guest_fpu container in the #NM > > > VM-exit handler. This must be done with interrupt/preemption disabled, > > > > Assuming my below understanding is correct, drop the "preemption" bit, it's > > misleading. > > code-wise yes. In concept we just want to highlight that this operation > must be completed when both interrupt and preemption are disabled. No no no no no. Yes, disabling IRQs also disables preemption, but that's not at all relevant, e.g. KVM could handle preemption via kvm_sched_{in,out}(). Handling this with IRQs disable is 100% mandatory because MSR_IA32_XFD_ERR can be indirectly consumed in (soft) IRQ context, end of story. > But we can also drop preemption if you prefer to, since preemption is > certainly disabled when interrupt is disabled.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 555f4de47ef2..f7a661f35d1a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -640,6 +640,7 @@ struct kvm_vcpu_arch { u64 smi_count; bool tpr_access_reporting; bool xsaves_enabled; + bool trap_nm; u64 ia32_xss; u64 microcode_version; u64 arch_capabilities; diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h index 6e5de2e2b0da..c57798b56f95 100644 --- a/arch/x86/kvm/vmx/vmcs.h +++ b/arch/x86/kvm/vmx/vmcs.h @@ -129,6 +129,11 @@ static inline bool is_machine_check(u32 intr_info) return is_exception_n(intr_info, MC_VECTOR); } +static inline bool is_nm(u32 intr_info) +{ + return is_exception_n(intr_info, NM_VECTOR); +} + /* Undocumented: icebp/int1 */ static inline bool is_icebp(u32 intr_info) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5974a88c9d35..57d256fbf34b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -36,6 +36,7 @@ #include <asm/debugreg.h> #include <asm/desc.h> #include <asm/fpu/api.h> +#include <asm/fpu/xstate.h> #include <asm/idtentry.h> #include <asm/io.h> #include <asm/irq_remapping.h> @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu) vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match); } + if (vcpu->arch.trap_nm) + eb |= (1u << NM_VECTOR); + vmcs_write32(EXCEPTION_BITMAP, eb); } @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_KERNEL_GS_BASE: vmx_write_guest_kernel_gs_base(vmx, data); break; + case MSR_IA32_XFD: + ret = kvm_set_msr_common(vcpu, msr_info); + if (!ret && data) { + vcpu->arch.trap_nm = true; + vmx_update_exception_bitmap(vcpu); + } + break; #endif case MSR_IA32_SYSENTER_CS: if (is_guest_mode(vcpu)) @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) vect_info = vmx->idt_vectoring_info; intr_info = vmx_get_intr_info(vcpu); - if (is_machine_check(intr_info) || is_nmi(intr_info)) + if (is_machine_check(intr_info) || is_nmi(intr_info) || is_nm(intr_info)) return 1; /* handled by handle_exception_nmi_irqoff() */ if (is_invalid_opcode(intr_info)) @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, kvm_after_interrupt(vcpu); } +static void handle_exception_nm(struct kvm_vcpu *vcpu) +{ + rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); + kvm_queue_exception(vcpu, NM_VECTOR); +} + static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) { const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist; @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) /* if exit due to PF check for async PF */ if (is_page_fault(intr_info)) vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags(); + /* if exit due to NM, handle before preemptions are enabled */ + else if (is_nm(intr_info)) + handle_exception_nm(&vmx->vcpu); /* Handle machine checks before interrupts are enabled */ else if (is_machine_check(intr_info)) kvm_machine_check(); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 36677b754ac9..b22defad5cab 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9893,6 +9893,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (test_thread_flag(TIF_NEED_FPU_LOAD)) switch_fpu_return(); + if (vcpu->arch.guest_fpu.xfd_err) + wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); + if (unlikely(vcpu->arch.switch_db_regs)) { set_debugreg(0, 7); set_debugreg(vcpu->arch.eff_db[0], 0); @@ -9956,6 +9959,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) static_call(kvm_x86_handle_exit_irqoff)(vcpu); + if (vcpu->arch.guest_fpu.xfd_err) + wrmsrl(MSR_IA32_XFD_ERR, 0); + /* * Consume any pending interrupts, including the possible source of * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
Guest IA32_XFD_ERR is generally modified in two places: - Set by CPU when #NM is triggered; - Cleared by guest in its #NM handler; Intercept #NM for the first case, if guest writes XFD as nonzero for the first time which indicates guest is possible to use XFD generating the exception. #NM is rare if the guest doesn't use dynamic features. Otherwise, there is at most one exception per guest task given a dynamic feature. Save the current XFD_ERR value to the guest_fpu container in the #NM VM-exit handler. This must be done with interrupt/preemption disabled, otherwise the unsaved MSR value may be clobbered by host operations. Inject a virtual #NM to the guest after saving the MSR value. Restore the host value (always ZERO outside of the host #NM handler) before enabling preemption. Restore the guest value from the guest_fpu container right before entering the guest (with preemption disabled). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jing Liu <jing2.liu@intel.com> --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/vmcs.h | 5 +++++ arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++- arch/x86/kvm/x86.c | 6 ++++++ 4 files changed, 33 insertions(+), 1 deletion(-)