[Xen-devel] xen/arm: Trap and yield on WFE instructions

Message ID	1405501458-2822-1-git-send-email-anup.patel@linaro.org
State	New
Headers	show Return-Path: <patchwork-forward+bncBDMMTIVGVAIBBIEBTGPAKGQEY3UQWIQ@linaro.org> Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.220.170 as permitted sender) client-ip=209.85.220.170; Received-SPF: none (google.com: xen-devel-bounces@lists.xen.org does not designate permitted sender hosts) client-ip=50.57.142.19; spamassassin: X-StarScan-Received: From: Anup Patel <anup.patel@linaro.org> To: xen-devel@lists.xen.org Date: Wed, 16 Jul 2014 14:34:18 +0530 Message-Id: <1405501458-2822-1-git-send-email-anup.patel@linaro.org> Cc: Ian.Campbell@citrix.com, Anup Patel <anup.patel@linaro.org>, stefano.stabellini@eu.citrix.com, patches@apm.com, stefano.stabellini@citrix.com, Pranavkumar Sawargaonkar <pranavkumar@linaro.org> Subject: [Xen-devel] [PATCH] xen/arm: Trap and yield on WFE instructions Precedence: list MIME-Version: 1.0 Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit

Anup Patel July 16, 2014, 9:04 a.m. UTC

If we have a Guest/DomU with two or more of its VCPUs running
on same host CPU then it can quite likely happen that these
VCPUs fight for same spinlock and one of them will waste CPU
cycles in WFE instruction. This patch makes WFE instruction
trap for VCPU and forces VCPU to yield its timeslice.

The KVM ARM/ARM64 also does similar thing for handling WFE
instructions. (Please refer,
https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html)

In general, this patch is more of an optimization for an
oversubscribed system having number of VCPUs more than
underlying host CPUs.

Signed-off-by: Anup Patel <anup.patel@linaro.org>
Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
---
 xen/arch/arm/traps.c |   27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

Ian Campbell July 16, 2014, 9:36 a.m. UTC | #1

On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote:
> If we have a Guest/DomU with two or more of its VCPUs running
> on same host CPU then it can quite likely happen that these
> VCPUs fight for same spinlock and one of them will waste CPU
> cycles in WFE instruction. This patch makes WFE instruction
> trap for VCPU and forces VCPU to yield its timeslice.
> 
> The KVM ARM/ARM64 also does similar thing for handling WFE
> instructions. (Please refer,
> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html)
> 
> In general, this patch is more of an optimization for an
> oversubscribed system having number of VCPUs more than
> underlying host CPUs.

Sounds good. Do you have any numbers for the improvement?

> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
> ---
>  xen/arch/arm/traps.c |   27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 686d8b7..bc9f67a 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void)
>  
>      /* Setup hypervisor traps */
>      WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM|
> -                 HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
> +                 HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
>      isb();
>  }
>  
> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs)
>              advance_pc(regs, hsr);
>              return;
>          }
> -        /* at the moment we only trap WFI */
> -        vcpu_block();
> -        /* The ARM spec declares that even if local irqs are masked in
> -         * the CPSR register, an irq should wake up a cpu from WFI anyway.
> -         * For this reason we need to check for irqs that need delivery,
> -         * ignoring the CPSR register, *after* calling SCHEDOP_block to
> -         * avoid races with vgic_vcpu_inject_irq.
> -         */
> -        if ( local_events_need_delivery_nomask() )
> -            vcpu_unblock(current);
> +        if ( hsr.bits & 0x1 ) {

Please can you add an appropriate struct hsr_wfe_wfi to union hsr and
use that instead of opencoding bit patterns in hsr.bits.

> +            /* Yield the VCPU for WFE */
> +            vcpu_force_reschedule(current);

This is OK fine by me, but I'm wondering: Does the HW have any support
for actually blocking until an event occurs in another vcpu (some sort
of trap on sev)? I don't think so, nor am I really sure how such a h/w
feature might work...

Ian.

Anup Patel July 16, 2014, 10:07 a.m. UTC | #2

On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote:
>> If we have a Guest/DomU with two or more of its VCPUs running
>> on same host CPU then it can quite likely happen that these
>> VCPUs fight for same spinlock and one of them will waste CPU
>> cycles in WFE instruction. This patch makes WFE instruction
>> trap for VCPU and forces VCPU to yield its timeslice.
>>
>> The KVM ARM/ARM64 also does similar thing for handling WFE
>> instructions. (Please refer,
>> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html)
>>
>> In general, this patch is more of an optimization for an
>> oversubscribed system having number of VCPUs more than
>> underlying host CPUs.
>
> Sounds good. Do you have any numbers for the improvement?

I did not try any benchmarks myself but hackbench shows good
improvement for KVM hence it is a worthy optimization for Xen too.

I found this change missing for Xen hence this patch.

>
>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>> ---
>>  xen/arch/arm/traps.c |   27 ++++++++++++++++-----------
>>  1 file changed, 16 insertions(+), 11 deletions(-)
>>
>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>> index 686d8b7..bc9f67a 100644
>> --- a/xen/arch/arm/traps.c
>> +++ b/xen/arch/arm/traps.c
>> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void)
>>
>>      /* Setup hypervisor traps */
>>      WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM|
>> -                 HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
>> +                 HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
>>      isb();
>>  }
>>
>> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs)
>>              advance_pc(regs, hsr);
>>              return;
>>          }
>> -        /* at the moment we only trap WFI */
>> -        vcpu_block();
>> -        /* The ARM spec declares that even if local irqs are masked in
>> -         * the CPSR register, an irq should wake up a cpu from WFI anyway.
>> -         * For this reason we need to check for irqs that need delivery,
>> -         * ignoring the CPSR register, *after* calling SCHEDOP_block to
>> -         * avoid races with vgic_vcpu_inject_irq.
>> -         */
>> -        if ( local_events_need_delivery_nomask() )
>> -            vcpu_unblock(current);
>> +        if ( hsr.bits & 0x1 ) {
>
> Please can you add an appropriate struct hsr_wfe_wfi to union hsr and
> use that instead of opencoding bit patterns in hsr.bits.

Sure, I will update this and send it right away.

>
>> +            /* Yield the VCPU for WFE */
>> +            vcpu_force_reschedule(current);
>
> This is OK fine by me, but I'm wondering: Does the HW have any support
> for actually blocking until an event occurs in another vcpu (some sort
> of trap on sev)? I don't think so, nor am I really sure how such a h/w
> feature might work...

Typically, on HW the WFE instruction forces cores to go in low power
mode. The core will come out of low power mode due to some interrupt
or SEV executed from another core.

>
> Ian.
>

--
Anup

Ian Campbell July 16, 2014, 10:09 a.m. UTC | #3

On Wed, 2014-07-16 at 15:37 +0530, Anup Patel wrote:
> On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote:
> >> If we have a Guest/DomU with two or more of its VCPUs running
> >> on same host CPU then it can quite likely happen that these
> >> VCPUs fight for same spinlock and one of them will waste CPU
> >> cycles in WFE instruction. This patch makes WFE instruction
> >> trap for VCPU and forces VCPU to yield its timeslice.
> >>
> >> The KVM ARM/ARM64 also does similar thing for handling WFE
> >> instructions. (Please refer,
> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html)
> >>
> >> In general, this patch is more of an optimization for an
> >> oversubscribed system having number of VCPUs more than
> >> underlying host CPUs.
> >
> > Sounds good. Do you have any numbers for the improvement?
> 
> I did not try any benchmarks myself but hackbench shows good
> improvement for KVM hence it is a worthy optimization for Xen too.
> 
> I found this change missing for Xen hence this patch.
> 
> >
> >> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
> >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
> >> ---
> >>  xen/arch/arm/traps.c |   27 ++++++++++++++++-----------
> >>  1 file changed, 16 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> >> index 686d8b7..bc9f67a 100644
> >> --- a/xen/arch/arm/traps.c
> >> +++ b/xen/arch/arm/traps.c
> >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void)
> >>
> >>      /* Setup hypervisor traps */
> >>      WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM|
> >> -                 HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
> >> +                 HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
> >>      isb();
> >>  }
> >>
> >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs)
> >>              advance_pc(regs, hsr);
> >>              return;
> >>          }
> >> -        /* at the moment we only trap WFI */
> >> -        vcpu_block();
> >> -        /* The ARM spec declares that even if local irqs are masked in
> >> -         * the CPSR register, an irq should wake up a cpu from WFI anyway.
> >> -         * For this reason we need to check for irqs that need delivery,
> >> -         * ignoring the CPSR register, *after* calling SCHEDOP_block to
> >> -         * avoid races with vgic_vcpu_inject_irq.
> >> -         */
> >> -        if ( local_events_need_delivery_nomask() )
> >> -            vcpu_unblock(current);
> >> +        if ( hsr.bits & 0x1 ) {
> >
> > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and
> > use that instead of opencoding bit patterns in hsr.bits.
> 
> Sure, I will update this and send it right away.

Thanks.

> >> +            /* Yield the VCPU for WFE */
> >> +            vcpu_force_reschedule(current);
> >
> > This is OK fine by me, but I'm wondering: Does the HW have any support
> > for actually blocking until an event occurs in another vcpu (some sort
> > of trap on sev)? I don't think so, nor am I really sure how such a h/w
> > feature might work...
> 
> Typically, on HW the WFE instruction forces cores to go in low power
> mode. The core will come out of low power mode due to some interrupt
> or SEV executed from another core.

Right, but for virt in order to put a wfe-ing vcpu to sleep we would
need to be able to get some sort of hypervisor event when another vcpu
in the same guest executed sev.

Otherwise the best we can do is to yield as you've done here, but that
will still keep spinning if there are no other vcpus to run.

Ian.

Anup Patel July 16, 2014, 10:17 a.m. UTC | #4

On 16 July 2014 15:39, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2014-07-16 at 15:37 +0530, Anup Patel wrote:
>> On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote:
>> >> If we have a Guest/DomU with two or more of its VCPUs running
>> >> on same host CPU then it can quite likely happen that these
>> >> VCPUs fight for same spinlock and one of them will waste CPU
>> >> cycles in WFE instruction. This patch makes WFE instruction
>> >> trap for VCPU and forces VCPU to yield its timeslice.
>> >>
>> >> The KVM ARM/ARM64 also does similar thing for handling WFE
>> >> instructions. (Please refer,
>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html)
>> >>
>> >> In general, this patch is more of an optimization for an
>> >> oversubscribed system having number of VCPUs more than
>> >> underlying host CPUs.
>> >
>> > Sounds good. Do you have any numbers for the improvement?
>>
>> I did not try any benchmarks myself but hackbench shows good
>> improvement for KVM hence it is a worthy optimization for Xen too.
>>
>> I found this change missing for Xen hence this patch.
>>
>> >
>> >> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>> >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>> >> ---
>> >>  xen/arch/arm/traps.c |   27 ++++++++++++++++-----------
>> >>  1 file changed, 16 insertions(+), 11 deletions(-)
>> >>
>> >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>> >> index 686d8b7..bc9f67a 100644
>> >> --- a/xen/arch/arm/traps.c
>> >> +++ b/xen/arch/arm/traps.c
>> >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void)
>> >>
>> >>      /* Setup hypervisor traps */
>> >>      WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM|
>> >> -                 HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
>> >> +                 HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
>> >>      isb();
>> >>  }
>> >>
>> >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs)
>> >>              advance_pc(regs, hsr);
>> >>              return;
>> >>          }
>> >> -        /* at the moment we only trap WFI */
>> >> -        vcpu_block();
>> >> -        /* The ARM spec declares that even if local irqs are masked in
>> >> -         * the CPSR register, an irq should wake up a cpu from WFI anyway.
>> >> -         * For this reason we need to check for irqs that need delivery,
>> >> -         * ignoring the CPSR register, *after* calling SCHEDOP_block to
>> >> -         * avoid races with vgic_vcpu_inject_irq.
>> >> -         */
>> >> -        if ( local_events_need_delivery_nomask() )
>> >> -            vcpu_unblock(current);
>> >> +        if ( hsr.bits & 0x1 ) {
>> >
>> > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and
>> > use that instead of opencoding bit patterns in hsr.bits.
>>
>> Sure, I will update this and send it right away.
>
> Thanks.
>
>> >> +            /* Yield the VCPU for WFE */
>> >> +            vcpu_force_reschedule(current);
>> >
>> > This is OK fine by me, but I'm wondering: Does the HW have any support
>> > for actually blocking until an event occurs in another vcpu (some sort
>> > of trap on sev)? I don't think so, nor am I really sure how such a h/w
>> > feature might work...
>>
>> Typically, on HW the WFE instruction forces cores to go in low power
>> mode. The core will come out of low power mode due to some interrupt
>> or SEV executed from another core.
>
> Right, but for virt in order to put a wfe-ing vcpu to sleep we would
> need to be able to get some sort of hypervisor event when another vcpu
> in the same guest executed sev.
>
> Otherwise the best we can do is to yield as you've done here, but that
> will still keep spinning if there are no other vcpus to run.

Yes, we cannot trap SEV instruction.

Yield seems to be the only option here, unless we modify the guest and
replace SEV instruction with a hypercall.

--
Anup

>
> Ian.
>

Ian Campbell July 16, 2014, 10:22 a.m. UTC | #5

On Wed, 2014-07-16 at 15:47 +0530, Anup Patel wrote:
> On 16 July 2014 15:39, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2014-07-16 at 15:37 +0530, Anup Patel wrote:
> >> On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> >> > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote:
> >> >> If we have a Guest/DomU with two or more of its VCPUs running
> >> >> on same host CPU then it can quite likely happen that these
> >> >> VCPUs fight for same spinlock and one of them will waste CPU
> >> >> cycles in WFE instruction. This patch makes WFE instruction
> >> >> trap for VCPU and forces VCPU to yield its timeslice.
> >> >>
> >> >> The KVM ARM/ARM64 also does similar thing for handling WFE
> >> >> instructions. (Please refer,
> >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html)
> >> >>
> >> >> In general, this patch is more of an optimization for an
> >> >> oversubscribed system having number of VCPUs more than
> >> >> underlying host CPUs.
> >> >
> >> > Sounds good. Do you have any numbers for the improvement?
> >>
> >> I did not try any benchmarks myself but hackbench shows good
> >> improvement for KVM hence it is a worthy optimization for Xen too.
> >>
> >> I found this change missing for Xen hence this patch.
> >>
> >> >
> >> >> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >> >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
> >> >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
> >> >> ---
> >> >>  xen/arch/arm/traps.c |   27 ++++++++++++++++-----------
> >> >>  1 file changed, 16 insertions(+), 11 deletions(-)
> >> >>
> >> >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> >> >> index 686d8b7..bc9f67a 100644
> >> >> --- a/xen/arch/arm/traps.c
> >> >> +++ b/xen/arch/arm/traps.c
> >> >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void)
> >> >>
> >> >>      /* Setup hypervisor traps */
> >> >>      WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM|
> >> >> -                 HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
> >> >> +                 HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2);
> >> >>      isb();
> >> >>  }
> >> >>
> >> >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs)
> >> >>              advance_pc(regs, hsr);
> >> >>              return;
> >> >>          }
> >> >> -        /* at the moment we only trap WFI */
> >> >> -        vcpu_block();
> >> >> -        /* The ARM spec declares that even if local irqs are masked in
> >> >> -         * the CPSR register, an irq should wake up a cpu from WFI anyway.
> >> >> -         * For this reason we need to check for irqs that need delivery,
> >> >> -         * ignoring the CPSR register, *after* calling SCHEDOP_block to
> >> >> -         * avoid races with vgic_vcpu_inject_irq.
> >> >> -         */
> >> >> -        if ( local_events_need_delivery_nomask() )
> >> >> -            vcpu_unblock(current);
> >> >> +        if ( hsr.bits & 0x1 ) {
> >> >
> >> > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and
> >> > use that instead of opencoding bit patterns in hsr.bits.
> >>
> >> Sure, I will update this and send it right away.
> >
> > Thanks.
> >
> >> >> +            /* Yield the VCPU for WFE */
> >> >> +            vcpu_force_reschedule(current);
> >> >
> >> > This is OK fine by me, but I'm wondering: Does the HW have any support
> >> > for actually blocking until an event occurs in another vcpu (some sort
> >> > of trap on sev)? I don't think so, nor am I really sure how such a h/w
> >> > feature might work...
> >>
> >> Typically, on HW the WFE instruction forces cores to go in low power
> >> mode. The core will come out of low power mode due to some interrupt
> >> or SEV executed from another core.
> >
> > Right, but for virt in order to put a wfe-ing vcpu to sleep we would
> > need to be able to get some sort of hypervisor event when another vcpu
> > in the same guest executed sev.
> >
> > Otherwise the best we can do is to yield as you've done here, but that
> > will still keep spinning if there are no other vcpus to run.
> 
> Yes, we cannot trap SEV instruction.

Plus it would be unwieldy to have to enable that for all *other* vcpus
when one did a WFE.

> Yield seems to be the only option here, unless we modify the guest and
> replace SEV instruction with a hypercall.

Perhaps one day we can add PARAVIRT_SPINLOCKS to arm. But not now ;-)

Ian.

[Xen-devel] xen/arm: Trap and yield on WFE instructions

Commit Message

Comments

Patch