diff mbox

[v2] arm/arm64: KVM: Properly account for guest CPU time

Message ID 1432838950-28774-1-git-send-email-christoffer.dall@linaro.org
State Superseded
Headers show

Commit Message

Christoffer Dall May 28, 2015, 6:49 p.m. UTC
Until now we have been calling kvm_guest_exit after re-enabling
interrupts when we come back from the guest, but this has the
unfortunate effect that CPU time accounting done in the context of timer
interrupts occurring while the guest is running doesn't properly notice
that the time since the last tick was spent in the guest.

Inspired by the comment in the x86 code, move the kvm_guest_exit() call
below the local_irq_enable() call and change __kvm_guest_exit() to
kvm_guest_exit(), because we are now calling this function with
interrupts enabled.  We have to now explicitly disable preemption and
not enable preemption before we've called kvm_guest_exit(), since
otherwise we could be preempted and everything happening before we
eventually get scheduled again would be accounted for as guest time.

At the same time, move the trace_kvm_exit() call outside of the atomic
section, since there is no reason for us to do that with interrupts
disabled.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
---
This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
rework recently posted by Christian Borntraeger.  I hope I got the logic
of this right, there were 2 slightly worrying facts about this:

First, we now enable and disable and enable interrupts on each exit
path, but I couldn't see any performance overhead on hackbench - yes the
only benchmark we care about.

Second, looking at the ppc and mips code, they seem to also call
kvm_guest_exit() before enabling interrupts, so I don't understand how
guest CPU time accounting works on those architectures.

Changes since v1:
 - Tweak comment and commit text based on Marc's feedback.
 - Explicitly disable preemption and enable it only after kvm_guest_exit().

 arch/arm/kvm/arm.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Comments

Christoffer Dall May 31, 2015, 6:59 a.m. UTC | #1
Hi Mario,

On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:
> On 05/28/2015 11:49 AM, Christoffer Dall wrote:
> > Until now we have been calling kvm_guest_exit after re-enabling
> > interrupts when we come back from the guest, but this has the
> > unfortunate effect that CPU time accounting done in the context of timer
> > interrupts occurring while the guest is running doesn't properly notice
> > that the time since the last tick was spent in the guest.
> > 
> > Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> > below the local_irq_enable() call and change __kvm_guest_exit() to
> > kvm_guest_exit(), because we are now calling this function with
> > interrupts enabled.  We have to now explicitly disable preemption and
> > not enable preemption before we've called kvm_guest_exit(), since
> > otherwise we could be preempted and everything happening before we
> > eventually get scheduled again would be accounted for as guest time.
> > 
> > At the same time, move the trace_kvm_exit() call outside of the atomic
> > section, since there is no reason for us to do that with interrupts
> > disabled.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> > This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> > rework recently posted by Christian Borntraeger.  I hope I got the logic
> > of this right, there were 2 slightly worrying facts about this:
> > 
> > First, we now enable and disable and enable interrupts on each exit
> > path, but I couldn't see any performance overhead on hackbench - yes the
> > only benchmark we care about.
> > 
> > Second, looking at the ppc and mips code, they seem to also call
> > kvm_guest_exit() before enabling interrupts, so I don't understand how
> > guest CPU time accounting works on those architectures.
> > 
> > Changes since v1:
> >  - Tweak comment and commit text based on Marc's feedback.
> >  - Explicitly disable preemption and enable it only after kvm_guest_exit().
> > 
> >  arch/arm/kvm/arm.c | 21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> > index e41cb11..fe8028d 100644
> > --- a/arch/arm/kvm/arm.c
> > +++ b/arch/arm/kvm/arm.c
> > @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  		kvm_vgic_flush_hwstate(vcpu);
> >  		kvm_timer_flush_hwstate(vcpu);
> >  
> > +		preempt_disable();
> >  		local_irq_disable();
> >  
> >  		/*
> > @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  
> >  		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
> >  			local_irq_enable();
> > +			preempt_enable();
> >  			kvm_timer_sync_hwstate(vcpu);
> >  			kvm_vgic_sync_hwstate(vcpu);
> >  			continue;
> > @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
> >  
> >  		vcpu->mode = OUTSIDE_GUEST_MODE;
> > -		__kvm_guest_exit();
> > -		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> > +		/*
> > +		 * Back from guest
> > +		 *************************************************************/
> > +
> >  		/*
> >  		 * We may have taken a host interrupt in HYP mode (ie
> >  		 * while executing the guest). This interrupt is still
> > @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  		local_irq_enable();
> >  
> >  		/*
> > -		 * Back from guest
> > -		 *************************************************************/
> > +		 * We do local_irq_enable() before calling kvm_guest_exit() so
> > +		 * that if a timer interrupt hits while running the guest we
> > +		 * account that tick as being spent in the guest.  We enable
> > +		 * preemption after calling kvm_guest_exit() so that if we get
> > +		 * preempted we make sure ticks after that is not counted as
> > +		 * guest time.
> > +		 */
> > +		kvm_guest_exit();
> > +		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> > +		preempt_enable();
> > +
> >  
> >  		kvm_timer_sync_hwstate(vcpu);
> >  		kvm_vgic_sync_hwstate(vcpu);
> > 
> 
> Hi Christoffer,
>  so currently we take a snap shot when we enter the guest
> (tsk->vtime_snap) and upon exit add the time we spent in
> the guest and update accrued time, which appears correct.

not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
am I missing something obvious here?

> 
> With this patch it appears that interrupts running
> in host mode are accrued to Guest time, and additional preemption
> latency is added.
> 
It is true that interrupt processing in host mode (if they hit on a CPU
when it is running a guest) are accrued to guest time, but without this
patch on current arm64 we accrue no CPU time to guest time at all, which
is hardly more correct.

If this patch is incorrect, then how does it work on x86, where
handle_external_intr() is called (with a barrier in between) before
kvm_guest_exit(), and where handle_external_intr() is simply
local_irq_enable() on SVM and something more complicated on VMX ?

Finally, how exactly is preemption latency added here?  Won't IRQ
processing run with higher priority than any task on your system, so the
order of (1) process pending IRQs (2) call schedule if needed is still
preserved here, but we call kvm_guest_exit() between (1) and (2) instead
of before (1).

It is entirely possible that I'm missing the mark here and everything
gets solved by enabling HAVE_VIRT_CPU_ACCOUNTING_GEN or we need some
extra logic?

Thanks,
-Christoffer
Christoffer Dall June 1, 2015, 9:08 a.m. UTC | #2
On Mon, Jun 01, 2015 at 09:47:46AM +0200, Christian Borntraeger wrote:
> Am 28.05.2015 um 20:49 schrieb Christoffer Dall:
> > Until now we have been calling kvm_guest_exit after re-enabling
> > interrupts when we come back from the guest, but this has the
> > unfortunate effect that CPU time accounting done in the context of timer
> > interrupts occurring while the guest is running doesn't properly notice
> > that the time since the last tick was spent in the guest.
> 
> Can you verify that a CPU bound guest has almost zero guest time?
> Assuming that your answer is "yes" your patch make sense as host
> timer interrupts should be the only reasons for guest exits then.
> 

Yes, 'cat /dev/urandom > /dev/null' in the guest shows up as 100% sys in
mpstat on the host, 0% guest.

> 
> > Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> > below the local_irq_enable() call and change __kvm_guest_exit() to
> > kvm_guest_exit(), because we are now calling this function with
> > interrupts enabled.  We have to now explicitly disable preemption and
> > not enable preemption before we've called kvm_guest_exit(), since
> > otherwise we could be preempted and everything happening before we
> > eventually get scheduled again would be accounted for as guest time.
> > 
> > At the same time, move the trace_kvm_exit() call outside of the atomic
> > section, since there is no reason for us to do that with interrupts
> > disabled.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> > This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> > rework recently posted by Christian Borntraeger.  I hope I got the logic
> > of this right, there were 2 slightly worrying facts about this:
> > 
> > First, we now enable and disable and enable interrupts on each exit
> > path, but I couldn't see any performance overhead on hackbench - yes the
> > only benchmark we care about.
> 
> This should be somewhat similar to the situation before my patch.
> There it was
> 
> 1: "disable", "guest", "disable again and save", "restore to disable", "enable"
> and now it is
> 2: "disable", "guest", "enable"
> and with your patch it is
> 3: "disable", "guest", "enable", "disable, "enable"
> 
> I assume that 3 and 1 are similar in its costs, so this is probably ok.
> 
> 
> > 
> > Second, looking at the ppc and mips code, they seem to also call
> > kvm_guest_exit() before enabling interrupts, so I don't understand how
> > guest CPU time accounting works on those architectures.
> 
> Not an expert here, but I assume mips has the same logic as arm so if your
> patch is right for arm its probably also for mips.
> 
> powerpc looks similar to what s390 does (not using the tick, instead it uses
> a hw-timer) so this should be fine.
> 
I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
this for free which would avoid the need for this patch?

Thanks,
-Christoffer
Christoffer Dall June 1, 2015, 1:35 p.m. UTC | #3
On Mon, Jun 01, 2015 at 11:21:19AM +0200, Christian Borntraeger wrote:
> Am 01.06.2015 um 11:08 schrieb Christoffer Dall:
> 
> >>>
> >>> Second, looking at the ppc and mips code, they seem to also call
> >>> kvm_guest_exit() before enabling interrupts, so I don't understand how
> >>> guest CPU time accounting works on those architectures.
> >>
> >> Not an expert here, but I assume mips has the same logic as arm so if your
> >> patch is right for arm its probably also for mips.
> >>
> >> powerpc looks similar to what s390 does (not using the tick, instead it uses
> >> a hw-timer) so this should be fine.
> >>
> > I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
> > this for free which would avoid the need for this patch?
> 
> Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to 
> HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks
> - yes it might work out. Can you give it a try?
> 
Adding HAVE_VIRT_CPU_ACCOUNTING_GEN to arch/arm64/Kconfig works, but has
no effect unless you also enable CONFIG_NO_HZ_FULL, so that hardly feels
like a fix since it would be a shame to force users to use this config
option to report CPU usage correctly.

I'm not entirely sure what the history and meaning behind these configs
are, so maybe there is an entirely different rework needed here.  It
seems logical that you could simply sample the counter at entry/exit of
the guest, but if there is nowhere to store this data without
NO_HZ_FULL+VIRT_CPU_ACCOUNTING_GEN then I guess that would be why?

Thanks,
-Christoffer
Christoffer Dall June 2, 2015, 9:27 a.m. UTC | #4
On Mon, Jun 01, 2015 at 08:48:22AM -0700, Mario Smarduch wrote:
> On 05/30/2015 11:59 PM, Christoffer Dall wrote:
> > Hi Mario,
> > 
> > On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:
> >> On 05/28/2015 11:49 AM, Christoffer Dall wrote:
> >>> Until now we have been calling kvm_guest_exit after re-enabling
> >>> interrupts when we come back from the guest, but this has the
> >>> unfortunate effect that CPU time accounting done in the context of timer
> >>> interrupts occurring while the guest is running doesn't properly notice
> >>> that the time since the last tick was spent in the guest.
> >>>
> >>> Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> >>> below the local_irq_enable() call and change __kvm_guest_exit() to
> >>> kvm_guest_exit(), because we are now calling this function with
> >>> interrupts enabled.  We have to now explicitly disable preemption and
> >>> not enable preemption before we've called kvm_guest_exit(), since
> >>> otherwise we could be preempted and everything happening before we
> >>> eventually get scheduled again would be accounted for as guest time.
> >>>
> >>> At the same time, move the trace_kvm_exit() call outside of the atomic
> >>> section, since there is no reason for us to do that with interrupts
> >>> disabled.
> >>>
> >>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>> ---
> >>> This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> >>> rework recently posted by Christian Borntraeger.  I hope I got the logic
> >>> of this right, there were 2 slightly worrying facts about this:
> >>>
> >>> First, we now enable and disable and enable interrupts on each exit
> >>> path, but I couldn't see any performance overhead on hackbench - yes the
> >>> only benchmark we care about.
> >>>
> >>> Second, looking at the ppc and mips code, they seem to also call
> >>> kvm_guest_exit() before enabling interrupts, so I don't understand how
> >>> guest CPU time accounting works on those architectures.
> >>>
> >>> Changes since v1:
> >>>  - Tweak comment and commit text based on Marc's feedback.
> >>>  - Explicitly disable preemption and enable it only after kvm_guest_exit().
> >>>
> >>>  arch/arm/kvm/arm.c | 21 +++++++++++++++++----
> >>>  1 file changed, 17 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> >>> index e41cb11..fe8028d 100644
> >>> --- a/arch/arm/kvm/arm.c
> >>> +++ b/arch/arm/kvm/arm.c
> >>> @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		kvm_vgic_flush_hwstate(vcpu);
> >>>  		kvm_timer_flush_hwstate(vcpu);
> >>>  
> >>> +		preempt_disable();
> >>>  		local_irq_disable();
> >>>  
> >>>  		/*
> >>> @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  
> >>>  		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
> >>>  			local_irq_enable();
> >>> +			preempt_enable();
> >>>  			kvm_timer_sync_hwstate(vcpu);
> >>>  			kvm_vgic_sync_hwstate(vcpu);
> >>>  			continue;
> >>> @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
> >>>  
> >>>  		vcpu->mode = OUTSIDE_GUEST_MODE;
> >>> -		__kvm_guest_exit();
> >>> -		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>> +		/*
> >>> +		 * Back from guest
> >>> +		 *************************************************************/
> >>> +
> >>>  		/*
> >>>  		 * We may have taken a host interrupt in HYP mode (ie
> >>>  		 * while executing the guest). This interrupt is still
> >>> @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>  		local_irq_enable();
> >>>  
> >>>  		/*
> >>> -		 * Back from guest
> >>> -		 *************************************************************/
> >>> +		 * We do local_irq_enable() before calling kvm_guest_exit() so
> >>> +		 * that if a timer interrupt hits while running the guest we
> >>> +		 * account that tick as being spent in the guest.  We enable
> >>> +		 * preemption after calling kvm_guest_exit() so that if we get
> >>> +		 * preempted we make sure ticks after that is not counted as
> >>> +		 * guest time.
> >>> +		 */
> >>> +		kvm_guest_exit();
> >>> +		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>> +		preempt_enable();
> >>> +
> >>>  
> >>>  		kvm_timer_sync_hwstate(vcpu);
> >>>  		kvm_vgic_sync_hwstate(vcpu);
> >>>
> >>
> >> Hi Christoffer,
> >>  so currently we take a snap shot when we enter the guest
> >> (tsk->vtime_snap) and upon exit add the time we spent in
> >> the guest and update accrued time, which appears correct.
> > 
> > not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
> > am I missing something obvious here?
> I see what you mean we can't use cycle based accounting to accrue
> Guest time.
> 

See other thread, we can enable this in the config but it still only
works with NO_HZ_FULL.

> > 
> >>
> >> With this patch it appears that interrupts running
> >> in host mode are accrued to Guest time, and additional preemption
> >> latency is added.
> >>
> > It is true that interrupt processing in host mode (if they hit on a CPU
> > when it is running a guest) are accrued to guest time, but without this
> > patch on current arm64 we accrue no CPU time to guest time at all, which
> > is hardly more correct.
> Yes if only ticks are supported then it's definitely better!
> 
> Nevertheless with high interrupt rate it will complicate root causing
> issues, a lot of that time will go to guest.

That sounds like a separate fix to me; if there's an existing mechanism
to account for time spent on hw IRQs and it is somehow invalidated by
the PF_VCPU flag being set, then that feels wrong, at least how arm64
works, but it doesn't make this patch less correct.

> 
> > 
> > If this patch is incorrect, then how does it work on x86, where
> > handle_external_intr() is called (with a barrier in between) before
> > kvm_guest_exit(), and where handle_external_intr() is simply
> > local_irq_enable() on SVM and something more complicated on VMX ?
> > 
> > Finally, how exactly is preemption latency added here?   Won't IRQ
> > processing run with higher priority than any task on your system, so the
> > order of (1) process pending IRQs (2) call schedule if needed is still
> > preserved here, but we call kvm_guest_exit() between (1) and (2) instead
> > of before (1).
> 
> I may be missing something, but on return from interrupt with preempt
> disabled we can't take the need resched path. And need to return
> to KVM no?

preempt_enable() will call __preempt_schedule() and cause preemption
there, so you're talking about adding these lines of latency:

	kvm_guest_exit();
	trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));

And these were called with interrupts disabled before, so I don't see
the issue??

However, your question is making me think whether we have a race in the
current code on fully preemptible kernels, if we get preempted before
calling kvm_timer_sync_hwstate() and kvm_vgic_sync_hwstate(), then we
could potentially schedule another vcpu on this core and loose/corrupt
state, can we not?  We probably need to check for this in
kvm_vcpu_load/kvm_vcpu_put.  I need to think more about if this is a
real issue or if I'm seeing ghosts.

> > 
> > It is entirely possible that I'm missing the mark here and everything
> > gets solved by enabling HAVE_VIRT_CPU_ACCOUNTING_GEN or we need some
> > extra logic?
> 
> I think something to look into for us, taking a low issue to high level
> application - for state machine based type of applications (I guess like
> NFV) it cause problems  to root cause issues, a lot of activities
> run between ticks. For VM cycle based accounting is probably a must
> in that case.
> 
Would you run with NO_HZ_FULL in this case?  Because then we should just
enable HAVE_VIRT_CPU_ACCOUNTING_GEN, and I think that would be a good
start.

Thanks,
-Christoffer
Christoffer Dall June 2, 2015, 9:28 a.m. UTC | #5
On Mon, Jun 01, 2015 at 03:37:32PM +0200, Christian Borntraeger wrote:
> Am 01.06.2015 um 15:35 schrieb Christoffer Dall:
> > On Mon, Jun 01, 2015 at 11:21:19AM +0200, Christian Borntraeger wrote:
> >> Am 01.06.2015 um 11:08 schrieb Christoffer Dall:
> >>
> >>>>>
> >>>>> Second, looking at the ppc and mips code, they seem to also call
> >>>>> kvm_guest_exit() before enabling interrupts, so I don't understand how
> >>>>> guest CPU time accounting works on those architectures.
> >>>>
> >>>> Not an expert here, but I assume mips has the same logic as arm so if your
> >>>> patch is right for arm its probably also for mips.
> >>>>
> >>>> powerpc looks similar to what s390 does (not using the tick, instead it uses
> >>>> a hw-timer) so this should be fine.
> >>>>
> >>> I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
> >>> this for free which would avoid the need for this patch?
> >>
> >> Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to 
> >> HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks
> >> - yes it might work out. Can you give it a try?
> >>
> > Adding HAVE_VIRT_CPU_ACCOUNTING_GEN to arch/arm64/Kconfig works, but has
> > no effect unless you also enable CONFIG_NO_HZ_FULL, so that hardly feels
> > like a fix since it would be a shame to force users to use this config
> > option to report CPU usage correctly.
> > 
> > I'm not entirely sure what the history and meaning behind these configs
> > are, so maybe there is an entirely different rework needed here.  It
> > seems logical that you could simply sample the counter at entry/exit of
> > the guest, but if there is nowhere to store this data without
> > NO_HZ_FULL+VIRT_CPU_ACCOUNTING_GEN then I guess that would be why?
> 
> Given Paolos response that irq_disable/enable is faster than save/restore
> at least on x86 your v2 patch might actually be the right thing to do.
> 
Thanks, I think so too, but we should enable
HAVE_VIRT_CPU_ACCOUNTING_GEN for arm64 as well, assuming there are no
mysterious side affects of doing so.

-Christoffer
Christoffer Dall June 2, 2015, 11:55 a.m. UTC | #6
[replying to myself]

On Tue, Jun 02, 2015 at 11:27:59AM +0200, Christoffer Dall wrote:

[..]

> > > 
> > > If this patch is incorrect, then how does it work on x86, where
> > > handle_external_intr() is called (with a barrier in between) before
> > > kvm_guest_exit(), and where handle_external_intr() is simply
> > > local_irq_enable() on SVM and something more complicated on VMX ?
> > > 
> > > Finally, how exactly is preemption latency added here?   Won't IRQ
> > > processing run with higher priority than any task on your system, so the
> > > order of (1) process pending IRQs (2) call schedule if needed is still
> > > preserved here, but we call kvm_guest_exit() between (1) and (2) instead
> > > of before (1).
> > 
> > I may be missing something, but on return from interrupt with preempt
> > disabled we can't take the need resched path. And need to return
> > to KVM no?
> 
> preempt_enable() will call __preempt_schedule() and cause preemption
> there, so you're talking about adding these lines of latency:
> 
> 	kvm_guest_exit();
> 	trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> 
> And these were called with interrupts disabled before, so I don't see
> the issue??
> 
> However, your question is making me think whether we have a race in the
> current code on fully preemptible kernels, if we get preempted before
> calling kvm_timer_sync_hwstate() and kvm_vgic_sync_hwstate(), then we
> could potentially schedule another vcpu on this core and loose/corrupt
> state, can we not?  We probably need to check for this in
> kvm_vcpu_load/kvm_vcpu_put.  I need to think more about if this is a
> real issue or if I'm seeing ghosts.
> 
I've thought about it and I don't think there's a race because those
functions don't access the hardware directly, but only manipulate
per-vcpu data structures.

-Christoffer
Christoffer Dall June 8, 2015, 11:35 a.m. UTC | #7
On Fri, Jun 05, 2015 at 05:24:07AM -0700, Mario Smarduch wrote:
> On 06/02/2015 02:27 AM, Christoffer Dall wrote:
> > On Mon, Jun 01, 2015 at 08:48:22AM -0700, Mario Smarduch wrote:
> >> On 05/30/2015 11:59 PM, Christoffer Dall wrote:
> >>> Hi Mario,
> >>>
> >>> On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:
> >>>> On 05/28/2015 11:49 AM, Christoffer Dall wrote:
> >>>>> Until now we have been calling kvm_guest_exit after re-enabling
> >>>>> interrupts when we come back from the guest, but this has the
> >>>>> unfortunate effect that CPU time accounting done in the context of timer
> >>>>> interrupts occurring while the guest is running doesn't properly notice
> >>>>> that the time since the last tick was spent in the guest.
> >>>>>
> >>>>> Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> >>>>> below the local_irq_enable() call and change __kvm_guest_exit() to
> >>>>> kvm_guest_exit(), because we are now calling this function with
> >>>>> interrupts enabled.  We have to now explicitly disable preemption and
> >>>>> not enable preemption before we've called kvm_guest_exit(), since
> >>>>> otherwise we could be preempted and everything happening before we
> >>>>> eventually get scheduled again would be accounted for as guest time.
> >>>>>
> >>>>> At the same time, move the trace_kvm_exit() call outside of the atomic
> >>>>> section, since there is no reason for us to do that with interrupts
> >>>>> disabled.
> >>>>>
> >>>>> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>>>> ---
> >>>>> This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> >>>>> rework recently posted by Christian Borntraeger.  I hope I got the logic
> >>>>> of this right, there were 2 slightly worrying facts about this:
> >>>>>
> >>>>> First, we now enable and disable and enable interrupts on each exit
> >>>>> path, but I couldn't see any performance overhead on hackbench - yes the
> >>>>> only benchmark we care about.
> >>>>>
> >>>>> Second, looking at the ppc and mips code, they seem to also call
> >>>>> kvm_guest_exit() before enabling interrupts, so I don't understand how
> >>>>> guest CPU time accounting works on those architectures.
> >>>>>
> >>>>> Changes since v1:
> >>>>>  - Tweak comment and commit text based on Marc's feedback.
> >>>>>  - Explicitly disable preemption and enable it only after kvm_guest_exit().
> >>>>>
> >>>>>  arch/arm/kvm/arm.c | 21 +++++++++++++++++----
> >>>>>  1 file changed, 17 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> >>>>> index e41cb11..fe8028d 100644
> >>>>> --- a/arch/arm/kvm/arm.c
> >>>>> +++ b/arch/arm/kvm/arm.c
> >>>>> @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>>>  		kvm_vgic_flush_hwstate(vcpu);
> >>>>>  		kvm_timer_flush_hwstate(vcpu);
> >>>>>  
> >>>>> +		preempt_disable();
> >>>>>  		local_irq_disable();
> >>>>>  
> >>>>>  		/*
> >>>>> @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>>>  
> >>>>>  		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
> >>>>>  			local_irq_enable();
> >>>>> +			preempt_enable();
> >>>>>  			kvm_timer_sync_hwstate(vcpu);
> >>>>>  			kvm_vgic_sync_hwstate(vcpu);
> >>>>>  			continue;
> >>>>> @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>>>  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
> >>>>>  
> >>>>>  		vcpu->mode = OUTSIDE_GUEST_MODE;
> >>>>> -		__kvm_guest_exit();
> >>>>> -		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>>>> +		/*
> >>>>> +		 * Back from guest
> >>>>> +		 *************************************************************/
> >>>>> +
> >>>>>  		/*
> >>>>>  		 * We may have taken a host interrupt in HYP mode (ie
> >>>>>  		 * while executing the guest). This interrupt is still
> >>>>> @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >>>>>  		local_irq_enable();
> >>>>>  
> >>>>>  		/*
> >>>>> -		 * Back from guest
> >>>>> -		 *************************************************************/
> >>>>> +		 * We do local_irq_enable() before calling kvm_guest_exit() so
> >>>>> +		 * that if a timer interrupt hits while running the guest we
> >>>>> +		 * account that tick as being spent in the guest.  We enable
> >>>>> +		 * preemption after calling kvm_guest_exit() so that if we get
> >>>>> +		 * preempted we make sure ticks after that is not counted as
> >>>>> +		 * guest time.
> >>>>> +		 */
> >>>>> +		kvm_guest_exit();
> >>>>> +		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >>>>> +		preempt_enable();
> >>>>> +
> >>>>>  
> >>>>>  		kvm_timer_sync_hwstate(vcpu);
> >>>>>  		kvm_vgic_sync_hwstate(vcpu);
> >>>>>
> >>>>
> >>>> Hi Christoffer,
> >>>>  so currently we take a snap shot when we enter the guest
> >>>> (tsk->vtime_snap) and upon exit add the time we spent in
> >>>> the guest and update accrued time, which appears correct.
> >>>
> >>> not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
> >>> am I missing something obvious here?
> >> I see what you mean we can't use cycle based accounting to accrue
> >> Guest time.
> >>
> > 
> > See other thread, we can enable this in the config but it still only
> > works with NO_HZ_FULL.
> > 
> >>>
> >>>>
> >>>> With this patch it appears that interrupts running
> >>>> in host mode are accrued to Guest time, and additional preemption
> >>>> latency is added.
> >>>>
> >>> It is true that interrupt processing in host mode (if they hit on a CPU
> >>> when it is running a guest) are accrued to guest time, but without this
> >>> patch on current arm64 we accrue no CPU time to guest time at all, which
> >>> is hardly more correct.
> >> Yes if only ticks are supported then it's definitely better!
> >>
> >> Nevertheless with high interrupt rate it will complicate root causing
> >> issues, a lot of that time will go to guest.
> > 
> > That sounds like a separate fix to me; if there's an existing mechanism
> > to account for time spent on hw IRQs and it is somehow invalidated by
> > the PF_VCPU flag being set, then that feels wrong, at least how arm64
> > works, but it doesn't make this patch less correct.
> 
> Tracing through the code (account_system_time()) it appears if the
> timer fires while an IRQ runs, tick are accounted to host IRQ
> mode (CPUTIME_IRQ). Otherwise it's accrued to guest. Under heavy
> interrupt load it's likely guest will mis some ticks, it appears
> it's the reverse of what I initially thought but in practice
> guest time should be ok as far as ticks go.
> 
> > 
> >>
> >>>
> >>> If this patch is incorrect, then how does it work on x86, where
> >>> handle_external_intr() is called (with a barrier in between) before
> >>> kvm_guest_exit(), and where handle_external_intr() is simply
> >>> local_irq_enable() on SVM and something more complicated on VMX ?
> >>>
> >>> Finally, how exactly is preemption latency added here?   Won't IRQ
> >>> processing run with higher priority than any task on your system, so the
> >>> order of (1) process pending IRQs (2) call schedule if needed is still
> >>> preserved here, but we call kvm_guest_exit() between (1) and (2) instead
> >>> of before (1).
> >>
> >> I may be missing something, but on return from interrupt with preempt
> >> disabled we can't take the need resched path. And need to return
> >> to KVM no?
> > 
> > preempt_enable() will call __preempt_schedule() and cause preemption
> > there, so you're talking about adding these lines of latency:t
> > 
> > 	kvm_guest_exit();
> > 	trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> 
> On return from IRQ this should execute - and el1_preempt won't
> get called.
> 
> #ifdef CONFIG_PREEMPT
>         get_thread_info tsk
>         ldr     w24, [tsk, #TI_PREEMPT]         // get preempt count
>         cbnz    w24, 1f                         // preempt count != 0
>         ldr     x0, [tsk, #TI_FLAGS]            // get flags
>         tbz     x0, #TIF_NEED_RESCHED, 1f       // needs rescheduling?
>         bl      el1_preempt
> 1:
> #endif
> 

I understand that, but then you call preempt_enable right after which
calls __preempt_schedule() which has the same affect as that asm snippet
you pasted here.

> 
> > 
> > And these were called with interrupts disabled before, so I don't see
> > the issue??
> > 
> > However, your question is making me think whether we have a race in the
> > current code on fully preemptible kernels, if we get preempted before
> > calling kvm_timer_sync_hwstate() and kvm_vgic_sync_hwstate(), then we
> > could potentially schedule another vcpu on this core and loose/corrupt
> > state, can we not?  We probably need to check for this in
> > kvm_vcpu_load/kvm_vcpu_put.  I need to think more about if this is a
> > real issue or if I'm seeing ghosts.
> 
> Yes appears like it could be an issue in PREEMPT mode.

see separate mail, I don't believe this to be an issue anymore.

> > 
> >>>
> >>> It is entirely possible that I'm missing the mark here and everything
> >>> gets solved by enabling HAVE_VIRT_CPU_ACCOUNTING_GEN or we need some
> >>> extra logic?
> >>
> >> I think something to look into for us, taking a low issue to high level
> >> application - for state machine based type of applications (I guess like
> >> NFV) it cause problems  to root cause issues, a lot of activities
> >> run between ticks. For VM cycle based accounting is probably a must
> >> in that case.
> >>
> > Would you run with NO_HZ_FULL in this case?  Because then we should just
> > enable HAVE_VIRT_CPU_ACCOUNTING_GEN, and I think that would be a good
> > start.
> It may have a use case to run an isolated vCPU, but in general any mode
> may be used (,NO_HZ, even low PERIODIC).
> 
ok, but I still think it would be more correct to have this patch than
not to.

-Christoffer
Christoffer Dall June 9, 2015, 2:43 p.m. UTC | #8
On Mon, Jun 08, 2015 at 06:50:08PM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> On 28/05/15 19:49, Christoffer Dall wrote:
> > Until now we have been calling kvm_guest_exit after re-enabling
> > interrupts when we come back from the guest, but this has the
> > unfortunate effect that CPU time accounting done in the context of timer
> > interrupts occurring while the guest is running doesn't properly notice
> > that the time since the last tick was spent in the guest.
> > 
> > Inspired by the comment in the x86 code, move the kvm_guest_exit() call
> > below the local_irq_enable() call and change __kvm_guest_exit() to
> > kvm_guest_exit(), because we are now calling this function with
> > interrupts enabled.  We have to now explicitly disable preemption and
> > not enable preemption before we've called kvm_guest_exit(), since
> > otherwise we could be preempted and everything happening before we
> > eventually get scheduled again would be accounted for as guest time.
> > 
> > At the same time, move the trace_kvm_exit() call outside of the atomic
> > section, since there is no reason for us to do that with interrupts
> > disabled.
> > 
> > Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
> > ---
> > This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
> > rework recently posted by Christian Borntraeger.  I hope I got the logic
> > of this right, there were 2 slightly worrying facts about this:
> > 
> > First, we now enable and disable and enable interrupts on each exit
> > path, but I couldn't see any performance overhead on hackbench - yes the
> > only benchmark we care about.
> > 
> > Second, looking at the ppc and mips code, they seem to also call
> > kvm_guest_exit() before enabling interrupts, so I don't understand how
> > guest CPU time accounting works on those architectures.
> > 
> > Changes since v1:
> >  - Tweak comment and commit text based on Marc's feedback.
> >  - Explicitly disable preemption and enable it only after kvm_guest_exit().
> > 
> >  arch/arm/kvm/arm.c | 21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> > index e41cb11..fe8028d 100644
> > --- a/arch/arm/kvm/arm.c
> > +++ b/arch/arm/kvm/arm.c
> > @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  		kvm_vgic_flush_hwstate(vcpu);
> >  		kvm_timer_flush_hwstate(vcpu);
> >  
> > +		preempt_disable();
> >  		local_irq_disable();
> >  
> >  		/*
> > @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  
> >  		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
> >  			local_irq_enable();
> > +			preempt_enable();
> >  			kvm_timer_sync_hwstate(vcpu);
> >  			kvm_vgic_sync_hwstate(vcpu);
> >  			continue;
> > @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
> >  
> >  		vcpu->mode = OUTSIDE_GUEST_MODE;
> > -		__kvm_guest_exit();
> > -		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> > +		/*
> > +		 * Back from guest
> > +		 *************************************************************/
> > +
> >  		/*
> >  		 * We may have taken a host interrupt in HYP mode (ie
> >  		 * while executing the guest). This interrupt is still
> > @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> >  		local_irq_enable();
> >  
> >  		/*
> > -		 * Back from guest
> > -		 *************************************************************/
> > +		 * We do local_irq_enable() before calling kvm_guest_exit() so
> > +		 * that if a timer interrupt hits while running the guest we
> > +		 * account that tick as being spent in the guest.  We enable
> > +		 * preemption after calling kvm_guest_exit() so that if we get
> > +		 * preempted we make sure ticks after that is not counted as
> > +		 * guest time.
> > +		 */
> > +		kvm_guest_exit();
> > +		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> > +		preempt_enable();
> > +
> >  
> >  		kvm_timer_sync_hwstate(vcpu);
> >  		kvm_vgic_sync_hwstate(vcpu);
> > 
> 
> I've been thinking about this a bit more, and I wonder if we can
> simplify it a bit. At the moment, we disable the interrupts around the
> HYP entry. But now that you have introduced preempt_disable, it looks
> like we move the local_irq_disable call to be just after
> __kvm_guest_enter, and not bother with having such a long critical section.
> 
> This is possible because entering HYP mode automatically masks
> interrupts, and we restore PSTATE on exception return.
> 
> I think this would slightly reduce the amount of code we run on the host
> that gets accounted to the guest.
> 
> Thoughts?
> 
Isn't there a situation then where the guest can get stuck because we
don't properly check for signal handling?

As I recall, we have the lcoal_irq_disable() call before checking
signal_pending(), so that if, for example, another thread sends a signal
to that VCPU we either:
 (1) handle the IPI and see we have a signal pending, so we abort, or
 (2) don't handle the IPI because IRQs are disabled, enter the guest but
     soon as we run the guest the interrupt hits, we go back, see the
     signal and exit.

There was something like this which caused a guest to get stuck with
userspace timers on v7.

Am I making sense at all?

-Christoffer
diff mbox

Patch

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e41cb11..fe8028d 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -532,6 +532,7 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		kvm_vgic_flush_hwstate(vcpu);
 		kvm_timer_flush_hwstate(vcpu);
 
+		preempt_disable();
 		local_irq_disable();
 
 		/*
@@ -544,6 +545,7 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 		if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
 			local_irq_enable();
+			preempt_enable();
 			kvm_timer_sync_hwstate(vcpu);
 			kvm_vgic_sync_hwstate(vcpu);
 			continue;
@@ -559,8 +561,10 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
 
 		vcpu->mode = OUTSIDE_GUEST_MODE;
-		__kvm_guest_exit();
-		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
+		/*
+		 * Back from guest
+		 *************************************************************/
+
 		/*
 		 * We may have taken a host interrupt in HYP mode (ie
 		 * while executing the guest). This interrupt is still
@@ -574,8 +578,17 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		local_irq_enable();
 
 		/*
-		 * Back from guest
-		 *************************************************************/
+		 * We do local_irq_enable() before calling kvm_guest_exit() so
+		 * that if a timer interrupt hits while running the guest we
+		 * account that tick as being spent in the guest.  We enable
+		 * preemption after calling kvm_guest_exit() so that if we get
+		 * preempted we make sure ticks after that is not counted as
+		 * guest time.
+		 */
+		kvm_guest_exit();
+		trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
+		preempt_enable();
+
 
 		kvm_timer_sync_hwstate(vcpu);
 		kvm_vgic_sync_hwstate(vcpu);