diff mbox series

[v2,01/15] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()

Message ID 20240427111929.9600-2-dwmw2@infradead.org
State New
Headers show
Series [v2,01/15] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() | expand

Commit Message

David Woodhouse April 27, 2024, 11:04 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.

Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.

But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.

In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.

The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.

One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.

Eventually, definition C should just be eliminated. Commit 451a707813ae
("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
the specific case of Xen timers, which are defined in terms of the KVM
clock and suffered from a continually increasing error in timer expiry
times. That commit notes that get_kvmclock_ns() is non-trivial to fix
and says "I'll come back to that", which remains true.

Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.

In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.

When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
should probably *adjust* kvm->arch.kvmclock_offset to account for the
drift, instead of yanking the clock back to defintion A. But in the
meantime there are a bunch of places where it just doesn't need to be
invoked at all.

To start with: there is no need to do such an update when a Xen guest
populates the shared_info page. This seems to have been a hangover from
the very first implementation of shared_info which automatically
populated the vcpu_info structures at their default locations, but even
then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
expected to explicitly set the vcpu_info even in its default locations,
there's not even any need for that either.

Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/xen.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Marcelo Tosatti May 4, 2024, 7:42 a.m. UTC | #1
On Sat, Apr 27, 2024 at 12:04:58PM +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The KVM clock is an interesting thing. It is defined as "nanoseconds
> since the guest was created", but in practice it runs at two *different*
> rates — or three different rates, if you count implementation bugs.
> 
> Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
> of the host, with a delta of kvm->arch.kvmclock_offset.
> 
> But that version doesn't actually get used in the common case, where the
> host has a reliable TSC and the guest TSCs are all running at the same
> rate and in sync with each other, and kvm->arch.use_master_clock is set.
> 
> In that common case, definition B is used: There is a reference point in
> time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
> and a corresponding host TSC value kvm->arch.master_cycle_now. This
> fixed point in time is converted to guest units (the time offset by
> kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
> value) and advertised to the guest in the pvclock structure. While in
> this 'use_master_clock' mode, the fixed point in time never needs to be
> changed, and the clock runs precisely in time with the guest TSC, at the
> rate advertised in the pvclock structure.
> 
> The third definition C is implemented in kvm_get_wall_clock_epoch() and
> __get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
> but converting the *host* TSC cycles directly to a value in nanoseconds
> instead of scaling via the guest TSC.
> 
> One might naïvely think that all three definitions are identical, since
> CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
> three are just the result of counting the host TSC at a known frequency,
> or the scaled guest TSC at a known precise fraction of the host's
> frequency. The problem is with arithmetic precision, and the way that
> frequency scaling is done in a division-free way by multiplying by a
> scale factor, then shifting right. In practice, all three ways of
> calculating the KVM clock will suffer a systemic drift from each other.
> 
> Eventually, definition C should just be eliminated. Commit 451a707813ae
> ("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
> the specific case of Xen timers, which are defined in terms of the KVM
> clock and suffered from a continually increasing error in timer expiry
> times. That commit notes that get_kvmclock_ns() is non-trivial to fix
> and says "I'll come back to that", which remains true.
> 
> Definitions A and B do need to coexist, the former to handle the case
> where the host or guest TSC is suboptimally configured. But KVM should
> be more careful about switching between them, and the discontinuity in
> guest time which could result.
> 
> In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
> time as the reference in master_kernel_ns and master_cycle_now, yanking
> the guest's clock back to match definition A at that moment.

KVM_REQ_MASTERCLOCK_UPDATE stops the vcpus because:

 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.

> When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
> should probably *adjust* kvm->arch.kvmclock_offset to account for the
> drift, instead of yanking the clock back to defintion A.

You are likely correct...

> But in the meantime there are a bunch of places where it just doesn't need to be
> invoked at all.
> 
> To start with: there is no need to do such an update when a Xen guest
> populates the shared_info page. This seems to have been a hangover from
> the very first implementation of shared_info which automatically
> populated the vcpu_info structures at their default locations, but even
> then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
> instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
> expected to explicitly set the vcpu_info even in its default locations,
> there's not even any need for that either.
> 
> Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Paul Durrant <paul@xen.org>
> ---
>  arch/x86/kvm/xen.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index f65b35a05d91..5a83a8154b79 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
>  	wc->version = wc_version + 1;
>  	read_unlock_irq(&gpc->lock);
>  
> -	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
> -
>  out:
>  	srcu_read_unlock(&kvm->srcu, idx);
>  	return ret;
> -- 
> 2.44.0

So KVM_REQ_MASTERCLOCK_UPDATE is to avoid the race above.

In what contexes is kvm_xen_shared_info_init called from again?

Not clear to me KVM_REQ_MASTERCLOCK_UPDATE is not needed (or that is
needed, for that matter...).
David Woodhouse May 7, 2024, 7:08 p.m. UTC | #2
On Sat, 2024-05-04 at 04:42 -0300, Marcelo Tosatti wrote:
> On Sat, Apr 27, 2024 at 12:04:58PM +0100, David Woodhouse wrote:
> > 
> > In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
> > time as the reference in master_kernel_ns and master_cycle_now, yanking
> > the guest's clock back to match definition A at that moment.
> 
> KVM_REQ_MASTERCLOCK_UPDATE stops the vcpus because:

Took me a while to work that one out, btw. The fact that the 
KVM_REQ_MCLOCK_INPROGRESS request is asserted but never actually
*handled*, so all it does is repeatedly kick the vCPU out and make it
spin until the request is cleared is... interesting. Likewise the way
that we set KVM_REQ_MASTERCLOCK_UPDATE on *all* vCPUs, so they *all*
call kvm_update_masterclock(), when only one of them needed to. I may
clean that up a little.

>  * To avoid that problem, do not allow visibility of distinct
>  * system_timestamp/tsc_timestamp values simultaneously: use a master
>  * copy of host monotonic time values. Update that master copy
>  * in lockstep.

Right. That comment is a lot longer than the part you cited here, and
starts with 'assuming a stable TSC across pCPUS, and a stable TSC
across vCPUs'. It's the "if (ka->use_master_clock)" case.

And yes, what it's basically saying is a special case of the fact that
if you let the KVM clock run at its "natural" rate based on the guest
TSC (definition B), but each vCPU runs at that rate from a *different*
point on the line that is definition A (the host CLOCK_MONOTONIC_RAW),
bad things will happen.

I'm OK with it stopping the vCPUs (although I'd like it to do so in a
less implicitly awful way). But when we don't need to update the
reference time at all, let's not do so.

> > When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
> > should probably *adjust* kvm->arch.kvmclock_offset to account for the
> > drift, instead of yanking the clock back to defintion A.
> 
> You are likely correct...
> 
> > But in the meantime there are a bunch of places where it just doesn't need to be
> > invoked at all.
> > 
> > To start with: there is no need to do such an update when a Xen guest
> > populates the shared_info page. This seems to have been a hangover from
> > the very first implementation of shared_info which automatically
> > populated the vcpu_info structures at their default locations, but even
> > then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
> > instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
> > expected to explicitly set the vcpu_info even in its default locations,
> > there's not even any need for that either.
> > 
> > Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > Reviewed-by: Paul Durrant <paul@xen.org>
> > ---
> >  arch/x86/kvm/xen.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index f65b35a05d91..5a83a8154b79 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
> >         wc->version = wc_version + 1;
> >         read_unlock_irq(&gpc->lock);
> >  
> > -       kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
> > -
> >  out:
> >         srcu_read_unlock(&kvm->srcu, idx);
> >         return ret;
> > -- 
> > 2.44.0
> 
> So KVM_REQ_MASTERCLOCK_UPDATE is to avoid the race above.
> 
> In what contexes is kvm_xen_shared_info_init called from again?
> 
> Not clear to me KVM_REQ_MASTERCLOCK_UPDATE is not needed (or that is
> needed, for that matter...).

We cal kvm_xen_shared_info_init() when the Xen "shared info" page is
set up. The only interesting part of that is the *wallclock* epoch.

The wallclock (just like KSR_KVM_WALL_CLOCK{,_NEW}) is *entirely* hosed
ever since the KVM clock stopped being based on CLOCK_MONOTONIC, since
that means that the value of "wallclock time minus KVM clock time"
actually *changes* as the KVM clock runs at a different rate to
wallclock time. 

I'm looking at a replacement for that which uses the gtod information
to give the guest a direct mapping of guest TSC to host CLOCK_TAI. And
in doing so we can *also* indicate when live migration has potentially
disrupted the guest TSC, so any further NTP/PTP refinement that the
guest may have done for itself needs to be thrown away.
diff mbox series

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index f65b35a05d91..5a83a8154b79 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -98,8 +98,6 @@  static int kvm_xen_shared_info_init(struct kvm *kvm)
 	wc->version = wc_version + 1;
 	read_unlock_irq(&gpc->lock);
 
-	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
-
 out:
 	srcu_read_unlock(&kvm->srcu, idx);
 	return ret;