mbox series

[RFC,0/10] Cleaning up the KVM clock mess

Message ID 20240418193528.41780-1-dwmw2@infradead.org
Headers show
Series Cleaning up the KVM clock mess | expand

Message

David Woodhouse April 18, 2024, 7:34 p.m. UTC
I lied, there aren't three different definitions of the KVM clock. The
fourth is on i386, where it's still based on CLOCK_MONOTONIC (well,
boot time which might as well be the same sine we offset it anyway).

If we fix *that* to be based on CLOCK_MONOTONIC_RAW then we can rip out
a whole bunch of mechanisms which were added to cope with NTP frequency
skew.

This cleans up the mess and gets us back down to the two unavoidable
definitions of the KVM clock: when the host and guest TSCs are well
behaved and in sync, it's in "master clock" mode where it's defined as
a simple arithmetic function of the guest TSC, otherwise it's clamped
to the host's CLOCK_MONOTONIC_RAW.

It includes Jack's KVM_[GS]ET_CLOCK_GUEST patch to allow accurate
migration. Also my KVM_VCPU_TSC_SCALE which exposes the precise TSC
scaling factors. This is needed to get accurate migration of the guest
TSC, and can *also* be used by userspace to have vDSO-style access to
the KVM clock. Thus allowing hypercalls and other emulated clock devices
(e.g. PIT, HPET, ACPI timer) to be based on the KVM clock too, giving
*consistency* across a live migration.

I do still need to fix KVM_REQ_MASTERCLOCK_UPDATE so that it doesn't
clamp the clock back to CLOCK_MONOTONIC_RAW; we should *update* the
ka->kvmclock_offset when we've been running in use_master_clock mode.
Should probably do that in kvm_arch_hardware_disable() for timekeeping
across hibernation too, but I haven't finished working that one out.

I think there are still some places left where KVM reads the time twice 
in close(ish) succession and then assumes they were at the *same* time, 
which I'll audit and fix too.

I also need to flesh out the test cases and do some real testing from
VMMs, but I think it's ready for some heckling at least.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks

David Woodhouse (8):
      KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
      KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
      KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
      KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
      KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
      KVM: x86: Remove periodic global clock updates
      KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE
      KVM: x86: Fix KVM clock precision in __get_kvmclock()

Jack Allister (2):
      KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
      KVM: selftests: Add KVM/PV clock selftest to prove timer correction

 Documentation/virt/kvm/api.rst                    |  37 ++
 Documentation/virt/kvm/devices/vcpu.rst           | 115 ++++--
 arch/x86/include/asm/kvm_host.h                   |   8 +-
 arch/x86/include/uapi/asm/kvm.h                   |   6 +
 arch/x86/kvm/svm/svm.c                            |   3 +-
 arch/x86/kvm/vmx/vmx.c                            |   2 +-
 arch/x86/kvm/x86.c                                | 438 +++++++++++++---------
 arch/x86/kvm/xen.c                                |   4 +-
 include/uapi/linux/kvm.h                          |   3 +
 tools/testing/selftests/kvm/Makefile              |   1 +
 tools/testing/selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++
 11 files changed, 600 insertions(+), 209 deletions(-)

Comments

David Woodhouse April 19, 2024, 12:52 p.m. UTC | #1
On Thu, 2024-04-18 at 20:34 +0100, David Woodhouse wrote:
> 
>       KVM: x86: Remove periodic global clock updates
>       KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE

Meh, I might have to put those back. They were originally introduced to
cope with NTP frequency skew which is no longer a problem, but we now
know there's a systemic skew even between the host CLOCK_MONOTONIC_RAW
and the KVM clock as calculated via the guest's TSC.

So at least when !ka->use_master_clock I think the forced resync does
need to happen.
Paul Durrant April 19, 2024, 3:49 p.m. UTC | #2
On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
> inadequate. It ignores TSC scaling, and ignores the fact that the host
> TSC may differ from one host to the next (and in fact because of the way
> the kernel calibrates it, it generally differs from one boot to the next
> even on the same hardware).
> 
> Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
> and attempt to document the *awful* process that we're requiring userspace
> to follow to merely preserve the TSC across migration.
> 
> I may have thrown up in my mouth a little when writing that documentation.
> It's an awful API. If we do this, we should be ashamed of ourselves.
> (I also haven't tested the documented process yet).
> 
> Let's use Simon's KVM_VCPU_TSC_VALUE instead.
> https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
>   arch/x86/include/uapi/asm/kvm.h         |   6 ++
>   arch/x86/kvm/x86.c                      |  15 ++++
>   3 files changed, 109 insertions(+), 27 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse April 19, 2024, 3:53 p.m. UTC | #3
On Fri, 2024-04-19 at 16:49 +0100, Paul Durrant wrote:
> On 18/04/2024 20:34, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
> > inadequate. It ignores TSC scaling, and ignores the fact that the host
> > TSC may differ from one host to the next (and in fact because of the way
> > the kernel calibrates it, it generally differs from one boot to the next
> > even on the same hardware).
> > 
> > Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
> > and attempt to document the *awful* process that we're requiring userspace
> > to follow to merely preserve the TSC across migration.
> > 
> > I may have thrown up in my mouth a little when writing that documentation.
> > It's an awful API. If we do this, we should be ashamed of ourselves.
> > (I also haven't tested the documented process yet).
> > 
> > Let's use Simon's KVM_VCPU_TSC_VALUE instead.
> > https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/

Oops, I meant to change that commit message. I've changed my mind, as
KVM_VCPU_TSC_VALUE sets the TSC using the KVM clock as a reference...
which is utterly backwards, since the KVM clock is *defined* in terms
of the TSC.

Will do so before posting it again.

May also include the "loop over KVM_GET_CLOCK until tai_offset didn't
change" part to fix the leap seconbd problem.

I think we do need to implement a better way to get { TAI, host TSC }
than abusing KVM_GET_CLOCK when we don't even *care* about the KVM
clock. We need a way to migrate the arch counters on platforms other
than x86 too. But I'm trying to leave that part for later. At least on
x86 we *do* have a way to migrate the TSC accurately, even if it's
awful.

> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >   Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
> >   arch/x86/include/uapi/asm/kvm.h         |   6 ++
> >   arch/x86/kvm/x86.c                      |  15 ++++
> >   3 files changed, 109 insertions(+), 27 deletions(-)
> > 
> 
> Reviewed-by: Paul Durrant <paul@xen.org>
>
Paul Durrant April 19, 2024, 4:07 p.m. UTC | #4
On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
> 
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
> 
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
> 
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
> 
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
> 
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_writE_lock, so in pvclock_update_vm_gtod_copy(), copy it into a

^ typo: capitalization of the 'E'

> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
> 
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
> 
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.
> 
> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/asm/kvm_host.h |   4 +
>   arch/x86/kvm/x86.c              | 150 ++++++++++++++++----------------
>   2 files changed, 78 insertions(+), 76 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index cfac72b4aa64..13f979dd14b9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1353,6 +1353,7 @@ struct kvm_arch {
>   	u64 last_tsc_write;
>   	u32 last_tsc_khz;
>   	u64 last_tsc_offset;
> +	u64 last_tsc_scaling_ratio;
>   	u64 cur_tsc_nsec;
>   	u64 cur_tsc_write;
>   	u64 cur_tsc_offset;
> @@ -1366,6 +1367,9 @@ struct kvm_arch {
>   	bool use_master_clock;
>   	u64 master_kernel_ns;
>   	u64 master_cycle_now;
> +	u64 master_tsc_scaling_ratio;
> +	u32 master_tsc_mul;
> +	s8 master_tsc_shift;
>   	struct delayed_work kvmclock_update_work;
>   	struct delayed_work kvmclock_sync_work;
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f870e29d2558..5cd92f4b4c97 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2671,6 +2671,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>   	kvm->arch.last_tsc_nsec = ns;
>   	kvm->arch.last_tsc_write = tsc;
>   	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
> +	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
>   	kvm->arch.last_tsc_offset = offset;
>   
>   	vcpu->arch.last_guest_tsc = tsc;
> @@ -3006,6 +3007,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>   {
>   #ifdef CONFIG_X86_64
>   	struct kvm_arch *ka = &kvm->arch;
> +	uint64_t last_tsc_hz;
>   	int vclock_mode;
>   	bool host_tsc_clocksource, vcpus_matched;
>   
> @@ -3025,6 +3027,34 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>   				&& !ka->backwards_tsc_observed
>   				&& !ka->boot_vcpu_runs_old_kvmclock;
>   
> +	/*
> +	 * When TSC scaling is in use (which can thankfully only happen
> +	 * with X86_FEATURE_CONSTANT_TSC), the host must calculate the
> +	 * KVM clock precisely as the guest would, by scaling through
> +	 * the guest TSC frequency. Otherwise, differences in arithmetic
> +	 * precision lead to systemic drift between the guest's and the
> +	 * host's idea of the time.
> +	 */
> +	if (kvm_caps.has_tsc_control) {
> +		/*
> +		 * Copy from the field protected solely by ka->tsc_write_lock,
> +		 * to the field protected by the ka->pvclock_sc seqlock.
> +		 */
> +		ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
> +
> +		/*
> +		 * Calculate the scaling factors precisely the same way
> +		 * that kvm_guest_time_update() does.
> +		 */
> +		last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
> +					    ka->last_tsc_scaling_ratio);
> +		kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz,
> +				   &ka->master_tsc_shift, &ka->master_tsc_mul);
> +	} else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1009,

1009?

> +				   &ka->master_tsc_shift, &ka->master_tsc_mul);
> +	}
> +
>   	if (ka->use_master_clock)
>   		atomic_set(&kvm_guest_has_master_clock, 1);
>