diff mbox

[RFC] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD

Message ID 1494613450-4713-1-git-send-email-john.stultz@linaro.org
State Superseded
Headers show

Commit Message

John Stultz May 12, 2017, 6:24 p.m. UTC
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago
to allow a transition from the old vsyscall implementations to
the new method (which simplified internal accounting and made
timekeeping more precise).

However, PPC and IA64 have yet to make the transition, despite
in some cases me sending test patches to try to help it along.

http://patches.linaro.org/patch/30501/
http://patches.linaro.org/patch/35412/

If its helpful, my last pass at the patches can be found here:
https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup

So I think its time to set a deadline and make it clear this
is going away. So this patch adds warnings about this
functionality being dropped. Likely to be in v4.14.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>

---
 kernel/time/timekeeping.c | 1 +
 1 file changed, 1 insertion(+)

-- 
2.7.4

Comments

Michael Ellerman May 22, 2017, 12:58 a.m. UTC | #1
John Stultz <john.stultz@linaro.org> writes:

> CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago

> to allow a transition from the old vsyscall implementations to

> the new method (which simplified internal accounting and made

> timekeeping more precise).


I'm sure it's completely obvious to everyone except me what needs to be
done, but can you spell it out for me? Keeping in mind that I don't know
anything about the time keeping code.

> However, PPC and IA64 have yet to make the transition, despite

> in some cases me sending test patches to try to help it along.

>

> http://patches.linaro.org/patch/30501/

> http://patches.linaro.org/patch/35412/


I've never seen a PPC patch, did you send one?

> If its helpful, my last pass at the patches can be found here:

> https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup


Looks like it's just a draft for PPC. Do you think that should work and
it just needs testing? The comment about the vdso being "slightly
behind" is a little concerning.

cheers
John Stultz May 22, 2017, 7:06 p.m. UTC | #2
On Sun, May 21, 2017 at 5:58 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> John Stultz <john.stultz@linaro.org> writes:

>

>> CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago

>> to allow a transition from the old vsyscall implementations to

>> the new method (which simplified internal accounting and made

>> timekeeping more precise).

>

> I'm sure it's completely obvious to everyone except me what needs to be

> done, but can you spell it out for me? Keeping in mind that I don't know

> anything about the time keeping code.


No. Apologies, I probably should have included something like this.

Basically long ago, timekeeping was handled (roughly) like:

clock_gettime():
    now = tk->clock->read()
    offset_ns = ((now - tk->cycle_last) * tk->clock->mult) >> tk->clock->shift;
    return timespec_add_ns(tk->xtime, offset_ns);

But since for error handling use sub-ns precision, combined with that
for update performance, we accumulate in fixed intervals, there are
situations where in the update, we could accumulate half of a
nanosecond into the base tk->xtime value and leaving half of a
nanosecond in the offset.   This caused the split nanosecond to be
truncated out by the math, causing 1ns discontinuities.

So to address this, we came up with sort of a hack, which when we
accumulate rounds up that partial nanosecond, and adds the amount we
rounded up to the error (which will cause the freq correction code to
slow the clock down slightly). This is the code that is now done in
the old_vsyscall_fixup() logic.

Unfortunately this fix (which generates up to a nanosecond of error
per tick) then made the freq correction code do more work and made it
more difficult to have a stable clock.

So we went for a more proper fix, which was to properly handle the
sub-nanosecond portion of the timekeeping throughout the logic, doing
the truncation last.

clock_gettime():
    now = tk->clock->read()
    ret.tv_sec = tk->xtime_sec;
    offset_sns = (now - tk->cycle_last) * tk->clock->mult;
    ret.tv_nsec = (offset_sns + tk->tkr_mono.xtime_nsec) >> tk->clock->shift;
    return ret;

So in the above, we now use the tk->tkr_mono.xtime_nsec (which despite
its unfortunate name, stores the accumulated shifted nanoseconds), and
add it to the (current_cycle_delta * clock->mult), then we do the
shift last to preserve as much precision as we can.

Unfortunately we need all the reader code to do the same, which wasn't
easy to transition in some cases. So we provided the
CONFIG_GENERIC_TIME_VSYSCALL_OLD option to preserve the old round-up
behavior while arch maintainers could make the transition.


>> However, PPC and IA64 have yet to make the transition, despite

>> in some cases me sending test patches to try to help it along.

>>

>> http://patches.linaro.org/patch/30501/

>> http://patches.linaro.org/patch/35412/

>

> I've never seen a PPC patch, did you send one?


Yea. The PPC patch I never felt comfortable enough with to send.


>> If its helpful, my last pass at the patches can be found here:

>> https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup

>

> Looks like it's just a draft for PPC. Do you think that should work and

> it just needs testing? The comment about the vdso being "slightly

> behind" is a little concerning.


So, long ago I talked w/ Paul Mackerras about the ppc vdso code, as
ppc has some other legacy "userspace time" code that has to be
maintained as well (I believe there's not code page, just data page
that userspace pulls directly from). So for that case, we have the
problem where we can't do this sub-ns accounting, so the hack there is
rather then rounding-up and adding to ntp_error in the accumulation
code (which then causes the mult to slow), we're basically doing it
in the reader, slowing down mult by one. This will cause userspace
time to have "steps" where after an accumulation time jumps forward a
bit, but avoids the possibility of a discontinuity where time jumps
backwards.

But again, I don't want to pretend I'm not an expert on the ppc side.
This draft patch doesn't even touch the __kernel_clock_gettime()
implementations, and was trying to preserve the existing ppc logic
while transitioning the core code. Its likely that a better fix should
be done deeper in the ppc side (likely splitting the legacy userspace
data formats out so any hacks only apply to them).

thanks
-john
Benjamin Herrenschmidt May 22, 2017, 8:45 p.m. UTC | #3
On Mon, 2017-05-22 at 12:06 -0700, John Stultz wrote:
> So, long ago I talked w/ Paul Mackerras about the ppc vdso code, as

> ppc has some other legacy "userspace time" code that has to be

> maintained as well (I believe there's not code page, just data page

> that userspace pulls directly from).


Hrm... the ppc VDSO has a code page, userspace just calls
__kernel_clock_*

I don't know if anything still uses the direct mapped values, we should
enquire internally.

Cheers,
Ben.
Michael Ellerman May 24, 2017, 1:31 p.m. UTC | #4
John Stultz <john.stultz@linaro.org> writes:
...
> So, long ago I talked w/ Paul Mackerras about the ppc vdso code, as

> ppc has some other legacy "userspace time" code that has to be

> maintained as well (I believe there's not code page, just data page

> that userspace pulls directly from). So for that case, we have the

> problem where we can't do this sub-ns accounting, so the hack there is

> rather then rounding-up and adding to ntp_error in the accumulation

> code (which then causes the mult to slow), we're basically doing it

> in the reader, slowing down mult by one. This will cause userspace

> time to have "steps" where after an accumulation time jumps forward a

> bit, but avoids the possibility of a discontinuity where time jumps

> backwards.

>

> But again, I don't want to pretend I'm not an expert on the ppc side.

> This draft patch doesn't even touch the __kernel_clock_gettime()

> implementations, and was trying to preserve the existing ppc logic

> while transitioning the core code. Its likely that a better fix should

> be done deeper in the ppc side (likely splitting the legacy userspace

> data formats out so any hacks only apply to them).


Thanks for all the detail, that's very helpful.

We do export a bunch of values directly to userspace, so we'll have to
come up with some way of making those continue to work, while updating
things enough to unblock the generic code.

Hopefully I can get Paul interested enough to look at it :)

cheers
Paul Mackerras May 25, 2017, 12:03 p.m. UTC | #5
On Mon, May 22, 2017 at 12:06:04PM -0700, John Stultz wrote:
> 

> Basically long ago, timekeeping was handled (roughly) like:

> 

> clock_gettime():

>     now = tk->clock->read()

>     offset_ns = ((now - tk->cycle_last) * tk->clock->mult) >> tk->clock->shift;

>     return timespec_add_ns(tk->xtime, offset_ns);

> 

> But since for error handling use sub-ns precision, combined with that

> for update performance, we accumulate in fixed intervals, there are

> situations where in the update, we could accumulate half of a

> nanosecond into the base tk->xtime value and leaving half of a

> nanosecond in the offset.   This caused the split nanosecond to be

> truncated out by the math, causing 1ns discontinuities.

> 

> So to address this, we came up with sort of a hack, which when we

> accumulate rounds up that partial nanosecond, and adds the amount we

> rounded up to the error (which will cause the freq correction code to

> slow the clock down slightly). This is the code that is now done in

> the old_vsyscall_fixup() logic.

> 

> Unfortunately this fix (which generates up to a nanosecond of error

> per tick) then made the freq correction code do more work and made it

> more difficult to have a stable clock.

> 

> So we went for a more proper fix, which was to properly handle the

> sub-nanosecond portion of the timekeeping throughout the logic, doing

> the truncation last.

> 

> clock_gettime():

>     now = tk->clock->read()

>     ret.tv_sec = tk->xtime_sec;

>     offset_sns = (now - tk->cycle_last) * tk->clock->mult;

>     ret.tv_nsec = (offset_sns + tk->tkr_mono.xtime_nsec) >> tk->clock->shift;

>     return ret;

> 

> So in the above, we now use the tk->tkr_mono.xtime_nsec (which despite

> its unfortunate name, stores the accumulated shifted nanoseconds), and

> add it to the (current_cycle_delta * clock->mult), then we do the

> shift last to preserve as much precision as we can.

> 

> Unfortunately we need all the reader code to do the same, which wasn't

> easy to transition in some cases. So we provided the

> CONFIG_GENERIC_TIME_VSYSCALL_OLD option to preserve the old round-up

> behavior while arch maintainers could make the transition.


The VDSO code on PPC computes the offset in units of 2^-32 seconds,
not nanoseconds, because that makes it easy to handle the split of the
offset into whole seconds and fractional seconds (which is handled in
the generic code by the slightly icky __iter_div_u64_rem function),
and also means that we can use PPC's instruction that computes
(a * b) >> 32 to convert the fractional part to either nanoseconds or
microseconds without doing a division.

I could pretty easily change the computations done at update_vsyscall
time to convert the tk->tkr_mono.xtime_nsec value to units of 2^-32
seconds for use by the VDSO.  That would mean we would no longer need
CONFIG_GENERIC_TIME_VSYSCALL_OLD, and would give us values returned by
the VDSO gettimeofday() and clock_gettime() that should be within
about 1/4 ns of what the generic code in the kernel would give (on
average, I mean, given that the results have at best nanosecond
resolution).  Since that corresponds to about 1 CPU clock cycle, it
seems like it should be good enough.

Alternatively I could make the VDSO computations use a smaller unit
(maybe 2^-36 or 2^-40 seconds), or else rewrite them to use exactly
the same algorithm as the generic code - which would be a bigger
change, and would mean having to do an iterative division.

So, do you think the 1/4 ns resolution is good enough for the VDSO
computations?

Paul.
John Stultz May 27, 2017, 3:58 a.m. UTC | #6
On Thu, May 25, 2017 at 5:03 AM, Paul Mackerras <paulus@ozlabs.org> wrote:
> On Mon, May 22, 2017 at 12:06:04PM -0700, John Stultz wrote:

>>

>> Basically long ago, timekeeping was handled (roughly) like:

>>

>> clock_gettime():

>>     now = tk->clock->read()

>>     offset_ns = ((now - tk->cycle_last) * tk->clock->mult) >> tk->clock->shift;

>>     return timespec_add_ns(tk->xtime, offset_ns);

>>

>> But since for error handling use sub-ns precision, combined with that

>> for update performance, we accumulate in fixed intervals, there are

>> situations where in the update, we could accumulate half of a

>> nanosecond into the base tk->xtime value and leaving half of a

>> nanosecond in the offset.   This caused the split nanosecond to be

>> truncated out by the math, causing 1ns discontinuities.

>>

>> So to address this, we came up with sort of a hack, which when we

>> accumulate rounds up that partial nanosecond, and adds the amount we

>> rounded up to the error (which will cause the freq correction code to

>> slow the clock down slightly). This is the code that is now done in

>> the old_vsyscall_fixup() logic.

>>

>> Unfortunately this fix (which generates up to a nanosecond of error

>> per tick) then made the freq correction code do more work and made it

>> more difficult to have a stable clock.

>>

>> So we went for a more proper fix, which was to properly handle the

>> sub-nanosecond portion of the timekeeping throughout the logic, doing

>> the truncation last.

>>

>> clock_gettime():

>>     now = tk->clock->read()

>>     ret.tv_sec = tk->xtime_sec;

>>     offset_sns = (now - tk->cycle_last) * tk->clock->mult;

>>     ret.tv_nsec = (offset_sns + tk->tkr_mono.xtime_nsec) >> tk->clock->shift;

>>     return ret;

>>

>> So in the above, we now use the tk->tkr_mono.xtime_nsec (which despite

>> its unfortunate name, stores the accumulated shifted nanoseconds), and

>> add it to the (current_cycle_delta * clock->mult), then we do the

>> shift last to preserve as much precision as we can.

>>

>> Unfortunately we need all the reader code to do the same, which wasn't

>> easy to transition in some cases. So we provided the

>> CONFIG_GENERIC_TIME_VSYSCALL_OLD option to preserve the old round-up

>> behavior while arch maintainers could make the transition.

>

> The VDSO code on PPC computes the offset in units of 2^-32 seconds,

> not nanoseconds, because that makes it easy to handle the split of the

> offset into whole seconds and fractional seconds (which is handled in

> the generic code by the slightly icky __iter_div_u64_rem function),

> and also means that we can use PPC's instruction that computes

> (a * b) >> 32 to convert the fractional part to either nanoseconds or

> microseconds without doing a division.

>

> I could pretty easily change the computations done at update_vsyscall

> time to convert the tk->tkr_mono.xtime_nsec value to units of 2^-32

> seconds for use by the VDSO.  That would mean we would no longer need

> CONFIG_GENERIC_TIME_VSYSCALL_OLD, and would give us values returned by

> the VDSO gettimeofday() and clock_gettime() that should be within

> about 1/4 ns of what the generic code in the kernel would give (on

> average, I mean, given that the results have at best nanosecond

> resolution).  Since that corresponds to about 1 CPU clock cycle, it

> seems like it should be good enough.

>

> Alternatively I could make the VDSO computations use a smaller unit

> (maybe 2^-36 or 2^-40 seconds), or else rewrite them to use exactly

> the same algorithm as the generic code - which would be a bigger

> change, and would mean having to do an iterative division.

>

> So, do you think the 1/4 ns resolution is good enough for the VDSO

> computations?


Sorry not to have gotten back to you yesterday on this!  So yea, the
above sounds reasonable to me. We only return ns resolution to
userspace, so I don't believe one would be able to make a normal
syscall that calculates time and then make a gettime call within a
1/4th of a nanosecond.

But, I can't say I've ever really groked the ppc logic, so take my
opinion with a grain of salt. :)

thanks
-john
diff mbox

Patch

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 9652bc5..8c1523d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -488,6 +488,7 @@  static void halt_fast_timekeeper(struct timekeeper *tk)
 }
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
+#warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon.
 
 static inline void update_vsyscall(struct timekeeper *tk)
 {