Message ID | 1432931068-4980-5-git-send-email-john.stultz@linaro.org |
---|---|
State | New |
Headers | show |
On Tue, Jun 2, 2015 at 2:21 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Jun 02, 2015 at 11:01:54AM +0200, Ingo Molnar wrote: >> > Currently, leapsecond adjustments are done at tick time. >> > >> > As a result, the leapsecond was applied at the first timer >> > tick *after* the leapsecond (~1-10ms late depending on HZ), >> > rather then exactly on the second edge. > > So why not program a hrtimer to expire at the exact right time? Then > even the VDSO gets its time 'right' and you avoid touching all the > fast paths. Well, the hrtimer won't always expire at the exact right time (due to irq latency, possible clockevent drift/error, etc) , so that would close the window a bit, but wouldn't eliminate the issue. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jun 2, 2015 at 2:01 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * John Stultz <john.stultz@linaro.org> wrote: > >> Currently, leapsecond adjustments are done at tick time. >> >> As a result, the leapsecond was applied at the first timer >> tick *after* the leapsecond (~1-10ms late depending on HZ), >> rather then exactly on the second edge. >> >> This was in part historical from back when we were always >> tick based, but correcting this since has been avoided since >> it adds extra conditional checks in the gettime fastpath, >> which has performance overhead. >> >> However, it was recently pointed out that ABS_TIME >> CLOCK_REALTIME timers set for right after the leapsecond >> could fire a second early, since some timers may be expired >> before we trigger the timekeeping timer, which then applies >> the leapsecond. >> >> This isn't quite as bad as it sounds, since behaviorally >> it is similar to what is possible w/ ntpd made leapsecond >> adjustments done w/o using the kernel discipline. Where >> due to latencies, timers may fire just prior to the >> settimeofday call. (Also, one should note that all >> applications using CLOCK_REALTIME timers should always be >> careful, since they are prone to quirks from settimeofday() >> disturbances.) >> >> However, the purpose of having the kernel do the leap adjustment >> is to avoid such latencies, so I think this is worth fixing. >> >> So in order to properly keep those timers from firing a second >> early, this patch modifies the gettime accessors to do the >> extra checks to apply the leapsecond adjustment on the second >> edge. This prevents the timer core from expiring timers too >> early. >> >> This patch does not handle VDSO time implementations, so >> userspace using vdso gettime will still see the leapsecond >> applied at the first timer tick after the leapsecond. >> This is a bit of a tradeoff, since the performance impact >> would be greatest to VDSO implementations, and since vdso >> interfaces don't provide the TIME_OOP flag, one can't >> distinquish the leapsecond from a time discontinuity (such >> as settimeofday), so correcting the VDSO may not be as >> important there. >> >> Apologies to Richard Cochran, who pushed for such a change >> years ago, which I resisted due to the concerns about the >> performance overhead. >> >> While I suspect this isn't extremely critical, folks who >> care about strict leap-second correctness will likely >> want to watch this, and it will likely be a -stable candidate. >> >> Cc: Prarit Bhargava <prarit@redhat.com> >> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> >> Cc: Richard Cochran <richardcochran@gmail.com> >> Cc: Jan Kara <jack@suse.cz> >> Cc: Jiri Bohac <jbohac@suse.cz> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Shuah Khan <shuahkh@osg.samsung.com> >> Originally-suggested-by: Richard Cochran <richardcochran@gmail.com> >> Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> >> Reported-by: Prarit Bhargava <prarit@redhat.com> >> Signed-off-by: John Stultz <john.stultz@linaro.org> >> --- >> include/linux/time64.h | 1 + >> include/linux/timekeeper_internal.h | 7 +++ >> kernel/time/ntp.c | 73 +++++++++++++++++++++++++--- >> kernel/time/ntp_internal.h | 1 + >> kernel/time/timekeeping.c | 97 ++++++++++++++++++++++++++++++++----- >> 5 files changed, 159 insertions(+), 20 deletions(-) > > So I don't like the complexity of this at all: why do we add over 100 lines of > code for something that occurs (literally) once in a blue moon? So yea. I very much felt the same way before the early timer expiration issue came up. > ... and for that reason I'm not surprised at all that it broke in non-obvious > ways. > > Instead of having these super rare special events, how about implementing leap > second smearing instead? That's far less radical and a lot easier to test as well, > as it's a continuous mechanism. It will also confuse user-space a lot less, > because there are no sudden time jumps. So yea. Leap smearing/slewing is an attractive solution. The first issue is that there's no standard yet for the range of time that the slew occurs (or even if the slew is linear or a curve). The second is I don't think we can actually get away from supporting UTC w/ leap, as applications may depend on precision. Also things like NTP sync w/ mixed systems would be problematic, as NTPd and others need to become savvy of which mode they are working with. The leap smearing method of only doing it in private networks and controlling it by the NTP server is becoming more widespread, but it has its own problems, since it doesn't handle CLOCK_TAI properly, and since CLOCK_REALTIME isn't yet frequency steerable separately from the other clockids, this method ends up slowing down CLOCK_TAI and CLOCK_MONOTONIC as well. I'd like to try to get something working in the kernel so we could support CLOCK_UTC and CLOCK_UTCSLS (smeared-leap-second) clockids, then allow applications that care to migrate explicitly to the one they care about. Possibly allowing CLOCK_REALTIME to be compile-time directed to CLOCK_UTCSLS so that most applications that don't care can just ignore it. But finding time to do this has been hard (if anyone is interested in working on it, I'd be excited to hear!). But if you think this patch is complicated, creating a new separately steered clockid is not going to be trvial (as there will be lots of ugly edge cases, like what if a leap second is cancelled mid-way through the slewing adjustment, etc). > Secondly, why is there a directional flag? I thought leap seconds can only be > inserted. A leap delete isn't likely to occur, but its supported by the adjtimex interface. And given the irregularity of the earths rotation, I'm not sure I'd rule it out completely. > So all in one, the leap second code is fragile and complex - lets re-think the > whole topic instead of complicating it even more ... So the core complexity with this patch is that we're basically having to do state-machine transitions in a read-only path (since the reads may happen before the update path runs). Since there's a number of read-paths, there's some duplication, and in some cases variance if the read path exports more state (ie: adjtimex). I do agree that the complexity of the time subsystem is getting hard to manage. I'm at the point where I think we need to avoid keeping duplicated timespec and ktime_t data (we can leave the ktime->timespec caching to the VDSOs). That will help cut down the read paths a bit, but will also simplify updates since we'll have less data to keep in sync. How we manage the ntp state also needs a rework, since the locking rules are getting too complex (bit me in an earlier version of this patch), and we're in effect duplicating some of that state in the timekeeper with this patch to handle the reads safely. But even assuming all those changes were already made, I think we'd still need something close to this patch. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 3, 2015 at 2:04 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * John Stultz <john.stultz@linaro.org> wrote: > >> > Instead of having these super rare special events, how about implementing leap >> > second smearing instead? That's far less radical and a lot easier to test as >> > well, as it's a continuous mechanism. It will also confuse user-space a lot >> > less, because there are no sudden time jumps. >> >> So yea. Leap smearing/slewing is an attractive solution. The first issue is that >> there's no standard yet for the range of time that the slew occurs (or even if >> the slew is linear or a curve). The second is I don't think we can actually get >> away from supporting UTC w/ leap, as applications may depend on precision. Also >> things like NTP sync w/ mixed systems would be problematic, as NTPd and others >> need to become savvy of which mode they are working with. > > Supporting it minimally is fine - supporting it with clearly unmaintainable > complexity is not. > > So as long as we offer good smearing of the leap second (with a configurable > parameter for how long the period should be), people in need of better leap > second handling can take that. > >> The leap smearing method of only doing it in private networks and controlling it >> by the NTP server is becoming more widespread, but it has its own problems, >> since it doesn't handle CLOCK_TAI properly, and since CLOCK_REALTIME isn't yet >> frequency steerable separately from the other clockids, this method ends up >> slowing down CLOCK_TAI and CLOCK_MONOTONIC as well. > > All real time clock derived clocks should smear in sync as well. Eeerrr.. So CLOCK_TAI is UTC without leapseconds, to smear TAI would be wrong. Similarly, CLOCK_MONOTONIC/BOOTTIME probably shouldn't be smeared either (but those are defined less strictly). >> I'd like to try to get something working in the kernel so we could support >> CLOCK_UTC and CLOCK_UTCSLS (smeared-leap-second) clockids, then allow >> applications that care to migrate explicitly to the one they care about. >> Possibly allowing CLOCK_REALTIME to be compile-time directed to CLOCK_UTCSLS so >> that most applications that don't care can just ignore it. But finding time to >> do this has been hard (if anyone is interested in working on it, I'd be excited >> to hear!). > > There should definitely be a Kconfig option to just map all relevant clocks to > smeared seconds. Hopefully this ends up being the standard in a few years and we > can pin down the exact parameters as well. > > Having separate clockids for mixed uses would be fine as well. Maybe. > >> But if you think this patch is complicated, creating a new separately steered >> clockid is not going to be trvial (as there will be lots of ugly edge cases, >> like what if a leap second is cancelled mid-way through the slewing adjustment, >> etc). > > Well, I think the main advantage of leap second smearing is that it's not a > binary, but a continuous interface, and so it's way easier to test than 'sudden' > leap second insertions. > > In fact we could essentially implement leap second smearing via the usual adjtimex > mechanisms: as far as the time code is concerned it does not matter why a gradual > adjustment occurs, only the rate of change and the method of convergence is an > open parameter. > > In fact I'd suggest we implement even original leap seconds by doing a high-rate > 'smearing' in the final X minutes leading up to the leap second, where 'X' could > be 1 by default. This way we could eliminate leap seconds as a separate logical > entity mostly. > > This should be far more gentle to applications as well than sudden jumps, and > timers will just work fine as well. Well, again the problem with high-rate smearing as you describe is that it would affect CLOCK_MONOTONIC as well, which could cause periodic timers used for sampling, etc (imagine recording audio, etc) to slow as well, possibly causing application problems. This is why the smeared leap-seconds are usually done across a day at a slow rate. To allow for CLOCK_REALTIME to be frequency adjusted separately from CLOCK_MONOTONIC/CLOCK_TAI, which would would have the least unwanted side-effects, we're probably going to have to manage it separately (like we do w/ MONOTONIC_RAW time). But again, this creates a lot more complexity. >> > Secondly, why is there a directional flag? I thought leap seconds can only be >> > inserted. >> >> A leap delete isn't likely to occur, but its supported by the adjtimex >> interface. And given the irregularity of the earths rotation, I'm not sure I'd >> rule it out completely. > > Well, the long term trend is clear and unambiguous: the rotation of Earth is > slowing down (the main component of which is losing angular momentum to the Moon), > hence the days are getting longer and we have to insert a leap second every second > year or so. > > The short term trends (discounting massive asteorid strikes, at which point leap > seconds will be the least of our problems) are somewhat chaotic: > > - glaciation (which shifts water mass assymetrically) > > - global warming (one component of which is thermal expansion, which expands > oceans assymetrically and shifts water mass - the other component is changing > climatology: different oceanic currents, etc. - which all shift mass around) > > - tectonics (slow rearrangement of mass plus earthquakes). > > - even slower scale rearrangement of mass (mantle plumes, etc.) > > but the long term trend still dominates. Look at this graph of measurements of the > Earth's rotation: > > http://en.wikipedia.org/wiki/File:Deviation_of_day_length_from_SI_day.svg > > See how the mean (the green line) was always above zero in the measured past. The > monotonically increasing nature comes from that. > > and given how many problems we had with leap second insertion, on millions of > installed systems, guess the likelihood of there being a leap second deleted? How > many OSs that can do leap second insertion are unable to do leap second deletion? > > Also note that leap second deletion means a jump in time backward. Daylight saving > time is already causing problems with that. Err.. Other way around. Leap-second deletion is a jump in time forward (jumping from 23:59:58 to 00:00:00, skipping 23:59:59). Which is simpler to deal with. And luckily (at least for us) daylight savings is done in userspace (as UTC, including leapseconds, ideally would be from the kernel providing TAI time). But yes, I agree that the leap deletion logic is likely to never run outside of testing. >> > So all in one, the leap second code is fragile and complex - lets re-think the >> > whole topic instead of complicating it even more ... >> >> So the core complexity with this patch is that we're basically having to do >> state-machine transitions in a read-only path (since the reads may happen before >> the update path runs). Since there's a number of read-paths, there's some >> duplication, and in some cases variance if the read path exports more state (ie: >> adjtimex). > > My fundamental observation is: the cost/benefit ratio is insanely high. I agree. In a perfect world, the kernel would export TAI not UTC, leaving the translation to UTC to userspace (take heed developers of new IoT OSes!). But the trouble is that historical posix/linux provides UTC (without a leapsecond representation, which is why we have to repeat a second). And as more folks (userspace developers, not really kernel developers) are caring about strict UTC correctness around the leapsecond, its hard to rationalize avoiding the complexity (since they don't really care, they just don't want to deal with anything unexpected in their application). > Interrupts are fundamentally jittery, there's no guarantee of their accuracy - you > yourself said that as a reply to PeterZ's suggestion to drive leap seconds via > hrtimers - and the motivation was to make interrupts arrive more accurately around > leap seconds. > > So why make the code more fragile, more complex, just to solve a scenario that > cannot really be done perfectly? So here I worry I didn't communicate clearly enough what the patch does. :( Its not about making interrupts more accurate around the leapsecond, its about applying the leapsecond transition in the read-path precisely at the leapsecond edge (rather then a short while later when the timer fires and we update the timekeeping structures). But more importantly, this change to the read path prevents timers that may be expired before update_wall_time timer runs (most likely on other cpus) from being expired early. Since the time read that is used by the hrtimer expiration logic is adjusted properly right on that edge. > Especially as second smearing appears to be the way superior future method of > handling leap seconds. > So here the problem is it depends on the user. For probably most users, who really don't care, the leap-smear is ideal behavior for CLOCK_REALTIME (I think leap-smears causing any change to other clockids would be surprising). However, there are some users who expect posix UTC leapsecond behavior. Either because they're positioning telescopes doing things that do depend on strict solar time, or because they are required (in some cases by law) to use UTC. I don't think we can just abandon/break those users, for leap-smearing. So I don't know if we can get away from that complexity. But maybe I'm not thinking "boldly" here. >> I do agree that the complexity of the time subsystem is getting hard to manage. > > That's rather an understatement. > >> I'm at the point where I think we need to avoid keeping duplicated timespec and >> ktime_t data (we can leave the ktime->timespec caching to the VDSOs). That will >> help cut down the read paths a bit, but will also simplify updates since we'll >> have less data to keep in sync. How we manage the ntp state also needs a >> rework, since the locking rules are getting too complex (bit me in an earlier >> version of this patch), and we're in effect duplicating some of that state in >> the timekeeper with this patch to handle the reads safely. > > Agreed. > >> But even assuming all those changes were already made, I think we'd still need >> something close to this patch. > > I disagree rather strongly. I do really appreciate the review and thoughts here, and respect and share your concern about complexity, but I'm not yet seeing a viable path forward with your proposals above. So additional ideas or clarifications would be welcome. So, I think with this push back, we're unlikely to have a solution that will be deploy-able by the leap second at the end of the month (though the issue was reported late enough that getting something merged/backported/deployed in mass wasn't super realistic). So we'll get to hear how much folks actually care about this issue. Since the leap is a discontinuity, and there is no way to set a ABS_TIME CLOCK_REALTIME timer for the 23:59:60 leap second, having a few very early timers targeted for the next second expire early on that repeated second is probably not a major issue in practice. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 3, 2015 at 11:48 PM, Ingo Molnar <mingo@kernel.org> wrote: > * John Stultz <john.stultz@linaro.org> wrote: >> >> But more importantly, this change to the read path prevents timers that may be >> expired before update_wall_time timer runs (most likely on other cpus) from >> being expired early. Since the time read that is used by the hrtimer expiration >> logic is adjusted properly right on that edge. > > But with leap second smearing we'd have the same benefits and some more. I'm not sure how this follows. Leap smearing is a different behavior and I'd like to support it (as a separate clockid) as well, but adding that support doesn't change that the existing behavior applying the leapsecond to UTC/CLOCK_REALTIME via a timer results in behavior that isn't strictly correct. >> > Especially as second smearing appears to be the way superior future method of >> > handling leap seconds. >> >> So here the problem is it depends on the user. For probably most users, who >> really don't care, the leap-smear is ideal behavior for CLOCK_REALTIME (I think >> leap-smears causing any change to other clockids would be surprising). However, >> there are some users who expect posix UTC leapsecond behavior. Either because >> they're positioning telescopes doing things that do depend on strict solar time, >> or because they are required (in some cases by law) to use UTC. > > That usecase is perfectly OK: they can use the old leap second logic. Which again, the current leap second logic has slight behavioral issues the patch I'm proposing is trying to fix. > What I argue is that we should not add significant complexity to logic that is > fragile already as-is, and which runs at most only once per year. > >> I don't think we can just abandon/break those users, for leap-smearing. So I >> don't know if we can get away from that complexity. > > That's a false dichotomy - I'm not suggesting to 'abandon' them. > > I'm suggesting one of these options: > > - Keep the current leap second code as-is, because it's proven. Concentrate on > leap second smearing instead that is superior and which might emerge as a > future standard. I understand that you're saying "focus on the future", which is good advice. But if the objection is complexity, adding leap-smearing isn't going to reduce complexity. I'd also argue the current leap second code isn't quite "proven". Or at least it didn't prove itself well last time around. Those major issues have been addressed, but the strict correctness issue where the leap second is done at timer time, instead of the second edge is not currently addressed. > - or make leap second insertion much more obvious and easier to verify (i.e. not > a +100 LOC delta!) I can work on simplifying and compressing some of the state to try to reduce the line count and make it easier to verify. But again, applying the leap-second at the exact second edge means we have to mimic the state transition in the various read paths, which is going to still be a non-trivial patch. > - or make leap second handling a part of some other existing facility that is > used much more frequently: for example have you considered making it a > special, kernel-internal case of settimeofday()? If settimeofday was > implemented in a fashion to set the time at a given 'time edge', then > thousands of systems would test that facility day in and out. Leap second > insertion would simply be a special settimeofday() call that sets the time > back by one second at midnight June 30 or so. Normal settimeofday() would be a > call that sets the time back (or forward) at the next seconds edge right now - > but it would fundamentally use the same facility. > > and yes, this variant would necessarily complicate settimeofday(), but that's > OK, as it's used so frequently so we have adequate testing of it. I will have to think more about this, but initially it doesn't seem to make much sense to me (again, I still worry that the specifics of the issue aren't clear). settimeofday() calls happen immediately, and under the timekeeping write lock. The issues here with leap seconds is that we are supposed to make a second repeat exactly on the second boundary. Since nothing asynchronous can be reliably done at a specific time, the only way to have correct behavior is to handle the leap adjustment in the read path, basically checking if we have crossed that edge, and if so, setting the time we return back by one second. For the adjtime case its more complicated because we also have to return other state data (leap-in-progress TIME_OOP flag, or after the leapsecond is over, the TIME_WAIT flag) which we have to adjust on that edge as well. The larger point of finding a way to ensure the code paths are tested well in an everyday fashion is a good one. I'm just not sure how to make that happen. (One could argue having ntp apply the leapsecond from userspace via settimeofday() rather then having the kernel do it would be more along this line of though, but again the problem there is userspace can't ensure something happens at a specific time, so it has the same issues of being "late" in applying the leap adjustment that the kernel currently has). > A third facility would effectiely be merged with this as well: when NTP > adjusts large offsets (when running with -g, etc.) it will use settimeofday(), > right? So.. it depends, a method has been added to adjtimex to allow clock syncing applications to inject time offsets, rather then trying to set the time absolutely to correct for error (since the calculation of setting the time absolutely is inherently racy). But again, those mechanisms happen immediately and don't have the same behavioral expectation that leap seconds have. > I don't think we've exhausted all of these options. So again, lots of good things to think about, but I'm still not seeing too much in the way of actionable suggestions yet. The leap-smear work is interesting, but also I see it as independent to the issue brought up by Prarit. In the mean-time, to try to address the complexity concern of this issue, I am looking at trying to simplify some of the NTP state handling, per Richard's comment, as well as moving that global state into the timekeeper so we don't have to copy it over so often. This will hopefully simplify my proposed change a bit (although it will make it hell for folks trying to backport a fix). Further changes to avoid duplicate state in the timekeeping structure (as well as duplicate accessors) will further reduce the footprint of the change (though I could see Thomas or you objecting to those on performance grounds). I can see about also removing support for leap-second removals (STA_DEL) in the core code, which might reduce the proposed patch by a few lines. I'm hoping after that it will appear simple enough. :) Thanks again for the review and thoughts! -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jun 5, 2015 at 12:29 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jun 04, 2015 at 05:08:11PM -0700, John Stultz wrote: >> I'm not sure how this follows. Leap smearing is a different behavior >> and I'd like to support it (as a separate clockid) as well, but adding >> that support doesn't change that the existing behavior applying the >> leapsecond to UTC/CLOCK_REALTIME via a timer results in behavior that >> isn't strictly correct. > > So the big problem of course is that your patches do not handle the VDSO > time read, and that is the biggest source of userspace time. > > So userspace will still see incorrect time, only in-kernel users (timers > being your prime example) get the leap second at the 'right' place. So yes. And as I mentioned in the patch, this is somewhat of a tradeoff, since adding extra conditionals in the highly optimized vdso gettime is probably more controversial. Further, since only adjtimex() provides both the time and the leap-state, its the only interface which can differentiate between 23:59:59 and 23:59:60, I'm not as sure leap-insertion on exactly the leap edge is as critical for users not using adjtimex. Fixing the issue in the vdso would likely be a following discussion (which I think is harder to weigh). But this patch ensures the kernel's internal state is correct, so we don't fire timers early. > Also note that your argument that timers will now get the correct time > is subject to the very same timer interrupt jitter as driving the leap > second stuff from an hrtimer itself -- they're all timers. So. The key here is that timers set for *after* the leapsecond, will not fire before. That's the problem. Timers aren't supposed to fire early, and that's the invariant we're currently not following. So if you set a timer for midnight after the leapsecond, it may expire ~a second early. Now, since there's no real representation of the leap-second, you can't set timers for that second, at least until the leap-second has been applied. For timers that are set just prior to the leap, its possible they could think they were expired early, since they may be delayed and not run until after the leapsecond is applied. So simple gettime calls might see the time as "before" the expiration time. However, if they use adjtimex they can see they're in the 23:59:59 w/ TIME_OOP, and things were correct. > That leaves the question; for who is this exact second edge important? I don't have specific cases in mind. Again, I argued against this sort of change when Richard suggested it earlier. The rational that its a discontinuity, and it shouldn't matter if that discontinuity is slightly late, since most folks aren't using adjtimex() and carefully noting the time state flags to see if a leap is in progress. However, when Prarit brought up the early timer expiration issue, it is harder for me to rationalize ignoring. Timers should not fire early. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jun 5, 2015 at 7:12 AM, Richard Cochran <richardcochran@gmail.com> wrote: > On Fri, Jun 05, 2015 at 11:10:08AM +0200, Peter Zijlstra wrote: >> Firstly I would strongly suggest such applications not use UTC because >> of this, I think TAI was invented for just this reason. > > So I wonder whether the bug in the original post affects TAI timers as > well... No, I don't believe so. The TAI timeline doesn't have a discontinuity(ie: each second is properly represented over a leapsecond), so it wouldn't have the same issue with the specifics of when that discontinuity is applied. Our TAI adjustment is done atomically with the leap adjustment, so late or not it shouldn't have problematic behavior. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Sat, Jun 6, 2015 at 2:44 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 5 Jun 2015, Peter Zijlstra wrote: > >> On Fri, 2015-06-05 at 11:04 +0200, Richard Cochran wrote: >> > On Fri, Jun 05, 2015 at 09:29:13AM +0200, Peter Zijlstra wrote: >> > > That leaves the question; for who is this exact second edge important? >> > >> > Distributed applications using the UTC time scale. >> > >> > Many control applications are done with a 1 millisecond period. >> > Having the time wrong by a second for 10 or 100 loops is bad news. >> >> Firstly I would strongly suggest such applications not use UTC because >> of this, I think TAI was invented for just this reason. >> >> Secondly, how would John's patches help with this? Usespace loops >> reading time would be using the VDSO and would still not get the right >> time, and timers would be subject to the same IRQ latency that a hrtimer >> based leap second insert would, and would still very much not be in-sync >> across the cluster. > > So the only thing which is fixed are in kernel users and therefor > hrtimers. Well, for vdso systems, hrtimers and adjtimex (which is the only interface that provides enough information to understand where in a leapsecond you actually are). And again, vdsos are fixable, but I hesitated due to my concerns about the extra performance overhead, the smaller benefit it provides relative to not having timers expiring early. > That means the whole leap mess added into the gettime fast path is > just constant overhead for that corner case. > > We can be smarter than that and just block hrtimer delivery for clock > realtime timers across the leap edge. There should be ways to do that > non intrusive if we think hard enough about it. This approach seems like it would add more work to the timer-add function (to check leapstate and adjust the expiration), which might be a heavier use case (we adjust for each timer) then the similar logic done in the update_base_offsets_now() at hrtimer_interrupt time (adjust for each hrtimer_interrupt). Now, It could be possible to do a lighter weight version of my patch, which just does the adjustment only for the hrtimer_interrupt code (leaving the rest of the read paths alone). If that is something you'd prefer. I'll have to think a bit to ensure the internal inconsistency isn't otherwise problematic. Though I suspect fixing adjtimex is worth it as well, since its really the only interface that can provide a sane view of the leapsecond, and isn't normally considered a hot path. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/include/linux/time64.h b/include/linux/time64.h index a383147..ff46e87 100644 --- a/include/linux/time64.h +++ b/include/linux/time64.h @@ -28,6 +28,7 @@ struct timespec64 { #define FSEC_PER_SEC 1000000000000000LL /* Located here for timespec[64]_valid_strict */ +#define TIME64_MAX ((s64)~((u64)1 << 63)) #define KTIME_MAX ((s64)~((u64)1 << 63)) #define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC) diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index fb86963..78980ea 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -60,6 +60,9 @@ struct tk_read_base { * shifted nano seconds. * @ntp_error_shift: Shift conversion between clock shifted nano seconds and * ntp shifted nano seconds. + * @next_leap_sec: Second value of the next leap sec (or TIME64_MAX) + * @next_leap_ktime: ktime_t value of the next leap sec (or KTIME_MAX) + * @leap_direction: Direction of pending leap adjustment * * Note: For timespec(64) based interfaces wall_to_monotonic is what * we need to add to xtime (or xtime corrected for sub jiffie times) @@ -104,6 +107,10 @@ struct timekeeper { s64 ntp_error; u32 ntp_error_shift; u32 ntp_err_mult; + /* Leapsecond status */ + time64_t next_leap_sec; + ktime_t next_leap_ktime; + int leap_direction; }; #ifdef CONFIG_GENERIC_TIME_VSYSCALL diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index 472591e..6e15fbb 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -35,6 +35,7 @@ unsigned long tick_nsec; static u64 tick_length; static u64 tick_length_base; +#define SECS_PER_DAY 86400 #define MAX_TICKADJ 500LL /* usecs */ #define MAX_TICKADJ_SCALED \ (((MAX_TICKADJ * NSEC_PER_USEC) << NTP_SCALE_SHIFT) / NTP_INTERVAL_FREQ) @@ -76,6 +77,9 @@ static long time_adjust; /* constant (boot-param configurable) NTP tick adjustment (upscaled) */ static s64 ntp_tick_adj; +/* second value of the next pending leapsecond, or TIME64_MAX if no leap */ +static time64_t ntp_next_leap_sec = TIME64_MAX; + #ifdef CONFIG_NTP_PPS /* @@ -349,6 +353,7 @@ void ntp_clear(void) tick_length = tick_length_base; time_offset = 0; + ntp_next_leap_sec = TIME64_MAX; /* Clear PPS state variables */ pps_clear(); } @@ -359,6 +364,33 @@ u64 ntp_tick_length(void) return tick_length; } +/** + * get_leap_state - Returns the NTP leap state + * @next_leap_sec: Next leapsecond in time64_t + * @next_leap_ktime: Next leapsecond in ktime_t + * + * Provides NTP leapsecond state. Returns direction + * of the leapsecond adjustment as an integer. + */ +int get_leap_state(time64_t *next_leap_sec, ktime_t *next_leap_ktime) +{ + int dir; + + if ((time_state == TIME_INS) && (time_status & STA_INS)) { + dir = -1; + *next_leap_sec = ntp_next_leap_sec; + *next_leap_ktime = ktime_set(ntp_next_leap_sec, 0); + } else if ((time_state == TIME_DEL) && (time_status & STA_DEL)) { + dir = 1; + *next_leap_sec = ntp_next_leap_sec; + *next_leap_ktime = ktime_set(ntp_next_leap_sec, 0); + } else { + dir = 0; + *next_leap_sec = TIME64_MAX; + next_leap_ktime->tv64 = KTIME_MAX; + } + return dir; +} /* * this routine handles the overflow of the microsecond field @@ -382,15 +414,21 @@ int second_overflow(unsigned long secs) */ switch (time_state) { case TIME_OK: - if (time_status & STA_INS) + if (time_status & STA_INS) { time_state = TIME_INS; - else if (time_status & STA_DEL) + ntp_next_leap_sec = secs + SECS_PER_DAY - + (secs % SECS_PER_DAY); + } else if (time_status & STA_DEL) { time_state = TIME_DEL; + ntp_next_leap_sec = secs + SECS_PER_DAY - + ((secs+1) % SECS_PER_DAY); + } break; case TIME_INS: - if (!(time_status & STA_INS)) + if (!(time_status & STA_INS)) { + ntp_next_leap_sec = TIME64_MAX; time_state = TIME_OK; - else if (secs % 86400 == 0) { + } else if (secs % SECS_PER_DAY == 0) { leap = -1; time_state = TIME_OOP; printk_deferred(KERN_NOTICE @@ -398,19 +436,21 @@ int second_overflow(unsigned long secs) } break; case TIME_DEL: - if (!(time_status & STA_DEL)) + if (!(time_status & STA_DEL)) { + ntp_next_leap_sec = TIME64_MAX; time_state = TIME_OK; - else if ((secs + 1) % 86400 == 0) { + } else if ((secs + 1) % SECS_PER_DAY == 0) { leap = 1; + ntp_next_leap_sec = TIME64_MAX; time_state = TIME_WAIT; printk_deferred(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n"); } break; case TIME_OOP: + ntp_next_leap_sec = TIME64_MAX; time_state = TIME_WAIT; break; - case TIME_WAIT: if (!(time_status & (STA_INS | STA_DEL))) time_state = TIME_OK; @@ -547,6 +587,7 @@ static inline void process_adj_status(struct timex *txc, struct timespec64 *ts) if ((time_status & STA_PLL) && !(txc->status & STA_PLL)) { time_state = TIME_OK; time_status = STA_UNSYNC; + ntp_next_leap_sec = TIME64_MAX; /* restart PPS frequency calibration */ pps_reset_freq_interval(); } @@ -711,6 +752,24 @@ int __do_adjtimex(struct timex *txc, struct timespec64 *ts, s32 *time_tai) if (!(time_status & STA_NANO)) txc->time.tv_usec /= NSEC_PER_USEC; + /* Handle leapsec adjustments */ + if (unlikely(ts->tv_sec >= ntp_next_leap_sec)) { + if ((time_state == TIME_INS) && (time_status & STA_INS)) { + result = TIME_OOP; + txc->tai++; + txc->time.tv_sec--; + } + if ((time_state == TIME_DEL) && (time_status & STA_DEL)) { + result = TIME_WAIT; + txc->tai--; + txc->time.tv_sec++; + } + if ((time_state == TIME_OOP) && + (ts->tv_sec == ntp_next_leap_sec)) { + result = TIME_WAIT; + } + } + return result; } diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h index bbd102a..cd831b6 100644 --- a/kernel/time/ntp_internal.h +++ b/kernel/time/ntp_internal.h @@ -5,6 +5,7 @@ extern void ntp_init(void); extern void ntp_clear(void); /* Returns how long ticks are at present, in ns / 2^NTP_SCALE_SHIFT. */ extern u64 ntp_tick_length(void); +extern int get_leap_state(time64_t *next_leap_sec, ktime_t *next_leap_ktime); extern int second_overflow(unsigned long secs); extern int ntp_validate_timex(struct timex *); extern int __do_adjtimex(struct timex *, struct timespec64 *, s32 *); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 946acb7..9313190 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -591,6 +591,10 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) ntp_clear(); } + /* Capture leapsecond state */ + tk->leap_direction = get_leap_state(&tk->next_leap_sec, + &tk->next_leap_ktime); + tk_update_ktime_data(tk); update_vsyscall(tk); @@ -634,6 +638,24 @@ static void timekeeping_forward_now(struct timekeeper *tk) } /** + * __getnstimeofday64_preleap - Returns the time of day in a timespec64, + * @tk: pointer to the timekeeper structure to use + * @ts: pointer to the timespec to be set + * + * Internal function. Does not take lock. Updates the time of day in the + * timespec, WITHOUT the leapsecond edge adjustment. + */ +static void __getnstimeofday64_preleap(struct timekeeper *tk, struct timespec64 *ts) +{ + s64 nsecs; + + ts->tv_sec = tk->xtime_sec; + ts->tv_nsec = 0; + nsecs = timekeeping_get_ns(&tk->tkr_mono); + timespec64_add_ns(ts, nsecs); +} + +/** * __getnstimeofday64 - Returns the time of day in a timespec64. * @ts: pointer to the timespec to be set * @@ -643,20 +665,22 @@ static void timekeeping_forward_now(struct timekeeper *tk) int __getnstimeofday64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; + time64_t next_leap; + int dir; unsigned long seq; - s64 nsecs = 0; do { seq = read_seqcount_begin(&tk_core.seq); - ts->tv_sec = tk->xtime_sec; - nsecs = timekeeping_get_ns(&tk->tkr_mono); + __getnstimeofday64_preleap(tk, ts); + next_leap = tk->next_leap_sec; + dir = tk->leap_direction; } while (read_seqcount_retry(&tk_core.seq, seq)); - ts->tv_nsec = 0; - timespec64_add_ns(ts, nsecs); - + /* Apply leapsecond adjustment */ + if (unlikely(ts->tv_sec >= next_leap)) + ts->tv_sec += dir; /* * Do not bail out early, in case there were callers still using * the value, even in the face of the WARN_ON. @@ -710,6 +734,8 @@ ktime_t ktime_get_with_offset(enum tk_offsets offs) struct timekeeper *tk = &tk_core.timekeeper; unsigned int seq; ktime_t base, *offset = offsets[offs]; + ktime_t next_leap; + int dir; s64 nsecs; WARN_ON(timekeeping_suspended); @@ -718,11 +744,17 @@ ktime_t ktime_get_with_offset(enum tk_offsets offs) seq = read_seqcount_begin(&tk_core.seq); base = ktime_add(tk->tkr_mono.base, *offset); nsecs = timekeeping_get_ns(&tk->tkr_mono); + next_leap = tk->next_leap_ktime; + dir = tk->leap_direction; } while (read_seqcount_retry(&tk_core.seq, seq)); - return ktime_add_ns(base, nsecs); - + base = ktime_add_ns(base, nsecs); + /* apply leapsecond adjustment */ + if (offs == TK_OFFS_REAL) + if (unlikely(base.tv64 >= next_leap.tv64)) + base = ktime_add(base, ktime_set(dir, 0)); + return base; } EXPORT_SYMBOL_GPL(ktime_get_with_offset); @@ -733,15 +765,23 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset); */ ktime_t ktime_mono_to_any(ktime_t tmono, enum tk_offsets offs) { + struct timekeeper *tk = &tk_core.timekeeper; ktime_t *offset = offsets[offs]; unsigned long seq; - ktime_t tconv; + ktime_t tconv, next_leap; + int dir; do { seq = read_seqcount_begin(&tk_core.seq); tconv = ktime_add(tmono, *offset); + next_leap = tk->next_leap_ktime; + dir = tk->leap_direction; } while (read_seqcount_retry(&tk_core.seq, seq)); + /* apply leapsecond adjustment */ + if (offs == TK_OFFS_REAL) + if (unlikely(tconv.tv64 >= next_leap.tv64)) + tconv = ktime_add(tconv, ktime_set(dir, 0)); return tconv; } EXPORT_SYMBOL_GPL(ktime_mono_to_any); @@ -862,6 +902,8 @@ void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec *ts_real) struct timekeeper *tk = &tk_core.timekeeper; unsigned long seq; s64 nsecs_raw, nsecs_real; + time64_t next_leap; + int dir; WARN_ON_ONCE(timekeeping_suspended); @@ -875,10 +917,17 @@ void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec *ts_real) nsecs_raw = timekeeping_get_ns(&tk->tkr_raw); nsecs_real = timekeeping_get_ns(&tk->tkr_mono); + next_leap = tk->next_leap_sec; + dir = tk->leap_direction; + } while (read_seqcount_retry(&tk_core.seq, seq)); timespec_add_ns(ts_raw, nsecs_raw); timespec_add_ns(ts_real, nsecs_real); + + /* apply leapsecond adjustment */ + if (unlikely(ts_real->tv_sec >= next_leap)) + ts_real->tv_sec += dir; } EXPORT_SYMBOL(getnstime_raw_and_real); @@ -1252,6 +1301,10 @@ void __init timekeeping_init(void) set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec); tk_set_wall_to_mono(tk, tmp); + /* Capture leapsecond state */ + tk->leap_direction = get_leap_state(&tk->next_leap_sec, + &tk->next_leap_ktime); + timekeeping_update(tk, TK_MIRROR); write_seqcount_end(&tk_core.seq); @@ -1422,6 +1475,10 @@ void timekeeping_resume(void) tk->tkr_mono.cycle_last = cycle_now; tk->tkr_raw.cycle_last = cycle_now; + /* Capture leapsecond state */ + tk->leap_direction = get_leap_state(&tk->next_leap_sec, + &tk->next_leap_ktime); + tk->ntp_error = 0; timekeeping_suspended = 0; timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET); @@ -1825,6 +1882,10 @@ void update_wall_time(void) */ clock_set |= accumulate_nsecs_to_secs(tk); + /* Capture leapsecond state */ + tk->leap_direction = get_leap_state(&tk->next_leap_sec, + &tk->next_leap_ktime); + write_seqcount_begin(&tk_core.seq); /* * Update the real timekeeper. @@ -1970,7 +2031,8 @@ ktime_t ktime_get_update_offsets_now(ktime_t *offs_real, ktime_t *offs_boot, { struct timekeeper *tk = &tk_core.timekeeper; unsigned int seq; - ktime_t base; + ktime_t base, next_leap; + int dir; u64 nsecs; do { @@ -1982,9 +2044,18 @@ ktime_t ktime_get_update_offsets_now(ktime_t *offs_real, ktime_t *offs_boot, *offs_real = tk->offs_real; *offs_boot = tk->offs_boot; *offs_tai = tk->offs_tai; + + next_leap = tk->next_leap_ktime; + dir = tk->leap_direction; } while (read_seqcount_retry(&tk_core.seq, seq)); - return ktime_add_ns(base, nsecs); + base = ktime_add_ns(base, nsecs); + + /* apply leapsecond adjustment */ + if (unlikely(ktime_add(base, *offs_real).tv64 >= next_leap.tv64)) + *offs_real = ktime_add(*offs_real, ktime_set(dir, 0)); + + return base; } #endif @@ -2015,11 +2086,11 @@ int do_adjtimex(struct timex *txc) return ret; } - getnstimeofday64(&ts); - raw_spin_lock_irqsave(&timekeeper_lock, flags); write_seqcount_begin(&tk_core.seq); + __getnstimeofday64_preleap(tk, &ts); + orig_tai = tai = tk->tai_offset; ret = __do_adjtimex(txc, &ts, &tai);
Currently, leapsecond adjustments are done at tick time. As a result, the leapsecond was applied at the first timer tick *after* the leapsecond (~1-10ms late depending on HZ), rather then exactly on the second edge. This was in part historical from back when we were always tick based, but correcting this since has been avoided since it adds extra conditional checks in the gettime fastpath, which has performance overhead. However, it was recently pointed out that ABS_TIME CLOCK_REALTIME timers set for right after the leapsecond could fire a second early, since some timers may be expired before we trigger the timekeeping timer, which then applies the leapsecond. This isn't quite as bad as it sounds, since behaviorally it is similar to what is possible w/ ntpd made leapsecond adjustments done w/o using the kernel discipline. Where due to latencies, timers may fire just prior to the settimeofday call. (Also, one should note that all applications using CLOCK_REALTIME timers should always be careful, since they are prone to quirks from settimeofday() disturbances.) However, the purpose of having the kernel do the leap adjustment is to avoid such latencies, so I think this is worth fixing. So in order to properly keep those timers from firing a second early, this patch modifies the gettime accessors to do the extra checks to apply the leapsecond adjustment on the second edge. This prevents the timer core from expiring timers too early. This patch does not handle VDSO time implementations, so userspace using vdso gettime will still see the leapsecond applied at the first timer tick after the leapsecond. This is a bit of a tradeoff, since the performance impact would be greatest to VDSO implementations, and since vdso interfaces don't provide the TIME_OOP flag, one can't distinquish the leapsecond from a time discontinuity (such as settimeofday), so correcting the VDSO may not be as important there. Apologies to Richard Cochran, who pushed for such a change years ago, which I resisted due to the concerns about the performance overhead. While I suspect this isn't extremely critical, folks who care about strict leap-second correctness will likely want to watch this, and it will likely be a -stable candidate. Cc: Prarit Bhargava <prarit@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Originally-suggested-by: Richard Cochran <richardcochran@gmail.com> Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> Reported-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- include/linux/time64.h | 1 + include/linux/timekeeper_internal.h | 7 +++ kernel/time/ntp.c | 73 +++++++++++++++++++++++++--- kernel/time/ntp_internal.h | 1 + kernel/time/timekeeping.c | 97 ++++++++++++++++++++++++++++++++----- 5 files changed, 159 insertions(+), 20 deletions(-)