[BUG,bisect] hrtimer: severe lag after suspend & resume

Message ID CALAqxLUAXDHimaqA0mqUNcY0inOZGRf92SxSbX8dDMzJUBRvmQ@mail.gmail.com
State New
Headers show

Commit Message

John Stultz June 4, 2015, 10:54 p.m.
On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote:
> all,
>
> After a fresh boot, the Chrome web browser behaves normally.  Pages
> load quickly and scroll fast.  Even image heavy sites such as
> images.google.com work fine.  However, after a suspend and resume
> cycle, Chrome becomes very slow.  Pages take ten seconds or more to
> load.  The scroll bars and buttons are almost completely
> unresponsive.  Interestingly, I can run Firefox on the same sites
> and it has no issue whatsoever.
>
> I have bisected the kernel and found that the following commit
> introduced the bug.  It is present in the latest linux-next (20150602).
>
>   From 868a3e915f7f5eba8f8cb4f7da2276760807c51c Mon Sep 17 00:00:00 2001
>   From: Thomas Gleixner <tglx@linutronix.de>
>   Date: Tue, 14 Apr 2015 21:08:37 +0000
>   Subject: [PATCH] hrtimer: Make offset update smarter
>
>   On every tick/hrtimer interrupt we update the offset variables of the
>   clock bases. That's silly because these offsets change very seldom.
>
>   Add a sequence counter to the time keeping code which keeps track of
>   the offset updates (clock_was_set()). Have a sequence cache in the
>   hrtimer cpu bases to evaluate whether the offsets must be updated or
>   not. This allows us later to avoid pointless cacheline pollution.
>
>   Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>   Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>   Acked-by: Peter Zijlstra <peterz@infradead.org>
>   Cc: Viresh Kumar <viresh.kumar@linaro.org>
>   Cc: Marcelo Tosatti <mtosatti@redhat.com>
>   Cc: Frederic Weisbecker <fweisbec@gmail.com>
>   Cc: John Stultz <john.stultz@linaro.org>
>   Link: http://lkml.kernel.org/r/20150414203501.132820245@linutronix.de
>   Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>   Cc: John Stultz <john.stultz@linaro.org>
>   ---
>    include/linux/hrtimer.h             |  4 ++--
>    include/linux/timekeeper_internal.h |  2 ++
>    kernel/time/hrtimer.c               |  3 ++-
>    kernel/time/timekeeping.c           | 23 ++++++++++++++++-------
>    kernel/time/timekeeping.h           |  7 ++++---
>    5 files changed, 26 insertions(+), 13 deletions(-)


So I suspect the problem is the change to clock_was_set_seq in
timekeeping_update is done prior to mirroring the time state to the
shadow-timekeeper. Thus the next time we do update_wall_time() the
updated sequence is overwritten by whats in the shadow copy. The
attached patch moving the modification up seems to avoid the issue for
me.

Thomas:  Looking at the problematic change, I'm not a big fan of it.
Caching timekeeping state here in the hrtimer code has been a source
of bugs in the past, and I'm not sure I see how avoiding copying
24bytes is that big of a win. Especially since it adds more state to
the timekeeper and hrtimer base that we have to read and mange.
Personally I'd prefer a revert to my fix.

thanks
-john

Comments

John Stultz June 5, 2015, 6:52 p.m. | #1
On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> On Thu, 4 Jun 2015, John Stultz wrote:
>> > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@gmail.com> wrote:
>> > So I suspect the problem is the change to clock_was_set_seq in
>> > timekeeping_update is done prior to mirroring the time state to the
>> > shadow-timekeeper. Thus the next time we do update_wall_time() the
>> > updated sequence is overwritten by whats in the shadow copy. The
>> > attached patch moving the modification up seems to avoid the issue for
>> > me.
>>
>> Duh, yes.
>>
>> > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching
>> > timekeeping state here in the hrtimer code has been a source of bugs in the
>> > past, and I'm not sure I see how avoiding copying 24bytes is that big of a
>> > win. Especially since it adds more state to the timekeeper and hrtimer base
>> > that we have to read and mange.
>>
>> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
>> In situations where we run high frequency periodic timers on clock monotonic and
>> nothing is going on in the other clock domains, which is a pretty common
>> situation, this is measurable in terms of cache utilization. [...]
>
> It's not just about 'touching': it's about _dirtying_ cachelines from a globally
> executed function (timekeeping), which is then accessed by per-CPU functionality
> (hrtimers).

Right, but part of that issue is that we're caching in the hrtimer cpu
bases data that *should not be cached*. That was the core issue that
caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
it.

The offset data is only valid for the monotonic time its read for. So
dirtying three cache lines really is just due to the fact that the
data is stored in the cpu_base structure where I'd argue it doesn't
provide real value (other then convenience of indexing it cleanly).

Reading the offset data into three values from the stack would be fine
too, and (I think) would avoid dirtying much extra (we have to store
the now value anyway).

BTW: Thomas, what are you using to do measurements here? I hesitate to
argue in these sorts of performance discussions, since I really only
have a embarrassing theoretical understanding of the issues and
suspect myself a bit naive here. Additionally these sorts of
constraints aren't always clearly documented, so being able to measure
and compare would be helpful to ensure future changes don't impact
things here.

> That makes it far more expensive, it has similar scalability limiting effects as a
> global lock - while if we do it smart it can perform as essentially lockless code
> in most cases.

Another reason why I don't like this approach of caching the data is
that it also prevents fixing the leap-second adjustment to happen on
the second edge, because we have to have an async update to the
seqcounter in order to refresh the cached real_offset. Or we have to
also export more ntp state data so we can duplicate the adjustment to
the cached data in the hrtimer code, which is more of the complexity
you've objected to.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
John Stultz June 8, 2015, 5:37 p.m. | #2
On Mon, Jun 8, 2015 at 12:44 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 5 Jun 2015, John Stultz wrote:
>> On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > * Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
>> >> In situations where we run high frequency periodic timers on clock monotonic and
>> >> nothing is going on in the other clock domains, which is a pretty common
>> >> situation, this is measurable in terms of cache utilization. [...]
>> >
>> > It's not just about 'touching': it's about _dirtying_ cachelines from a globally
>> > executed function (timekeeping), which is then accessed by per-CPU functionality
>> > (hrtimers).
>>
>> Right, but part of that issue is that we're caching in the hrtimer cpu
>> bases data that *should not be cached*. That was the core issue that
>> caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
>> it.
>>
>> The offset data is only valid for the monotonic time its read for. So
>> dirtying three cache lines really is just due to the fact that the
>> data is stored in the cpu_base structure where I'd argue it doesn't
>> provide real value (other then convenience of indexing it cleanly).
>>
>> Reading the offset data into three values from the stack would be fine
>> too, and (I think) would avoid dirtying much extra (we have to store
>> the now value anyway).
>
> Well, the problem is that we need to fetch that data on several
> occasions:
>
>         - hrtimer_start (if it is the first expiring timer of a clock)
>         - hrtimer_reprogram (after canceling the first timer)
>         - hrtimer_interrupt

hrtimer_interrupt is really the one I'm most concerned with, because
that's what does timer expiration. For that usage, having duplicated
time state has been repeatedly problematic.

For start and reprogram, I guess I'm not as concerned because bad
cache values there can only result in errors in programming the
clockevent hardware, which can only cause timers to firing late.


> So I really prefer to have cached values instead of a function
> call. And even if we do not cache stuff, there is no guarantee that we
> wont expire a timer too early:
>
> CPU0                                    CPU1
> hrtimer_interrupt()
>            ------------------------------------- leap second edge
>    get_time_and_offsets_uncached()
>                                         do_leap_second_adjustment()
>    expire_timers()
>
> So, if the do_leap_second_adjustment() happens a bit too late, then
> clock monotonic + offset_realtime will have advanced over the leap
> second and expire a timer and the sleeper will then observe that it is
> expired too early because the leap second adjustment finished before
> it returned to user space.

Well, the patch I proposed (which does leapsecond adjustments in the
read paths) avoids this because update_base_and_offsets_now() checks
to see if the monotonic base + offset would cross the leapsecond edge,
and corrects the offset appropriately.


>
> You do not even need two cpus for this. You can observe the same issue
> on a UP machine. Assume we have two hrtimers programmed to go off at
> the leap seconds edge:
>
>     1) a user space timer
>     2) the tick/leap second one
>
> If the user space timer has been enqueued before the leap one, then it
> will be expired first and if the timer interrupt got delayed a bit it
> again will see that its over the programmed time and happily expire to
> early.
>
> So what ever we do vs. the hrtimer offsets, cached or not will not
> prevent that we expire timers early.

Right, so without my proposed patch, this is an issue, but my proposed
patch requires that the hrtimer_interrupt not use cached offsets in
order to ensure the read-state is adjusted properly for the
leapsecond.  (Or it requries the hrtimer_interrupt path to also cache
the leapsecond state so it can do the same adjustment to the cached
data, but this seems terribly duplicative).


>> BTW: Thomas, what are you using to do measurements here? I hesitate
>> to argue in these sorts of performance discussions, since I really
>> only have a embarrassing theoretical understanding of the issues and
>> suspect myself a bit naive here. Additionally these sorts of
>> constraints aren't always clearly documented, so being able to
>> measure and compare would be helpful to ensure future changes don't
>> impact things here.
>
> performance counters and tests which stress the particular subsystems.

Sweet. Vague hand-waving results are simple to reproduce.
[Gestures, points assertively a few times] Results validated! Science! :)

But seriously, I'm earnestly looking for specifics (like which stress
tests are you caring about) here, so I can try to also watch that
patches I write or take don't undo your performance optimizations, so
in the future you'll have to yell at fewer people.


>> > That makes it far more expensive, it has similar scalability limiting effects as a
>> > global lock - while if we do it smart it can perform as essentially lockless code
>> > in most cases.
>>
>> Another reason why I don't like this approach of caching the data is
>> that it also prevents fixing the leap-second adjustment to happen on
>> the second edge, because we have to have an async update to the
>> seqcounter in order to refresh the cached real_offset. Or we have to
>> also export more ntp state data so we can duplicate the adjustment to
>> the cached data in the hrtimer code, which is more of the complexity
>> you've objected to.
>
> There is no guarantee that it happens at the seconds edge. Timer might
> be delayed, vcpu scheduled out ....

Again, I worry you've not looked closely at my patch. The only way to
assure the change happens at second edge is to do the adjustment in
the read path.

> All you will be able to do is to narrow the window, but as I explained
> above it wont prevent early expiry and it wont prevent VDSO seing the
> time go over the leap second and then jump back.

The vdso is fixable as well. I just avoided it because fixing it has
less benefit then fixing the timer expiration issue, and expected the
performance overhead (which is an extra comparison and subtraction,
not huge, but I respect that the vdsos are very optimized) would raise
more objections.

(If we want to get fancy, there's even the possibility of switching
the vdso execution page so we use the fixed adjustment function only
near the leap-second and go back to the normal function otherwise -
but regardless, if its to be done, have to fix the core first)

> We just have to accept that timekeeping, time readout and hrtimers
> have asynchronous behaviour. And there is no way around that unless
> you want to kill performance completely for the sake of this leap
> second nonsense.

Even outside of my leapsecond correctness concern, I think caching
time state like you're doing here is a maintenance issue. It has bit
us a number of times already, and things are complicated enough that
even fixing the issues that stem from it are non-trivial.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 90ed5db..53be796 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -580,6 +580,9 @@  static void timekeeping_update(struct timekeeper *tk, unsigned int action)
 		ntp_clear();
 	}
 
+	if (action & TK_CLOCK_WAS_SET)
+		tk->clock_was_set_seq++;
+
 	tk_update_ktime_data(tk);
 
 	update_vsyscall(tk);
@@ -591,9 +594,6 @@  static void timekeeping_update(struct timekeeper *tk, unsigned int action)
 
 	update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono);
 	update_fast_timekeeper(&tk->tkr_raw,  &tk_fast_raw);
-
-	if (action & TK_CLOCK_WAS_SET)
-		tk->clock_was_set_seq++;
 }
 
 /**