[RFC] sched_clock: Avoid tearing during read from NMI

Message ID	1421859236-19782-1-git-send-email-daniel.thompson@linaro.org
State	New
Headers	show Return-Path: <patchwork-forward+bncBCAPDLF44QLBBPVT76SQKGQEJO5CB2Q@linaro.org> MIME-Version: 1.0 Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.217.174 as permitted sender) client-ip=209.85.217.174; Received-SPF: pass (google.com: domain of daniel.thompson@linaro.org designates 74.125.82.44 as permitted sender) client-ip=74.125.82.44; From: Daniel Thompson <daniel.thompson@linaro.org> To: John Stultz <john.stultz@linaro.org> Cc: Daniel Thompson <daniel.thompson@linaro.org>, linux-kernel@vger.kernel.org, patches@linaro.org, linaro-kernel@lists.linaro.org, Sumit Semwal <sumit.semwal@linaro.org>, Thomas Gleixner <tglx@linutronix.de>, Stephen Boyd <sboyd@codeaurora.org>, Steven Rostedt <rostedt@goodmis.org> Subject: [RFC PATCH] sched_clock: Avoid tearing during read from NMI Date: Wed, 21 Jan 2015 16:53:56 +0000 Message-Id: <1421859236-19782-1-git-send-email-daniel.thompson@linaro.org> Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org

Daniel Thompson Jan. 21, 2015, 4:53 p.m. UTC

Currently it is possible for an NMI (or FIQ on ARM) to come in and
read sched_clock() whilst update_sched_clock() has half updated the
state. This results in a bad time value being observed.

This patch fixes that problem in a similar manner to Thomas Gleixner's
4396e058c52e("timekeeping: Provide fast and NMI safe access to
CLOCK_MONOTONIC").

Note that ripping out the seqcount lock from sched_clock_register() and
replacing it with a large comment is not nearly as bad as it looks! The
locking here is actually pretty useless since most of the variables
modified within the write lock are not covered by the read lock. As a
result a big comment and the sequence bump implicit in the call
to update_epoch() should work pretty much the same.

Suggested-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
---

Notes:
    This patch has only had fairly light testing at this point. However it
    survives basic tests. In particular I am running perf from FIQ/NMI and
    have instrumented it with some monotonicity tests none of which have
    reported any problem.
    

 kernel/time/sched_clock.c | 63 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 50 insertions(+), 13 deletions(-)

--
1.9.3

John Stultz Jan. 21, 2015, 5:29 p.m. UTC | #1

On Wed, Jan 21, 2015 at 8:53 AM, Daniel Thompson
<daniel.thompson@linaro.org> wrote:
> Currently it is possible for an NMI (or FIQ on ARM) to come in and
> read sched_clock() whilst update_sched_clock() has half updated the
> state. This results in a bad time value being observed.
>
> This patch fixes that problem in a similar manner to Thomas Gleixner's
> 4396e058c52e("timekeeping: Provide fast and NMI safe access to
> CLOCK_MONOTONIC").
>
> Note that ripping out the seqcount lock from sched_clock_register() and
> replacing it with a large comment is not nearly as bad as it looks! The
> locking here is actually pretty useless since most of the variables
> modified within the write lock are not covered by the read lock. As a
> result a big comment and the sequence bump implicit in the call
> to update_epoch() should work pretty much the same.

It still looks pretty bad, even with the current explanation.


> -       raw_write_seqcount_begin(&cd.seq);
> +       /*
> +        * sched_clock will report a bad value if it executes
> +        * concurrently with the following code. No locking exists to
> +        * prevent this; we rely mostly on this function being called
> +        * early during kernel boot up before we have lots of other
> +        * stuff going on.
> +        */
>         read_sched_clock = read;
>         sched_clock_mask = new_mask;
>         cd.rate = rate;
>         cd.wrap_kt = new_wrap_kt;
>         cd.mult = new_mult;
>         cd.shift = new_shift;
> -       cd.epoch_cyc = new_epoch;
> -       cd.epoch_ns = ns;
> -       raw_write_seqcount_end(&cd.seq);
> +       update_epoch(new_epoch, ns);


So looking at this, the sched_clock_register() function may not be
called super early, so I was looking to see what prevented bad reads
prior to registration. And from quick inspection, its nothing. I
suspect the undocumented trick that makes this work is that the mult
value is initialzied to zero, so sched_clock returns 0 until things
have been registered.

So it does seem like it would be worth while to do the initialization
under the lock, or possibly use the suspend flag to make the first
initialization safe.

thanks
-john

Daniel Thompson Jan. 21, 2015, 8:20 p.m. UTC | #2

On 21/01/15 17:29, John Stultz wrote:
> On Wed, Jan 21, 2015 at 8:53 AM, Daniel Thompson
> <daniel.thompson@linaro.org> wrote:
>> Currently it is possible for an NMI (or FIQ on ARM) to come in and
>> read sched_clock() whilst update_sched_clock() has half updated the
>> state. This results in a bad time value being observed.
>>
>> This patch fixes that problem in a similar manner to Thomas Gleixner's
>> 4396e058c52e("timekeeping: Provide fast and NMI safe access to
>> CLOCK_MONOTONIC").
>>
>> Note that ripping out the seqcount lock from sched_clock_register() and
>> replacing it with a large comment is not nearly as bad as it looks! The
>> locking here is actually pretty useless since most of the variables
>> modified within the write lock are not covered by the read lock. As a
>> result a big comment and the sequence bump implicit in the call
>> to update_epoch() should work pretty much the same.
> 
> It still looks pretty bad, even with the current explanation.

I'm inclined to agree. Although to be clear, the code I proposed should
not more broken than the code we have today (and arguably more honest).

>> -       raw_write_seqcount_begin(&cd.seq);
>> +       /*
>> +        * sched_clock will report a bad value if it executes
>> +        * concurrently with the following code. No locking exists to
>> +        * prevent this; we rely mostly on this function being called
>> +        * early during kernel boot up before we have lots of other
>> +        * stuff going on.
>> +        */
>>         read_sched_clock = read;
>>         sched_clock_mask = new_mask;
>>         cd.rate = rate;
>>         cd.wrap_kt = new_wrap_kt;
>>         cd.mult = new_mult;
>>         cd.shift = new_shift;
>> -       cd.epoch_cyc = new_epoch;
>> -       cd.epoch_ns = ns;
>> -       raw_write_seqcount_end(&cd.seq);
>> +       update_epoch(new_epoch, ns);
> 
> 
> So looking at this, the sched_clock_register() function may not be
> called super early, so I was looking to see what prevented bad reads
> prior to registration.

Certainly not super early, but, from the WARN_ON() at the top of the
function I thought it was intended to be called before start_kernel()
unmasks interrupts...

> And from quick inspection, its nothing. I
> suspect the undocumented trick that makes this work is that the mult
> value is initialzied to zero, so sched_clock returns 0 until things
> have been registered.
> 
> So it does seem like it would be worth while to do the initialization
> under the lock, or possibly use the suspend flag to make the first
> initialization safe.

As mentioned the existing write lock doesn't really do very much at the
moment.

The simple and (I think) strictly correct approach is to duplicate the
whole of the clock_data (minus the seqcount) and make the read lock in
sched_clock cover all accesses to the structure.

This would substantially enlarge the critical section in sched_clock()
meaning we might loop round the seqcount fractionally more often.
However if that causes any real problems it would be a sign the epoch
was being updated too frequently.

Unless I get any objections (or you really want me to look closely at
using suspend) then I'll try this approach in the next day or two.


Daniel.

Daniel Thompson Jan. 30, 2015, 7:03 p.m. UTC | #3

This patchset optimizes the generic sched_clock implementation to
significantly reduce the data cache profile. It also makes it safe to call
sched_clock() from NMI (or FIQ on ARM).

The data cache profile of sched_clock() in both the original code and
my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
depending on alignment of struct clock_data. After patching, the cache
profile for the normal case should be a single cacheline.

NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
using the perf handler to check that sched_clock() returned monotonic
values. At the same time I forcefully reduced kt_wrap so that
update_sched_clock() is being called at >1000Hz.

Without the patches the above system is grossly unstable, surviving
[9K,115K,25K] perf event cycles during three separate runs. With the
patch I ran for over 9M perf event cycles before getting bored.

v3:
* Optimized to minimise cache profile, including elimination of
  the suspended flag (Thomas Gleixner).
* Replaced the update_bank_begin/end with a single update function
  (Thomas Gleixner).
* Split into multiple patches to aid review.

v2:

* Extended the scope of the read lock in sched_clock() so we can bank
  all data consumed there (John Stultz)


Daniel Thompson (4):
  sched_clock: Match scope of read and write seqcounts
  sched_clock: Optimize cache line usage
  sched_clock: Remove suspend from clock_read_data
  sched_clock: Avoid deadlock during read from NMI

 kernel/time/sched_clock.c | 163 ++++++++++++++++++++++++++++++----------------
 1 file changed, 107 insertions(+), 56 deletions(-)

--
1.9.3

Daniel Thompson Feb. 5, 2015, 9:05 a.m. UTC | #4

On 05/02/15 00:50, Stephen Boyd wrote:
> On 01/30, Daniel Thompson wrote:
>> This patchset optimizes the generic sched_clock implementation to
>> significantly reduce the data cache profile. It also makes it safe to call
>> sched_clock() from NMI (or FIQ on ARM).
>>
>> The data cache profile of sched_clock() in both the original code and
>> my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
>> depending on alignment of struct clock_data. After patching, the cache
>> profile for the normal case should be a single cacheline.
>>
>> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
>> using the perf handler to check that sched_clock() returned monotonic
>> values. At the same time I forcefully reduced kt_wrap so that
>> update_sched_clock() is being called at >1000Hz.
>>
>> Without the patches the above system is grossly unstable, surviving
>> [9K,115K,25K] perf event cycles during three separate runs. With the
>> patch I ran for over 9M perf event cycles before getting bored.
> 
> I wanted to see if there was any speedup from these changes so I
> made a tight loop around sched_clock() that ran for 10 seconds
> and I ran it 10 times before and after this patch series:
> 
>         unsigned long long clock, start_clock;
>         int count = 0; 
> 
>         clock = start_clock = sched_clock();
>         while ((clock - start_clock) < 10ULL * NSEC_PER_SEC) {
>                 clock = sched_clock();
>                 count++;
>         }
> 
>         pr_info("Made %d calls in %llu ns\n", count, clock - start_clock);
> 
> Before
> ------
>  Made 19218953 calls in 10000000439 ns
>  Made 19212790 calls in 10000000438 ns
>  Made 19217121 calls in 10000000142 ns
>  Made 19227304 calls in 10000000142 ns
>  Made 19217559 calls in 10000000142 ns
>  Made 19230193 calls in 10000000290 ns
>  Made 19212715 calls in 10000000290 ns
>  Made 19234446 calls in 10000000438 ns
>  Made 19226274 calls in 10000000439 ns
>  Made 19236118 calls in 10000000143 ns
>  
> After
> -----
>  Made 19434797 calls in 10000000438 ns
>  Made 19435733 calls in 10000000439 ns
>  Made 19434499 calls in 10000000438 ns
>  Made 19438482 calls in 10000000438 ns
>  Made 19435604 calls in 10000000142 ns
>  Made 19438551 calls in 10000000438 ns
>  Made 19444550 calls in 10000000290 ns
>  Made 19437580 calls in 10000000290 ns
>  Made 19439429 calls in 10000048142 ns
>  Made 19439493 calls in 10000000438 ns
> 
> So it seems to be a small improvement.
> 

Awesome!

I guess this is mostly the effect of simplifying the suspend logic since
the changes to the cache profile probably wouldn't reveal much in such a
tight loop.

I will re-run this after acting on your other review comments. BTW what
device did you run on?


Daniel.

Daniel Thompson Feb. 8, 2015, 12:02 p.m. UTC | #5

This patchset optimizes the generic sched_clock implementation by
removing branches and significantly reducing the data cache profile. It
also makes it safe to call sched_clock() from NMI (or FIQ on ARM).

The data cache profile of sched_clock() in the original code is
somewhere between 2 and 3 (64-byte) cache lines, depending on alignment
of struct clock_data. After patching, the cache profile for the normal
case should be a single cacheline.

NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
using the perf handler to check that sched_clock() returned monotonic
values. At the same time I forcefully reduced kt_wrap so that
update_sched_clock() is being called at >1000Hz.

Without the patches the above system is grossly unstable, surviving
[9K,115K,25K] perf event cycles during three separate runs. With the
patch I ran for over 9M perf event cycles before getting bored.

v4:
* Optimized sched_clock() to be branchless by introducing a dummy
  function to provide clock values while the clock is suspended
  (Stephen Boyd).
* Improved commenting, including the kerneldoc comments (Stephen Boyd).
* Removed a redundant notrace from the update logic (Steven Rostedt).

v3:
* Optimized to minimise cache profile, including elimination of
  the suspended flag (Thomas Gleixner).
* Replaced the update_bank_begin/end with a single update function
  (Thomas Gleixner).
* Split into multiple patches to aid review.

v2:
* Extended the scope of the read lock in sched_clock() so we can bank
  all data consumed there (John Stultz)


Daniel Thompson (5):
  sched_clock: Match scope of read and write seqcounts
  sched_clock: Optimize cache line usage
  sched_clock: Remove suspend from clock_read_data
  sched_clock: Remove redundant notrace from update function
  sched_clock: Avoid deadlock during read from NMI

 kernel/time/sched_clock.c | 195 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 138 insertions(+), 57 deletions(-)

--
2.1.0

Daniel Thompson Feb. 8, 2015, 12:09 p.m. UTC | #6

On 05/02/15 17:05, Daniel Thompson wrote:
> On 05/02/15 00:50, Stephen Boyd wrote:
>> On 01/30, Daniel Thompson wrote:
>>> This patchset optimizes the generic sched_clock implementation to
>>> significantly reduce the data cache profile. It also makes it safe to call
>>> sched_clock() from NMI (or FIQ on ARM).
>>>
>>> The data cache profile of sched_clock() in both the original code and
>>> my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
>>> depending on alignment of struct clock_data. After patching, the cache
>>> profile for the normal case should be a single cacheline.
>>>
>>> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
>>> using the perf handler to check that sched_clock() returned monotonic
>>> values. At the same time I forcefully reduced kt_wrap so that
>>> update_sched_clock() is being called at >1000Hz.
>>>
>>> Without the patches the above system is grossly unstable, surviving
>>> [9K,115K,25K] perf event cycles during three separate runs. With the
>>> patch I ran for over 9M perf event cycles before getting bored.
>>
>> I wanted to see if there was any speedup from these changes so I
>> made a tight loop around sched_clock() that ran for 10 seconds
>> and I ran it 10 times before and after this patch series:
>>
>>         unsigned long long clock, start_clock;
>>         int count = 0; 
>>
>>         clock = start_clock = sched_clock();
>>         while ((clock - start_clock) < 10ULL * NSEC_PER_SEC) {
>>                 clock = sched_clock();
>>                 count++;
>>         }
>>
>>         pr_info("Made %d calls in %llu ns\n", count, clock - start_clock);
>>
>> Before
>> ------
>>  Made 19218953 calls in 10000000439 ns
>>  Made 19212790 calls in 10000000438 ns
>>  Made 19217121 calls in 10000000142 ns
>>  Made 19227304 calls in 10000000142 ns
>>  Made 19217559 calls in 10000000142 ns
>>  Made 19230193 calls in 10000000290 ns
>>  Made 19212715 calls in 10000000290 ns
>>  Made 19234446 calls in 10000000438 ns
>>  Made 19226274 calls in 10000000439 ns
>>  Made 19236118 calls in 10000000143 ns
>>  
>> After
>> -----
>>  Made 19434797 calls in 10000000438 ns
>>  Made 19435733 calls in 10000000439 ns
>>  Made 19434499 calls in 10000000438 ns
>>  Made 19438482 calls in 10000000438 ns
>>  Made 19435604 calls in 10000000142 ns
>>  Made 19438551 calls in 10000000438 ns
>>  Made 19444550 calls in 10000000290 ns
>>  Made 19437580 calls in 10000000290 ns
>>  Made 19439429 calls in 10000048142 ns
>>  Made 19439493 calls in 10000000438 ns
>>
>> So it seems to be a small improvement.
>>
> 
> Awesome!
> 
> I guess this is mostly the effect of simplifying the suspend logic since
> the changes to the cache profile probably wouldn't reveal much in such a
> tight loop.
> 
> I will re-run this after acting on your other review comments. BTW what
> device did you run on?

I ran the same test on my Snapdragon 600 board. The results are a little
odd. There is an odd quantization effect that I cannot easily explain
and the results of the v4 patch seem almost too good to be true.

My results are below but I'd be very interested to see what results you
get with the v4 patch!

Latest (branchless approach):

Made 18736519 calls in 10000000439 ns
Made 19958774 calls in 10000000439 ns
Made 18736500 calls in 10000000587 ns
Made 21703993 calls in 10000000439 ns
Made 18734458 calls in 10000000142 ns
Made 18736175 calls in 10000000439 ns
Made 19961406 calls in 10000000291 ns
Made 19953920 calls in 10000000143 ns
Made 21709619 calls in 10000000290 ns
Made 18734077 calls in 10000000142 ns

v3:

Made 15971788 calls in 10000000438 ns
Made 14594357 calls in 10000000734 ns
Made 14590951 calls in 10000000735 ns
Made 14595048 calls in 10000000290 ns
Made 14595157 calls in 10000000143 ns
Made 14594117 calls in 10000000142 ns
Made 14597277 calls in 10000000142 ns
Made 14594472 calls in 10000000586 ns
Made 14601292 calls in 10000000587 ns
Made 15968630 calls in 10000000587 ns

Current:

Made 14274871 calls in 10000000587 ns
Made 15634136 calls in 10000000587 ns
Made 16453528 calls in 10000000142 ns
Made 14275854 calls in 10000000586 ns
Made 15634128 calls in 10000000438 ns
Made 14277672 calls in 10000000143 ns
Made 14282904 calls in 10000000290 ns
Made 14278673 calls in 10000000142 ns
Made 14276096 calls in 10000000290 ns
Made 14275336 calls in 10000000143 ns

Daniel Thompson March 2, 2015, 3:56 p.m. UTC | #7

This patchset optimizes the generic sched_clock implementation by
removing branches and significantly reducing the data cache profile. It
also makes it safe to call sched_clock() from NMI (or FIQ on ARM).

The data cache profile of sched_clock() in the original code is
somewhere between 2 and 3 (64-byte) cache lines, depending on alignment
of struct clock_data. After patching, the cache profile for the normal
case should be a single cacheline.

NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
using the perf handler to check that sched_clock() returned monotonic
values. At the same time I forcefully reduced kt_wrap so that
update_sched_clock() is being called at >1000Hz.

Without the patches the above system is grossly unstable, surviving
[9K,115K,25K] perf event cycles during three separate runs. With the
patch I ran for over 9M perf event cycles before getting bored.

Performance testing has primarily been performed using a simple
tight loop test (i.e. one that is unlikely to benefit from the
cache profile improvements). Summary results show benefit on all
CPUs although magnitude varies significantly:

  Cortex A9 @ 792MHz	 4.1% speedup
  Cortex A9 @ 1GHz	 0.4% speedup  (different SoC to above)
  Scorpian		13.6% speedup
  Krait			35.1% speedup
  Cortex A53 @ 1GHz	 1.6% speedup
  Cortex A57 @ 1GHz	 5.0% speedup

Benchmarking was done by Stephen Boyd and myself, full data for the
above summaries can be found here:
https://docs.google.com/spreadsheets/d/1Zd2xN42U4oAVZcArqAYdAWgFI5oDFRysURCSYNmBpZA/edit?usp=sharing

v5:
* Summarized benchmark results in the patchset cover letter and
  added some Reviewed-by:s.
* Rebased on 4.0-rc1.

v4:
* Optimized sched_clock() to be branchless by introducing a dummy
  function to provide clock values while the clock is suspended
  (Stephen Boyd).
* Improved commenting, including the kerneldoc comments (Stephen Boyd).
* Removed a redundant notrace from the update logic (Steven Rostedt).

v3:
* Optimized to minimise cache profile, including elimination of
  the suspended flag (Thomas Gleixner).
* Replaced the update_bank_begin/end with a single update function
  (Thomas Gleixner).
* Split into multiple patches to aid review.

v2:
* Extended the scope of the read lock in sched_clock() so we can bank
  all data consumed there (John Stultz)


Daniel Thompson (5):
  sched_clock: Match scope of read and write seqcounts
  sched_clock: Optimize cache line usage
  sched_clock: Remove suspend from clock_read_data
  sched_clock: Remove redundant notrace from update function
  sched_clock: Avoid deadlock during read from NMI

 kernel/time/sched_clock.c | 195 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 138 insertions(+), 57 deletions(-)

--
2.1.0

[RFC] sched_clock: Avoid tearing during read from NMI

Commit Message

Comments

Patch