Message ID | 1404978747-20869-1-git-send-email-linus.walleij@linaro.org |
---|---|
State | Accepted |
Commit | 7806f60e1d205db46eca6ad24429b3f86eda2588 |
Headers | show |
On Thu, 10 Jul 2014, Linus Walleij wrote: > This adds some documentation about clock sources, clock events, > the weak sched_clock() function and delay timers that answers > questions that repeatedly arise on the mailing lists. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Nicolas Pitre <nico@fluxnic.net> > Cc: Colin Cross <ccross@google.com> > Cc: John Stultz <john.stultz@linaro.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> > --- > ChangeLog v2->v3: > - Minor spelling and tweak comments (PeterZ) > - Emphasize clocksource_register_[k]hz (John Stultz) > - Spelling fixes (Randy Dunlap) > ChangeLog v1->v2: > - Included paragraphs and minor edits to account for PeterZ's > comments on addressing SMP use cases, which makes especially > the semantics of sched_clock() much clearer. > --- > Documentation/timers/00-INDEX | 2 + > Documentation/timers/timekeeping.txt | 179 +++++++++++++++++++++++++++++++++++ > 2 files changed, 181 insertions(+) > create mode 100644 Documentation/timers/timekeeping.txt > > diff --git a/Documentation/timers/00-INDEX b/Documentation/timers/00-INDEX > index 6d042dc1cce0..ee212a27772f 100644 > --- a/Documentation/timers/00-INDEX > +++ b/Documentation/timers/00-INDEX > @@ -12,6 +12,8 @@ Makefile > - Build and link hpet_example > NO_HZ.txt > - Summary of the different methods for the scheduler clock-interrupts management. > +timekeeping.txt > + - Clock sources, clock events, sched_clock() and delay timer notes > timers-howto.txt > - how to insert delays in the kernel the right (tm) way. > timer_stats.txt > diff --git a/Documentation/timers/timekeeping.txt b/Documentation/timers/timekeeping.txt > new file mode 100644 > index 000000000000..f3a8cf28f802 > --- /dev/null > +++ b/Documentation/timers/timekeeping.txt > @@ -0,0 +1,179 @@ > +Clock sources, Clock events, sched_clock() and delay timers > +----------------------------------------------------------- > + > +This document tries to briefly explain some basic kernel timekeeping > +abstractions. It partly pertains to the drivers usually found in > +drivers/clocksource in the kernel tree, but the code may be spread out > +across the kernel. > + > +If you grep through the kernel source you will find a number of architecture- > +specific implementations of clock sources, clockevents and several likewise > +architecture-specific overrides of the sched_clock() function and some > +delay timers. > + > +To provide timekeeping for your platform, the clock source provides > +the basic timeline, whereas clock events shoot interrupts on certain points > +on this timeline, providing facilities such as high-resolution timers. > +sched_clock() is used for scheduling and timestamping, and delay timers > +provide an accurate delay source using hardware counters. > + > + > +Clock sources > +------------- > + > +The purpose of the clock source is to provide a timeline for the system that > +tells you where you are in time. For example issuing the command 'date' on > +a Linux system will eventually read the clock source to determine exactly > +what time it is. > + > +Typically the clock source is a monotonic, atomic counter which will provide > +n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over. > +It will ideally NEVER stop ticking as long as the system is running. It > +may stop during system suspend. > + > +The clock source shall have as high resolution as possible, and the frequency > +shall be as stable and correct as possible as compared to a real-world wall > +clock. It should not move unpredictably back and forth in time or miss a few > +cycles here and there. > + > +It must be immune to the kind of effects that occur in hardware where e.g. > +the counter register is read in two phases on the bus lowest 16 bits first > +and the higher 16 bits in a second bus cycle with the counter bits > +potentially being updated in between leading to the risk of very strange > +values from the counter. > + > +When the wall-clock accuracy of the clock source isn't satisfactory, there > +are various quirks and layers in the timekeeping code for e.g. synchronizing > +the user-visible time to RTC clocks in the system or against networked time > +servers using NTP, but all they do basically is update an offset against > +the clock source, which provides the fundamental timeline for the system. > +These measures does not affect the clock source per se, they only adapt the > +system to the shortcomings of it. > + > +The clock source struct shall provide means to translate the provided counter > +into a nanosecond value as an unsigned long long (unsigned 64 bit) number. > +Since this operation may be invoked very often, doing this in a strict > +mathematical sense is not desirable: instead the number is taken as close as > +possible to a nanosecond value using only the arithmetic operations > +multiply and shift, so in clocksource_cyc2ns() you find: > + > + ns ~= (clocksource * mult) >> shift > + > +You will find a number of helper functions in the clock source code intended > +to aid in providing these mult and shift values, such as > +clocksource_khz2mult(), clocksource_hz2mult() that help determine the > +mult factor from a fixed shift, and clocksource_register_hz() and > +clocksource_register_khz() which will help out assigning both shift and mult > +factors using the frequency of the clock source as the only input. > + > +For real simple clock sources accessed from a single I/O memory location > +there is nowadays even clocksource_mmio_init() which will take a memory > +location, bit width, a parameter telling whether the counter in the > +register counts up or down, and the timer clock rate, and then conjure all > +necessary parameters. > + > +Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43 > +seconds, the code handling the clock source will have to compensate for this. > +That is the reason why the clock source struct also contains a 'mask' > +member telling how many bits of the source are valid. This way the timekeeping > +code knows when the counter will wrap around and can insert the necessary > +compensation code on both sides of the wrap point so that the system timeline > +remains monotonic. > + > + > +Clock events > +------------ > + > +Clock events are the conceptual reverse of clock sources: they take a > +desired time specification value and calculate the values to poke into > +hardware timer registers. > + > +Clock events are orthogonal to clock sources. The same hardware > +and register range may be used for the clock event, but it is essentially > +a different thing. The hardware driving clock events has to be able to > +fire interrupts, so as to trigger events on the system timeline. On an SMP > +system, it is ideal (and customary) to have one such event driving timer per > +CPU core, so that each core can trigger events independently of any other > +core. > + > +You will notice that the clock event device code is based on the same basic > +idea about translating counters to nanoseconds using mult and shift > +arithmetic, and you find the same family of helper functions again for > +assigning these values. The clock event driver does not need a 'mask' > +attribute however: the system will not try to plan events beyond the time > +horizon of the clock event. > + > + > +sched_clock() > +------------- > + > +In addition to the clock sources and clock events there is a special weak > +function in the kernel called sched_clock(). This function shall return the > +number of nanoseconds since the system was started. An architecture may or > +may not provide an implementation of sched_clock() on its own. If a local > +implementation is not provided, the system jiffy counter will be used as > +sched_clock(). > + > +As the name suggests, sched_clock() is used for scheduling the system, > +determining the absolute timeslice for a certain process in the CFS scheduler > +for example. It is also used for printk timestamps when you have selected to > +include time information in printk for things like bootcharts. > + > +Compared to clock sources, sched_clock() has to be very fast: it is called > +much more often, especially by the scheduler. If you have to do trade-offs > +between accuracy compared to the clock source, you may sacrifice accuracy > +for speed in sched_clock(). It however requires some of the same basic > +characteristics as the clock source, i.e. it should be monotonic. > + > +The sched_clock() function may wrap only on unsigned long long boundaries, > +i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps > +after circa 585 years. (For most practical systems this means "never".) > + > +If an architecture does not provide its own implementation of this function, > +it will fall back to using jiffies, making its maximum resolution 1/HZ of the > +jiffy frequency for the architecture. This will affect scheduling accuracy > +and will likely show up in system benchmarks. > + > +The clock driving sched_clock() may stop or reset to zero during system > +suspend/sleep. This does not matter to the function it serves of scheduling > +events on the system. However it may result in interesting timestamps in > +printk(). > + > +The sched_clock() function should be callable in any context, IRQ- and > +NMI-safe and return a sane value in any context. > + > +Some architectures may have a limited set of time sources and lack a nice > +counter to derive a 64-bit nanosecond value, so for example on the ARM > +architecture, special helper functions have been created to provide a > +sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the > +same counter that is also used as clock source is used for this purpose. > + > +On SMP systems, it is crucial for performance that sched_clock() can be called > +independently on each CPU without any synchronization performance hits. > +Some hardware (such as the x86 TSC) will cause the sched_clock() function to > +drift between the CPUs on the system. The kernel can work around this by > +enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect > +that makes sched_clock() different from the ordinary clock source. > + > + > +Delay timers (some architectures only) > +-------------------------------------- > + > +On systems with variable CPU frequency, the various kernel delay() functions > +will sometimes behave strangely. Basically these delays usually use a hard > +loop to delay a certain number of jiffy fractions using a "lpj" (loops per > +jiffy) value, calibrated on boot. > + > +Let's hope that your system is running on maximum frequency when this value > +is calibrated: as an effect when the frequency is geared down to half the > +full frequency, any delay() will be twice as long. Usually this does not > +hurt, as you're commonly requesting that amount of delay *or more*. But > +basically the semantics are quite unpredictable on such systems. > + > +Enter timer-based delays. Using these, a timer read may be used instead of > +a hard-coded loop for providing the desired delay. > + > +This is done by declaring a struct delay_timer and assigning the appropriate > +function pointers and rate settings for this delay timer. > + > +This is available on some architectures like OpenRISC or ARM. > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 07/10/2014 12:52 AM, Linus Walleij wrote: > This adds some documentation about clock sources, clock events, > the weak sched_clock() function and delay timers that answers > questions that repeatedly arise on the mailing lists. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Nicolas Pitre <nico@fluxnic.net> > Cc: Colin Cross <ccross@google.com> > Cc: John Stultz <john.stultz@linaro.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Queued for 3.17. Thanks! -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/Documentation/timers/00-INDEX b/Documentation/timers/00-INDEX index 6d042dc1cce0..ee212a27772f 100644 --- a/Documentation/timers/00-INDEX +++ b/Documentation/timers/00-INDEX @@ -12,6 +12,8 @@ Makefile - Build and link hpet_example NO_HZ.txt - Summary of the different methods for the scheduler clock-interrupts management. +timekeeping.txt + - Clock sources, clock events, sched_clock() and delay timer notes timers-howto.txt - how to insert delays in the kernel the right (tm) way. timer_stats.txt diff --git a/Documentation/timers/timekeeping.txt b/Documentation/timers/timekeeping.txt new file mode 100644 index 000000000000..f3a8cf28f802 --- /dev/null +++ b/Documentation/timers/timekeeping.txt @@ -0,0 +1,179 @@ +Clock sources, Clock events, sched_clock() and delay timers +----------------------------------------------------------- + +This document tries to briefly explain some basic kernel timekeeping +abstractions. It partly pertains to the drivers usually found in +drivers/clocksource in the kernel tree, but the code may be spread out +across the kernel. + +If you grep through the kernel source you will find a number of architecture- +specific implementations of clock sources, clockevents and several likewise +architecture-specific overrides of the sched_clock() function and some +delay timers. + +To provide timekeeping for your platform, the clock source provides +the basic timeline, whereas clock events shoot interrupts on certain points +on this timeline, providing facilities such as high-resolution timers. +sched_clock() is used for scheduling and timestamping, and delay timers +provide an accurate delay source using hardware counters. + + +Clock sources +------------- + +The purpose of the clock source is to provide a timeline for the system that +tells you where you are in time. For example issuing the command 'date' on +a Linux system will eventually read the clock source to determine exactly +what time it is. + +Typically the clock source is a monotonic, atomic counter which will provide +n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over. +It will ideally NEVER stop ticking as long as the system is running. It +may stop during system suspend. + +The clock source shall have as high resolution as possible, and the frequency +shall be as stable and correct as possible as compared to a real-world wall +clock. It should not move unpredictably back and forth in time or miss a few +cycles here and there. + +It must be immune to the kind of effects that occur in hardware where e.g. +the counter register is read in two phases on the bus lowest 16 bits first +and the higher 16 bits in a second bus cycle with the counter bits +potentially being updated in between leading to the risk of very strange +values from the counter. + +When the wall-clock accuracy of the clock source isn't satisfactory, there +are various quirks and layers in the timekeeping code for e.g. synchronizing +the user-visible time to RTC clocks in the system or against networked time +servers using NTP, but all they do basically is update an offset against +the clock source, which provides the fundamental timeline for the system. +These measures does not affect the clock source per se, they only adapt the +system to the shortcomings of it. + +The clock source struct shall provide means to translate the provided counter +into a nanosecond value as an unsigned long long (unsigned 64 bit) number. +Since this operation may be invoked very often, doing this in a strict +mathematical sense is not desirable: instead the number is taken as close as +possible to a nanosecond value using only the arithmetic operations +multiply and shift, so in clocksource_cyc2ns() you find: + + ns ~= (clocksource * mult) >> shift + +You will find a number of helper functions in the clock source code intended +to aid in providing these mult and shift values, such as +clocksource_khz2mult(), clocksource_hz2mult() that help determine the +mult factor from a fixed shift, and clocksource_register_hz() and +clocksource_register_khz() which will help out assigning both shift and mult +factors using the frequency of the clock source as the only input. + +For real simple clock sources accessed from a single I/O memory location +there is nowadays even clocksource_mmio_init() which will take a memory +location, bit width, a parameter telling whether the counter in the +register counts up or down, and the timer clock rate, and then conjure all +necessary parameters. + +Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43 +seconds, the code handling the clock source will have to compensate for this. +That is the reason why the clock source struct also contains a 'mask' +member telling how many bits of the source are valid. This way the timekeeping +code knows when the counter will wrap around and can insert the necessary +compensation code on both sides of the wrap point so that the system timeline +remains monotonic. + + +Clock events +------------ + +Clock events are the conceptual reverse of clock sources: they take a +desired time specification value and calculate the values to poke into +hardware timer registers. + +Clock events are orthogonal to clock sources. The same hardware +and register range may be used for the clock event, but it is essentially +a different thing. The hardware driving clock events has to be able to +fire interrupts, so as to trigger events on the system timeline. On an SMP +system, it is ideal (and customary) to have one such event driving timer per +CPU core, so that each core can trigger events independently of any other +core. + +You will notice that the clock event device code is based on the same basic +idea about translating counters to nanoseconds using mult and shift +arithmetic, and you find the same family of helper functions again for +assigning these values. The clock event driver does not need a 'mask' +attribute however: the system will not try to plan events beyond the time +horizon of the clock event. + + +sched_clock() +------------- + +In addition to the clock sources and clock events there is a special weak +function in the kernel called sched_clock(). This function shall return the +number of nanoseconds since the system was started. An architecture may or +may not provide an implementation of sched_clock() on its own. If a local +implementation is not provided, the system jiffy counter will be used as +sched_clock(). + +As the name suggests, sched_clock() is used for scheduling the system, +determining the absolute timeslice for a certain process in the CFS scheduler +for example. It is also used for printk timestamps when you have selected to +include time information in printk for things like bootcharts. + +Compared to clock sources, sched_clock() has to be very fast: it is called +much more often, especially by the scheduler. If you have to do trade-offs +between accuracy compared to the clock source, you may sacrifice accuracy +for speed in sched_clock(). It however requires some of the same basic +characteristics as the clock source, i.e. it should be monotonic. + +The sched_clock() function may wrap only on unsigned long long boundaries, +i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps +after circa 585 years. (For most practical systems this means "never".) + +If an architecture does not provide its own implementation of this function, +it will fall back to using jiffies, making its maximum resolution 1/HZ of the +jiffy frequency for the architecture. This will affect scheduling accuracy +and will likely show up in system benchmarks. + +The clock driving sched_clock() may stop or reset to zero during system +suspend/sleep. This does not matter to the function it serves of scheduling +events on the system. However it may result in interesting timestamps in +printk(). + +The sched_clock() function should be callable in any context, IRQ- and +NMI-safe and return a sane value in any context. + +Some architectures may have a limited set of time sources and lack a nice +counter to derive a 64-bit nanosecond value, so for example on the ARM +architecture, special helper functions have been created to provide a +sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the +same counter that is also used as clock source is used for this purpose. + +On SMP systems, it is crucial for performance that sched_clock() can be called +independently on each CPU without any synchronization performance hits. +Some hardware (such as the x86 TSC) will cause the sched_clock() function to +drift between the CPUs on the system. The kernel can work around this by +enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect +that makes sched_clock() different from the ordinary clock source. + + +Delay timers (some architectures only) +-------------------------------------- + +On systems with variable CPU frequency, the various kernel delay() functions +will sometimes behave strangely. Basically these delays usually use a hard +loop to delay a certain number of jiffy fractions using a "lpj" (loops per +jiffy) value, calibrated on boot. + +Let's hope that your system is running on maximum frequency when this value +is calibrated: as an effect when the frequency is geared down to half the +full frequency, any delay() will be twice as long. Usually this does not +hurt, as you're commonly requesting that amount of delay *or more*. But +basically the semantics are quite unpredictable on such systems. + +Enter timer-based delays. Using these, a timer read may be used instead of +a hard-coded loop for providing the desired delay. + +This is done by declaring a struct delay_timer and assigning the appropriate +function pointers and rate settings for this delay timer. + +This is available on some architectures like OpenRISC or ARM.
This adds some documentation about clock sources, clock events, the weak sched_clock() function and delay timers that answers questions that repeatedly arise on the mailing lists. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Colin Cross <ccross@google.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> --- ChangeLog v2->v3: - Minor spelling and tweak comments (PeterZ) - Emphasize clocksource_register_[k]hz (John Stultz) - Spelling fixes (Randy Dunlap) ChangeLog v1->v2: - Included paragraphs and minor edits to account for PeterZ's comments on addressing SMP use cases, which makes especially the semantics of sched_clock() much clearer. --- Documentation/timers/00-INDEX | 2 + Documentation/timers/timekeeping.txt | 179 +++++++++++++++++++++++++++++++++++ 2 files changed, 181 insertions(+) create mode 100644 Documentation/timers/timekeeping.txt