diff mbox series

[API-NEXT,8/8] linux-gen: time: use hw time counter when available

Message ID 20170421131134.27992-9-petri.savolainen@linaro.org
State Superseded
Headers show
Series Use HW time counter | expand

Commit Message

Petri Savolainen April 21, 2017, 1:11 p.m. UTC
Use 64 bit HW time counter when available. It is used on
x86 when invariant TSC CPU flag indicates that TSC frequency
is constant. Otherwise, the system time is used as before. Direct
HW time counter usage avoids system call, and related latency
and performance issues.

Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org>

---
 platform/linux-generic/Makefile.am                 |   1 +
 platform/linux-generic/arch/arm/odp_cpu_arch.c     |  11 +
 platform/linux-generic/arch/default/odp_cpu_arch.c |  11 +
 platform/linux-generic/arch/mips64/odp_cpu_arch.c  |  11 +
 platform/linux-generic/arch/powerpc/odp_cpu_arch.c |  11 +
 platform/linux-generic/arch/x86/cpu_flags.c        |   9 +
 platform/linux-generic/arch/x86/odp_cpu_arch.c     |  59 ++++
 .../include/odp/api/plat/time_types.h              |  23 +-
 platform/linux-generic/include/odp_time_internal.h |  24 ++
 platform/linux-generic/odp_time.c                  | 303 ++++++++++++++++-----
 10 files changed, 398 insertions(+), 65 deletions(-)
 create mode 100644 platform/linux-generic/include/odp_time_internal.h

-- 
2.11.0

Comments

Brian Brooks April 19, 2017, 6:36 p.m. UTC | #1
On 04/26 07:11:57, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> 

> 

> > > From coverletter:

> > > "This patch set modifies time implementation to use TSC when running on

> > a x86

> > > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system

> > time

> > > is used as before. TSC is much more efficient both in performance and

> > > latency/jitter wise than Linux system call. This can be seen also with

> > > scheduler latency test which time stamps events with this API. All

> > latency

> > > measurements (min, ave, max) improved significantly."

> > 

> > odp_sched_latency currently uses clock_gettime. It is my understanding

> > that clock_gettime does not have the over head of the system call. Can

> > you elaborate more on the 'improved significantly' part?

> > 

> 

> clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of kernel functions including system call entry, RCU maintenance, etc.

> 

> E.g. in sched_latency test kernel consumed about 10% of all the cycles. Also latency measurement results improved like this:

> * min >3x lower

> * avg 2x lower

> * max more stable and 50% lower


You might want to share more information on the environment
where you're seeing such significant improvements because the
results on Broadwell do not match the above interpretation.

PS - This patch series breaks the build on ARM.

before / after numbers on 2650v4 (HT disabled, Linux 4.4.6, GCC 5.3.1):

 before:

  HIGH priority
  Thread   Avg[ns]    Min[ns]    Max[ns]    Samples    Total
  ---------------------------------------------------------------
  1        619        351        2463       2103       2103
  2        652        382        1509       2019       2019
  3        637        360        1950       1867       1867
  4        606        373        2328       2073       2073
  5        611        371        2677       2096       2096
  6        643        378        3045       2106       2106
  7        631        354        1677       1923       1923
  8        603        367        4721       2054       2054
  9        617        373        1524       2111       2111
  10       641        369        1808       2024       2024
  ---------------------------------------------------------------
  Total    626        351        4721       20376      20376

  LOW priority
  Thread   Avg[ns]    Min[ns]    Max[ns]    Samples    Total
  ---------------------------------------------------------------
  1        30498      480        914522     2097       4192201
  2        44302      491        584995     1980       4192285
  3        84258      680        515286     2001       4192437
  4        127714     746        473280     2011       4192231
  5        24568      455        724637     2109       4192208
  6        42436      473        523936     2041       4192198
  7        85438      554        486851     2017       4192381
  8        126164     841        203464     2058       4192250
  9        23085      492        671478     2091       4192192
  10       41748      499        515091     1970       4192280
  ---------------------------------------------------------------
  Total    62725      455        914522     20375      41922663

 after:

  HIGH priority
  Thread   Avg[ns]    Min[ns]    Max[ns]    Samples    Total
  ---------------------------------------------------------------
  1        523        202        4671       2152       2152
  2        551        276        1540       2058       2058
  3        492        257        1274       1928       1928
  4        496        269        1201       2035       2035
  5        520        252        1506       2165       2165
  6        548        291        1540       2002       2002
  7        491        251        1274       1969       1969
  8        486        259        3007       1951       1951
  9        528        276        1601       2091       2091
  10       555        264        1611       2001       2001
  ---------------------------------------------------------------
  Total    519        202        4671       20352      20352

  LOW priority
  Thread   Avg[ns]    Min[ns]    Max[ns]    Samples    Total
  ---------------------------------------------------------------
  1        28303      432        828632     2141       4192152
  2        43635      373        662999     2005       4192246
  3        85032      579        664442     1991       4192376
  4        128053     471        306203     2066       4192269
  5        25289      431        591153     2139       4192139
  6        41118      362        693192     2013       4192302
  7        87817      523        696300     2090       4192335
  8        128484     398        232439     2008       4192353
  9        23565      507        716741     1969       4192212
  10       41952      338        614098     1929       4192303
  ---------------------------------------------------------------
  Total    63273      338        828632     20351      41922687

> -Petri

>
Brian Brooks April 19, 2017, 6:45 p.m. UTC | #2
On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> 

> > > > This function (cpu_global_time()) is called only when we have first

> > checked that TSC is invariant. Also we measure the TSC frequency in that

> > case. This function is defined in the same file as cpu_cycles(), and the

> > file is x86 specific. So, we know what we are doing, and just re-using the

> > code to read TSC.

> > 

> > What sort of timing accuracy is expected from the app?

> > 

> > From benchmarking the maximum single-threaded rate of these reads:

> > 

> >  x86_64:

> > 

> >    read       7 ns/op

> >    read_sync  22 ns/op

> > 

> >  A57:

> > 

> >    read       4 ns/op

> >    read_sync  26 ns/op

> > 

> > read_sync issues a synchronizing instruction for greater timing accuracy

> > but clearly takes more time to return the time value read from the core.

> 

> Accuracy is as good as implementation can offer with reasonable overhead. We do not put any nsec figures into API spec. ODP API should offer application the most efficient way to read time anyway.


'reasonable' is what we need to define.

Another reason why you're seeing a performance boost on x86 is that when
switching from clock_gettime() to RDTSC, you're no longer issuing a synchronizing
instruction (fence). As shown above, this can be a significant factor depending
on how often the time is being sampled.

However, there is a loss in timing accuracy because the load of the value
may not happen at the time it happens in program order. This is why a
synchronizing instruction needs to be used, but it slows down the execution
of the thread on the core...

> This patch does not take a position which way TSC should be read. There are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp is not as widely supported as rdtsc. This detail is a magnitude less significant compared to: use system call vs direct TSC read. It can be tuned later. This patch set helps if rdtscp should be used later on (introduces x86 cpu flags).


So you're saying that you do not need the synchronizing instruction, and the
loss of timing accuracy is OK, right?

> -Petri

>
Brian Brooks April 19, 2017, 7:04 p.m. UTC | #3
On 04/26 08:02:09, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> 

> 

> > > > > diff --git a/platform/linux-

> > generic/include/odp/api/plat/time_types.h

> > > > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > index 4847f3b1..8ae7ce82 100644

> > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > @@ -26,11 +26,28 @@ extern "C" {

> > > > >   * the linux timespec structure, which is dependent on POSIX

> > extension

> > > > level.

> > > > >   */

> > > > >  typedef struct odp_time_t {

> > > > > -	int64_t tv_sec;      /**< @internal Seconds */

> > > > > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > > > +	union {

> > > > > +		/** @internal Posix timespec */

> > > > > +		struct {

> > > > > +			/** @internal Seconds */

> > > > > +			int64_t tv_sec;

> > > > > +

> > > > > +			/** @internal Nanoseconds */

> > > > > +			int64_t tv_nsec;

> > > > > +		} spec;

> > > > > +

> > > > > +		/** @internal HW time counter */

> > > > > +		struct {

> > > > > +			/** @internal Counter value */

> > > > > +			uint64_t count;

> > > > > +

> > > > > +			/** @internal Reserved */

> > > > > +			uint64_t res;

> > > > > +		} hw;

> > > > > +	};

> > > >

> > > > A processor's tick counter cannot be used to measure calendar time by

> > > > itself. The delta between two ticks, however, can be converted to

> > > > calendar time.

> > > >

> > > > Please see the proposal that introduces odp_tick_t which eliminates

> > > > the wasted u64 reserved field in the union above.

> > >

> > >

> > > Nothing is wasted here today.

> > 

> > The 64-bit CPU time is stored in a 128-bit data structure where 64-bits

> > are wasted. You can use just a 64-bit type and then convert that to

> > a timespec (using a starting timestamp taken on global init) if needed.

> > 

> 

> Nothing is wasted compared to the situation today. Entire timespec is stored as before. If you want to compress timespec, it's another topic. Compress means additional complexity and overhead. This patch set just adds ability to use 64 but HW time when available. Timespec side implementation remains as is.


You are re-using the calendar time type to store a value read from the CPU
counter. A different approach is to use a different type (64 bits only for no
wasted space) for the value read from the CPU.

This value must be converted to a calendar time time if needed. It cannot just
be used to represent calendar time.

However, for some applications and use cases, you don't need to convert to calendar time
in order to measure across two reads of the counter. You can also do direct arithmetic
on the value instead of arithmetic on a timespec.

If you want to read code instead of email or design documentation, look at this:

  https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_bench.c
  https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time.h

All you need is a 64 bit type, a way to read from the counter as well as get
the frequency, and 2 very small conversion functions.

> > > This is because all our code (in this and other files) does "static

> > inline" for functions that we want to be inlined. Static is used for

> > functions that are static, but we don't care if those are inlined (slow

> > path). Most time API functions are fast path.

> > 

> > I am saying that adding the 'inline' qualifier here doesn't actually do

> > anything

> > because the compiler will consider static functions for inlining anyways.

> 

> 

> We all know that those are considered. We use "inline" as C standard describes: to suggest  which function calls should be as fast as possible.

>  

> 

> > > > > +

> > > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > > > > +{

> > > > > +	odp_time_t time;

> > > > > +

> > > > > +	time.hw.res = 0;

> > > > > +	time.hw.count = t1.hw.count + t2.hw.count;

> > > > > +

> > > > > +	return time;

> > > > > +}

> > > >

> > > > If a single u64 were used this code wouldn't need to exist.

> > >

> > > Which code? hw.res = 0 ? It not actually used for anything and could be

> > removed. Although, the performance gain would not be huge. Anyway, I'll

> > remove it for v2.

> > 

> > If the CPU time was stored in a 64-bit type, you can use arithmetic

> > operators

> > on the values directly instead of doing arithmetic on a 128-bit compound

> > datatype where 64-bits are unused. This is the union above.

> 

> 

> The implementation above does 64bit arithmetic. The reserved field is not used for anything. V2 removed all .res references (except one due to a false GCC warning). We have abstract API definition, so that application remains portable also, when time cannot be represented with a single integer type.

> 

> It's implementation defined, if e.g. the above sum function is inlined and thus results no overhead at all. 

> 

> -Petri

> 

>
Brian Brooks April 20, 2017, 10:10 p.m. UTC | #4
On 04/27 10:02:24, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> 

> 

> > -----Original Message-----

> > From: Brian Brooks [mailto:brian.brooks@arm.com]

> > Sent: Wednesday, April 19, 2017 10:05 PM

> > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com>

> > Cc: lng-odp@lists.linaro.org

> > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time

> > counter when available

> > 

> > On 04/26 08:02:09, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> > >

> > >

> > > > > > > diff --git a/platform/linux-

> > > > generic/include/odp/api/plat/time_types.h

> > > > > > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > > > index 4847f3b1..8ae7ce82 100644

> > > > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > > > @@ -26,11 +26,28 @@ extern "C" {

> > > > > > >   * the linux timespec structure, which is dependent on POSIX

> > > > extension

> > > > > > level.

> > > > > > >   */

> > > > > > >  typedef struct odp_time_t {

> > > > > > > -	int64_t tv_sec;      /**< @internal Seconds */

> > > > > > > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > > > > > +	union {

> > > > > > > +		/** @internal Posix timespec */

> > > > > > > +		struct {

> > > > > > > +			/** @internal Seconds */

> > > > > > > +			int64_t tv_sec;

> > > > > > > +

> > > > > > > +			/** @internal Nanoseconds */

> > > > > > > +			int64_t tv_nsec;

> > > > > > > +		} spec;

> > > > > > > +

> > > > > > > +		/** @internal HW time counter */

> > > > > > > +		struct {

> > > > > > > +			/** @internal Counter value */

> > > > > > > +			uint64_t count;

> > > > > > > +

> > > > > > > +			/** @internal Reserved */

> > > > > > > +			uint64_t res;

> > > > > > > +		} hw;

> > > > > > > +	};

> > > > > >

> > > > > > A processor's tick counter cannot be used to measure calendar time

> > by

> > > > > > itself. The delta between two ticks, however, can be converted to

> > > > > > calendar time.

> > > > > >

> > > > > > Please see the proposal that introduces odp_tick_t which

> > eliminates

> > > > > > the wasted u64 reserved field in the union above.

> > > > >

> > > > >

> > > > > Nothing is wasted here today.

> > > >

> > > > The 64-bit CPU time is stored in a 128-bit data structure where 64-

> > bits

> > > > are wasted. You can use just a 64-bit type and then convert that to

> > > > a timespec (using a starting timestamp taken on global init) if

> > needed.

> > > >

> > >

> > > Nothing is wasted compared to the situation today. Entire timespec is

> > stored as before. If you want to compress timespec, it's another topic.

> > Compress means additional complexity and overhead. This patch set just

> > adds ability to use 64 but HW time when available. Timespec side

> > implementation remains as is.

> > 

> > You are re-using the calendar time type to store a value read from the CPU

> > counter. A different approach is to use a different type (64 bits only for

> > no

> > wasted space) for the value read from the CPU.

> > 

> > This value must be converted to a calendar time time if needed. It cannot

> > just

> > be used to represent calendar time.

> > 

> > However, for some applications and use cases, you don't need to convert to

> > calendar time

> > in order to measure across two reads of the counter. You can also do

> > direct arithmetic

> > on the value instead of arithmetic on a timespec.

> > 

> > If you want to read code instead of email or design documentation, look at

> > this:

> > 

> > 

> > https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_ben

> > ch.c

> >   https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time.h

> > 

> > All you need is a 64 bit type, a way to read from the counter as well as

> > get

> > the frequency, and 2 very small conversion functions.

> 

> 

> First, odp_time_t is our storage type for time values. Today odp-linux defines that as 2x64bits. Nobody has complained about too high time storage consumption. I'm not changing size of the type here. It's not purpose of this patch set to e.g. compress timespec and thus minimize memory foot print of time storage.


odp_time_t represents a calendar time. You can "get" an odp_time_t and tell
what calendar time it is because its value represents a defined offset from
some point in calendar time. You cannot do the same with the value read from
the CPU counter because there is no defined offset from a point in calendar
time. The HW does not guarantee this. That is why I have proposed a separate
odp_tick_t type. This type need also only be 64 bits. Wasting an additional
64 bits (to store a tick in a odp_time_t) is costly when you have many in-
flight events in the system.

> Second, are you proposing an API change? Change of time type definition?


Again, it is documented here: https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-eL0oGLAQ4OM/edit#heading=h.2a3mt9nt0qsp

> Ola's cntr_now() function changes as soon as your time source is not a 64bit counter, but e.g. timespec (2x64bit).


It is not Ola's code, but if it changes the perspective in a positive way it
can be. "as soon as" can be quantified as not ARMv8 and not x64_64. I think
that is the exception rather than the norm. In this case, a struct timespec
can be converted into nanosecond representation and stored in a 64 bit value.

That is the way to design in favor of ARMv8 and x86_64; no space waste and
no expensive conversion into a timespec. Of course, this means that application
code has to use a new type: odp_tick_t.

> It would need to compress that. How application would do arithmetic on the compressed timespec ?


You multiply seconds by 1,000,000,000 and add the nanoseconds to get the epoch
offset in terms of nanoseconds - this can be stored in a uint64_t. And, now you
can "spec" that time is in terms of nanoseconds not "implementation defined".
The application can now do arithmetic too.

> Third, compression/conversion usually mean division operation. Division is tens of times heavier operation than sum, dec, mult. Currently, ODP API is defined so that conversion is done only when application asks for it (odp_time_t <-> nsec time). Compression is a trade-off between storage space size and performance.


I prefer to measure. Perhaps time operations using CPU tick masked by a timespec
should be added here:

  https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_bench.c

Taking the difference between two 64-bit values and calling nsec_from_cntr() is
quite fast - as fast (faster on ARM) as calling clock_gettime() once through the vDSO.

> -Petri

> 

>
Brian Brooks April 20, 2017, 10:14 p.m. UTC | #5
On 04/27 08:21:34, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> 

> 

> > -----Original Message-----

> > From: Brian Brooks [mailto:brian.brooks@arm.com]

> > Sent: Wednesday, April 19, 2017 9:46 PM

> > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com>

> > Cc: lng-odp@lists.linaro.org

> > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time

> > counter when available

> > 

> > On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> > >

> > > > > > This function (cpu_global_time()) is called only when we have

> > first

> > > > checked that TSC is invariant. Also we measure the TSC frequency in

> > that

> > > > case. This function is defined in the same file as cpu_cycles(), and

> > the

> > > > file is x86 specific. So, we know what we are doing, and just re-using

> > the

> > > > code to read TSC.

> > > >

> > > > What sort of timing accuracy is expected from the app?

> > > >

> > > > From benchmarking the maximum single-threaded rate of these reads:

> > > >

> > > >  x86_64:

> > > >

> > > >    read       7 ns/op

> > > >    read_sync  22 ns/op

> > > >

> > > >  A57:

> > > >

> > > >    read       4 ns/op

> > > >    read_sync  26 ns/op

> > > >

> > > > read_sync issues a synchronizing instruction for greater timing

> > accuracy

> > > > but clearly takes more time to return the time value read from the

> > core.

> > >

> > > Accuracy is as good as implementation can offer with reasonable

> > overhead. We do not put any nsec figures into API spec. ODP API should

> > offer application the most efficient way to read time anyway.

> > 

> > 'reasonable' is what we need to define.

> > 

> > Another reason why you're seeing a performance boost on x86 is that when

> > switching from clock_gettime() to RDTSC, you're no longer issuing a

> > synchronizing

> > instruction (fence). As shown above, this can be a significant factor

> > depending

> > on how often the time is being sampled.

> > 

> > However, there is a loss in timing accuracy because the load of the value

> > may not happen at the time it happens in program order. This is why a

> > synchronizing instruction needs to be used, but it slows down the

> > execution

> > of the thread on the core...

> > 

> > > This patch does not take a position which way TSC should be read. There

> > are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current

> > code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp

> > is not as widely supported as rdtsc. This detail is a magnitude less

> > significant compared to: use system call vs direct TSC read. It can be

> > tuned later. This patch set helps if rdtscp should be used later on

> > (introduces x86 cpu flags).

> > 

> > So you're saying that you do not need the synchronizing instruction, and

> > the

> > loss of timing accuracy is OK, right?

> 

> 

> What is our timing accuracy today? How much jitter the system call (and everything may launch) causes in the current implementation? How much we are losing accuracy compared what we have today? I'd say we are better off that today anyway, just because we avoid the system call.

> 

> We don't have fence today on rdtsc read for cycle count. This patch does not add it, either. If we are good without it for cycles, we are good for nsec also.

> 

> The scale time scale application typically measures is in order of micro to milliseconds - thousands to millions of nanoseconds. If out of order execution moves the sample position e.g. 20 nsec (40 CPU cycles), it is not significant error to the measurement. Already single cache miss during a measurement causes same kind of noise to the measurement. Also CPU crystal might not be too accurate either, etc.


Thanks for confirming that from the application point of view. Can also
comment as to whether this is the same position for Timer Pool, Timer,
and Timeout events?

> -Petri

> 

>
Brian Brooks April 21, 2017, 5:29 p.m. UTC | #6
On 04/21 16:11:34, Petri Savolainen wrote:
> Use 64 bit HW time counter when available. It is used on

> x86 when invariant TSC CPU flag indicates that TSC frequency

> is constant. Otherwise, the system time is used as before. Direct

> HW time counter usage avoids system call, and related latency

> and performance issues.

> 

> Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org>

> ---

>  platform/linux-generic/Makefile.am                 |   1 +

>  platform/linux-generic/arch/arm/odp_cpu_arch.c     |  11 +

>  platform/linux-generic/arch/default/odp_cpu_arch.c |  11 +

>  platform/linux-generic/arch/mips64/odp_cpu_arch.c  |  11 +

>  platform/linux-generic/arch/powerpc/odp_cpu_arch.c |  11 +

>  platform/linux-generic/arch/x86/cpu_flags.c        |   9 +

>  platform/linux-generic/arch/x86/odp_cpu_arch.c     |  59 ++++

>  .../include/odp/api/plat/time_types.h              |  23 +-

>  platform/linux-generic/include/odp_time_internal.h |  24 ++

>  platform/linux-generic/odp_time.c                  | 303 ++++++++++++++++-----

>  10 files changed, 398 insertions(+), 65 deletions(-)

>  create mode 100644 platform/linux-generic/include/odp_time_internal.h

> 

> diff --git a/platform/linux-generic/Makefile.am b/platform/linux-generic/Makefile.am

> index 60b7f849..ed66fecf 100644

> --- a/platform/linux-generic/Makefile.am

> +++ b/platform/linux-generic/Makefile.am

> @@ -171,6 +171,7 @@ noinst_HEADERS = \

>  		  ${srcdir}/include/odp_schedule_if.h \

>  		  ${srcdir}/include/odp_sorted_list_internal.h \

>  		  ${srcdir}/include/odp_shm_internal.h \

> +		  ${srcdir}/include/odp_time_internal.h \

>  		  ${srcdir}/include/odp_timer_internal.h \

>  		  ${srcdir}/include/odp_timer_wheel_internal.h \

>  		  ${srcdir}/include/odp_traffic_mngr_internal.h \

> diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> index 2ac223e0..3a87f09c 100644

> --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> @@ -13,6 +13,7 @@

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

>  

>  #define GIGA 1000000000

>  

> @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>  	return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +	return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +	return 0;

> +}

> diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c b/platform/linux-generic/arch/default/odp_cpu_arch.c

> index 2ac223e0..3a87f09c 100644

> --- a/platform/linux-generic/arch/default/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c

> @@ -13,6 +13,7 @@

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

>  

>  #define GIGA 1000000000

>  

> @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>  	return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +	return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +	return 0;

> +}

> diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> index 646acf9c..a9a94531 100644

> --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> @@ -7,6 +7,7 @@

>  #include <odp/api/cpu.h>

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

> +#include <odp_time_internal.h>

>  

>  uint64_t odp_cpu_cycles(void)

>  {

> @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>  	return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +	return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +	return 0;

> +}

> diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> index 2ac223e0..3a87f09c 100644

> --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> @@ -13,6 +13,7 @@

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

>  

>  #define GIGA 1000000000

>  

> @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>  	return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +	return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +	return 0;

> +}

> diff --git a/platform/linux-generic/arch/x86/cpu_flags.c b/platform/linux-generic/arch/x86/cpu_flags.c

> index 8fb9477a..cde8ad3e 100644

> --- a/platform/linux-generic/arch/x86/cpu_flags.c

> +++ b/platform/linux-generic/arch/x86/cpu_flags.c

> @@ -38,6 +38,7 @@

>   */

>  

>  #include <arch/x86/cpu_flags.h>

> +#include <odp_time_internal.h>

>  #include <stdio.h>

>  #include <stdint.h>

>  

> @@ -347,3 +348,11 @@ void cpu_flags_print_all(void)

>  

>  	printf("\n\n");

>  }

> +

> +int cpu_has_global_time(void)

> +{

> +	if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0)

> +		return 1;

> +

> +	return 0;

> +}

> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> index c8cf27b6..9ba601a3 100644

> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> @@ -3,7 +3,14 @@

>   *

>   * SPDX-License-Identifier:     BSD-3-Clause

>   */

> +

> +#include <odp_posix_extensions.h>

> +

>  #include <odp/api/cpu.h>

> +#include <odp_time_internal.h>

> +#include <odp_debug_internal.h>

> +

> +#include <time.h>

>  

>  uint64_t odp_cpu_cycles(void)

>  {

> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>  	return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +	return odp_cpu_cycles();


A cycle counter cannot always be used to measure time. Even on x86,
odp_cpu_cycles() will return the value of RDTSC which is not actually
representative of the cycle count. Even if the x86 processor is set
to a fixed frequency, the Invariant TSC may run at a different fixed
frequency. Please take a look at the odp_tick_t proposal here:

https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-eL0oGLAQ4OM/edit?usp=sharing

> +}

> +

> +#define SEC_IN_NS 1000000000ULL

> +

> +/* Measure TSC frequency. Frequency information registers are defined for x86,

> + * but those are often not enumerated. */

> +uint64_t cpu_global_time_freq(void)

> +{


The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's
frequency in kHz. It would be better to use this when running under a
hypervisor. It appears to work under VMware as well.

> +	struct timespec sleep, ts1, ts2;

> +	uint64_t t1, t2, ts_nsec, cycles, hz;

> +	int i;

> +	uint64_t avg = 0;

> +	int rounds = 4;

> +

> +	for (i = 0; i < rounds; i++) {

> +		sleep.tv_sec  = 0;

> +		sleep.tv_nsec = SEC_IN_NS / 10;

> +

> +		if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) {

> +			ODP_DBG("clock_gettime failed\n");

> +			return 0;

> +		}

> +

> +		t1 = cpu_global_time();

> +

> +		if (nanosleep(&sleep, NULL) < 0) {

> +			ODP_DBG("nanosleep failed\n");

> +			return 0;

> +		}

> +

> +		if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) {

> +			ODP_DBG("clock_gettime failed\n");

> +			return 0;

> +		}

> +

> +		t2 = cpu_global_time();

> +

> +		ts_nsec  = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS;

> +		ts_nsec += ts2.tv_nsec - ts1.tv_nsec;

> +

> +		cycles = t2 - t1;

> +

> +		hz = (cycles * SEC_IN_NS) / ts_nsec;

> +		avg += hz;

> +	}

> +

> +	return avg / rounds;

> +}

> diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h b/platform/linux-generic/include/odp/api/plat/time_types.h

> index 4847f3b1..8ae7ce82 100644

> --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> @@ -26,11 +26,28 @@ extern "C" {

>   * the linux timespec structure, which is dependent on POSIX extension level.

>   */

>  typedef struct odp_time_t {

> -	int64_t tv_sec;      /**< @internal Seconds */

> -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> +	union {

> +		/** @internal Posix timespec */

> +		struct {

> +			/** @internal Seconds */

> +			int64_t tv_sec;

> +

> +			/** @internal Nanoseconds */

> +			int64_t tv_nsec;

> +		} spec;

> +

> +		/** @internal HW time counter */

> +		struct {

> +			/** @internal Counter value */

> +			uint64_t count;

> +

> +			/** @internal Reserved */

> +			uint64_t res;

> +		} hw;

> +	};


A processor's tick counter cannot be used to measure calendar time by
itself. The delta between two ticks, however, can be converted to
calendar time.

Please see the proposal that introduces odp_tick_t which eliminates
the wasted u64 reserved field in the union above.


>  } odp_time_t;

>  

> -#define ODP_TIME_NULL ((odp_time_t){0, 0})

> +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} })

>  

>  /**

>   * @}

> diff --git a/platform/linux-generic/include/odp_time_internal.h b/platform/linux-generic/include/odp_time_internal.h

> new file mode 100644

> index 00000000..99ac7977

> --- /dev/null

> +++ b/platform/linux-generic/include/odp_time_internal.h

> @@ -0,0 +1,24 @@

> +/* Copyright (c) 2017, Linaro Limited

> + * All rights reserved.

> + *

> + * SPDX-License-Identifier:     BSD-3-Clause

> + */

> +

> +#ifndef ODP_TIME_INTERNAL_H_

> +#define ODP_TIME_INTERNAL_H_

> +

> +#ifdef __cplusplus

> +extern "C" {

> +#endif

> +

> +#include <stdint.h>

> +

> +int cpu_has_global_time(void);

> +uint64_t cpu_global_time(void);

> +uint64_t cpu_global_time_freq(void);

> +

> +#ifdef __cplusplus

> +}

> +#endif

> +

> +#endif

> diff --git a/platform/linux-generic/odp_time.c b/platform/linux-generic/odp_time.c

> index 81e05224..2dd8f2c4 100644

> --- a/platform/linux-generic/odp_time.c

> +++ b/platform/linux-generic/odp_time.c

> @@ -10,36 +10,39 @@

>  #include <odp/api/time.h>

>  #include <odp/api/hints.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

> +#include <string.h>

> +#include <inttypes.h>

>  

> -static odp_time_t start_time;

> +typedef struct time_global_t {

> +	odp_time_t start_time;

> +	int        use_hw;

> +	uint64_t   hw_start;

> +	uint64_t   hw_freq_hz;

> +} time_global_t;

>  

> -static inline

> -uint64_t time_to_ns(odp_time_t time)

> -{

> -	uint64_t ns;

> -

> -	ns = time.tv_sec * ODP_TIME_SEC_IN_NS;

> -	ns += time.tv_nsec;

> +static time_global_t global;

>  

> -	return ns;

> -}

> +/*

> + * Posix timespec based functions

> + */

>  

> -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)


Static functions in .c files will always be considered for inlining, so the
inline qualifier is unnecessary here. You can use always_inline compiler
attribute if the disassembly and run time performance is not as expected.

Comment applies to nearly every function in this file.

>  {

>  	odp_time_t time;

>  

> -	time.tv_sec = t2.tv_sec - t1.tv_sec;

> -	time.tv_nsec = t2.tv_nsec - t1.tv_nsec;

> +	time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec;

> +	time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec;

>  

> -	if (time.tv_nsec < 0) {

> -		time.tv_nsec += ODP_TIME_SEC_IN_NS;

> -		--time.tv_sec;

> +	if (time.spec.tv_nsec < 0) {

> +		time.spec.tv_nsec += ODP_TIME_SEC_IN_NS;

> +		--time.spec.tv_sec;

>  	}

>  

>  	return time;

>  }

>  

> -static inline odp_time_t time_local(void)

> +static inline odp_time_t time_spec_cur(void)

>  {

>  	int ret;

>  	odp_time_t time;

> @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void)

>  	if (odp_unlikely(ret != 0))

>  		ODP_ABORT("clock_gettime failed\n");

>  

> -	time.tv_sec = sys_time.tv_sec;

> -	time.tv_nsec = sys_time.tv_nsec;

> +	time.spec.tv_sec = sys_time.tv_sec;

> +	time.spec.tv_nsec = sys_time.tv_nsec;

>  

> -	return time_diff(time, start_time);

> +	return time_spec_diff(time, global.start_time);

>  }

>  

> -static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> +static inline uint64_t time_spec_res(void)

>  {

> -	if (t2.tv_sec < t1.tv_sec)

> +	int ret;

> +	struct timespec tres;

> +

> +	ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> +	if (odp_unlikely(ret != 0))

> +		ODP_ABORT("clock_getres failed\n");

> +

> +	return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> +}

> +

> +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1)

> +{

> +	if (t2.spec.tv_sec < t1.spec.tv_sec)

>  		return -1;

>  

> -	if (t2.tv_sec > t1.tv_sec)

> +	if (t2.spec.tv_sec > t1.spec.tv_sec)

>  		return 1;

>  

> -	return t2.tv_nsec - t1.tv_nsec;

> +	return t2.spec.tv_nsec - t1.spec.tv_nsec;

>  }

>  

> -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2)

>  {

>  	odp_time_t time;

>  

> -	time.tv_sec = t2.tv_sec + t1.tv_sec;

> -	time.tv_nsec = t2.tv_nsec + t1.tv_nsec;

> +	time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec;

> +	time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec;

>  

> -	if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> -		time.tv_nsec -= ODP_TIME_SEC_IN_NS;

> -		++time.tv_sec;

> +	if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> +		time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS;

> +		++time.spec.tv_sec;

>  	}

>  

>  	return time;

>  }

>  

> -static inline odp_time_t time_local_from_ns(uint64_t ns)

> +static inline uint64_t time_spec_to_ns(odp_time_t time)

> +{

> +	uint64_t ns;

> +

> +	ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

> +	ns += time.spec.tv_nsec;

> +

> +	return ns;

> +}

> +

> +static inline odp_time_t time_spec_from_ns(uint64_t ns)

>  {

>  	odp_time_t time;

>  

> -	time.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> -	time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS;

> +	time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> +	time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

>  

>  	return time;

>  }

>  

> -static inline void time_wait_until(odp_time_t time)

> +/*

> + * HW time counter based functions

> + */

> +

> +static inline odp_time_t time_hw_cur(void)

>  {

> -	odp_time_t cur;

> +	odp_time_t time;

>  

> -	do {

> -		cur = time_local();

> -	} while (time_cmp(time, cur) > 0);

> +	time.hw.res   = 0;

> +	time.hw.count = cpu_global_time() - global.hw_start;

> +

> +	return time;

>  }

>  

> -static inline uint64_t time_local_res(void)

> +static inline uint64_t time_hw_res(void)


I think 'freq' is a more descriptive name than 'res'.

It would also be good to document at either the function declaration
or the function definition (if multiple implementations differ) whether
the frequency is in Hz or kHz. For example, on ARMv8 it is just a
register read to get Hz. No estimation, and no lowering to kHz.

>  {

> -	int ret;

> -	struct timespec tres;

> +	/* Promise a bit lower resolution than average cycle counter

> +	 * frequency */

> +	return global.hw_freq_hz / 10;

> +}

>  

> -	ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> -	if (odp_unlikely(ret != 0))

> -		ODP_ABORT("clock_getres failed\n");

> +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1)

> +{

> +	if (odp_likely(t2.hw.count > t1.hw.count))

> +		return 1;

>  

> -	return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> +	if (t2.hw.count < t1.hw.count)

> +		return -1;

> +

> +	return 0;

> +}

> +

> +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1)

> +{

> +	odp_time_t time;

> +

> +	time.hw.res = 0;

> +	time.hw.count = t2.hw.count - t1.hw.count;

> +

> +	return time;

> +}

> +

> +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> +{

> +	odp_time_t time;

> +

> +	time.hw.res = 0;

> +	time.hw.count = t1.hw.count + t2.hw.count;

> +

> +	return time;

> +}


If a single u64 were used this code wouldn't need to exist.

> +static inline uint64_t time_hw_to_ns(odp_time_t time)

> +{

> +	uint64_t nsec;

> +	uint64_t freq_hz = global.hw_freq_hz;

> +	uint64_t count = time.hw.count;

> +	uint64_t sec = 0;

> +

> +	if (count >= freq_hz) {

> +		sec   = count / freq_hz;

> +		count = count - sec * freq_hz;

> +	}

> +

> +	nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz;

> +

> +	return (sec * ODP_TIME_SEC_IN_NS) + nsec;

> +}

> +

> +static inline odp_time_t time_hw_from_ns(uint64_t ns)

> +{

> +	odp_time_t time;

> +	uint64_t count;

> +	uint64_t freq_hz = global.hw_freq_hz;

> +	uint64_t sec = 0;

> +

> +	if (ns >= ODP_TIME_SEC_IN_NS) {

> +		sec = ns / ODP_TIME_SEC_IN_NS;

> +		ns  = ns - sec * ODP_TIME_SEC_IN_NS;

> +	}

> +

> +	count  = sec * freq_hz;

> +	count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS;

> +

> +	time.hw.res   = 0;

> +	time.hw.count = count;

> +

> +	return time;

> +}

> +

> +/*

> + * Common functions

> + */

> +

> +static inline odp_time_t time_cur(void)

> +{

> +	if (global.use_hw)

> +		return time_hw_cur();

> +

> +	return time_spec_cur();

> +}

> +

> +static inline uint64_t time_res(void)

> +{

> +	if (global.use_hw)

> +		return time_hw_res();

> +

> +	return time_spec_res();

> +}

> +

> +static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> +{

> +	if (global.use_hw)

> +		return time_hw_cmp(t2, t1);

> +

> +	return time_spec_cmp(t2, t1);

> +}

> +

> +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> +{

> +	if (global.use_hw)

> +		return time_hw_diff(t2, t1);

> +

> +	return time_spec_diff(t2, t1);

> +}

> +

> +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> +{

> +	if (global.use_hw)

> +		return time_hw_sum(t1, t2);

> +

> +	return time_spec_sum(t1, t2);

> +}

> +

> +static inline uint64_t time_to_ns(odp_time_t time)

> +{

> +	if (global.use_hw)

> +		return time_hw_to_ns(time);

> +

> +	return time_spec_to_ns(time);

> +}

> +

> +static inline odp_time_t time_from_ns(uint64_t ns)

> +{

> +	if (global.use_hw)

> +		return time_hw_from_ns(ns);

> +

> +	return time_spec_from_ns(ns);

> +}

> +

> +static inline void time_wait_until(odp_time_t time)

> +{

> +	odp_time_t cur;

> +

> +	do {

> +		cur = time_cur();

> +	} while (time_cmp(time, cur) > 0);

>  }

>  

>  odp_time_t odp_time_local(void)

>  {

> -	return time_local();

> +	return time_cur();

>  }

>  

>  odp_time_t odp_time_global(void)

>  {

> -	return time_local();

> +	return time_cur();

>  }

>  

>  odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1)

> @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time)

>  

>  odp_time_t odp_time_local_from_ns(uint64_t ns)

>  {

> -	return time_local_from_ns(ns);

> +	return time_from_ns(ns);

>  }

>  

>  odp_time_t odp_time_global_from_ns(uint64_t ns)

>  {

> -	return time_local_from_ns(ns);

> +	return time_from_ns(ns);

>  }

>  

>  int odp_time_cmp(odp_time_t t2, odp_time_t t1)

> @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t t2)

>  

>  uint64_t odp_time_local_res(void)

>  {

> -	return time_local_res();

> +	return time_res();

>  }

>  

>  uint64_t odp_time_global_res(void)

>  {

> -	return time_local_res();

> +	return time_res();

>  }

>  

>  void odp_time_wait_ns(uint64_t ns)

>  {

> -	odp_time_t cur = time_local();

> -	odp_time_t wait = time_local_from_ns(ns);

> +	odp_time_t cur = time_cur();

> +	odp_time_t wait = time_from_ns(ns);

>  	odp_time_t end_time = time_sum(cur, wait);

>  

>  	time_wait_until(end_time);

> @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time)

>  

>  int odp_time_init_global(void)

>  {

> -	int ret;

> -	struct timespec time;

> -

> -	ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time);

> -	if (ret) {

> -		start_time = ODP_TIME_NULL;

> -	} else {

> -		start_time.tv_sec = time.tv_sec;

> -		start_time.tv_nsec = time.tv_nsec;

> +	struct timespec sys_time;

> +	int ret = 0;

> +

> +	memset(&global, 0, sizeof(time_global_t));

> +

> +	if (cpu_has_global_time()) {

> +		global.use_hw = 1;

> +		global.hw_freq_hz  = cpu_global_time_freq();

> +

> +		if (global.hw_freq_hz == 0)

> +			return -1;

> +

> +		printf("HW time counter freq: %" PRIu64 " hz\n\n",

> +		       global.hw_freq_hz);

> +

> +		global.hw_start = cpu_global_time();

> +		return 0;

> +	}

> +

> +	global.start_time = ODP_TIME_NULL;

> +

> +	ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time);

> +	if (ret == 0) {

> +		global.start_time.spec.tv_sec  = sys_time.tv_sec;

> +		global.start_time.spec.tv_nsec = sys_time.tv_nsec;

>  	}

>  

>  	return ret;

> -- 

> 2.11.0

>
Bill Fischofer April 21, 2017, 7:19 p.m. UTC | #7
Good stuff. I'll put this on the agenda for Monday's ARCH call to discuss.

On Fri, Apr 21, 2017 at 12:29 PM, Brian Brooks <brian.brooks@arm.com> wrote:

> On 04/21 16:11:34, Petri Savolainen wrote:

> > Use 64 bit HW time counter when available. It is used on

> > x86 when invariant TSC CPU flag indicates that TSC frequency

> > is constant. Otherwise, the system time is used as before. Direct

> > HW time counter usage avoids system call, and related latency

> > and performance issues.

> >

> > Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org>

> > ---

> >  platform/linux-generic/Makefile.am                 |   1 +

> >  platform/linux-generic/arch/arm/odp_cpu_arch.c     |  11 +

> >  platform/linux-generic/arch/default/odp_cpu_arch.c |  11 +

> >  platform/linux-generic/arch/mips64/odp_cpu_arch.c  |  11 +

> >  platform/linux-generic/arch/powerpc/odp_cpu_arch.c |  11 +

> >  platform/linux-generic/arch/x86/cpu_flags.c        |   9 +

> >  platform/linux-generic/arch/x86/odp_cpu_arch.c     |  59 ++++

> >  .../include/odp/api/plat/time_types.h              |  23 +-

> >  platform/linux-generic/include/odp_time_internal.h |  24 ++

> >  platform/linux-generic/odp_time.c                  | 303

> ++++++++++++++++-----

> >  10 files changed, 398 insertions(+), 65 deletions(-)

> >  create mode 100644 platform/linux-generic/include/odp_time_internal.h

> >

> > diff --git a/platform/linux-generic/Makefile.am

> b/platform/linux-generic/Makefile.am

> > index 60b7f849..ed66fecf 100644

> > --- a/platform/linux-generic/Makefile.am

> > +++ b/platform/linux-generic/Makefile.am

> > @@ -171,6 +171,7 @@ noinst_HEADERS = \

> >                 ${srcdir}/include/odp_schedule_if.h \

> >                 ${srcdir}/include/odp_sorted_list_internal.h \

> >                 ${srcdir}/include/odp_shm_internal.h \

> > +               ${srcdir}/include/odp_time_internal.h \

> >                 ${srcdir}/include/odp_timer_internal.h \

> >                 ${srcdir}/include/odp_timer_wheel_internal.h \

> >                 ${srcdir}/include/odp_traffic_mngr_internal.h \

> > diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c

> b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > index 2ac223e0..3a87f09c 100644

> > --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > @@ -13,6 +13,7 @@

> >  #include <odp/api/hints.h>

> >  #include <odp/api/system_info.h>

> >  #include <odp_debug_internal.h>

> > +#include <odp_time_internal.h>

> >

> >  #define GIGA 1000000000

> >

> > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> >  {

> >       return 1;

> >  }

> > +

> > +uint64_t cpu_global_time(void)

> > +{

> > +     return 0;

> > +}

> > +

> > +uint64_t cpu_global_time_freq(void)

> > +{

> > +     return 0;

> > +}

> > diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c

> b/platform/linux-generic/arch/default/odp_cpu_arch.c

> > index 2ac223e0..3a87f09c 100644

> > --- a/platform/linux-generic/arch/default/odp_cpu_arch.c

> > +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c

> > @@ -13,6 +13,7 @@

> >  #include <odp/api/hints.h>

> >  #include <odp/api/system_info.h>

> >  #include <odp_debug_internal.h>

> > +#include <odp_time_internal.h>

> >

> >  #define GIGA 1000000000

> >

> > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> >  {

> >       return 1;

> >  }

> > +

> > +uint64_t cpu_global_time(void)

> > +{

> > +     return 0;

> > +}

> > +

> > +uint64_t cpu_global_time_freq(void)

> > +{

> > +     return 0;

> > +}

> > diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > index 646acf9c..a9a94531 100644

> > --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > @@ -7,6 +7,7 @@

> >  #include <odp/api/cpu.h>

> >  #include <odp/api/hints.h>

> >  #include <odp/api/system_info.h>

> > +#include <odp_time_internal.h>

> >

> >  uint64_t odp_cpu_cycles(void)

> >  {

> > @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> >  {

> >       return 1;

> >  }

> > +

> > +uint64_t cpu_global_time(void)

> > +{

> > +     return 0;

> > +}

> > +

> > +uint64_t cpu_global_time_freq(void)

> > +{

> > +     return 0;

> > +}

> > diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > index 2ac223e0..3a87f09c 100644

> > --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > @@ -13,6 +13,7 @@

> >  #include <odp/api/hints.h>

> >  #include <odp/api/system_info.h>

> >  #include <odp_debug_internal.h>

> > +#include <odp_time_internal.h>

> >

> >  #define GIGA 1000000000

> >

> > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> >  {

> >       return 1;

> >  }

> > +

> > +uint64_t cpu_global_time(void)

> > +{

> > +     return 0;

> > +}

> > +

> > +uint64_t cpu_global_time_freq(void)

> > +{

> > +     return 0;

> > +}

> > diff --git a/platform/linux-generic/arch/x86/cpu_flags.c

> b/platform/linux-generic/arch/x86/cpu_flags.c

> > index 8fb9477a..cde8ad3e 100644

> > --- a/platform/linux-generic/arch/x86/cpu_flags.c

> > +++ b/platform/linux-generic/arch/x86/cpu_flags.c

> > @@ -38,6 +38,7 @@

> >   */

> >

> >  #include <arch/x86/cpu_flags.h>

> > +#include <odp_time_internal.h>

> >  #include <stdio.h>

> >  #include <stdint.h>

> >

> > @@ -347,3 +348,11 @@ void cpu_flags_print_all(void)

> >

> >       printf("\n\n");

> >  }

> > +

> > +int cpu_has_global_time(void)

> > +{

> > +     if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0)

> > +             return 1;

> > +

> > +     return 0;

> > +}

> > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > index c8cf27b6..9ba601a3 100644

> > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > @@ -3,7 +3,14 @@

> >   *

> >   * SPDX-License-Identifier:     BSD-3-Clause

> >   */

> > +

> > +#include <odp_posix_extensions.h>

> > +

> >  #include <odp/api/cpu.h>

> > +#include <odp_time_internal.h>

> > +#include <odp_debug_internal.h>

> > +

> > +#include <time.h>

> >

> >  uint64_t odp_cpu_cycles(void)

> >  {

> > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

> >  {

> >       return 1;

> >  }

> > +

> > +uint64_t cpu_global_time(void)

> > +{

> > +     return odp_cpu_cycles();

>

> A cycle counter cannot always be used to measure time. Even on x86,

> odp_cpu_cycles() will return the value of RDTSC which is not actually

> representative of the cycle count. Even if the x86 processor is set

> to a fixed frequency, the Invariant TSC may run at a different fixed

> frequency. Please take a look at the odp_tick_t proposal here:

>

> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

> eL0oGLAQ4OM/edit?usp=sharing

>

> > +}

> > +

> > +#define SEC_IN_NS 1000000000ULL

> > +

> > +/* Measure TSC frequency. Frequency information registers are defined

> for x86,

> > + * but those are often not enumerated. */

> > +uint64_t cpu_global_time_freq(void)

> > +{

>

> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

> frequency in kHz. It would be better to use this when running under a

> hypervisor. It appears to work under VMware as well.

>

> > +     struct timespec sleep, ts1, ts2;

> > +     uint64_t t1, t2, ts_nsec, cycles, hz;

> > +     int i;

> > +     uint64_t avg = 0;

> > +     int rounds = 4;

> > +

> > +     for (i = 0; i < rounds; i++) {

> > +             sleep.tv_sec  = 0;

> > +             sleep.tv_nsec = SEC_IN_NS / 10;

> > +

> > +             if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) {

> > +                     ODP_DBG("clock_gettime failed\n");

> > +                     return 0;

> > +             }

> > +

> > +             t1 = cpu_global_time();

> > +

> > +             if (nanosleep(&sleep, NULL) < 0) {

> > +                     ODP_DBG("nanosleep failed\n");

> > +                     return 0;

> > +             }

> > +

> > +             if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) {

> > +                     ODP_DBG("clock_gettime failed\n");

> > +                     return 0;

> > +             }

> > +

> > +             t2 = cpu_global_time();

> > +

> > +             ts_nsec  = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS;

> > +             ts_nsec += ts2.tv_nsec - ts1.tv_nsec;

> > +

> > +             cycles = t2 - t1;

> > +

> > +             hz = (cycles * SEC_IN_NS) / ts_nsec;

> > +             avg += hz;

> > +     }

> > +

> > +     return avg / rounds;

> > +}

> > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

> b/platform/linux-generic/include/odp/api/plat/time_types.h

> > index 4847f3b1..8ae7ce82 100644

> > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > @@ -26,11 +26,28 @@ extern "C" {

> >   * the linux timespec structure, which is dependent on POSIX extension

> level.

> >   */

> >  typedef struct odp_time_t {

> > -     int64_t tv_sec;      /**< @internal Seconds */

> > -     int64_t tv_nsec;     /**< @internal Nanoseconds */

> > +     union {

> > +             /** @internal Posix timespec */

> > +             struct {

> > +                     /** @internal Seconds */

> > +                     int64_t tv_sec;

> > +

> > +                     /** @internal Nanoseconds */

> > +                     int64_t tv_nsec;

> > +             } spec;

> > +

> > +             /** @internal HW time counter */

> > +             struct {

> > +                     /** @internal Counter value */

> > +                     uint64_t count;

> > +

> > +                     /** @internal Reserved */

> > +                     uint64_t res;

> > +             } hw;

> > +     };

>

> A processor's tick counter cannot be used to measure calendar time by

> itself. The delta between two ticks, however, can be converted to

> calendar time.

>

> Please see the proposal that introduces odp_tick_t which eliminates

> the wasted u64 reserved field in the union above.

>

>

> >  } odp_time_t;

> >

> > -#define ODP_TIME_NULL ((odp_time_t){0, 0})

> > +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} })

> >

> >  /**

> >   * @}

> > diff --git a/platform/linux-generic/include/odp_time_internal.h

> b/platform/linux-generic/include/odp_time_internal.h

> > new file mode 100644

> > index 00000000..99ac7977

> > --- /dev/null

> > +++ b/platform/linux-generic/include/odp_time_internal.h

> > @@ -0,0 +1,24 @@

> > +/* Copyright (c) 2017, Linaro Limited

> > + * All rights reserved.

> > + *

> > + * SPDX-License-Identifier:     BSD-3-Clause

> > + */

> > +

> > +#ifndef ODP_TIME_INTERNAL_H_

> > +#define ODP_TIME_INTERNAL_H_

> > +

> > +#ifdef __cplusplus

> > +extern "C" {

> > +#endif

> > +

> > +#include <stdint.h>

> > +

> > +int cpu_has_global_time(void);

> > +uint64_t cpu_global_time(void);

> > +uint64_t cpu_global_time_freq(void);

> > +

> > +#ifdef __cplusplus

> > +}

> > +#endif

> > +

> > +#endif

> > diff --git a/platform/linux-generic/odp_time.c

> b/platform/linux-generic/odp_time.c

> > index 81e05224..2dd8f2c4 100644

> > --- a/platform/linux-generic/odp_time.c

> > +++ b/platform/linux-generic/odp_time.c

> > @@ -10,36 +10,39 @@

> >  #include <odp/api/time.h>

> >  #include <odp/api/hints.h>

> >  #include <odp_debug_internal.h>

> > +#include <odp_time_internal.h>

> > +#include <string.h>

> > +#include <inttypes.h>

> >

> > -static odp_time_t start_time;

> > +typedef struct time_global_t {

> > +     odp_time_t start_time;

> > +     int        use_hw;

> > +     uint64_t   hw_start;

> > +     uint64_t   hw_freq_hz;

> > +} time_global_t;

> >

> > -static inline

> > -uint64_t time_to_ns(odp_time_t time)

> > -{

> > -     uint64_t ns;

> > -

> > -     ns = time.tv_sec * ODP_TIME_SEC_IN_NS;

> > -     ns += time.tv_nsec;

> > +static time_global_t global;

> >

> > -     return ns;

> > -}

> > +/*

> > + * Posix timespec based functions

> > + */

> >

> > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

>

> Static functions in .c files will always be considered for inlining, so the

> inline qualifier is unnecessary here. You can use always_inline compiler

> attribute if the disassembly and run time performance is not as expected.

>

> Comment applies to nearly every function in this file.

>

> >  {

> >       odp_time_t time;

> >

> > -     time.tv_sec = t2.tv_sec - t1.tv_sec;

> > -     time.tv_nsec = t2.tv_nsec - t1.tv_nsec;

> > +     time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec;

> > +     time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec;

> >

> > -     if (time.tv_nsec < 0) {

> > -             time.tv_nsec += ODP_TIME_SEC_IN_NS;

> > -             --time.tv_sec;

> > +     if (time.spec.tv_nsec < 0) {

> > +             time.spec.tv_nsec += ODP_TIME_SEC_IN_NS;

> > +             --time.spec.tv_sec;

> >       }

> >

> >       return time;

> >  }

> >

> > -static inline odp_time_t time_local(void)

> > +static inline odp_time_t time_spec_cur(void)

> >  {

> >       int ret;

> >       odp_time_t time;

> > @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void)

> >       if (odp_unlikely(ret != 0))

> >               ODP_ABORT("clock_gettime failed\n");

> >

> > -     time.tv_sec = sys_time.tv_sec;

> > -     time.tv_nsec = sys_time.tv_nsec;

> > +     time.spec.tv_sec = sys_time.tv_sec;

> > +     time.spec.tv_nsec = sys_time.tv_nsec;

> >

> > -     return time_diff(time, start_time);

> > +     return time_spec_diff(time, global.start_time);

> >  }

> >

> > -static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> > +static inline uint64_t time_spec_res(void)

> >  {

> > -     if (t2.tv_sec < t1.tv_sec)

> > +     int ret;

> > +     struct timespec tres;

> > +

> > +     ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> > +     if (odp_unlikely(ret != 0))

> > +             ODP_ABORT("clock_getres failed\n");

> > +

> > +     return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> > +}

> > +

> > +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1)

> > +{

> > +     if (t2.spec.tv_sec < t1.spec.tv_sec)

> >               return -1;

> >

> > -     if (t2.tv_sec > t1.tv_sec)

> > +     if (t2.spec.tv_sec > t1.spec.tv_sec)

> >               return 1;

> >

> > -     return t2.tv_nsec - t1.tv_nsec;

> > +     return t2.spec.tv_nsec - t1.spec.tv_nsec;

> >  }

> >

> > -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> > +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2)

> >  {

> >       odp_time_t time;

> >

> > -     time.tv_sec = t2.tv_sec + t1.tv_sec;

> > -     time.tv_nsec = t2.tv_nsec + t1.tv_nsec;

> > +     time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec;

> > +     time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec;

> >

> > -     if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> > -             time.tv_nsec -= ODP_TIME_SEC_IN_NS;

> > -             ++time.tv_sec;

> > +     if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> > +             time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS;

> > +             ++time.spec.tv_sec;

> >       }

> >

> >       return time;

> >  }

> >

> > -static inline odp_time_t time_local_from_ns(uint64_t ns)

> > +static inline uint64_t time_spec_to_ns(odp_time_t time)

> > +{

> > +     uint64_t ns;

> > +

> > +     ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

> > +     ns += time.spec.tv_nsec;

> > +

> > +     return ns;

> > +}

> > +

> > +static inline odp_time_t time_spec_from_ns(uint64_t ns)

> >  {

> >       odp_time_t time;

> >

> > -     time.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> > -     time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS;

> > +     time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> > +     time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

> >

> >       return time;

> >  }

> >

> > -static inline void time_wait_until(odp_time_t time)

> > +/*

> > + * HW time counter based functions

> > + */

> > +

> > +static inline odp_time_t time_hw_cur(void)

> >  {

> > -     odp_time_t cur;

> > +     odp_time_t time;

> >

> > -     do {

> > -             cur = time_local();

> > -     } while (time_cmp(time, cur) > 0);

> > +     time.hw.res   = 0;

> > +     time.hw.count = cpu_global_time() - global.hw_start;

> > +

> > +     return time;

> >  }

> >

> > -static inline uint64_t time_local_res(void)

> > +static inline uint64_t time_hw_res(void)

>

> I think 'freq' is a more descriptive name than 'res'.

>

> It would also be good to document at either the function declaration

> or the function definition (if multiple implementations differ) whether

> the frequency is in Hz or kHz. For example, on ARMv8 it is just a

> register read to get Hz. No estimation, and no lowering to kHz.

>

> >  {

> > -     int ret;

> > -     struct timespec tres;

> > +     /* Promise a bit lower resolution than average cycle counter

> > +      * frequency */

> > +     return global.hw_freq_hz / 10;

> > +}

> >

> > -     ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> > -     if (odp_unlikely(ret != 0))

> > -             ODP_ABORT("clock_getres failed\n");

> > +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1)

> > +{

> > +     if (odp_likely(t2.hw.count > t1.hw.count))

> > +             return 1;

> >

> > -     return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> > +     if (t2.hw.count < t1.hw.count)

> > +             return -1;

> > +

> > +     return 0;

> > +}

> > +

> > +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1)

> > +{

> > +     odp_time_t time;

> > +

> > +     time.hw.res = 0;

> > +     time.hw.count = t2.hw.count - t1.hw.count;

> > +

> > +     return time;

> > +}

> > +

> > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > +{

> > +     odp_time_t time;

> > +

> > +     time.hw.res = 0;

> > +     time.hw.count = t1.hw.count + t2.hw.count;

> > +

> > +     return time;

> > +}

>

> If a single u64 were used this code wouldn't need to exist.

>

> > +static inline uint64_t time_hw_to_ns(odp_time_t time)

> > +{

> > +     uint64_t nsec;

> > +     uint64_t freq_hz = global.hw_freq_hz;

> > +     uint64_t count = time.hw.count;

> > +     uint64_t sec = 0;

> > +

> > +     if (count >= freq_hz) {

> > +             sec   = count / freq_hz;

> > +             count = count - sec * freq_hz;

> > +     }

> > +

> > +     nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz;

> > +

> > +     return (sec * ODP_TIME_SEC_IN_NS) + nsec;

> > +}

> > +

> > +static inline odp_time_t time_hw_from_ns(uint64_t ns)

> > +{

> > +     odp_time_t time;

> > +     uint64_t count;

> > +     uint64_t freq_hz = global.hw_freq_hz;

> > +     uint64_t sec = 0;

> > +

> > +     if (ns >= ODP_TIME_SEC_IN_NS) {

> > +             sec = ns / ODP_TIME_SEC_IN_NS;

> > +             ns  = ns - sec * ODP_TIME_SEC_IN_NS;

> > +     }

> > +

> > +     count  = sec * freq_hz;

> > +     count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS;

> > +

> > +     time.hw.res   = 0;

> > +     time.hw.count = count;

> > +

> > +     return time;

> > +}

> > +

> > +/*

> > + * Common functions

> > + */

> > +

> > +static inline odp_time_t time_cur(void)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_cur();

> > +

> > +     return time_spec_cur();

> > +}

> > +

> > +static inline uint64_t time_res(void)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_res();

> > +

> > +     return time_spec_res();

> > +}

> > +

> > +static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_cmp(t2, t1);

> > +

> > +     return time_spec_cmp(t2, t1);

> > +}

> > +

> > +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_diff(t2, t1);

> > +

> > +     return time_spec_diff(t2, t1);

> > +}

> > +

> > +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_sum(t1, t2);

> > +

> > +     return time_spec_sum(t1, t2);

> > +}

> > +

> > +static inline uint64_t time_to_ns(odp_time_t time)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_to_ns(time);

> > +

> > +     return time_spec_to_ns(time);

> > +}

> > +

> > +static inline odp_time_t time_from_ns(uint64_t ns)

> > +{

> > +     if (global.use_hw)

> > +             return time_hw_from_ns(ns);

> > +

> > +     return time_spec_from_ns(ns);

> > +}

> > +

> > +static inline void time_wait_until(odp_time_t time)

> > +{

> > +     odp_time_t cur;

> > +

> > +     do {

> > +             cur = time_cur();

> > +     } while (time_cmp(time, cur) > 0);

> >  }

> >

> >  odp_time_t odp_time_local(void)

> >  {

> > -     return time_local();

> > +     return time_cur();

> >  }

> >

> >  odp_time_t odp_time_global(void)

> >  {

> > -     return time_local();

> > +     return time_cur();

> >  }

> >

> >  odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1)

> > @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time)

> >

> >  odp_time_t odp_time_local_from_ns(uint64_t ns)

> >  {

> > -     return time_local_from_ns(ns);

> > +     return time_from_ns(ns);

> >  }

> >

> >  odp_time_t odp_time_global_from_ns(uint64_t ns)

> >  {

> > -     return time_local_from_ns(ns);

> > +     return time_from_ns(ns);

> >  }

> >

> >  int odp_time_cmp(odp_time_t t2, odp_time_t t1)

> > @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t

> t2)

> >

> >  uint64_t odp_time_local_res(void)

> >  {

> > -     return time_local_res();

> > +     return time_res();

> >  }

> >

> >  uint64_t odp_time_global_res(void)

> >  {

> > -     return time_local_res();

> > +     return time_res();

> >  }

> >

> >  void odp_time_wait_ns(uint64_t ns)

> >  {

> > -     odp_time_t cur = time_local();

> > -     odp_time_t wait = time_local_from_ns(ns);

> > +     odp_time_t cur = time_cur();

> > +     odp_time_t wait = time_from_ns(ns);

> >       odp_time_t end_time = time_sum(cur, wait);

> >

> >       time_wait_until(end_time);

> > @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time)

> >

> >  int odp_time_init_global(void)

> >  {

> > -     int ret;

> > -     struct timespec time;

> > -

> > -     ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time);

> > -     if (ret) {

> > -             start_time = ODP_TIME_NULL;

> > -     } else {

> > -             start_time.tv_sec = time.tv_sec;

> > -             start_time.tv_nsec = time.tv_nsec;

> > +     struct timespec sys_time;

> > +     int ret = 0;

> > +

> > +     memset(&global, 0, sizeof(time_global_t));

> > +

> > +     if (cpu_has_global_time()) {

> > +             global.use_hw = 1;

> > +             global.hw_freq_hz  = cpu_global_time_freq();

> > +

> > +             if (global.hw_freq_hz == 0)

> > +                     return -1;

> > +

> > +             printf("HW time counter freq: %" PRIu64 " hz\n\n",

> > +                    global.hw_freq_hz);

> > +

> > +             global.hw_start = cpu_global_time();

> > +             return 0;

> > +     }

> > +

> > +     global.start_time = ODP_TIME_NULL;

> > +

> > +     ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time);

> > +     if (ret == 0) {

> > +             global.start_time.spec.tv_sec  = sys_time.tv_sec;

> > +             global.start_time.spec.tv_nsec = sys_time.tv_nsec;

> >       }

> >

> >       return ret;

> > --

> > 2.11.0

> >

>
Brian Brooks April 24, 2017, 3:59 a.m. UTC | #8
On 04/21 14:19:27, Bill Fischofer wrote:
> Good stuff. I'll put this on the agenda for Monday's ARCH call to discuss.


Hi Bill, Petri,

I will only be able to join the Wednesday ARCH call this week.

Thanks,
Brian

> On Fri, Apr 21, 2017 at 12:29 PM, Brian Brooks <brian.brooks@arm.com> wrote:

> 

> > On 04/21 16:11:34, Petri Savolainen wrote:

> > > Use 64 bit HW time counter when available. It is used on

> > > x86 when invariant TSC CPU flag indicates that TSC frequency

> > > is constant. Otherwise, the system time is used as before. Direct

> > > HW time counter usage avoids system call, and related latency

> > > and performance issues.

> > >

> > > Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org>

> > > ---

> > >  platform/linux-generic/Makefile.am                 |   1 +

> > >  platform/linux-generic/arch/arm/odp_cpu_arch.c     |  11 +

> > >  platform/linux-generic/arch/default/odp_cpu_arch.c |  11 +

> > >  platform/linux-generic/arch/mips64/odp_cpu_arch.c  |  11 +

> > >  platform/linux-generic/arch/powerpc/odp_cpu_arch.c |  11 +

> > >  platform/linux-generic/arch/x86/cpu_flags.c        |   9 +

> > >  platform/linux-generic/arch/x86/odp_cpu_arch.c     |  59 ++++

> > >  .../include/odp/api/plat/time_types.h              |  23 +-

> > >  platform/linux-generic/include/odp_time_internal.h |  24 ++

> > >  platform/linux-generic/odp_time.c                  | 303

> > ++++++++++++++++-----

> > >  10 files changed, 398 insertions(+), 65 deletions(-)

> > >  create mode 100644 platform/linux-generic/include/odp_time_internal.h

> > >

> > > diff --git a/platform/linux-generic/Makefile.am

> > b/platform/linux-generic/Makefile.am

> > > index 60b7f849..ed66fecf 100644

> > > --- a/platform/linux-generic/Makefile.am

> > > +++ b/platform/linux-generic/Makefile.am

> > > @@ -171,6 +171,7 @@ noinst_HEADERS = \

> > >                 ${srcdir}/include/odp_schedule_if.h \

> > >                 ${srcdir}/include/odp_sorted_list_internal.h \

> > >                 ${srcdir}/include/odp_shm_internal.h \

> > > +               ${srcdir}/include/odp_time_internal.h \

> > >                 ${srcdir}/include/odp_timer_internal.h \

> > >                 ${srcdir}/include/odp_timer_wheel_internal.h \

> > >                 ${srcdir}/include/odp_traffic_mngr_internal.h \

> > > diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > > index 2ac223e0..3a87f09c 100644

> > > --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > > +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> > > @@ -13,6 +13,7 @@

> > >  #include <odp/api/hints.h>

> > >  #include <odp/api/system_info.h>

> > >  #include <odp_debug_internal.h>

> > > +#include <odp_time_internal.h>

> > >

> > >  #define GIGA 1000000000

> > >

> > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> > >  {

> > >       return 1;

> > >  }

> > > +

> > > +uint64_t cpu_global_time(void)

> > > +{

> > > +     return 0;

> > > +}

> > > +

> > > +uint64_t cpu_global_time_freq(void)

> > > +{

> > > +     return 0;

> > > +}

> > > diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c

> > b/platform/linux-generic/arch/default/odp_cpu_arch.c

> > > index 2ac223e0..3a87f09c 100644

> > > --- a/platform/linux-generic/arch/default/odp_cpu_arch.c

> > > +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c

> > > @@ -13,6 +13,7 @@

> > >  #include <odp/api/hints.h>

> > >  #include <odp/api/system_info.h>

> > >  #include <odp_debug_internal.h>

> > > +#include <odp_time_internal.h>

> > >

> > >  #define GIGA 1000000000

> > >

> > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> > >  {

> > >       return 1;

> > >  }

> > > +

> > > +uint64_t cpu_global_time(void)

> > > +{

> > > +     return 0;

> > > +}

> > > +

> > > +uint64_t cpu_global_time_freq(void)

> > > +{

> > > +     return 0;

> > > +}

> > > diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > > index 646acf9c..a9a94531 100644

> > > --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > > +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> > > @@ -7,6 +7,7 @@

> > >  #include <odp/api/cpu.h>

> > >  #include <odp/api/hints.h>

> > >  #include <odp/api/system_info.h>

> > > +#include <odp_time_internal.h>

> > >

> > >  uint64_t odp_cpu_cycles(void)

> > >  {

> > > @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> > >  {

> > >       return 1;

> > >  }

> > > +

> > > +uint64_t cpu_global_time(void)

> > > +{

> > > +     return 0;

> > > +}

> > > +

> > > +uint64_t cpu_global_time_freq(void)

> > > +{

> > > +     return 0;

> > > +}

> > > diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > > index 2ac223e0..3a87f09c 100644

> > > --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > > +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> > > @@ -13,6 +13,7 @@

> > >  #include <odp/api/hints.h>

> > >  #include <odp/api/system_info.h>

> > >  #include <odp_debug_internal.h>

> > > +#include <odp_time_internal.h>

> > >

> > >  #define GIGA 1000000000

> > >

> > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

> > >  {

> > >       return 1;

> > >  }

> > > +

> > > +uint64_t cpu_global_time(void)

> > > +{

> > > +     return 0;

> > > +}

> > > +

> > > +uint64_t cpu_global_time_freq(void)

> > > +{

> > > +     return 0;

> > > +}

> > > diff --git a/platform/linux-generic/arch/x86/cpu_flags.c

> > b/platform/linux-generic/arch/x86/cpu_flags.c

> > > index 8fb9477a..cde8ad3e 100644

> > > --- a/platform/linux-generic/arch/x86/cpu_flags.c

> > > +++ b/platform/linux-generic/arch/x86/cpu_flags.c

> > > @@ -38,6 +38,7 @@

> > >   */

> > >

> > >  #include <arch/x86/cpu_flags.h>

> > > +#include <odp_time_internal.h>

> > >  #include <stdio.h>

> > >  #include <stdint.h>

> > >

> > > @@ -347,3 +348,11 @@ void cpu_flags_print_all(void)

> > >

> > >       printf("\n\n");

> > >  }

> > > +

> > > +int cpu_has_global_time(void)

> > > +{

> > > +     if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0)

> > > +             return 1;

> > > +

> > > +     return 0;

> > > +}

> > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > index c8cf27b6..9ba601a3 100644

> > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > @@ -3,7 +3,14 @@

> > >   *

> > >   * SPDX-License-Identifier:     BSD-3-Clause

> > >   */

> > > +

> > > +#include <odp_posix_extensions.h>

> > > +

> > >  #include <odp/api/cpu.h>

> > > +#include <odp_time_internal.h>

> > > +#include <odp_debug_internal.h>

> > > +

> > > +#include <time.h>

> > >

> > >  uint64_t odp_cpu_cycles(void)

> > >  {

> > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

> > >  {

> > >       return 1;

> > >  }

> > > +

> > > +uint64_t cpu_global_time(void)

> > > +{

> > > +     return odp_cpu_cycles();

> >

> > A cycle counter cannot always be used to measure time. Even on x86,

> > odp_cpu_cycles() will return the value of RDTSC which is not actually

> > representative of the cycle count. Even if the x86 processor is set

> > to a fixed frequency, the Invariant TSC may run at a different fixed

> > frequency. Please take a look at the odp_tick_t proposal here:

> >

> > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

> > eL0oGLAQ4OM/edit?usp=sharing

> >

> > > +}

> > > +

> > > +#define SEC_IN_NS 1000000000ULL

> > > +

> > > +/* Measure TSC frequency. Frequency information registers are defined

> > for x86,

> > > + * but those are often not enumerated. */

> > > +uint64_t cpu_global_time_freq(void)

> > > +{

> >

> > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

> > frequency in kHz. It would be better to use this when running under a

> > hypervisor. It appears to work under VMware as well.

> >

> > > +     struct timespec sleep, ts1, ts2;

> > > +     uint64_t t1, t2, ts_nsec, cycles, hz;

> > > +     int i;

> > > +     uint64_t avg = 0;

> > > +     int rounds = 4;

> > > +

> > > +     for (i = 0; i < rounds; i++) {

> > > +             sleep.tv_sec  = 0;

> > > +             sleep.tv_nsec = SEC_IN_NS / 10;

> > > +

> > > +             if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) {

> > > +                     ODP_DBG("clock_gettime failed\n");

> > > +                     return 0;

> > > +             }

> > > +

> > > +             t1 = cpu_global_time();

> > > +

> > > +             if (nanosleep(&sleep, NULL) < 0) {

> > > +                     ODP_DBG("nanosleep failed\n");

> > > +                     return 0;

> > > +             }

> > > +

> > > +             if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) {

> > > +                     ODP_DBG("clock_gettime failed\n");

> > > +                     return 0;

> > > +             }

> > > +

> > > +             t2 = cpu_global_time();

> > > +

> > > +             ts_nsec  = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS;

> > > +             ts_nsec += ts2.tv_nsec - ts1.tv_nsec;

> > > +

> > > +             cycles = t2 - t1;

> > > +

> > > +             hz = (cycles * SEC_IN_NS) / ts_nsec;

> > > +             avg += hz;

> > > +     }

> > > +

> > > +     return avg / rounds;

> > > +}

> > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

> > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > index 4847f3b1..8ae7ce82 100644

> > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > @@ -26,11 +26,28 @@ extern "C" {

> > >   * the linux timespec structure, which is dependent on POSIX extension

> > level.

> > >   */

> > >  typedef struct odp_time_t {

> > > -     int64_t tv_sec;      /**< @internal Seconds */

> > > -     int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > +     union {

> > > +             /** @internal Posix timespec */

> > > +             struct {

> > > +                     /** @internal Seconds */

> > > +                     int64_t tv_sec;

> > > +

> > > +                     /** @internal Nanoseconds */

> > > +                     int64_t tv_nsec;

> > > +             } spec;

> > > +

> > > +             /** @internal HW time counter */

> > > +             struct {

> > > +                     /** @internal Counter value */

> > > +                     uint64_t count;

> > > +

> > > +                     /** @internal Reserved */

> > > +                     uint64_t res;

> > > +             } hw;

> > > +     };

> >

> > A processor's tick counter cannot be used to measure calendar time by

> > itself. The delta between two ticks, however, can be converted to

> > calendar time.

> >

> > Please see the proposal that introduces odp_tick_t which eliminates

> > the wasted u64 reserved field in the union above.

> >

> >

> > >  } odp_time_t;

> > >

> > > -#define ODP_TIME_NULL ((odp_time_t){0, 0})

> > > +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} })

> > >

> > >  /**

> > >   * @}

> > > diff --git a/platform/linux-generic/include/odp_time_internal.h

> > b/platform/linux-generic/include/odp_time_internal.h

> > > new file mode 100644

> > > index 00000000..99ac7977

> > > --- /dev/null

> > > +++ b/platform/linux-generic/include/odp_time_internal.h

> > > @@ -0,0 +1,24 @@

> > > +/* Copyright (c) 2017, Linaro Limited

> > > + * All rights reserved.

> > > + *

> > > + * SPDX-License-Identifier:     BSD-3-Clause

> > > + */

> > > +

> > > +#ifndef ODP_TIME_INTERNAL_H_

> > > +#define ODP_TIME_INTERNAL_H_

> > > +

> > > +#ifdef __cplusplus

> > > +extern "C" {

> > > +#endif

> > > +

> > > +#include <stdint.h>

> > > +

> > > +int cpu_has_global_time(void);

> > > +uint64_t cpu_global_time(void);

> > > +uint64_t cpu_global_time_freq(void);

> > > +

> > > +#ifdef __cplusplus

> > > +}

> > > +#endif

> > > +

> > > +#endif

> > > diff --git a/platform/linux-generic/odp_time.c

> > b/platform/linux-generic/odp_time.c

> > > index 81e05224..2dd8f2c4 100644

> > > --- a/platform/linux-generic/odp_time.c

> > > +++ b/platform/linux-generic/odp_time.c

> > > @@ -10,36 +10,39 @@

> > >  #include <odp/api/time.h>

> > >  #include <odp/api/hints.h>

> > >  #include <odp_debug_internal.h>

> > > +#include <odp_time_internal.h>

> > > +#include <string.h>

> > > +#include <inttypes.h>

> > >

> > > -static odp_time_t start_time;

> > > +typedef struct time_global_t {

> > > +     odp_time_t start_time;

> > > +     int        use_hw;

> > > +     uint64_t   hw_start;

> > > +     uint64_t   hw_freq_hz;

> > > +} time_global_t;

> > >

> > > -static inline

> > > -uint64_t time_to_ns(odp_time_t time)

> > > -{

> > > -     uint64_t ns;

> > > -

> > > -     ns = time.tv_sec * ODP_TIME_SEC_IN_NS;

> > > -     ns += time.tv_nsec;

> > > +static time_global_t global;

> > >

> > > -     return ns;

> > > -}

> > > +/*

> > > + * Posix timespec based functions

> > > + */

> > >

> > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

> >

> > Static functions in .c files will always be considered for inlining, so the

> > inline qualifier is unnecessary here. You can use always_inline compiler

> > attribute if the disassembly and run time performance is not as expected.

> >

> > Comment applies to nearly every function in this file.

> >

> > >  {

> > >       odp_time_t time;

> > >

> > > -     time.tv_sec = t2.tv_sec - t1.tv_sec;

> > > -     time.tv_nsec = t2.tv_nsec - t1.tv_nsec;

> > > +     time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec;

> > > +     time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec;

> > >

> > > -     if (time.tv_nsec < 0) {

> > > -             time.tv_nsec += ODP_TIME_SEC_IN_NS;

> > > -             --time.tv_sec;

> > > +     if (time.spec.tv_nsec < 0) {

> > > +             time.spec.tv_nsec += ODP_TIME_SEC_IN_NS;

> > > +             --time.spec.tv_sec;

> > >       }

> > >

> > >       return time;

> > >  }

> > >

> > > -static inline odp_time_t time_local(void)

> > > +static inline odp_time_t time_spec_cur(void)

> > >  {

> > >       int ret;

> > >       odp_time_t time;

> > > @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void)

> > >       if (odp_unlikely(ret != 0))

> > >               ODP_ABORT("clock_gettime failed\n");

> > >

> > > -     time.tv_sec = sys_time.tv_sec;

> > > -     time.tv_nsec = sys_time.tv_nsec;

> > > +     time.spec.tv_sec = sys_time.tv_sec;

> > > +     time.spec.tv_nsec = sys_time.tv_nsec;

> > >

> > > -     return time_diff(time, start_time);

> > > +     return time_spec_diff(time, global.start_time);

> > >  }

> > >

> > > -static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> > > +static inline uint64_t time_spec_res(void)

> > >  {

> > > -     if (t2.tv_sec < t1.tv_sec)

> > > +     int ret;

> > > +     struct timespec tres;

> > > +

> > > +     ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> > > +     if (odp_unlikely(ret != 0))

> > > +             ODP_ABORT("clock_getres failed\n");

> > > +

> > > +     return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> > > +}

> > > +

> > > +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1)

> > > +{

> > > +     if (t2.spec.tv_sec < t1.spec.tv_sec)

> > >               return -1;

> > >

> > > -     if (t2.tv_sec > t1.tv_sec)

> > > +     if (t2.spec.tv_sec > t1.spec.tv_sec)

> > >               return 1;

> > >

> > > -     return t2.tv_nsec - t1.tv_nsec;

> > > +     return t2.spec.tv_nsec - t1.spec.tv_nsec;

> > >  }

> > >

> > > -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> > > +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2)

> > >  {

> > >       odp_time_t time;

> > >

> > > -     time.tv_sec = t2.tv_sec + t1.tv_sec;

> > > -     time.tv_nsec = t2.tv_nsec + t1.tv_nsec;

> > > +     time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec;

> > > +     time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec;

> > >

> > > -     if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> > > -             time.tv_nsec -= ODP_TIME_SEC_IN_NS;

> > > -             ++time.tv_sec;

> > > +     if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> > > +             time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS;

> > > +             ++time.spec.tv_sec;

> > >       }

> > >

> > >       return time;

> > >  }

> > >

> > > -static inline odp_time_t time_local_from_ns(uint64_t ns)

> > > +static inline uint64_t time_spec_to_ns(odp_time_t time)

> > > +{

> > > +     uint64_t ns;

> > > +

> > > +     ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

> > > +     ns += time.spec.tv_nsec;

> > > +

> > > +     return ns;

> > > +}

> > > +

> > > +static inline odp_time_t time_spec_from_ns(uint64_t ns)

> > >  {

> > >       odp_time_t time;

> > >

> > > -     time.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> > > -     time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS;

> > > +     time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> > > +     time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

> > >

> > >       return time;

> > >  }

> > >

> > > -static inline void time_wait_until(odp_time_t time)

> > > +/*

> > > + * HW time counter based functions

> > > + */

> > > +

> > > +static inline odp_time_t time_hw_cur(void)

> > >  {

> > > -     odp_time_t cur;

> > > +     odp_time_t time;

> > >

> > > -     do {

> > > -             cur = time_local();

> > > -     } while (time_cmp(time, cur) > 0);

> > > +     time.hw.res   = 0;

> > > +     time.hw.count = cpu_global_time() - global.hw_start;

> > > +

> > > +     return time;

> > >  }

> > >

> > > -static inline uint64_t time_local_res(void)

> > > +static inline uint64_t time_hw_res(void)

> >

> > I think 'freq' is a more descriptive name than 'res'.

> >

> > It would also be good to document at either the function declaration

> > or the function definition (if multiple implementations differ) whether

> > the frequency is in Hz or kHz. For example, on ARMv8 it is just a

> > register read to get Hz. No estimation, and no lowering to kHz.

> >

> > >  {

> > > -     int ret;

> > > -     struct timespec tres;

> > > +     /* Promise a bit lower resolution than average cycle counter

> > > +      * frequency */

> > > +     return global.hw_freq_hz / 10;

> > > +}

> > >

> > > -     ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> > > -     if (odp_unlikely(ret != 0))

> > > -             ODP_ABORT("clock_getres failed\n");

> > > +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1)

> > > +{

> > > +     if (odp_likely(t2.hw.count > t1.hw.count))

> > > +             return 1;

> > >

> > > -     return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> > > +     if (t2.hw.count < t1.hw.count)

> > > +             return -1;

> > > +

> > > +     return 0;

> > > +}

> > > +

> > > +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1)

> > > +{

> > > +     odp_time_t time;

> > > +

> > > +     time.hw.res = 0;

> > > +     time.hw.count = t2.hw.count - t1.hw.count;

> > > +

> > > +     return time;

> > > +}

> > > +

> > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > > +{

> > > +     odp_time_t time;

> > > +

> > > +     time.hw.res = 0;

> > > +     time.hw.count = t1.hw.count + t2.hw.count;

> > > +

> > > +     return time;

> > > +}

> >

> > If a single u64 were used this code wouldn't need to exist.

> >

> > > +static inline uint64_t time_hw_to_ns(odp_time_t time)

> > > +{

> > > +     uint64_t nsec;

> > > +     uint64_t freq_hz = global.hw_freq_hz;

> > > +     uint64_t count = time.hw.count;

> > > +     uint64_t sec = 0;

> > > +

> > > +     if (count >= freq_hz) {

> > > +             sec   = count / freq_hz;

> > > +             count = count - sec * freq_hz;

> > > +     }

> > > +

> > > +     nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz;

> > > +

> > > +     return (sec * ODP_TIME_SEC_IN_NS) + nsec;

> > > +}

> > > +

> > > +static inline odp_time_t time_hw_from_ns(uint64_t ns)

> > > +{

> > > +     odp_time_t time;

> > > +     uint64_t count;

> > > +     uint64_t freq_hz = global.hw_freq_hz;

> > > +     uint64_t sec = 0;

> > > +

> > > +     if (ns >= ODP_TIME_SEC_IN_NS) {

> > > +             sec = ns / ODP_TIME_SEC_IN_NS;

> > > +             ns  = ns - sec * ODP_TIME_SEC_IN_NS;

> > > +     }

> > > +

> > > +     count  = sec * freq_hz;

> > > +     count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS;

> > > +

> > > +     time.hw.res   = 0;

> > > +     time.hw.count = count;

> > > +

> > > +     return time;

> > > +}

> > > +

> > > +/*

> > > + * Common functions

> > > + */

> > > +

> > > +static inline odp_time_t time_cur(void)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_cur();

> > > +

> > > +     return time_spec_cur();

> > > +}

> > > +

> > > +static inline uint64_t time_res(void)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_res();

> > > +

> > > +     return time_spec_res();

> > > +}

> > > +

> > > +static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_cmp(t2, t1);

> > > +

> > > +     return time_spec_cmp(t2, t1);

> > > +}

> > > +

> > > +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_diff(t2, t1);

> > > +

> > > +     return time_spec_diff(t2, t1);

> > > +}

> > > +

> > > +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_sum(t1, t2);

> > > +

> > > +     return time_spec_sum(t1, t2);

> > > +}

> > > +

> > > +static inline uint64_t time_to_ns(odp_time_t time)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_to_ns(time);

> > > +

> > > +     return time_spec_to_ns(time);

> > > +}

> > > +

> > > +static inline odp_time_t time_from_ns(uint64_t ns)

> > > +{

> > > +     if (global.use_hw)

> > > +             return time_hw_from_ns(ns);

> > > +

> > > +     return time_spec_from_ns(ns);

> > > +}

> > > +

> > > +static inline void time_wait_until(odp_time_t time)

> > > +{

> > > +     odp_time_t cur;

> > > +

> > > +     do {

> > > +             cur = time_cur();

> > > +     } while (time_cmp(time, cur) > 0);

> > >  }

> > >

> > >  odp_time_t odp_time_local(void)

> > >  {

> > > -     return time_local();

> > > +     return time_cur();

> > >  }

> > >

> > >  odp_time_t odp_time_global(void)

> > >  {

> > > -     return time_local();

> > > +     return time_cur();

> > >  }

> > >

> > >  odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1)

> > > @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time)

> > >

> > >  odp_time_t odp_time_local_from_ns(uint64_t ns)

> > >  {

> > > -     return time_local_from_ns(ns);

> > > +     return time_from_ns(ns);

> > >  }

> > >

> > >  odp_time_t odp_time_global_from_ns(uint64_t ns)

> > >  {

> > > -     return time_local_from_ns(ns);

> > > +     return time_from_ns(ns);

> > >  }

> > >

> > >  int odp_time_cmp(odp_time_t t2, odp_time_t t1)

> > > @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t

> > t2)

> > >

> > >  uint64_t odp_time_local_res(void)

> > >  {

> > > -     return time_local_res();

> > > +     return time_res();

> > >  }

> > >

> > >  uint64_t odp_time_global_res(void)

> > >  {

> > > -     return time_local_res();

> > > +     return time_res();

> > >  }

> > >

> > >  void odp_time_wait_ns(uint64_t ns)

> > >  {

> > > -     odp_time_t cur = time_local();

> > > -     odp_time_t wait = time_local_from_ns(ns);

> > > +     odp_time_t cur = time_cur();

> > > +     odp_time_t wait = time_from_ns(ns);

> > >       odp_time_t end_time = time_sum(cur, wait);

> > >

> > >       time_wait_until(end_time);

> > > @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time)

> > >

> > >  int odp_time_init_global(void)

> > >  {

> > > -     int ret;

> > > -     struct timespec time;

> > > -

> > > -     ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time);

> > > -     if (ret) {

> > > -             start_time = ODP_TIME_NULL;

> > > -     } else {

> > > -             start_time.tv_sec = time.tv_sec;

> > > -             start_time.tv_nsec = time.tv_nsec;

> > > +     struct timespec sys_time;

> > > +     int ret = 0;

> > > +

> > > +     memset(&global, 0, sizeof(time_global_t));

> > > +

> > > +     if (cpu_has_global_time()) {

> > > +             global.use_hw = 1;

> > > +             global.hw_freq_hz  = cpu_global_time_freq();

> > > +

> > > +             if (global.hw_freq_hz == 0)

> > > +                     return -1;

> > > +

> > > +             printf("HW time counter freq: %" PRIu64 " hz\n\n",

> > > +                    global.hw_freq_hz);

> > > +

> > > +             global.hw_start = cpu_global_time();

> > > +             return 0;

> > > +     }

> > > +

> > > +     global.start_time = ODP_TIME_NULL;

> > > +

> > > +     ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time);

> > > +     if (ret == 0) {

> > > +             global.start_time.spec.tv_sec  = sys_time.tv_sec;

> > > +             global.start_time.spec.tv_nsec = sys_time.tv_nsec;

> > >       }

> > >

> > >       return ret;

> > > --

> > > 2.11.0

> > >

> >
Savolainen, Petri (Nokia - FI/Espoo) April 24, 2017, 8:07 a.m. UTC | #9
> > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > index c8cf27b6..9ba601a3 100644

> > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > @@ -3,7 +3,14 @@

> >   *

> >   * SPDX-License-Identifier:     BSD-3-Clause

> >   */

> > +

> > +#include <odp_posix_extensions.h>

> > +

> >  #include <odp/api/cpu.h>

> > +#include <odp_time_internal.h>

> > +#include <odp_debug_internal.h>

> > +

> > +#include <time.h>

> >

> >  uint64_t odp_cpu_cycles(void)

> >  {

> > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

> >  {

> >  	return 1;

> >  }

> > +

> > +uint64_t cpu_global_time(void)

> > +{

> > +	return odp_cpu_cycles();

> 

> A cycle counter cannot always be used to measure time. Even on x86,

> odp_cpu_cycles() will return the value of RDTSC which is not actually

> representative of the cycle count. Even if the x86 processor is set

> to a fixed frequency, the Invariant TSC may run at a different fixed

> frequency. Please take a look at the odp_tick_t proposal here:

> 

> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

> eL0oGLAQ4OM/edit?usp=sharing

>


From coverletter:
"This patch set modifies time implementation to use TSC when running on a x86 
CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time 
is used as before. TSC is much more efficient both in performance and 
latency/jitter wise than Linux system call. This can be seen also with 
scheduler latency test which time stamps events with this API. All latency 
measurements (min, ave, max) improved significantly."

This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC.


 
> > +}

> > +

> > +#define SEC_IN_NS 1000000000ULL

> > +

> > +/* Measure TSC frequency. Frequency information registers are defined

> for x86,

> > + * but those are often not enumerated. */

> > +uint64_t cpu_global_time_freq(void)

> > +{

> 

> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

> frequency in kHz. It would be better to use this when running under a

> hypervisor. It appears to work under VMware as well.



Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any.

Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it.


> > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

> b/platform/linux-generic/include/odp/api/plat/time_types.h

> > index 4847f3b1..8ae7ce82 100644

> > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > @@ -26,11 +26,28 @@ extern "C" {

> >   * the linux timespec structure, which is dependent on POSIX extension

> level.

> >   */

> >  typedef struct odp_time_t {

> > -	int64_t tv_sec;      /**< @internal Seconds */

> > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > +	union {

> > +		/** @internal Posix timespec */

> > +		struct {

> > +			/** @internal Seconds */

> > +			int64_t tv_sec;

> > +

> > +			/** @internal Nanoseconds */

> > +			int64_t tv_nsec;

> > +		} spec;

> > +

> > +		/** @internal HW time counter */

> > +		struct {

> > +			/** @internal Counter value */

> > +			uint64_t count;

> > +

> > +			/** @internal Reserved */

> > +			uint64_t res;

> > +		} hw;

> > +	};

> 

> A processor's tick counter cannot be used to measure calendar time by

> itself. The delta between two ticks, however, can be converted to

> calendar time.

> 

> Please see the proposal that introduces odp_tick_t which eliminates

> the wasted u64 reserved field in the union above.



Nothing is wasted here today. We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems.

Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ?




> > +/*

> > + * Posix timespec based functions

> > + */

> >

> > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

> 

> Static functions in .c files will always be considered for inlining, so

> the

> inline qualifier is unnecessary here. You can use always_inline compiler

> attribute if the disassembly and run time performance is not as expected.

> 

> Comment applies to nearly every function in this file.



This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path.


> >

> > -static inline uint64_t time_local_res(void)

> > +static inline uint64_t time_hw_res(void)

> 

> I think 'freq' is a more descriptive name than 'res'.

> 

> It would also be good to document at either the function declaration

> or the function definition (if multiple implementations differ) whether

> the frequency is in Hz or kHz. For example, on ARMv8 it is just a

> register read to get Hz. No estimation, and no lowering to kHz.



This implements resolution API functions ...

/**
 * Global time resolution in hertz
 *
 * @return      Global time resolution in hertz
 */
uint64_t odp_time_global_res(void);


... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative.


> > +

> > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > +{

> > +	odp_time_t time;

> > +

> > +	time.hw.res = 0;

> > +	time.hw.count = t1.hw.count + t2.hw.count;

> > +

> > +	return time;

> > +}

> 

> If a single u64 were used this code wouldn't need to exist.


Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2.

-Petri
Brian Brooks April 25, 2017, 5:25 p.m. UTC | #10
On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > index c8cf27b6..9ba601a3 100644

> > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > @@ -3,7 +3,14 @@

> > >   *

> > >   * SPDX-License-Identifier:     BSD-3-Clause

> > >   */

> > > +

> > > +#include <odp_posix_extensions.h>

> > > +

> > >  #include <odp/api/cpu.h>

> > > +#include <odp_time_internal.h>

> > > +#include <odp_debug_internal.h>

> > > +

> > > +#include <time.h>

> > >

> > >  uint64_t odp_cpu_cycles(void)

> > >  {

> > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

> > >  {

> > >  	return 1;

> > >  }

> > > +

> > > +uint64_t cpu_global_time(void)

> > > +{

> > > +	return odp_cpu_cycles();

> > 

> > A cycle counter cannot always be used to measure time. Even on x86,

> > odp_cpu_cycles() will return the value of RDTSC which is not actually

> > representative of the cycle count. Even if the x86 processor is set

> > to a fixed frequency, the Invariant TSC may run at a different fixed

> > frequency. Please take a look at the odp_tick_t proposal here:

> > 

> > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

> > eL0oGLAQ4OM/edit?usp=sharing

> >

> 

> From coverletter:

> "This patch set modifies time implementation to use TSC when running on a x86 

> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time 

> is used as before. TSC is much more efficient both in performance and 

> latency/jitter wise than Linux system call. This can be seen also with 

> scheduler latency test which time stamps events with this API. All latency 

> measurements (min, ave, max) improved significantly."

> 

> This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC.

> 

> 

>  

> > > +}

> > > +

> > > +#define SEC_IN_NS 1000000000ULL

> > > +

> > > +/* Measure TSC frequency. Frequency information registers are defined

> > for x86,

> > > + * but those are often not enumerated. */

> > > +uint64_t cpu_global_time_freq(void)

> > > +{

> > 

> > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

> > frequency in kHz. It would be better to use this when running under a

> > hypervisor. It appears to work under VMware as well.

> 

> 

> Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any.

> 

> Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it.

> 

> 

> > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

> > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > index 4847f3b1..8ae7ce82 100644

> > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > @@ -26,11 +26,28 @@ extern "C" {

> > >   * the linux timespec structure, which is dependent on POSIX extension

> > level.

> > >   */

> > >  typedef struct odp_time_t {

> > > -	int64_t tv_sec;      /**< @internal Seconds */

> > > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > +	union {

> > > +		/** @internal Posix timespec */

> > > +		struct {

> > > +			/** @internal Seconds */

> > > +			int64_t tv_sec;

> > > +

> > > +			/** @internal Nanoseconds */

> > > +			int64_t tv_nsec;

> > > +		} spec;

> > > +

> > > +		/** @internal HW time counter */

> > > +		struct {

> > > +			/** @internal Counter value */

> > > +			uint64_t count;

> > > +

> > > +			/** @internal Reserved */

> > > +			uint64_t res;

> > > +		} hw;

> > > +	};

> > 

> > A processor's tick counter cannot be used to measure calendar time by

> > itself. The delta between two ticks, however, can be converted to

> > calendar time.

> > 

> > Please see the proposal that introduces odp_tick_t which eliminates

> > the wasted u64 reserved field in the union above.

> 

> 

> Nothing is wasted here today.


The 64-bit CPU time is stored in a 128-bit data structure where 64-bits
are wasted. You can use just a 64-bit type and then convert that to
a timespec (using a starting timestamp taken on global init) if needed.

> We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems.

> 

> Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ?

> 

> 

> 

> 

> > > +/*

> > > + * Posix timespec based functions

> > > + */

> > >

> > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

> > 

> > Static functions in .c files will always be considered for inlining, so

> > the

> > inline qualifier is unnecessary here. You can use always_inline compiler

> > attribute if the disassembly and run time performance is not as expected.

> > 

> > Comment applies to nearly every function in this file.

> 

> 

> This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path.


I am saying that adding the 'inline' qualifier here doesn't actually do anything
because the compiler will consider static functions for inlining anyways.

> 

> > >

> > > -static inline uint64_t time_local_res(void)

> > > +static inline uint64_t time_hw_res(void)

> > 

> > I think 'freq' is a more descriptive name than 'res'.

> > 

> > It would also be good to document at either the function declaration

> > or the function definition (if multiple implementations differ) whether

> > the frequency is in Hz or kHz. For example, on ARMv8 it is just a

> > register read to get Hz. No estimation, and no lowering to kHz.

> 

> 

> This implements resolution API functions ...

> 

> /**

>  * Global time resolution in hertz

>  *

>  * @return      Global time resolution in hertz

>  */

> uint64_t odp_time_global_res(void);

> 

> 

> ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative.

> 

> 

> > > +

> > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > > +{

> > > +	odp_time_t time;

> > > +

> > > +	time.hw.res = 0;

> > > +	time.hw.count = t1.hw.count + t2.hw.count;

> > > +

> > > +	return time;

> > > +}

> > 

> > If a single u64 were used this code wouldn't need to exist.

> 

> Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2.


If the CPU time was stored in a 64-bit type, you can use arithmetic operators
on the values directly instead of doing arithmetic on a 128-bit compound
datatype where 64-bits are unused. This is the union above.

> -Petri

>
Brian Brooks April 25, 2017, 6:51 p.m. UTC | #11
On 04/25 12:25:03, Brian Brooks wrote:
> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> > > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > > index c8cf27b6..9ba601a3 100644

> > > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> > > > @@ -3,7 +3,14 @@

> > > >   *

> > > >   * SPDX-License-Identifier:     BSD-3-Clause

> > > >   */

> > > > +

> > > > +#include <odp_posix_extensions.h>

> > > > +

> > > >  #include <odp/api/cpu.h>

> > > > +#include <odp_time_internal.h>

> > > > +#include <odp_debug_internal.h>

> > > > +

> > > > +#include <time.h>

> > > >

> > > >  uint64_t odp_cpu_cycles(void)

> > > >  {

> > > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

> > > >  {

> > > >  	return 1;

> > > >  }

> > > > +

> > > > +uint64_t cpu_global_time(void)

> > > > +{

> > > > +	return odp_cpu_cycles();

> > > 

> > > A cycle counter cannot always be used to measure time. Even on x86,

> > > odp_cpu_cycles() will return the value of RDTSC which is not actually

> > > representative of the cycle count. Even if the x86 processor is set

> > > to a fixed frequency, the Invariant TSC may run at a different fixed

> > > frequency. Please take a look at the odp_tick_t proposal here:

> > > 

> > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

> > > eL0oGLAQ4OM/edit?usp=sharing

> > >

> > 

> > From coverletter:

> > "This patch set modifies time implementation to use TSC when running on a x86 

> > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time 

> > is used as before. TSC is much more efficient both in performance and 

> > latency/jitter wise than Linux system call. This can be seen also with 

> > scheduler latency test which time stamps events with this API. All latency 

> > measurements (min, ave, max) improved significantly."

> > 

> > This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC.


What sort of timing accuracy is expected from the app?

From benchmarking the maximum single-threaded rate of these reads:

 x86_64:

   read       7 ns/op
   read_sync  22 ns/op

 A57:

   read       4 ns/op
   read_sync  26 ns/op

read_sync issues a synchronizing instruction for greater timing accuracy
but clearly takes more time to return the time value read from the core.

> > > > +}

> > > > +

> > > > +#define SEC_IN_NS 1000000000ULL

> > > > +

> > > > +/* Measure TSC frequency. Frequency information registers are defined

> > > for x86,

> > > > + * but those are often not enumerated. */

> > > > +uint64_t cpu_global_time_freq(void)

> > > > +{

> > > 

> > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

> > > frequency in kHz. It would be better to use this when running under a

> > > hypervisor. It appears to work under VMware as well.

> > 

> > 

> > Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any.

> > 

> > Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it.

> > 

> > 

> > > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > index 4847f3b1..8ae7ce82 100644

> > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > @@ -26,11 +26,28 @@ extern "C" {

> > > >   * the linux timespec structure, which is dependent on POSIX extension

> > > level.

> > > >   */

> > > >  typedef struct odp_time_t {

> > > > -	int64_t tv_sec;      /**< @internal Seconds */

> > > > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > > +	union {

> > > > +		/** @internal Posix timespec */

> > > > +		struct {

> > > > +			/** @internal Seconds */

> > > > +			int64_t tv_sec;

> > > > +

> > > > +			/** @internal Nanoseconds */

> > > > +			int64_t tv_nsec;

> > > > +		} spec;

> > > > +

> > > > +		/** @internal HW time counter */

> > > > +		struct {

> > > > +			/** @internal Counter value */

> > > > +			uint64_t count;

> > > > +

> > > > +			/** @internal Reserved */

> > > > +			uint64_t res;

> > > > +		} hw;

> > > > +	};

> > > 

> > > A processor's tick counter cannot be used to measure calendar time by

> > > itself. The delta between two ticks, however, can be converted to

> > > calendar time.

> > > 

> > > Please see the proposal that introduces odp_tick_t which eliminates

> > > the wasted u64 reserved field in the union above.

> > 

> > 

> > Nothing is wasted here today.

> 

> The 64-bit CPU time is stored in a 128-bit data structure where 64-bits

> are wasted. You can use just a 64-bit type and then convert that to

> a timespec (using a starting timestamp taken on global init) if needed.

> 

> > We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems.

> > 

> > Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ?

> > 

> > 

> > 

> > 

> > > > +/*

> > > > + * Posix timespec based functions

> > > > + */

> > > >

> > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> > > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

> > > 

> > > Static functions in .c files will always be considered for inlining, so

> > > the

> > > inline qualifier is unnecessary here. You can use always_inline compiler

> > > attribute if the disassembly and run time performance is not as expected.

> > > 

> > > Comment applies to nearly every function in this file.

> > 

> > 

> > This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path.

> 

> I am saying that adding the 'inline' qualifier here doesn't actually do anything

> because the compiler will consider static functions for inlining anyways.

> 

> > 

> > > >

> > > > -static inline uint64_t time_local_res(void)

> > > > +static inline uint64_t time_hw_res(void)

> > > 

> > > I think 'freq' is a more descriptive name than 'res'.

> > > 

> > > It would also be good to document at either the function declaration

> > > or the function definition (if multiple implementations differ) whether

> > > the frequency is in Hz or kHz. For example, on ARMv8 it is just a

> > > register read to get Hz. No estimation, and no lowering to kHz.

> > 

> > 

> > This implements resolution API functions ...

> > 

> > /**

> >  * Global time resolution in hertz

> >  *

> >  * @return      Global time resolution in hertz

> >  */

> > uint64_t odp_time_global_res(void);

> > 

> > 

> > ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative.

> > 

> > 

> > > > +

> > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > > > +{

> > > > +	odp_time_t time;

> > > > +

> > > > +	time.hw.res = 0;

> > > > +	time.hw.count = t1.hw.count + t2.hw.count;

> > > > +

> > > > +	return time;

> > > > +}

> > > 

> > > If a single u64 were used this code wouldn't need to exist.

> > 

> > Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2.

> 

> If the CPU time was stored in a 64-bit type, you can use arithmetic operators

> on the values directly instead of doing arithmetic on a 128-bit compound

> datatype where 64-bits are unused. This is the union above.

> 

> > -Petri

> >
Maxim Uvarov April 25, 2017, 7:47 p.m. UTC | #12
On 04/25/17 21:51, Brian Brooks wrote:
> On 04/25 12:25:03, Brian Brooks wrote:

>> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote:

>>>>> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

>>>> b/platform/linux-generic/arch/x86/odp_cpu_arch.c

>>>>> index c8cf27b6..9ba601a3 100644

>>>>> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

>>>>> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

>>>>> @@ -3,7 +3,14 @@

>>>>>   *

>>>>>   * SPDX-License-Identifier:     BSD-3-Clause

>>>>>   */

>>>>> +

>>>>> +#include <odp_posix_extensions.h>

>>>>> +

>>>>>  #include <odp/api/cpu.h>

>>>>> +#include <odp_time_internal.h>

>>>>> +#include <odp_debug_internal.h>

>>>>> +

>>>>> +#include <time.h>

>>>>>

>>>>>  uint64_t odp_cpu_cycles(void)

>>>>>  {

>>>>> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

>>>>>  {

>>>>>  	return 1;

>>>>>  }

>>>>> +

>>>>> +uint64_t cpu_global_time(void)

>>>>> +{

>>>>> +	return odp_cpu_cycles();

>>>>

>>>> A cycle counter cannot always be used to measure time. Even on x86,

>>>> odp_cpu_cycles() will return the value of RDTSC which is not actually

>>>> representative of the cycle count. Even if the x86 processor is set

>>>> to a fixed frequency, the Invariant TSC may run at a different fixed

>>>> frequency. Please take a look at the odp_tick_t proposal here:

>>>>

>>>> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

>>>> eL0oGLAQ4OM/edit?usp=sharing

>>>>

>>>

>>> From coverletter:

>>> "This patch set modifies time implementation to use TSC when running on a x86 

>>> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time 

>>> is used as before. TSC is much more efficient both in performance and 

>>> latency/jitter wise than Linux system call. This can be seen also with 

>>> scheduler latency test which time stamps events with this API. All latency 

>>> measurements (min, ave, max) improved significantly."

>>>

>>> This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC.

> 

> What sort of timing accuracy is expected from the app?

> 

> From benchmarking the maximum single-threaded rate of these reads:

> 

>  x86_64:

> 

>    read       7 ns/op

>    read_sync  22 ns/op

> 

>  A57:

> 

>    read       4 ns/op

>    read_sync  26 ns/op

> 

> read_sync issues a synchronizing instruction for greater timing accuracy

> but clearly takes more time to return the time value read from the core.

> 



it has to be depend on cpu frequency.

Maxim.

>>>>> +}

>>>>> +

>>>>> +#define SEC_IN_NS 1000000000ULL

>>>>> +

>>>>> +/* Measure TSC frequency. Frequency information registers are defined

>>>> for x86,

>>>>> + * but those are often not enumerated. */

>>>>> +uint64_t cpu_global_time_freq(void)

>>>>> +{

>>>>

>>>> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

>>>> frequency in kHz. It would be better to use this when running under a

>>>> hypervisor. It appears to work under VMware as well.

>>>

>>>

>>> Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any.

>>>

>>> Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it.

>>>

>>>

>>>>> diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

>>>> b/platform/linux-generic/include/odp/api/plat/time_types.h

>>>>> index 4847f3b1..8ae7ce82 100644

>>>>> --- a/platform/linux-generic/include/odp/api/plat/time_types.h

>>>>> +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

>>>>> @@ -26,11 +26,28 @@ extern "C" {

>>>>>   * the linux timespec structure, which is dependent on POSIX extension

>>>> level.

>>>>>   */

>>>>>  typedef struct odp_time_t {

>>>>> -	int64_t tv_sec;      /**< @internal Seconds */

>>>>> -	int64_t tv_nsec;     /**< @internal Nanoseconds */

>>>>> +	union {

>>>>> +		/** @internal Posix timespec */

>>>>> +		struct {

>>>>> +			/** @internal Seconds */

>>>>> +			int64_t tv_sec;

>>>>> +

>>>>> +			/** @internal Nanoseconds */

>>>>> +			int64_t tv_nsec;

>>>>> +		} spec;

>>>>> +

>>>>> +		/** @internal HW time counter */

>>>>> +		struct {

>>>>> +			/** @internal Counter value */

>>>>> +			uint64_t count;

>>>>> +

>>>>> +			/** @internal Reserved */

>>>>> +			uint64_t res;

>>>>> +		} hw;

>>>>> +	};

>>>>

>>>> A processor's tick counter cannot be used to measure calendar time by

>>>> itself. The delta between two ticks, however, can be converted to

>>>> calendar time.

>>>>

>>>> Please see the proposal that introduces odp_tick_t which eliminates

>>>> the wasted u64 reserved field in the union above.

>>>

>>>

>>> Nothing is wasted here today.

>>

>> The 64-bit CPU time is stored in a 128-bit data structure where 64-bits

>> are wasted. You can use just a 64-bit type and then convert that to

>> a timespec (using a starting timestamp taken on global init) if needed.

>>

>>> We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems.

>>>

>>> Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ?

>>>

>>>

>>>

>>>

>>>>> +/*

>>>>> + * Posix timespec based functions

>>>>> + */

>>>>>

>>>>> -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

>>>>> +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

>>>>

>>>> Static functions in .c files will always be considered for inlining, so

>>>> the

>>>> inline qualifier is unnecessary here. You can use always_inline compiler

>>>> attribute if the disassembly and run time performance is not as expected.

>>>>

>>>> Comment applies to nearly every function in this file.

>>>

>>>

>>> This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path.

>>

>> I am saying that adding the 'inline' qualifier here doesn't actually do anything

>> because the compiler will consider static functions for inlining anyways.

>>

>>>

>>>>>

>>>>> -static inline uint64_t time_local_res(void)

>>>>> +static inline uint64_t time_hw_res(void)

>>>>

>>>> I think 'freq' is a more descriptive name than 'res'.

>>>>

>>>> It would also be good to document at either the function declaration

>>>> or the function definition (if multiple implementations differ) whether

>>>> the frequency is in Hz or kHz. For example, on ARMv8 it is just a

>>>> register read to get Hz. No estimation, and no lowering to kHz.

>>>

>>>

>>> This implements resolution API functions ...

>>>

>>> /**

>>>  * Global time resolution in hertz

>>>  *

>>>  * @return      Global time resolution in hertz

>>>  */

>>> uint64_t odp_time_global_res(void);

>>>

>>>

>>> ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative.

>>>

>>>

>>>>> +

>>>>> +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

>>>>> +{

>>>>> +	odp_time_t time;

>>>>> +

>>>>> +	time.hw.res = 0;

>>>>> +	time.hw.count = t1.hw.count + t2.hw.count;

>>>>> +

>>>>> +	return time;

>>>>> +}

>>>>

>>>> If a single u64 were used this code wouldn't need to exist.

>>>

>>> Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2.

>>

>> If the CPU time was stored in a 64-bit type, you can use arithmetic operators

>> on the values directly instead of doing arithmetic on a 128-bit compound

>> datatype where 64-bits are unused. This is the union above.

>>

>>> -Petri

>>>
Brian Brooks April 25, 2017, 7:57 p.m. UTC | #13
On 04/25 22:47:11, Maxim Uvarov wrote:
> On 04/25/17 21:51, Brian Brooks wrote:

> > On 04/25 12:25:03, Brian Brooks wrote:

> >> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> >>>>> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> >>>> b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> >>>>> index c8cf27b6..9ba601a3 100644

> >>>>> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> >>>>> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> >>>>> @@ -3,7 +3,14 @@

> >>>>>   *

> >>>>>   * SPDX-License-Identifier:     BSD-3-Clause

> >>>>>   */

> >>>>> +

> >>>>> +#include <odp_posix_extensions.h>

> >>>>> +

> >>>>>  #include <odp/api/cpu.h>

> >>>>> +#include <odp_time_internal.h>

> >>>>> +#include <odp_debug_internal.h>

> >>>>> +

> >>>>> +#include <time.h>

> >>>>>

> >>>>>  uint64_t odp_cpu_cycles(void)

> >>>>>  {

> >>>>> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

> >>>>>  {

> >>>>>  	return 1;

> >>>>>  }

> >>>>> +

> >>>>> +uint64_t cpu_global_time(void)

> >>>>> +{

> >>>>> +	return odp_cpu_cycles();

> >>>>

> >>>> A cycle counter cannot always be used to measure time. Even on x86,

> >>>> odp_cpu_cycles() will return the value of RDTSC which is not actually

> >>>> representative of the cycle count. Even if the x86 processor is set

> >>>> to a fixed frequency, the Invariant TSC may run at a different fixed

> >>>> frequency. Please take a look at the odp_tick_t proposal here:

> >>>>

> >>>> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

> >>>> eL0oGLAQ4OM/edit?usp=sharing

> >>>>

> >>>

> >>> From coverletter:

> >>> "This patch set modifies time implementation to use TSC when running on a x86 

> >>> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time 

> >>> is used as before. TSC is much more efficient both in performance and 

> >>> latency/jitter wise than Linux system call. This can be seen also with 

> >>> scheduler latency test which time stamps events with this API. All latency 

> >>> measurements (min, ave, max) improved significantly."

> >>>

> >>> This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC.

> > 

> > What sort of timing accuracy is expected from the app?

> > 

> > From benchmarking the maximum single-threaded rate of these reads:

> > 

> >  x86_64:

> > 

> >    read       7 ns/op

> >    read_sync  22 ns/op

> > 

> >  A57:

> > 

> >    read       4 ns/op

> >    read_sync  26 ns/op

> > 

> > read_sync issues a synchronizing instruction for greater timing accuracy

> > but clearly takes more time to return the time value read from the core.

> > 

> 

> 

> it has to be depend on cpu frequency.

> 

> Maxim.


We are showing the difference between 'read' and 'read_sync' on the
same machine here.
Honnappa Nagarahalli April 26, 2017, 4:21 a.m. UTC | #14
On 24 April 2017 at 03:07, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolainen@nokia-bell-labs.com> wrote:
>> > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c

>> b/platform/linux-generic/arch/x86/odp_cpu_arch.c

>> > index c8cf27b6..9ba601a3 100644

>> > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

>> > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

>> > @@ -3,7 +3,14 @@

>> >   *

>> >   * SPDX-License-Identifier:     BSD-3-Clause

>> >   */

>> > +

>> > +#include <odp_posix_extensions.h>

>> > +

>> >  #include <odp/api/cpu.h>

>> > +#include <odp_time_internal.h>

>> > +#include <odp_debug_internal.h>

>> > +

>> > +#include <time.h>

>> >

>> >  uint64_t odp_cpu_cycles(void)

>> >  {

>> > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

>> >  {

>> >     return 1;

>> >  }

>> > +

>> > +uint64_t cpu_global_time(void)

>> > +{

>> > +   return odp_cpu_cycles();

>>

>> A cycle counter cannot always be used to measure time. Even on x86,

>> odp_cpu_cycles() will return the value of RDTSC which is not actually

>> representative of the cycle count. Even if the x86 processor is set

>> to a fixed frequency, the Invariant TSC may run at a different fixed

>> frequency. Please take a look at the odp_tick_t proposal here:

>>

>> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-

>> eL0oGLAQ4OM/edit?usp=sharing

>>

>

> From coverletter:

> "This patch set modifies time implementation to use TSC when running on a x86

> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time

> is used as before. TSC is much more efficient both in performance and

> latency/jitter wise than Linux system call. This can be seen also with

> scheduler latency test which time stamps events with this API. All latency

> measurements (min, ave, max) improved significantly."


odp_sched_latency currently uses clock_gettime. It is my understanding
that clock_gettime does not have the over head of the system call. Can
you elaborate more on the 'improved significantly' part?

>

> This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC.

>

>

>

>> > +}

>> > +

>> > +#define SEC_IN_NS 1000000000ULL

>> > +

>> > +/* Measure TSC frequency. Frequency information registers are defined

>> for x86,

>> > + * but those are often not enumerated. */

>> > +uint64_t cpu_global_time_freq(void)

>> > +{

>>

>> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's

>> frequency in kHz. It would be better to use this when running under a

>> hypervisor. It appears to work under VMware as well.

>

>

> Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any.

>

> Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it.

>

>

>> > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h

>> b/platform/linux-generic/include/odp/api/plat/time_types.h

>> > index 4847f3b1..8ae7ce82 100644

>> > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

>> > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

>> > @@ -26,11 +26,28 @@ extern "C" {

>> >   * the linux timespec structure, which is dependent on POSIX extension

>> level.

>> >   */

>> >  typedef struct odp_time_t {

>> > -   int64_t tv_sec;      /**< @internal Seconds */

>> > -   int64_t tv_nsec;     /**< @internal Nanoseconds */

>> > +   union {

>> > +           /** @internal Posix timespec */

>> > +           struct {

>> > +                   /** @internal Seconds */

>> > +                   int64_t tv_sec;

>> > +

>> > +                   /** @internal Nanoseconds */

>> > +                   int64_t tv_nsec;

>> > +           } spec;

>> > +

>> > +           /** @internal HW time counter */

>> > +           struct {

>> > +                   /** @internal Counter value */

>> > +                   uint64_t count;

>> > +

>> > +                   /** @internal Reserved */

>> > +                   uint64_t res;

>> > +           } hw;

>> > +   };

>>

>> A processor's tick counter cannot be used to measure calendar time by

>> itself. The delta between two ticks, however, can be converted to

>> calendar time.

>>

>> Please see the proposal that introduces odp_tick_t which eliminates

>> the wasted u64 reserved field in the union above.

>

>

> Nothing is wasted here today. We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems.

>

> Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ?

>

>

>

>

>> > +/*

>> > + * Posix timespec based functions

>> > + */

>> >

>> > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

>> > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

>>

>> Static functions in .c files will always be considered for inlining, so

>> the

>> inline qualifier is unnecessary here. You can use always_inline compiler

>> attribute if the disassembly and run time performance is not as expected.

>>

>> Comment applies to nearly every function in this file.

>

>

> This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path.

>

>

>> >

>> > -static inline uint64_t time_local_res(void)

>> > +static inline uint64_t time_hw_res(void)

>>

>> I think 'freq' is a more descriptive name than 'res'.

>>

>> It would also be good to document at either the function declaration

>> or the function definition (if multiple implementations differ) whether

>> the frequency is in Hz or kHz. For example, on ARMv8 it is just a

>> register read to get Hz. No estimation, and no lowering to kHz.

>

>

> This implements resolution API functions ...

>

> /**

>  * Global time resolution in hertz

>  *

>  * @return      Global time resolution in hertz

>  */

> uint64_t odp_time_global_res(void);

>

>

> ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative.

>

>

>> > +

>> > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

>> > +{

>> > +   odp_time_t time;

>> > +

>> > +   time.hw.res = 0;

>> > +   time.hw.count = t1.hw.count + t2.hw.count;

>> > +

>> > +   return time;

>> > +}

>>

>> If a single u64 were used this code wouldn't need to exist.

>

> Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2.

>

> -Petri

>
Honnappa Nagarahalli April 26, 2017, 4:38 a.m. UTC | #15
On 21 April 2017 at 08:11, Petri Savolainen <petri.savolainen@linaro.org> wrote:
> Use 64 bit HW time counter when available. It is used on

> x86 when invariant TSC CPU flag indicates that TSC frequency

> is constant. Otherwise, the system time is used as before. Direct

> HW time counter usage avoids system call, and related latency

> and performance issues.

>

> Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org>

> ---

>  platform/linux-generic/Makefile.am                 |   1 +

>  platform/linux-generic/arch/arm/odp_cpu_arch.c     |  11 +

>  platform/linux-generic/arch/default/odp_cpu_arch.c |  11 +

>  platform/linux-generic/arch/mips64/odp_cpu_arch.c  |  11 +

>  platform/linux-generic/arch/powerpc/odp_cpu_arch.c |  11 +

>  platform/linux-generic/arch/x86/cpu_flags.c        |   9 +

>  platform/linux-generic/arch/x86/odp_cpu_arch.c     |  59 ++++

>  .../include/odp/api/plat/time_types.h              |  23 +-

>  platform/linux-generic/include/odp_time_internal.h |  24 ++

>  platform/linux-generic/odp_time.c                  | 303 ++++++++++++++++-----

>  10 files changed, 398 insertions(+), 65 deletions(-)

>  create mode 100644 platform/linux-generic/include/odp_time_internal.h

>

> diff --git a/platform/linux-generic/Makefile.am b/platform/linux-generic/Makefile.am

> index 60b7f849..ed66fecf 100644

> --- a/platform/linux-generic/Makefile.am

> +++ b/platform/linux-generic/Makefile.am

> @@ -171,6 +171,7 @@ noinst_HEADERS = \

>                   ${srcdir}/include/odp_schedule_if.h \

>                   ${srcdir}/include/odp_sorted_list_internal.h \

>                   ${srcdir}/include/odp_shm_internal.h \

> +                 ${srcdir}/include/odp_time_internal.h \

>                   ${srcdir}/include/odp_timer_internal.h \

>                   ${srcdir}/include/odp_timer_wheel_internal.h \

>                   ${srcdir}/include/odp_traffic_mngr_internal.h \

> diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> index 2ac223e0..3a87f09c 100644

> --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c

> @@ -13,6 +13,7 @@

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

>

>  #define GIGA 1000000000

>

> @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>         return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +       return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +       return 0;

> +}

> diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c b/platform/linux-generic/arch/default/odp_cpu_arch.c

> index 2ac223e0..3a87f09c 100644

> --- a/platform/linux-generic/arch/default/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c

> @@ -13,6 +13,7 @@

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

>

>  #define GIGA 1000000000

>

> @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>         return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +       return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +       return 0;

> +}

> diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> index 646acf9c..a9a94531 100644

> --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c

> @@ -7,6 +7,7 @@

>  #include <odp/api/cpu.h>

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

> +#include <odp_time_internal.h>

>

>  uint64_t odp_cpu_cycles(void)

>  {

> @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>         return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +       return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +       return 0;

> +}

> diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> index 2ac223e0..3a87f09c 100644

> --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c

> @@ -13,6 +13,7 @@

>  #include <odp/api/hints.h>

>  #include <odp/api/system_info.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

>

>  #define GIGA 1000000000

>

> @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>         return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +       return 0;

> +}

> +

> +uint64_t cpu_global_time_freq(void)

> +{

> +       return 0;

> +}

> diff --git a/platform/linux-generic/arch/x86/cpu_flags.c b/platform/linux-generic/arch/x86/cpu_flags.c

> index 8fb9477a..cde8ad3e 100644

> --- a/platform/linux-generic/arch/x86/cpu_flags.c

> +++ b/platform/linux-generic/arch/x86/cpu_flags.c

> @@ -38,6 +38,7 @@

>   */

>

>  #include <arch/x86/cpu_flags.h>

> +#include <odp_time_internal.h>

>  #include <stdio.h>

>  #include <stdint.h>

>

> @@ -347,3 +348,11 @@ void cpu_flags_print_all(void)

>

>         printf("\n\n");

>  }

> +

> +int cpu_has_global_time(void)

> +{

> +       if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0)

> +               return 1;

> +

> +       return 0;

> +}

> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> index c8cf27b6..9ba601a3 100644

> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c

> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c

> @@ -3,7 +3,14 @@

>   *

>   * SPDX-License-Identifier:     BSD-3-Clause

>   */

> +

> +#include <odp_posix_extensions.h>

> +

>  #include <odp/api/cpu.h>

> +#include <odp_time_internal.h>

> +#include <odp_debug_internal.h>

> +

> +#include <time.h>

>

>  uint64_t odp_cpu_cycles(void)

>  {

> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)

>  {

>         return 1;

>  }

> +

> +uint64_t cpu_global_time(void)

> +{

> +       return odp_cpu_cycles();

> +}

> +

> +#define SEC_IN_NS 1000000000ULL

> +

> +/* Measure TSC frequency. Frequency information registers are defined for x86,

> + * but those are often not enumerated. */

> +uint64_t cpu_global_time_freq(void)

> +{

> +       struct timespec sleep, ts1, ts2;

> +       uint64_t t1, t2, ts_nsec, cycles, hz;

> +       int i;

> +       uint64_t avg = 0;

> +       int rounds = 4;

> +

> +       for (i = 0; i < rounds; i++) {

> +               sleep.tv_sec  = 0;

> +               sleep.tv_nsec = SEC_IN_NS / 10;

> +

> +               if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) {

> +                       ODP_DBG("clock_gettime failed\n");

> +                       return 0;

> +               }

> +

> +               t1 = cpu_global_time();

> +

> +               if (nanosleep(&sleep, NULL) < 0) {

> +                       ODP_DBG("nanosleep failed\n");

> +                       return 0;

> +               }

> +

> +               if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) {

> +                       ODP_DBG("clock_gettime failed\n");

> +                       return 0;

> +               }

> +

> +               t2 = cpu_global_time();

> +

> +               ts_nsec  = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS;

> +               ts_nsec += ts2.tv_nsec - ts1.tv_nsec;

> +

> +               cycles = t2 - t1;

> +

> +               hz = (cycles * SEC_IN_NS) / ts_nsec;

> +               avg += hz;

> +       }

> +

> +       return avg / rounds;

> +}


This function is not very accurate. Ideally, ts1 and t1 (ts2 and t2)
should be read at the same instance for this to be accurate. There is
also the possibility that the 'rdtsc' in cpu_global_time might get
executed ahead or later than what is in the code here.

Since this is called during init, the 'rounds' can be increased to a
higher value to get a better average. Initial values can be discarded
to ignore the cache warming latencies.

We should fall back to this only if the frequency information
registers are not available.

There is a good white paper on methods to use for measuring cycles for
code that takes small amount of cycles.
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf


> diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h b/platform/linux-generic/include/odp/api/plat/time_types.h

> index 4847f3b1..8ae7ce82 100644

> --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> @@ -26,11 +26,28 @@ extern "C" {

>   * the linux timespec structure, which is dependent on POSIX extension level.

>   */

>  typedef struct odp_time_t {

> -       int64_t tv_sec;      /**< @internal Seconds */

> -       int64_t tv_nsec;     /**< @internal Nanoseconds */

> +       union {

> +               /** @internal Posix timespec */

> +               struct {

> +                       /** @internal Seconds */

> +                       int64_t tv_sec;

> +

> +                       /** @internal Nanoseconds */

> +                       int64_t tv_nsec;

> +               } spec;

> +

> +               /** @internal HW time counter */

> +               struct {

> +                       /** @internal Counter value */

> +                       uint64_t count;

> +

> +                       /** @internal Reserved */

> +                       uint64_t res;

> +               } hw;

> +       };

>  } odp_time_t;

>

> -#define ODP_TIME_NULL ((odp_time_t){0, 0})

> +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} })

>

>  /**

>   * @}

> diff --git a/platform/linux-generic/include/odp_time_internal.h b/platform/linux-generic/include/odp_time_internal.h

> new file mode 100644

> index 00000000..99ac7977

> --- /dev/null

> +++ b/platform/linux-generic/include/odp_time_internal.h

> @@ -0,0 +1,24 @@

> +/* Copyright (c) 2017, Linaro Limited

> + * All rights reserved.

> + *

> + * SPDX-License-Identifier:     BSD-3-Clause

> + */

> +

> +#ifndef ODP_TIME_INTERNAL_H_

> +#define ODP_TIME_INTERNAL_H_

> +

> +#ifdef __cplusplus

> +extern "C" {

> +#endif

> +

> +#include <stdint.h>

> +

> +int cpu_has_global_time(void);

> +uint64_t cpu_global_time(void);

> +uint64_t cpu_global_time_freq(void);

> +

> +#ifdef __cplusplus

> +}

> +#endif

> +

> +#endif

> diff --git a/platform/linux-generic/odp_time.c b/platform/linux-generic/odp_time.c

> index 81e05224..2dd8f2c4 100644

> --- a/platform/linux-generic/odp_time.c

> +++ b/platform/linux-generic/odp_time.c

> @@ -10,36 +10,39 @@

>  #include <odp/api/time.h>

>  #include <odp/api/hints.h>

>  #include <odp_debug_internal.h>

> +#include <odp_time_internal.h>

> +#include <string.h>

> +#include <inttypes.h>

>

> -static odp_time_t start_time;

> +typedef struct time_global_t {

> +       odp_time_t start_time;

> +       int        use_hw;

> +       uint64_t   hw_start;

> +       uint64_t   hw_freq_hz;

> +} time_global_t;

>

> -static inline

> -uint64_t time_to_ns(odp_time_t time)

> -{

> -       uint64_t ns;

> -

> -       ns = time.tv_sec * ODP_TIME_SEC_IN_NS;

> -       ns += time.tv_nsec;

> +static time_global_t global;

>

> -       return ns;

> -}

> +/*

> + * Posix timespec based functions

> + */

>

> -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)

>  {

>         odp_time_t time;

>

> -       time.tv_sec = t2.tv_sec - t1.tv_sec;

> -       time.tv_nsec = t2.tv_nsec - t1.tv_nsec;

> +       time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec;

> +       time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec;

>

> -       if (time.tv_nsec < 0) {

> -               time.tv_nsec += ODP_TIME_SEC_IN_NS;

> -               --time.tv_sec;

> +       if (time.spec.tv_nsec < 0) {

> +               time.spec.tv_nsec += ODP_TIME_SEC_IN_NS;

> +               --time.spec.tv_sec;

>         }

>

>         return time;

>  }

>

> -static inline odp_time_t time_local(void)

> +static inline odp_time_t time_spec_cur(void)

>  {

>         int ret;

>         odp_time_t time;

> @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void)

>         if (odp_unlikely(ret != 0))

>                 ODP_ABORT("clock_gettime failed\n");

>

> -       time.tv_sec = sys_time.tv_sec;

> -       time.tv_nsec = sys_time.tv_nsec;

> +       time.spec.tv_sec = sys_time.tv_sec;

> +       time.spec.tv_nsec = sys_time.tv_nsec;

>

> -       return time_diff(time, start_time);

> +       return time_spec_diff(time, global.start_time);

>  }

>

> -static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> +static inline uint64_t time_spec_res(void)

>  {

> -       if (t2.tv_sec < t1.tv_sec)

> +       int ret;

> +       struct timespec tres;

> +

> +       ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> +       if (odp_unlikely(ret != 0))

> +               ODP_ABORT("clock_getres failed\n");

> +

> +       return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> +}

> +

> +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1)

> +{

> +       if (t2.spec.tv_sec < t1.spec.tv_sec)

>                 return -1;

>

> -       if (t2.tv_sec > t1.tv_sec)

> +       if (t2.spec.tv_sec > t1.spec.tv_sec)

>                 return 1;

>

> -       return t2.tv_nsec - t1.tv_nsec;

> +       return t2.spec.tv_nsec - t1.spec.tv_nsec;

>  }

>

> -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2)

>  {

>         odp_time_t time;

>

> -       time.tv_sec = t2.tv_sec + t1.tv_sec;

> -       time.tv_nsec = t2.tv_nsec + t1.tv_nsec;

> +       time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec;

> +       time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec;

>

> -       if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> -               time.tv_nsec -= ODP_TIME_SEC_IN_NS;

> -               ++time.tv_sec;

> +       if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {

> +               time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS;

> +               ++time.spec.tv_sec;

>         }

>

>         return time;

>  }

>

> -static inline odp_time_t time_local_from_ns(uint64_t ns)

> +static inline uint64_t time_spec_to_ns(odp_time_t time)

> +{

> +       uint64_t ns;

> +

> +       ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

> +       ns += time.spec.tv_nsec;

> +

> +       return ns;

> +}

> +

> +static inline odp_time_t time_spec_from_ns(uint64_t ns)

>  {

>         odp_time_t time;

>

> -       time.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> -       time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS;

> +       time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS;

> +       time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS;

>

>         return time;

>  }

>

> -static inline void time_wait_until(odp_time_t time)

> +/*

> + * HW time counter based functions

> + */

> +

> +static inline odp_time_t time_hw_cur(void)

>  {

> -       odp_time_t cur;

> +       odp_time_t time;

>

> -       do {

> -               cur = time_local();

> -       } while (time_cmp(time, cur) > 0);

> +       time.hw.res   = 0;

> +       time.hw.count = cpu_global_time() - global.hw_start;

> +

> +       return time;

>  }

>

> -static inline uint64_t time_local_res(void)

> +static inline uint64_t time_hw_res(void)

>  {

> -       int ret;

> -       struct timespec tres;

> +       /* Promise a bit lower resolution than average cycle counter

> +        * frequency */

> +       return global.hw_freq_hz / 10;

> +}

>

> -       ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);

> -       if (odp_unlikely(ret != 0))

> -               ODP_ABORT("clock_getres failed\n");

> +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1)

> +{

> +       if (odp_likely(t2.hw.count > t1.hw.count))

> +               return 1;

>

> -       return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;

> +       if (t2.hw.count < t1.hw.count)

> +               return -1;

> +

> +       return 0;

> +}

> +

> +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1)

> +{

> +       odp_time_t time;

> +

> +       time.hw.res = 0;

> +       time.hw.count = t2.hw.count - t1.hw.count;

> +

> +       return time;

> +}

> +

> +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> +{

> +       odp_time_t time;

> +

> +       time.hw.res = 0;

> +       time.hw.count = t1.hw.count + t2.hw.count;

> +

> +       return time;

> +}

> +

> +static inline uint64_t time_hw_to_ns(odp_time_t time)

> +{

> +       uint64_t nsec;

> +       uint64_t freq_hz = global.hw_freq_hz;

> +       uint64_t count = time.hw.count;

> +       uint64_t sec = 0;

> +

> +       if (count >= freq_hz) {

> +               sec   = count / freq_hz;

> +               count = count - sec * freq_hz;

> +       }

> +

> +       nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz;

> +

> +       return (sec * ODP_TIME_SEC_IN_NS) + nsec;

> +}

> +

> +static inline odp_time_t time_hw_from_ns(uint64_t ns)

> +{

> +       odp_time_t time;

> +       uint64_t count;

> +       uint64_t freq_hz = global.hw_freq_hz;

> +       uint64_t sec = 0;

> +

> +       if (ns >= ODP_TIME_SEC_IN_NS) {

> +               sec = ns / ODP_TIME_SEC_IN_NS;

> +               ns  = ns - sec * ODP_TIME_SEC_IN_NS;

> +       }

> +

> +       count  = sec * freq_hz;

> +       count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS;

> +

> +       time.hw.res   = 0;

> +       time.hw.count = count;

> +

> +       return time;

> +}

> +

> +/*

> + * Common functions

> + */

> +

> +static inline odp_time_t time_cur(void)

> +{

> +       if (global.use_hw)

> +               return time_hw_cur();

> +

> +       return time_spec_cur();

> +}

> +

> +static inline uint64_t time_res(void)

> +{

> +       if (global.use_hw)

> +               return time_hw_res();

> +

> +       return time_spec_res();

> +}

> +

> +static inline int time_cmp(odp_time_t t2, odp_time_t t1)

> +{

> +       if (global.use_hw)

> +               return time_hw_cmp(t2, t1);

> +

> +       return time_spec_cmp(t2, t1);

> +}

> +

> +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)

> +{

> +       if (global.use_hw)

> +               return time_hw_diff(t2, t1);

> +

> +       return time_spec_diff(t2, t1);

> +}

> +

> +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)

> +{

> +       if (global.use_hw)

> +               return time_hw_sum(t1, t2);

> +

> +       return time_spec_sum(t1, t2);

> +}

> +

> +static inline uint64_t time_to_ns(odp_time_t time)

> +{

> +       if (global.use_hw)

> +               return time_hw_to_ns(time);

> +

> +       return time_spec_to_ns(time);

> +}

> +

> +static inline odp_time_t time_from_ns(uint64_t ns)

> +{

> +       if (global.use_hw)

> +               return time_hw_from_ns(ns);

> +

> +       return time_spec_from_ns(ns);

> +}

> +

> +static inline void time_wait_until(odp_time_t time)

> +{

> +       odp_time_t cur;

> +

> +       do {

> +               cur = time_cur();

> +       } while (time_cmp(time, cur) > 0);

>  }

>

>  odp_time_t odp_time_local(void)

>  {

> -       return time_local();

> +       return time_cur();

>  }

>

>  odp_time_t odp_time_global(void)

>  {

> -       return time_local();

> +       return time_cur();

>  }

>

>  odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1)

> @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time)

>

>  odp_time_t odp_time_local_from_ns(uint64_t ns)

>  {

> -       return time_local_from_ns(ns);

> +       return time_from_ns(ns);

>  }

>

>  odp_time_t odp_time_global_from_ns(uint64_t ns)

>  {

> -       return time_local_from_ns(ns);

> +       return time_from_ns(ns);

>  }

>

>  int odp_time_cmp(odp_time_t t2, odp_time_t t1)

> @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t t2)

>

>  uint64_t odp_time_local_res(void)

>  {

> -       return time_local_res();

> +       return time_res();

>  }

>

>  uint64_t odp_time_global_res(void)

>  {

> -       return time_local_res();

> +       return time_res();

>  }

>

>  void odp_time_wait_ns(uint64_t ns)

>  {

> -       odp_time_t cur = time_local();

> -       odp_time_t wait = time_local_from_ns(ns);

> +       odp_time_t cur = time_cur();

> +       odp_time_t wait = time_from_ns(ns);

>         odp_time_t end_time = time_sum(cur, wait);

>

>         time_wait_until(end_time);

> @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time)

>

>  int odp_time_init_global(void)

>  {

> -       int ret;

> -       struct timespec time;

> -

> -       ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time);

> -       if (ret) {

> -               start_time = ODP_TIME_NULL;

> -       } else {

> -               start_time.tv_sec = time.tv_sec;

> -               start_time.tv_nsec = time.tv_nsec;

> +       struct timespec sys_time;

> +       int ret = 0;

> +

> +       memset(&global, 0, sizeof(time_global_t));

> +

> +       if (cpu_has_global_time()) {

> +               global.use_hw = 1;

> +               global.hw_freq_hz  = cpu_global_time_freq();

> +

> +               if (global.hw_freq_hz == 0)

> +                       return -1;

> +

> +               printf("HW time counter freq: %" PRIu64 " hz\n\n",

> +                      global.hw_freq_hz);

> +

> +               global.hw_start = cpu_global_time();

> +               return 0;

> +       }

> +

> +       global.start_time = ODP_TIME_NULL;

> +

> +       ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time);

> +       if (ret == 0) {

> +               global.start_time.spec.tv_sec  = sys_time.tv_sec;

> +               global.start_time.spec.tv_nsec = sys_time.tv_nsec;

>         }

>

>         return ret;

> --

> 2.11.0

>
Savolainen, Petri (Nokia - FI/Espoo) April 26, 2017, 7:11 a.m. UTC | #16
> > From coverletter:

> > "This patch set modifies time implementation to use TSC when running on

> a x86

> > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system

> time

> > is used as before. TSC is much more efficient both in performance and

> > latency/jitter wise than Linux system call. This can be seen also with

> > scheduler latency test which time stamps events with this API. All

> latency

> > measurements (min, ave, max) improved significantly."

> 

> odp_sched_latency currently uses clock_gettime. It is my understanding

> that clock_gettime does not have the over head of the system call. Can

> you elaborate more on the 'improved significantly' part?

> 


clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of kernel functions including system call entry, RCU maintenance, etc.

E.g. in sched_latency test kernel consumed about 10% of all the cycles. Also latency measurement results improved like this:
* min >3x lower
* avg 2x lower
* max more stable and 50% lower



-Petri
Savolainen, Petri (Nokia - FI/Espoo) April 26, 2017, 7:30 a.m. UTC | #17
> > > This function (cpu_global_time()) is called only when we have first

> checked that TSC is invariant. Also we measure the TSC frequency in that

> case. This function is defined in the same file as cpu_cycles(), and the

> file is x86 specific. So, we know what we are doing, and just re-using the

> code to read TSC.

> 

> What sort of timing accuracy is expected from the app?

> 

> From benchmarking the maximum single-threaded rate of these reads:

> 

>  x86_64:

> 

>    read       7 ns/op

>    read_sync  22 ns/op

> 

>  A57:

> 

>    read       4 ns/op

>    read_sync  26 ns/op

> 

> read_sync issues a synchronizing instruction for greater timing accuracy

> but clearly takes more time to return the time value read from the core.


Accuracy is as good as implementation can offer with reasonable overhead. We do not put any nsec figures into API spec. ODP API should offer application the most efficient way to read time anyway.

This patch does not take a position which way TSC should be read. There are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp is not as widely supported as rdtsc. This detail is a magnitude less significant compared to: use system call vs direct TSC read. It can be tuned later. This patch set helps if rdtscp should be used later on (introduces x86 cpu flags).

-Petri
Savolainen, Petri (Nokia - FI/Espoo) April 26, 2017, 8:02 a.m. UTC | #18
> > > > diff --git a/platform/linux-

> generic/include/odp/api/plat/time_types.h

> > > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > index 4847f3b1..8ae7ce82 100644

> > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > @@ -26,11 +26,28 @@ extern "C" {

> > > >   * the linux timespec structure, which is dependent on POSIX

> extension

> > > level.

> > > >   */

> > > >  typedef struct odp_time_t {

> > > > -	int64_t tv_sec;      /**< @internal Seconds */

> > > > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > > +	union {

> > > > +		/** @internal Posix timespec */

> > > > +		struct {

> > > > +			/** @internal Seconds */

> > > > +			int64_t tv_sec;

> > > > +

> > > > +			/** @internal Nanoseconds */

> > > > +			int64_t tv_nsec;

> > > > +		} spec;

> > > > +

> > > > +		/** @internal HW time counter */

> > > > +		struct {

> > > > +			/** @internal Counter value */

> > > > +			uint64_t count;

> > > > +

> > > > +			/** @internal Reserved */

> > > > +			uint64_t res;

> > > > +		} hw;

> > > > +	};

> > >

> > > A processor's tick counter cannot be used to measure calendar time by

> > > itself. The delta between two ticks, however, can be converted to

> > > calendar time.

> > >

> > > Please see the proposal that introduces odp_tick_t which eliminates

> > > the wasted u64 reserved field in the union above.

> >

> >

> > Nothing is wasted here today.

> 

> The 64-bit CPU time is stored in a 128-bit data structure where 64-bits

> are wasted. You can use just a 64-bit type and then convert that to

> a timespec (using a starting timestamp taken on global init) if needed.

> 


Nothing is wasted compared to the situation today. Entire timespec is stored as before. If you want to compress timespec, it's another topic. Compress means additional complexity and overhead. This patch set just adds ability to use 64 but HW time when available. Timespec side implementation remains as is.


> > This is because all our code (in this and other files) does "static

> inline" for functions that we want to be inlined. Static is used for

> functions that are static, but we don't care if those are inlined (slow

> path). Most time API functions are fast path.

> 

> I am saying that adding the 'inline' qualifier here doesn't actually do

> anything

> because the compiler will consider static functions for inlining anyways.



We all know that those are considered. We use "inline" as C standard describes: to suggest  which function calls should be as fast as possible.
 

> > > > +

> > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)

> > > > +{

> > > > +	odp_time_t time;

> > > > +

> > > > +	time.hw.res = 0;

> > > > +	time.hw.count = t1.hw.count + t2.hw.count;

> > > > +

> > > > +	return time;

> > > > +}

> > >

> > > If a single u64 were used this code wouldn't need to exist.

> >

> > Which code? hw.res = 0 ? It not actually used for anything and could be

> removed. Although, the performance gain would not be huge. Anyway, I'll

> remove it for v2.

> 

> If the CPU time was stored in a 64-bit type, you can use arithmetic

> operators

> on the values directly instead of doing arithmetic on a 128-bit compound

> datatype where 64-bits are unused. This is the union above.



The implementation above does 64bit arithmetic. The reserved field is not used for anything. V2 removed all .res references (except one due to a false GCC warning). We have abstract API definition, so that application remains portable also, when time cannot be represented with a single integer type.

It's implementation defined, if e.g. the above sum function is inlined and thus results no overhead at all. 

-Petri
Savolainen, Petri (Nokia - FI/Espoo) April 26, 2017, 9:37 a.m. UTC | #19
> > +#define SEC_IN_NS 1000000000ULL

> > +

> > +/* Measure TSC frequency. Frequency information registers are defined

> for x86,

> > + * but those are often not enumerated. */

> > +uint64_t cpu_global_time_freq(void)

> > +{

> > +       struct timespec sleep, ts1, ts2;

> > +       uint64_t t1, t2, ts_nsec, cycles, hz;

> > +       int i;

> > +       uint64_t avg = 0;

> > +       int rounds = 4;

> > +

> > +       for (i = 0; i < rounds; i++) {

> > +               sleep.tv_sec  = 0;

> > +               sleep.tv_nsec = SEC_IN_NS / 10;

> > +

> > +               if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) {

> > +                       ODP_DBG("clock_gettime failed\n");

> > +                       return 0;

> > +               }

> > +

> > +               t1 = cpu_global_time();

> > +

> > +               if (nanosleep(&sleep, NULL) < 0) {

> > +                       ODP_DBG("nanosleep failed\n");

> > +                       return 0;

> > +               }

> > +

> > +               if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) {

> > +                       ODP_DBG("clock_gettime failed\n");

> > +                       return 0;

> > +               }

> > +

> > +               t2 = cpu_global_time();

> > +

> > +               ts_nsec  = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS;

> > +               ts_nsec += ts2.tv_nsec - ts1.tv_nsec;

> > +

> > +               cycles = t2 - t1;

> > +

> > +               hz = (cycles * SEC_IN_NS) / ts_nsec;

> > +               avg += hz;

> > +       }

> > +

> > +       return avg / rounds;

> > +}

> 

> This function is not very accurate. Ideally, ts1 and t1 (ts2 and t2)

> should be read at the same instance for this to be accurate. There is

> also the possibility that the 'rdtsc' in cpu_global_time might get

> executed ahead or later than what is in the code here.

> 

> Since this is called during init, the 'rounds' can be increased to a

> higher value to get a better average. Initial values can be discarded

> to ignore the cache warming latencies.

> 

> We should fall back to this only if the frequency information

> registers are not available.

> 

> There is a good white paper on methods to use for measuring cycles for

> code that takes small amount of cycles.

> http://www.intel.com/content/dam/www/public/us/en/documents/white-

> papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

> 


Execution or latency does not affect accuracy here since we measure over 100ms (about 200 M CPU cycles). E.g. a time stamp error of 1000 cycles would result an error in the 6th digit of the result, which is not very significant. DPDK uses similar algorithm to measure the frequency. This measurement takes now 0.4 seconds in total. I would not want to consume too many seconds on every application start up for this.

E.g. on my Haswell those registers are not populated. Also DPDK does not refer to those anywhere. So, don't expect those to be populated often.

I think this should be accurate enough as is. It can be optimized later if needed.

-Petri
Peltonen, Janne (Nokia - FI/Espoo) April 26, 2017, 9:40 a.m. UTC | #20
> > odp_sched_latency currently uses clock_gettime. It is my understanding

> > that clock_gettime does not have the over head of the system call. Can

> > you elaborate more on the 'improved significantly' part?

> >

> 

> clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of

> kernel functions including system call entry, RCU maintenance, etc.


clock_gettime() does not use the vdso implementation without syscall overhead
on x86 if clock id is CLOCK_MONOTONIC_RAW as it seems to be in ODP. I think
new enough kernels do support CLOCK_MONOTONIC_RAW in vsdo for arm64 though.

CLOCK_MONOTONIC is supported in vdso in x86 and would not cause syscall
overhead provided that the kernel time source is tsc (which it often is,
but not always (e.g. in some VMs)).

	Janne
Maxim Uvarov April 26, 2017, 7:35 p.m. UTC | #21
On 04/26/17 12:40, Peltonen, Janne (Nokia - FI/Espoo) wrote:
>>> odp_sched_latency currently uses clock_gettime. It is my understanding

>>> that clock_gettime does not have the over head of the system call. Can

>>> you elaborate more on the 'improved significantly' part?

>>>

>>

>> clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of

>> kernel functions including system call entry, RCU maintenance, etc.

> 

> clock_gettime() does not use the vdso implementation without syscall overhead

> on x86 if clock id is CLOCK_MONOTONIC_RAW as it seems to be in ODP. I think

> new enough kernels do support CLOCK_MONOTONIC_RAW in vsdo for arm64 though.

> 

> CLOCK_MONOTONIC is supported in vdso in x86 and would not cause syscall

> overhead provided that the kernel time source is tsc (which it often is,

> but not always (e.g. in some VMs)).

> 

> 	Janne

> 

> 


here we need to be very careful with 2 things:
1) if api says nanosecond should be returned than it has to be nanoseconds.

2) We need to go in more generic way. If on fresh kernels
clock_gettime() shows great results then maybe it's reasonable to say
with it. But in that case we need to do measurements and define minimal
kernel version. I think that on any call kernel does some internal time
store which might trigger other sybsystem to take an action (soft irq
timers and rcu run as Petri saw).

This thread looks like very old but it says that CLOCK_MONOTONIC_RAW has
worst performance according to rdtsc:
http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/

ps. link also has test code to measure values.

Maxim.
Savolainen, Petri (Nokia - FI/Espoo) April 27, 2017, 8:11 a.m. UTC | #22
> -----Original Message-----

> From: Brian Brooks [mailto:brian.brooks@arm.com]

> Sent: Wednesday, April 19, 2017 9:37 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com>

> Cc: Honnappa Nagarahalli <honnappa.nagarahalli@linaro.org>; lng-

> odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time

> counter when available

> 

> On 04/26 07:11:57, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> >

> >

> > > > From coverletter:

> > > > "This patch set modifies time implementation to use TSC when running

> on

> > > a x86

> > > > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux

> system

> > > time

> > > > is used as before. TSC is much more efficient both in performance

> and

> > > > latency/jitter wise than Linux system call. This can be seen also

> with

> > > > scheduler latency test which time stamps events with this API. All

> > > latency

> > > > measurements (min, ave, max) improved significantly."

> > >

> > > odp_sched_latency currently uses clock_gettime. It is my understanding

> > > that clock_gettime does not have the over head of the system call. Can

> > > you elaborate more on the 'improved significantly' part?

> > >

> >

> > clock_gettime() uses the same TSC, but when you profile it with perf you

> can see tens of kernel functions including system call entry, RCU

> maintenance, etc.

> >

> > E.g. in sched_latency test kernel consumed about 10% of all the cycles.

> Also latency measurement results improved like this:

> > * min >3x lower

> > * avg 2x lower

> > * max more stable and 50% lower

> 

> You might want to share more information on the environment

> where you're seeing such significant improvements because the

> results on Broadwell do not match the above interpretation.

> 

> PS - This patch series breaks the build on ARM.

> 


Use v2. It should build on ARM.

-Petri
Savolainen, Petri (Nokia - FI/Espoo) April 27, 2017, 8:21 a.m. UTC | #23
> -----Original Message-----

> From: Brian Brooks [mailto:brian.brooks@arm.com]

> Sent: Wednesday, April 19, 2017 9:46 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com>

> Cc: lng-odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time

> counter when available

> 

> On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> >

> > > > > This function (cpu_global_time()) is called only when we have

> first

> > > checked that TSC is invariant. Also we measure the TSC frequency in

> that

> > > case. This function is defined in the same file as cpu_cycles(), and

> the

> > > file is x86 specific. So, we know what we are doing, and just re-using

> the

> > > code to read TSC.

> > >

> > > What sort of timing accuracy is expected from the app?

> > >

> > > From benchmarking the maximum single-threaded rate of these reads:

> > >

> > >  x86_64:

> > >

> > >    read       7 ns/op

> > >    read_sync  22 ns/op

> > >

> > >  A57:

> > >

> > >    read       4 ns/op

> > >    read_sync  26 ns/op

> > >

> > > read_sync issues a synchronizing instruction for greater timing

> accuracy

> > > but clearly takes more time to return the time value read from the

> core.

> >

> > Accuracy is as good as implementation can offer with reasonable

> overhead. We do not put any nsec figures into API spec. ODP API should

> offer application the most efficient way to read time anyway.

> 

> 'reasonable' is what we need to define.

> 

> Another reason why you're seeing a performance boost on x86 is that when

> switching from clock_gettime() to RDTSC, you're no longer issuing a

> synchronizing

> instruction (fence). As shown above, this can be a significant factor

> depending

> on how often the time is being sampled.

> 

> However, there is a loss in timing accuracy because the load of the value

> may not happen at the time it happens in program order. This is why a

> synchronizing instruction needs to be used, but it slows down the

> execution

> of the thread on the core...

> 

> > This patch does not take a position which way TSC should be read. There

> are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current

> code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp

> is not as widely supported as rdtsc. This detail is a magnitude less

> significant compared to: use system call vs direct TSC read. It can be

> tuned later. This patch set helps if rdtscp should be used later on

> (introduces x86 cpu flags).

> 

> So you're saying that you do not need the synchronizing instruction, and

> the

> loss of timing accuracy is OK, right?



What is our timing accuracy today? How much jitter the system call (and everything may launch) causes in the current implementation? How much we are losing accuracy compared what we have today? I'd say we are better off that today anyway, just because we avoid the system call.

We don't have fence today on rdtsc read for cycle count. This patch does not add it, either. If we are good without it for cycles, we are good for nsec also.

The scale time scale application typically measures is in order of micro to milliseconds - thousands to millions of nanoseconds. If out of order execution moves the sample position e.g. 20 nsec (40 CPU cycles), it is not significant error to the measurement. Already single cache miss during a measurement causes same kind of noise to the measurement. Also CPU crystal might not be too accurate either, etc.


-Petri
Savolainen, Petri (Nokia - FI/Espoo) April 27, 2017, 10:02 a.m. UTC | #24
> -----Original Message-----

> From: Brian Brooks [mailto:brian.brooks@arm.com]

> Sent: Wednesday, April 19, 2017 10:05 PM

> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com>

> Cc: lng-odp@lists.linaro.org

> Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time

> counter when available

> 

> On 04/26 08:02:09, Savolainen, Petri (Nokia - FI/Espoo) wrote:

> >

> >

> > > > > > diff --git a/platform/linux-

> > > generic/include/odp/api/plat/time_types.h

> > > > > b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > > index 4847f3b1..8ae7ce82 100644

> > > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h

> > > > > > @@ -26,11 +26,28 @@ extern "C" {

> > > > > >   * the linux timespec structure, which is dependent on POSIX

> > > extension

> > > > > level.

> > > > > >   */

> > > > > >  typedef struct odp_time_t {

> > > > > > -	int64_t tv_sec;      /**< @internal Seconds */

> > > > > > -	int64_t tv_nsec;     /**< @internal Nanoseconds */

> > > > > > +	union {

> > > > > > +		/** @internal Posix timespec */

> > > > > > +		struct {

> > > > > > +			/** @internal Seconds */

> > > > > > +			int64_t tv_sec;

> > > > > > +

> > > > > > +			/** @internal Nanoseconds */

> > > > > > +			int64_t tv_nsec;

> > > > > > +		} spec;

> > > > > > +

> > > > > > +		/** @internal HW time counter */

> > > > > > +		struct {

> > > > > > +			/** @internal Counter value */

> > > > > > +			uint64_t count;

> > > > > > +

> > > > > > +			/** @internal Reserved */

> > > > > > +			uint64_t res;

> > > > > > +		} hw;

> > > > > > +	};

> > > > >

> > > > > A processor's tick counter cannot be used to measure calendar time

> by

> > > > > itself. The delta between two ticks, however, can be converted to

> > > > > calendar time.

> > > > >

> > > > > Please see the proposal that introduces odp_tick_t which

> eliminates

> > > > > the wasted u64 reserved field in the union above.

> > > >

> > > >

> > > > Nothing is wasted here today.

> > >

> > > The 64-bit CPU time is stored in a 128-bit data structure where 64-

> bits

> > > are wasted. You can use just a 64-bit type and then convert that to

> > > a timespec (using a starting timestamp taken on global init) if

> needed.

> > >

> >

> > Nothing is wasted compared to the situation today. Entire timespec is

> stored as before. If you want to compress timespec, it's another topic.

> Compress means additional complexity and overhead. This patch set just

> adds ability to use 64 but HW time when available. Timespec side

> implementation remains as is.

> 

> You are re-using the calendar time type to store a value read from the CPU

> counter. A different approach is to use a different type (64 bits only for

> no

> wasted space) for the value read from the CPU.

> 

> This value must be converted to a calendar time time if needed. It cannot

> just

> be used to represent calendar time.

> 

> However, for some applications and use cases, you don't need to convert to

> calendar time

> in order to measure across two reads of the counter. You can also do

> direct arithmetic

> on the value instead of arithmetic on a timespec.

> 

> If you want to read code instead of email or design documentation, look at

> this:

> 

> 

> https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_ben

> ch.c

>   https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time.h

> 

> All you need is a 64 bit type, a way to read from the counter as well as

> get

> the frequency, and 2 very small conversion functions.



First, odp_time_t is our storage type for time values. Today odp-linux defines that as 2x64bits. Nobody has complained about too high time storage consumption. I'm not changing size of the type here. It's not purpose of this patch set to e.g. compress timespec and thus minimize memory foot print of time storage.

Second, are you proposing an API change? Change of time type definition?
Ola's cntr_now() function changes as soon as your time source is not a 64bit counter, but e.g. timespec (2x64bit). It would need to compress that. How application would do arithmetic on the compressed timespec ?

Third, compression/conversion usually mean division operation. Division is tens of times heavier operation than sum, dec, mult. Currently, ODP API is defined so that conversion is done only when application asks for it (odp_time_t <-> nsec time). Compression is a trade-off between storage space size and performance.

-Petri
Savolainen, Petri (Nokia - FI/Espoo) April 28, 2017, 12:36 p.m. UTC | #25
> You multiply seconds by 1,000,000,000 and add the nanoseconds to get the

> epoch

> offset in terms of nanoseconds - this can be stored in a uint64_t. And,

> now you

> can "spec" that time is in terms of nanoseconds not "implementation

> defined".

> The application can now do arithmetic too.

> 

> > Third, compression/conversion usually mean division operation. Division

> is tens of times heavier operation than sum, dec, mult. Currently, ODP API

> is defined so that conversion is done only when application asks for it

> (odp_time_t <-> nsec time). Compression is a trade-off between storage

> space size and performance.

> 

> I prefer to measure. Perhaps time operations using CPU tick masked by a

> timespec

> should be added here:

> 

> 

> https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_ben

> ch.c

> 

> Taking the difference between two 64-bit values and calling

> nsec_from_cntr() is

> quite fast - as fast (faster on ARM) as calling clock_gettime() once

> through the vDSO.

>


Realized after sending yesterday that timespec is pretty fast to convert into nsec. So, I did add that as the last patch of v3.  So, size of odp_time_t (in odp-linux) is 64 bits after this set. Hopefully, everybody are happy with this now. I'd really like to work one change at the time: first take HW time counter into use, then optimize timespec implementation, etc. One problem, one patch set.

-Petri
diff mbox series

Patch

diff --git a/platform/linux-generic/Makefile.am b/platform/linux-generic/Makefile.am
index 60b7f849..ed66fecf 100644
--- a/platform/linux-generic/Makefile.am
+++ b/platform/linux-generic/Makefile.am
@@ -171,6 +171,7 @@  noinst_HEADERS = \
 		  ${srcdir}/include/odp_schedule_if.h \
 		  ${srcdir}/include/odp_sorted_list_internal.h \
 		  ${srcdir}/include/odp_shm_internal.h \
+		  ${srcdir}/include/odp_time_internal.h \
 		  ${srcdir}/include/odp_timer_internal.h \
 		  ${srcdir}/include/odp_timer_wheel_internal.h \
 		  ${srcdir}/include/odp_traffic_mngr_internal.h \
diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c b/platform/linux-generic/arch/arm/odp_cpu_arch.c
index 2ac223e0..3a87f09c 100644
--- a/platform/linux-generic/arch/arm/odp_cpu_arch.c
+++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c
@@ -13,6 +13,7 @@ 
 #include <odp/api/hints.h>
 #include <odp/api/system_info.h>
 #include <odp_debug_internal.h>
+#include <odp_time_internal.h>
 
 #define GIGA 1000000000
 
@@ -46,3 +47,13 @@  uint64_t odp_cpu_cycles_resolution(void)
 {
 	return 1;
 }
+
+uint64_t cpu_global_time(void)
+{
+	return 0;
+}
+
+uint64_t cpu_global_time_freq(void)
+{
+	return 0;
+}
diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c b/platform/linux-generic/arch/default/odp_cpu_arch.c
index 2ac223e0..3a87f09c 100644
--- a/platform/linux-generic/arch/default/odp_cpu_arch.c
+++ b/platform/linux-generic/arch/default/odp_cpu_arch.c
@@ -13,6 +13,7 @@ 
 #include <odp/api/hints.h>
 #include <odp/api/system_info.h>
 #include <odp_debug_internal.h>
+#include <odp_time_internal.h>
 
 #define GIGA 1000000000
 
@@ -46,3 +47,13 @@  uint64_t odp_cpu_cycles_resolution(void)
 {
 	return 1;
 }
+
+uint64_t cpu_global_time(void)
+{
+	return 0;
+}
+
+uint64_t cpu_global_time_freq(void)
+{
+	return 0;
+}
diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c b/platform/linux-generic/arch/mips64/odp_cpu_arch.c
index 646acf9c..a9a94531 100644
--- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c
+++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c
@@ -7,6 +7,7 @@ 
 #include <odp/api/cpu.h>
 #include <odp/api/hints.h>
 #include <odp/api/system_info.h>
+#include <odp_time_internal.h>
 
 uint64_t odp_cpu_cycles(void)
 {
@@ -29,3 +30,13 @@  uint64_t odp_cpu_cycles_resolution(void)
 {
 	return 1;
 }
+
+uint64_t cpu_global_time(void)
+{
+	return 0;
+}
+
+uint64_t cpu_global_time_freq(void)
+{
+	return 0;
+}
diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c
index 2ac223e0..3a87f09c 100644
--- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c
+++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c
@@ -13,6 +13,7 @@ 
 #include <odp/api/hints.h>
 #include <odp/api/system_info.h>
 #include <odp_debug_internal.h>
+#include <odp_time_internal.h>
 
 #define GIGA 1000000000
 
@@ -46,3 +47,13 @@  uint64_t odp_cpu_cycles_resolution(void)
 {
 	return 1;
 }
+
+uint64_t cpu_global_time(void)
+{
+	return 0;
+}
+
+uint64_t cpu_global_time_freq(void)
+{
+	return 0;
+}
diff --git a/platform/linux-generic/arch/x86/cpu_flags.c b/platform/linux-generic/arch/x86/cpu_flags.c
index 8fb9477a..cde8ad3e 100644
--- a/platform/linux-generic/arch/x86/cpu_flags.c
+++ b/platform/linux-generic/arch/x86/cpu_flags.c
@@ -38,6 +38,7 @@ 
  */
 
 #include <arch/x86/cpu_flags.h>
+#include <odp_time_internal.h>
 #include <stdio.h>
 #include <stdint.h>
 
@@ -347,3 +348,11 @@  void cpu_flags_print_all(void)
 
 	printf("\n\n");
 }
+
+int cpu_has_global_time(void)
+{
+	if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0)
+		return 1;
+
+	return 0;
+}
diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c b/platform/linux-generic/arch/x86/odp_cpu_arch.c
index c8cf27b6..9ba601a3 100644
--- a/platform/linux-generic/arch/x86/odp_cpu_arch.c
+++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c
@@ -3,7 +3,14 @@ 
  *
  * SPDX-License-Identifier:     BSD-3-Clause
  */
+
+#include <odp_posix_extensions.h>
+
 #include <odp/api/cpu.h>
+#include <odp_time_internal.h>
+#include <odp_debug_internal.h>
+
+#include <time.h>
 
 uint64_t odp_cpu_cycles(void)
 {
@@ -31,3 +38,55 @@  uint64_t odp_cpu_cycles_resolution(void)
 {
 	return 1;
 }
+
+uint64_t cpu_global_time(void)
+{
+	return odp_cpu_cycles();
+}
+
+#define SEC_IN_NS 1000000000ULL
+
+/* Measure TSC frequency. Frequency information registers are defined for x86,
+ * but those are often not enumerated. */
+uint64_t cpu_global_time_freq(void)
+{
+	struct timespec sleep, ts1, ts2;
+	uint64_t t1, t2, ts_nsec, cycles, hz;
+	int i;
+	uint64_t avg = 0;
+	int rounds = 4;
+
+	for (i = 0; i < rounds; i++) {
+		sleep.tv_sec  = 0;
+		sleep.tv_nsec = SEC_IN_NS / 10;
+
+		if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) {
+			ODP_DBG("clock_gettime failed\n");
+			return 0;
+		}
+
+		t1 = cpu_global_time();
+
+		if (nanosleep(&sleep, NULL) < 0) {
+			ODP_DBG("nanosleep failed\n");
+			return 0;
+		}
+
+		if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) {
+			ODP_DBG("clock_gettime failed\n");
+			return 0;
+		}
+
+		t2 = cpu_global_time();
+
+		ts_nsec  = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS;
+		ts_nsec += ts2.tv_nsec - ts1.tv_nsec;
+
+		cycles = t2 - t1;
+
+		hz = (cycles * SEC_IN_NS) / ts_nsec;
+		avg += hz;
+	}
+
+	return avg / rounds;
+}
diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h b/platform/linux-generic/include/odp/api/plat/time_types.h
index 4847f3b1..8ae7ce82 100644
--- a/platform/linux-generic/include/odp/api/plat/time_types.h
+++ b/platform/linux-generic/include/odp/api/plat/time_types.h
@@ -26,11 +26,28 @@  extern "C" {
  * the linux timespec structure, which is dependent on POSIX extension level.
  */
 typedef struct odp_time_t {
-	int64_t tv_sec;      /**< @internal Seconds */
-	int64_t tv_nsec;     /**< @internal Nanoseconds */
+	union {
+		/** @internal Posix timespec */
+		struct {
+			/** @internal Seconds */
+			int64_t tv_sec;
+
+			/** @internal Nanoseconds */
+			int64_t tv_nsec;
+		} spec;
+
+		/** @internal HW time counter */
+		struct {
+			/** @internal Counter value */
+			uint64_t count;
+
+			/** @internal Reserved */
+			uint64_t res;
+		} hw;
+	};
 } odp_time_t;
 
-#define ODP_TIME_NULL ((odp_time_t){0, 0})
+#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} })
 
 /**
  * @}
diff --git a/platform/linux-generic/include/odp_time_internal.h b/platform/linux-generic/include/odp_time_internal.h
new file mode 100644
index 00000000..99ac7977
--- /dev/null
+++ b/platform/linux-generic/include/odp_time_internal.h
@@ -0,0 +1,24 @@ 
+/* Copyright (c) 2017, Linaro Limited
+ * All rights reserved.
+ *
+ * SPDX-License-Identifier:     BSD-3-Clause
+ */
+
+#ifndef ODP_TIME_INTERNAL_H_
+#define ODP_TIME_INTERNAL_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdint.h>
+
+int cpu_has_global_time(void);
+uint64_t cpu_global_time(void);
+uint64_t cpu_global_time_freq(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/platform/linux-generic/odp_time.c b/platform/linux-generic/odp_time.c
index 81e05224..2dd8f2c4 100644
--- a/platform/linux-generic/odp_time.c
+++ b/platform/linux-generic/odp_time.c
@@ -10,36 +10,39 @@ 
 #include <odp/api/time.h>
 #include <odp/api/hints.h>
 #include <odp_debug_internal.h>
+#include <odp_time_internal.h>
+#include <string.h>
+#include <inttypes.h>
 
-static odp_time_t start_time;
+typedef struct time_global_t {
+	odp_time_t start_time;
+	int        use_hw;
+	uint64_t   hw_start;
+	uint64_t   hw_freq_hz;
+} time_global_t;
 
-static inline
-uint64_t time_to_ns(odp_time_t time)
-{
-	uint64_t ns;
-
-	ns = time.tv_sec * ODP_TIME_SEC_IN_NS;
-	ns += time.tv_nsec;
+static time_global_t global;
 
-	return ns;
-}
+/*
+ * Posix timespec based functions
+ */
 
-static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)
+static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)
 {
 	odp_time_t time;
 
-	time.tv_sec = t2.tv_sec - t1.tv_sec;
-	time.tv_nsec = t2.tv_nsec - t1.tv_nsec;
+	time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec;
+	time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec;
 
-	if (time.tv_nsec < 0) {
-		time.tv_nsec += ODP_TIME_SEC_IN_NS;
-		--time.tv_sec;
+	if (time.spec.tv_nsec < 0) {
+		time.spec.tv_nsec += ODP_TIME_SEC_IN_NS;
+		--time.spec.tv_sec;
 	}
 
 	return time;
 }
 
-static inline odp_time_t time_local(void)
+static inline odp_time_t time_spec_cur(void)
 {
 	int ret;
 	odp_time_t time;
@@ -49,77 +52,237 @@  static inline odp_time_t time_local(void)
 	if (odp_unlikely(ret != 0))
 		ODP_ABORT("clock_gettime failed\n");
 
-	time.tv_sec = sys_time.tv_sec;
-	time.tv_nsec = sys_time.tv_nsec;
+	time.spec.tv_sec = sys_time.tv_sec;
+	time.spec.tv_nsec = sys_time.tv_nsec;
 
-	return time_diff(time, start_time);
+	return time_spec_diff(time, global.start_time);
 }
 
-static inline int time_cmp(odp_time_t t2, odp_time_t t1)
+static inline uint64_t time_spec_res(void)
 {
-	if (t2.tv_sec < t1.tv_sec)
+	int ret;
+	struct timespec tres;
+
+	ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);
+	if (odp_unlikely(ret != 0))
+		ODP_ABORT("clock_getres failed\n");
+
+	return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;
+}
+
+static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1)
+{
+	if (t2.spec.tv_sec < t1.spec.tv_sec)
 		return -1;
 
-	if (t2.tv_sec > t1.tv_sec)
+	if (t2.spec.tv_sec > t1.spec.tv_sec)
 		return 1;
 
-	return t2.tv_nsec - t1.tv_nsec;
+	return t2.spec.tv_nsec - t1.spec.tv_nsec;
 }
 
-static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)
+static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2)
 {
 	odp_time_t time;
 
-	time.tv_sec = t2.tv_sec + t1.tv_sec;
-	time.tv_nsec = t2.tv_nsec + t1.tv_nsec;
+	time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec;
+	time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec;
 
-	if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {
-		time.tv_nsec -= ODP_TIME_SEC_IN_NS;
-		++time.tv_sec;
+	if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) {
+		time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS;
+		++time.spec.tv_sec;
 	}
 
 	return time;
 }
 
-static inline odp_time_t time_local_from_ns(uint64_t ns)
+static inline uint64_t time_spec_to_ns(odp_time_t time)
+{
+	uint64_t ns;
+
+	ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS;
+	ns += time.spec.tv_nsec;
+
+	return ns;
+}
+
+static inline odp_time_t time_spec_from_ns(uint64_t ns)
 {
 	odp_time_t time;
 
-	time.tv_sec = ns / ODP_TIME_SEC_IN_NS;
-	time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS;
+	time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS;
+	time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS;
 
 	return time;
 }
 
-static inline void time_wait_until(odp_time_t time)
+/*
+ * HW time counter based functions
+ */
+
+static inline odp_time_t time_hw_cur(void)
 {
-	odp_time_t cur;
+	odp_time_t time;
 
-	do {
-		cur = time_local();
-	} while (time_cmp(time, cur) > 0);
+	time.hw.res   = 0;
+	time.hw.count = cpu_global_time() - global.hw_start;
+
+	return time;
 }
 
-static inline uint64_t time_local_res(void)
+static inline uint64_t time_hw_res(void)
 {
-	int ret;
-	struct timespec tres;
+	/* Promise a bit lower resolution than average cycle counter
+	 * frequency */
+	return global.hw_freq_hz / 10;
+}
 
-	ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres);
-	if (odp_unlikely(ret != 0))
-		ODP_ABORT("clock_getres failed\n");
+static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1)
+{
+	if (odp_likely(t2.hw.count > t1.hw.count))
+		return 1;
 
-	return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec;
+	if (t2.hw.count < t1.hw.count)
+		return -1;
+
+	return 0;
+}
+
+static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1)
+{
+	odp_time_t time;
+
+	time.hw.res = 0;
+	time.hw.count = t2.hw.count - t1.hw.count;
+
+	return time;
+}
+
+static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)
+{
+	odp_time_t time;
+
+	time.hw.res = 0;
+	time.hw.count = t1.hw.count + t2.hw.count;
+
+	return time;
+}
+
+static inline uint64_t time_hw_to_ns(odp_time_t time)
+{
+	uint64_t nsec;
+	uint64_t freq_hz = global.hw_freq_hz;
+	uint64_t count = time.hw.count;
+	uint64_t sec = 0;
+
+	if (count >= freq_hz) {
+		sec   = count / freq_hz;
+		count = count - sec * freq_hz;
+	}
+
+	nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz;
+
+	return (sec * ODP_TIME_SEC_IN_NS) + nsec;
+}
+
+static inline odp_time_t time_hw_from_ns(uint64_t ns)
+{
+	odp_time_t time;
+	uint64_t count;
+	uint64_t freq_hz = global.hw_freq_hz;
+	uint64_t sec = 0;
+
+	if (ns >= ODP_TIME_SEC_IN_NS) {
+		sec = ns / ODP_TIME_SEC_IN_NS;
+		ns  = ns - sec * ODP_TIME_SEC_IN_NS;
+	}
+
+	count  = sec * freq_hz;
+	count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS;
+
+	time.hw.res   = 0;
+	time.hw.count = count;
+
+	return time;
+}
+
+/*
+ * Common functions
+ */
+
+static inline odp_time_t time_cur(void)
+{
+	if (global.use_hw)
+		return time_hw_cur();
+
+	return time_spec_cur();
+}
+
+static inline uint64_t time_res(void)
+{
+	if (global.use_hw)
+		return time_hw_res();
+
+	return time_spec_res();
+}
+
+static inline int time_cmp(odp_time_t t2, odp_time_t t1)
+{
+	if (global.use_hw)
+		return time_hw_cmp(t2, t1);
+
+	return time_spec_cmp(t2, t1);
+}
+
+static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)
+{
+	if (global.use_hw)
+		return time_hw_diff(t2, t1);
+
+	return time_spec_diff(t2, t1);
+}
+
+static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2)
+{
+	if (global.use_hw)
+		return time_hw_sum(t1, t2);
+
+	return time_spec_sum(t1, t2);
+}
+
+static inline uint64_t time_to_ns(odp_time_t time)
+{
+	if (global.use_hw)
+		return time_hw_to_ns(time);
+
+	return time_spec_to_ns(time);
+}
+
+static inline odp_time_t time_from_ns(uint64_t ns)
+{
+	if (global.use_hw)
+		return time_hw_from_ns(ns);
+
+	return time_spec_from_ns(ns);
+}
+
+static inline void time_wait_until(odp_time_t time)
+{
+	odp_time_t cur;
+
+	do {
+		cur = time_cur();
+	} while (time_cmp(time, cur) > 0);
 }
 
 odp_time_t odp_time_local(void)
 {
-	return time_local();
+	return time_cur();
 }
 
 odp_time_t odp_time_global(void)
 {
-	return time_local();
+	return time_cur();
 }
 
 odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1)
@@ -134,12 +297,12 @@  uint64_t odp_time_to_ns(odp_time_t time)
 
 odp_time_t odp_time_local_from_ns(uint64_t ns)
 {
-	return time_local_from_ns(ns);
+	return time_from_ns(ns);
 }
 
 odp_time_t odp_time_global_from_ns(uint64_t ns)
 {
-	return time_local_from_ns(ns);
+	return time_from_ns(ns);
 }
 
 int odp_time_cmp(odp_time_t t2, odp_time_t t1)
@@ -154,18 +317,18 @@  odp_time_t odp_time_sum(odp_time_t t1, odp_time_t t2)
 
 uint64_t odp_time_local_res(void)
 {
-	return time_local_res();
+	return time_res();
 }
 
 uint64_t odp_time_global_res(void)
 {
-	return time_local_res();
+	return time_res();
 }
 
 void odp_time_wait_ns(uint64_t ns)
 {
-	odp_time_t cur = time_local();
-	odp_time_t wait = time_local_from_ns(ns);
+	odp_time_t cur = time_cur();
+	odp_time_t wait = time_from_ns(ns);
 	odp_time_t end_time = time_sum(cur, wait);
 
 	time_wait_until(end_time);
@@ -193,15 +356,31 @@  uint64_t odp_time_to_u64(odp_time_t time)
 
 int odp_time_init_global(void)
 {
-	int ret;
-	struct timespec time;
-
-	ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time);
-	if (ret) {
-		start_time = ODP_TIME_NULL;
-	} else {
-		start_time.tv_sec = time.tv_sec;
-		start_time.tv_nsec = time.tv_nsec;
+	struct timespec sys_time;
+	int ret = 0;
+
+	memset(&global, 0, sizeof(time_global_t));
+
+	if (cpu_has_global_time()) {
+		global.use_hw = 1;
+		global.hw_freq_hz  = cpu_global_time_freq();
+
+		if (global.hw_freq_hz == 0)
+			return -1;
+
+		printf("HW time counter freq: %" PRIu64 " hz\n\n",
+		       global.hw_freq_hz);
+
+		global.hw_start = cpu_global_time();
+		return 0;
+	}
+
+	global.start_time = ODP_TIME_NULL;
+
+	ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time);
+	if (ret == 0) {
+		global.start_time.spec.tv_sec  = sys_time.tv_sec;
+		global.start_time.spec.tv_nsec = sys_time.tv_nsec;
 	}
 
 	return ret;