diff mbox

doc/users-guide: add helpers section

Message ID 1449863556-31260-1-git-send-email-mike.holmes@linaro.org
State New
Headers show

Commit Message

Mike Holmes Dec. 11, 2015, 7:52 p.m. UTC
Signed-off-by: Mike Holmes <mike.holmes@linaro.org>
---
 doc/users-guide/users-guide.adoc | 161 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)

Comments

Bill Fischofer Dec. 12, 2015, 3:13 p.m. UTC | #1
On Fri, Dec 11, 2015 at 1:52 PM, Mike Holmes <mike.holmes@linaro.org> wrote:

> Signed-off-by: Mike Holmes <mike.holmes@linaro.org>

> ---

>  doc/users-guide/users-guide.adoc | 161

> +++++++++++++++++++++++++++++++++++++++

>  1 file changed, 161 insertions(+)

>

> diff --git a/doc/users-guide/users-guide.adoc

> b/doc/users-guide/users-guide.adoc

> index cf77fa0..d2e1a16 100644

> --- a/doc/users-guide/users-guide.adoc

> +++ b/doc/users-guide/users-guide.adoc

> @@ -431,6 +431,167 @@ Applications only include the 'include/odp.h file

> which includes the 'platform/<

>  The doxygen documentation defining the behavior of the ODP API is all

> contained in the public API files, and the actual definitions for an

> implementation will be found in the per platform directories.

>  Per-platform data that might normally be a #define can be recovered via

> the appropriate access function if the #define is not directly visible to

> the application.

>

> +== Helpers

>


Calling this section Helpers is confusing since we're already using that
for the ODP helper functions (odph prefix). These are really separate
sections that are awkwardly under a "Miscellaneous" heading.  Better to
promote them to first level sections that cover a wider theme rather than
just enumerating.  For example "Core Management and Isolation", "Memory and
Cache Management", "Synchronization", etc.


> +Many small helper functions and definitions are needed to enable ODP

> +applications to be hardware optimized but not tied to a particular

> hardware or

> +execution environment. These are typically implemented with inline

> functions,

> +preprocessor macros, or compiler built­in features. Thus API definitions

> are

> +normally inline when possible.

> +

> +=== Core enumeration

> +Application or middleware need to handle physical and/or logical core

> IDs, core

> +counts and core masks quite often. Core enumeration has to remain

> consistent

> +even when core deployment may change during application execution (e.g.,

> due to

> +adaptation to changing traffic profile, etc).

> +

> +* +odp_cpumask_from_str()+

> +* +odp_cpumask_to_str()+

> +* +odp_cpumask_zero()+

> +* +odp_cpumask_set()+

> +* +odp_cpumask_setall()+

> +* +odp_cpumask_clr()+

> +* +odp_cpumask_isset()+

> +* +odp_cpumask_count()+

> +* +odp_cpumask_and()+

> +* +odp_cpumask_or()+

> +* +odp_cpumask_xor()+

> +* +odp_cpumask_equal()+

> +* +odp_cpumask_copy()+

> +* +odp_cpumask_first()+

> +* +odp_cpumask_last()+

> +* +odp_cpumask_next()+

> +* +odp_cpumask_default_worker()+

> +* +odp_cpumask_default_control()+

>


Good start, but we want to do more than just enumerate functions here (you
can get that from the API reference).  The User's Guide should cover why
and how these functions should be used and factored into application
design. OK for now as a skeleton, but we need to expand here.


> +

> +=== Memory alignments

> +For optimal performance and scalability (e.g., to avoid false sharing and

> cache

> +line aliasing), some application data structures need to be aligned to

> cache

> +(cache line) and/or memory subsystem (page, DRAM burst) alignments.  NUMA

> +systems also support location­awareness and potentially different cache

> line

> +sizes on a per­memory basis. Static memory allocation Serves application

> needs

> +for portable definitions for global and core/thread local data.

> +

> +* +ODP_ALIGNED+

> +* +ODP_PACKED+

> +* +ODP_OFFSETOF+

> +* +ODP_FIELD_SIZEOF+

> +* +ODP_CACHE_LINE_SIZE+

> +* +ODP_PAGE_SIZE+

> +* +ODP_ALIGNED_CACHE+

> +* +ODP_ALIGNED_PAGE+

> +

> +=== Compiler hints

> +The compiler and linker can do better optimizations if code includes

> hints on

> +expected application  behavior.  Examples of these are classification of

> +branches with likely/unlikely hints, or marking  code with hot (optimize

> for

> +speed) or cold (optimize for size) tags.

> +

> +* +odp_likely()+

> +* +odp_unlikely()+

> +* +odp_prefetch()+

> +* +odp_prefetch_store()+

> +

> +=== Atomic operations

> +Modern ISAs offers various atomic instructions to access/manipulate data

> +concurrently from multiple cores. Well scalable multicore software is

> possible

> +only through correct usage (and combination) of hardware acceleration and

> +atomic instructions. Applications use atomic operations to update global

> +statistics, sequence counters, quotas, etc., and to build concurrent data

> +structures.

> +

> +* +odp_atomic_init_u64()+

> +* +odp_atomic_load_u64()+

> +* +odp_atomic_store_u64()+

> +* +odp_atomic_fetch_add_u64()+

> +* +odp_atomic_add_u64()+

> +* +odp_atomic_fetch_sub_u64()+

> +* +odp_atomic_sub_u64()+

> +* +odp_atomic_fetch_inc_u64()+

> +* +odp_atomic_inc_u64()+

> +* +odp_atomic_fetch_dec_u64()+

> +* +odp_atomic_dec_u64()+

> +

> +=== Memory synchronization barriers

> +Application (or middleware) needs a portable way to synchronize data

> +modifications into main memory before messaging other cores or hardware

> +acceleration about the changes. The nature of the synchronization needs

> are

> +cache coherence protocol specific.

> +

> +* +odp_barrier_t()+


+* +odp_rwlock_t()+
> +* +odp_ticketlock_t()+

>


The _t names are types, not functions, so omit the () suffixes.


> +* +odp_barrier_init()+

> +* +odp_barrier_wait()+

> +* +odp_rwlock_init()+

> +* +odp_rwlock_read_lock()+

> +* +odp_rwlock_read_unlock()+

> +* +odp_rwlock_write_lock()+

> +* +odp_rwlock_write_unlock()+

> +* +odp_sync_stores()+

> +* +odp_ticketlock_init()+

> +* +odp_ticketlock_lock()+

> +* +odp_ticketlock_trylock()+

> +* +odp_ticketlock_unlock()+

> +* +odp_ticketlock_is_locked()+

> +

> +=== Execution barriers and spinlocks

> +Although software locking should be avoided (especially in fast path

> code), at

> +times there is no practical way to synchronize cores other than using

> execution

> +barriers or spinlocks. For example, the application initialization phase

> +typically is not performance critical and may be much simpler with

> synchronous

> +interfaces and locking.

> +

> +* +odp_spinlock_t()+

>


Omit ()


> +* +odp_spinlock_init()+

> +* +odp_spinlock_lock()+

> +* +odp_spinlock_trylock()+

> +* +odp_spinlock_unlock()+

> +* +odp_spinlock_is_locked()+

> +

> +=== Profiling and debugging

> +Although there are (external) tools for profiling and debugging, some

> level of

> +application code instrumentation is typically needed (e.g., for on field

> +debug/profiling). Typically an SoC supports CPU level (e.g., cycle count,

> cache

> +misses, branch prediction misses) and SoC level (system cache misses,

> +interconnect/DRAM utilization) performance counters.

> +

> +* +odp_errno()+

> +* +odp_errno_zero()+

> +* +odp_errno_print()+

> +* +odp_errno_str()+

> +

> +* +odp_override_log()+

> +* +odp_override_abort()+

> +

> +=== SoC Hardware info

> +The application may be interested in generic performance characteristics

> of the

> +SoC it is running on to have optimal adaption to the system.

> +

> +* +odp_cpu_id()+

> +* +odp_cpu_count()+

> +* +odp_cpu_cycles()+

> +* +odp_cpu_cycles_diff()+

> +* +odp_cpu_cycles_max()+

> +* +odp_cpu_cycles_resolution()+

> +

> +=== Data manipulation

> +There are some data manipulation operations that are typical to networking

> +applications. Examples of these are byte order swap for big/little­endian

> +conversion, various checksum algorithms, and bit shuffling/shifting.

> +

> +* +odp_be_to_cpu_16()+

> +* +odp_be_to_cpu_32()+

> +* +odp_be_to_cpu_64()+

> +* +odp_cpu_to_be_16()+

> +* +odp_cpu_to_be_32()+

> +* +odp_cpu_to_be_64()+

> +* +odp_le_to_cpu_16()+

> +* +odp_le_to_cpu_32()+

> +* +odp_le_to_cpu_64()+

> +* +odp_cpu_to_le_16()+

> +* +odp_cpu_to_le_32()+

> +* +odp_cpu_to_le_64()+

> +

>  .Users include structure

>  ----

>  ./

> --

> 2.5.0

>

> _______________________________________________

> lng-odp mailing list

> lng-odp@lists.linaro.org

> https://lists.linaro.org/mailman/listinfo/lng-odp

>
Mike Holmes Dec. 14, 2015, 10:28 p.m. UTC | #2
On 12 December 2015 at 10:13, Bill Fischofer <bill.fischofer@linaro.org>
wrote:

>

>

> On Fri, Dec 11, 2015 at 1:52 PM, Mike Holmes <mike.holmes@linaro.org>

> wrote:

>

>> Signed-off-by: Mike Holmes <mike.holmes@linaro.org>

>> ---

>>  doc/users-guide/users-guide.adoc | 161

>> +++++++++++++++++++++++++++++++++++++++

>>  1 file changed, 161 insertions(+)

>>

>> diff --git a/doc/users-guide/users-guide.adoc

>> b/doc/users-guide/users-guide.adoc

>> index cf77fa0..d2e1a16 100644

>> --- a/doc/users-guide/users-guide.adoc

>> +++ b/doc/users-guide/users-guide.adoc

>> @@ -431,6 +431,167 @@ Applications only include the 'include/odp.h file

>> which includes the 'platform/<

>>  The doxygen documentation defining the behavior of the ODP API is all

>> contained in the public API files, and the actual definitions for an

>> implementation will be found in the per platform directories.

>>  Per-platform data that might normally be a #define can be recovered via

>> the appropriate access function if the #define is not directly visible to

>> the application.

>>

>> +== Helpers

>>

>

> Calling this section Helpers is confusing since we're already using that

> for the ODP helper functions (odph prefix). These are really separate

> sections that are awkwardly under a "Miscellaneous" heading.  Better to

> promote them to first level sections that cover a wider theme rather than

> just enumerating.  For example "Core Management and Isolation", "Memory and

> Cache Management", "Synchronization", etc.

>


This comes from our original overview docs, and fleshes out not much more
than place holders for the sections, will sync with you and see what we can
do.  I think we need to get in  some initial guidance on how to do things
like atomics etc and agree the lists need to expand into real pros on the
subject.  I think we can add detail to them one at a time if we flash out
 how they will fit in, basically the same as you just did by filling in the
queues and adding real meat.


>

>

>> +Many small helper functions and definitions are needed to enable ODP

>> +applications to be hardware optimized but not tied to a particular

>> hardware or

>> +execution environment. These are typically implemented with inline

>> functions,

>> +preprocessor macros, or compiler built­in features. Thus API definitions

>> are

>> +normally inline when possible.

>> +

>> +=== Core enumeration

>> +Application or middleware need to handle physical and/or logical core

>> IDs, core

>> +counts and core masks quite often. Core enumeration has to remain

>> consistent

>> +even when core deployment may change during application execution (e.g.,

>> due to

>> +adaptation to changing traffic profile, etc).

>> +

>> +* +odp_cpumask_from_str()+

>> +* +odp_cpumask_to_str()+

>> +* +odp_cpumask_zero()+

>> +* +odp_cpumask_set()+

>> +* +odp_cpumask_setall()+

>> +* +odp_cpumask_clr()+

>> +* +odp_cpumask_isset()+

>> +* +odp_cpumask_count()+

>> +* +odp_cpumask_and()+

>> +* +odp_cpumask_or()+

>> +* +odp_cpumask_xor()+

>> +* +odp_cpumask_equal()+

>> +* +odp_cpumask_copy()+

>> +* +odp_cpumask_first()+

>> +* +odp_cpumask_last()+

>> +* +odp_cpumask_next()+

>> +* +odp_cpumask_default_worker()+

>> +* +odp_cpumask_default_control()+

>>

>

> Good start, but we want to do more than just enumerate functions here (you

> can get that from the API reference).  The User's Guide should cover why

> and how these functions should be used and factored into application

> design. OK for now as a skeleton, but we need to expand here.

>

>

>> +

>> +=== Memory alignments

>> +For optimal performance and scalability (e.g., to avoid false sharing

>> and cache

>> +line aliasing), some application data structures need to be aligned to

>> cache

>> +(cache line) and/or memory subsystem (page, DRAM burst) alignments.  NUMA

>> +systems also support location­awareness and potentially different cache

>> line

>> +sizes on a per­memory basis. Static memory allocation Serves application

>> needs

>> +for portable definitions for global and core/thread local data.

>> +

>> +* +ODP_ALIGNED+

>> +* +ODP_PACKED+

>> +* +ODP_OFFSETOF+

>> +* +ODP_FIELD_SIZEOF+

>> +* +ODP_CACHE_LINE_SIZE+

>> +* +ODP_PAGE_SIZE+

>> +* +ODP_ALIGNED_CACHE+

>> +* +ODP_ALIGNED_PAGE+

>> +

>> +=== Compiler hints

>> +The compiler and linker can do better optimizations if code includes

>> hints on

>> +expected application  behavior.  Examples of these are classification of

>> +branches with likely/unlikely hints, or marking  code with hot (optimize

>> for

>> +speed) or cold (optimize for size) tags.

>> +

>> +* +odp_likely()+

>> +* +odp_unlikely()+

>> +* +odp_prefetch()+

>> +* +odp_prefetch_store()+

>> +

>> +=== Atomic operations

>> +Modern ISAs offers various atomic instructions to access/manipulate data

>> +concurrently from multiple cores. Well scalable multicore software is

>> possible

>> +only through correct usage (and combination) of hardware acceleration and

>> +atomic instructions. Applications use atomic operations to update global

>> +statistics, sequence counters, quotas, etc., and to build concurrent data

>> +structures.

>> +

>> +* +odp_atomic_init_u64()+

>> +* +odp_atomic_load_u64()+

>> +* +odp_atomic_store_u64()+

>> +* +odp_atomic_fetch_add_u64()+

>> +* +odp_atomic_add_u64()+

>> +* +odp_atomic_fetch_sub_u64()+

>> +* +odp_atomic_sub_u64()+

>> +* +odp_atomic_fetch_inc_u64()+

>> +* +odp_atomic_inc_u64()+

>> +* +odp_atomic_fetch_dec_u64()+

>> +* +odp_atomic_dec_u64()+

>> +

>> +=== Memory synchronization barriers

>> +Application (or middleware) needs a portable way to synchronize data

>> +modifications into main memory before messaging other cores or hardware

>> +acceleration about the changes. The nature of the synchronization needs

>> are

>> +cache coherence protocol specific.

>> +

>> +* +odp_barrier_t()+

>

> +* +odp_rwlock_t()+

>> +* +odp_ticketlock_t()+

>>

>

> The _t names are types, not functions, so omit the () suffixes.

>

>

>> +* +odp_barrier_init()+

>> +* +odp_barrier_wait()+

>> +* +odp_rwlock_init()+

>> +* +odp_rwlock_read_lock()+

>> +* +odp_rwlock_read_unlock()+

>> +* +odp_rwlock_write_lock()+

>> +* +odp_rwlock_write_unlock()+

>> +* +odp_sync_stores()+

>> +* +odp_ticketlock_init()+

>> +* +odp_ticketlock_lock()+

>> +* +odp_ticketlock_trylock()+

>> +* +odp_ticketlock_unlock()+

>> +* +odp_ticketlock_is_locked()+

>> +

>> +=== Execution barriers and spinlocks

>> +Although software locking should be avoided (especially in fast path

>> code), at

>> +times there is no practical way to synchronize cores other than using

>> execution

>> +barriers or spinlocks. For example, the application initialization phase

>> +typically is not performance critical and may be much simpler with

>> synchronous

>> +interfaces and locking.

>> +

>> +* +odp_spinlock_t()+

>>

>

> Omit ()

>

>

>> +* +odp_spinlock_init()+

>> +* +odp_spinlock_lock()+

>> +* +odp_spinlock_trylock()+

>> +* +odp_spinlock_unlock()+

>> +* +odp_spinlock_is_locked()+

>> +

>> +=== Profiling and debugging

>> +Although there are (external) tools for profiling and debugging, some

>> level of

>> +application code instrumentation is typically needed (e.g., for on field

>> +debug/profiling). Typically an SoC supports CPU level (e.g., cycle

>> count, cache

>> +misses, branch prediction misses) and SoC level (system cache misses,

>> +interconnect/DRAM utilization) performance counters.

>> +

>> +* +odp_errno()+

>> +* +odp_errno_zero()+

>> +* +odp_errno_print()+

>> +* +odp_errno_str()+

>> +

>> +* +odp_override_log()+

>> +* +odp_override_abort()+

>> +

>> +=== SoC Hardware info

>> +The application may be interested in generic performance characteristics

>> of the

>> +SoC it is running on to have optimal adaption to the system.

>> +

>> +* +odp_cpu_id()+

>> +* +odp_cpu_count()+

>> +* +odp_cpu_cycles()+

>> +* +odp_cpu_cycles_diff()+

>> +* +odp_cpu_cycles_max()+

>> +* +odp_cpu_cycles_resolution()+

>> +

>> +=== Data manipulation

>> +There are some data manipulation operations that are typical to

>> networking

>> +applications. Examples of these are byte order swap for big/little­endian

>> +conversion, various checksum algorithms, and bit shuffling/shifting.

>> +

>> +* +odp_be_to_cpu_16()+

>> +* +odp_be_to_cpu_32()+

>> +* +odp_be_to_cpu_64()+

>> +* +odp_cpu_to_be_16()+

>> +* +odp_cpu_to_be_32()+

>> +* +odp_cpu_to_be_64()+

>> +* +odp_le_to_cpu_16()+

>> +* +odp_le_to_cpu_32()+

>> +* +odp_le_to_cpu_64()+

>> +* +odp_cpu_to_le_16()+

>> +* +odp_cpu_to_le_32()+

>> +* +odp_cpu_to_le_64()+

>> +

>>  .Users include structure

>>  ----

>>  ./

>> --

>> 2.5.0

>>

>> _______________________________________________

>> lng-odp mailing list

>> lng-odp@lists.linaro.org

>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>

>

>



-- 
Mike Holmes
Technical Manager - Linaro Networking Group
Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs
Bill Fischofer Dec. 14, 2015, 10:36 p.m. UTC | #3
On Mon, Dec 14, 2015 at 4:28 PM, Mike Holmes <mike.holmes@linaro.org> wrote:

>

>

> On 12 December 2015 at 10:13, Bill Fischofer <bill.fischofer@linaro.org>

> wrote:

>

>>

>>

>> On Fri, Dec 11, 2015 at 1:52 PM, Mike Holmes <mike.holmes@linaro.org>

>> wrote:

>>

>>> Signed-off-by: Mike Holmes <mike.holmes@linaro.org>

>>> ---

>>>  doc/users-guide/users-guide.adoc | 161

>>> +++++++++++++++++++++++++++++++++++++++

>>>  1 file changed, 161 insertions(+)

>>>

>>> diff --git a/doc/users-guide/users-guide.adoc

>>> b/doc/users-guide/users-guide.adoc

>>> index cf77fa0..d2e1a16 100644

>>> --- a/doc/users-guide/users-guide.adoc

>>> +++ b/doc/users-guide/users-guide.adoc

>>> @@ -431,6 +431,167 @@ Applications only include the 'include/odp.h file

>>> which includes the 'platform/<

>>>  The doxygen documentation defining the behavior of the ODP API is all

>>> contained in the public API files, and the actual definitions for an

>>> implementation will be found in the per platform directories.

>>>  Per-platform data that might normally be a #define can be recovered via

>>> the appropriate access function if the #define is not directly visible to

>>> the application.

>>>

>>> +== Helpers

>>>

>>

>> Calling this section Helpers is confusing since we're already using that

>> for the ODP helper functions (odph prefix). These are really separate

>> sections that are awkwardly under a "Miscellaneous" heading.  Better to

>> promote them to first level sections that cover a wider theme rather than

>> just enumerating.  For example "Core Management and Isolation", "Memory and

>> Cache Management", "Synchronization", etc.

>>

>

> This comes from our original overview docs, and fleshes out not much more

> than place holders for the sections, will sync with you and see what we can

> do.  I think we need to get in  some initial guidance on how to do things

> like atomics etc and agree the lists need to expand into real pros on the

> subject.  I think we can add detail to them one at a time if we flash out

>  how they will fit in, basically the same as you just did by filling in the

> queues and adding real meat.

>

>


Agree.  I just think we shouldn't call this section "Helpers" as that's
confusing as these are ODP_ functions, not ODPH_ functions.



>

>>

>>> +Many small helper functions and definitions are needed to enable ODP

>>> +applications to be hardware optimized but not tied to a particular

>>> hardware or

>>> +execution environment. These are typically implemented with inline

>>> functions,

>>> +preprocessor macros, or compiler built­in features. Thus API

>>> definitions are

>>> +normally inline when possible.

>>> +

>>> +=== Core enumeration

>>> +Application or middleware need to handle physical and/or logical core

>>> IDs, core

>>> +counts and core masks quite often. Core enumeration has to remain

>>> consistent

>>> +even when core deployment may change during application execution

>>> (e.g., due to

>>> +adaptation to changing traffic profile, etc).

>>> +

>>> +* +odp_cpumask_from_str()+

>>> +* +odp_cpumask_to_str()+

>>> +* +odp_cpumask_zero()+

>>> +* +odp_cpumask_set()+

>>> +* +odp_cpumask_setall()+

>>> +* +odp_cpumask_clr()+

>>> +* +odp_cpumask_isset()+

>>> +* +odp_cpumask_count()+

>>> +* +odp_cpumask_and()+

>>> +* +odp_cpumask_or()+

>>> +* +odp_cpumask_xor()+

>>> +* +odp_cpumask_equal()+

>>> +* +odp_cpumask_copy()+

>>> +* +odp_cpumask_first()+

>>> +* +odp_cpumask_last()+

>>> +* +odp_cpumask_next()+

>>> +* +odp_cpumask_default_worker()+

>>> +* +odp_cpumask_default_control()+

>>>

>>

>> Good start, but we want to do more than just enumerate functions here

>> (you can get that from the API reference).  The User's Guide should cover

>> why and how these functions should be used and factored into application

>> design. OK for now as a skeleton, but we need to expand here.

>>

>>

>>> +

>>> +=== Memory alignments

>>> +For optimal performance and scalability (e.g., to avoid false sharing

>>> and cache

>>> +line aliasing), some application data structures need to be aligned to

>>> cache

>>> +(cache line) and/or memory subsystem (page, DRAM burst) alignments.

>>> NUMA

>>> +systems also support location­awareness and potentially different cache

>>> line

>>> +sizes on a per­memory basis. Static memory allocation Serves

>>> application needs

>>> +for portable definitions for global and core/thread local data.

>>> +

>>> +* +ODP_ALIGNED+

>>> +* +ODP_PACKED+

>>> +* +ODP_OFFSETOF+

>>> +* +ODP_FIELD_SIZEOF+

>>> +* +ODP_CACHE_LINE_SIZE+

>>> +* +ODP_PAGE_SIZE+

>>> +* +ODP_ALIGNED_CACHE+

>>> +* +ODP_ALIGNED_PAGE+

>>> +

>>> +=== Compiler hints

>>> +The compiler and linker can do better optimizations if code includes

>>> hints on

>>> +expected application  behavior.  Examples of these are classification of

>>> +branches with likely/unlikely hints, or marking  code with hot

>>> (optimize for

>>> +speed) or cold (optimize for size) tags.

>>> +

>>> +* +odp_likely()+

>>> +* +odp_unlikely()+

>>> +* +odp_prefetch()+

>>> +* +odp_prefetch_store()+

>>> +

>>> +=== Atomic operations

>>> +Modern ISAs offers various atomic instructions to access/manipulate data

>>> +concurrently from multiple cores. Well scalable multicore software is

>>> possible

>>> +only through correct usage (and combination) of hardware acceleration

>>> and

>>> +atomic instructions. Applications use atomic operations to update global

>>> +statistics, sequence counters, quotas, etc., and to build concurrent

>>> data

>>> +structures.

>>> +

>>> +* +odp_atomic_init_u64()+

>>> +* +odp_atomic_load_u64()+

>>> +* +odp_atomic_store_u64()+

>>> +* +odp_atomic_fetch_add_u64()+

>>> +* +odp_atomic_add_u64()+

>>> +* +odp_atomic_fetch_sub_u64()+

>>> +* +odp_atomic_sub_u64()+

>>> +* +odp_atomic_fetch_inc_u64()+

>>> +* +odp_atomic_inc_u64()+

>>> +* +odp_atomic_fetch_dec_u64()+

>>> +* +odp_atomic_dec_u64()+

>>> +

>>> +=== Memory synchronization barriers

>>> +Application (or middleware) needs a portable way to synchronize data

>>> +modifications into main memory before messaging other cores or hardware

>>> +acceleration about the changes. The nature of the synchronization needs

>>> are

>>> +cache coherence protocol specific.

>>> +

>>> +* +odp_barrier_t()+

>>

>> +* +odp_rwlock_t()+

>>> +* +odp_ticketlock_t()+

>>>

>>

>> The _t names are types, not functions, so omit the () suffixes.

>>

>>

>>> +* +odp_barrier_init()+

>>> +* +odp_barrier_wait()+

>>> +* +odp_rwlock_init()+

>>> +* +odp_rwlock_read_lock()+

>>> +* +odp_rwlock_read_unlock()+

>>> +* +odp_rwlock_write_lock()+

>>> +* +odp_rwlock_write_unlock()+

>>> +* +odp_sync_stores()+

>>> +* +odp_ticketlock_init()+

>>> +* +odp_ticketlock_lock()+

>>> +* +odp_ticketlock_trylock()+

>>> +* +odp_ticketlock_unlock()+

>>> +* +odp_ticketlock_is_locked()+

>>> +

>>> +=== Execution barriers and spinlocks

>>> +Although software locking should be avoided (especially in fast path

>>> code), at

>>> +times there is no practical way to synchronize cores other than using

>>> execution

>>> +barriers or spinlocks. For example, the application initialization phase

>>> +typically is not performance critical and may be much simpler with

>>> synchronous

>>> +interfaces and locking.

>>> +

>>> +* +odp_spinlock_t()+

>>>

>>

>> Omit ()

>>

>>

>>> +* +odp_spinlock_init()+

>>> +* +odp_spinlock_lock()+

>>> +* +odp_spinlock_trylock()+

>>> +* +odp_spinlock_unlock()+

>>> +* +odp_spinlock_is_locked()+

>>> +

>>> +=== Profiling and debugging

>>> +Although there are (external) tools for profiling and debugging, some

>>> level of

>>> +application code instrumentation is typically needed (e.g., for on field

>>> +debug/profiling). Typically an SoC supports CPU level (e.g., cycle

>>> count, cache

>>> +misses, branch prediction misses) and SoC level (system cache misses,

>>> +interconnect/DRAM utilization) performance counters.

>>> +

>>> +* +odp_errno()+

>>> +* +odp_errno_zero()+

>>> +* +odp_errno_print()+

>>> +* +odp_errno_str()+

>>> +

>>> +* +odp_override_log()+

>>> +* +odp_override_abort()+

>>> +

>>> +=== SoC Hardware info

>>> +The application may be interested in generic performance

>>> characteristics of the

>>> +SoC it is running on to have optimal adaption to the system.

>>> +

>>> +* +odp_cpu_id()+

>>> +* +odp_cpu_count()+

>>> +* +odp_cpu_cycles()+

>>> +* +odp_cpu_cycles_diff()+

>>> +* +odp_cpu_cycles_max()+

>>> +* +odp_cpu_cycles_resolution()+

>>> +

>>> +=== Data manipulation

>>> +There are some data manipulation operations that are typical to

>>> networking

>>> +applications. Examples of these are byte order swap for

>>> big/little­endian

>>> +conversion, various checksum algorithms, and bit shuffling/shifting.

>>> +

>>> +* +odp_be_to_cpu_16()+

>>> +* +odp_be_to_cpu_32()+

>>> +* +odp_be_to_cpu_64()+

>>> +* +odp_cpu_to_be_16()+

>>> +* +odp_cpu_to_be_32()+

>>> +* +odp_cpu_to_be_64()+

>>> +* +odp_le_to_cpu_16()+

>>> +* +odp_le_to_cpu_32()+

>>> +* +odp_le_to_cpu_64()+

>>> +* +odp_cpu_to_le_16()+

>>> +* +odp_cpu_to_le_32()+

>>> +* +odp_cpu_to_le_64()+

>>> +

>>>  .Users include structure

>>>  ----

>>>  ./

>>> --

>>> 2.5.0

>>>

>>> _______________________________________________

>>> lng-odp mailing list

>>> lng-odp@lists.linaro.org

>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>

>>

>>

>

>

> --

> Mike Holmes

> Technical Manager - Linaro Networking Group

> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs

>

>

>
Mike Holmes Dec. 14, 2015, 10:37 p.m. UTC | #4
On 14 December 2015 at 17:36, Bill Fischofer <bill.fischofer@linaro.org>
wrote:

>

> On Mon, Dec 14, 2015 at 4:28 PM, Mike Holmes <mike.holmes@linaro.org>

> wrote:

>

>>

>>

>> On 12 December 2015 at 10:13, Bill Fischofer <bill.fischofer@linaro.org>

>> wrote:

>>

>>>

>>>

>>> On Fri, Dec 11, 2015 at 1:52 PM, Mike Holmes <mike.holmes@linaro.org>

>>> wrote:

>>>

>>>> Signed-off-by: Mike Holmes <mike.holmes@linaro.org>

>>>> ---

>>>>  doc/users-guide/users-guide.adoc | 161

>>>> +++++++++++++++++++++++++++++++++++++++

>>>>  1 file changed, 161 insertions(+)

>>>>

>>>> diff --git a/doc/users-guide/users-guide.adoc

>>>> b/doc/users-guide/users-guide.adoc

>>>> index cf77fa0..d2e1a16 100644

>>>> --- a/doc/users-guide/users-guide.adoc

>>>> +++ b/doc/users-guide/users-guide.adoc

>>>> @@ -431,6 +431,167 @@ Applications only include the 'include/odp.h file

>>>> which includes the 'platform/<

>>>>  The doxygen documentation defining the behavior of the ODP API is all

>>>> contained in the public API files, and the actual definitions for an

>>>> implementation will be found in the per platform directories.

>>>>  Per-platform data that might normally be a #define can be recovered

>>>> via the appropriate access function if the #define is not directly visible

>>>> to the application.

>>>>

>>>> +== Helpers

>>>>

>>>

>>> Calling this section Helpers is confusing since we're already using that

>>> for the ODP helper functions (odph prefix). These are really separate

>>> sections that are awkwardly under a "Miscellaneous" heading.  Better to

>>> promote them to first level sections that cover a wider theme rather than

>>> just enumerating.  For example "Core Management and Isolation", "Memory and

>>> Cache Management", "Synchronization", etc.

>>>

>>

>> This comes from our original overview docs, and fleshes out not much more

>> than place holders for the sections, will sync with you and see what we can

>> do.  I think we need to get in  some initial guidance on how to do things

>> like atomics etc and agree the lists need to expand into real pros on the

>> subject.  I think we can add detail to them one at a time if we flash out

>>  how they will fit in, basically the same as you just did by filling in the

>> queues and adding real meat.

>>

>>

>

> Agree.  I just think we shouldn't call this section "Helpers" as that's

> confusing as these are ODP_ functions, not ODPH_ functions.

>


Will fix


>

>

>

>>

>>>

>>>> +Many small helper functions and definitions are needed to enable ODP

>>>> +applications to be hardware optimized but not tied to a particular

>>>> hardware or

>>>> +execution environment. These are typically implemented with inline

>>>> functions,

>>>> +preprocessor macros, or compiler built­in features. Thus API

>>>> definitions are

>>>> +normally inline when possible.

>>>> +

>>>> +=== Core enumeration

>>>> +Application or middleware need to handle physical and/or logical core

>>>> IDs, core

>>>> +counts and core masks quite often. Core enumeration has to remain

>>>> consistent

>>>> +even when core deployment may change during application execution

>>>> (e.g., due to

>>>> +adaptation to changing traffic profile, etc).

>>>> +

>>>> +* +odp_cpumask_from_str()+

>>>> +* +odp_cpumask_to_str()+

>>>> +* +odp_cpumask_zero()+

>>>> +* +odp_cpumask_set()+

>>>> +* +odp_cpumask_setall()+

>>>> +* +odp_cpumask_clr()+

>>>> +* +odp_cpumask_isset()+

>>>> +* +odp_cpumask_count()+

>>>> +* +odp_cpumask_and()+

>>>> +* +odp_cpumask_or()+

>>>> +* +odp_cpumask_xor()+

>>>> +* +odp_cpumask_equal()+

>>>> +* +odp_cpumask_copy()+

>>>> +* +odp_cpumask_first()+

>>>> +* +odp_cpumask_last()+

>>>> +* +odp_cpumask_next()+

>>>> +* +odp_cpumask_default_worker()+

>>>> +* +odp_cpumask_default_control()+

>>>>

>>>

>>> Good start, but we want to do more than just enumerate functions here

>>> (you can get that from the API reference).  The User's Guide should cover

>>> why and how these functions should be used and factored into application

>>> design. OK for now as a skeleton, but we need to expand here.

>>>

>>>

>>>> +

>>>> +=== Memory alignments

>>>> +For optimal performance and scalability (e.g., to avoid false sharing

>>>> and cache

>>>> +line aliasing), some application data structures need to be aligned to

>>>> cache

>>>> +(cache line) and/or memory subsystem (page, DRAM burst) alignments.

>>>> NUMA

>>>> +systems also support location­awareness and potentially different

>>>> cache line

>>>> +sizes on a per­memory basis. Static memory allocation Serves

>>>> application needs

>>>> +for portable definitions for global and core/thread local data.

>>>> +

>>>> +* +ODP_ALIGNED+

>>>> +* +ODP_PACKED+

>>>> +* +ODP_OFFSETOF+

>>>> +* +ODP_FIELD_SIZEOF+

>>>> +* +ODP_CACHE_LINE_SIZE+

>>>> +* +ODP_PAGE_SIZE+

>>>> +* +ODP_ALIGNED_CACHE+

>>>> +* +ODP_ALIGNED_PAGE+

>>>> +

>>>> +=== Compiler hints

>>>> +The compiler and linker can do better optimizations if code includes

>>>> hints on

>>>> +expected application  behavior.  Examples of these are classification

>>>> of

>>>> +branches with likely/unlikely hints, or marking  code with hot

>>>> (optimize for

>>>> +speed) or cold (optimize for size) tags.

>>>> +

>>>> +* +odp_likely()+

>>>> +* +odp_unlikely()+

>>>> +* +odp_prefetch()+

>>>> +* +odp_prefetch_store()+

>>>> +

>>>> +=== Atomic operations

>>>> +Modern ISAs offers various atomic instructions to access/manipulate

>>>> data

>>>> +concurrently from multiple cores. Well scalable multicore software is

>>>> possible

>>>> +only through correct usage (and combination) of hardware acceleration

>>>> and

>>>> +atomic instructions. Applications use atomic operations to update

>>>> global

>>>> +statistics, sequence counters, quotas, etc., and to build concurrent

>>>> data

>>>> +structures.

>>>> +

>>>> +* +odp_atomic_init_u64()+

>>>> +* +odp_atomic_load_u64()+

>>>> +* +odp_atomic_store_u64()+

>>>> +* +odp_atomic_fetch_add_u64()+

>>>> +* +odp_atomic_add_u64()+

>>>> +* +odp_atomic_fetch_sub_u64()+

>>>> +* +odp_atomic_sub_u64()+

>>>> +* +odp_atomic_fetch_inc_u64()+

>>>> +* +odp_atomic_inc_u64()+

>>>> +* +odp_atomic_fetch_dec_u64()+

>>>> +* +odp_atomic_dec_u64()+

>>>> +

>>>> +=== Memory synchronization barriers

>>>> +Application (or middleware) needs a portable way to synchronize data

>>>> +modifications into main memory before messaging other cores or hardware

>>>> +acceleration about the changes. The nature of the synchronization

>>>> needs are

>>>> +cache coherence protocol specific.

>>>> +

>>>> +* +odp_barrier_t()+

>>>

>>> +* +odp_rwlock_t()+

>>>> +* +odp_ticketlock_t()+

>>>>

>>>

>>> The _t names are types, not functions, so omit the () suffixes.

>>>

>>>

>>>> +* +odp_barrier_init()+

>>>> +* +odp_barrier_wait()+

>>>> +* +odp_rwlock_init()+

>>>> +* +odp_rwlock_read_lock()+

>>>> +* +odp_rwlock_read_unlock()+

>>>> +* +odp_rwlock_write_lock()+

>>>> +* +odp_rwlock_write_unlock()+

>>>> +* +odp_sync_stores()+

>>>> +* +odp_ticketlock_init()+

>>>> +* +odp_ticketlock_lock()+

>>>> +* +odp_ticketlock_trylock()+

>>>> +* +odp_ticketlock_unlock()+

>>>> +* +odp_ticketlock_is_locked()+

>>>> +

>>>> +=== Execution barriers and spinlocks

>>>> +Although software locking should be avoided (especially in fast path

>>>> code), at

>>>> +times there is no practical way to synchronize cores other than using

>>>> execution

>>>> +barriers or spinlocks. For example, the application initialization

>>>> phase

>>>> +typically is not performance critical and may be much simpler with

>>>> synchronous

>>>> +interfaces and locking.

>>>> +

>>>> +* +odp_spinlock_t()+

>>>>

>>>

>>> Omit ()

>>>

>>>

>>>> +* +odp_spinlock_init()+

>>>> +* +odp_spinlock_lock()+

>>>> +* +odp_spinlock_trylock()+

>>>> +* +odp_spinlock_unlock()+

>>>> +* +odp_spinlock_is_locked()+

>>>> +

>>>> +=== Profiling and debugging

>>>> +Although there are (external) tools for profiling and debugging, some

>>>> level of

>>>> +application code instrumentation is typically needed (e.g., for on

>>>> field

>>>> +debug/profiling). Typically an SoC supports CPU level (e.g., cycle

>>>> count, cache

>>>> +misses, branch prediction misses) and SoC level (system cache misses,

>>>> +interconnect/DRAM utilization) performance counters.

>>>> +

>>>> +* +odp_errno()+

>>>> +* +odp_errno_zero()+

>>>> +* +odp_errno_print()+

>>>> +* +odp_errno_str()+

>>>> +

>>>> +* +odp_override_log()+

>>>> +* +odp_override_abort()+

>>>> +

>>>> +=== SoC Hardware info

>>>> +The application may be interested in generic performance

>>>> characteristics of the

>>>> +SoC it is running on to have optimal adaption to the system.

>>>> +

>>>> +* +odp_cpu_id()+

>>>> +* +odp_cpu_count()+

>>>> +* +odp_cpu_cycles()+

>>>> +* +odp_cpu_cycles_diff()+

>>>> +* +odp_cpu_cycles_max()+

>>>> +* +odp_cpu_cycles_resolution()+

>>>> +

>>>> +=== Data manipulation

>>>> +There are some data manipulation operations that are typical to

>>>> networking

>>>> +applications. Examples of these are byte order swap for

>>>> big/little­endian

>>>> +conversion, various checksum algorithms, and bit shuffling/shifting.

>>>> +

>>>> +* +odp_be_to_cpu_16()+

>>>> +* +odp_be_to_cpu_32()+

>>>> +* +odp_be_to_cpu_64()+

>>>> +* +odp_cpu_to_be_16()+

>>>> +* +odp_cpu_to_be_32()+

>>>> +* +odp_cpu_to_be_64()+

>>>> +* +odp_le_to_cpu_16()+

>>>> +* +odp_le_to_cpu_32()+

>>>> +* +odp_le_to_cpu_64()+

>>>> +* +odp_cpu_to_le_16()+

>>>> +* +odp_cpu_to_le_32()+

>>>> +* +odp_cpu_to_le_64()+

>>>> +

>>>>  .Users include structure

>>>>  ----

>>>>  ./

>>>> --

>>>> 2.5.0

>>>>

>>>> _______________________________________________

>>>> lng-odp mailing list

>>>> lng-odp@lists.linaro.org

>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>

>>>

>>>

>>

>>

>> --

>> Mike Holmes

>> Technical Manager - Linaro Networking Group

>> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs

>>

>>

>>

>



-- 
Mike Holmes
Technical Manager - Linaro Networking Group
Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs
diff mbox

Patch

diff --git a/doc/users-guide/users-guide.adoc b/doc/users-guide/users-guide.adoc
index cf77fa0..d2e1a16 100644
--- a/doc/users-guide/users-guide.adoc
+++ b/doc/users-guide/users-guide.adoc
@@ -431,6 +431,167 @@  Applications only include the 'include/odp.h file which includes the 'platform/<
 The doxygen documentation defining the behavior of the ODP API is all contained in the public API files, and the actual definitions for an implementation will be found in the per platform directories.
 Per-platform data that might normally be a #define can be recovered via the appropriate access function if the #define is not directly visible to the application.
 
+== Helpers
+Many small helper functions and definitions are needed to enable ODP
+applications to be hardware optimized but not tied to a particular hardware or
+execution environment. These are typically implemented with inline functions,
+preprocessor macros, or compiler built­in features. Thus API definitions are
+normally inline when possible.
+
+=== Core enumeration
+Application or middleware need to handle physical and/or logical core IDs, core
+counts and core masks quite often. Core enumeration has to remain consistent
+even when core deployment may change during application execution (e.g., due to
+adaptation to changing traffic profile, etc).
+
+* +odp_cpumask_from_str()+
+* +odp_cpumask_to_str()+
+* +odp_cpumask_zero()+
+* +odp_cpumask_set()+
+* +odp_cpumask_setall()+
+* +odp_cpumask_clr()+
+* +odp_cpumask_isset()+
+* +odp_cpumask_count()+
+* +odp_cpumask_and()+
+* +odp_cpumask_or()+
+* +odp_cpumask_xor()+
+* +odp_cpumask_equal()+
+* +odp_cpumask_copy()+
+* +odp_cpumask_first()+
+* +odp_cpumask_last()+
+* +odp_cpumask_next()+
+* +odp_cpumask_default_worker()+
+* +odp_cpumask_default_control()+
+
+=== Memory alignments
+For optimal performance and scalability (e.g., to avoid false sharing and cache
+line aliasing), some application data structures need to be aligned to cache
+(cache line) and/or memory subsystem (page, DRAM burst) alignments.  NUMA
+systems also support location­awareness and potentially different cache line
+sizes on a per­memory basis. Static memory allocation Serves application needs
+for portable definitions for global and core/thread local data.
+
+* +ODP_ALIGNED+
+* +ODP_PACKED+
+* +ODP_OFFSETOF+
+* +ODP_FIELD_SIZEOF+
+* +ODP_CACHE_LINE_SIZE+
+* +ODP_PAGE_SIZE+
+* +ODP_ALIGNED_CACHE+
+* +ODP_ALIGNED_PAGE+
+
+=== Compiler hints
+The compiler and linker can do better optimizations if code includes hints on
+expected application  behavior.  Examples of these are classification of
+branches with likely/unlikely hints, or marking  code with hot (optimize for
+speed) or cold (optimize for size) tags.
+
+* +odp_likely()+
+* +odp_unlikely()+
+* +odp_prefetch()+
+* +odp_prefetch_store()+
+
+=== Atomic operations
+Modern ISAs offers various atomic instructions to access/manipulate data
+concurrently from multiple cores. Well scalable multicore software is possible
+only through correct usage (and combination) of hardware acceleration and
+atomic instructions. Applications use atomic operations to update global
+statistics, sequence counters, quotas, etc., and to build concurrent data
+structures.
+
+* +odp_atomic_init_u64()+
+* +odp_atomic_load_u64()+
+* +odp_atomic_store_u64()+
+* +odp_atomic_fetch_add_u64()+
+* +odp_atomic_add_u64()+
+* +odp_atomic_fetch_sub_u64()+
+* +odp_atomic_sub_u64()+
+* +odp_atomic_fetch_inc_u64()+
+* +odp_atomic_inc_u64()+
+* +odp_atomic_fetch_dec_u64()+
+* +odp_atomic_dec_u64()+
+
+=== Memory synchronization barriers
+Application (or middleware) needs a portable way to synchronize data
+modifications into main memory before messaging other cores or hardware
+acceleration about the changes. The nature of the synchronization needs are
+cache coherence protocol specific.
+
+* +odp_barrier_t()+
+* +odp_rwlock_t()+
+* +odp_ticketlock_t()+
+* +odp_barrier_init()+
+* +odp_barrier_wait()+
+* +odp_rwlock_init()+
+* +odp_rwlock_read_lock()+
+* +odp_rwlock_read_unlock()+
+* +odp_rwlock_write_lock()+
+* +odp_rwlock_write_unlock()+
+* +odp_sync_stores()+
+* +odp_ticketlock_init()+
+* +odp_ticketlock_lock()+
+* +odp_ticketlock_trylock()+
+* +odp_ticketlock_unlock()+
+* +odp_ticketlock_is_locked()+
+
+=== Execution barriers and spinlocks
+Although software locking should be avoided (especially in fast path code), at
+times there is no practical way to synchronize cores other than using execution
+barriers or spinlocks. For example, the application initialization phase
+typically is not performance critical and may be much simpler with synchronous
+interfaces and locking.
+
+* +odp_spinlock_t()+
+* +odp_spinlock_init()+
+* +odp_spinlock_lock()+
+* +odp_spinlock_trylock()+
+* +odp_spinlock_unlock()+
+* +odp_spinlock_is_locked()+
+
+=== Profiling and debugging
+Although there are (external) tools for profiling and debugging, some level of
+application code instrumentation is typically needed (e.g., for on field
+debug/profiling). Typically an SoC supports CPU level (e.g., cycle count, cache
+misses, branch prediction misses) and SoC level (system cache misses,
+interconnect/DRAM utilization) performance counters.
+
+* +odp_errno()+
+* +odp_errno_zero()+
+* +odp_errno_print()+
+* +odp_errno_str()+
+
+* +odp_override_log()+
+* +odp_override_abort()+
+
+=== SoC Hardware info
+The application may be interested in generic performance characteristics of the
+SoC it is running on to have optimal adaption to the system.
+
+* +odp_cpu_id()+
+* +odp_cpu_count()+
+* +odp_cpu_cycles()+
+* +odp_cpu_cycles_diff()+
+* +odp_cpu_cycles_max()+
+* +odp_cpu_cycles_resolution()+
+
+=== Data manipulation
+There are some data manipulation operations that are typical to networking
+applications. Examples of these are byte order swap for big/little­endian
+conversion, various checksum algorithms, and bit shuffling/shifting.
+
+* +odp_be_to_cpu_16()+
+* +odp_be_to_cpu_32()+
+* +odp_be_to_cpu_64()+
+* +odp_cpu_to_be_16()+
+* +odp_cpu_to_be_32()+
+* +odp_cpu_to_be_64()+
+* +odp_le_to_cpu_16()+
+* +odp_le_to_cpu_32()+
+* +odp_le_to_cpu_64()+
+* +odp_cpu_to_le_16()+
+* +odp_cpu_to_le_32()+
+* +odp_cpu_to_le_64()+
+
 .Users include structure
 ----
 ./