diff mbox

[API-NEXT,RFC,1/1] api: cpu: performance profiling start/stop

Message ID 1462340707-14796-2-git-send-email-yi.he@linaro.org
State New
Headers show

Commit Message

Yi He May 4, 2016, 5:45 a.m. UTC
Establish a performance profiling environment guarantees meaningful
and consistency of consecutive invocations of the odp_cpu_xxx() APIs.
While after profiling was done restore the execution environment to
its multi-core optimized state.

Signed-off-by: Yi He <yi.he@linaro.org>
---
 include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Comments

Mike Holmes May 4, 2016, 3:23 p.m. UTC | #1
It sounded like the arch call was leaning towards documenting that on
odp-linux  the application must ensure that odp_threads are pinned to cores
when launched.
This is a restriction that some platforms may not need to make, vs the idea
that a piece of ODP code can use these APIs to ensure the behavior it needs
without knowledge or reliance on the wider system.

On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

> Establish a performance profiling environment guarantees meaningful

> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

> While after profiling was done restore the execution environment to

> its multi-core optimized state.

>

> Signed-off-by: Yi He <yi.he@linaro.org>

> ---

>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>  1 file changed, 31 insertions(+)

>

> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

> index 2789511..0bc9327 100644

> --- a/include/odp/api/spec/cpu.h

> +++ b/include/odp/api/spec/cpu.h

> @@ -27,6 +27,21 @@ extern "C" {

>

>

>  /**

> + * @typedef odp_profiler_t

> + * ODP performance profiler handle

> + */

> +

> +/**

> + * Setup a performance profiling environment

> + *

> + * A performance profiling environment guarantees meaningful and

> consistency of

> + * consecutive invocations of the odp_cpu_xxx() APIs.

> + *

> + * @return performance profiler handle

> + */

> +odp_profiler_t odp_profiler_start(void);

> +

> +/**

>   * CPU identifier

>   *

>   * Determine CPU identifier on which the calling is running. CPU

> numbering is

> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>  void odp_cpu_pause(void);

>

>  /**

> + * Stop the performance profiling environment

> + *

> + * Stop performance profiling and restore the execution environment to its

> + * multi-core optimized state, won't preserve meaningful and consistency

> of

> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

> + *

> + * @param profiler  performance profiler handle

> + *

> + * @retval 0 on success

> + * @retval <0 on failure

> + *

> + * @see odp_profiler_start()

> + */

> +int odp_profiler_stop(odp_profiler_t profiler);

> +

> +/**

>   * @}

>   */

>

> --

> 1.9.1

>

> _______________________________________________

> lng-odp mailing list

> lng-odp@lists.linaro.org

> https://lists.linaro.org/mailman/listinfo/lng-odp

>




-- 
Mike Holmes
Technical Manager - Linaro Networking Group
Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs
"Work should be fun and collaborative, the rest follows"
Bill Fischofer May 4, 2016, 3:32 p.m. UTC | #2
I think there are two fallouts form this discussion.  First, there is the
question of the precise semantics of the existing timing APIs as they
relate to processor locality. Applications such as profiling tests, to the
extent that they APIs that have processor-local semantics, must ensure that
the thread(s) using these APIs are pinned for the duration of the
measurement.

The other point is the one that Petri brought up about having other APIs
that provide timing information based on wall time or other metrics that
are not processor-local.  While these may not have the same performance
characteristics, they would be independent of thread migration
considerations.

Of course all this depends on exactly what one is trying to measure. Since
thread migration is not free, allowing such activity may or may not be
relevant to what is being measured, so ODP probably wants to have both
processor-local and systemwide timing APIs.  We just need to be sure they
are specified precisely so that applications know how to use them properly.

On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org> wrote:

> It sounded like the arch call was leaning towards documenting that on

> odp-linux  the application must ensure that odp_threads are pinned to cores

> when launched.

> This is a restriction that some platforms may not need to make, vs the

> idea that a piece of ODP code can use these APIs to ensure the behavior it

> needs without knowledge or reliance on the wider system.

>

> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>

>> Establish a performance profiling environment guarantees meaningful

>> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

>> While after profiling was done restore the execution environment to

>> its multi-core optimized state.

>>

>> Signed-off-by: Yi He <yi.he@linaro.org>

>> ---

>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>  1 file changed, 31 insertions(+)

>>

>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>> index 2789511..0bc9327 100644

>> --- a/include/odp/api/spec/cpu.h

>> +++ b/include/odp/api/spec/cpu.h

>> @@ -27,6 +27,21 @@ extern "C" {

>>

>>

>>  /**

>> + * @typedef odp_profiler_t

>> + * ODP performance profiler handle

>> + */

>> +

>> +/**

>> + * Setup a performance profiling environment

>> + *

>> + * A performance profiling environment guarantees meaningful and

>> consistency of

>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>> + *

>> + * @return performance profiler handle

>> + */

>> +odp_profiler_t odp_profiler_start(void);

>> +

>> +/**

>>   * CPU identifier

>>   *

>>   * Determine CPU identifier on which the calling is running. CPU

>> numbering is

>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>  void odp_cpu_pause(void);

>>

>>  /**

>> + * Stop the performance profiling environment

>> + *

>> + * Stop performance profiling and restore the execution environment to

>> its

>> + * multi-core optimized state, won't preserve meaningful and consistency

>> of

>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>> + *

>> + * @param profiler  performance profiler handle

>> + *

>> + * @retval 0 on success

>> + * @retval <0 on failure

>> + *

>> + * @see odp_profiler_start()

>> + */

>> +int odp_profiler_stop(odp_profiler_t profiler);

>> +

>> +/**

>>   * @}

>>   */

>>

>> --

>> 1.9.1

>>

>> _______________________________________________

>> lng-odp mailing list

>> lng-odp@lists.linaro.org

>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>

>

>

>

> --

> Mike Holmes

> Technical Manager - Linaro Networking Group

> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs

> "Work should be fun and collaborative, the rest follows"

>

>

>

> _______________________________________________

> lng-odp mailing list

> lng-odp@lists.linaro.org

> https://lists.linaro.org/mailman/listinfo/lng-odp

>

>
Yi He May 5, 2016, 2:45 a.m. UTC | #3
Hi, thanks Mike and Bill,

From your clear summarize can we put it into several TO-DO decisions: (we
can have a discussion in next ARCH call):

   1. How to addressing the precise semantics of the existing timing APIs
   (odp_cpu_xxx) as they relate to processor locality.


   - *A:* guarantee by adding constraint to ODP thread concept: every ODP
   thread shall be deployed and pinned on one CPU core.
      - A sub-question: my understanding is that application programmers
      only need to specify available CPU sets for control/worker
threads, and it
      is ODP to arrange the threads onto each CPU core while launching, right?
   - *B*: guarantee by adding new APIs to disable/enable CPU migration.
   - Then document clearly in user's guide or API document.


   1. Understand the requirement to have both processor-local and
   system-wide timing APIs:


   - There are some APIs available in time.h (odp_time_local(), etc).
   - We can have a thread to understand the relationship, usage scenarios
   and constraints of APIs in time.h and cpu.h.

Best Regards, Yi

On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org> wrote:

> I think there are two fallouts form this discussion.  First, there is the

> question of the precise semantics of the existing timing APIs as they

> relate to processor locality. Applications such as profiling tests, to the

> extent that they APIs that have processor-local semantics, must ensure that

> the thread(s) using these APIs are pinned for the duration of the

> measurement.

>

> The other point is the one that Petri brought up about having other APIs

> that provide timing information based on wall time or other metrics that

> are not processor-local.  While these may not have the same performance

> characteristics, they would be independent of thread migration

> considerations.

>

> Of course all this depends on exactly what one is trying to measure. Since

> thread migration is not free, allowing such activity may or may not be

> relevant to what is being measured, so ODP probably wants to have both

> processor-local and systemwide timing APIs.  We just need to be sure they

> are specified precisely so that applications know how to use them properly.

>

> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org>

> wrote:

>

>> It sounded like the arch call was leaning towards documenting that on

>> odp-linux  the application must ensure that odp_threads are pinned to cores

>> when launched.

>> This is a restriction that some platforms may not need to make, vs the

>> idea that a piece of ODP code can use these APIs to ensure the behavior it

>> needs without knowledge or reliance on the wider system.

>>

>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>

>>> Establish a performance profiling environment guarantees meaningful

>>> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

>>> While after profiling was done restore the execution environment to

>>> its multi-core optimized state.

>>>

>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>> ---

>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>  1 file changed, 31 insertions(+)

>>>

>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>>> index 2789511..0bc9327 100644

>>> --- a/include/odp/api/spec/cpu.h

>>> +++ b/include/odp/api/spec/cpu.h

>>> @@ -27,6 +27,21 @@ extern "C" {

>>>

>>>

>>>  /**

>>> + * @typedef odp_profiler_t

>>> + * ODP performance profiler handle

>>> + */

>>> +

>>> +/**

>>> + * Setup a performance profiling environment

>>> + *

>>> + * A performance profiling environment guarantees meaningful and

>>> consistency of

>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>> + *

>>> + * @return performance profiler handle

>>> + */

>>> +odp_profiler_t odp_profiler_start(void);

>>> +

>>> +/**

>>>   * CPU identifier

>>>   *

>>>   * Determine CPU identifier on which the calling is running. CPU

>>> numbering is

>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>  void odp_cpu_pause(void);

>>>

>>>  /**

>>> + * Stop the performance profiling environment

>>> + *

>>> + * Stop performance profiling and restore the execution environment to

>>> its

>>> + * multi-core optimized state, won't preserve meaningful and

>>> consistency of

>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>> + *

>>> + * @param profiler  performance profiler handle

>>> + *

>>> + * @retval 0 on success

>>> + * @retval <0 on failure

>>> + *

>>> + * @see odp_profiler_start()

>>> + */

>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>> +

>>> +/**

>>>   * @}

>>>   */

>>>

>>> --

>>> 1.9.1

>>>

>>> _______________________________________________

>>> lng-odp mailing list

>>> lng-odp@lists.linaro.org

>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>

>>

>>

>>

>> --

>> Mike Holmes

>> Technical Manager - Linaro Networking Group

>> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM SoCs

>> "Work should be fun and collaborative, the rest follows"

>>

>>

>>

>> _______________________________________________

>> lng-odp mailing list

>> lng-odp@lists.linaro.org

>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>

>>

>
Bill Fischofer May 5, 2016, 10:50 a.m. UTC | #4
I've added this to the agenda for Monday's call, however I suggest we
continue the dialog here as well as background.

Regarding thread pinning, there's always been a tradeoff on that.  On the
one hand dedicating cores to threads is ideal for scale out in many core
systems, however ODP does not require many core environments to work
effectively, so ODP APIs enable but do not require or assume that cores are
dedicated to threads. That's really a question of application design and
fit to the particular platform it's running on. In embedded environments
you'll likely see this model more since the application knows which
platform it's being targeted for. In VNF environments, by contrast, you're
more likely to see a blend where applications will take advantage of
however many cores are available to it but will still run without dedicated
cores in environments with more modest resources.

On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

> Hi, thanks Mike and Bill,

>

> From your clear summarize can we put it into several TO-DO decisions: (we

> can have a discussion in next ARCH call):

>

>    1. How to addressing the precise semantics of the existing timing APIs

>    (odp_cpu_xxx) as they relate to processor locality.

>

>

>    - *A:* guarantee by adding constraint to ODP thread concept: every ODP

>    thread shall be deployed and pinned on one CPU core.

>       - A sub-question: my understanding is that application programmers

>       only need to specify available CPU sets for control/worker threads, and it

>       is ODP to arrange the threads onto each CPU core while launching, right?

>    - *B*: guarantee by adding new APIs to disable/enable CPU migration.

>    - Then document clearly in user's guide or API document.

>

>

>    1. Understand the requirement to have both processor-local and

>    system-wide timing APIs:

>

>

>    - There are some APIs available in time.h (odp_time_local(), etc).

>    - We can have a thread to understand the relationship, usage scenarios

>    and constraints of APIs in time.h and cpu.h.

>

> Best Regards, Yi

>

> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>

>> I think there are two fallouts form this discussion.  First, there is the

>> question of the precise semantics of the existing timing APIs as they

>> relate to processor locality. Applications such as profiling tests, to the

>> extent that they APIs that have processor-local semantics, must ensure that

>> the thread(s) using these APIs are pinned for the duration of the

>> measurement.

>>

>> The other point is the one that Petri brought up about having other APIs

>> that provide timing information based on wall time or other metrics that

>> are not processor-local.  While these may not have the same performance

>> characteristics, they would be independent of thread migration

>> considerations.

>>

>> Of course all this depends on exactly what one is trying to measure.

>> Since thread migration is not free, allowing such activity may or may not

>> be relevant to what is being measured, so ODP probably wants to have both

>> processor-local and systemwide timing APIs.  We just need to be sure they

>> are specified precisely so that applications know how to use them properly.

>>

>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org>

>> wrote:

>>

>>> It sounded like the arch call was leaning towards documenting that on

>>> odp-linux  the application must ensure that odp_threads are pinned to cores

>>> when launched.

>>> This is a restriction that some platforms may not need to make, vs the

>>> idea that a piece of ODP code can use these APIs to ensure the behavior it

>>> needs without knowledge or reliance on the wider system.

>>>

>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>

>>>> Establish a performance profiling environment guarantees meaningful

>>>> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

>>>> While after profiling was done restore the execution environment to

>>>> its multi-core optimized state.

>>>>

>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>> ---

>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>  1 file changed, 31 insertions(+)

>>>>

>>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>>>> index 2789511..0bc9327 100644

>>>> --- a/include/odp/api/spec/cpu.h

>>>> +++ b/include/odp/api/spec/cpu.h

>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>

>>>>

>>>>  /**

>>>> + * @typedef odp_profiler_t

>>>> + * ODP performance profiler handle

>>>> + */

>>>> +

>>>> +/**

>>>> + * Setup a performance profiling environment

>>>> + *

>>>> + * A performance profiling environment guarantees meaningful and

>>>> consistency of

>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>> + *

>>>> + * @return performance profiler handle

>>>> + */

>>>> +odp_profiler_t odp_profiler_start(void);

>>>> +

>>>> +/**

>>>>   * CPU identifier

>>>>   *

>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>> numbering is

>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>  void odp_cpu_pause(void);

>>>>

>>>>  /**

>>>> + * Stop the performance profiling environment

>>>> + *

>>>> + * Stop performance profiling and restore the execution environment to

>>>> its

>>>> + * multi-core optimized state, won't preserve meaningful and

>>>> consistency of

>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>> + *

>>>> + * @param profiler  performance profiler handle

>>>> + *

>>>> + * @retval 0 on success

>>>> + * @retval <0 on failure

>>>> + *

>>>> + * @see odp_profiler_start()

>>>> + */

>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>> +

>>>> +/**

>>>>   * @}

>>>>   */

>>>>

>>>> --

>>>> 1.9.1

>>>>

>>>> _______________________________________________

>>>> lng-odp mailing list

>>>> lng-odp@lists.linaro.org

>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>

>>>

>>>

>>>

>>> --

>>> Mike Holmes

>>> Technical Manager - Linaro Networking Group

>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM

>>> SoCs

>>> "Work should be fun and collaborative, the rest follows"

>>>

>>>

>>>

>>> _______________________________________________

>>> lng-odp mailing list

>>> lng-odp@lists.linaro.org

>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>

>>>

>>

>
Yi He May 5, 2016, 12:59 p.m. UTC | #5
Hi, thanks Bill

I understand more deeply of ODP thread concept and in embedded app
developers are involved in target platform tuning/optimization.

Can I have a little example: say we have a data-plane app which includes 3
ODP threads. And would like to install and run it upon 2 platforms.

   - Platform A: 2 cores.
   - Platform B: 10 cores.

Question, which one of the below assumptions is the current ODP programming
model?

*1, *Application developer writes target platform specific code to tell
that:

On platform A run threads (0) on core (0), and threads (1,2) on core (1).
On platform B run threads (0) on core (0), and threads (1) can scale out
and duplicate 8 instances on core (1~8), and thread (2) on core (9).

Install and run on different platform requires above platform specific code
and recompilation for target.

*2, *Application developer writes code to specify:

Threads (0, 2) would not scale out
Threads (1) can scale out (up to a limit N?)
Platform A has 3 cores available (as command line parameter?)
Platform B has 10 cores available (as command line parameter?)

Install and run on different platform may not requires re-compilation.
ODP intelligently arrange the threads according to the information provided.

Last question: in some case like power save mode available cores shrink
would ODP intelligently re-arrange the ODP threads dynamically in runtime?

Thanks and Best Regards, Yi

On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org> wrote:

> I've added this to the agenda for Monday's call, however I suggest we

> continue the dialog here as well as background.

>

> Regarding thread pinning, there's always been a tradeoff on that.  On the

> one hand dedicating cores to threads is ideal for scale out in many core

> systems, however ODP does not require many core environments to work

> effectively, so ODP APIs enable but do not require or assume that cores are

> dedicated to threads. That's really a question of application design and

> fit to the particular platform it's running on. In embedded environments

> you'll likely see this model more since the application knows which

> platform it's being targeted for. In VNF environments, by contrast, you're

> more likely to see a blend where applications will take advantage of

> however many cores are available to it but will still run without dedicated

> cores in environments with more modest resources.

>

> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>

>> Hi, thanks Mike and Bill,

>>

>> From your clear summarize can we put it into several TO-DO decisions: (we

>> can have a discussion in next ARCH call):

>>

>>    1. How to addressing the precise semantics of the existing timing

>>    APIs (odp_cpu_xxx) as they relate to processor locality.

>>

>>

>>    - *A:* guarantee by adding constraint to ODP thread concept: every

>>    ODP thread shall be deployed and pinned on one CPU core.

>>       - A sub-question: my understanding is that application programmers

>>       only need to specify available CPU sets for control/worker threads, and it

>>       is ODP to arrange the threads onto each CPU core while launching, right?

>>    - *B*: guarantee by adding new APIs to disable/enable CPU migration.

>>    - Then document clearly in user's guide or API document.

>>

>>

>>    1. Understand the requirement to have both processor-local and

>>    system-wide timing APIs:

>>

>>

>>    - There are some APIs available in time.h (odp_time_local(), etc).

>>    - We can have a thread to understand the relationship, usage

>>    scenarios and constraints of APIs in time.h and cpu.h.

>>

>> Best Regards, Yi

>>

>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>>

>>> I think there are two fallouts form this discussion.  First, there is

>>> the question of the precise semantics of the existing timing APIs as they

>>> relate to processor locality. Applications such as profiling tests, to the

>>> extent that they APIs that have processor-local semantics, must ensure that

>>> the thread(s) using these APIs are pinned for the duration of the

>>> measurement.

>>>

>>> The other point is the one that Petri brought up about having other APIs

>>> that provide timing information based on wall time or other metrics that

>>> are not processor-local.  While these may not have the same performance

>>> characteristics, they would be independent of thread migration

>>> considerations.

>>>

>>> Of course all this depends on exactly what one is trying to measure.

>>> Since thread migration is not free, allowing such activity may or may not

>>> be relevant to what is being measured, so ODP probably wants to have both

>>> processor-local and systemwide timing APIs.  We just need to be sure they

>>> are specified precisely so that applications know how to use them properly.

>>>

>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org>

>>> wrote:

>>>

>>>> It sounded like the arch call was leaning towards documenting that on

>>>> odp-linux  the application must ensure that odp_threads are pinned to cores

>>>> when launched.

>>>> This is a restriction that some platforms may not need to make, vs the

>>>> idea that a piece of ODP code can use these APIs to ensure the behavior it

>>>> needs without knowledge or reliance on the wider system.

>>>>

>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>

>>>>> Establish a performance profiling environment guarantees meaningful

>>>>> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

>>>>> While after profiling was done restore the execution environment to

>>>>> its multi-core optimized state.

>>>>>

>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>> ---

>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>  1 file changed, 31 insertions(+)

>>>>>

>>>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>>>>> index 2789511..0bc9327 100644

>>>>> --- a/include/odp/api/spec/cpu.h

>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>

>>>>>

>>>>>  /**

>>>>> + * @typedef odp_profiler_t

>>>>> + * ODP performance profiler handle

>>>>> + */

>>>>> +

>>>>> +/**

>>>>> + * Setup a performance profiling environment

>>>>> + *

>>>>> + * A performance profiling environment guarantees meaningful and

>>>>> consistency of

>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>> + *

>>>>> + * @return performance profiler handle

>>>>> + */

>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>> +

>>>>> +/**

>>>>>   * CPU identifier

>>>>>   *

>>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>>> numbering is

>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>  void odp_cpu_pause(void);

>>>>>

>>>>>  /**

>>>>> + * Stop the performance profiling environment

>>>>> + *

>>>>> + * Stop performance profiling and restore the execution environment

>>>>> to its

>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>> consistency of

>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>> + *

>>>>> + * @param profiler  performance profiler handle

>>>>> + *

>>>>> + * @retval 0 on success

>>>>> + * @retval <0 on failure

>>>>> + *

>>>>> + * @see odp_profiler_start()

>>>>> + */

>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>> +

>>>>> +/**

>>>>>   * @}

>>>>>   */

>>>>>

>>>>> --

>>>>> 1.9.1

>>>>>

>>>>> _______________________________________________

>>>>> lng-odp mailing list

>>>>> lng-odp@lists.linaro.org

>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>

>>>>

>>>>

>>>>

>>>> --

>>>> Mike Holmes

>>>> Technical Manager - Linaro Networking Group

>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM

>>>> SoCs

>>>> "Work should be fun and collaborative, the rest follows"

>>>>

>>>>

>>>>

>>>> _______________________________________________

>>>> lng-odp mailing list

>>>> lng-odp@lists.linaro.org

>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>

>>>>

>>>

>>

>
Bill Fischofer May 6, 2016, 2:23 p.m. UTC | #6
These are all good questions. ODP divides threads into worker threads and
control threads. The distinction is that worker threads are supposed to be
performance sensitive and perform optimally with dedicated cores while
control threads perform more "housekeeping" functions and would be less
impacted by sharing cores.

In the absence of explicit API calls, it is unspecified how an ODP
implementation assigns threads to cores. The distinction between worker and
control thread is a hint to the underlying implementation that should be
used in managing available processor resources.

The APIs in cpumask.h enable applications to determine how many CPUs are
available to it and how to divide them among worker and control threads
(odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note
that ODP does not provide APIs for setting specific threads to specific
CPUs, so keep that in mind in the answers below.


On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

> Hi, thanks Bill

>

> I understand more deeply of ODP thread concept and in embedded app

> developers are involved in target platform tuning/optimization.

>

> Can I have a little example: say we have a data-plane app which includes 3

> ODP threads. And would like to install and run it upon 2 platforms.

>

>    - Platform A: 2 cores.

>    - Platform B: 10 cores

>

> During initialization, the application can use odp_cpumask_all_available()

to determine how many CPUs are available and can (optionally) use
odp_cpumask_default_worker() and odp_cpumask_default_control() to divide
them into CPUs that should be used for worker and control threads,
respectively. For an application designed for scale-out, the number of
available CPUs would typically be used to control how many worker threads
the application creates. If the number of worker threads matches the number
of worker CPUs then the ODP implementation would be expected to dedicate a
worker core to each worker thread. If more threads are created than there
are corresponding cores, then it is up to each implementation as to how it
multiplexes them among the available cores in a fair manner.


> Question, which one of the below assumptions is the current ODP

> programming model?

>

> *1, *Application developer writes target platform specific code to tell

> that:

>

> On platform A run threads (0) on core (0), and threads (1,2) on core (1).

> On platform B run threads (0) on core (0), and threads (1) can scale out

> and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>


As noted, ODP does not provide APIs that permit specific threads to be
assigned to specific cores. Instead it is up to each ODP implementation as
to how it maps ODP threads to available CPUs, subject to the advisory
information provided by the ODP thread type and the cpumask assignments for
control and worker threads. So in these examples suppose what the
application has is two control threads and one or more workers.  For
Platform A you might have core 0 defined for control threads and Core 1 for
worker threads. In this case threads 0 and 1 would run on Core 0 while
thread 2 ran on Core 1. For Platform B it's again up to the application how
it wants to divide the 10 CPUs between control and worker. It may want to
have 2 control CPUs so that each control thread can have its own core,
leaving 8 worker threads, or it might have the control threads share a
single CPU and have 9 worker threads with their own cores.


>

>

Install and run on different platform requires above platform specific code
> and recompilation for target.

>


No. As noted, the model is the same. The only difference is how many
control/worker threads the application chooses to create based on the
information it gets during initialization by odp_cpumask_all_available().


>

> *2, *Application developer writes code to specify:

>

> Threads (0, 2) would not scale out

> Threads (1) can scale out (up to a limit N?)

> Platform A has 3 cores available (as command line parameter?)

> Platform B has 10 cores available (as command line parameter?)

>

> Install and run on different platform may not requires re-compilation.

> ODP intelligently arrange the threads according to the information

> provided.

>


Applications determine the minimum number of threads they require. For most
applications they would tend to have a fixed number of control threads
(based on the application's functional design) and a variable number of
worker threads (minimum 1) based on available processing resources. These
application-defined minimums determine the minimum configuration the
application might need for optimal performance, with scale out to larger
configurations performed automatically.


>

> Last question: in some case like power save mode available cores shrink

> would ODP intelligently re-arrange the ODP threads dynamically in runtime?

>


The intent is that while control threads may have distinct roles and
responsibilities (thus requiring that all always be eligible to be
scheduled) worker threads are symmetric and interchangeable. So in this
case if I have N worker threads to match to the N available worker CPUs and
power save mode wants to reduce that number to N-1, then the only effect is
that the worker CPU entering power save mode goes dormant along with the
thread that is running on it. That thread isn't redistributed to some other
core because it's the same as the other worker threads.  Its is expected
that cores would only enter power save state at odp_schedule() boundaries.
So for example, if odp_schedule() determines that there is no work to
dispatch to this thread then that might trigger the associated CPU to enter
low power mode. When later that core wakes up odp_schedule() would continue
and then return work to its reactivated thread.

A slight wrinkle here is the concept of scheduler groups, which allows work
classes to be dispatched to different groups of worker threads.  In this
case the implementation might want to take scheduler group membership into
consideration in determining which cores to idle for power savings.
However, the ODP API itself is silent on this subject as it is
implementation dependent how power save modes are managed.


>

> Thanks and Best Regards, Yi

>


Thank you for these questions. I answering them I realized we do not (yet)
have this information covered in the ODP User Guide. I'll be using this
information to help fill in that gap.


>

> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>

>> I've added this to the agenda for Monday's call, however I suggest we

>> continue the dialog here as well as background.

>>

>> Regarding thread pinning, there's always been a tradeoff on that.  On the

>> one hand dedicating cores to threads is ideal for scale out in many core

>> systems, however ODP does not require many core environments to work

>> effectively, so ODP APIs enable but do not require or assume that cores are

>> dedicated to threads. That's really a question of application design and

>> fit to the particular platform it's running on. In embedded environments

>> you'll likely see this model more since the application knows which

>> platform it's being targeted for. In VNF environments, by contrast, you're

>> more likely to see a blend where applications will take advantage of

>> however many cores are available to it but will still run without dedicated

>> cores in environments with more modest resources.

>>

>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>

>>> Hi, thanks Mike and Bill,

>>>

>>> From your clear summarize can we put it into several TO-DO decisions:

>>> (we can have a discussion in next ARCH call):

>>>

>>>    1. How to addressing the precise semantics of the existing timing

>>>    APIs (odp_cpu_xxx) as they relate to processor locality.

>>>

>>>

>>>    - *A:* guarantee by adding constraint to ODP thread concept: every

>>>    ODP thread shall be deployed and pinned on one CPU core.

>>>       - A sub-question: my understanding is that application

>>>       programmers only need to specify available CPU sets for control/worker

>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>       launching, right?

>>>    - *B*: guarantee by adding new APIs to disable/enable CPU migration.

>>>    - Then document clearly in user's guide or API document.

>>>

>>>

>>>    1. Understand the requirement to have both processor-local and

>>>    system-wide timing APIs:

>>>

>>>

>>>    - There are some APIs available in time.h (odp_time_local(), etc).

>>>    - We can have a thread to understand the relationship, usage

>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>

>>> Best Regards, Yi

>>>

>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>> wrote:

>>>

>>>> I think there are two fallouts form this discussion.  First, there is

>>>> the question of the precise semantics of the existing timing APIs as they

>>>> relate to processor locality. Applications such as profiling tests, to the

>>>> extent that they APIs that have processor-local semantics, must ensure that

>>>> the thread(s) using these APIs are pinned for the duration of the

>>>> measurement.

>>>>

>>>> The other point is the one that Petri brought up about having other

>>>> APIs that provide timing information based on wall time or other metrics

>>>> that are not processor-local.  While these may not have the same

>>>> performance characteristics, they would be independent of thread migration

>>>> considerations.

>>>>

>>>> Of course all this depends on exactly what one is trying to measure.

>>>> Since thread migration is not free, allowing such activity may or may not

>>>> be relevant to what is being measured, so ODP probably wants to have both

>>>> processor-local and systemwide timing APIs.  We just need to be sure they

>>>> are specified precisely so that applications know how to use them properly.

>>>>

>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org>

>>>> wrote:

>>>>

>>>>> It sounded like the arch call was leaning towards documenting that on

>>>>> odp-linux  the application must ensure that odp_threads are pinned to cores

>>>>> when launched.

>>>>> This is a restriction that some platforms may not need to make, vs the

>>>>> idea that a piece of ODP code can use these APIs to ensure the behavior it

>>>>> needs without knowledge or reliance on the wider system.

>>>>>

>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>

>>>>>> Establish a performance profiling environment guarantees meaningful

>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>> While after profiling was done restore the execution environment to

>>>>>> its multi-core optimized state.

>>>>>>

>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>> ---

>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>  1 file changed, 31 insertions(+)

>>>>>>

>>>>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>>>>>> index 2789511..0bc9327 100644

>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>

>>>>>>

>>>>>>  /**

>>>>>> + * @typedef odp_profiler_t

>>>>>> + * ODP performance profiler handle

>>>>>> + */

>>>>>> +

>>>>>> +/**

>>>>>> + * Setup a performance profiling environment

>>>>>> + *

>>>>>> + * A performance profiling environment guarantees meaningful and

>>>>>> consistency of

>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>> + *

>>>>>> + * @return performance profiler handle

>>>>>> + */

>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>> +

>>>>>> +/**

>>>>>>   * CPU identifier

>>>>>>   *

>>>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>>>> numbering is

>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>  void odp_cpu_pause(void);

>>>>>>

>>>>>>  /**

>>>>>> + * Stop the performance profiling environment

>>>>>> + *

>>>>>> + * Stop performance profiling and restore the execution environment

>>>>>> to its

>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>> consistency of

>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>> + *

>>>>>> + * @param profiler  performance profiler handle

>>>>>> + *

>>>>>> + * @retval 0 on success

>>>>>> + * @retval <0 on failure

>>>>>> + *

>>>>>> + * @see odp_profiler_start()

>>>>>> + */

>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>> +

>>>>>> +/**

>>>>>>   * @}

>>>>>>   */

>>>>>>

>>>>>> --

>>>>>> 1.9.1

>>>>>>

>>>>>> _______________________________________________

>>>>>> lng-odp mailing list

>>>>>> lng-odp@lists.linaro.org

>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>

>>>>>

>>>>>

>>>>>

>>>>> --

>>>>> Mike Holmes

>>>>> Technical Manager - Linaro Networking Group

>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM

>>>>> SoCs

>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>

>>>>>

>>>>>

>>>>> _______________________________________________

>>>>> lng-odp mailing list

>>>>> lng-odp@lists.linaro.org

>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>

>>>>>

>>>>

>>>

>>

>
Yi He May 9, 2016, 6:50 a.m. UTC | #7
Hi, Bill

Thanks very much for your detailed explanation. I understand the
programming practise like:

/* Firstly developer got a chance to specify core availabilities to the
 * application instance.
 */
odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )

*So It is possible to run an application with different core availabilities
spec on different platform, **and possible to run multiple application
instances on one platform in isolation.*

*A: Make the above as command line parameters can help application binary
portable, run it on platform A or B requires no re-compilation, but only
invocation parameters change.*

/* Application developer fanout worker/control threads depends on
 * the needs and actual availabilities.
 */
actually_available_cores =
    odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);

iterator( actually_available_cores ) {

    /* Fanout one work thread instance */
    odph_linux_pthread_create(...upon one available core...);
}

*B: Is odph_linux_pthread_create() a temporary helper API and will converge
into platform-independent odp_thread_create(..one core spec...) in future?
Or, is it deliberately left as platform dependant helper API?*

Based on above understanding and back to ODP-427 problem, which seems only
the main thread (program entrance) was accidentally not pinned on one core
:), the main thread is also an ODP_THREAD_CONTROL, but was not instantiated
through odph_linux_pthread_create().

A solution can be: in odp_init_global() API, after
odp_cpumask_init_global(), pin the main thread to the 1st available core
for control thread. This adds new behavioural specification to this API,
but seems natural. Actually Ivan's patch did most of this, except that the
core was fixed to 0. we can discuss in today's meeting.

Thanks and Best Regards, Yi

On 6 May 2016 at 22:23, Bill Fischofer <bill.fischofer@linaro.org> wrote:

> These are all good questions. ODP divides threads into worker threads and

> control threads. The distinction is that worker threads are supposed to be

> performance sensitive and perform optimally with dedicated cores while

> control threads perform more "housekeeping" functions and would be less

> impacted by sharing cores.

>

> In the absence of explicit API calls, it is unspecified how an ODP

> implementation assigns threads to cores. The distinction between worker and

> control thread is a hint to the underlying implementation that should be

> used in managing available processor resources.

>

> The APIs in cpumask.h enable applications to determine how many CPUs are

> available to it and how to divide them among worker and control threads

> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note

> that ODP does not provide APIs for setting specific threads to specific

> CPUs, so keep that in mind in the answers below.

>

>

> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

>

>> Hi, thanks Bill

>>

>> I understand more deeply of ODP thread concept and in embedded app

>> developers are involved in target platform tuning/optimization.

>>

>> Can I have a little example: say we have a data-plane app which includes

>> 3 ODP threads. And would like to install and run it upon 2 platforms.

>>

>>    - Platform A: 2 cores.

>>    - Platform B: 10 cores

>>

>> During initialization, the application can use

> odp_cpumask_all_available() to determine how many CPUs are available and

> can (optionally) use odp_cpumask_default_worker() and

> odp_cpumask_default_control() to divide them into CPUs that should be used

> for worker and control threads, respectively. For an application designed

> for scale-out, the number of available CPUs would typically be used to

> control how many worker threads the application creates. If the number of

> worker threads matches the number of worker CPUs then the ODP

> implementation would be expected to dedicate a worker core to each worker

> thread. If more threads are created than there are corresponding cores,

> then it is up to each implementation as to how it multiplexes them among

> the available cores in a fair manner.

>

>

>> Question, which one of the below assumptions is the current ODP

>> programming model?

>>

>> *1, *Application developer writes target platform specific code to tell

>> that:

>>

>> On platform A run threads (0) on core (0), and threads (1,2) on core (1).

>> On platform B run threads (0) on core (0), and threads (1) can scale out

>> and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>>

>

> As noted, ODP does not provide APIs that permit specific threads to be

> assigned to specific cores. Instead it is up to each ODP implementation as

> to how it maps ODP threads to available CPUs, subject to the advisory

> information provided by the ODP thread type and the cpumask assignments for

> control and worker threads. So in these examples suppose what the

> application has is two control threads and one or more workers.  For

> Platform A you might have core 0 defined for control threads and Core 1 for

> worker threads. In this case threads 0 and 1 would run on Core 0 while

> thread 2 ran on Core 1. For Platform B it's again up to the application how

> it wants to divide the 10 CPUs between control and worker. It may want to

> have 2 control CPUs so that each control thread can have its own core,

> leaving 8 worker threads, or it might have the control threads share a

> single CPU and have 9 worker threads with their own cores.

>

>

>>

>>

> Install and run on different platform requires above platform specific

>> code and recompilation for target.

>>

>

> No. As noted, the model is the same. The only difference is how many

> control/worker threads the application chooses to create based on the

> information it gets during initialization by odp_cpumask_all_available().

>

>

>>

>> *2, *Application developer writes code to specify:

>>

>> Threads (0, 2) would not scale out

>> Threads (1) can scale out (up to a limit N?)

>> Platform A has 3 cores available (as command line parameter?)

>> Platform B has 10 cores available (as command line parameter?)

>>

>> Install and run on different platform may not requires re-compilation.

>> ODP intelligently arrange the threads according to the information

>> provided.

>>

>

> Applications determine the minimum number of threads they require. For

> most applications they would tend to have a fixed number of control threads

> (based on the application's functional design) and a variable number of

> worker threads (minimum 1) based on available processing resources. These

> application-defined minimums determine the minimum configuration the

> application might need for optimal performance, with scale out to larger

> configurations performed automatically.

>

>

>>

>> Last question: in some case like power save mode available cores shrink

>> would ODP intelligently re-arrange the ODP threads dynamically in runtime?

>>

>

> The intent is that while control threads may have distinct roles and

> responsibilities (thus requiring that all always be eligible to be

> scheduled) worker threads are symmetric and interchangeable. So in this

> case if I have N worker threads to match to the N available worker CPUs and

> power save mode wants to reduce that number to N-1, then the only effect is

> that the worker CPU entering power save mode goes dormant along with the

> thread that is running on it. That thread isn't redistributed to some other

> core because it's the same as the other worker threads.  Its is expected

> that cores would only enter power save state at odp_schedule() boundaries.

> So for example, if odp_schedule() determines that there is no work to

> dispatch to this thread then that might trigger the associated CPU to enter

> low power mode. When later that core wakes up odp_schedule() would continue

> and then return work to its reactivated thread.

>

> A slight wrinkle here is the concept of scheduler groups, which allows

> work classes to be dispatched to different groups of worker threads.  In

> this case the implementation might want to take scheduler group membership

> into consideration in determining which cores to idle for power savings.

> However, the ODP API itself is silent on this subject as it is

> implementation dependent how power save modes are managed.

>

>

>>

>> Thanks and Best Regards, Yi

>>

>

> Thank you for these questions. I answering them I realized we do not (yet)

> have this information covered in the ODP User Guide. I'll be using this

> information to help fill in that gap.

>

>

>>

>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>>

>>> I've added this to the agenda for Monday's call, however I suggest we

>>> continue the dialog here as well as background.

>>>

>>> Regarding thread pinning, there's always been a tradeoff on that.  On

>>> the one hand dedicating cores to threads is ideal for scale out in many

>>> core systems, however ODP does not require many core environments to work

>>> effectively, so ODP APIs enable but do not require or assume that cores are

>>> dedicated to threads. That's really a question of application design and

>>> fit to the particular platform it's running on. In embedded environments

>>> you'll likely see this model more since the application knows which

>>> platform it's being targeted for. In VNF environments, by contrast, you're

>>> more likely to see a blend where applications will take advantage of

>>> however many cores are available to it but will still run without dedicated

>>> cores in environments with more modest resources.

>>>

>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>>

>>>> Hi, thanks Mike and Bill,

>>>>

>>>> From your clear summarize can we put it into several TO-DO decisions:

>>>> (we can have a discussion in next ARCH call):

>>>>

>>>>    1. How to addressing the precise semantics of the existing timing

>>>>    APIs (odp_cpu_xxx) as they relate to processor locality.

>>>>

>>>>

>>>>    - *A:* guarantee by adding constraint to ODP thread concept: every

>>>>    ODP thread shall be deployed and pinned on one CPU core.

>>>>       - A sub-question: my understanding is that application

>>>>       programmers only need to specify available CPU sets for control/worker

>>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>>       launching, right?

>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU migration.

>>>>    - Then document clearly in user's guide or API document.

>>>>

>>>>

>>>>    1. Understand the requirement to have both processor-local and

>>>>    system-wide timing APIs:

>>>>

>>>>

>>>>    - There are some APIs available in time.h (odp_time_local(), etc).

>>>>    - We can have a thread to understand the relationship, usage

>>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>>

>>>> Best Regards, Yi

>>>>

>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>>> wrote:

>>>>

>>>>> I think there are two fallouts form this discussion.  First, there is

>>>>> the question of the precise semantics of the existing timing APIs as they

>>>>> relate to processor locality. Applications such as profiling tests, to the

>>>>> extent that they APIs that have processor-local semantics, must ensure that

>>>>> the thread(s) using these APIs are pinned for the duration of the

>>>>> measurement.

>>>>>

>>>>> The other point is the one that Petri brought up about having other

>>>>> APIs that provide timing information based on wall time or other metrics

>>>>> that are not processor-local.  While these may not have the same

>>>>> performance characteristics, they would be independent of thread migration

>>>>> considerations.

>>>>>

>>>>> Of course all this depends on exactly what one is trying to measure.

>>>>> Since thread migration is not free, allowing such activity may or may not

>>>>> be relevant to what is being measured, so ODP probably wants to have both

>>>>> processor-local and systemwide timing APIs.  We just need to be sure they

>>>>> are specified precisely so that applications know how to use them properly.

>>>>>

>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org>

>>>>> wrote:

>>>>>

>>>>>> It sounded like the arch call was leaning towards documenting that on

>>>>>> odp-linux  the application must ensure that odp_threads are pinned to cores

>>>>>> when launched.

>>>>>> This is a restriction that some platforms may not need to make, vs

>>>>>> the idea that a piece of ODP code can use these APIs to ensure the behavior

>>>>>> it needs without knowledge or reliance on the wider system.

>>>>>>

>>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>>

>>>>>>> Establish a performance profiling environment guarantees meaningful

>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>> While after profiling was done restore the execution environment to

>>>>>>> its multi-core optimized state.

>>>>>>>

>>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>>> ---

>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>>  1 file changed, 31 insertions(+)

>>>>>>>

>>>>>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>>>>>>> index 2789511..0bc9327 100644

>>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>>

>>>>>>>

>>>>>>>  /**

>>>>>>> + * @typedef odp_profiler_t

>>>>>>> + * ODP performance profiler handle

>>>>>>> + */

>>>>>>> +

>>>>>>> +/**

>>>>>>> + * Setup a performance profiling environment

>>>>>>> + *

>>>>>>> + * A performance profiling environment guarantees meaningful and

>>>>>>> consistency of

>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>> + *

>>>>>>> + * @return performance profiler handle

>>>>>>> + */

>>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>>> +

>>>>>>> +/**

>>>>>>>   * CPU identifier

>>>>>>>   *

>>>>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>>>>> numbering is

>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>>  void odp_cpu_pause(void);

>>>>>>>

>>>>>>>  /**

>>>>>>> + * Stop the performance profiling environment

>>>>>>> + *

>>>>>>> + * Stop performance profiling and restore the execution environment

>>>>>>> to its

>>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>>> consistency of

>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>>> + *

>>>>>>> + * @param profiler  performance profiler handle

>>>>>>> + *

>>>>>>> + * @retval 0 on success

>>>>>>> + * @retval <0 on failure

>>>>>>> + *

>>>>>>> + * @see odp_profiler_start()

>>>>>>> + */

>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>>> +

>>>>>>> +/**

>>>>>>>   * @}

>>>>>>>   */

>>>>>>>

>>>>>>> --

>>>>>>> 1.9.1

>>>>>>>

>>>>>>> _______________________________________________

>>>>>>> lng-odp mailing list

>>>>>>> lng-odp@lists.linaro.org

>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>>>> --

>>>>>> Mike Holmes

>>>>>> Technical Manager - Linaro Networking Group

>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for ARM

>>>>>> SoCs

>>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>>

>>>>>>

>>>>>>

>>>>>> _______________________________________________

>>>>>> lng-odp mailing list

>>>>>> lng-odp@lists.linaro.org

>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>
Bill Fischofer May 9, 2016, 8:54 p.m. UTC | #8
On Mon, May 9, 2016 at 1:50 AM, Yi He <yi.he@linaro.org> wrote:

> Hi, Bill

>

> Thanks very much for your detailed explanation. I understand the

> programming practise like:

>

> /* Firstly developer got a chance to specify core availabilities to the

>  * application instance.

>  */

> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )

>

> *So It is possible to run an application with different core

> availabilities spec on different platform, **and possible to run multiple

> application instances on one platform in isolation.*

>

> *A: Make the above as command line parameters can help application binary

> portable, run it on platform A or B requires no re-compilation, but only

> invocation parameters change.*

>


The intent behind the ability to specify cpumasks at odp_init_global() time
is to allow a launcher script that is configured by some provisioning agent
(e.g., OpenDaylight) to communicate core assignments down to the ODP
implementation in a platform-independent manner.  So applications will fall
into two categories, those that have provisioned coremasks that simply get
passed through and more "stand alone" applications that will us
odp_cpumask_all_available() and odp_cpumask_default_worker/control() as
noted earlier to size themselves dynamically to the available processing
resources.  In both cases there is no need to recompile the application but
rather to simply have it create an appropriate number of control/worker
threads as determined either by external configuration or inquiry.


>

> /* Application developer fanout worker/control threads depends on

>  * the needs and actual availabilities.

>  */

> actually_available_cores =

>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);

>

> iterator( actually_available_cores ) {

>

>     /* Fanout one work thread instance */

>     odph_linux_pthread_create(...upon one available core...);

> }

>

> *B: Is odph_linux_pthread_create() a temporary helper API and will

> converge into platform-independent odp_thread_create(..one core spec...) in

> future? Or, is it deliberately left as platform dependant helper API?*

>

> Based on above understanding and back to ODP-427 problem, which seems only

> the main thread (program entrance) was accidentally not pinned on one core

> :), the main thread is also an ODP_THREAD_CONTROL, but was not instantiated

> through odph_linux_pthread_create().

>


ODP provides no APIs or helpers to control thread pinning. The only
controls ODP provides is the ability to know the number of available cores,
to partition them for use by worker and control threads, and the ability
(via helpers) to create a number of threads of the application's choosing.
The implementation is expected to schedule these threads to available cores
in a fair manner, so if the number of application threads is less than or
equal to the available number of cores then implementations SHOULD (but are
not required to) pin each thread to its own core. Applications SHOULD NOT
be designed to require or depend on any specify thread-to-core mapping both
for portability as well as because what constitutes a "core" in a virtual
environment may or may not represent dedicated hardware.


>

> A solution can be: in odp_init_global() API, after

> odp_cpumask_init_global(), pin the main thread to the 1st available core

> for control thread. This adds new behavioural specification to this API,

> but seems natural. Actually Ivan's patch did most of this, except that the

> core was fixed to 0. we can discuss in today's meeting.

>


An application may consist of more than a single thread at the time it
calls odp_init_global(), however it is RECOMMENDED that odp_init_global()
be called only from the application's initial thread and before it creates
any other threads to avoid the address space confusion that has been the
subject of the past couple of ARCH calls and that we are looking to achieve
consensus on. I'd like to move that discussion to a separate discussion
thread from this one, if you don't mind.


>

> Thanks and Best Regards, Yi

>

> On 6 May 2016 at 22:23, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>

>> These are all good questions. ODP divides threads into worker threads and

>> control threads. The distinction is that worker threads are supposed to be

>> performance sensitive and perform optimally with dedicated cores while

>> control threads perform more "housekeeping" functions and would be less

>> impacted by sharing cores.

>>

>> In the absence of explicit API calls, it is unspecified how an ODP

>> implementation assigns threads to cores. The distinction between worker and

>> control thread is a hint to the underlying implementation that should be

>> used in managing available processor resources.

>>

>> The APIs in cpumask.h enable applications to determine how many CPUs are

>> available to it and how to divide them among worker and control threads

>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note

>> that ODP does not provide APIs for setting specific threads to specific

>> CPUs, so keep that in mind in the answers below.

>>

>>

>> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

>>

>>> Hi, thanks Bill

>>>

>>> I understand more deeply of ODP thread concept and in embedded app

>>> developers are involved in target platform tuning/optimization.

>>>

>>> Can I have a little example: say we have a data-plane app which includes

>>> 3 ODP threads. And would like to install and run it upon 2 platforms.

>>>

>>>    - Platform A: 2 cores.

>>>    - Platform B: 10 cores

>>>

>>> During initialization, the application can use

>> odp_cpumask_all_available() to determine how many CPUs are available and

>> can (optionally) use odp_cpumask_default_worker() and

>> odp_cpumask_default_control() to divide them into CPUs that should be used

>> for worker and control threads, respectively. For an application designed

>> for scale-out, the number of available CPUs would typically be used to

>> control how many worker threads the application creates. If the number of

>> worker threads matches the number of worker CPUs then the ODP

>> implementation would be expected to dedicate a worker core to each worker

>> thread. If more threads are created than there are corresponding cores,

>> then it is up to each implementation as to how it multiplexes them among

>> the available cores in a fair manner.

>>

>>

>>> Question, which one of the below assumptions is the current ODP

>>> programming model?

>>>

>>> *1, *Application developer writes target platform specific code to tell

>>> that:

>>>

>>> On platform A run threads (0) on core (0), and threads (1,2) on core (1).

>>> On platform B run threads (0) on core (0), and threads (1) can scale out

>>> and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>>>

>>

>> As noted, ODP does not provide APIs that permit specific threads to be

>> assigned to specific cores. Instead it is up to each ODP implementation as

>> to how it maps ODP threads to available CPUs, subject to the advisory

>> information provided by the ODP thread type and the cpumask assignments for

>> control and worker threads. So in these examples suppose what the

>> application has is two control threads and one or more workers.  For

>> Platform A you might have core 0 defined for control threads and Core 1 for

>> worker threads. In this case threads 0 and 1 would run on Core 0 while

>> thread 2 ran on Core 1. For Platform B it's again up to the application how

>> it wants to divide the 10 CPUs between control and worker. It may want to

>> have 2 control CPUs so that each control thread can have its own core,

>> leaving 8 worker threads, or it might have the control threads share a

>> single CPU and have 9 worker threads with their own cores.

>>

>>

>>>

>>>

>> Install and run on different platform requires above platform specific

>>> code and recompilation for target.

>>>

>>

>> No. As noted, the model is the same. The only difference is how many

>> control/worker threads the application chooses to create based on the

>> information it gets during initialization by odp_cpumask_all_available().

>>

>>

>>>

>>> *2, *Application developer writes code to specify:

>>>

>>> Threads (0, 2) would not scale out

>>> Threads (1) can scale out (up to a limit N?)

>>> Platform A has 3 cores available (as command line parameter?)

>>> Platform B has 10 cores available (as command line parameter?)

>>>

>>> Install and run on different platform may not requires re-compilation.

>>> ODP intelligently arrange the threads according to the information

>>> provided.

>>>

>>

>> Applications determine the minimum number of threads they require. For

>> most applications they would tend to have a fixed number of control threads

>> (based on the application's functional design) and a variable number of

>> worker threads (minimum 1) based on available processing resources. These

>> application-defined minimums determine the minimum configuration the

>> application might need for optimal performance, with scale out to larger

>> configurations performed automatically.

>>

>>

>>>

>>> Last question: in some case like power save mode available cores shrink

>>> would ODP intelligently re-arrange the ODP threads dynamically in runtime?

>>>

>>

>> The intent is that while control threads may have distinct roles and

>> responsibilities (thus requiring that all always be eligible to be

>> scheduled) worker threads are symmetric and interchangeable. So in this

>> case if I have N worker threads to match to the N available worker CPUs and

>> power save mode wants to reduce that number to N-1, then the only effect is

>> that the worker CPU entering power save mode goes dormant along with the

>> thread that is running on it. That thread isn't redistributed to some other

>> core because it's the same as the other worker threads.  Its is expected

>> that cores would only enter power save state at odp_schedule() boundaries.

>> So for example, if odp_schedule() determines that there is no work to

>> dispatch to this thread then that might trigger the associated CPU to enter

>> low power mode. When later that core wakes up odp_schedule() would continue

>> and then return work to its reactivated thread.

>>

>> A slight wrinkle here is the concept of scheduler groups, which allows

>> work classes to be dispatched to different groups of worker threads.  In

>> this case the implementation might want to take scheduler group membership

>> into consideration in determining which cores to idle for power savings.

>> However, the ODP API itself is silent on this subject as it is

>> implementation dependent how power save modes are managed.

>>

>>

>>>

>>> Thanks and Best Regards, Yi

>>>

>>

>> Thank you for these questions. I answering them I realized we do not

>> (yet) have this information covered in the ODP User Guide. I'll be using

>> this information to help fill in that gap.

>>

>>

>>>

>>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org>

>>> wrote:

>>>

>>>> I've added this to the agenda for Monday's call, however I suggest we

>>>> continue the dialog here as well as background.

>>>>

>>>> Regarding thread pinning, there's always been a tradeoff on that.  On

>>>> the one hand dedicating cores to threads is ideal for scale out in many

>>>> core systems, however ODP does not require many core environments to work

>>>> effectively, so ODP APIs enable but do not require or assume that cores are

>>>> dedicated to threads. That's really a question of application design and

>>>> fit to the particular platform it's running on. In embedded environments

>>>> you'll likely see this model more since the application knows which

>>>> platform it's being targeted for. In VNF environments, by contrast, you're

>>>> more likely to see a blend where applications will take advantage of

>>>> however many cores are available to it but will still run without dedicated

>>>> cores in environments with more modest resources.

>>>>

>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>>>

>>>>> Hi, thanks Mike and Bill,

>>>>>

>>>>> From your clear summarize can we put it into several TO-DO decisions:

>>>>> (we can have a discussion in next ARCH call):

>>>>>

>>>>>    1. How to addressing the precise semantics of the existing timing

>>>>>    APIs (odp_cpu_xxx) as they relate to processor locality.

>>>>>

>>>>>

>>>>>    - *A:* guarantee by adding constraint to ODP thread concept: every

>>>>>    ODP thread shall be deployed and pinned on one CPU core.

>>>>>       - A sub-question: my understanding is that application

>>>>>       programmers only need to specify available CPU sets for control/worker

>>>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>>>       launching, right?

>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU

>>>>>    migration.

>>>>>    - Then document clearly in user's guide or API document.

>>>>>

>>>>>

>>>>>    1. Understand the requirement to have both processor-local and

>>>>>    system-wide timing APIs:

>>>>>

>>>>>

>>>>>    - There are some APIs available in time.h (odp_time_local(), etc).

>>>>>    - We can have a thread to understand the relationship, usage

>>>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>>>

>>>>> Best Regards, Yi

>>>>>

>>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>>>> wrote:

>>>>>

>>>>>> I think there are two fallouts form this discussion.  First, there is

>>>>>> the question of the precise semantics of the existing timing APIs as they

>>>>>> relate to processor locality. Applications such as profiling tests, to the

>>>>>> extent that they APIs that have processor-local semantics, must ensure that

>>>>>> the thread(s) using these APIs are pinned for the duration of the

>>>>>> measurement.

>>>>>>

>>>>>> The other point is the one that Petri brought up about having other

>>>>>> APIs that provide timing information based on wall time or other metrics

>>>>>> that are not processor-local.  While these may not have the same

>>>>>> performance characteristics, they would be independent of thread migration

>>>>>> considerations.

>>>>>>

>>>>>> Of course all this depends on exactly what one is trying to measure.

>>>>>> Since thread migration is not free, allowing such activity may or may not

>>>>>> be relevant to what is being measured, so ODP probably wants to have both

>>>>>> processor-local and systemwide timing APIs.  We just need to be sure they

>>>>>> are specified precisely so that applications know how to use them properly.

>>>>>>

>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org>

>>>>>> wrote:

>>>>>>

>>>>>>> It sounded like the arch call was leaning towards documenting that

>>>>>>> on odp-linux  the application must ensure that odp_threads are pinned to

>>>>>>> cores when launched.

>>>>>>> This is a restriction that some platforms may not need to make, vs

>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the behavior

>>>>>>> it needs without knowledge or reliance on the wider system.

>>>>>>>

>>>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>>>

>>>>>>>> Establish a performance profiling environment guarantees meaningful

>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()

>>>>>>>> APIs.

>>>>>>>> While after profiling was done restore the execution environment to

>>>>>>>> its multi-core optimized state.

>>>>>>>>

>>>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>>>> ---

>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>>>  1 file changed, 31 insertions(+)

>>>>>>>>

>>>>>>>> diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h

>>>>>>>> index 2789511..0bc9327 100644

>>>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>>>

>>>>>>>>

>>>>>>>>  /**

>>>>>>>> + * @typedef odp_profiler_t

>>>>>>>> + * ODP performance profiler handle

>>>>>>>> + */

>>>>>>>> +

>>>>>>>> +/**

>>>>>>>> + * Setup a performance profiling environment

>>>>>>>> + *

>>>>>>>> + * A performance profiling environment guarantees meaningful and

>>>>>>>> consistency of

>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>>> + *

>>>>>>>> + * @return performance profiler handle

>>>>>>>> + */

>>>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>>>> +

>>>>>>>> +/**

>>>>>>>>   * CPU identifier

>>>>>>>>   *

>>>>>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>>>>>> numbering is

>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>>>  void odp_cpu_pause(void);

>>>>>>>>

>>>>>>>>  /**

>>>>>>>> + * Stop the performance profiling environment

>>>>>>>> + *

>>>>>>>> + * Stop performance profiling and restore the execution

>>>>>>>> environment to its

>>>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>>>> consistency of

>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>>>> + *

>>>>>>>> + * @param profiler  performance profiler handle

>>>>>>>> + *

>>>>>>>> + * @retval 0 on success

>>>>>>>> + * @retval <0 on failure

>>>>>>>> + *

>>>>>>>> + * @see odp_profiler_start()

>>>>>>>> + */

>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>>>> +

>>>>>>>> +/**

>>>>>>>>   * @}

>>>>>>>>   */

>>>>>>>>

>>>>>>>> --

>>>>>>>> 1.9.1

>>>>>>>>

>>>>>>>> _______________________________________________

>>>>>>>> lng-odp mailing list

>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> --

>>>>>>> Mike Holmes

>>>>>>> Technical Manager - Linaro Networking Group

>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for

>>>>>>> ARM SoCs

>>>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> _______________________________________________

>>>>>>> lng-odp mailing list

>>>>>>> lng-odp@lists.linaro.org

>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>
Yi He May 10, 2016, 1:04 p.m. UTC | #9
Hi, Petri

While we can continue processor-related discussions in Bill's new
comprehensive email thread, about ODP-427 of how to guarantee locality of
odp_cpu_xxx() APIs, can we make a decision between two choices in
tomorrow's ARCH meeting?

*Choice one: *constraint to ODP thread concept: every ODP thread will be
pinned on one CPU core. In this case, only the main thread was accidentally
not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not
instantiated through odph_linux_pthread_create().

The solution can be: in odp_init_global() API, after
odp_cpumask_init_global(), pin the main thread to the 1st available core
for control threads.

*Choice two: *in case to allow ODP thread migration between CPU cores, new
APIs are required to enable/disable CPU migration on the fly. (as patch
suggested).

Let's talk in tomorrow. Thanks and Best Regards, Yi

On 10 May 2016 at 04:54, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>

>

> On Mon, May 9, 2016 at 1:50 AM, Yi He <yi.he@linaro.org> wrote:

>

>> Hi, Bill

>>

>> Thanks very much for your detailed explanation. I understand the

>> programming practise like:

>>

>> /* Firstly developer got a chance to specify core availabilities to the

>>  * application instance.

>>  */

>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )

>>

>> *So It is possible to run an application with different core

>> availabilities spec on different platform, **and possible to run

>> multiple application instances on one platform in isolation.*

>>

>> *A: Make the above as command line parameters can help application binary

>> portable, run it on platform A or B requires no re-compilation, but only

>> invocation parameters change.*

>>

>

> The intent behind the ability to specify cpumasks at odp_init_global()

> time is to allow a launcher script that is configured by some provisioning

> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP

> implementation in a platform-independent manner.  So applications will fall

> into two categories, those that have provisioned coremasks that simply get

> passed through and more "stand alone" applications that will us

> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as

> noted earlier to size themselves dynamically to the available processing

> resources.  In both cases there is no need to recompile the application but

> rather to simply have it create an appropriate number of control/worker

> threads as determined either by external configuration or inquiry.

>

>

>>

>> /* Application developer fanout worker/control threads depends on

>>  * the needs and actual availabilities.

>>  */

>> actually_available_cores =

>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);

>>

>> iterator( actually_available_cores ) {

>>

>>     /* Fanout one work thread instance */

>>     odph_linux_pthread_create(...upon one available core...);

>> }

>>

>> *B: Is odph_linux_pthread_create() a temporary helper API and will

>> converge into platform-independent odp_thread_create(..one core spec...) in

>> future? Or, is it deliberately left as platform dependant helper API?*

>>

>> Based on above understanding and back to ODP-427 problem, which seems

>> only the main thread (program entrance) was accidentally not pinned on one

>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not

>> instantiated through odph_linux_pthread_create().

>>

>

> ODP provides no APIs or helpers to control thread pinning. The only

> controls ODP provides is the ability to know the number of available cores,

> to partition them for use by worker and control threads, and the ability

> (via helpers) to create a number of threads of the application's choosing.

> The implementation is expected to schedule these threads to available cores

> in a fair manner, so if the number of application threads is less than or

> equal to the available number of cores then implementations SHOULD (but are

> not required to) pin each thread to its own core. Applications SHOULD NOT

> be designed to require or depend on any specify thread-to-core mapping both

> for portability as well as because what constitutes a "core" in a virtual

> environment may or may not represent dedicated hardware.

>

>

>>

>> A solution can be: in odp_init_global() API, after

>> odp_cpumask_init_global(), pin the main thread to the 1st available core

>> for control thread. This adds new behavioural specification to this API,

>> but seems natural. Actually Ivan's patch did most of this, except that the

>> core was fixed to 0. we can discuss in today's meeting.

>>

>

> An application may consist of more than a single thread at the time it

> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()

> be called only from the application's initial thread and before it creates

> any other threads to avoid the address space confusion that has been the

> subject of the past couple of ARCH calls and that we are looking to achieve

> consensus on. I'd like to move that discussion to a separate discussion

> thread from this one, if you don't mind.

>

>

>>

>> Thanks and Best Regards, Yi

>>

>> On 6 May 2016 at 22:23, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>>

>>> These are all good questions. ODP divides threads into worker threads

>>> and control threads. The distinction is that worker threads are supposed to

>>> be performance sensitive and perform optimally with dedicated cores while

>>> control threads perform more "housekeeping" functions and would be less

>>> impacted by sharing cores.

>>>

>>> In the absence of explicit API calls, it is unspecified how an ODP

>>> implementation assigns threads to cores. The distinction between worker and

>>> control thread is a hint to the underlying implementation that should be

>>> used in managing available processor resources.

>>>

>>> The APIs in cpumask.h enable applications to determine how many CPUs are

>>> available to it and how to divide them among worker and control threads

>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note

>>> that ODP does not provide APIs for setting specific threads to specific

>>> CPUs, so keep that in mind in the answers below.

>>>

>>>

>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

>>>

>>>> Hi, thanks Bill

>>>>

>>>> I understand more deeply of ODP thread concept and in embedded app

>>>> developers are involved in target platform tuning/optimization.

>>>>

>>>> Can I have a little example: say we have a data-plane app which

>>>> includes 3 ODP threads. And would like to install and run it upon 2

>>>> platforms.

>>>>

>>>>    - Platform A: 2 cores.

>>>>    - Platform B: 10 cores

>>>>

>>>> During initialization, the application can use

>>> odp_cpumask_all_available() to determine how many CPUs are available and

>>> can (optionally) use odp_cpumask_default_worker() and

>>> odp_cpumask_default_control() to divide them into CPUs that should be used

>>> for worker and control threads, respectively. For an application designed

>>> for scale-out, the number of available CPUs would typically be used to

>>> control how many worker threads the application creates. If the number of

>>> worker threads matches the number of worker CPUs then the ODP

>>> implementation would be expected to dedicate a worker core to each worker

>>> thread. If more threads are created than there are corresponding cores,

>>> then it is up to each implementation as to how it multiplexes them among

>>> the available cores in a fair manner.

>>>

>>>

>>>> Question, which one of the below assumptions is the current ODP

>>>> programming model?

>>>>

>>>> *1, *Application developer writes target platform specific code to

>>>> tell that:

>>>>

>>>> On platform A run threads (0) on core (0), and threads (1,2) on core

>>>> (1).

>>>> On platform B run threads (0) on core (0), and threads (1) can scale

>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>>>>

>>>

>>> As noted, ODP does not provide APIs that permit specific threads to be

>>> assigned to specific cores. Instead it is up to each ODP implementation as

>>> to how it maps ODP threads to available CPUs, subject to the advisory

>>> information provided by the ODP thread type and the cpumask assignments for

>>> control and worker threads. So in these examples suppose what the

>>> application has is two control threads and one or more workers.  For

>>> Platform A you might have core 0 defined for control threads and Core 1 for

>>> worker threads. In this case threads 0 and 1 would run on Core 0 while

>>> thread 2 ran on Core 1. For Platform B it's again up to the application how

>>> it wants to divide the 10 CPUs between control and worker. It may want to

>>> have 2 control CPUs so that each control thread can have its own core,

>>> leaving 8 worker threads, or it might have the control threads share a

>>> single CPU and have 9 worker threads with their own cores.

>>>

>>>

>>>>

>>>>

>>> Install and run on different platform requires above platform specific

>>>> code and recompilation for target.

>>>>

>>>

>>> No. As noted, the model is the same. The only difference is how many

>>> control/worker threads the application chooses to create based on the

>>> information it gets during initialization by odp_cpumask_all_available().

>>>

>>>

>>>>

>>>> *2, *Application developer writes code to specify:

>>>>

>>>> Threads (0, 2) would not scale out

>>>> Threads (1) can scale out (up to a limit N?)

>>>> Platform A has 3 cores available (as command line parameter?)

>>>> Platform B has 10 cores available (as command line parameter?)

>>>>

>>>> Install and run on different platform may not requires re-compilation.

>>>> ODP intelligently arrange the threads according to the information

>>>> provided.

>>>>

>>>

>>> Applications determine the minimum number of threads they require. For

>>> most applications they would tend to have a fixed number of control threads

>>> (based on the application's functional design) and a variable number of

>>> worker threads (minimum 1) based on available processing resources. These

>>> application-defined minimums determine the minimum configuration the

>>> application might need for optimal performance, with scale out to larger

>>> configurations performed automatically.

>>>

>>>

>>>>

>>>> Last question: in some case like power save mode available cores shrink

>>>> would ODP intelligently re-arrange the ODP threads dynamically in runtime?

>>>>

>>>

>>> The intent is that while control threads may have distinct roles and

>>> responsibilities (thus requiring that all always be eligible to be

>>> scheduled) worker threads are symmetric and interchangeable. So in this

>>> case if I have N worker threads to match to the N available worker CPUs and

>>> power save mode wants to reduce that number to N-1, then the only effect is

>>> that the worker CPU entering power save mode goes dormant along with the

>>> thread that is running on it. That thread isn't redistributed to some other

>>> core because it's the same as the other worker threads.  Its is expected

>>> that cores would only enter power save state at odp_schedule() boundaries.

>>> So for example, if odp_schedule() determines that there is no work to

>>> dispatch to this thread then that might trigger the associated CPU to enter

>>> low power mode. When later that core wakes up odp_schedule() would continue

>>> and then return work to its reactivated thread.

>>>

>>> A slight wrinkle here is the concept of scheduler groups, which allows

>>> work classes to be dispatched to different groups of worker threads.  In

>>> this case the implementation might want to take scheduler group membership

>>> into consideration in determining which cores to idle for power savings.

>>> However, the ODP API itself is silent on this subject as it is

>>> implementation dependent how power save modes are managed.

>>>

>>>

>>>>

>>>> Thanks and Best Regards, Yi

>>>>

>>>

>>> Thank you for these questions. I answering them I realized we do not

>>> (yet) have this information covered in the ODP User Guide. I'll be using

>>> this information to help fill in that gap.

>>>

>>>

>>>>

>>>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org>

>>>> wrote:

>>>>

>>>>> I've added this to the agenda for Monday's call, however I suggest we

>>>>> continue the dialog here as well as background.

>>>>>

>>>>> Regarding thread pinning, there's always been a tradeoff on that.  On

>>>>> the one hand dedicating cores to threads is ideal for scale out in many

>>>>> core systems, however ODP does not require many core environments to work

>>>>> effectively, so ODP APIs enable but do not require or assume that cores are

>>>>> dedicated to threads. That's really a question of application design and

>>>>> fit to the particular platform it's running on. In embedded environments

>>>>> you'll likely see this model more since the application knows which

>>>>> platform it's being targeted for. In VNF environments, by contrast, you're

>>>>> more likely to see a blend where applications will take advantage of

>>>>> however many cores are available to it but will still run without dedicated

>>>>> cores in environments with more modest resources.

>>>>>

>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>>>>

>>>>>> Hi, thanks Mike and Bill,

>>>>>>

>>>>>> From your clear summarize can we put it into several TO-DO decisions:

>>>>>> (we can have a discussion in next ARCH call):

>>>>>>

>>>>>>    1. How to addressing the precise semantics of the existing timing

>>>>>>    APIs (odp_cpu_xxx) as they relate to processor locality.

>>>>>>

>>>>>>

>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:

>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.

>>>>>>       - A sub-question: my understanding is that application

>>>>>>       programmers only need to specify available CPU sets for control/worker

>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>>>>       launching, right?

>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU

>>>>>>    migration.

>>>>>>    - Then document clearly in user's guide or API document.

>>>>>>

>>>>>>

>>>>>>    1. Understand the requirement to have both processor-local and

>>>>>>    system-wide timing APIs:

>>>>>>

>>>>>>

>>>>>>    - There are some APIs available in time.h (odp_time_local(), etc).

>>>>>>    - We can have a thread to understand the relationship, usage

>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>>>>

>>>>>> Best Regards, Yi

>>>>>>

>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>>>>> wrote:

>>>>>>

>>>>>>> I think there are two fallouts form this discussion.  First, there

>>>>>>> is the question of the precise semantics of the existing timing APIs as

>>>>>>> they relate to processor locality. Applications such as profiling tests, to

>>>>>>> the extent that they APIs that have processor-local semantics, must ensure

>>>>>>> that the thread(s) using these APIs are pinned for the duration of the

>>>>>>> measurement.

>>>>>>>

>>>>>>> The other point is the one that Petri brought up about having other

>>>>>>> APIs that provide timing information based on wall time or other metrics

>>>>>>> that are not processor-local.  While these may not have the same

>>>>>>> performance characteristics, they would be independent of thread migration

>>>>>>> considerations.

>>>>>>>

>>>>>>> Of course all this depends on exactly what one is trying to measure.

>>>>>>> Since thread migration is not free, allowing such activity may or may not

>>>>>>> be relevant to what is being measured, so ODP probably wants to have both

>>>>>>> processor-local and systemwide timing APIs.  We just need to be sure they

>>>>>>> are specified precisely so that applications know how to use them properly.

>>>>>>>

>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <mike.holmes@linaro.org

>>>>>>> > wrote:

>>>>>>>

>>>>>>>> It sounded like the arch call was leaning towards documenting that

>>>>>>>> on odp-linux  the application must ensure that odp_threads are pinned to

>>>>>>>> cores when launched.

>>>>>>>> This is a restriction that some platforms may not need to make, vs

>>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the behavior

>>>>>>>> it needs without knowledge or reliance on the wider system.

>>>>>>>>

>>>>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>>>>

>>>>>>>>> Establish a performance profiling environment guarantees meaningful

>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()

>>>>>>>>> APIs.

>>>>>>>>> While after profiling was done restore the execution environment to

>>>>>>>>> its multi-core optimized state.

>>>>>>>>>

>>>>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>>>>> ---

>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>>>>  1 file changed, 31 insertions(+)

>>>>>>>>>

>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h

>>>>>>>>> b/include/odp/api/spec/cpu.h

>>>>>>>>> index 2789511..0bc9327 100644

>>>>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>  /**

>>>>>>>>> + * @typedef odp_profiler_t

>>>>>>>>> + * ODP performance profiler handle

>>>>>>>>> + */

>>>>>>>>> +

>>>>>>>>> +/**

>>>>>>>>> + * Setup a performance profiling environment

>>>>>>>>> + *

>>>>>>>>> + * A performance profiling environment guarantees meaningful and

>>>>>>>>> consistency of

>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>>>> + *

>>>>>>>>> + * @return performance profiler handle

>>>>>>>>> + */

>>>>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>>>>> +

>>>>>>>>> +/**

>>>>>>>>>   * CPU identifier

>>>>>>>>>   *

>>>>>>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>>>>>>> numbering is

>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>>>>  void odp_cpu_pause(void);

>>>>>>>>>

>>>>>>>>>  /**

>>>>>>>>> + * Stop the performance profiling environment

>>>>>>>>> + *

>>>>>>>>> + * Stop performance profiling and restore the execution

>>>>>>>>> environment to its

>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>>>>> consistency of

>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>>>>> + *

>>>>>>>>> + * @param profiler  performance profiler handle

>>>>>>>>> + *

>>>>>>>>> + * @retval 0 on success

>>>>>>>>> + * @retval <0 on failure

>>>>>>>>> + *

>>>>>>>>> + * @see odp_profiler_start()

>>>>>>>>> + */

>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>>>>> +

>>>>>>>>> +/**

>>>>>>>>>   * @}

>>>>>>>>>   */

>>>>>>>>>

>>>>>>>>> --

>>>>>>>>> 1.9.1

>>>>>>>>>

>>>>>>>>> _______________________________________________

>>>>>>>>> lng-odp mailing list

>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> --

>>>>>>>> Mike Holmes

>>>>>>>> Technical Manager - Linaro Networking Group

>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for

>>>>>>>> ARM SoCs

>>>>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> _______________________________________________

>>>>>>>> lng-odp mailing list

>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>

>>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>
Bill Fischofer May 11, 2016, 12:08 a.m. UTC | #10
We didn't get around to discussing this during today's public call, but
we'll try to cover this during tomorrow's ARCH call.

As I noted earlier, the question of thread assignment to cores becomes
complicated in virtual environment when it's not clear that a (virtual)
core necessarily implies that there is any dedicated HW behind it. I think
the simplest approach to take is simply to say that as long as the number
of ODP threads is less than or equal to the number reported by
odp_cpumask_all_available() and the number of control threads does not
exceed odp_cpumask_control_default() and the number of worker threads does
not exceed odp_cpumask_worker_default(), then the application can assume
that each ODP thread will have its own CPU. If the thread count exceeds
these numbers, then it is implementation-defined as to how ODP threads are
multiplexed onto available CPUs in a fair manner.

Applications that want best performance will adapt their thread usage to
the number of CPUs available to it (subject to application-defined
minimums, perhaps) to ensure that they don't have more threads than CPUs.

If we have this convention then perhaps no additional APIs are needed to
cover pinning/migration considerations?

On Tue, May 10, 2016 at 8:04 AM, Yi He <yi.he@linaro.org> wrote:

> Hi, Petri

>

> While we can continue processor-related discussions in Bill's new

> comprehensive email thread, about ODP-427 of how to guarantee locality of

> odp_cpu_xxx() APIs, can we make a decision between two choices in

> tomorrow's ARCH meeting?

>

> *Choice one: *constraint to ODP thread concept: every ODP thread will be

> pinned on one CPU core. In this case, only the main thread was accidentally

> not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not

> instantiated through odph_linux_pthread_create().

>

> The solution can be: in odp_init_global() API, after

> odp_cpumask_init_global(), pin the main thread to the 1st available core

> for control threads.

>

> *Choice two: *in case to allow ODP thread migration between CPU cores,

> new APIs are required to enable/disable CPU migration on the fly. (as patch

> suggested).

>

> Let's talk in tomorrow. Thanks and Best Regards, Yi

>

> On 10 May 2016 at 04:54, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>

>>

>>

>> On Mon, May 9, 2016 at 1:50 AM, Yi He <yi.he@linaro.org> wrote:

>>

>>> Hi, Bill

>>>

>>> Thanks very much for your detailed explanation. I understand the

>>> programming practise like:

>>>

>>> /* Firstly developer got a chance to specify core availabilities to the

>>>  * application instance.

>>>  */

>>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )

>>>

>>> *So It is possible to run an application with different core

>>> availabilities spec on different platform, **and possible to run

>>> multiple application instances on one platform in isolation.*

>>>

>>> *A: Make the above as command line parameters can help application

>>> binary portable, run it on platform A or B requires no re-compilation, but

>>> only invocation parameters change.*

>>>

>>

>> The intent behind the ability to specify cpumasks at odp_init_global()

>> time is to allow a launcher script that is configured by some provisioning

>> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP

>> implementation in a platform-independent manner.  So applications will fall

>> into two categories, those that have provisioned coremasks that simply get

>> passed through and more "stand alone" applications that will us

>> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as

>> noted earlier to size themselves dynamically to the available processing

>> resources.  In both cases there is no need to recompile the application but

>> rather to simply have it create an appropriate number of control/worker

>> threads as determined either by external configuration or inquiry.

>>

>>

>>>

>>> /* Application developer fanout worker/control threads depends on

>>>  * the needs and actual availabilities.

>>>  */

>>> actually_available_cores =

>>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);

>>>

>>> iterator( actually_available_cores ) {

>>>

>>>     /* Fanout one work thread instance */

>>>     odph_linux_pthread_create(...upon one available core...);

>>> }

>>>

>>> *B: Is odph_linux_pthread_create() a temporary helper API and will

>>> converge into platform-independent odp_thread_create(..one core spec...) in

>>> future? Or, is it deliberately left as platform dependant helper API?*

>>>

>>> Based on above understanding and back to ODP-427 problem, which seems

>>> only the main thread (program entrance) was accidentally not pinned on one

>>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not

>>> instantiated through odph_linux_pthread_create().

>>>

>>

>> ODP provides no APIs or helpers to control thread pinning. The only

>> controls ODP provides is the ability to know the number of available cores,

>> to partition them for use by worker and control threads, and the ability

>> (via helpers) to create a number of threads of the application's choosing.

>> The implementation is expected to schedule these threads to available cores

>> in a fair manner, so if the number of application threads is less than or

>> equal to the available number of cores then implementations SHOULD (but are

>> not required to) pin each thread to its own core. Applications SHOULD NOT

>> be designed to require or depend on any specify thread-to-core mapping both

>> for portability as well as because what constitutes a "core" in a virtual

>> environment may or may not represent dedicated hardware.

>>

>>

>>>

>>> A solution can be: in odp_init_global() API, after

>>> odp_cpumask_init_global(), pin the main thread to the 1st available core

>>> for control thread. This adds new behavioural specification to this API,

>>> but seems natural. Actually Ivan's patch did most of this, except that the

>>> core was fixed to 0. we can discuss in today's meeting.

>>>

>>

>> An application may consist of more than a single thread at the time it

>> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()

>> be called only from the application's initial thread and before it creates

>> any other threads to avoid the address space confusion that has been the

>> subject of the past couple of ARCH calls and that we are looking to achieve

>> consensus on. I'd like to move that discussion to a separate discussion

>> thread from this one, if you don't mind.

>>

>>

>>>

>>> Thanks and Best Regards, Yi

>>>

>>> On 6 May 2016 at 22:23, Bill Fischofer <bill.fischofer@linaro.org>

>>> wrote:

>>>

>>>> These are all good questions. ODP divides threads into worker threads

>>>> and control threads. The distinction is that worker threads are supposed to

>>>> be performance sensitive and perform optimally with dedicated cores while

>>>> control threads perform more "housekeeping" functions and would be less

>>>> impacted by sharing cores.

>>>>

>>>> In the absence of explicit API calls, it is unspecified how an ODP

>>>> implementation assigns threads to cores. The distinction between worker and

>>>> control thread is a hint to the underlying implementation that should be

>>>> used in managing available processor resources.

>>>>

>>>> The APIs in cpumask.h enable applications to determine how many CPUs

>>>> are available to it and how to divide them among worker and control threads

>>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note

>>>> that ODP does not provide APIs for setting specific threads to specific

>>>> CPUs, so keep that in mind in the answers below.

>>>>

>>>>

>>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

>>>>

>>>>> Hi, thanks Bill

>>>>>

>>>>> I understand more deeply of ODP thread concept and in embedded app

>>>>> developers are involved in target platform tuning/optimization.

>>>>>

>>>>> Can I have a little example: say we have a data-plane app which

>>>>> includes 3 ODP threads. And would like to install and run it upon 2

>>>>> platforms.

>>>>>

>>>>>    - Platform A: 2 cores.

>>>>>    - Platform B: 10 cores

>>>>>

>>>>> During initialization, the application can use

>>>> odp_cpumask_all_available() to determine how many CPUs are available and

>>>> can (optionally) use odp_cpumask_default_worker() and

>>>> odp_cpumask_default_control() to divide them into CPUs that should be used

>>>> for worker and control threads, respectively. For an application designed

>>>> for scale-out, the number of available CPUs would typically be used to

>>>> control how many worker threads the application creates. If the number of

>>>> worker threads matches the number of worker CPUs then the ODP

>>>> implementation would be expected to dedicate a worker core to each worker

>>>> thread. If more threads are created than there are corresponding cores,

>>>> then it is up to each implementation as to how it multiplexes them among

>>>> the available cores in a fair manner.

>>>>

>>>>

>>>>> Question, which one of the below assumptions is the current ODP

>>>>> programming model?

>>>>>

>>>>> *1, *Application developer writes target platform specific code to

>>>>> tell that:

>>>>>

>>>>> On platform A run threads (0) on core (0), and threads (1,2) on core

>>>>> (1).

>>>>> On platform B run threads (0) on core (0), and threads (1) can scale

>>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>>>>>

>>>>

>>>> As noted, ODP does not provide APIs that permit specific threads to be

>>>> assigned to specific cores. Instead it is up to each ODP implementation as

>>>> to how it maps ODP threads to available CPUs, subject to the advisory

>>>> information provided by the ODP thread type and the cpumask assignments for

>>>> control and worker threads. So in these examples suppose what the

>>>> application has is two control threads and one or more workers.  For

>>>> Platform A you might have core 0 defined for control threads and Core 1 for

>>>> worker threads. In this case threads 0 and 1 would run on Core 0 while

>>>> thread 2 ran on Core 1. For Platform B it's again up to the application how

>>>> it wants to divide the 10 CPUs between control and worker. It may want to

>>>> have 2 control CPUs so that each control thread can have its own core,

>>>> leaving 8 worker threads, or it might have the control threads share a

>>>> single CPU and have 9 worker threads with their own cores.

>>>>

>>>>

>>>>>

>>>>>

>>>> Install and run on different platform requires above platform specific

>>>>> code and recompilation for target.

>>>>>

>>>>

>>>> No. As noted, the model is the same. The only difference is how many

>>>> control/worker threads the application chooses to create based on the

>>>> information it gets during initialization by odp_cpumask_all_available().

>>>>

>>>>

>>>>>

>>>>> *2, *Application developer writes code to specify:

>>>>>

>>>>> Threads (0, 2) would not scale out

>>>>> Threads (1) can scale out (up to a limit N?)

>>>>> Platform A has 3 cores available (as command line parameter?)

>>>>> Platform B has 10 cores available (as command line parameter?)

>>>>>

>>>>> Install and run on different platform may not requires re-compilation.

>>>>> ODP intelligently arrange the threads according to the information

>>>>> provided.

>>>>>

>>>>

>>>> Applications determine the minimum number of threads they require. For

>>>> most applications they would tend to have a fixed number of control threads

>>>> (based on the application's functional design) and a variable number of

>>>> worker threads (minimum 1) based on available processing resources. These

>>>> application-defined minimums determine the minimum configuration the

>>>> application might need for optimal performance, with scale out to larger

>>>> configurations performed automatically.

>>>>

>>>>

>>>>>

>>>>> Last question: in some case like power save mode available cores

>>>>> shrink would ODP intelligently re-arrange the ODP threads dynamically in

>>>>> runtime?

>>>>>

>>>>

>>>> The intent is that while control threads may have distinct roles and

>>>> responsibilities (thus requiring that all always be eligible to be

>>>> scheduled) worker threads are symmetric and interchangeable. So in this

>>>> case if I have N worker threads to match to the N available worker CPUs and

>>>> power save mode wants to reduce that number to N-1, then the only effect is

>>>> that the worker CPU entering power save mode goes dormant along with the

>>>> thread that is running on it. That thread isn't redistributed to some other

>>>> core because it's the same as the other worker threads.  Its is expected

>>>> that cores would only enter power save state at odp_schedule() boundaries.

>>>> So for example, if odp_schedule() determines that there is no work to

>>>> dispatch to this thread then that might trigger the associated CPU to enter

>>>> low power mode. When later that core wakes up odp_schedule() would continue

>>>> and then return work to its reactivated thread.

>>>>

>>>> A slight wrinkle here is the concept of scheduler groups, which allows

>>>> work classes to be dispatched to different groups of worker threads.  In

>>>> this case the implementation might want to take scheduler group membership

>>>> into consideration in determining which cores to idle for power savings.

>>>> However, the ODP API itself is silent on this subject as it is

>>>> implementation dependent how power save modes are managed.

>>>>

>>>>

>>>>>

>>>>> Thanks and Best Regards, Yi

>>>>>

>>>>

>>>> Thank you for these questions. I answering them I realized we do not

>>>> (yet) have this information covered in the ODP User Guide. I'll be using

>>>> this information to help fill in that gap.

>>>>

>>>>

>>>>>

>>>>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org>

>>>>> wrote:

>>>>>

>>>>>> I've added this to the agenda for Monday's call, however I suggest we

>>>>>> continue the dialog here as well as background.

>>>>>>

>>>>>> Regarding thread pinning, there's always been a tradeoff on that.  On

>>>>>> the one hand dedicating cores to threads is ideal for scale out in many

>>>>>> core systems, however ODP does not require many core environments to work

>>>>>> effectively, so ODP APIs enable but do not require or assume that cores are

>>>>>> dedicated to threads. That's really a question of application design and

>>>>>> fit to the particular platform it's running on. In embedded environments

>>>>>> you'll likely see this model more since the application knows which

>>>>>> platform it's being targeted for. In VNF environments, by contrast, you're

>>>>>> more likely to see a blend where applications will take advantage of

>>>>>> however many cores are available to it but will still run without dedicated

>>>>>> cores in environments with more modest resources.

>>>>>>

>>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>>>>>

>>>>>>> Hi, thanks Mike and Bill,

>>>>>>>

>>>>>>> From your clear summarize can we put it into several TO-DO

>>>>>>> decisions: (we can have a discussion in next ARCH call):

>>>>>>>

>>>>>>>    1. How to addressing the precise semantics of the existing

>>>>>>>    timing APIs (odp_cpu_xxx) as they relate to processor locality.

>>>>>>>

>>>>>>>

>>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:

>>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.

>>>>>>>       - A sub-question: my understanding is that application

>>>>>>>       programmers only need to specify available CPU sets for control/worker

>>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>>>>>       launching, right?

>>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU

>>>>>>>    migration.

>>>>>>>    - Then document clearly in user's guide or API document.

>>>>>>>

>>>>>>>

>>>>>>>    1. Understand the requirement to have both processor-local and

>>>>>>>    system-wide timing APIs:

>>>>>>>

>>>>>>>

>>>>>>>    - There are some APIs available in time.h (odp_time_local(),

>>>>>>>    etc).

>>>>>>>    - We can have a thread to understand the relationship, usage

>>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>>>>>

>>>>>>> Best Regards, Yi

>>>>>>>

>>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>>>>>> wrote:

>>>>>>>

>>>>>>>> I think there are two fallouts form this discussion.  First, there

>>>>>>>> is the question of the precise semantics of the existing timing APIs as

>>>>>>>> they relate to processor locality. Applications such as profiling tests, to

>>>>>>>> the extent that they APIs that have processor-local semantics, must ensure

>>>>>>>> that the thread(s) using these APIs are pinned for the duration of the

>>>>>>>> measurement.

>>>>>>>>

>>>>>>>> The other point is the one that Petri brought up about having other

>>>>>>>> APIs that provide timing information based on wall time or other metrics

>>>>>>>> that are not processor-local.  While these may not have the same

>>>>>>>> performance characteristics, they would be independent of thread migration

>>>>>>>> considerations.

>>>>>>>>

>>>>>>>> Of course all this depends on exactly what one is trying to

>>>>>>>> measure. Since thread migration is not free, allowing such activity may or

>>>>>>>> may not be relevant to what is being measured, so ODP probably wants to

>>>>>>>> have both processor-local and systemwide timing APIs.  We just need to be

>>>>>>>> sure they are specified precisely so that applications know how to use them

>>>>>>>> properly.

>>>>>>>>

>>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <

>>>>>>>> mike.holmes@linaro.org> wrote:

>>>>>>>>

>>>>>>>>> It sounded like the arch call was leaning towards documenting that

>>>>>>>>> on odp-linux  the application must ensure that odp_threads are pinned to

>>>>>>>>> cores when launched.

>>>>>>>>> This is a restriction that some platforms may not need to make, vs

>>>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the behavior

>>>>>>>>> it needs without knowledge or reliance on the wider system.

>>>>>>>>>

>>>>>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>>>>>

>>>>>>>>>> Establish a performance profiling environment guarantees

>>>>>>>>>> meaningful

>>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()

>>>>>>>>>> APIs.

>>>>>>>>>> While after profiling was done restore the execution environment

>>>>>>>>>> to

>>>>>>>>>> its multi-core optimized state.

>>>>>>>>>>

>>>>>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>>>>>> ---

>>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>>>>>  1 file changed, 31 insertions(+)

>>>>>>>>>>

>>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h

>>>>>>>>>> b/include/odp/api/spec/cpu.h

>>>>>>>>>> index 2789511..0bc9327 100644

>>>>>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>  /**

>>>>>>>>>> + * @typedef odp_profiler_t

>>>>>>>>>> + * ODP performance profiler handle

>>>>>>>>>> + */

>>>>>>>>>> +

>>>>>>>>>> +/**

>>>>>>>>>> + * Setup a performance profiling environment

>>>>>>>>>> + *

>>>>>>>>>> + * A performance profiling environment guarantees meaningful and

>>>>>>>>>> consistency of

>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>>>>> + *

>>>>>>>>>> + * @return performance profiler handle

>>>>>>>>>> + */

>>>>>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>>>>>> +

>>>>>>>>>> +/**

>>>>>>>>>>   * CPU identifier

>>>>>>>>>>   *

>>>>>>>>>>   * Determine CPU identifier on which the calling is running. CPU

>>>>>>>>>> numbering is

>>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>>>>>  void odp_cpu_pause(void);

>>>>>>>>>>

>>>>>>>>>>  /**

>>>>>>>>>> + * Stop the performance profiling environment

>>>>>>>>>> + *

>>>>>>>>>> + * Stop performance profiling and restore the execution

>>>>>>>>>> environment to its

>>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>>>>>> consistency of

>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>>>>>> + *

>>>>>>>>>> + * @param profiler  performance profiler handle

>>>>>>>>>> + *

>>>>>>>>>> + * @retval 0 on success

>>>>>>>>>> + * @retval <0 on failure

>>>>>>>>>> + *

>>>>>>>>>> + * @see odp_profiler_start()

>>>>>>>>>> + */

>>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>>>>>> +

>>>>>>>>>> +/**

>>>>>>>>>>   * @}

>>>>>>>>>>   */

>>>>>>>>>>

>>>>>>>>>> --

>>>>>>>>>> 1.9.1

>>>>>>>>>>

>>>>>>>>>> _______________________________________________

>>>>>>>>>> lng-odp mailing list

>>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> --

>>>>>>>>> Mike Holmes

>>>>>>>>> Technical Manager - Linaro Networking Group

>>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for

>>>>>>>>> ARM SoCs

>>>>>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> _______________________________________________

>>>>>>>>> lng-odp mailing list

>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>

>>>>>>>>>

>>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>
Yi He May 12, 2016, 8:40 a.m. UTC | #11
Hi, Bill and Petri

After yesterday's discussion/email I realized some strict rules in ARCH:

   1. odp apis do not want to handle thread-to-core arrangement except only
   providing availability info, right? (may be repeated several time until I
   got it :)).
   2. then application needs odp-helper api to accomplish this, in other
   words, odp-helper api takes the responsibility to provide methods for "odp
   threads' deployment and instantiation".

For ODP-427, narrow down the goal to "pin main thread to core 0 by
odp-helper to prevent validation tests fail, meanwhile adding no
constraints to odp apis at all", sounds good?

Then here is the proposal: for libodphelper-linux library, add
__attribute__(constructor) constructor function to pin the main thread to
core 0. fulfil above goal and require no application code addition. (only
drawback is that cannot pin to the 1st available control thread core
because it is before odp_init_global(), no perfect world :)).

In meanwhile we can open new thread to further discuss on the big topics:
1, odp thread concept and their deploy and instantiation on platform.
2, further advanced topics around (to list).

thanks and best regards, Yi


On 11 May 2016 at 08:08, Bill Fischofer <bill.fischofer@linaro.org> wrote:

> We didn't get around to discussing this during today's public call, but

> we'll try to cover this during tomorrow's ARCH call.

>

> As I noted earlier, the question of thread assignment to cores becomes

> complicated in virtual environment when it's not clear that a (virtual)

> core necessarily implies that there is any dedicated HW behind it. I think

> the simplest approach to take is simply to say that as long as the number

> of ODP threads is less than or equal to the number reported by

> odp_cpumask_all_available() and the number of control threads does not

> exceed odp_cpumask_control_default() and the number of worker threads does

> not exceed odp_cpumask_worker_default(), then the application can assume

> that each ODP thread will have its own CPU. If the thread count exceeds

> these numbers, then it is implementation-defined as to how ODP threads are

> multiplexed onto available CPUs in a fair manner.

>

> Applications that want best performance will adapt their thread usage to

> the number of CPUs available to it (subject to application-defined

> minimums, perhaps) to ensure that they don't have more threads than CPUs.

>

> If we have this convention then perhaps no additional APIs are needed to

> cover pinning/migration considerations?

>

> On Tue, May 10, 2016 at 8:04 AM, Yi He <yi.he@linaro.org> wrote:

>

>> Hi, Petri

>>

>> While we can continue processor-related discussions in Bill's new

>> comprehensive email thread, about ODP-427 of how to guarantee locality of

>> odp_cpu_xxx() APIs, can we make a decision between two choices in

>> tomorrow's ARCH meeting?

>>

>> *Choice one: *constraint to ODP thread concept: every ODP thread will be

>> pinned on one CPU core. In this case, only the main thread was accidentally

>> not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not

>> instantiated through odph_linux_pthread_create().

>>

>> The solution can be: in odp_init_global() API, after

>> odp_cpumask_init_global(), pin the main thread to the 1st available core

>> for control threads.

>>

>> *Choice two: *in case to allow ODP thread migration between CPU cores,

>> new APIs are required to enable/disable CPU migration on the fly. (as patch

>> suggested).

>>

>> Let's talk in tomorrow. Thanks and Best Regards, Yi

>>

>> On 10 May 2016 at 04:54, Bill Fischofer <bill.fischofer@linaro.org>

>> wrote:

>>

>>>

>>>

>>> On Mon, May 9, 2016 at 1:50 AM, Yi He <yi.he@linaro.org> wrote:

>>>

>>>> Hi, Bill

>>>>

>>>> Thanks very much for your detailed explanation. I understand the

>>>> programming practise like:

>>>>

>>>> /* Firstly developer got a chance to specify core availabilities to the

>>>>  * application instance.

>>>>  */

>>>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ...

>>>> )

>>>>

>>>> *So It is possible to run an application with different core

>>>> availabilities spec on different platform, **and possible to run

>>>> multiple application instances on one platform in isolation.*

>>>>

>>>> *A: Make the above as command line parameters can help application

>>>> binary portable, run it on platform A or B requires no re-compilation, but

>>>> only invocation parameters change.*

>>>>

>>>

>>> The intent behind the ability to specify cpumasks at odp_init_global()

>>> time is to allow a launcher script that is configured by some provisioning

>>> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP

>>> implementation in a platform-independent manner.  So applications will fall

>>> into two categories, those that have provisioned coremasks that simply get

>>> passed through and more "stand alone" applications that will us

>>> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as

>>> noted earlier to size themselves dynamically to the available processing

>>> resources.  In both cases there is no need to recompile the application but

>>> rather to simply have it create an appropriate number of control/worker

>>> threads as determined either by external configuration or inquiry.

>>>

>>>

>>>>

>>>> /* Application developer fanout worker/control threads depends on

>>>>  * the needs and actual availabilities.

>>>>  */

>>>> actually_available_cores =

>>>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);

>>>>

>>>> iterator( actually_available_cores ) {

>>>>

>>>>     /* Fanout one work thread instance */

>>>>     odph_linux_pthread_create(...upon one available core...);

>>>> }

>>>>

>>>> *B: Is odph_linux_pthread_create() a temporary helper API and will

>>>> converge into platform-independent odp_thread_create(..one core spec...) in

>>>> future? Or, is it deliberately left as platform dependant helper API?*

>>>>

>>>> Based on above understanding and back to ODP-427 problem, which seems

>>>> only the main thread (program entrance) was accidentally not pinned on one

>>>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not

>>>> instantiated through odph_linux_pthread_create().

>>>>

>>>

>>> ODP provides no APIs or helpers to control thread pinning. The only

>>> controls ODP provides is the ability to know the number of available cores,

>>> to partition them for use by worker and control threads, and the ability

>>> (via helpers) to create a number of threads of the application's choosing.

>>> The implementation is expected to schedule these threads to available cores

>>> in a fair manner, so if the number of application threads is less than or

>>> equal to the available number of cores then implementations SHOULD (but are

>>> not required to) pin each thread to its own core. Applications SHOULD NOT

>>> be designed to require or depend on any specify thread-to-core mapping both

>>> for portability as well as because what constitutes a "core" in a virtual

>>> environment may or may not represent dedicated hardware.

>>>

>>>

>>>>

>>>> A solution can be: in odp_init_global() API, after

>>>> odp_cpumask_init_global(), pin the main thread to the 1st available core

>>>> for control thread. This adds new behavioural specification to this API,

>>>> but seems natural. Actually Ivan's patch did most of this, except that the

>>>> core was fixed to 0. we can discuss in today's meeting.

>>>>

>>>

>>> An application may consist of more than a single thread at the time it

>>> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()

>>> be called only from the application's initial thread and before it creates

>>> any other threads to avoid the address space confusion that has been the

>>> subject of the past couple of ARCH calls and that we are looking to achieve

>>> consensus on. I'd like to move that discussion to a separate discussion

>>> thread from this one, if you don't mind.

>>>

>>>

>>>>

>>>> Thanks and Best Regards, Yi

>>>>

>>>> On 6 May 2016 at 22:23, Bill Fischofer <bill.fischofer@linaro.org>

>>>> wrote:

>>>>

>>>>> These are all good questions. ODP divides threads into worker threads

>>>>> and control threads. The distinction is that worker threads are supposed to

>>>>> be performance sensitive and perform optimally with dedicated cores while

>>>>> control threads perform more "housekeeping" functions and would be less

>>>>> impacted by sharing cores.

>>>>>

>>>>> In the absence of explicit API calls, it is unspecified how an ODP

>>>>> implementation assigns threads to cores. The distinction between worker and

>>>>> control thread is a hint to the underlying implementation that should be

>>>>> used in managing available processor resources.

>>>>>

>>>>> The APIs in cpumask.h enable applications to determine how many CPUs

>>>>> are available to it and how to divide them among worker and control threads

>>>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note

>>>>> that ODP does not provide APIs for setting specific threads to specific

>>>>> CPUs, so keep that in mind in the answers below.

>>>>>

>>>>>

>>>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

>>>>>

>>>>>> Hi, thanks Bill

>>>>>>

>>>>>> I understand more deeply of ODP thread concept and in embedded app

>>>>>> developers are involved in target platform tuning/optimization.

>>>>>>

>>>>>> Can I have a little example: say we have a data-plane app which

>>>>>> includes 3 ODP threads. And would like to install and run it upon 2

>>>>>> platforms.

>>>>>>

>>>>>>    - Platform A: 2 cores.

>>>>>>    - Platform B: 10 cores

>>>>>>

>>>>>> During initialization, the application can use

>>>>> odp_cpumask_all_available() to determine how many CPUs are available and

>>>>> can (optionally) use odp_cpumask_default_worker() and

>>>>> odp_cpumask_default_control() to divide them into CPUs that should be used

>>>>> for worker and control threads, respectively. For an application designed

>>>>> for scale-out, the number of available CPUs would typically be used to

>>>>> control how many worker threads the application creates. If the number of

>>>>> worker threads matches the number of worker CPUs then the ODP

>>>>> implementation would be expected to dedicate a worker core to each worker

>>>>> thread. If more threads are created than there are corresponding cores,

>>>>> then it is up to each implementation as to how it multiplexes them among

>>>>> the available cores in a fair manner.

>>>>>

>>>>>

>>>>>> Question, which one of the below assumptions is the current ODP

>>>>>> programming model?

>>>>>>

>>>>>> *1, *Application developer writes target platform specific code to

>>>>>> tell that:

>>>>>>

>>>>>> On platform A run threads (0) on core (0), and threads (1,2) on core

>>>>>> (1).

>>>>>> On platform B run threads (0) on core (0), and threads (1) can scale

>>>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>>>>>>

>>>>>

>>>>> As noted, ODP does not provide APIs that permit specific threads to be

>>>>> assigned to specific cores. Instead it is up to each ODP implementation as

>>>>> to how it maps ODP threads to available CPUs, subject to the advisory

>>>>> information provided by the ODP thread type and the cpumask assignments for

>>>>> control and worker threads. So in these examples suppose what the

>>>>> application has is two control threads and one or more workers.  For

>>>>> Platform A you might have core 0 defined for control threads and Core 1 for

>>>>> worker threads. In this case threads 0 and 1 would run on Core 0 while

>>>>> thread 2 ran on Core 1. For Platform B it's again up to the application how

>>>>> it wants to divide the 10 CPUs between control and worker. It may want to

>>>>> have 2 control CPUs so that each control thread can have its own core,

>>>>> leaving 8 worker threads, or it might have the control threads share a

>>>>> single CPU and have 9 worker threads with their own cores.

>>>>>

>>>>>

>>>>>>

>>>>>>

>>>>> Install and run on different platform requires above platform specific

>>>>>> code and recompilation for target.

>>>>>>

>>>>>

>>>>> No. As noted, the model is the same. The only difference is how many

>>>>> control/worker threads the application chooses to create based on the

>>>>> information it gets during initialization by odp_cpumask_all_available().

>>>>>

>>>>>

>>>>>>

>>>>>> *2, *Application developer writes code to specify:

>>>>>>

>>>>>> Threads (0, 2) would not scale out

>>>>>> Threads (1) can scale out (up to a limit N?)

>>>>>> Platform A has 3 cores available (as command line parameter?)

>>>>>> Platform B has 10 cores available (as command line parameter?)

>>>>>>

>>>>>> Install and run on different platform may not requires re-compilation.

>>>>>> ODP intelligently arrange the threads according to the information

>>>>>> provided.

>>>>>>

>>>>>

>>>>> Applications determine the minimum number of threads they require. For

>>>>> most applications they would tend to have a fixed number of control threads

>>>>> (based on the application's functional design) and a variable number of

>>>>> worker threads (minimum 1) based on available processing resources. These

>>>>> application-defined minimums determine the minimum configuration the

>>>>> application might need for optimal performance, with scale out to larger

>>>>> configurations performed automatically.

>>>>>

>>>>>

>>>>>>

>>>>>> Last question: in some case like power save mode available cores

>>>>>> shrink would ODP intelligently re-arrange the ODP threads dynamically in

>>>>>> runtime?

>>>>>>

>>>>>

>>>>> The intent is that while control threads may have distinct roles and

>>>>> responsibilities (thus requiring that all always be eligible to be

>>>>> scheduled) worker threads are symmetric and interchangeable. So in this

>>>>> case if I have N worker threads to match to the N available worker CPUs and

>>>>> power save mode wants to reduce that number to N-1, then the only effect is

>>>>> that the worker CPU entering power save mode goes dormant along with the

>>>>> thread that is running on it. That thread isn't redistributed to some other

>>>>> core because it's the same as the other worker threads.  Its is expected

>>>>> that cores would only enter power save state at odp_schedule() boundaries.

>>>>> So for example, if odp_schedule() determines that there is no work to

>>>>> dispatch to this thread then that might trigger the associated CPU to enter

>>>>> low power mode. When later that core wakes up odp_schedule() would continue

>>>>> and then return work to its reactivated thread.

>>>>>

>>>>> A slight wrinkle here is the concept of scheduler groups, which allows

>>>>> work classes to be dispatched to different groups of worker threads.  In

>>>>> this case the implementation might want to take scheduler group membership

>>>>> into consideration in determining which cores to idle for power savings.

>>>>> However, the ODP API itself is silent on this subject as it is

>>>>> implementation dependent how power save modes are managed.

>>>>>

>>>>>

>>>>>>

>>>>>> Thanks and Best Regards, Yi

>>>>>>

>>>>>

>>>>> Thank you for these questions. I answering them I realized we do not

>>>>> (yet) have this information covered in the ODP User Guide. I'll be using

>>>>> this information to help fill in that gap.

>>>>>

>>>>>

>>>>>>

>>>>>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org>

>>>>>> wrote:

>>>>>>

>>>>>>> I've added this to the agenda for Monday's call, however I suggest

>>>>>>> we continue the dialog here as well as background.

>>>>>>>

>>>>>>> Regarding thread pinning, there's always been a tradeoff on that.

>>>>>>> On the one hand dedicating cores to threads is ideal for scale out in many

>>>>>>> core systems, however ODP does not require many core environments to work

>>>>>>> effectively, so ODP APIs enable but do not require or assume that cores are

>>>>>>> dedicated to threads. That's really a question of application design and

>>>>>>> fit to the particular platform it's running on. In embedded environments

>>>>>>> you'll likely see this model more since the application knows which

>>>>>>> platform it's being targeted for. In VNF environments, by contrast, you're

>>>>>>> more likely to see a blend where applications will take advantage of

>>>>>>> however many cores are available to it but will still run without dedicated

>>>>>>> cores in environments with more modest resources.

>>>>>>>

>>>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>>>>>>

>>>>>>>> Hi, thanks Mike and Bill,

>>>>>>>>

>>>>>>>> From your clear summarize can we put it into several TO-DO

>>>>>>>> decisions: (we can have a discussion in next ARCH call):

>>>>>>>>

>>>>>>>>    1. How to addressing the precise semantics of the existing

>>>>>>>>    timing APIs (odp_cpu_xxx) as they relate to processor locality.

>>>>>>>>

>>>>>>>>

>>>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:

>>>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.

>>>>>>>>       - A sub-question: my understanding is that application

>>>>>>>>       programmers only need to specify available CPU sets for control/worker

>>>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>>>>>>       launching, right?

>>>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU

>>>>>>>>    migration.

>>>>>>>>    - Then document clearly in user's guide or API document.

>>>>>>>>

>>>>>>>>

>>>>>>>>    1. Understand the requirement to have both processor-local and

>>>>>>>>    system-wide timing APIs:

>>>>>>>>

>>>>>>>>

>>>>>>>>    - There are some APIs available in time.h (odp_time_local(),

>>>>>>>>    etc).

>>>>>>>>    - We can have a thread to understand the relationship, usage

>>>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>>>>>>

>>>>>>>> Best Regards, Yi

>>>>>>>>

>>>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>>>>>>> wrote:

>>>>>>>>

>>>>>>>>> I think there are two fallouts form this discussion.  First, there

>>>>>>>>> is the question of the precise semantics of the existing timing APIs as

>>>>>>>>> they relate to processor locality. Applications such as profiling tests, to

>>>>>>>>> the extent that they APIs that have processor-local semantics, must ensure

>>>>>>>>> that the thread(s) using these APIs are pinned for the duration of the

>>>>>>>>> measurement.

>>>>>>>>>

>>>>>>>>> The other point is the one that Petri brought up about having

>>>>>>>>> other APIs that provide timing information based on wall time or other

>>>>>>>>> metrics that are not processor-local.  While these may not have the same

>>>>>>>>> performance characteristics, they would be independent of thread migration

>>>>>>>>> considerations.

>>>>>>>>>

>>>>>>>>> Of course all this depends on exactly what one is trying to

>>>>>>>>> measure. Since thread migration is not free, allowing such activity may or

>>>>>>>>> may not be relevant to what is being measured, so ODP probably wants to

>>>>>>>>> have both processor-local and systemwide timing APIs.  We just need to be

>>>>>>>>> sure they are specified precisely so that applications know how to use them

>>>>>>>>> properly.

>>>>>>>>>

>>>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <

>>>>>>>>> mike.holmes@linaro.org> wrote:

>>>>>>>>>

>>>>>>>>>> It sounded like the arch call was leaning towards documenting

>>>>>>>>>> that on odp-linux  the application must ensure that odp_threads are pinned

>>>>>>>>>> to cores when launched.

>>>>>>>>>> This is a restriction that some platforms may not need to make,

>>>>>>>>>> vs the idea that a piece of ODP code can use these APIs to ensure the

>>>>>>>>>> behavior it needs without knowledge or reliance on the wider system.

>>>>>>>>>>

>>>>>>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>>>>>>

>>>>>>>>>>> Establish a performance profiling environment guarantees

>>>>>>>>>>> meaningful

>>>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()

>>>>>>>>>>> APIs.

>>>>>>>>>>> While after profiling was done restore the execution environment

>>>>>>>>>>> to

>>>>>>>>>>> its multi-core optimized state.

>>>>>>>>>>>

>>>>>>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>>>>>>> ---

>>>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>>>>>>  1 file changed, 31 insertions(+)

>>>>>>>>>>>

>>>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h

>>>>>>>>>>> b/include/odp/api/spec/cpu.h

>>>>>>>>>>> index 2789511..0bc9327 100644

>>>>>>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>  /**

>>>>>>>>>>> + * @typedef odp_profiler_t

>>>>>>>>>>> + * ODP performance profiler handle

>>>>>>>>>>> + */

>>>>>>>>>>> +

>>>>>>>>>>> +/**

>>>>>>>>>>> + * Setup a performance profiling environment

>>>>>>>>>>> + *

>>>>>>>>>>> + * A performance profiling environment guarantees meaningful

>>>>>>>>>>> and consistency of

>>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>>>>>> + *

>>>>>>>>>>> + * @return performance profiler handle

>>>>>>>>>>> + */

>>>>>>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>>>>>>> +

>>>>>>>>>>> +/**

>>>>>>>>>>>   * CPU identifier

>>>>>>>>>>>   *

>>>>>>>>>>>   * Determine CPU identifier on which the calling is running.

>>>>>>>>>>> CPU numbering is

>>>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>>>>>>  void odp_cpu_pause(void);

>>>>>>>>>>>

>>>>>>>>>>>  /**

>>>>>>>>>>> + * Stop the performance profiling environment

>>>>>>>>>>> + *

>>>>>>>>>>> + * Stop performance profiling and restore the execution

>>>>>>>>>>> environment to its

>>>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>>>>>>> consistency of

>>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>>>>>>> + *

>>>>>>>>>>> + * @param profiler  performance profiler handle

>>>>>>>>>>> + *

>>>>>>>>>>> + * @retval 0 on success

>>>>>>>>>>> + * @retval <0 on failure

>>>>>>>>>>> + *

>>>>>>>>>>> + * @see odp_profiler_start()

>>>>>>>>>>> + */

>>>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>>>>>>> +

>>>>>>>>>>> +/**

>>>>>>>>>>>   * @}

>>>>>>>>>>>   */

>>>>>>>>>>>

>>>>>>>>>>> --

>>>>>>>>>>> 1.9.1

>>>>>>>>>>>

>>>>>>>>>>> _______________________________________________

>>>>>>>>>>> lng-odp mailing list

>>>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>> --

>>>>>>>>>> Mike Holmes

>>>>>>>>>> Technical Manager - Linaro Networking Group

>>>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for

>>>>>>>>>> ARM SoCs

>>>>>>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>> _______________________________________________

>>>>>>>>>> lng-odp mailing list

>>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>

>>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>
Bill Fischofer May 12, 2016, 8:33 p.m. UTC | #12
On Thu, May 12, 2016 at 3:40 AM, Yi He <yi.he@linaro.org> wrote:

> Hi, Bill and Petri

>

> After yesterday's discussion/email I realized some strict rules in ARCH:

>

>    1. odp apis do not want to handle thread-to-core arrangement except

>    only providing availability info, right? (may be repeated several time

>    until I got it :)).

>    2. then application needs odp-helper api to accomplish this, in other

>    words, odp-helper api takes the responsibility to provide methods for "odp

>    threads' deployment and instantiation".

>

> For ODP-427, narrow down the goal to "pin main thread to core 0 by

> odp-helper to prevent validation tests fail, meanwhile adding no

> constraints to odp apis at all", sounds good?

>

> Then here is the proposal: for libodphelper-linux library, add

> __attribute__(constructor) constructor function to pin the main thread to

> core 0. fulfil above goal and require no application code addition. (only

> drawback is that cannot pin to the 1st available control thread core

> because it is before odp_init_global(), no perfect world :)).

>


That's an interesting suggestion. Core 0 is always a distinguished core in
most systems anyway, so I don't see that as an undue restriction.  We'll
add this to the discussion topic list for Monday's ARCH call.


>

> In meanwhile we can open new thread to further discuss on the big topics:

> 1, odp thread concept and their deploy and instantiation on platform.

> 2, further advanced topics around (to list).

>

> thanks and best regards, Yi

>

>

> On 11 May 2016 at 08:08, Bill Fischofer <bill.fischofer@linaro.org> wrote:

>

>> We didn't get around to discussing this during today's public call, but

>> we'll try to cover this during tomorrow's ARCH call.

>>

>> As I noted earlier, the question of thread assignment to cores becomes

>> complicated in virtual environment when it's not clear that a (virtual)

>> core necessarily implies that there is any dedicated HW behind it. I think

>> the simplest approach to take is simply to say that as long as the number

>> of ODP threads is less than or equal to the number reported by

>> odp_cpumask_all_available() and the number of control threads does not

>> exceed odp_cpumask_control_default() and the number of worker threads does

>> not exceed odp_cpumask_worker_default(), then the application can assume

>> that each ODP thread will have its own CPU. If the thread count exceeds

>> these numbers, then it is implementation-defined as to how ODP threads are

>> multiplexed onto available CPUs in a fair manner.

>>

>> Applications that want best performance will adapt their thread usage to

>> the number of CPUs available to it (subject to application-defined

>> minimums, perhaps) to ensure that they don't have more threads than CPUs.

>>

>> If we have this convention then perhaps no additional APIs are needed to

>> cover pinning/migration considerations?

>>

>> On Tue, May 10, 2016 at 8:04 AM, Yi He <yi.he@linaro.org> wrote:

>>

>>> Hi, Petri

>>>

>>> While we can continue processor-related discussions in Bill's new

>>> comprehensive email thread, about ODP-427 of how to guarantee locality of

>>> odp_cpu_xxx() APIs, can we make a decision between two choices in

>>> tomorrow's ARCH meeting?

>>>

>>> *Choice one: *constraint to ODP thread concept: every ODP thread will

>>> be pinned on one CPU core. In this case, only the main thread was

>>> accidentally not pinned on one core :), it is an ODP_THREAD_CONTROL, but is

>>> not instantiated through odph_linux_pthread_create().

>>>

>>> The solution can be: in odp_init_global() API, after

>>> odp_cpumask_init_global(), pin the main thread to the 1st available core

>>> for control threads.

>>>

>>> *Choice two: *in case to allow ODP thread migration between CPU cores,

>>> new APIs are required to enable/disable CPU migration on the fly. (as patch

>>> suggested).

>>>

>>> Let's talk in tomorrow. Thanks and Best Regards, Yi

>>>

>>> On 10 May 2016 at 04:54, Bill Fischofer <bill.fischofer@linaro.org>

>>> wrote:

>>>

>>>>

>>>>

>>>> On Mon, May 9, 2016 at 1:50 AM, Yi He <yi.he@linaro.org> wrote:

>>>>

>>>>> Hi, Bill

>>>>>

>>>>> Thanks very much for your detailed explanation. I understand the

>>>>> programming practise like:

>>>>>

>>>>> /* Firstly developer got a chance to specify core availabilities to the

>>>>>  * application instance.

>>>>>  */

>>>>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus

>>>>> ... )

>>>>>

>>>>> *So It is possible to run an application with different core

>>>>> availabilities spec on different platform, **and possible to run

>>>>> multiple application instances on one platform in isolation.*

>>>>>

>>>>> *A: Make the above as command line parameters can help application

>>>>> binary portable, run it on platform A or B requires no re-compilation, but

>>>>> only invocation parameters change.*

>>>>>

>>>>

>>>> The intent behind the ability to specify cpumasks at odp_init_global()

>>>> time is to allow a launcher script that is configured by some provisioning

>>>> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP

>>>> implementation in a platform-independent manner.  So applications will fall

>>>> into two categories, those that have provisioned coremasks that simply get

>>>> passed through and more "stand alone" applications that will us

>>>> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as

>>>> noted earlier to size themselves dynamically to the available processing

>>>> resources.  In both cases there is no need to recompile the application but

>>>> rather to simply have it create an appropriate number of control/worker

>>>> threads as determined either by external configuration or inquiry.

>>>>

>>>>

>>>>>

>>>>> /* Application developer fanout worker/control threads depends on

>>>>>  * the needs and actual availabilities.

>>>>>  */

>>>>> actually_available_cores =

>>>>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);

>>>>>

>>>>> iterator( actually_available_cores ) {

>>>>>

>>>>>     /* Fanout one work thread instance */

>>>>>     odph_linux_pthread_create(...upon one available core...);

>>>>> }

>>>>>

>>>>> *B: Is odph_linux_pthread_create() a temporary helper API and will

>>>>> converge into platform-independent odp_thread_create(..one core spec...) in

>>>>> future? Or, is it deliberately left as platform dependant helper API?*

>>>>>

>>>>> Based on above understanding and back to ODP-427 problem, which seems

>>>>> only the main thread (program entrance) was accidentally not pinned on one

>>>>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not

>>>>> instantiated through odph_linux_pthread_create().

>>>>>

>>>>

>>>> ODP provides no APIs or helpers to control thread pinning. The only

>>>> controls ODP provides is the ability to know the number of available cores,

>>>> to partition them for use by worker and control threads, and the ability

>>>> (via helpers) to create a number of threads of the application's choosing.

>>>> The implementation is expected to schedule these threads to available cores

>>>> in a fair manner, so if the number of application threads is less than or

>>>> equal to the available number of cores then implementations SHOULD (but are

>>>> not required to) pin each thread to its own core. Applications SHOULD NOT

>>>> be designed to require or depend on any specify thread-to-core mapping both

>>>> for portability as well as because what constitutes a "core" in a virtual

>>>> environment may or may not represent dedicated hardware.

>>>>

>>>>

>>>>>

>>>>> A solution can be: in odp_init_global() API, after

>>>>> odp_cpumask_init_global(), pin the main thread to the 1st available core

>>>>> for control thread. This adds new behavioural specification to this API,

>>>>> but seems natural. Actually Ivan's patch did most of this, except that the

>>>>> core was fixed to 0. we can discuss in today's meeting.

>>>>>

>>>>

>>>> An application may consist of more than a single thread at the time it

>>>> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()

>>>> be called only from the application's initial thread and before it creates

>>>> any other threads to avoid the address space confusion that has been the

>>>> subject of the past couple of ARCH calls and that we are looking to achieve

>>>> consensus on. I'd like to move that discussion to a separate discussion

>>>> thread from this one, if you don't mind.

>>>>

>>>>

>>>>>

>>>>> Thanks and Best Regards, Yi

>>>>>

>>>>> On 6 May 2016 at 22:23, Bill Fischofer <bill.fischofer@linaro.org>

>>>>> wrote:

>>>>>

>>>>>> These are all good questions. ODP divides threads into worker threads

>>>>>> and control threads. The distinction is that worker threads are supposed to

>>>>>> be performance sensitive and perform optimally with dedicated cores while

>>>>>> control threads perform more "housekeeping" functions and would be less

>>>>>> impacted by sharing cores.

>>>>>>

>>>>>> In the absence of explicit API calls, it is unspecified how an ODP

>>>>>> implementation assigns threads to cores. The distinction between worker and

>>>>>> control thread is a hint to the underlying implementation that should be

>>>>>> used in managing available processor resources.

>>>>>>

>>>>>> The APIs in cpumask.h enable applications to determine how many CPUs

>>>>>> are available to it and how to divide them among worker and control threads

>>>>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note

>>>>>> that ODP does not provide APIs for setting specific threads to specific

>>>>>> CPUs, so keep that in mind in the answers below.

>>>>>>

>>>>>>

>>>>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <yi.he@linaro.org> wrote:

>>>>>>

>>>>>>> Hi, thanks Bill

>>>>>>>

>>>>>>> I understand more deeply of ODP thread concept and in embedded app

>>>>>>> developers are involved in target platform tuning/optimization.

>>>>>>>

>>>>>>> Can I have a little example: say we have a data-plane app which

>>>>>>> includes 3 ODP threads. And would like to install and run it upon 2

>>>>>>> platforms.

>>>>>>>

>>>>>>>    - Platform A: 2 cores.

>>>>>>>    - Platform B: 10 cores

>>>>>>>

>>>>>>> During initialization, the application can use

>>>>>> odp_cpumask_all_available() to determine how many CPUs are available and

>>>>>> can (optionally) use odp_cpumask_default_worker() and

>>>>>> odp_cpumask_default_control() to divide them into CPUs that should be used

>>>>>> for worker and control threads, respectively. For an application designed

>>>>>> for scale-out, the number of available CPUs would typically be used to

>>>>>> control how many worker threads the application creates. If the number of

>>>>>> worker threads matches the number of worker CPUs then the ODP

>>>>>> implementation would be expected to dedicate a worker core to each worker

>>>>>> thread. If more threads are created than there are corresponding cores,

>>>>>> then it is up to each implementation as to how it multiplexes them among

>>>>>> the available cores in a fair manner.

>>>>>>

>>>>>>

>>>>>>> Question, which one of the below assumptions is the current ODP

>>>>>>> programming model?

>>>>>>>

>>>>>>> *1, *Application developer writes target platform specific code to

>>>>>>> tell that:

>>>>>>>

>>>>>>> On platform A run threads (0) on core (0), and threads (1,2) on core

>>>>>>> (1).

>>>>>>> On platform B run threads (0) on core (0), and threads (1) can scale

>>>>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).

>>>>>>>

>>>>>>

>>>>>> As noted, ODP does not provide APIs that permit specific threads to

>>>>>> be assigned to specific cores. Instead it is up to each ODP implementation

>>>>>> as to how it maps ODP threads to available CPUs, subject to the advisory

>>>>>> information provided by the ODP thread type and the cpumask assignments for

>>>>>> control and worker threads. So in these examples suppose what the

>>>>>> application has is two control threads and one or more workers.  For

>>>>>> Platform A you might have core 0 defined for control threads and Core 1 for

>>>>>> worker threads. In this case threads 0 and 1 would run on Core 0 while

>>>>>> thread 2 ran on Core 1. For Platform B it's again up to the application how

>>>>>> it wants to divide the 10 CPUs between control and worker. It may want to

>>>>>> have 2 control CPUs so that each control thread can have its own core,

>>>>>> leaving 8 worker threads, or it might have the control threads share a

>>>>>> single CPU and have 9 worker threads with their own cores.

>>>>>>

>>>>>>

>>>>>>>

>>>>>>>

>>>>>> Install and run on different platform requires above platform

>>>>>>> specific code and recompilation for target.

>>>>>>>

>>>>>>

>>>>>> No. As noted, the model is the same. The only difference is how many

>>>>>> control/worker threads the application chooses to create based on the

>>>>>> information it gets during initialization by odp_cpumask_all_available().

>>>>>>

>>>>>>

>>>>>>>

>>>>>>> *2, *Application developer writes code to specify:

>>>>>>>

>>>>>>> Threads (0, 2) would not scale out

>>>>>>> Threads (1) can scale out (up to a limit N?)

>>>>>>> Platform A has 3 cores available (as command line parameter?)

>>>>>>> Platform B has 10 cores available (as command line parameter?)

>>>>>>>

>>>>>>> Install and run on different platform may not requires

>>>>>>> re-compilation.

>>>>>>> ODP intelligently arrange the threads according to the information

>>>>>>> provided.

>>>>>>>

>>>>>>

>>>>>> Applications determine the minimum number of threads they require.

>>>>>> For most applications they would tend to have a fixed number of control

>>>>>> threads (based on the application's functional design) and a variable

>>>>>> number of worker threads (minimum 1) based on available processing

>>>>>> resources. These application-defined minimums determine the minimum

>>>>>> configuration the application might need for optimal performance, with

>>>>>> scale out to larger configurations performed automatically.

>>>>>>

>>>>>>

>>>>>>>

>>>>>>> Last question: in some case like power save mode available cores

>>>>>>> shrink would ODP intelligently re-arrange the ODP threads dynamically in

>>>>>>> runtime?

>>>>>>>

>>>>>>

>>>>>> The intent is that while control threads may have distinct roles and

>>>>>> responsibilities (thus requiring that all always be eligible to be

>>>>>> scheduled) worker threads are symmetric and interchangeable. So in this

>>>>>> case if I have N worker threads to match to the N available worker CPUs and

>>>>>> power save mode wants to reduce that number to N-1, then the only effect is

>>>>>> that the worker CPU entering power save mode goes dormant along with the

>>>>>> thread that is running on it. That thread isn't redistributed to some other

>>>>>> core because it's the same as the other worker threads.  Its is expected

>>>>>> that cores would only enter power save state at odp_schedule() boundaries.

>>>>>> So for example, if odp_schedule() determines that there is no work to

>>>>>> dispatch to this thread then that might trigger the associated CPU to enter

>>>>>> low power mode. When later that core wakes up odp_schedule() would continue

>>>>>> and then return work to its reactivated thread.

>>>>>>

>>>>>> A slight wrinkle here is the concept of scheduler groups, which

>>>>>> allows work classes to be dispatched to different groups of worker

>>>>>> threads.  In this case the implementation might want to take scheduler

>>>>>> group membership into consideration in determining which cores to idle for

>>>>>> power savings. However, the ODP API itself is silent on this subject as it

>>>>>> is implementation dependent how power save modes are managed.

>>>>>>

>>>>>>

>>>>>>>

>>>>>>> Thanks and Best Regards, Yi

>>>>>>>

>>>>>>

>>>>>> Thank you for these questions. I answering them I realized we do not

>>>>>> (yet) have this information covered in the ODP User Guide. I'll be using

>>>>>> this information to help fill in that gap.

>>>>>>

>>>>>>

>>>>>>>

>>>>>>> On 5 May 2016 at 18:50, Bill Fischofer <bill.fischofer@linaro.org>

>>>>>>> wrote:

>>>>>>>

>>>>>>>> I've added this to the agenda for Monday's call, however I suggest

>>>>>>>> we continue the dialog here as well as background.

>>>>>>>>

>>>>>>>> Regarding thread pinning, there's always been a tradeoff on that.

>>>>>>>> On the one hand dedicating cores to threads is ideal for scale out in many

>>>>>>>> core systems, however ODP does not require many core environments to work

>>>>>>>> effectively, so ODP APIs enable but do not require or assume that cores are

>>>>>>>> dedicated to threads. That's really a question of application design and

>>>>>>>> fit to the particular platform it's running on. In embedded environments

>>>>>>>> you'll likely see this model more since the application knows which

>>>>>>>> platform it's being targeted for. In VNF environments, by contrast, you're

>>>>>>>> more likely to see a blend where applications will take advantage of

>>>>>>>> however many cores are available to it but will still run without dedicated

>>>>>>>> cores in environments with more modest resources.

>>>>>>>>

>>>>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <yi.he@linaro.org> wrote:

>>>>>>>>

>>>>>>>>> Hi, thanks Mike and Bill,

>>>>>>>>>

>>>>>>>>> From your clear summarize can we put it into several TO-DO

>>>>>>>>> decisions: (we can have a discussion in next ARCH call):

>>>>>>>>>

>>>>>>>>>    1. How to addressing the precise semantics of the existing

>>>>>>>>>    timing APIs (odp_cpu_xxx) as they relate to processor locality.

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:

>>>>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.

>>>>>>>>>       - A sub-question: my understanding is that application

>>>>>>>>>       programmers only need to specify available CPU sets for control/worker

>>>>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core while

>>>>>>>>>       launching, right?

>>>>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU

>>>>>>>>>    migration.

>>>>>>>>>    - Then document clearly in user's guide or API document.

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>    1. Understand the requirement to have both processor-local and

>>>>>>>>>    system-wide timing APIs:

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>    - There are some APIs available in time.h (odp_time_local(),

>>>>>>>>>    etc).

>>>>>>>>>    - We can have a thread to understand the relationship, usage

>>>>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.

>>>>>>>>>

>>>>>>>>> Best Regards, Yi

>>>>>>>>>

>>>>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <bill.fischofer@linaro.org>

>>>>>>>>> wrote:

>>>>>>>>>

>>>>>>>>>> I think there are two fallouts form this discussion.  First,

>>>>>>>>>> there is the question of the precise semantics of the existing timing APIs

>>>>>>>>>> as they relate to processor locality. Applications such as profiling tests,

>>>>>>>>>> to the extent that they APIs that have processor-local semantics, must

>>>>>>>>>> ensure that the thread(s) using these APIs are pinned for the duration of

>>>>>>>>>> the measurement.

>>>>>>>>>>

>>>>>>>>>> The other point is the one that Petri brought up about having

>>>>>>>>>> other APIs that provide timing information based on wall time or other

>>>>>>>>>> metrics that are not processor-local.  While these may not have the same

>>>>>>>>>> performance characteristics, they would be independent of thread migration

>>>>>>>>>> considerations.

>>>>>>>>>>

>>>>>>>>>> Of course all this depends on exactly what one is trying to

>>>>>>>>>> measure. Since thread migration is not free, allowing such activity may or

>>>>>>>>>> may not be relevant to what is being measured, so ODP probably wants to

>>>>>>>>>> have both processor-local and systemwide timing APIs.  We just need to be

>>>>>>>>>> sure they are specified precisely so that applications know how to use them

>>>>>>>>>> properly.

>>>>>>>>>>

>>>>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <

>>>>>>>>>> mike.holmes@linaro.org> wrote:

>>>>>>>>>>

>>>>>>>>>>> It sounded like the arch call was leaning towards documenting

>>>>>>>>>>> that on odp-linux  the application must ensure that odp_threads are pinned

>>>>>>>>>>> to cores when launched.

>>>>>>>>>>> This is a restriction that some platforms may not need to make,

>>>>>>>>>>> vs the idea that a piece of ODP code can use these APIs to ensure the

>>>>>>>>>>> behavior it needs without knowledge or reliance on the wider system.

>>>>>>>>>>>

>>>>>>>>>>> On 4 May 2016 at 01:45, Yi He <yi.he@linaro.org> wrote:

>>>>>>>>>>>

>>>>>>>>>>>> Establish a performance profiling environment guarantees

>>>>>>>>>>>> meaningful

>>>>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()

>>>>>>>>>>>> APIs.

>>>>>>>>>>>> While after profiling was done restore the execution

>>>>>>>>>>>> environment to

>>>>>>>>>>>> its multi-core optimized state.

>>>>>>>>>>>>

>>>>>>>>>>>> Signed-off-by: Yi He <yi.he@linaro.org>

>>>>>>>>>>>> ---

>>>>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++

>>>>>>>>>>>>  1 file changed, 31 insertions(+)

>>>>>>>>>>>>

>>>>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h

>>>>>>>>>>>> b/include/odp/api/spec/cpu.h

>>>>>>>>>>>> index 2789511..0bc9327 100644

>>>>>>>>>>>> --- a/include/odp/api/spec/cpu.h

>>>>>>>>>>>> +++ b/include/odp/api/spec/cpu.h

>>>>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {

>>>>>>>>>>>>

>>>>>>>>>>>>

>>>>>>>>>>>>  /**

>>>>>>>>>>>> + * @typedef odp_profiler_t

>>>>>>>>>>>> + * ODP performance profiler handle

>>>>>>>>>>>> + */

>>>>>>>>>>>> +

>>>>>>>>>>>> +/**

>>>>>>>>>>>> + * Setup a performance profiling environment

>>>>>>>>>>>> + *

>>>>>>>>>>>> + * A performance profiling environment guarantees meaningful

>>>>>>>>>>>> and consistency of

>>>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.

>>>>>>>>>>>> + *

>>>>>>>>>>>> + * @return performance profiler handle

>>>>>>>>>>>> + */

>>>>>>>>>>>> +odp_profiler_t odp_profiler_start(void);

>>>>>>>>>>>> +

>>>>>>>>>>>> +/**

>>>>>>>>>>>>   * CPU identifier

>>>>>>>>>>>>   *

>>>>>>>>>>>>   * Determine CPU identifier on which the calling is running.

>>>>>>>>>>>> CPU numbering is

>>>>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);

>>>>>>>>>>>>  void odp_cpu_pause(void);

>>>>>>>>>>>>

>>>>>>>>>>>>  /**

>>>>>>>>>>>> + * Stop the performance profiling environment

>>>>>>>>>>>> + *

>>>>>>>>>>>> + * Stop performance profiling and restore the execution

>>>>>>>>>>>> environment to its

>>>>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and

>>>>>>>>>>>> consistency of

>>>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.

>>>>>>>>>>>> + *

>>>>>>>>>>>> + * @param profiler  performance profiler handle

>>>>>>>>>>>> + *

>>>>>>>>>>>> + * @retval 0 on success

>>>>>>>>>>>> + * @retval <0 on failure

>>>>>>>>>>>> + *

>>>>>>>>>>>> + * @see odp_profiler_start()

>>>>>>>>>>>> + */

>>>>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);

>>>>>>>>>>>> +

>>>>>>>>>>>> +/**

>>>>>>>>>>>>   * @}

>>>>>>>>>>>>   */

>>>>>>>>>>>>

>>>>>>>>>>>> --

>>>>>>>>>>>> 1.9.1

>>>>>>>>>>>>

>>>>>>>>>>>> _______________________________________________

>>>>>>>>>>>> lng-odp mailing list

>>>>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>> --

>>>>>>>>>>> Mike Holmes

>>>>>>>>>>> Technical Manager - Linaro Networking Group

>>>>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software

>>>>>>>>>>> for ARM SoCs

>>>>>>>>>>> "Work should be fun and collaborative, the rest follows"

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>> _______________________________________________

>>>>>>>>>>> lng-odp mailing list

>>>>>>>>>>> lng-odp@lists.linaro.org

>>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>

>>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>
diff mbox

Patch

diff --git a/include/odp/api/spec/cpu.h b/include/odp/api/spec/cpu.h
index 2789511..0bc9327 100644
--- a/include/odp/api/spec/cpu.h
+++ b/include/odp/api/spec/cpu.h
@@ -27,6 +27,21 @@  extern "C" {
 
 
 /**
+ * @typedef odp_profiler_t
+ * ODP performance profiler handle
+ */
+
+/**
+ * Setup a performance profiling environment
+ *
+ * A performance profiling environment guarantees meaningful and consistency of
+ * consecutive invocations of the odp_cpu_xxx() APIs.
+ *
+ * @return performance profiler handle
+ */
+odp_profiler_t odp_profiler_start(void);
+
+/**
  * CPU identifier
  *
  * Determine CPU identifier on which the calling is running. CPU numbering is
@@ -170,6 +185,22 @@  uint64_t odp_cpu_cycles_resolution(void);
 void odp_cpu_pause(void);
 
 /**
+ * Stop the performance profiling environment
+ *
+ * Stop performance profiling and restore the execution environment to its
+ * multi-core optimized state, won't preserve meaningful and consistency of
+ * consecutive invocations of the odp_cpu_xxx() APIs anymore.
+ *
+ * @param profiler  performance profiler handle
+ *
+ * @retval 0 on success
+ * @retval <0 on failure
+ *
+ * @see odp_profiler_start()
+ */
+int odp_profiler_stop(odp_profiler_t profiler);
+
+/**
  * @}
  */