[RFC,0/3] tqs: add thread quiescent state library

Message ID	20181122033055.3431-1-honnappa.nagarahalli@arm.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of dev-bounces@dpdk.org designates 92.243.14.124 as permitted sender) client-ip=92.243.14.124; From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> To: dev@dpdk.org Cc: nd@arm.com, honnappa.nagarahalli@arm.com, dharmik.thakkar@arm.com, malvika.gupta@arm.com, gavin.hu@arm.com Date: Wed, 21 Nov 2018 21:30:52 -0600 Message-Id: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> Subject: [dpdk-dev] [RFC 0/3] tqs: add thread quiescent state library Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	tqs: add thread quiescent state library \| expand [RFC,0/3] tqs: add thread quiescent state library [RFC,1/3] log: add TQS log type [RFC,2/3] tqs: add thread quiescent state library [RFC,3/3] test/tqs: Add API and functional tests

Honnappa Nagarahalli Nov. 22, 2018, 3:30 a.m. UTC

Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).

In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.

Since these data structures are lock less, the writers and readers
are accessing the data structures simultaneously. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
separate the operation of removing an element into 2 steps:

Delete: in this step, the writer removes the element from the
data structure but does not return the associated memory to the allocator.
This will ensure that new readers will not get a reference to the removed
element. Removing the reference is an atomic operation.

Free: in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the removed element.

This library helps the writer determine when it is safe to free the
memory.

This library makes use of Thread Quiescent State (TQS). TQS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
determine its quiescent state. Let us consider the following diagram:

    Time -------------------------------------------------->

                                |     |
  RT1   $++++****D1****+++***D2*|**+++|+++**D3*****++++$
                                |     |
  RT2          $++++****D1****++|+**D2|***++++++**D3*****++++$
                                |     |
  RT3      $++++****D1****+++***|D2***|++++++**D2*****++++$
                                |     |
                                |<--->|
                               Del | Free
                                   |
                              Cannot free memory
                              during this period

RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
           i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
      i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
      atomic operation.
Free - Point in time when the writer can free the entry.

As shown thread RT1 acesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot return the memory associated with that element to the
allocator. The writer can return the memory to the allocator only after
the reader stops referencng D2. In other words, reader thread RT1
has to enter a quiescent state.

Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.

However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.

It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.

For DPDK applications, the start and end of while(1) loop (where no shared
data structures are getting accessed) act as perfect quiescent states. This
will combine all the shared data structure accesses into a single critical
section and keeps the over head introduced by this library to the minimum.

However, the length of the critical section and the number of reader threads
is proportional to the time taken to identify the end of critical section.
So, if the application desires, it should be possible to identify the end
of critical section for each data structure.

To provide the required flexibility, this library has a concept of TQS
variable. The application can create one or more TQS variables to help it
track the end of one or more critical sections.

The application can create a TQS variable using the API rte_tqs_alloc.
It takes a mask of lcore IDs that will report their quiescent states
using this variable. This mask can be empty to start with.

rte_tqs_register_lcore API will register a reader thread to report its
quiescent state. This can be called from any control plane thread or from
the reader thread. The application can create a TQS variable with no reader
threads and add the threads dynamically using this API.

The application can trigger the reader threads to report their quiescent
state status by calling the API rte_tqs_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_tqs_start returns a token to each caller.

The application has to call rte_tqs_check API with the token to get the
current status. Option to block till all the threads enter the quiescent
state is provided. If this API indicates that all the threads have entered
the quiescent state, the application can free the deleted entry.

The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of waiting for the
reader threads to enter the quiescent state.

rte_tqs_unregister_lcore API will remove a reader thread from reporting its
quiescent state using a TQS variable. The rte_tqs_check API will not wait
for this reader thread to report the quiescent state status anymore.

Finally, a TQS variable can be deleted by calling rte_tqs_free API.
Application must make sure that the reader threads are not referencing the
TQS variable anymore before deleting it.

The reader threads should call rte_tqs_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.

Next Steps:
1) Add more test cases
2) Convert to patch
3) Incorporate feedback from community
4) Add documentation

Dharmik Thakkar (1):
  test/tqs: Add API and functional tests

Honnappa Nagarahalli (2):
  log: add TQS log type
  tqs: add thread quiescent state library

 config/common_base                      |   6 +
 lib/Makefile                            |   2 +
 lib/librte_eal/common/include/rte_log.h |   1 +
 lib/librte_tqs/Makefile                 |  23 +
 lib/librte_tqs/meson.build              |   5 +
 lib/librte_tqs/rte_tqs.c                | 249 +++++++++++
 lib/librte_tqs/rte_tqs.h                | 352 +++++++++++++++
 lib/librte_tqs/rte_tqs_version.map      |  16 +
 lib/meson.build                         |   2 +-
 mk/rte.app.mk                           |   1 +
 test/test/Makefile                      |   2 +
 test/test/autotest_data.py              |   6 +
 test/test/meson.build                   |   5 +-
 test/test/test_tqs.c                    | 540 ++++++++++++++++++++++++
 14 files changed, 1208 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_tqs/Makefile
 create mode 100644 lib/librte_tqs/meson.build
 create mode 100644 lib/librte_tqs/rte_tqs.c
 create mode 100644 lib/librte_tqs/rte_tqs.h
 create mode 100644 lib/librte_tqs/rte_tqs_version.map
 create mode 100644 test/test/test_tqs.c

-- 
2.17.1

Ilya Maximets Nov. 22, 2018, 7:31 a.m. UTC | #1

Hi.
Is the any differentiation points with liburcu [1] ?
Is there any profit having own implementation inside DPDK ?

[1] http://liburcu.org/
    https://lwn.net/Articles/573424/

Best regards, Ilya Maximets.

> Lock-less data structures provide scalability and determinism.

> They enable use cases where locking may not be allowed

> (for ex: real-time applications).

> 

> In the following paras, the term 'memory' refers to memory allocated

> by typical APIs like malloc or anything that is representative of

> memory, for ex: an index of a free element array.

> 

> Since these data structures are lock less, the writers and readers

> are accessing the data structures simultaneously. Hence, while removing

> an element from a data structure, the writers cannot return the memory

> to the allocator, without knowing that the readers are not

> referencing that element/memory anymore. Hence, it is required to

> separate the operation of removing an element into 2 steps:

> 

> Delete: in this step, the writer removes the element from the

> data structure but does not return the associated memory to the allocator.

> This will ensure that new readers will not get a reference to the removed

> element. Removing the reference is an atomic operation.

> 

> Free: in this step, the writer returns the memory to the

> memory allocator, only after knowing that all the readers have stopped

> referencing the removed element.

> 

> This library helps the writer determine when it is safe to free the

> memory.

> 

> This library makes use of Thread Quiescent State (TQS). TQS can be

> defined as 'any point in the thread execution where the thread does

> not hold a reference to shared memory'. It is upto the application to

> determine its quiescent state. Let us consider the following diagram:

> 

>     Time -------------------------------------------------->

> 

>                                 |     |

>   RT1   $++++****D1****+++***D2*|**+++|+++**D3*****++++$

>                                 |     |

>   RT2          $++++****D1****++|+**D2|***++++++**D3*****++++$

>                                 |     |

>   RT3      $++++****D1****+++***|D2***|++++++**D2*****++++$

>                                 |     |

>                                 |<--->|

>                                Del | Free

>                                    |

>                               Cannot free memory

>                               during this period

> 

> RTx - Reader thread

> < and > - Start and end of while(1) loop

> ***Dx*** - Reader thread is accessing the shared data structure Dx.

>            i.e. critical section.

> +++ - Reader thread is not accessing any shared data structure.

>       i.e. non critical section or quiescent state.

> Del - Point in time when the reference to the entry is removed using

>       atomic operation.

> Free - Point in time when the writer can free the entry.

> 

> As shown thread RT1 acesses data structures D1, D2 and D3. When it is

> accessing D2, if the writer has to remove an element from D2, the

> writer cannot return the memory associated with that element to the

> allocator. The writer can return the memory to the allocator only after

> the reader stops referencng D2. In other words, reader thread RT1

> has to enter a quiescent state.

> 

> Similarly, since thread RT3 is also accessing D2, writer has to wait till

> RT3 enters quiescent state as well.

> 

> However, the writer does not need to wait for RT2 to enter quiescent state.

> Thread RT2 was not accessing D2 when the delete operation happened.

> So, RT2 will not get a reference to the deleted entry.

> 

> It can be noted that, the critical sections for D2 and D3 are quiescent states

> for D1. i.e. for a given data structure Dx, any point in the thread execution

> that does not reference Dx is a quiescent state.

> 

> For DPDK applications, the start and end of while(1) loop (where no shared

> data structures are getting accessed) act as perfect quiescent states. This

> will combine all the shared data structure accesses into a single critical

> section and keeps the over head introduced by this library to the minimum.

> 

> However, the length of the critical section and the number of reader threads

> is proportional to the time taken to identify the end of critical section.

> So, if the application desires, it should be possible to identify the end

> of critical section for each data structure.

> 

> To provide the required flexibility, this library has a concept of TQS

> variable. The application can create one or more TQS variables to help it

> track the end of one or more critical sections.

> 

> The application can create a TQS variable using the API rte_tqs_alloc.

> It takes a mask of lcore IDs that will report their quiescent states

> using this variable. This mask can be empty to start with.

> 

> rte_tqs_register_lcore API will register a reader thread to report its

> quiescent state. This can be called from any control plane thread or from

> the reader thread. The application can create a TQS variable with no reader

> threads and add the threads dynamically using this API.

> 

> The application can trigger the reader threads to report their quiescent

> state status by calling the API rte_tqs_start. It is possible for multiple

> writer threads to query the quiescent state status simultaneously. Hence,

> rte_tqs_start returns a token to each caller.

> 

> The application has to call rte_tqs_check API with the token to get the

> current status. Option to block till all the threads enter the quiescent

> state is provided. If this API indicates that all the threads have entered

> the quiescent state, the application can free the deleted entry.

> 

> The separation of triggering the reporting from querying the status provides

> the writer threads flexibility to do useful work instead of waiting for the

> reader threads to enter the quiescent state.

> 

> rte_tqs_unregister_lcore API will remove a reader thread from reporting its

> quiescent state using a TQS variable. The rte_tqs_check API will not wait

> for this reader thread to report the quiescent state status anymore.

> 

> Finally, a TQS variable can be deleted by calling rte_tqs_free API.

> Application must make sure that the reader threads are not referencing the

> TQS variable anymore before deleting it.

> 

> The reader threads should call rte_tqs_update API to indicate that they

> entered a quiescent state. This API checks if a writer has triggered a

> quiescent state query and update the state accordingly.

> 

> Next Steps:

> 1) Add more test cases

> 2) Convert to patch

> 3) Incorporate feedback from community

> 4) Add documentation

> 

> Dharmik Thakkar (1):

>   test/tqs: Add API and functional tests

> 

> Honnappa Nagarahalli (2):

>   log: add TQS log type

>   tqs: add thread quiescent state library

> 

>  config/common_base                      |   6 +

>  lib/Makefile                            |   2 +

>  lib/librte_eal/common/include/rte_log.h |   1 +

>  lib/librte_tqs/Makefile                 |  23 +

>  lib/librte_tqs/meson.build              |   5 +

>  lib/librte_tqs/rte_tqs.c                | 249 +++++++++++

>  lib/librte_tqs/rte_tqs.h                | 352 +++++++++++++++

>  lib/librte_tqs/rte_tqs_version.map      |  16 +

>  lib/meson.build                         |   2 +-

>  mk/rte.app.mk                           |   1 +

>  test/test/Makefile                      |   2 +

>  test/test/autotest_data.py              |   6 +

>  test/test/meson.build                   |   5 +-

>  test/test/test_tqs.c                    | 540 ++++++++++++++++++++++++

>  14 files changed, 1208 insertions(+), 2 deletions(-)

>  create mode 100644 lib/librte_tqs/Makefile

>  create mode 100644 lib/librte_tqs/meson.build

>  create mode 100644 lib/librte_tqs/rte_tqs.c

>  create mode 100644 lib/librte_tqs/rte_tqs.h

>  create mode 100644 lib/librte_tqs/rte_tqs_version.map

>  create mode 100644 test/test/test_tqs.c

> 

> -- 

> 2.17.1

Stephen Hemminger Nov. 27, 2018, 10:28 p.m. UTC | #2

On Wed, 21 Nov 2018 21:30:52 -0600
Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

> Lock-less data structures provide scalability and determinism.

> They enable use cases where locking may not be allowed

> (for ex: real-time applications).

> 

> In the following paras, the term 'memory' refers to memory allocated

> by typical APIs like malloc or anything that is representative of

> memory, for ex: an index of a free element array.

> 

> Since these data structures are lock less, the writers and readers

> are accessing the data structures simultaneously. Hence, while removing

> an element from a data structure, the writers cannot return the memory

> to the allocator, without knowing that the readers are not

> referencing that element/memory anymore. Hence, it is required to

> separate the operation of removing an element into 2 steps:

> 

> Delete: in this step, the writer removes the element from the

> data structure but does not return the associated memory to the allocator.

> This will ensure that new readers will not get a reference to the removed

> element. Removing the reference is an atomic operation.

> 

> Free: in this step, the writer returns the memory to the

> memory allocator, only after knowing that all the readers have stopped

> referencing the removed element.

> 

> This library helps the writer determine when it is safe to free the

> memory.

> 

> This library makes use of Thread Quiescent State (TQS). TQS can be

> defined as 'any point in the thread execution where the thread does

> not hold a reference to shared memory'. It is upto the application to

> determine its quiescent state. Let us consider the following diagram:

> 

>     Time -------------------------------------------------->

> 

>                                 |     |

>   RT1   $++++****D1****+++***D2*|**+++|+++**D3*****++++$

>                                 |     |

>   RT2          $++++****D1****++|+**D2|***++++++**D3*****++++$

>                                 |     |

>   RT3      $++++****D1****+++***|D2***|++++++**D2*****++++$

>                                 |     |

>                                 |<--->|

>                                Del | Free

>                                    |

>                               Cannot free memory

>                               during this period

> 

> RTx - Reader thread

> < and > - Start and end of while(1) loop

> ***Dx*** - Reader thread is accessing the shared data structure Dx.

>            i.e. critical section.

> +++ - Reader thread is not accessing any shared data structure.

>       i.e. non critical section or quiescent state.

> Del - Point in time when the reference to the entry is removed using

>       atomic operation.

> Free - Point in time when the writer can free the entry.

> 

> As shown thread RT1 acesses data structures D1, D2 and D3. When it is

> accessing D2, if the writer has to remove an element from D2, the

> writer cannot return the memory associated with that element to the

> allocator. The writer can return the memory to the allocator only after

> the reader stops referencng D2. In other words, reader thread RT1

> has to enter a quiescent state.

> 

> Similarly, since thread RT3 is also accessing D2, writer has to wait till

> RT3 enters quiescent state as well.

> 

> However, the writer does not need to wait for RT2 to enter quiescent state.

> Thread RT2 was not accessing D2 when the delete operation happened.

> So, RT2 will not get a reference to the deleted entry.

> 

> It can be noted that, the critical sections for D2 and D3 are quiescent states

> for D1. i.e. for a given data structure Dx, any point in the thread execution

> that does not reference Dx is a quiescent state.

> 

> For DPDK applications, the start and end of while(1) loop (where no shared

> data structures are getting accessed) act as perfect quiescent states. This

> will combine all the shared data structure accesses into a single critical

> section and keeps the over head introduced by this library to the minimum.

> 

> However, the length of the critical section and the number of reader threads

> is proportional to the time taken to identify the end of critical section.

> So, if the application desires, it should be possible to identify the end

> of critical section for each data structure.

> 

> To provide the required flexibility, this library has a concept of TQS

> variable. The application can create one or more TQS variables to help it

> track the end of one or more critical sections.

> 

> The application can create a TQS variable using the API rte_tqs_alloc.

> It takes a mask of lcore IDs that will report their quiescent states

> using this variable. This mask can be empty to start with.

> 

> rte_tqs_register_lcore API will register a reader thread to report its

> quiescent state. This can be called from any control plane thread or from

> the reader thread. The application can create a TQS variable with no reader

> threads and add the threads dynamically using this API.

> 

> The application can trigger the reader threads to report their quiescent

> state status by calling the API rte_tqs_start. It is possible for multiple

> writer threads to query the quiescent state status simultaneously. Hence,

> rte_tqs_start returns a token to each caller.

> 

> The application has to call rte_tqs_check API with the token to get the

> current status. Option to block till all the threads enter the quiescent

> state is provided. If this API indicates that all the threads have entered

> the quiescent state, the application can free the deleted entry.

> 

> The separation of triggering the reporting from querying the status provides

> the writer threads flexibility to do useful work instead of waiting for the

> reader threads to enter the quiescent state.

> 

> rte_tqs_unregister_lcore API will remove a reader thread from reporting its

> quiescent state using a TQS variable. The rte_tqs_check API will not wait

> for this reader thread to report the quiescent state status anymore.

> 

> Finally, a TQS variable can be deleted by calling rte_tqs_free API.

> Application must make sure that the reader threads are not referencing the

> TQS variable anymore before deleting it.

> 

> The reader threads should call rte_tqs_update API to indicate that they

> entered a quiescent state. This API checks if a writer has triggered a

> quiescent state query and update the state accordingly.

> 

> Next Steps:

> 1) Add more test cases

> 2) Convert to patch

> 3) Incorporate feedback from community

> 4) Add documentation

> 

> Dharmik Thakkar (1):

>   test/tqs: Add API and functional tests

> 

> Honnappa Nagarahalli (2):

>   log: add TQS log type

>   tqs: add thread quiescent state library

> 

>  config/common_base                      |   6 +

>  lib/Makefile                            |   2 +

>  lib/librte_eal/common/include/rte_log.h |   1 +

>  lib/librte_tqs/Makefile                 |  23 +

>  lib/librte_tqs/meson.build              |   5 +

>  lib/librte_tqs/rte_tqs.c                | 249 +++++++++++

>  lib/librte_tqs/rte_tqs.h                | 352 +++++++++++++++

>  lib/librte_tqs/rte_tqs_version.map      |  16 +

>  lib/meson.build                         |   2 +-

>  mk/rte.app.mk                           |   1 +

>  test/test/Makefile                      |   2 +

>  test/test/autotest_data.py              |   6 +

>  test/test/meson.build                   |   5 +-

>  test/test/test_tqs.c                    | 540 ++++++++++++++++++++++++

>  14 files changed, 1208 insertions(+), 2 deletions(-)

>  create mode 100644 lib/librte_tqs/Makefile

>  create mode 100644 lib/librte_tqs/meson.build

>  create mode 100644 lib/librte_tqs/rte_tqs.c

>  create mode 100644 lib/librte_tqs/rte_tqs.h

>  create mode 100644 lib/librte_tqs/rte_tqs_version.map

>  create mode 100644 test/test/test_tqs.c

> 


Mixed feelings about this one.

Love to see RCU used for more things since it is much better than reader/writer
locks for many applications. But hate to see DPDK reinventing every other library
and not reusing code. Userspace RCU https://liburcu.org/ is widely supported by
distro's, more throughly tested and documented, and more flexiple.

The issue with many of these reinventions is a tradeoff of DPDK growing
another dependency on external library versus using common code.

For RCU, the big issue for me is the testing and documentation of how to do RCU
safely. Many people get it wrong!

Van Haaren, Harry Nov. 27, 2018, 10:49 p.m. UTC | #3

> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger

> Sent: Tuesday, November 27, 2018 2:28 PM

> To: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> Cc: dev@dpdk.org; nd@arm.com; dharmik.thakkar@arm.com; malvika.gupta@arm.com;

> gavin.hu@arm.com

> Subject: Re: [dpdk-dev] [RFC 0/3] tqs: add thread quiescent state library

> 

> On Wed, 21 Nov 2018 21:30:52 -0600

> Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

> 

> > Lock-less data structures provide scalability and determinism.

> > They enable use cases where locking may not be allowed

> > (for ex: real-time applications).

> >


<snip detailed RFC commit messag>

> > Dharmik Thakkar (1):

> >   test/tqs: Add API and functional tests

> >

> > Honnappa Nagarahalli (2):

> >   log: add TQS log type

> >   tqs: add thread quiescent state library

> >

> >  config/common_base                      |   6 +

> >  lib/Makefile                            |   2 +

> >  lib/librte_eal/common/include/rte_log.h |   1 +

> >  lib/librte_tqs/Makefile                 |  23 +

> >  lib/librte_tqs/meson.build              |   5 +

> >  lib/librte_tqs/rte_tqs.c                | 249 +++++++++++

> >  lib/librte_tqs/rte_tqs.h                | 352 +++++++++++++++

> >  lib/librte_tqs/rte_tqs_version.map      |  16 +

> >  lib/meson.build                         |   2 +-

> >  mk/rte.app.mk                           |   1 +

> >  test/test/Makefile                      |   2 +

> >  test/test/autotest_data.py              |   6 +

> >  test/test/meson.build                   |   5 +-

> >  test/test/test_tqs.c                    | 540 ++++++++++++++++++++++++

> >  14 files changed, 1208 insertions(+), 2 deletions(-)

> >  create mode 100644 lib/librte_tqs/Makefile

> >  create mode 100644 lib/librte_tqs/meson.build

> >  create mode 100644 lib/librte_tqs/rte_tqs.c

> >  create mode 100644 lib/librte_tqs/rte_tqs.h

> >  create mode 100644 lib/librte_tqs/rte_tqs_version.map

> >  create mode 100644 test/test/test_tqs.c

> >

> 

> Mixed feelings about this one.

> 

> Love to see RCU used for more things since it is much better than

> reader/writer

> locks for many applications. But hate to see DPDK reinventing every other

> library

> and not reusing code. Userspace RCU https://liburcu.org/ is widely supported

> by

> distro's, more throughly tested and documented, and more flexiple.

> 

> The issue with many of these reinventions is a tradeoff of DPDK growing

> another dependency on external library versus using common code.

> 

> For RCU, the big issue for me is the testing and documentation of how to do

> RCU

> safely. Many people get it wrong!



Some notes on liburcu (and my amateur understanding of LGPL, I'm not a license lawyer :)

Liburcu is LGPL, which AFAIK means we must dynamically link applications if the application code is BSD or other permissive licenses.

The side effect of this is that urcu function calls must be "real" function calls and inlining them is not possible. Therefore using liburcu in LGPL mode could have a performance impact in this case. I expect estimating the performance cost would be
difficult as its pretty much a case-by-case depending on what you're doing in the surrounding code.

Generally I'm in favour of using established libraries (particularly for complex functionality like RCU) but in this case I think there's a tradeoff with raw performance.

Honnappa Nagarahalli Nov. 28, 2018, 5:31 a.m. UTC | #4

> >

> > On Wed, 21 Nov 2018 21:30:52 -0600

> > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

> >

> > > Lock-less data structures provide scalability and determinism.

> > > They enable use cases where locking may not be allowed (for ex:

> > > real-time applications).

> > >

> 

> <snip detailed RFC commit messag>

> 

> > > Dharmik Thakkar (1):

> > >   test/tqs: Add API and functional tests

> > >

> > > Honnappa Nagarahalli (2):

> > >   log: add TQS log type

> > >   tqs: add thread quiescent state library

> > >

> > >  config/common_base                      |   6 +

> > >  lib/Makefile                            |   2 +

> > >  lib/librte_eal/common/include/rte_log.h |   1 +

> > >  lib/librte_tqs/Makefile                 |  23 +

> > >  lib/librte_tqs/meson.build              |   5 +

> > >  lib/librte_tqs/rte_tqs.c                | 249 +++++++++++

> > >  lib/librte_tqs/rte_tqs.h                | 352 +++++++++++++++

> > >  lib/librte_tqs/rte_tqs_version.map      |  16 +

> > >  lib/meson.build                         |   2 +-

> > >  mk/rte.app.mk                           |   1 +

> > >  test/test/Makefile                      |   2 +

> > >  test/test/autotest_data.py              |   6 +

> > >  test/test/meson.build                   |   5 +-

> > >  test/test/test_tqs.c                    | 540 ++++++++++++++++++++++++

> > >  14 files changed, 1208 insertions(+), 2 deletions(-)  create mode

> > > 100644 lib/librte_tqs/Makefile  create mode 100644

> > > lib/librte_tqs/meson.build  create mode 100644

> > > lib/librte_tqs/rte_tqs.c  create mode 100644

> > > lib/librte_tqs/rte_tqs.h  create mode 100644

> > > lib/librte_tqs/rte_tqs_version.map

> > >  create mode 100644 test/test/test_tqs.c

> > >

> >

> > Mixed feelings about this one.

> >

> > Love to see RCU used for more things since it is much better than

> > reader/writer locks for many applications. But hate to see DPDK

> > reinventing every other library and not reusing code. Userspace RCU

> > https://liburcu.org/ is widely supported by distro's, more throughly

> > tested and documented, and more flexiple.

> >

> > The issue with many of these reinventions is a tradeoff of DPDK

> > growing another dependency on external library versus using common code.

> >

Agree with the dependency issues. Sometimes flexibility also causes confusion and features that are not necessarily required for a targeted use case. I have seen that much of the functionality that can be left to the application is implemented as part of the library.
I think having it in DPDK will give us control over the amount of capability this library will have and freedom over changes we would like to make to such a library. I also view DPDK as one package where all things required for data plane development are available.

> > For RCU, the big issue for me is the testing and documentation of how

> > to do RCU safely. Many people get it wrong!

Hopefully, we all will do a better job collectively :)

> 

> 

> Some notes on liburcu (and my amateur understanding of LGPL, I'm not a

> license lawyer :)

> 

> Liburcu is LGPL, which AFAIK means we must dynamically link applications if

> the application code is BSD or other permissive licenses.

> 

> The side effect of this is that urcu function calls must be "real" function calls

> and inlining them is not possible. Therefore using liburcu in LGPL mode could

> have a performance impact in this case. I expect estimating the performance

> cost would be

> difficult as its pretty much a case-by-case depending on what you're doing in

> the surrounding code.

> 

> Generally I'm in favour of using established libraries (particularly for complex

> functionality like RCU) but in this case I think there's a tradeoff with raw

> performance.

Stephen Hemminger Nov. 28, 2018, 11:23 p.m. UTC | #5

On Wed, 28 Nov 2018 05:31:56 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > > Mixed feelings about this one.

> > >

> > > Love to see RCU used for more things since it is much better than

> > > reader/writer locks for many applications. But hate to see DPDK

> > > reinventing every other library and not reusing code. Userspace RCU

> > > https://liburcu.org/ is widely supported by distro's, more throughly

> > > tested and documented, and more flexiple.

> > >

> > > The issue with many of these reinventions is a tradeoff of DPDK

> > > growing another dependency on external library versus using common code.

> > >  

> Agree with the dependency issues. Sometimes flexibility also causes confusion and features that are not necessarily required for a targeted use case. I have seen that much of the functionality that can be left to the application is implemented as part of the library.

> I think having it in DPDK will give us control over the amount of capability this library will have and freedom over changes we would like to make to such a library. I also view DPDK as one package where all things required for data plane development are available.

> 

> > > For RCU, the big issue for me is the testing and documentation of how

> > > to do RCU safely. Many people get it wrong!  

> Hopefully, we all will do a better job collectively :)

> 

> > 


Reinventing RCU is not helping anyone.


DPDK needs to fix its dependency model, and just admit that it is ok
to build off of more than glibc.

Having used liburcu, it can be done in a small manner and really isn't that
confusing. 

Is your real issue the LGPL license of liburcu for your skittish customers?

Honnappa Nagarahalli Nov. 30, 2018, 2:13 a.m. UTC | #6

> 

> > > > Mixed feelings about this one.

> > > >

> > > > Love to see RCU used for more things since it is much better than

> > > > reader/writer locks for many applications. But hate to see DPDK

> > > > reinventing every other library and not reusing code. Userspace

> > > > RCU https://liburcu.org/ is widely supported by distro's, more

> > > > throughly tested and documented, and more flexiple.

> > > >

> > > > The issue with many of these reinventions is a tradeoff of DPDK

> > > > growing another dependency on external library versus using common

> code.

> > > >

> > Agree with the dependency issues. Sometimes flexibility also causes confusion

> and features that are not necessarily required for a targeted use case. I have

> seen that much of the functionality that can be left to the application is

> implemented as part of the library.

> > I think having it in DPDK will give us control over the amount of capability this

> library will have and freedom over changes we would like to make to such a

> library. I also view DPDK as one package where all things required for data

> plane development are available.

> >

> > > > For RCU, the big issue for me is the testing and documentation of

> > > > how to do RCU safely. Many people get it wrong!

> > Hopefully, we all will do a better job collectively :)

> >

> > >

> 

> Reinventing RCU is not helping anyone.

IMO, this depends on what the rte_tqs has to offer and what the requirements are. Before starting this patch, I looked at the liburcu APIs. I have to say, fairly quickly (no offense) I concluded that this does not address DPDK's needs. I took a deeper look at the APIs/code in the past day and I still concluded the same. My partial analysis (analysis of more APIs can be done, I do not have cycles at this point) is as follows:

The reader threads' information is maintained in a linked list[1]. This linked list is protected by a mutex lock[2]. Any additions/deletions/traversals of this list are blocking and cannot happen in parallel.

The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call) is a blocking call. There is no option provided to make it non-blocking. The writer spins cycles while waiting for the grace period to get over.

'synchronize_rcu' also has grace period lock [4]. If I have multiple writers running on data plane threads, I cannot call this API to reclaim the memory in the worker threads as it will block other worker threads. This means, there is an extra thread required (on the control plane?) which does garbage collection and a method to push the pointers from worker threads to the garbage collection thread. This also means the time duration from delete to free increases putting pressure on amount of memory held up.
Since this API cannot be called concurrently by multiple writers, each writer has to wait for other writer's grace period to get over (i.e. multiple writer threads cannot overlap their grace periods).

This API also has to traverse the linked list which is not very well suited for calling on data plane.

I have not gone too much into rcu_thread_offline[5] API. This again needs to be used in worker cores and does not look to be very optimal.

I have glanced at rcu_quiescent_state [6], it wakes up the thread calling 'synchronize_rcu' which seems good amount of code for the data plane.

[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L85
[2] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L68
[3] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L344
[4] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L58
[5] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L211
[6] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L193

Coming to what is provided in rte_tqs:

The synchronize_rcu functionality is split in to 2 APIs: rte_tqs_start and rte_tqs_check. The reader data is maintained as an array.

Both the APIs are lock-free, allowing them to be called from multiple threads concurrently. This allows multiple writers to wait for their grace periods concurrently as well as overlap their grace periods. rte_tqs_start API returns a token which provides the ability to separate the quiescent state waiting of different writers. Hence, no writer waits for other writer's grace period to get over. 
Since these 2 APIs are lock-free, they can be called from writers running on worker cores as well without the need for a separate thread to do garbage collection.

The separation into 2 APIs provides the ability for writers to not spin cycles waiting for the grace period to get over. This enables different ways of doing garbage collection. For ex: a data structure delete API could remove the entry from the data structure, call rte_tqs_start and return back to the caller. On the invocation of next API call of the library, the API can call rte_tqs_check (which will mostly indicate that the grace period is complete) and free the previously deleted entry.

rte_tqs_update (mapping to rcu_quiescent_state) is pretty small and simple.

rte_tqs_register and rte_tqs_unregister APIs are lock free. Hence additional APIs like rcu_thread_online and rcu_thread_offline are not required. The rte_tqs_unregister API (when compared to rcu_thread_offline) is much simple and conducive to be used in worker threads.

> 

> 

> DPDK needs to fix its dependency model, and just admit that it is ok to build off

> of more than glibc.

> 

> Having used liburcu, it can be done in a small manner and really isn't that

> confusing.

> 

> Is your real issue the LGPL license of liburcu for your skittish customers?

I have not had any discussions on this. Customers are mainly focused on having a solution on which they have meaningful control. They want to be able to submit a patch and change things if required. For ex: barriers for Arm [7] are not optimal. How easy is it to change this and get it into distros (there are both internal and external factors here)?

[7] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/arch/arm.h#L44

Honnappa Nagarahalli Nov. 30, 2018, 2:25 a.m. UTC | #7

> > >

> >

> > Mixed feelings about this one.

> >

> > Love to see RCU used for more things since it is much better than

> > reader/writer locks for many applications. But hate to see DPDK

> > reinventing every other library and not reusing code. Userspace RCU

> > https://liburcu.org/ is widely supported by distro's, more throughly

> > tested and documented, and more flexiple.

> >

> > The issue with many of these reinventions is a tradeoff of DPDK

> > growing another dependency on external library versus using common code.

> >

> > For RCU, the big issue for me is the testing and documentation of how

> > to do RCU safely. Many people get it wrong!

> 

> 

> Some notes on liburcu (and my amateur understanding of LGPL, I'm not a

> license lawyer :)

> 

> Liburcu is LGPL, which AFAIK means we must dynamically link applications if

> the application code is BSD or other permissive licenses.

> 

> The side effect of this is that urcu function calls must be "real" function calls

> and inlining them is not possible. Therefore using liburcu in LGPL mode could

> have a performance impact in this case. I expect estimating the performance

> cost would be

> difficult as its pretty much a case-by-case depending on what you're doing in

> the surrounding code.

> 

> Generally I'm in favour of using established libraries (particularly for complex

> functionality like RCU) but in this case I think there's a tradeoff with raw

> performance.

The licensing info [1] is very interesting. Again I am no lawyer :)

[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L184

Luca Boccassi Nov. 30, 2018, 4:26 p.m. UTC | #8

On Fri, 2018-11-30 at 02:13 +0000, Honnappa Nagarahalli wrote:
> > 

> > > > > Mixed feelings about this one.

> > > > > 

> > > > > Love to see RCU used for more things since it is much better

> > > > > than

> > > > > reader/writer locks for many applications. But hate to see

> > > > > DPDK

> > > > > reinventing every other library and not reusing code.

> > > > > Userspace

> > > > > RCU https://liburcu.org/ is widely supported by distro's,

> > > > > more

> > > > > throughly tested and documented, and more flexiple.

> > > > > 

> > > > > The issue with many of these reinventions is a tradeoff of

> > > > > DPDK

> > > > > growing another dependency on external library versus using

> > > > > common

> > 

> > code.

> > > > > 

> > > 

> > > Agree with the dependency issues. Sometimes flexibility also

> > > causes confusion

> > 

> > and features that are not necessarily required for a targeted use

> > case. I have

> > seen that much of the functionality that can be left to the

> > application is

> > implemented as part of the library.

> > > I think having it in DPDK will give us control over the amount of

> > > capability this

> > 

> > library will have and freedom over changes we would like to make to

> > such a

> > library. I also view DPDK as one package where all things required

> > for data

> > plane development are available.

> > > 

> > > > > For RCU, the big issue for me is the testing and

> > > > > documentation of

> > > > > how to do RCU safely. Many people get it wrong!

> > > 

> > > Hopefully, we all will do a better job collectively :)

> > > 

> > > > 

> > 

> > Reinventing RCU is not helping anyone.

> 

> IMO, this depends on what the rte_tqs has to offer and what the

> requirements are. Before starting this patch, I looked at the liburcu

> APIs. I have to say, fairly quickly (no offense) I concluded that

> this does not address DPDK's needs. I took a deeper look at the

> APIs/code in the past day and I still concluded the same. My partial

> analysis (analysis of more APIs can be done, I do not have cycles at

> this point) is as follows:

> 

> The reader threads' information is maintained in a linked list[1].

> This linked list is protected by a mutex lock[2]. Any

> additions/deletions/traversals of this list are blocking and cannot

> happen in parallel.

> 

> The API, 'synchronize_rcu' [3] (similar functionality to

> rte_tqs_check call) is a blocking call. There is no option provided

> to make it non-blocking. The writer spins cycles while waiting for

> the grace period to get over.

> 

> 'synchronize_rcu' also has grace period lock [4]. If I have multiple

> writers running on data plane threads, I cannot call this API to

> reclaim the memory in the worker threads as it will block other

> worker threads. This means, there is an extra thread required (on the

> control plane?) which does garbage collection and a method to push

> the pointers from worker threads to the garbage collection thread.

> This also means the time duration from delete to free increases

> putting pressure on amount of memory held up.

> Since this API cannot be called concurrently by multiple writers,

> each writer has to wait for other writer's grace period to get over

> (i.e. multiple writer threads cannot overlap their grace periods).

> 

> This API also has to traverse the linked list which is not very well

> suited for calling on data plane.

> 

> I have not gone too much into rcu_thread_offline[5] API. This again

> needs to be used in worker cores and does not look to be very

> optimal.

> 

> I have glanced at rcu_quiescent_state [6], it wakes up the thread

> calling 'synchronize_rcu' which seems good amount of code for the

> data plane.

> 

> [1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st

> atic/urcu-qsbr.h#L85

> [2] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c

> #L68

> [3] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c

> #L344

> [4] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c

> #L58

> [5] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st

> atic/urcu-qsbr.h#L211

> [6] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st

> atic/urcu-qsbr.h#L193

> 

> Coming to what is provided in rte_tqs:

> 

> The synchronize_rcu functionality is split in to 2 APIs:

> rte_tqs_start and rte_tqs_check. The reader data is maintained as an

> array.

> 

> Both the APIs are lock-free, allowing them to be called from multiple

> threads concurrently. This allows multiple writers to wait for their

> grace periods concurrently as well as overlap their grace periods.

> rte_tqs_start API returns a token which provides the ability to

> separate the quiescent state waiting of different writers. Hence, no

> writer waits for other writer's grace period to get over. 

> Since these 2 APIs are lock-free, they can be called from writers

> running on worker cores as well without the need for a separate

> thread to do garbage collection.

> 

> The separation into 2 APIs provides the ability for writers to not

> spin cycles waiting for the grace period to get over. This enables

> different ways of doing garbage collection. For ex: a data structure

> delete API could remove the entry from the data structure, call

> rte_tqs_start and return back to the caller. On the invocation of

> next API call of the library, the API can call rte_tqs_check (which

> will mostly indicate that the grace period is complete) and free the

> previously deleted entry.

> 

> rte_tqs_update (mapping to rcu_quiescent_state) is pretty small and

> simple.

> 

> rte_tqs_register and rte_tqs_unregister APIs are lock free. Hence

> additional APIs like rcu_thread_online and rcu_thread_offline are not

> required. The rte_tqs_unregister API (when compared to

> rcu_thread_offline) is much simple and conducive to be used in worker

> threads.


liburcu has many flavours already, qsbr being one of them. If none of
those are optimal for this use case, why not work with upstream to
either improve the existing flavours, or add a new one?

You have the specific knowledge and expertise about the requirements
needs for the implementation for this use case, and they have the long-
time and extensive experience maintaining the library on a wide range
of systems and use cases. Why not combine both?
I might be wrong, but to me, nothing described above seems to be
particularly or uniquely tied to implementing a software dataplane or
to DPDK. IMHO we should pool and share wherever possible, rather than
build an ecosystem closed onto itself.

Just my 2c of course!

> > DPDK needs to fix its dependency model, and just admit that it is

> > ok to build off

> > of more than glibc.

> > 

> > Having used liburcu, it can be done in a small manner and really

> > isn't that

> > confusing.

> > 

> > Is your real issue the LGPL license of liburcu for your skittish

> > customers?

> 

> I have not had any discussions on this. Customers are mainly focused

> on having a solution on which they have meaningful control. They want

> to be able to submit a patch and change things if required. For ex:

> barriers for Arm [7] are not optimal. How easy is it to change this

> and get it into distros (there are both internal and external factors

> here)?


It's just as easy (or as hard) as it is with DPDK. So it's either wait
for the distros to update, or rebuild locally.

-- 
Kind regards,
Luca Boccassi

Stephen Hemminger Nov. 30, 2018, 6:32 p.m. UTC | #9

On Fri, 30 Nov 2018 16:26:25 +0000
Luca Boccassi <bluca@debian.org> wrote:

> On Fri, 2018-11-30 at 02:13 +0000, Honnappa Nagarahalli wrote:

> > >   

> > > > > > Mixed feelings about this one.

> > > > > > 

> > > > > > Love to see RCU used for more things since it is much better

> > > > > > than

> > > > > > reader/writer locks for many applications. But hate to see

> > > > > > DPDK

> > > > > > reinventing every other library and not reusing code.

> > > > > > Userspace

> > > > > > RCU https://liburcu.org/ is widely supported by distro's,

> > > > > > more

> > > > > > throughly tested and documented, and more flexiple.

> > > > > > 

> > > > > > The issue with many of these reinventions is a tradeoff of

> > > > > > DPDK

> > > > > > growing another dependency on external library versus using

> > > > > > common  

> > > 

> > > code.  

> > > > > >   

> > > > 

> > > > Agree with the dependency issues. Sometimes flexibility also

> > > > causes confusion  

> > > 

> > > and features that are not necessarily required for a targeted use

> > > case. I have

> > > seen that much of the functionality that can be left to the

> > > application is

> > > implemented as part of the library.  

> > > > I think having it in DPDK will give us control over the amount of

> > > > capability this  

> > > 

> > > library will have and freedom over changes we would like to make to

> > > such a

> > > library. I also view DPDK as one package where all things required

> > > for data

> > > plane development are available.  

> > > >   

> > > > > > For RCU, the big issue for me is the testing and

> > > > > > documentation of

> > > > > > how to do RCU safely. Many people get it wrong!  

> > > > 

> > > > Hopefully, we all will do a better job collectively :)

> > > >   

> > > > >   

> > > 

> > > Reinventing RCU is not helping anyone.  

> > 

> > IMO, this depends on what the rte_tqs has to offer and what the

> > requirements are. Before starting this patch, I looked at the liburcu

> > APIs. I have to say, fairly quickly (no offense) I concluded that

> > this does not address DPDK's needs. I took a deeper look at the

> > APIs/code in the past day and I still concluded the same. My partial

> > analysis (analysis of more APIs can be done, I do not have cycles at

> > this point) is as follows:

> > 

> > The reader threads' information is maintained in a linked list[1].

> > This linked list is protected by a mutex lock[2]. Any

> > additions/deletions/traversals of this list are blocking and cannot

> > happen in parallel.

> > 

> > The API, 'synchronize_rcu' [3] (similar functionality to

> > rte_tqs_check call) is a blocking call. There is no option provided

> > to make it non-blocking. The writer spins cycles while waiting for

> > the grace period to get over.

> > 

> > 'synchronize_rcu' also has grace period lock [4]. If I have multiple

> > writers running on data plane threads, I cannot call this API to

> > reclaim the memory in the worker threads as it will block other

> > worker threads. This means, there is an extra thread required (on the

> > control plane?) which does garbage collection and a method to push

> > the pointers from worker threads to the garbage collection thread.

> > This also means the time duration from delete to free increases

> > putting pressure on amount of memory held up.

> > Since this API cannot be called concurrently by multiple writers,

> > each writer has to wait for other writer's grace period to get over

> > (i.e. multiple writer threads cannot overlap their grace periods).

> > 

> > This API also has to traverse the linked list which is not very well

> > suited for calling on data plane.

> > 

> > I have not gone too much into rcu_thread_offline[5] API. This again

> > needs to be used in worker cores and does not look to be very

> > optimal.

> > 

> > I have glanced at rcu_quiescent_state [6], it wakes up the thread

> > calling 'synchronize_rcu' which seems good amount of code for the

> > data plane.

> > 

> > [1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st

> > atic/urcu-qsbr.h#L85

> > [2] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c

> > #L68

> > [3] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c

> > #L344

> > [4] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c

> > #L58

> > [5] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st

> > atic/urcu-qsbr.h#L211

> > [6] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st

> > atic/urcu-qsbr.h#L193

> > 

> > Coming to what is provided in rte_tqs:

> > 

> > The synchronize_rcu functionality is split in to 2 APIs:

> > rte_tqs_start and rte_tqs_check. The reader data is maintained as an

> > array.

> > 

> > Both the APIs are lock-free, allowing them to be called from multiple

> > threads concurrently. This allows multiple writers to wait for their

> > grace periods concurrently as well as overlap their grace periods.

> > rte_tqs_start API returns a token which provides the ability to

> > separate the quiescent state waiting of different writers. Hence, no

> > writer waits for other writer's grace period to get over. 

> > Since these 2 APIs are lock-free, they can be called from writers

> > running on worker cores as well without the need for a separate

> > thread to do garbage collection.

> > 

> > The separation into 2 APIs provides the ability for writers to not

> > spin cycles waiting for the grace period to get over. This enables

> > different ways of doing garbage collection. For ex: a data structure

> > delete API could remove the entry from the data structure, call

> > rte_tqs_start and return back to the caller. On the invocation of

> > next API call of the library, the API can call rte_tqs_check (which

> > will mostly indicate that the grace period is complete) and free the

> > previously deleted entry.

> > 

> > rte_tqs_update (mapping to rcu_quiescent_state) is pretty small and

> > simple.

> > 

> > rte_tqs_register and rte_tqs_unregister APIs are lock free. Hence

> > additional APIs like rcu_thread_online and rcu_thread_offline are not

> > required. The rte_tqs_unregister API (when compared to

> > rcu_thread_offline) is much simple and conducive to be used in worker

> > threads.  

> 

> liburcu has many flavours already, qsbr being one of them. If none of

> those are optimal for this use case, why not work with upstream to

> either improve the existing flavours, or add a new one?

> 

> You have the specific knowledge and expertise about the requirements

> needs for the implementation for this use case, and they have the long-

> time and extensive experience maintaining the library on a wide range

> of systems and use cases. Why not combine both?

> I might be wrong, but to me, nothing described above seems to be

> particularly or uniquely tied to implementing a software dataplane or

> to DPDK. IMHO we should pool and share wherever possible, rather than

> build an ecosystem closed onto itself.

> 

> Just my 2c of course!

> 

> > > DPDK needs to fix its dependency model, and just admit that it is

> > > ok to build off

> > > of more than glibc.

> > > 

> > > Having used liburcu, it can be done in a small manner and really

> > > isn't that

> > > confusing.

> > > 

> > > Is your real issue the LGPL license of liburcu for your skittish

> > > customers?  

> > 

> > I have not had any discussions on this. Customers are mainly focused

> > on having a solution on which they have meaningful control. They want

> > to be able to submit a patch and change things if required. For ex:

> > barriers for Arm [7] are not optimal. How easy is it to change this

> > and get it into distros (there are both internal and external factors

> > here)?  

> 

> It's just as easy (or as hard) as it is with DPDK. So it's either wait

> for the distros to update, or rebuild locally.

> 


Either way it would be useful to have the broader RCU community
in on the discussion.

Honnappa Nagarahalli Nov. 30, 2018, 8:20 p.m. UTC | #10

Hi Luca,
	Appreciate your comments.

> 

> liburcu has many flavours already, qsbr being one of them. If none of those

> are optimal for this use case, why not work with upstream to either improve

> the existing flavours, or add a new one?

> 

> You have the specific knowledge and expertise about the requirements

> needs for the implementation for this use case, and they have the long-

> time and extensive experience maintaining the library on a wide range of

> systems and use cases. Why not combine both?

> I might be wrong, but to me, nothing described above seems to be

> particularly or uniquely tied to implementing a software dataplane or to

> DPDK.

This comment does not help much, I would prefer a more concrete comment. If possible, can you please mention where else these features are required? IMO, if these were required for other use cases, they would have been present in liburcu already.

IMHO we should pool and share wherever possible, rather than build
> an ecosystem closed onto itself.

> 

> Just my 2c of course!

I would say this is true for some of the other libraries in DPDK as well [1] or in fact for the whole of DPDK. Linux Kernel has introduced XDP. I would say Linux Kernel has much vibrant history and community. If XDP lacks some of the features we want, why not work with Linux community to improve it and not do DPDK?

[1] https://github.com/urcu/userspace-rcu/blob/master/doc/uatomic-api.md

IMO, we should focus our discussion on how relevant rte_tqs is to DPDK, whether it solves the problems people face while using DPDK and the value add it brings. This should be the basis for the acceptance rather than what is available elsewhere.

> 

> > > DPDK needs to fix its dependency model, and just admit that it is ok

> > > to build off of more than glibc.

> > >

> > > Having used liburcu, it can be done in a small manner and really

> > > isn't that confusing.

> > >

> > > Is your real issue the LGPL license of liburcu for your skittish

> > > customers?

> >

> > I have not had any discussions on this. Customers are mainly focused

> > on having a solution on which they have meaningful control. They want

> > to be able to submit a patch and change things if required. For ex:

> > barriers for Arm [7] are not optimal. How easy is it to change this

> > and get it into distros (there are both internal and external factors

> > here)?

> 

> It's just as easy (or as hard) as it is with DPDK. So it's either wait for the

> distros to update, or rebuild locally.

> 

I have not worked with that community, hence I cannot comment on that. Apologies for asking some hard questions, but, do you know how easy it is to change the license for liburcu? Can the DPDK governing board take a decision to change the license for liburcu?

> --

> Kind regards,

> Luca Boccassi

Mattias Rönnblom Nov. 30, 2018, 8:56 p.m. UTC | #11

On 2018-11-30 03:13, Honnappa Nagarahalli wrote:
>>

>> Reinventing RCU is not helping anyone.

> IMO, this depends on what the rte_tqs has to offer and what the requirements are. Before starting this patch, I looked at the liburcu APIs. I have to say, fairly quickly (no offense) I concluded that this does not address DPDK's needs. I took a deeper look at the APIs/code in the past day and I still concluded the same. My partial analysis (analysis of more APIs can be done, I do not have cycles at this point) is as follows:

> 

> The reader threads' information is maintained in a linked list[1]. This linked list is protected by a mutex lock[2]. Any additions/deletions/traversals of this list are blocking and cannot happen in parallel.

> 

> The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call) is a blocking call. There is no option provided to make it non-blocking. The writer spins cycles while waiting for the grace period to get over.

> 


Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu() 
which never? Why would the average application want to wait for the 
grace period to be over anyway?

> 'synchronize_rcu' also has grace period lock [4]. If I have multiple writers running on data plane threads, I cannot call this API to reclaim the memory in the worker threads as it will block other worker threads. This means, there is an extra thread required (on the control plane?) which does garbage collection and a method to push the pointers from worker threads to the garbage collection thread. This also means the time duration from delete to free increases putting pressure on amount of memory held up.

> Since this API cannot be called concurrently by multiple writers, each writer has to wait for other writer's grace period to get over (i.e. multiple writer threads cannot overlap their grace periods).


"Real" DPDK applications typically have to interact with the outside 
world using interfaces beyond DPDK packet I/O, and this is best done via 
an intermediate "control plane" thread running in the DPDK application. 
Typically, this thread would also be the RCU writer and "garbage 
collector", I would say.

> 

> This API also has to traverse the linked list which is not very well suited for calling on data plane.

> 

> I have not gone too much into rcu_thread_offline[5] API. This again needs to be used in worker cores and does not look to be very optimal.

> 

> I have glanced at rcu_quiescent_state [6], it wakes up the thread calling 'synchronize_rcu' which seems good amount of code for the data plane.

> 


Wouldn't the typical DPDK lcore worker call rcu_quiescent_state() after 
processing a burst of packets? If so, I would more lean toward 
"negligible overhead", than "a good amount of code".

I must admit I didn't look at your library in detail, but I must still 
ask: if TQS is basically RCU, why isn't it called RCU? And why isn't the 
API calls named in a similar manner?

Mattias Rönnblom Nov. 30, 2018, 9:03 p.m. UTC | #12

On 2018-11-30 03:25, Honnappa Nagarahalli wrote:
>> Generally I'm in favour of using established libraries (particularly for complex

>> functionality like RCU) but in this case I think there's a tradeoff with raw

>> performance.

> The licensing info [1] is very interesting. Again I am no lawyer :)

> 

> [1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L184

> 


If you don't know the macro/inline function exception of LGPL 2.1, maybe 
it's time to read the license text. Lawyer or not.

Stephen Hemminger Nov. 30, 2018, 11:44 p.m. UTC | #13

On Fri, 30 Nov 2018 21:56:30 +0100
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> On 2018-11-30 03:13, Honnappa Nagarahalli wrote:

> >>

> >> Reinventing RCU is not helping anyone.  

> > IMO, this depends on what the rte_tqs has to offer and what the requirements are. Before starting this patch, I looked at the liburcu APIs. I have to say, fairly quickly (no offense) I concluded that this does not address DPDK's needs. I took a deeper look at the APIs/code in the past day and I still concluded the same. My partial analysis (analysis of more APIs can be done, I do not have cycles at this point) is as follows:

> > 

> > The reader threads' information is maintained in a linked list[1]. This linked list is protected by a mutex lock[2]. Any additions/deletions/traversals of this list are blocking and cannot happen in parallel.

> > 

> > The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call) is a blocking call. There is no option provided to make it non-blocking. The writer spins cycles while waiting for the grace period to get over.

> >   

> 

> Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu() 

> which never? Why would the average application want to wait for the 

> grace period to be over anyway?

> 

> > 'synchronize_rcu' also has grace period lock [4]. If I have multiple writers running on data plane threads, I cannot call this API to reclaim the memory in the worker threads as it will block other worker threads. This means, there is an extra thread required (on the control plane?) which does garbage collection and a method to push the pointers from worker threads to the garbage collection thread. This also means the time duration from delete to free increases putting pressure on amount of memory held up.

> > Since this API cannot be called concurrently by multiple writers, each writer has to wait for other writer's grace period to get over (i.e. multiple writer threads cannot overlap their grace periods).  

> 

> "Real" DPDK applications typically have to interact with the outside 

> world using interfaces beyond DPDK packet I/O, and this is best done via 

> an intermediate "control plane" thread running in the DPDK application. 

> Typically, this thread would also be the RCU writer and "garbage 

> collector", I would say.

> 

> > 

> > This API also has to traverse the linked list which is not very well suited for calling on data plane.

> > 

> > I have not gone too much into rcu_thread_offline[5] API. This again needs to be used in worker cores and does not look to be very optimal.

> > 

> > I have glanced at rcu_quiescent_state [6], it wakes up the thread calling 'synchronize_rcu' which seems good amount of code for the data plane.

> >   

> 

> Wouldn't the typical DPDK lcore worker call rcu_quiescent_state() after 

> processing a burst of packets? If so, I would more lean toward 

> "negligible overhead", than "a good amount of code".

> 

> I must admit I didn't look at your library in detail, but I must still 

> ask: if TQS is basically RCU, why isn't it called RCU? And why isn't the 

> API calls named in a similar manner?



We used liburcu at Brocade with DPDK. It was just a case of putting rcu_quiescent_state in the packet handling
loop. There were a bunch more cases where control thread needed to register/unregister as part of RCU.
I think any library would have that issue with user supplied threads.  You need a "worry about me" and
a "don't worry about me" API in the library.

There is also a tradeoff between call_rcu and defer_rcu about what context the RCU callback happens.
You really need a control thread to handle the RCU cleanup.

The point is that RCU steps into the application design, and liburcu seems to be flexible enough
and well documented enough to allow for more options.

Honnappa Nagarahalli Dec. 1, 2018, 6:37 p.m. UTC | #14

> 

> On Fri, 30 Nov 2018 21:56:30 +0100

> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> 

> > On 2018-11-30 03:13, Honnappa Nagarahalli wrote:

> > >>

> > >> Reinventing RCU is not helping anyone.

> > > IMO, this depends on what the rte_tqs has to offer and what the

> requirements are. Before starting this patch, I looked at the liburcu APIs. I

> have to say, fairly quickly (no offense) I concluded that this does not address

> DPDK's needs. I took a deeper look at the APIs/code in the past day and I still

> concluded the same. My partial analysis (analysis of more APIs can be done, I

> do not have cycles at this point) is as follows:

> > >

> > > The reader threads' information is maintained in a linked list[1]. This

> linked list is protected by a mutex lock[2]. Any additions/deletions/traversals

> of this list are blocking and cannot happen in parallel.

> > >

> > > The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call)

> is a blocking call. There is no option provided to make it non-blocking. The

> writer spins cycles while waiting for the grace period to get over.

> > >

> >

> > Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu()

> > which never?

call_rcu (I do not know about defer_rcu, have you looked at the implementation to verify your claim?) requires a separate thread that does garbage collection (this forces a programming model, the thread is even launched by the library). call_rcu() allows you to batch and defer the work to the garbage collector thread. In the garbage collector thread, when 'synchronize_rcu' is called, it spins for at least 1 grace period. Deferring and batching also have the side effect that memory is being held up for longer time.

Why would the average application want to wait for the
> > grace period to be over anyway?

I assume when you say 'average application', you mean the writer(s) are on control plane. 
It has been agreed (in the context of rte_hash) that writer(s) can be on data plane. In this case, 'synchronize_rcu' cannot be called from data plane. If call_rcu has to be called, it adds additional cycles to push the pointers (or any data) to the garbage collector thread to the data plane. I kindly suggest you to take a look for more details in liburcu code and the rte_tqs code.
Additionally, call_rcu function is more than 10 lines.

> >

> > > 'synchronize_rcu' also has grace period lock [4]. If I have multiple writers

> running on data plane threads, I cannot call this API to reclaim the memory in

> the worker threads as it will block other worker threads. This means, there is

> an extra thread required (on the control plane?) which does garbage

> collection and a method to push the pointers from worker threads to the

> garbage collection thread. This also means the time duration from delete to

> free increases putting pressure on amount of memory held up.

> > > Since this API cannot be called concurrently by multiple writers, each

> writer has to wait for other writer's grace period to get over (i.e. multiple

> writer threads cannot overlap their grace periods).

> >

> > "Real" DPDK applications typically have to interact with the outside

> > world using interfaces beyond DPDK packet I/O, and this is best done

> > via an intermediate "control plane" thread running in the DPDK application.

> > Typically, this thread would also be the RCU writer and "garbage

> > collector", I would say.

> >

Agree, that is one way to do it and it comes with its own issues as I described above.

> > >

> > > This API also has to traverse the linked list which is not very well suited for

> calling on data plane.

> > >

> > > I have not gone too much into rcu_thread_offline[5] API. This again needs

> to be used in worker cores and does not look to be very optimal.

> > >

> > > I have glanced at rcu_quiescent_state [6], it wakes up the thread calling

> 'synchronize_rcu' which seems good amount of code for the data plane.

> > >

> >

> > Wouldn't the typical DPDK lcore worker call rcu_quiescent_state()

> > after processing a burst of packets? If so, I would more lean toward

> > "negligible overhead", than "a good amount of code".

DPDK is being used in embedded and real time applications as well. There, processing a burst of packets is not possible due to low latency requirements. Hence it is not possible to amortize the cost.

> >

> > I must admit I didn't look at your library in detail, but I must still

> > ask: if TQS is basically RCU, why isn't it called RCU? And why isn't

> > the API calls named in a similar manner?

I kindly request you to take a look at the patch. More than that, if you have not done already, please take a look at the liburcu implementation as well.
TQS is not RCU (Read-Copy-Update). TQS helps implement RCU. TQS helps to understand when the threads have passed through the quiescent state.
I am also not sure why the name liburcu has RCU in it. It does not do any Read-Copy-Update.

> 

> 

> We used liburcu at Brocade with DPDK. It was just a case of putting

> rcu_quiescent_state in the packet handling

> loop. There were a bunch more cases where control thread needed to

> register/unregister as part of RCU.

I assume that the packet handling loop was a polling loop (correct me if I am wrong). With the support of event dev, we have rte_event_dequeue_burst API which supports blocking till the packets are available (or blocking for an extended period of time). This means that, before calling this API, the thread needs to inform "don't worry about me". Once, this API returns, it needs to inform "worry about me". So, these two APIs need to be efficient. Please look at rte_tqs_register/unregister APIs.

> I think any library would have that issue with user supplied threads.  You need

> a "worry about me" and

> a "don't worry about me" API in the library.

> 

> There is also a tradeoff between call_rcu and defer_rcu about what context

> the RCU callback happens.

> You really need a control thread to handle the RCU cleanup.

That is if you choose to use liburcu. rte_tqs provides the ability to do cleanup efficiently without the need for a control plane thread in DPDK use cases.

> 

> The point is that RCU steps into the application design, and liburcu seems to

> be flexible enough

> and well documented enough to allow for more options.

Agree that RCU steps into application design. That is the reason rte_tqs just does enough and provides the flexibility to the application to implement the RCU however it feels like. DPDK has also stepped into application design by providing libraries like hash, LPM etc.

I do not understand why you think liburcu is flexible enough for DPDK use cases. I mentioned the specific use cases where liburcu is not useful. I did not find anything in the documentation to help me solve these use cases. Appreciate if you could help me understand how I can use liburcu to solve these use cases.

[RFC,0/3] tqs: add thread quiescent state library

Message

Comments