mbox series

[RFC,v2,0/2] rcu: add RCU library supporting QSBR mechanism

Message ID 20181222021420.5114-1-honnappa.nagarahalli@arm.com
Headers show
Series rcu: add RCU library supporting QSBR mechanism | expand

Message

Honnappa Nagarahalli Dec. 22, 2018, 2:14 a.m. UTC
Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).

In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.

Since these data structures are lock less, the writers and readers
are accessing the data structures simultaneously. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
separate the operation of removing an element into 2 steps:

Delete: in this step, the writer removes the reference to the element from
the data structure but does not return the associated memory to the
allocator. This will ensure that new readers will not get a reference to
the removed element. Removing the reference is an atomic operation.

Free(Reclaim): in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the deleted element.

This library helps the writer determine when it is safe to free the
memory.

This library makes use of thread Quiescent State (QS). QS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
determine its quiescent state. Let us consider the following diagram:

    Time -------------------------------------------------->

                                |     |
  RT1   $++++****D1****+++***D2*|**+++|+++**D3*****++++$
                                |     |
  RT2          $++++****D1****++|+**D2|***++++++**D3*****++++$
                                |     |
  RT3      $++++****D1****+++***|D2***|++++++**D2*****++++$
                                |     |
                                |<--->|
                               Del | Free
                                   |
                              Cannot free memory
                              during this period
                              (Grace Period)

RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
           i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
      i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
      atomic operation.
Free - Point in time when the writer can free the entry.
Grace Period - Time duration between Del and Free, during which memory cannot
               be freed.

As shown, thread RT1 acesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot free the memory associated with that element immediately.
The writer can return the memory to the allocator only after the reader
stops referencng D2. In other words, reader thread RT1 has to enter
a quiescent state.

Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.

However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.

It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.

Since memory is not freed immediately, there might be a need for
provisioning of additional memory, depending on the application requirements.

It is important to make sure that this library keeps the over head, of
identifying the end of grace period and subsequent freeing of memory,
to a minimum. One has to understand how grace period and critical section
affect this overhead.

The writer has to poll the readers to identify the end of grace period.
Polling introduces memory accesses and wastes CPU cycles. The memory
is not available for reuse during grace period. Longer grace periods
exasperate these conditions.

The length of the critical section and the number of reader threads
is proportional to the duration of the grace period. Keeping the critical
sections smaller will keep the grace period smaller. However, keeping the
critical sections smaller requires additional CPU cycles(due to additional
reporting) in the readers.

Hence, we need the characteristics of small grace period and large critical
section. This library addresses this by allowing the writer to do
other work without having to block till the readers report their quiescent
state.

For DPDK applications, the start and end of while(1) loop (where no shared
data structures are getting accessed) act as perfect quiescent states. This
will combine all the shared data structure accesses into a single, large
critical section which helps keep the over head on the reader side to
a minimum.

DPDK supports pipeline model of packet processing and service cores.
In these use cases, a given data structure may not be used by all the
workers in the application. To provide the required flexibility, this
library has a concept of QS variable. The application can create one
QS variable per data structure to help it track the end of grace
period for each data structure.

The application can initialize a QS variable using the API rte_rcu_qsbr_init.

Each reader thread is assumed to have a unique thread ID. Currently, the
management of the thread ID (for ex: allocation/free) is left to the
application. The thread ID should be in the range of 0 to
RTE_RCU_MAX_THREADS. The application could also use lcore_id as the
thread ID where applicable.

rte_rcu_qsbr_register_thread API will register a reader thread
to report its quiescent state. This can be called from a reader thread.
A control plane thread can also call this on behalf of a reader thread.
However, the application must ensure that the reader thread is ready to
report the QS status before the writer checks the QS.

The application can trigger the reader threads to report their QS
status by calling the API rte_rcu_qsbr_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_rcu_qsbr_start returns a token to each caller.

The application has to call rte_rcu_qsbr_check API with the token to get the
current QS status. Option to block till all the reader threads enter the
QS is provided. If this API indicates that all the reader threads have entered
the QS, the application can free the deleted entry.

The APIs rte_rcu_qsbr_start and rte_rcu_qsbr_check are lock free. Hence, they
can be called concurrently from multiple writers even while running
as wroker threads.

The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of blocking for the
reader threads to enter the QS. This reduces the memory accesses due
to continuous polling for the status.

rte_rcu_qsbr_unregister_thread API will remove a reader thread from reporting
its QS. The rte_rcu_qsbr_check API will not wait for this reader thread to
report the QS status anymore.

The reader threads should call rte_rcu_qsbr_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.

v2:
1) Cover letter changes
   a) Explian the parameters that affect the overhead of using RCU
      and their effect
   b) Explain how this library addresses these effects to keep
      the overhead to minimum
2) Library changes
   a) Rename the library to avoid confusion (Matias, Bruce, Konstantin)
   b) Simplify the code/remove APIs to keep this library inline with
      other synchronisation mechanisms like locks (Konstantin)
   c) Change the design to support more than 64 threads (Konstantin)
   d) Fixed version map to remove static inline functions
3) Testcase changes
   a) Add boundary and additional functional test cases
   b) Add stress test cases (Paul E. McKenney)

Next Steps:
1) rte_rcu_qsbr_register_thread/rte_rcu_qsbr_unregister_thread can be
   optimized to avoid accessing the common bitmap array. This is required
   as these are data plane APIs. Plan is to introduce
   rte_rcu_qsbr_thread_online/rte_rcu_qsbr_thread_offline which will not
   touch the common bitmap array.
2) Add debug logs to enable debugging
3) Documentation
4) Convert to patch

Dharmik Thakkar (1):
  test/rcu_qsbr: add API and functional tests

Honnappa Nagarahalli (1):
  rcu: add RCU library supporting QSBR mechanism

 config/common_base                 |   6 +
 lib/Makefile                       |   2 +
 lib/librte_rcu/Makefile            |  23 +
 lib/librte_rcu/meson.build         |   5 +
 lib/librte_rcu/rte_rcu_qsbr.c      |  63 +++
 lib/librte_rcu/rte_rcu_qsbr.h      | 321 ++++++++++++
 lib/librte_rcu/rte_rcu_version.map |   8 +
 lib/meson.build                    |   2 +-
 mk/rte.app.mk                      |   1 +
 test/test/Makefile                 |   2 +
 test/test/autotest_data.py         |   6 +
 test/test/meson.build              |   5 +-
 test/test/test_rcu_qsbr.c          | 801 +++++++++++++++++++++++++++++
 13 files changed, 1243 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_rcu/Makefile
 create mode 100644 lib/librte_rcu/meson.build
 create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c
 create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h
 create mode 100644 lib/librte_rcu/rte_rcu_version.map
 create mode 100644 test/test/test_rcu_qsbr.c

-- 
2.17.1