Message ID | 20190426044000.32670-1-honnappa.nagarahalli@arm.com |
---|---|
Headers | show |
Series | lib/rcu: add RCU library supporting QSBR mechanism | expand |
> -----Original Message----- > From: Honnappa Nagarahalli [mailto:honnappa.nagarahalli@arm.com] > Sent: Friday, April 26, 2019 5:40 AM > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; stephen@networkplumber.org; paulmck@linux.ibm.com; Kovacevic, Marko > <marko.kovacevic@intel.com>; dev@dpdk.org > Cc: honnappa.nagarahalli@arm.com; gavin.hu@arm.com; dharmik.thakkar@arm.com; malvika.gupta@arm.com > Subject: [PATCH v8 0/4] lib/rcu: add RCU library supporting QSBR mechanism > > Lock-less data structures provide scalability and determinism. > They enable use cases where locking may not be allowed > (for ex: real-time applications). > > In the following paras, the term 'memory' refers to memory allocated > by typical APIs like malloc or anything that is representative of > memory, for ex: an index of a free element array. > > Since these data structures are lock less, the writers and readers > are accessing the data structures concurrently. Hence, while removing > an element from a data structure, the writers cannot return the memory > to the allocator, without knowing that the readers are not > referencing that element/memory anymore. Hence, it is required to > separate the operation of removing an element into 2 steps: > > Delete: in this step, the writer removes the reference to the element from > the data structure but does not return the associated memory to the > allocator. This will ensure that new readers will not get a reference to > the removed element. Removing the reference is an atomic operation. > > Free(Reclaim): in this step, the writer returns the memory to the > memory allocator, only after knowing that all the readers have stopped > referencing the deleted element. > > This library helps the writer determine when it is safe to free the > memory. > > This library makes use of thread Quiescent State (QS). QS can be > defined as 'any point in the thread execution where the thread does > not hold a reference to shared memory'. It is upto the application to > determine its quiescent state. Let us consider the following diagram: > > Time --------------------------------------------------> > > | | > RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$ > | | > RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$ > | | > RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$ > | | > |<--->| > Del | Free > | > Cannot free memory > during this period > (Grace Period) > > RTx - Reader thread > < and > - Start and end of while(1) loop > ***Dx*** - Reader thread is accessing the shared data structure Dx. > i.e. critical section. > +++ - Reader thread is not accessing any shared data structure. > i.e. non critical section or quiescent state. > Del - Point in time when the reference to the entry is removed using > atomic operation. > Free - Point in time when the writer can free the entry. > Grace Period - Time duration between Del and Free, during which memory cannot > be freed. > > As shown, thread RT1 accesses data structures D1, D2 and D3. When it is > accessing D2, if the writer has to remove an element from D2, the > writer cannot free the memory associated with that element immediately. > The writer can return the memory to the allocator only after the reader > stops referencing D2. In other words, reader thread RT1 has to enter > a quiescent state. > > Similarly, since thread RT3 is also accessing D2, writer has to wait till > RT3 enters quiescent state as well. > > However, the writer does not need to wait for RT2 to enter quiescent state. > Thread RT2 was not accessing D2 when the delete operation happened. > So, RT2 will not get a reference to the deleted entry. > > It can be noted that, the critical sections for D2 and D3 are quiescent states > for D1. i.e. for a given data structure Dx, any point in the thread execution > that does not reference Dx is a quiescent state. > > Since memory is not freed immediately, there might be a need for > provisioning of additional memory, depending on the application requirements. > > It is important to make sure that this library keeps the overhead of > identifying the end of grace period and subsequent freeing of memory, > to a minimum. The following paras explain how grace period and critical > section affect this overhead. > > The writer has to poll the readers to identify the end of grace period. > Polling introduces memory accesses and wastes CPU cycles. The memory > is not available for reuse during grace period. Longer grace periods > exasperate these conditions. > > The length of the critical section and the number of reader threads > is proportional to the duration of the grace period. Keeping the critical > sections smaller will keep the grace period smaller. However, keeping the > critical sections smaller requires additional CPU cycles(due to additional > reporting) in the readers. > > Hence, we need the characteristics of small grace period and large critical > section. This library addresses this by allowing the writer to do > other work without having to block till the readers report their quiescent > state. > > For DPDK applications, the start and end of while(1) loop (where no > references to shared data structures are kept) act as perfect quiescent > states. This will combine all the shared data structure accesses into a > single, large critical section which helps keep the overhead on the > reader side to a minimum. > > DPDK supports pipeline model of packet processing and service cores. > In these use cases, a given data structure may not be used by all the > workers in the application. The writer does not have to wait for all > the workers to report their quiescent state. To provide the required > flexibility, this library has a concept of QS variable. The application > can create one QS variable per data structure to help it track the > end of grace period for each data structure. This helps keep the grace > period to a minimum. > > The application has to allocate memory and initialize a QS variable. > > Application can call rte_rcu_qsbr_get_memsize to calculate the size > of memory to allocate. This API takes maximum number of reader threads, > using this variable, as a parameter. Currently, a maximum of 1024 threads > are supported. > > Further, the application can initialize a QS variable using the API > rte_rcu_qsbr_init. > > Each reader thread is assumed to have a unique thread ID. Currently, the > management of the thread ID (for ex: allocation/free) is left to the > application. The thread ID should be in the range of 0 to > maximum number of threads provided while creating the QS variable. > The application could also use lcore_id as the thread ID where applicable. > > rte_rcu_qsbr_thread_register API will register a reader thread > to report its quiescent state. This can be called from a reader thread. > A control plane thread can also call this on behalf of a reader thread. > The reader thread must call rte_rcu_qsbr_thread_online API to start reporting > its quiescent state. > > Some of the use cases might require the reader threads to make > blocking API calls (for ex: while using eventdev APIs). The writer thread > should not wait for such reader threads to enter quiescent state. > The reader thread must call rte_rcu_qsbr_thread_offline API, before calling > blocking APIs. It can call rte_rcu_qsbr_thread_online API once the blocking > API call returns. > > The writer thread can trigger the reader threads to report their quiescent > state by calling the API rte_rcu_qsbr_start. It is possible for multiple > writer threads to query the quiescent state status simultaneously. Hence, > rte_rcu_qsbr_start returns a token to each caller. > > The writer thread has to call rte_rcu_qsbr_check API with the token to get the > current quiescent state status. Option to block till all the reader threads > enter the quiescent state is provided. If this API indicates that all the > reader threads have entered the quiescent state, the application can free the > deleted entry. > > The APIs rte_rcu_qsbr_start and rte_rcu_qsbr_check are lock free. Hence, they > can be called concurrently from multiple writers even while running > as worker threads. > > The separation of triggering the reporting from querying the status provides > the writer threads flexibility to do useful work instead of blocking for the > reader threads to enter the quiescent state or go offline. This reduces the > memory accesses due to continuous polling for the status. > > rte_rcu_qsbr_synchronize API combines the functionality of rte_rcu_qsbr_start > and blocking rte_rcu_qsbr_check into a single API. This API triggers the reader > threads to report their quiescent state and polls till all the readers enter > the quiescent state or go offline. This API does not allow the writer to > do useful work while waiting and also introduces additional memory accesses > due to continuous polling. > > The reader thread must call rte_rcu_qsbr_thread_offline and > rte_rcu_qsbr_thread_unregister APIs to remove itself from reporting its > quiescent state. The rte_rcu_qsbr_check API will not wait for this reader > thread to report the quiescent state status anymore. > > The reader threads should call rte_rcu_qsbr_update API to indicate that they > entered a quiescent state. This API checks if a writer has triggered a > quiescent state query and update the state accordingly. > > Patch v8: > 1) Library changes > a) Symbols prefixed with '__RTE' or 'rte_' as required (Thomas) > b) Used PRI?64 macros to support 32b compilation (Thomas) > c) Fixed shared library compilation (Thomas) > 2) Test cases > a) Fixed segmentation fault when more than 20 cores are used for testing (Jerin) > b) Used PRI?64 macros to support 32b compilation (Thomas) > c) Testing done on x86, ThunderX2, Octeon TX, BlueField for 32b(x86 only)/64b, > debug/non-debug, shared/static linking, meson/makefile with various > number of cores > > Patch v7: > 1) Library changes > a) Added macro RCU_IS_LOCK_CNT_ZERO > b) Added lock counter validation to rte_rcu_qsbr_thread_online/ > rte_rcu_qsbr_thread_offline/rte_rcu_qsbr_thread_register/ > rte_rcu_qsbr_thread_unregister APIs (Paul) > > Patch v6: > 1) Library changes > a) Fixed and tested meson build on Arm and x86 (Konstantin) > b) Moved rte_rcu_qsbr_synchronize API to rte_rcu_qsbr.c > > Patch v5: > 1) Library changes > a) Removed extra alignment in rte_rcu_qsbr_get_memsize API (Paul) > b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock APIs (Paul) > c) Clarified the need for 64b counters (Paul) > 2) Test cases > a) Added additional performance test cases to benchmark > __rcu_qsbr_check_all > b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock calls in various test cases > 3) Documentation > a) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock usage description > > Patch v4: > 1) Library changes > a) Fixed the compilation issue on x86 (Konstantin) > b) Rebased with latest master > > Patch v3: > 1) Library changes > a) Moved the registered thread ID array to the end of the > structure (Konstantin) > b) Removed the compile time constant RTE_RCU_MAX_THREADS > c) Added code to keep track of registered number of threads > > Patch v2: > 1) Library changes > a) Corrected the RTE_ASSERT checks (Konstantin) > b) Replaced RTE_ASSERT with 'if' checks for non-datapath APIs (Konstantin) > c) Made rte_rcu_qsbr_thread_register/unregister non-datapath critical APIs > d) Renamed rte_rcu_qsbr_update to rte_rcu_qsbr_quiescent (Ola) > e) Used rte_smp_mb() in rte_rcu_qsbr_thread_online API for x86 (Konstantin) > f) Removed the macro to access the thread QS counters (Konstantin) > 2) Test cases > a) Added additional test cases for removing RTE_ASSERT > 3) Documentation > a) Changed the figure to make it bigger (Marko) > b) Spelling and format corrections (Marko) > > Patch v1: > 1) Library changes > a) Changed the maximum number of reader threads to 1024 > b) Renamed rte_rcu_qsbr_register/unregister_thread to > rte_rcu_qsbr_thread_register/unregister > c) Added rte_rcu_qsbr_thread_online/offline API. These are optimized > version of rte_rcu_qsbr_thread_register/unregister API. These > also provide the flexibility for performance when the requested > maximum number of threads is higher than the current number of > threads. > d) Corrected memory orderings in rte_rcu_qsbr_update > e) Changed the signature of rte_rcu_qsbr_start API to return the token > f) Changed the signature of rte_rcu_qsbr_start API to not take the > expected number of QS states to wait. > g) Added debug logs > h) Added API and programmer guide documentation. > > RFC v3: > 1) Library changes > a) Rebased to latest master > b) Added new API rte_rcu_qsbr_get_memsize > c) Add support for memory allocation for QSBR variable (Konstantin) > d) Fixed a bug in rte_rcu_qsbr_check (Konstantin) > 2) Testcase changes > a) Separated stress tests into a performance test case file > b) Added performance statistics > > RFC v2: > 1) Cover letter changes > a) Explian the parameters that affect the overhead of using RCU > and their effect > b) Explain how this library addresses these effects to keep > the overhead to minimum > 2) Library changes > a) Rename the library to avoid confusion (Matias, Bruce, Konstantin) > b) Simplify the code/remove APIs to keep this library inline with > other synchronisation mechanisms like locks (Konstantin) > c) Change the design to support more than 64 threads (Konstantin) > d) Fixed version map to remove static inline functions > 3) Testcase changes > a) Add boundary and additional functional test cases > b) Add stress test cases (Paul E. McKenney) > > Dharmik Thakkar (1): > test/rcu_qsbr: add API and functional tests > > Honnappa Nagarahalli (3): > rcu: add RCU library supporting QSBR mechanism > doc/rcu: add lib_rcu documentation > doc: added RCU to the release notes > > MAINTAINERS | 5 + > app/test/Makefile | 2 + > app/test/autotest_data.py | 12 + > app/test/meson.build | 7 +- > app/test/test_rcu_qsbr.c | 1014 +++++++++++++++++ > app/test/test_rcu_qsbr_perf.c | 704 ++++++++++++ > config/common_base | 6 + > doc/api/doxy-api-index.md | 3 +- > doc/api/doxy-api.conf.in | 1 + > .../prog_guide/img/rcu_general_info.svg | 509 +++++++++ > doc/guides/prog_guide/index.rst | 1 + > doc/guides/prog_guide/rcu_lib.rst | 185 +++ > doc/guides/rel_notes/release_19_05.rst | 8 + > lib/Makefile | 2 + > lib/librte_rcu/Makefile | 23 + > lib/librte_rcu/meson.build | 7 + > lib/librte_rcu/rte_rcu_qsbr.c | 277 +++++ > lib/librte_rcu/rte_rcu_qsbr.h | 641 +++++++++++ > lib/librte_rcu/rte_rcu_version.map | 13 + > lib/meson.build | 2 +- > mk/rte.app.mk | 1 + > 21 files changed, 3420 insertions(+), 3 deletions(-) > create mode 100644 app/test/test_rcu_qsbr.c > create mode 100644 app/test/test_rcu_qsbr_perf.c > create mode 100644 doc/guides/prog_guide/img/rcu_general_info.svg > create mode 100644 doc/guides/prog_guide/rcu_lib.rst > create mode 100644 lib/librte_rcu/Makefile > create mode 100644 lib/librte_rcu/meson.build > create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c > create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h > create mode 100644 lib/librte_rcu/rte_rcu_version.map > > -- Run UT on my box (SKX) for both x86_64 and i686 over 96 cores. All passed. Tested-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > 2.17.1