[RFC,2/3] tqs: add thread quiescent state library

Message ID	20181122033055.3431-3-honnappa.nagarahalli@arm.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of dev-bounces@dpdk.org designates 92.243.14.124 as permitted sender) client-ip=92.243.14.124; From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> To: dev@dpdk.org Cc: nd@arm.com, honnappa.nagarahalli@arm.com, dharmik.thakkar@arm.com, malvika.gupta@arm.com, gavin.hu@arm.com Date: Wed, 21 Nov 2018 21:30:54 -0600 Message-Id: <20181122033055.3431-3-honnappa.nagarahalli@arm.com> In-Reply-To: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> Subject: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	tqs: add thread quiescent state library \| expand [RFC,0/3] tqs: add thread quiescent state library [RFC,1/3] log: add TQS log type [RFC,2/3] tqs: add thread quiescent state library [RFC,3/3] test/tqs: Add API and functional tests

Honnappa Nagarahalli Nov. 22, 2018, 3:30 a.m. UTC

Add Thread Quiescent State (TQS) library. This library helps identify
the quiescent state of the reader threads so that the writers
can free the memory associated with the lock less data structures.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Reviewed-by: Steve Capper <steve.capper@arm.com>

Reviewed-by: Gavin Hu <gavin.hu@arm.com>

---
 config/common_base                 |   6 +
 lib/Makefile                       |   2 +
 lib/librte_tqs/Makefile            |  23 ++
 lib/librte_tqs/meson.build         |   5 +
 lib/librte_tqs/rte_tqs.c           | 249 ++++++++++++++++++++
 lib/librte_tqs/rte_tqs.h           | 352 +++++++++++++++++++++++++++++
 lib/librte_tqs/rte_tqs_version.map |  16 ++
 lib/meson.build                    |   2 +-
 mk/rte.app.mk                      |   1 +
 9 files changed, 655 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_tqs/Makefile
 create mode 100644 lib/librte_tqs/meson.build
 create mode 100644 lib/librte_tqs/rte_tqs.c
 create mode 100644 lib/librte_tqs/rte_tqs.h
 create mode 100644 lib/librte_tqs/rte_tqs_version.map

-- 
2.17.1

Ananyev, Konstantin Nov. 24, 2018, 12:18 p.m. UTC | #1

Hi Honnappa,

> +

> +/* Allocate a new TQS variable with the name *name* in memory. */

> +struct rte_tqs * __rte_experimental

> +rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask)

> +{

> +	char tqs_name[RTE_TQS_NAMESIZE];

> +	struct rte_tailq_entry *te, *tmp_te;

> +	struct rte_tqs_list *tqs_list;

> +	struct rte_tqs *v, *tmp_v;

> +	int ret;

> +

> +	if (name == NULL) {

> +		RTE_LOG(ERR, TQS, "Invalid input parameters\n");

> +		rte_errno = -EINVAL;

> +		return NULL;

> +	}

> +

> +	te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);

> +	if (te == NULL) {

> +		RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");

> +		rte_errno = -ENOMEM;

> +		return NULL;

> +	}

> +

> +	snprintf(tqs_name, sizeof(tqs_name), "%s", name);

> +	v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),

> +				RTE_CACHE_LINE_SIZE, socket_id);

> +	if (v == NULL) {

> +		RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS variable\n");

> +		rte_errno = -ENOMEM;

> +		goto alloc_error;

> +	}

> +

> +	ret = snprintf(v->name, sizeof(v->name), "%s", name);

> +	if (ret < 0 || ret >= (int)sizeof(v->name)) {

> +		rte_errno = -ENAMETOOLONG;

> +		goto alloc_error;

> +	}

> +

> +	te->data = (void *) v;

> +	v->lcore_mask = lcore_mask;

> +

> +	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);

> +

> +	tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);

> +

> +	/* Search if a TQS variable with the same name exists already */

> +	TAILQ_FOREACH(tmp_te, tqs_list, next) {

> +		tmp_v = (struct rte_tqs *) tmp_te->data;

> +		if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)

> +			break;

> +	}

> +

> +	if (tmp_te != NULL) {

> +		rte_errno = -EEXIST;

> +		goto tqs_exist;

> +	}

> +

> +	TAILQ_INSERT_TAIL(tqs_list, te, next);

> +

> +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);

> +

> +	return v;

> +

> +tqs_exist:

> +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);

> +

> +alloc_error:

> +	rte_free(te);

> +	rte_free(v);

> +	return NULL;

> +}


That seems quite heavy-weight function just to allocate sync variable.
As size of struct rte_tqs is constant and known to the user, might be better
just provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free
memory for it by himself.

> +

> +/* Add a reader thread, running on an lcore, to the list of threads

> + * reporting their quiescent state on a TQS variable.

> + */

> +int __rte_experimental

> +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id)

> +{

> +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >= RTE_TQS_MAX_LCORE),

> +				-EINVAL);


It is not very good practice to make function return different values and behave
in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to check input parameters.
For fast-path (functions in .h) we sometimes skip such checking,
but debug mode can probably use RTE_ASSERT() or so.


lcore_id >= RTE_TQS_MAX_LCORE

Is this limitation really necessary?
First it means that only lcores can use that API (at least data-path part), second
even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of requiring lcore_id as
input parameter, you'll just make it return index of next available entry in w[].
Then tqs_update() can take that index as input parameter. 

> +


> +	/* Worker thread has to count the quiescent states

> +	 * only from the current value of token.

> +	 */

> +	v->w[lcore_id].cnt = v->token;


Wonder what would happen, if new reader will
call register(), after writer calls start()?
Looks like a race-condition.
Or such pattern is not supported?

> +

> +	/* Release the store to initial TQS count so that workers

> +	 * can use it immediately after this function returns.

> +	 */

> +	__atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), __ATOMIC_RELEASE);

> +

> +	return 0;

> +}

> +

> +/**

> + * @warning

> + * @b EXPERIMENTAL: this API may change without prior notice

> + *

> + * Trigger the worker threads to report the quiescent state

> + * status.

> + *

> + * This is implemented as a lock-free function. It is multi-thread

> + * safe and can be called from the worker threads as well.

> + *

> + * @param v

> + *   TQS variable

> + * @param n

> + *   Expected number of times the quiescent state is entered

> + * @param t

> + *   - If successful, this is the token for this call of the API.

> + *     This should be passed to rte_tqs_check API.

> + * @return

> + *   - -EINVAL if the parameters are invalid (debug mode compilation only).

> + *   - 0 Otherwise and always (non-debug mode compilation).

> + */

> +static __rte_always_inline int __rte_experimental

> +rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t)

> +{

> +	TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);

> +

> +	/* This store release will ensure that changes to any data

> +	 * structure are visible to the workers before the token

> +	 * update is visible.

> +	 */

> +	*t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);

> +

> +	return 0;

> +}

> +

> +/**

> + * @warning

> + * @b EXPERIMENTAL: this API may change without prior notice

> + *

> + * Update quiescent state for the worker thread on a lcore.

> + *

> + * This is implemented as a lock-free function. It is multi-thread safe.

> + * All the worker threads registered to report their quiescent state

> + * on the TQS variable must call this API.

> + *

> + * @param v

> + *   TQS variable

> + */

> +static __rte_always_inline void __rte_experimental

> +rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id)

> +{

> +	uint32_t t;

> +

> +	TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >= RTE_TQS_MAX_LCORE);

> +

> +	/* Load the token before the worker thread loads any other

> +	 * (lock-free) data structure. This ensures that updates

> +	 * to the data structures are visible if the update

> +	 * to token is visible.

> +	 */

> +	t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);


Hmm, I am not very familiar with C11 model, but it looks like a race
condition to me: 
as I understand, update() supposed be called at the end of reader's
critical section, correct?
But ACQUIRE is only a hoist barrier, which means compiler and cpu
are free to move earlier reads (and writes) after it.
It probably needs to be a full ACQ_REL here.

> +	if (v->w[lcore_id].cnt != t)

> +		v->w[lcore_id].cnt++;

> +}

> +

> +/**

> + * @warning

> + * @b EXPERIMENTAL: this API may change without prior notice

> + *

> + * Checks if all the worker threads have entered the quiescent state

> + * 'n' number of times. 'n' is provided in rte_tqs_start API.

> + *

> + * This is implemented as a lock-free function. It is multi-thread

> + * safe and can be called from the worker threads as well.

> + *

> + * @param v

> + *   TQS variable

> + * @param t

> + *   Token returned by rte_tqs_start API

> + * @param wait

> + *   If true, block till all the worker threads have completed entering

> + *   the quiescent state 'n' number of times

> + * @return

> + *   - 0 if all worker threads have NOT passed through specified number

> + *     of quiescent states.

> + *   - 1 if all worker threads have passed through specified number

> + *     of quiescent states.

> + *   - -EINVAL if the parameters are invalid (debug mode compilation only).

> + */

> +static __rte_always_inline int __rte_experimental

> +rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait)

> +{

> +	uint64_t l;

> +	uint64_t lcore_mask;

> +

> +	TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);

> +

> +	do {

> +		/* Load the current lcore_mask before loading the

> +		 * worker thread quiescent state counters.

> +		 */

> +		lcore_mask = __atomic_load_n(&v->lcore_mask, __ATOMIC_ACQUIRE);


What would happen if reader will call unregister() simultaneously
with check() and will update lcore_mask straight after that load?
As I understand check() might hang in such case.

> +

> +		while (lcore_mask) {

> +			l = __builtin_ctz(lcore_mask);

> +			if (v->w[l].cnt != t)

> +				break;


As I understand, that makes control-path function progress dependent
on simultaneous invocation of data-path functions.
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes
control-path and data-path functions from the same thread.

> +

> +			lcore_mask &= ~(1UL << l);

> +		}

> +

> +		if (lcore_mask == 0)

> +			return 1;

> +

> +		rte_pause();

> +	} while (wait);

> +

> +	return 0;

> +}

> +

Honnappa Nagarahalli Nov. 27, 2018, 9:32 p.m. UTC | #2

> 

> Hi Honnappa,

Thank you for reviewing the patch, appreciate your comments.

> 

> > +

> > +/* Allocate a new TQS variable with the name *name* in memory. */

> > +struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name,

> > +int socket_id, uint64_t lcore_mask) {

> > +	char tqs_name[RTE_TQS_NAMESIZE];

> > +	struct rte_tailq_entry *te, *tmp_te;

> > +	struct rte_tqs_list *tqs_list;

> > +	struct rte_tqs *v, *tmp_v;

> > +	int ret;

> > +

> > +	if (name == NULL) {

> > +		RTE_LOG(ERR, TQS, "Invalid input parameters\n");

> > +		rte_errno = -EINVAL;

> > +		return NULL;

> > +	}

> > +

> > +	te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);

> > +	if (te == NULL) {

> > +		RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");

> > +		rte_errno = -ENOMEM;

> > +		return NULL;

> > +	}

> > +

> > +	snprintf(tqs_name, sizeof(tqs_name), "%s", name);

> > +	v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),

> > +				RTE_CACHE_LINE_SIZE, socket_id);

> > +	if (v == NULL) {

> > +		RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS

> variable\n");

> > +		rte_errno = -ENOMEM;

> > +		goto alloc_error;

> > +	}

> > +

> > +	ret = snprintf(v->name, sizeof(v->name), "%s", name);

> > +	if (ret < 0 || ret >= (int)sizeof(v->name)) {

> > +		rte_errno = -ENAMETOOLONG;

> > +		goto alloc_error;

> > +	}

> > +

> > +	te->data = (void *) v;

> > +	v->lcore_mask = lcore_mask;

> > +

> > +	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);

> > +

> > +	tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);

> > +

> > +	/* Search if a TQS variable with the same name exists already */

> > +	TAILQ_FOREACH(tmp_te, tqs_list, next) {

> > +		tmp_v = (struct rte_tqs *) tmp_te->data;

> > +		if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)

> > +			break;

> > +	}

> > +

> > +	if (tmp_te != NULL) {

> > +		rte_errno = -EEXIST;

> > +		goto tqs_exist;

> > +	}

> > +

> > +	TAILQ_INSERT_TAIL(tqs_list, te, next);

> > +

> > +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);

> > +

> > +	return v;

> > +

> > +tqs_exist:

> > +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);

> > +

> > +alloc_error:

> > +	rte_free(te);

> > +	rte_free(v);

> > +	return NULL;

> > +}

> 

> That seems quite heavy-weight function just to allocate sync variable.

> As size of struct rte_tqs is constant and known to the user, might be better just

> provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free memory

> for it by himself.

> 

I believe, when you say heavy-weight, you are referring to adding tqs variable to the TAILQ and allocating the memory for it. Agree. I also do not expect that there are a whole lot of tqs variables used in an application. Even in rte_tqs_free, there is similar overhead.

The extra part is due to the way the TQS variable will get identified by data plane threads. I am thinking that a data plane thread will use the rte_tqs_lookup API to identify a TQS variable. However, it is possible to share this with data plane threads via a simple shared structure as well.

Along with not allocating the memory, are you suggesting that we could skip maintaining a list of TQS variables in the TAILQ? This will remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach.

> > +

> > +/* Add a reader thread, running on an lcore, to the list of threads

> > + * reporting their quiescent state on a TQS variable.

> > + */

> > +int __rte_experimental

> > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> RTE_TQS_MAX_LCORE),

> > +				-EINVAL);

> 

> It is not very good practice to make function return different values and behave

> in a different way in debug/non-debug mode.

> I'd say that for slow-path (functions in .c) it is always good to check input

> parameters.

> For fast-path (functions in .h) we sometimes skip such checking, but debug

> mode can probably use RTE_ASSERT() or so.

Makes sense, I will change this in the next version.

> 

> 

> lcore_id >= RTE_TQS_MAX_LCORE

> 

> Is this limitation really necessary?

I added this limitation because currently DPDK application cannot take a mask more than 64bit wide. Otherwise, this should be as big as RTE_MAX_LCORE.
I see that in the case of '-lcores' option, the number of lcores can be more than the number of PEs. In this case, we still need a MAX limit (but can be bigger than 64).
 
> First it means that only lcores can use that API (at least data-path part), second

> even today many machines have more than 64 cores.

> I think you can easily avoid such limitation, if instead of requiring lcore_id as

> input parameter, you'll just make it return index of next available entry in w[].

> Then tqs_update() can take that index as input parameter.

I had thought about a similar approach based on IDs. I was concerned that ID will be one more thing to manage for the application. But, I see the limitations of the current approach now. I will change it to allocation based. This will support even non-EAL pthreads as well.

> 

> > +

> 

> > +	/* Worker thread has to count the quiescent states

> > +	 * only from the current value of token.

> > +	 */

> > +	v->w[lcore_id].cnt = v->token;

> 

> Wonder what would happen, if new reader will call register(), after writer calls

> start()?

> Looks like a race-condition.

> Or such pattern is not supported?

The start should be called only after the reference to the entry in the data structure is 'deleted'. Hence the new reader will not get the reference to the deleted entry and does not have to increment its counter. When rte_tqs_check is called, it will see that the counter is already up to date. (I am missing a load-acquire on the token, I will correct that in the next version).

> 

> > +

> > +	/* Release the store to initial TQS count so that workers

> > +	 * can use it immediately after this function returns.

> > +	 */

> > +	__atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id),

> > +__ATOMIC_RELEASE);

> > +

> > +	return 0;

> > +}

> > +

> > +/**

> > + * @warning

> > + * @b EXPERIMENTAL: this API may change without prior notice

> > + *

> > + * Trigger the worker threads to report the quiescent state

> > + * status.

> > + *

> > + * This is implemented as a lock-free function. It is multi-thread

> > + * safe and can be called from the worker threads as well.

> > + *

> > + * @param v

> > + *   TQS variable

> > + * @param n

> > + *   Expected number of times the quiescent state is entered

> > + * @param t

> > + *   - If successful, this is the token for this call of the API.

> > + *     This should be passed to rte_tqs_check API.

> > + * @return

> > + *   - -EINVAL if the parameters are invalid (debug mode compilation only).

> > + *   - 0 Otherwise and always (non-debug mode compilation).

> > + */

> > +static __rte_always_inline int __rte_experimental

> > +rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) {

> > +	TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);

> > +

> > +	/* This store release will ensure that changes to any data

> > +	 * structure are visible to the workers before the token

> > +	 * update is visible.

> > +	 */

> > +	*t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);

> > +

> > +	return 0;

> > +}

> > +

> > +/**

> > + * @warning

> > + * @b EXPERIMENTAL: this API may change without prior notice

> > + *

> > + * Update quiescent state for the worker thread on a lcore.

> > + *

> > + * This is implemented as a lock-free function. It is multi-thread safe.

> > + * All the worker threads registered to report their quiescent state

> > + * on the TQS variable must call this API.

> > + *

> > + * @param v

> > + *   TQS variable

> > + */

> > +static __rte_always_inline void __rte_experimental

> > +rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) {

> > +	uint32_t t;

> > +

> > +	TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >=

> RTE_TQS_MAX_LCORE);

> > +

> > +	/* Load the token before the worker thread loads any other

> > +	 * (lock-free) data structure. This ensures that updates

> > +	 * to the data structures are visible if the update

> > +	 * to token is visible.

> > +	 */

> > +	t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);

> 

> Hmm, I am not very familiar with C11 model, but it looks like a race condition

> to me:

> as I understand, update() supposed be called at the end of reader's critical

> section, correct?

Yes, the understanding is correct.

> But ACQUIRE is only a hoist barrier, which means compiler and cpu are free to

> move earlier reads (and writes) after it.

Yes, your understanding is correct.

> It probably needs to be a full ACQ_REL here.

> 

The sequence of operations is as follows:
1) Writer 'deletes' an entry from a lock-free data structure
2) Writer calls rte_tqs_start - This API increments the 'token' and does a store-release. So, any earlier stores would be visible if the store to 'token' is visible (to the data plane threads).
3) Reader calls rte_tqs_update - This API load-acquires the 'token'.
	a) If this 'token' is the updated value from 2) then the entry deleted from 1) will not be available for the reader to reference (even if that reference is due to earlier reads being moved after load-acquire of 'token').
	b) If this 'token' is not the updated value from 2) then the entry deleted from 1) may or may not be available for the reader to reference. In this case the w[lcore_id].cnt is not updated, hence the writer will wait to 'free' the deleted entry from 1)


> > +	if (v->w[lcore_id].cnt != t)

> > +		v->w[lcore_id].cnt++;

> > +}

> > +

> > +/**

> > + * @warning

> > + * @b EXPERIMENTAL: this API may change without prior notice

> > + *

> > + * Checks if all the worker threads have entered the quiescent state

> > + * 'n' number of times. 'n' is provided in rte_tqs_start API.

> > + *

> > + * This is implemented as a lock-free function. It is multi-thread

> > + * safe and can be called from the worker threads as well.

> > + *

> > + * @param v

> > + *   TQS variable

> > + * @param t

> > + *   Token returned by rte_tqs_start API

> > + * @param wait

> > + *   If true, block till all the worker threads have completed entering

> > + *   the quiescent state 'n' number of times

> > + * @return

> > + *   - 0 if all worker threads have NOT passed through specified number

> > + *     of quiescent states.

> > + *   - 1 if all worker threads have passed through specified number

> > + *     of quiescent states.

> > + *   - -EINVAL if the parameters are invalid (debug mode compilation only).

> > + */

> > +static __rte_always_inline int __rte_experimental

> > +rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) {

> > +	uint64_t l;

> > +	uint64_t lcore_mask;

> > +

> > +	TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);

> > +

> > +	do {

> > +		/* Load the current lcore_mask before loading the

> > +		 * worker thread quiescent state counters.

> > +		 */

> > +		lcore_mask = __atomic_load_n(&v->lcore_mask,

> __ATOMIC_ACQUIRE);

> 

> What would happen if reader will call unregister() simultaneously with check()

> and will update lcore_mask straight after that load?

> As I understand check() might hang in such case.

If the 'lcore_mask' is updated after this load, it will affect only the current iteration of the while loop below. In the next iteration the 'lcore_mask' is loaded again.

> 

> > +

> > +		while (lcore_mask) {

> > +			l = __builtin_ctz(lcore_mask);

> > +			if (v->w[l].cnt != t)

> > +				break;

> 

> As I understand, that makes control-path function progress dependent on

> simultaneous invocation of data-path functions.

I agree that the control-path function progress (for ex: how long to wait for freeing the memory) depends on invocation of the data-path functions. The separation of 'start', 'check' and the option not to block in 'check' provide the flexibility for control-path to do some other work if it chooses to.  

> In some cases that might cause control-path to hang.

> Let say if data-path function wouldn't be called, or user invokes control-path

> and data-path functions from the same thread.

I agree with the case of data-path function not getting called. I would consider that as programming error. I can document that warning in the rte_tqs_check API.

In the case of same thread calling both control-path and data-path functions, it would depend on the sequence of the calls. The following sequence should not cause any hangs:
Worker thread
1) 'deletes' an entry from a lock-free data structure
2) rte_tqs_start
3) rte_tqs_update 
4) rte_tqs_check (wait == 1 or wait == 0)
5) 'free' the entry deleted in 1)

If 3) and 4) are interchanged, then there will be a hang if wait is set to 1. If wait is set to 0, there should not be a hang.
I can document this as part of the documentation (I do not think API documentation is required for this).

> 

> > +

> > +			lcore_mask &= ~(1UL << l);

> > +		}

> > +

> > +		if (lcore_mask == 0)

> > +			return 1;

> > +

> > +		rte_pause();

> > +	} while (wait);

> > +

> > +	return 0;

> > +}

> > +

Ananyev, Konstantin Nov. 28, 2018, 3:25 p.m. UTC | #3

> >

> > Hi Honnappa,

> Thank you for reviewing the patch, appreciate your comments.

> 

> >

> > > +

> > > +/* Allocate a new TQS variable with the name *name* in memory. */

> > > +struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name,

> > > +int socket_id, uint64_t lcore_mask) {

> > > +	char tqs_name[RTE_TQS_NAMESIZE];

> > > +	struct rte_tailq_entry *te, *tmp_te;

> > > +	struct rte_tqs_list *tqs_list;

> > > +	struct rte_tqs *v, *tmp_v;

> > > +	int ret;

> > > +

> > > +	if (name == NULL) {

> > > +		RTE_LOG(ERR, TQS, "Invalid input parameters\n");

> > > +		rte_errno = -EINVAL;

> > > +		return NULL;

> > > +	}

> > > +

> > > +	te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);

> > > +	if (te == NULL) {

> > > +		RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");

> > > +		rte_errno = -ENOMEM;

> > > +		return NULL;

> > > +	}

> > > +

> > > +	snprintf(tqs_name, sizeof(tqs_name), "%s", name);

> > > +	v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),

> > > +				RTE_CACHE_LINE_SIZE, socket_id);

> > > +	if (v == NULL) {

> > > +		RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS

> > variable\n");

> > > +		rte_errno = -ENOMEM;

> > > +		goto alloc_error;

> > > +	}

> > > +

> > > +	ret = snprintf(v->name, sizeof(v->name), "%s", name);

> > > +	if (ret < 0 || ret >= (int)sizeof(v->name)) {

> > > +		rte_errno = -ENAMETOOLONG;

> > > +		goto alloc_error;

> > > +	}

> > > +

> > > +	te->data = (void *) v;

> > > +	v->lcore_mask = lcore_mask;

> > > +

> > > +	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);

> > > +

> > > +	tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);

> > > +

> > > +	/* Search if a TQS variable with the same name exists already */

> > > +	TAILQ_FOREACH(tmp_te, tqs_list, next) {

> > > +		tmp_v = (struct rte_tqs *) tmp_te->data;

> > > +		if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)

> > > +			break;

> > > +	}

> > > +

> > > +	if (tmp_te != NULL) {

> > > +		rte_errno = -EEXIST;

> > > +		goto tqs_exist;

> > > +	}

> > > +

> > > +	TAILQ_INSERT_TAIL(tqs_list, te, next);

> > > +

> > > +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);

> > > +

> > > +	return v;

> > > +

> > > +tqs_exist:

> > > +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);

> > > +

> > > +alloc_error:

> > > +	rte_free(te);

> > > +	rte_free(v);

> > > +	return NULL;

> > > +}

> >

> > That seems quite heavy-weight function just to allocate sync variable.

> > As size of struct rte_tqs is constant and known to the user, might be better just

> > provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free memory

> > for it by himself.

> >

> I believe, when you say heavy-weight, you are referring to adding tqs variable to the TAILQ and allocating the memory for it. 


Yes.

> Agree. I also

> do not expect that there are a whole lot of tqs variables used in an application. Even in rte_tqs_free, there is similar overhead.

> 

> The extra part is due to the way the TQS variable will get identified by data plane threads. I am thinking that a data plane thread will use the

> rte_tqs_lookup API to identify a TQS variable. However, it is possible to share this with data plane threads via a simple shared structure as

> well.

> 

> Along with not allocating the memory, are you suggesting that we could skip maintaining a list of TQS variables in the TAILQ? This will

> remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach.


Yes, that's what I suggest.
My thought was - it is just another data structure used for synchronization (as spinlock, rwlock, etc.).
So should be possible to allocate it statically and we probably don't need to have an ability to lookup
such variable by name via tailq.

> 

> > > +

> > > +/* Add a reader thread, running on an lcore, to the list of threads

> > > + * reporting their quiescent state on a TQS variable.

> > > + */

> > > +int __rte_experimental

> > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> > RTE_TQS_MAX_LCORE),

> > > +				-EINVAL);

> >

> > It is not very good practice to make function return different values and behave

> > in a different way in debug/non-debug mode.

> > I'd say that for slow-path (functions in .c) it is always good to check input

> > parameters.

> > For fast-path (functions in .h) we sometimes skip such checking, but debug

> > mode can probably use RTE_ASSERT() or so.

> Makes sense, I will change this in the next version.

> 

> >

> >

> > lcore_id >= RTE_TQS_MAX_LCORE

> >

> > Is this limitation really necessary?

> I added this limitation because currently DPDK application cannot take a mask more than 64bit wide. Otherwise, this should be as big as

> RTE_MAX_LCORE.

> I see that in the case of '-lcores' option, the number of lcores can be more than the number of PEs. In this case, we still need a MAX limit

> (but can be bigger than 64).

> 

> > First it means that only lcores can use that API (at least data-path part), second

> > even today many machines have more than 64 cores.

> > I think you can easily avoid such limitation, if instead of requiring lcore_id as

> > input parameter, you'll just make it return index of next available entry in w[].

> > Then tqs_update() can take that index as input parameter.

> I had thought about a similar approach based on IDs. I was concerned that ID will be one more thing to manage for the application. But, I

> see the limitations of the current approach now. I will change it to allocation based. This will support even non-EAL pthreads as well.


Yes, with such approach non-lcore threads will be able to use it also.

 
> >

> > > +

> >

> > > +	/* Worker thread has to count the quiescent states

> > > +	 * only from the current value of token.

> > > +	 */

> > > +	v->w[lcore_id].cnt = v->token;

> >

> > Wonder what would happen, if new reader will call register(), after writer calls

> > start()?

> > Looks like a race-condition.

> > Or such pattern is not supported?

> The start should be called only after the reference to the entry in the data structure is 'deleted'. Hence the new reader will not get the

> reference to the deleted entry and does not have to increment its counter. When rte_tqs_check is called, it will see that the counter is

> already up to date. (I am missing a load-acquire on the token, I will correct that in the next version).


Yes, with _acquire_ in place it seems to be good here.  

> 

> >

> > > +

> > > +	/* Release the store to initial TQS count so that workers

> > > +	 * can use it immediately after this function returns.

> > > +	 */

> > > +	__atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id),

> > > +__ATOMIC_RELEASE);

> > > +

> > > +	return 0;

> > > +}

> > > +

> > > +/**

> > > + * @warning

> > > + * @b EXPERIMENTAL: this API may change without prior notice

> > > + *

> > > + * Trigger the worker threads to report the quiescent state

> > > + * status.

> > > + *

> > > + * This is implemented as a lock-free function. It is multi-thread

> > > + * safe and can be called from the worker threads as well.

> > > + *

> > > + * @param v

> > > + *   TQS variable

> > > + * @param n

> > > + *   Expected number of times the quiescent state is entered

> > > + * @param t

> > > + *   - If successful, this is the token for this call of the API.

> > > + *     This should be passed to rte_tqs_check API.

> > > + * @return

> > > + *   - -EINVAL if the parameters are invalid (debug mode compilation only).

> > > + *   - 0 Otherwise and always (non-debug mode compilation).

> > > + */

> > > +static __rte_always_inline int __rte_experimental

> > > +rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) {

> > > +	TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);

> > > +

> > > +	/* This store release will ensure that changes to any data

> > > +	 * structure are visible to the workers before the token

> > > +	 * update is visible.

> > > +	 */

> > > +	*t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);

> > > +

> > > +	return 0;

> > > +}

> > > +

> > > +/**

> > > + * @warning

> > > + * @b EXPERIMENTAL: this API may change without prior notice

> > > + *

> > > + * Update quiescent state for the worker thread on a lcore.

> > > + *

> > > + * This is implemented as a lock-free function. It is multi-thread safe.

> > > + * All the worker threads registered to report their quiescent state

> > > + * on the TQS variable must call this API.

> > > + *

> > > + * @param v

> > > + *   TQS variable

> > > + */

> > > +static __rte_always_inline void __rte_experimental

> > > +rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) {

> > > +	uint32_t t;

> > > +

> > > +	TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >=

> > RTE_TQS_MAX_LCORE);

> > > +

> > > +	/* Load the token before the worker thread loads any other

> > > +	 * (lock-free) data structure. This ensures that updates

> > > +	 * to the data structures are visible if the update

> > > +	 * to token is visible.

> > > +	 */

> > > +	t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);

> >

> > Hmm, I am not very familiar with C11 model, but it looks like a race condition

> > to me:

> > as I understand, update() supposed be called at the end of reader's critical

> > section, correct?

> Yes, the understanding is correct.

> 

> > But ACQUIRE is only a hoist barrier, which means compiler and cpu are free to

> > move earlier reads (and writes) after it.

> Yes, your understanding is correct.

> 

> > It probably needs to be a full ACQ_REL here.

> >

> The sequence of operations is as follows:

> 1) Writer 'deletes' an entry from a lock-free data structure

> 2) Writer calls rte_tqs_start - This API increments the 'token' and does a store-release. So, any earlier stores would be visible if the store to

> 'token' is visible (to the data plane threads).

> 3) Reader calls rte_tqs_update - This API load-acquires the 'token'.

> 	a) If this 'token' is the updated value from 2) then the entry deleted from 1) will not be available for the reader to reference (even if

> that reference is due to earlier reads being moved after load-acquire of 'token').

> 	b) If this 'token' is not the updated value from 2) then the entry deleted from 1) may or may not be available for the reader to

> reference. In this case the w[lcore_id].cnt is not updated, hence the writer will wait to 'free' the deleted entry from 1)


Yes, you right, it's me being confused.

> 

> 

> > > +	if (v->w[lcore_id].cnt != t)

> > > +		v->w[lcore_id].cnt++;

> > > +}

> > > +

> > > +/**

> > > + * @warning

> > > + * @b EXPERIMENTAL: this API may change without prior notice

> > > + *

> > > + * Checks if all the worker threads have entered the quiescent state

> > > + * 'n' number of times. 'n' is provided in rte_tqs_start API.

> > > + *

> > > + * This is implemented as a lock-free function. It is multi-thread

> > > + * safe and can be called from the worker threads as well.

> > > + *

> > > + * @param v

> > > + *   TQS variable

> > > + * @param t

> > > + *   Token returned by rte_tqs_start API

> > > + * @param wait

> > > + *   If true, block till all the worker threads have completed entering

> > > + *   the quiescent state 'n' number of times

> > > + * @return

> > > + *   - 0 if all worker threads have NOT passed through specified number

> > > + *     of quiescent states.

> > > + *   - 1 if all worker threads have passed through specified number

> > > + *     of quiescent states.

> > > + *   - -EINVAL if the parameters are invalid (debug mode compilation only).

> > > + */

> > > +static __rte_always_inline int __rte_experimental

> > > +rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) {

> > > +	uint64_t l;

> > > +	uint64_t lcore_mask;

> > > +

> > > +	TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);

> > > +

> > > +	do {

> > > +		/* Load the current lcore_mask before loading the

> > > +		 * worker thread quiescent state counters.

> > > +		 */

> > > +		lcore_mask = __atomic_load_n(&v->lcore_mask,

> > __ATOMIC_ACQUIRE);

> >

> > What would happen if reader will call unregister() simultaneously with check()

> > and will update lcore_mask straight after that load?

> > As I understand check() might hang in such case.

> If the 'lcore_mask' is updated after this load, it will affect only the current iteration of the while loop below. In the next iteration the

> 'lcore_mask' is loaded again.


True, my confusion again.

> 

> >

> > > +

> > > +		while (lcore_mask) {

> > > +			l = __builtin_ctz(lcore_mask);

> > > +			if (v->w[l].cnt != t)

> > > +				break;

> >

> > As I understand, that makes control-path function progress dependent on

> > simultaneous invocation of data-path functions.

> I agree that the control-path function progress (for ex: how long to wait for freeing the memory) depends on invocation of the data-path

> functions. The separation of 'start', 'check' and the option not to block in 'check' provide the flexibility for control-path to do some other

> work if it chooses to.

> 

> > In some cases that might cause control-path to hang.

> > Let say if data-path function wouldn't be called, or user invokes control-path

> > and data-path functions from the same thread.

> I agree with the case of data-path function not getting called. I would consider that as programming error. I can document that warning in

> the rte_tqs_check API.


Sure, it can be documented.
Though that means, that each data-path thread would have to do explicit update() call
for every tqs it might use.
I just think that it would complicate things and might limit usage of the library quite significantly.
 
> 

> In the case of same thread calling both control-path and data-path functions, it would depend on the sequence of the calls. The following

> sequence should not cause any hangs:

> Worker thread

> 1) 'deletes' an entry from a lock-free data structure

> 2) rte_tqs_start

> 3) rte_tqs_update

> 4) rte_tqs_check (wait == 1 or wait == 0)

> 5) 'free' the entry deleted in 1)


That an interesting idea, and that should help, I think.
Probably worth to have {2,3,4} sequence as a new high level function.

> 

> If 3) and 4) are interchanged, then there will be a hang if wait is set to 1. If wait is set to 0, there should not be a hang.

> I can document this as part of the documentation (I do not think API documentation is required for this).

> 

> >

> > > +

> > > +			lcore_mask &= ~(1UL << l);

> > > +		}

> > > +

> > > +		if (lcore_mask == 0)

> > > +			return 1;

> > > +

> > > +		rte_pause();

> > > +	} while (wait);

> > > +

> > > +	return 0;

> > > +}

> > > +

Honnappa Nagarahalli Dec. 7, 2018, 7:27 a.m. UTC | #4

> >

> > > > +

> > > > +/* Add a reader thread, running on an lcore, to the list of

> > > > +threads

> > > > + * reporting their quiescent state on a TQS variable.

> > > > + */

> > > > +int __rte_experimental

> > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> > > RTE_TQS_MAX_LCORE),

> > > > +				-EINVAL);

> > >

> > > It is not very good practice to make function return different

> > > values and behave in a different way in debug/non-debug mode.

> > > I'd say that for slow-path (functions in .c) it is always good to

> > > check input parameters.

> > > For fast-path (functions in .h) we sometimes skip such checking, but

> > > debug mode can probably use RTE_ASSERT() or so.

> > Makes sense, I will change this in the next version.

> >

> > >

> > >

> > > lcore_id >= RTE_TQS_MAX_LCORE

> > >

> > > Is this limitation really necessary?

> > I added this limitation because currently DPDK application cannot take

> > a mask more than 64bit wide. Otherwise, this should be as big as

> RTE_MAX_LCORE.

> > I see that in the case of '-lcores' option, the number of lcores can

> > be more than the number of PEs. In this case, we still need a MAX limit (but

> can be bigger than 64).

> >

> > > First it means that only lcores can use that API (at least data-path

> > > part), second even today many machines have more than 64 cores.

> > > I think you can easily avoid such limitation, if instead of

> > > requiring lcore_id as input parameter, you'll just make it return index of

> next available entry in w[].

> > > Then tqs_update() can take that index as input parameter.

> > I had thought about a similar approach based on IDs. I was concerned

> > that ID will be one more thing to manage for the application. But, I see the

> limitations of the current approach now. I will change it to allocation based.

> This will support even non-EAL pthreads as well.

> 

> Yes, with such approach non-lcore threads will be able to use it also.

> 

I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be efficient as they can be called from the worker's packet processing loop (rte_event_dequeue_burst allows blocking. So, the worker thread needs to call rte_tqs_unregister_lcore before calling rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet processing). Allocating the thread ID in these functions will make them more complex.

I suggest that we change the argument 'lcore_id' to 'thread_id'. The application could use 'lcore_id' as 'thread_id' if threads are mapped to physical cores 1:1.

If the threads are not mapped 1:1 to physical cores, the threads need to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see that DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation, but I think the thread ID allocation should not be part of this library. Such thread ID might be useful for other libraries.

<snip>

> 

> >

> > >

> > > > +

> > > > +		while (lcore_mask) {

> > > > +			l = __builtin_ctz(lcore_mask);

> > > > +			if (v->w[l].cnt != t)

> > > > +				break;

> > >

> > > As I understand, that makes control-path function progress dependent

> > > on simultaneous invocation of data-path functions.

> > I agree that the control-path function progress (for ex: how long to

> > wait for freeing the memory) depends on invocation of the data-path

> > functions. The separation of 'start', 'check' and the option not to block in

> 'check' provide the flexibility for control-path to do some other work if it

> chooses to.

> >

> > > In some cases that might cause control-path to hang.

> > > Let say if data-path function wouldn't be called, or user invokes

> > > control-path and data-path functions from the same thread.

> > I agree with the case of data-path function not getting called. I

> > would consider that as programming error. I can document that warning in

> the rte_tqs_check API.

> 

> Sure, it can be documented.

> Though that means, that each data-path thread would have to do explicit

> update() call for every tqs it might use.

> I just think that it would complicate things and might limit usage of the library

> quite significantly.

Each data path thread has to report its quiescent state. Hence, each data-path thread has to call update() (similar to how rte_timer_manage() has to be called periodically on the worker thread). 
Do you have any particular use case in mind where this fails?

> 

> >

> > In the case of same thread calling both control-path and data-path

> > functions, it would depend on the sequence of the calls. The following

> sequence should not cause any hangs:

> > Worker thread

> > 1) 'deletes' an entry from a lock-free data structure

> > 2) rte_tqs_start

> > 3) rte_tqs_update

> > 4) rte_tqs_check (wait == 1 or wait == 0)

> > 5) 'free' the entry deleted in 1)

> 

> That an interesting idea, and that should help, I think.

> Probably worth to have {2,3,4} sequence as a new high level function.

> 

Yes, this is a good idea. Such a function would be applicable only in the worker thread. I would prefer to leave it to the application to take care.

Stephen Hemminger Dec. 7, 2018, 5:29 p.m. UTC | #5

On Fri, 7 Dec 2018 07:27:16 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > >  

> > > > > +

> > > > > +/* Add a reader thread, running on an lcore, to the list of

> > > > > +threads

> > > > > + * reporting their quiescent state on a TQS variable.

> > > > > + */

> > > > > +int __rte_experimental

> > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=  

> > > > RTE_TQS_MAX_LCORE),  

> > > > > +				-EINVAL);  

> > > >

> > > > It is not very good practice to make function return different

> > > > values and behave in a different way in debug/non-debug mode.

> > > > I'd say that for slow-path (functions in .c) it is always good to

> > > > check input parameters.

> > > > For fast-path (functions in .h) we sometimes skip such checking, but

> > > > debug mode can probably use RTE_ASSERT() or so.  

> > > Makes sense, I will change this in the next version.

> > >  

> > > >

> > > >

> > > > lcore_id >= RTE_TQS_MAX_LCORE

> > > >

> > > > Is this limitation really necessary?  

> > > I added this limitation because currently DPDK application cannot take

> > > a mask more than 64bit wide. Otherwise, this should be as big as  

> > RTE_MAX_LCORE.  

> > > I see that in the case of '-lcores' option, the number of lcores can

> > > be more than the number of PEs. In this case, we still need a MAX limit (but  

> > can be bigger than 64).  

> > >  

> > > > First it means that only lcores can use that API (at least data-path

> > > > part), second even today many machines have more than 64 cores.

> > > > I think you can easily avoid such limitation, if instead of

> > > > requiring lcore_id as input parameter, you'll just make it return index of  

> > next available entry in w[].  

> > > > Then tqs_update() can take that index as input parameter.  

> > > I had thought about a similar approach based on IDs. I was concerned

> > > that ID will be one more thing to manage for the application. But, I see the  

> > limitations of the current approach now. I will change it to allocation based.

> > This will support even non-EAL pthreads as well.

> > 

> > Yes, with such approach non-lcore threads will be able to use it also.

> >   

> I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be efficient as they can be called from the worker's packet processing loop (rte_event_dequeue_burst allows blocking. So, the worker thread needs to call rte_tqs_unregister_lcore before calling rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet processing). Allocating the thread ID in these functions will make them more complex.

> 

> I suggest that we change the argument 'lcore_id' to 'thread_id'. The application could use 'lcore_id' as 'thread_id' if threads are mapped to physical cores 1:1.

> 

> If the threads are not mapped 1:1 to physical cores, the threads need to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see that DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation, but I think the thread ID allocation should not be part of this library. Such thread ID might be useful for other libraries.

> 

> <snip



Thread id is problematic since Glibc doesn't want to give it out.
You have to roll your own function to do gettid().
It is not as easy as just that.  Plus what about preemption?

Honnappa Nagarahalli Dec. 11, 2018, 6:40 a.m. UTC | #6

> 

> > > >

> > > > > > +

> > > > > > +/* Add a reader thread, running on an lcore, to the list of

> > > > > > +threads

> > > > > > + * reporting their quiescent state on a TQS variable.

> > > > > > + */

> > > > > > +int __rte_experimental

> > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> > > > > RTE_TQS_MAX_LCORE),

> > > > > > +				-EINVAL);

> > > > >

> > > > > It is not very good practice to make function return different

> > > > > values and behave in a different way in debug/non-debug mode.

> > > > > I'd say that for slow-path (functions in .c) it is always good

> > > > > to check input parameters.

> > > > > For fast-path (functions in .h) we sometimes skip such checking,

> > > > > but debug mode can probably use RTE_ASSERT() or so.

> > > > Makes sense, I will change this in the next version.

> > > >

> > > > >

> > > > >

> > > > > lcore_id >= RTE_TQS_MAX_LCORE

> > > > >

> > > > > Is this limitation really necessary?

> > > > I added this limitation because currently DPDK application cannot

> > > > take a mask more than 64bit wide. Otherwise, this should be as big

> > > > as

> > > RTE_MAX_LCORE.

> > > > I see that in the case of '-lcores' option, the number of lcores

> > > > can be more than the number of PEs. In this case, we still need a

> > > > MAX limit (but

> > > can be bigger than 64).

> > > >

> > > > > First it means that only lcores can use that API (at least

> > > > > data-path part), second even today many machines have more than 64

> cores.

> > > > > I think you can easily avoid such limitation, if instead of

> > > > > requiring lcore_id as input parameter, you'll just make it

> > > > > return index of

> > > next available entry in w[].

> > > > > Then tqs_update() can take that index as input parameter.

> > > > I had thought about a similar approach based on IDs. I was

> > > > concerned that ID will be one more thing to manage for the

> > > > application. But, I see the

> > > limitations of the current approach now. I will change it to allocation

> based.

> > > This will support even non-EAL pthreads as well.

> > >

> > > Yes, with such approach non-lcore threads will be able to use it also.

> > >

> > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be

> efficient as they can be called from the worker's packet processing loop

> (rte_event_dequeue_burst allows blocking. So, the worker thread needs to

> call rte_tqs_unregister_lcore before calling rte_event_dequeue_burst and

> rte_tqs_register_lcore before starting packet processing). Allocating the

> thread ID in these functions will make them more complex.

> >

> > I suggest that we change the argument 'lcore_id' to 'thread_id'. The

> application could use 'lcore_id' as 'thread_id' if threads are mapped to

> physical cores 1:1.

> >

> > If the threads are not mapped 1:1 to physical cores, the threads need to use

> a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see that

> DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per

> TQS variable). I could provide APIs to do the thread ID allocation, but I think

> the thread ID allocation should not be part of this library. Such thread ID

> might be useful for other libraries.

> >

> > <snip

> 

> 

> Thread id is problematic since Glibc doesn't want to give it out.

> You have to roll your own function to do gettid().

> It is not as easy as just that.  Plus what about preemption?


Agree. I looked into this further. The rte_gettid function uses a system call (BSD and Linux). I am not clear on the space of the ID returned (as well). I do not think it is guaranteed that it will be with in a narrow range that is required here.

My suggestion would be to add a set of APIs that would allow for allocation of thread IDs which are within a given range of 0 to <predefined MAX>

Ananyev, Konstantin Dec. 12, 2018, 9:29 a.m. UTC | #7

> > >

> > > > > +

> > > > > +/* Add a reader thread, running on an lcore, to the list of

> > > > > +threads

> > > > > + * reporting their quiescent state on a TQS variable.

> > > > > + */

> > > > > +int __rte_experimental

> > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> > > > RTE_TQS_MAX_LCORE),

> > > > > +				-EINVAL);

> > > >

> > > > It is not very good practice to make function return different

> > > > values and behave in a different way in debug/non-debug mode.

> > > > I'd say that for slow-path (functions in .c) it is always good to

> > > > check input parameters.

> > > > For fast-path (functions in .h) we sometimes skip such checking, but

> > > > debug mode can probably use RTE_ASSERT() or so.

> > > Makes sense, I will change this in the next version.

> > >

> > > >

> > > >

> > > > lcore_id >= RTE_TQS_MAX_LCORE

> > > >

> > > > Is this limitation really necessary?

> > > I added this limitation because currently DPDK application cannot take

> > > a mask more than 64bit wide. Otherwise, this should be as big as

> > RTE_MAX_LCORE.

> > > I see that in the case of '-lcores' option, the number of lcores can

> > > be more than the number of PEs. In this case, we still need a MAX limit (but

> > can be bigger than 64).

> > >

> > > > First it means that only lcores can use that API (at least data-path

> > > > part), second even today many machines have more than 64 cores.

> > > > I think you can easily avoid such limitation, if instead of

> > > > requiring lcore_id as input parameter, you'll just make it return index of

> > next available entry in w[].

> > > > Then tqs_update() can take that index as input parameter.

> > > I had thought about a similar approach based on IDs. I was concerned

> > > that ID will be one more thing to manage for the application. But, I see the

> > limitations of the current approach now. I will change it to allocation based.

> > This will support even non-EAL pthreads as well.

> >

> > Yes, with such approach non-lcore threads will be able to use it also.

> >

> I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be efficient as they can be called from the worker's packet

> processing loop (rte_event_dequeue_burst allows blocking. So, the worker thread needs to call rte_tqs_unregister_lcore before calling

> rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet processing). Allocating the thread ID in these functions will

> make them more complex.

> 

> I suggest that we change the argument 'lcore_id' to 'thread_id'. The application could use 'lcore_id' as 'thread_id' if threads are mapped to

> physical cores 1:1.

> 

> If the threads are not mapped 1:1 to physical cores, the threads need to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do

> not see that DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per TQS variable). I could provide APIs to do the

> thread ID allocation, but I think the thread ID allocation should not be part of this library. Such thread ID might be useful for other libraries.


I don't think there is any point to introduce new thread_id concept just for that library.
After all we already have a concept of lcore_id which pretty much serves the same purpose.
I still think that we need to either:
a) make register/unregister to work with any valid lcore_id (<=  RTE_MAX_LCORE)
b) make register/unregister to return index in w[]

For a) will need mask bigger than 64bits. 
b)  would allow to use data-path API by non-lcores threads too,
plus w[] would occupy less space, and check() might be faster.
Though yes, as a drawback, for b) register/unregister probably would need
extra 'while(CAS(...));' loop.
I suppose the question here do you foresee a lot of concurrent register/unregister
at data-path? 

> 

> <snip>

> 

> >

> > >

> > > >

> > > > > +

> > > > > +		while (lcore_mask) {

> > > > > +			l = __builtin_ctz(lcore_mask);

> > > > > +			if (v->w[l].cnt != t)

> > > > > +				break;

> > > >

> > > > As I understand, that makes control-path function progress dependent

> > > > on simultaneous invocation of data-path functions.

> > > I agree that the control-path function progress (for ex: how long to

> > > wait for freeing the memory) depends on invocation of the data-path

> > > functions. The separation of 'start', 'check' and the option not to block in

> > 'check' provide the flexibility for control-path to do some other work if it

> > chooses to.

> > >

> > > > In some cases that might cause control-path to hang.

> > > > Let say if data-path function wouldn't be called, or user invokes

> > > > control-path and data-path functions from the same thread.

> > > I agree with the case of data-path function not getting called. I

> > > would consider that as programming error. I can document that warning in

> > the rte_tqs_check API.

> >

> > Sure, it can be documented.

> > Though that means, that each data-path thread would have to do explicit

> > update() call for every tqs it might use.

> > I just think that it would complicate things and might limit usage of the library

> > quite significantly.

> Each data path thread has to report its quiescent state. Hence, each data-path thread has to call update() (similar to how

> rte_timer_manage() has to be called periodically on the worker thread).


I understand that.
Though that means that each data-path thread has to know explicitly what rcu vars it accesses.
Would be hard to adopt such API with rcu vars used inside some library.
But ok, as I understand people do use QSBR approach in their apps and
find it useful.

> Do you have any particular use case in mind where this fails?


Let say it means that library can't be used to add/del RX/TX ethdev callbacks
in a safe manner.

BTW, two side questions:
1) As I understand what you propose is very similar to QSBR main concept.
Wouldn't it be better to name it accordingly to avoid confusion (or at least document it somewhere).
I think someone else already raised that question.
2) Would QSBR be the only technique in that lib?
Any plans to add something similar to GP one too (with MBs at reader-side)?

> 

> >

> > >

> > > In the case of same thread calling both control-path and data-path

> > > functions, it would depend on the sequence of the calls. The following

> > sequence should not cause any hangs:

> > > Worker thread

> > > 1) 'deletes' an entry from a lock-free data structure

> > > 2) rte_tqs_start

> > > 3) rte_tqs_update

> > > 4) rte_tqs_check (wait == 1 or wait == 0)

> > > 5) 'free' the entry deleted in 1)

> >

> > That an interesting idea, and that should help, I think.

> > Probably worth to have {2,3,4} sequence as a new high level function.

> >

> Yes, this is a good idea. Such a function would be applicable only in the worker thread. I would prefer to leave it to the application to take

> care.


Yes, it would be applicable only to worker thread, but why we can't have a function for it?
Let say it could be 2 different functions: one doing {2,3,4} - for worker threads,
and second doing just {2,4} - for control threads.
Or it could be just one function that takes extra parameter: lcore_id/w[] index.
If it is some predefined invalid value (-1 or so), step #3 will be skipped.

Konstantin

Honnappa Nagarahalli Dec. 13, 2018, 7:39 a.m. UTC | #8

> 

> 

> > > >

> > > > > > +

> > > > > > +/* Add a reader thread, running on an lcore, to the list of

> > > > > > +threads

> > > > > > + * reporting their quiescent state on a TQS variable.

> > > > > > + */

> > > > > > +int __rte_experimental

> > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> > > > > RTE_TQS_MAX_LCORE),

> > > > > > +				-EINVAL);

> > > > >

> > > > > It is not very good practice to make function return different

> > > > > values and behave in a different way in debug/non-debug mode.

> > > > > I'd say that for slow-path (functions in .c) it is always good

> > > > > to check input parameters.

> > > > > For fast-path (functions in .h) we sometimes skip such checking,

> > > > > but debug mode can probably use RTE_ASSERT() or so.

> > > > Makes sense, I will change this in the next version.

> > > >

> > > > >

> > > > >

> > > > > lcore_id >= RTE_TQS_MAX_LCORE

> > > > >

> > > > > Is this limitation really necessary?

> > > > I added this limitation because currently DPDK application cannot

> > > > take a mask more than 64bit wide. Otherwise, this should be as big

> > > > as

> > > RTE_MAX_LCORE.

> > > > I see that in the case of '-lcores' option, the number of lcores

> > > > can be more than the number of PEs. In this case, we still need a

> > > > MAX limit (but

> > > can be bigger than 64).

> > > >

> > > > > First it means that only lcores can use that API (at least

> > > > > data-path part), second even today many machines have more than 64

> cores.

> > > > > I think you can easily avoid such limitation, if instead of

> > > > > requiring lcore_id as input parameter, you'll just make it

> > > > > return index of

> > > next available entry in w[].

> > > > > Then tqs_update() can take that index as input parameter.

> > > > I had thought about a similar approach based on IDs. I was

> > > > concerned that ID will be one more thing to manage for the

> > > > application. But, I see the

> > > limitations of the current approach now. I will change it to allocation

> based.

> > > This will support even non-EAL pthreads as well.

> > >

> > > Yes, with such approach non-lcore threads will be able to use it also.

> > >

> > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need

> > to be efficient as they can be called from the worker's packet

> > processing loop (rte_event_dequeue_burst allows blocking. So, the

> > worker thread needs to call rte_tqs_unregister_lcore before calling

> rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet

> processing). Allocating the thread ID in these functions will make them more

> complex.

> >

> > I suggest that we change the argument 'lcore_id' to 'thread_id'. The

> > application could use 'lcore_id' as 'thread_id' if threads are mapped to

> physical cores 1:1.

> >

> > If the threads are not mapped 1:1 to physical cores, the threads need

> > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not

> > see that DPDK has a thread_id concept. For TQS, the thread IDs are global

> (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation,

> but I think the thread ID allocation should not be part of this library. Such

> thread ID might be useful for other libraries.

> 

> I don't think there is any point to introduce new thread_id concept just for

> that library.

Currently, we have rte_gettid API. It is being used by rte_spinlock. However, the thread ID returned here is the thread ID as defined by OS. rte_spinlock APIs do not care who defines the thread ID as long as those IDs are unique per thread. I think, if we have a thread_id concept that covers non-eal threads as well, it might help other libraries too. For ex: [1] talks about the limitation of per-lcore cache. I think this limitation can be removed easily if we could have a thread_id that is in a small, well defined space (rather than OS defined thread ID which may be an arbitrary number). I see similar issues mentioned for rte_timer.
It might be useful in the dynamic threads Bruce talked about at the Dublin summit (I am not sure on this one, just speculating).

[1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issue-label

> After all we already have a concept of lcore_id which pretty much serves the

> same purpose.

> I still think that we need to either:

> a) make register/unregister to work with any valid lcore_id (<=

> RTE_MAX_LCORE)

I have made this change already, it will be there in the next version.

> b) make register/unregister to return index in w[]

> 

> For a) will need mask bigger than 64bits.

> b)  would allow to use data-path API by non-lcores threads too, plus w[]

> would occupy less space, and check() might be faster.

> Though yes, as a drawback, for b) register/unregister probably would need

> extra 'while(CAS(...));' loop.

Along with the CAS, we also need to search for available index in the array.

> I suppose the question here do you foresee a lot of concurrent

> register/unregister at data-path?

IMO, yes, because of the event dev API being blocking.
We can solve this by providing separate APIs for allocation/freeing of the IDs. I am just questioning where these APIs should be.

> 

> >

> > <snip>

> >

> > >

> > > >

> > > > >

> > > > > > +

> > > > > > +		while (lcore_mask) {

> > > > > > +			l = __builtin_ctz(lcore_mask);

> > > > > > +			if (v->w[l].cnt != t)

> > > > > > +				break;

> > > > >

> > > > > As I understand, that makes control-path function progress

> > > > > dependent on simultaneous invocation of data-path functions.

> > > > I agree that the control-path function progress (for ex: how long

> > > > to wait for freeing the memory) depends on invocation of the

> > > > data-path functions. The separation of 'start', 'check' and the

> > > > option not to block in

> > > 'check' provide the flexibility for control-path to do some other

> > > work if it chooses to.

> > > >

> > > > > In some cases that might cause control-path to hang.

> > > > > Let say if data-path function wouldn't be called, or user

> > > > > invokes control-path and data-path functions from the same thread.

> > > > I agree with the case of data-path function not getting called. I

> > > > would consider that as programming error. I can document that

> > > > warning in

> > > the rte_tqs_check API.

> > >

> > > Sure, it can be documented.

> > > Though that means, that each data-path thread would have to do

> > > explicit

> > > update() call for every tqs it might use.

> > > I just think that it would complicate things and might limit usage

> > > of the library quite significantly.

> > Each data path thread has to report its quiescent state. Hence, each

> > data-path thread has to call update() (similar to how

> > rte_timer_manage() has to be called periodically on the worker thread).

> 

> I understand that.

> Though that means that each data-path thread has to know explicitly what rcu

> vars it accesses.

Yes. That is correct. It is both good and bad. It is providing flexibility to reduce the overhead. For ex: in pipeline mode, it may be that a particular data structure is accessed only by some of the threads in the application. In this case, this library allows for per data structure vars, which reduces the over head. This applies for service cores as well.

> Would be hard to adopt such API with rcu vars used inside some library.

> But ok, as I understand people do use QSBR approach in their apps and find it

> useful.

It can be adopted in the library with different levels of assumptions/constraints.
1) With the assumption that the data plane threads will update the quiescent state. For ex: for rte_hash library we could ask the user to pass the TQS variable as input and rte_hash writer APIs can call rte_tqs_start and rte_tqs_check APIs.
2) If the assumption in 1) is not good, updating of the quiescent state can be hidden in the library, but again with the assumption that the data plane library API is called on a regular basis. For ex: the rte_tqs_update can be called within rte_hash_lookup API.
3) If we do not want to assume that the data plane API will be called on a regular basis, then the rte_tqs_register/unregister APIs need to be used before and after entering the critical section along with calling rte_tqs_update API. For ex: rte_hash_lookup should have the sequence rte_tqs_register, <critical section>, rte_tqs_unregister, rte_tqs_update. (very similar to GP)

> 

> > Do you have any particular use case in mind where this fails?

> 

> Let say it means that library can't be used to add/del RX/TX ethdev callbacks

> in a safe manner.

I need to understand this better. I will look at rte_ethdev library.

> 

> BTW, two side questions:

> 1) As I understand what you propose is very similar to QSBR main concept.

> Wouldn't it be better to name it accordingly to avoid confusion (or at least

> document it somewhere).

> I think someone else already raised that question.

QSBR stands for Quiescent State Based Reclamation. This library already has 'Thread Quiescent State' in the name. Others have questioned/suggested why not use RCU instead. I called it thread quiescent state as this library just helps determine if all the readers have entered the quiescent state. It does not do anything else.

However, you are also bringing up an important point, 'will we add other methods of memory reclamation'? With that in mind, may be we should not call it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will also future proof the API incase we want to add additional RCU types.

> 2) Would QSBR be the only technique in that lib?

> Any plans to add something similar to GP one too (with MBs at reader-side)?

I believe, by GP, you mean general-purpose RCU. In my understanding QSBR is the one with least overhead. For DPDK applications, I think reducing that overhead is important. The GP adds additional over head on the reader side. I did not see a need to add any additional ones as of now. But if there are use cases that cannot be achieved with the proposed APIs, we can definitely expand it.

> 

> >

> > >

> > > >

> > > > In the case of same thread calling both control-path and data-path

> > > > functions, it would depend on the sequence of the calls. The

> > > > following

> > > sequence should not cause any hangs:

> > > > Worker thread

> > > > 1) 'deletes' an entry from a lock-free data structure

> > > > 2) rte_tqs_start

> > > > 3) rte_tqs_update

> > > > 4) rte_tqs_check (wait == 1 or wait == 0)

> > > > 5) 'free' the entry deleted in 1)

> > >

> > > That an interesting idea, and that should help, I think.

> > > Probably worth to have {2,3,4} sequence as a new high level function.

> > >

> > Yes, this is a good idea. Such a function would be applicable only in

> > the worker thread. I would prefer to leave it to the application to take care.

> 

> Yes, it would be applicable only to worker thread, but why we can't have a

> function for it?

> Let say it could be 2 different functions: one doing {2,3,4} - for worker threads,

> and second doing just {2,4} - for control threads.

> Or it could be just one function that takes extra parameter: lcore_id/w[] index.

> If it is some predefined invalid value (-1 or so), step #3 will be skipped.

The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the writers do not have to spend CPU/memory cycles polling for the readers' quiescent state. In the context of DPDK, this overhead will be significant (at least equal to the length of 1 while loop on the worker core). This is one of the key features of this library. Combining 2,[3], 4 will defeat this purpose. For ex: in the rte_hash library, whenever a writer on the data path calls rte_hash_add, (with 2,3,4 combined) it will wait for the rest of the readers to enter quiescent state. i.e. the performance will come down whenever a rte_hash_add is called.

I am trying to understand your concern. If it is that, 'programmers may not use the right sequence', I would prefer to treat that as programming error. May be it is better addressed by providing debugging capabilities.

> 

> Konstantin

Burakov, Anatoly Dec. 13, 2018, 12:26 p.m. UTC | #9

On 11-Dec-18 6:40 AM, Honnappa Nagarahalli wrote:
>>

>>>>>

>>>>>>> +

>>>>>>> +/* Add a reader thread, running on an lcore, to the list of

>>>>>>> +threads

>>>>>>> + * reporting their quiescent state on a TQS variable.

>>>>>>> + */

>>>>>>> +int __rte_experimental

>>>>>>> +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

>>>>>>> +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

>>>>>> RTE_TQS_MAX_LCORE),

>>>>>>> +				-EINVAL);

>>>>>>

>>>>>> It is not very good practice to make function return different

>>>>>> values and behave in a different way in debug/non-debug mode.

>>>>>> I'd say that for slow-path (functions in .c) it is always good

>>>>>> to check input parameters.

>>>>>> For fast-path (functions in .h) we sometimes skip such checking,

>>>>>> but debug mode can probably use RTE_ASSERT() or so.

>>>>> Makes sense, I will change this in the next version.

>>>>>

>>>>>>

>>>>>>

>>>>>> lcore_id >= RTE_TQS_MAX_LCORE

>>>>>>

>>>>>> Is this limitation really necessary?

>>>>> I added this limitation because currently DPDK application cannot

>>>>> take a mask more than 64bit wide. Otherwise, this should be as big

>>>>> as

>>>> RTE_MAX_LCORE.

>>>>> I see that in the case of '-lcores' option, the number of lcores

>>>>> can be more than the number of PEs. In this case, we still need a

>>>>> MAX limit (but

>>>> can be bigger than 64).

>>>>>

>>>>>> First it means that only lcores can use that API (at least

>>>>>> data-path part), second even today many machines have more than 64

>> cores.

>>>>>> I think you can easily avoid such limitation, if instead of

>>>>>> requiring lcore_id as input parameter, you'll just make it

>>>>>> return index of

>>>> next available entry in w[].

>>>>>> Then tqs_update() can take that index as input parameter.

>>>>> I had thought about a similar approach based on IDs. I was

>>>>> concerned that ID will be one more thing to manage for the

>>>>> application. But, I see the

>>>> limitations of the current approach now. I will change it to allocation

>> based.

>>>> This will support even non-EAL pthreads as well.

>>>>

>>>> Yes, with such approach non-lcore threads will be able to use it also.

>>>>

>>> I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be

>> efficient as they can be called from the worker's packet processing loop

>> (rte_event_dequeue_burst allows blocking. So, the worker thread needs to

>> call rte_tqs_unregister_lcore before calling rte_event_dequeue_burst and

>> rte_tqs_register_lcore before starting packet processing). Allocating the

>> thread ID in these functions will make them more complex.

>>>

>>> I suggest that we change the argument 'lcore_id' to 'thread_id'. The

>> application could use 'lcore_id' as 'thread_id' if threads are mapped to

>> physical cores 1:1.

>>>

>>> If the threads are not mapped 1:1 to physical cores, the threads need to use

>> a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see that

>> DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per

>> TQS variable). I could provide APIs to do the thread ID allocation, but I think

>> the thread ID allocation should not be part of this library. Such thread ID

>> might be useful for other libraries.

>>>

>>> <snip

>>

>>

>> Thread id is problematic since Glibc doesn't want to give it out.

>> You have to roll your own function to do gettid().

>> It is not as easy as just that.  Plus what about preemption?

> 

> Agree. I looked into this further. The rte_gettid function uses a system call (BSD and Linux). I am not clear on the space of the ID returned (as well). I do not think it is guaranteed that it will be with in a narrow range that is required here.

> 

> My suggestion would be to add a set of APIs that would allow for allocation of thread IDs which are within a given range of 0 to <predefined MAX>

> 


System-provided thread-ID's would probably also be potentially 
non-unique in multiprocess scenario?

-- 
Thanks,
Anatoly

Ananyev, Konstantin Dec. 17, 2018, 1:14 p.m. UTC | #10

> >

> > > > >

> > > > > > > +

> > > > > > > +/* Add a reader thread, running on an lcore, to the list of

> > > > > > > +threads

> > > > > > > + * reporting their quiescent state on a TQS variable.

> > > > > > > + */

> > > > > > > +int __rte_experimental

> > > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > > > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> > > > > > RTE_TQS_MAX_LCORE),

> > > > > > > +				-EINVAL);

> > > > > >

> > > > > > It is not very good practice to make function return different

> > > > > > values and behave in a different way in debug/non-debug mode.

> > > > > > I'd say that for slow-path (functions in .c) it is always good

> > > > > > to check input parameters.

> > > > > > For fast-path (functions in .h) we sometimes skip such checking,

> > > > > > but debug mode can probably use RTE_ASSERT() or so.

> > > > > Makes sense, I will change this in the next version.

> > > > >

> > > > > >

> > > > > >

> > > > > > lcore_id >= RTE_TQS_MAX_LCORE

> > > > > >

> > > > > > Is this limitation really necessary?

> > > > > I added this limitation because currently DPDK application cannot

> > > > > take a mask more than 64bit wide. Otherwise, this should be as big

> > > > > as

> > > > RTE_MAX_LCORE.

> > > > > I see that in the case of '-lcores' option, the number of lcores

> > > > > can be more than the number of PEs. In this case, we still need a

> > > > > MAX limit (but

> > > > can be bigger than 64).

> > > > >

> > > > > > First it means that only lcores can use that API (at least

> > > > > > data-path part), second even today many machines have more than 64

> > cores.

> > > > > > I think you can easily avoid such limitation, if instead of

> > > > > > requiring lcore_id as input parameter, you'll just make it

> > > > > > return index of

> > > > next available entry in w[].

> > > > > > Then tqs_update() can take that index as input parameter.

> > > > > I had thought about a similar approach based on IDs. I was

> > > > > concerned that ID will be one more thing to manage for the

> > > > > application. But, I see the

> > > > limitations of the current approach now. I will change it to allocation

> > based.

> > > > This will support even non-EAL pthreads as well.

> > > >

> > > > Yes, with such approach non-lcore threads will be able to use it also.

> > > >

> > > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need

> > > to be efficient as they can be called from the worker's packet

> > > processing loop (rte_event_dequeue_burst allows blocking. So, the

> > > worker thread needs to call rte_tqs_unregister_lcore before calling

> > rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet

> > processing). Allocating the thread ID in these functions will make them more

> > complex.

> > >

> > > I suggest that we change the argument 'lcore_id' to 'thread_id'. The

> > > application could use 'lcore_id' as 'thread_id' if threads are mapped to

> > physical cores 1:1.

> > >

> > > If the threads are not mapped 1:1 to physical cores, the threads need

> > > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not

> > > see that DPDK has a thread_id concept. For TQS, the thread IDs are global

> > (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation,

> > but I think the thread ID allocation should not be part of this library. Such

> > thread ID might be useful for other libraries.

> >

> > I don't think there is any point to introduce new thread_id concept just for

> > that library.

> Currently, we have rte_gettid API. It is being used by rte_spinlock. However, the thread ID returned here is the thread ID as defined by OS.

> rte_spinlock APIs do not care who defines the thread ID as long as those IDs are unique per thread. I think, if we have a thread_id concept

> that covers non-eal threads as well, it might help other libraries too. For ex: [1] talks about the limitation of per-lcore cache.

> I think this

> limitation can be removed easily if we could have a thread_id that is in a small, well defined space (rather than OS defined thread ID which

> may be an arbitrary number). I see similar issues mentioned for rte_timer.


If we'll just introduce new ID (let's name it thread_id) then we'll just replace one limitation with the other.
If it still would be local_cache[], now based on some thread_id instead of current lcore_id.
I don't see how it will be better than current one.
To make any arbitrary thread to use mempool's cache we need something smarter
then just local_cache[] for each id, but without loss of performance.

> It might be useful in the dynamic threads Bruce talked about at the Dublin summit (I am not sure on this one, just speculating).


That's probably about  make lcore_id allocation/freeing to be dynamic.
> 

> [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issue-label

> 

> > After all we already have a concept of lcore_id which pretty much serves the

> > same purpose.

> > I still think that we need to either:

> > a) make register/unregister to work with any valid lcore_id (<=

> > RTE_MAX_LCORE)

> I have made this change already, it will be there in the next version.


Ok.

> 

> > b) make register/unregister to return index in w[]

> >

> > For a) will need mask bigger than 64bits.

> > b)  would allow to use data-path API by non-lcores threads too, plus w[]

> > would occupy less space, and check() might be faster.

> > Though yes, as a drawback, for b) register/unregister probably would need

> > extra 'while(CAS(...));' loop.

> Along with the CAS, we also need to search for available index in the array.


Sure, but I thought that one is relatively cheap comparing to CAS itself
(probably not, as cache line with data will be shared between cores).

> 

> > I suppose the question here do you foresee a lot of concurrent

> > register/unregister at data-path?

> IMO, yes, because of the event dev API being blocking.

> We can solve this by providing separate APIs for allocation/freeing of the IDs. I am just questioning where these APIs should be.

> 

> >

> > >

> > > <snip>

> > >

> > > >

> > > > >

> > > > > >

> > > > > > > +

> > > > > > > +		while (lcore_mask) {

> > > > > > > +			l = __builtin_ctz(lcore_mask);

> > > > > > > +			if (v->w[l].cnt != t)

> > > > > > > +				break;

> > > > > >

> > > > > > As I understand, that makes control-path function progress

> > > > > > dependent on simultaneous invocation of data-path functions.

> > > > > I agree that the control-path function progress (for ex: how long

> > > > > to wait for freeing the memory) depends on invocation of the

> > > > > data-path functions. The separation of 'start', 'check' and the

> > > > > option not to block in

> > > > 'check' provide the flexibility for control-path to do some other

> > > > work if it chooses to.

> > > > >

> > > > > > In some cases that might cause control-path to hang.

> > > > > > Let say if data-path function wouldn't be called, or user

> > > > > > invokes control-path and data-path functions from the same thread.

> > > > > I agree with the case of data-path function not getting called. I

> > > > > would consider that as programming error. I can document that

> > > > > warning in

> > > > the rte_tqs_check API.

> > > >

> > > > Sure, it can be documented.

> > > > Though that means, that each data-path thread would have to do

> > > > explicit

> > > > update() call for every tqs it might use.

> > > > I just think that it would complicate things and might limit usage

> > > > of the library quite significantly.

> > > Each data path thread has to report its quiescent state. Hence, each

> > > data-path thread has to call update() (similar to how

> > > rte_timer_manage() has to be called periodically on the worker thread).

> >

> > I understand that.

> > Though that means that each data-path thread has to know explicitly what rcu

> > vars it accesses.

> Yes. That is correct. It is both good and bad. It is providing flexibility to reduce the overhead. For ex: in pipeline mode, it may be that a

> particular data structure is accessed only by some of the threads in the application. In this case, this library allows for per data structure

> vars, which reduces the over head. This applies for service cores as well.

> 

> > Would be hard to adopt such API with rcu vars used inside some library.

> > But ok, as I understand people do use QSBR approach in their apps and find it

> > useful.

> It can be adopted in the library with different levels of assumptions/constraints.

> 1) With the assumption that the data plane threads will update the quiescent state. For ex: for rte_hash library we could ask the user to pass

> the TQS variable as input and rte_hash writer APIs can call rte_tqs_start and rte_tqs_check APIs.

> 2) If the assumption in 1) is not good, updating of the quiescent state can be hidden in the library, but again with the assumption that the

> data plane library API is called on a regular basis. For ex: the rte_tqs_update can be called within rte_hash_lookup API.

> 3) If we do not want to assume that the data plane API will be called on a regular basis, then the rte_tqs_register/unregister APIs need to be

> used before and after entering the critical section along with calling rte_tqs_update API. For ex: rte_hash_lookup should have the sequence

> rte_tqs_register, <critical section>, rte_tqs_unregister, rte_tqs_update. (very similar to GP)


#3 is surely possible but it seems quite expensive.
Anyway, as I said before, people do use QSBR approach -
it has the small overhead for readers and relatively straightforward.
So let start with that one, and have some ability to extend the lib
with new methods  in future.

> 

> >

> > > Do you have any particular use case in mind where this fails?

> >

> > Let say it means that library can't be used to add/del RX/TX ethdev callbacks

> > in a safe manner.

> I need to understand this better. I will look at rte_ethdev library.


Ok, you can also have a look at: lib/librte_bpf/bpf_pkt.c
to check how we overcome it now.

> 

> >

> > BTW, two side questions:

> > 1) As I understand what you propose is very similar to QSBR main concept.

> > Wouldn't it be better to name it accordingly to avoid confusion (or at least

> > document it somewhere).

> > I think someone else already raised that question.

> QSBR stands for Quiescent State Based Reclamation. This library already has 'Thread Quiescent State' in the name. Others have

> questioned/suggested why not use RCU instead. I called it thread quiescent state as this library just helps determine if all the readers have

> entered the quiescent state. It does not do anything else.

> 

> However, you are also bringing up an important point, 'will we add other methods of memory reclamation'? With that in mind, may be we

> should not call it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will also future proof the API incase we want to add additional RCU types.


Yep, that sounds like a good approach to me.

> 

> > 2) Would QSBR be the only technique in that lib?

> > Any plans to add something similar to GP one too (with MBs at reader-side)?

> I believe, by GP, you mean general-purpose RCU.


Yes.

> In my understanding QSBR is the one with least overhead. For DPDK applications, I think

> reducing that overhead is important. The GP adds additional over head on the reader side.


Yes, but it provides better flexibility.
> I did not see a need to add any additional ones as of now.


Let say your #3 solution above.
I think GP will be cheaper than register/unregister for each library call.

> But if there are use cases that cannot be achieved with the proposed APIs, we can definitely expand it.

> 

> >

> > >

> > > >

> > > > >

> > > > > In the case of same thread calling both control-path and data-path

> > > > > functions, it would depend on the sequence of the calls. The

> > > > > following

> > > > sequence should not cause any hangs:

> > > > > Worker thread

> > > > > 1) 'deletes' an entry from a lock-free data structure

> > > > > 2) rte_tqs_start

> > > > > 3) rte_tqs_update

> > > > > 4) rte_tqs_check (wait == 1 or wait == 0)

> > > > > 5) 'free' the entry deleted in 1)

> > > >

> > > > That an interesting idea, and that should help, I think.

> > > > Probably worth to have {2,3,4} sequence as a new high level function.

> > > >

> > > Yes, this is a good idea. Such a function would be applicable only in

> > > the worker thread. I would prefer to leave it to the application to take care.

> >

> > Yes, it would be applicable only to worker thread, but why we can't have a

> > function for it?

> > Let say it could be 2 different functions: one doing {2,3,4} - for worker threads,

> > and second doing just {2,4} - for control threads.

> > Or it could be just one function that takes extra parameter: lcore_id/w[] index.

> > If it is some predefined invalid value (-1 or so), step #3 will be skipped.

> The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the writers do not have to spend CPU/memory cycles polling for the

> readers' quiescent state. In the context of DPDK, this overhead will be significant (at least equal to the length of 1 while loop on the worker

> core). This is one of the key features of this library. Combining 2,[3], 4 will defeat this purpose. For ex: in the rte_hash library, whenever a

> writer on the data path calls rte_hash_add, (with 2,3,4 combined) it will wait for the rest of the readers to enter quiescent state. i.e. the

> performance will come down whenever a rte_hash_add is called.


I am not suggesting to replace start+[update+]check with one mega-function.
NP with currently defined API 
I am talking about an additional function for the users where performance is not a main concern -
they just need a function that would do things in  a proper way for themc.
I think having such extra function will simplify their life,
again they can use it as a reference to understand the proper sequence they need to call on their own.

Honnappa Nagarahalli Dec. 18, 2018, 4:30 a.m. UTC | #11

> 

> On 11-Dec-18 6:40 AM, Honnappa Nagarahalli wrote:

> >>

> >>>>>

> >>>>>>> +

> >>>>>>> +/* Add a reader thread, running on an lcore, to the list of

> >>>>>>> +threads

> >>>>>>> + * reporting their quiescent state on a TQS variable.

> >>>>>>> + */

> >>>>>>> +int __rte_experimental

> >>>>>>> +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> >>>>>>> +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=

> >>>>>> RTE_TQS_MAX_LCORE),

> >>>>>>> +				-EINVAL);

> >>>>>>

> >>>>>> It is not very good practice to make function return different

> >>>>>> values and behave in a different way in debug/non-debug mode.

> >>>>>> I'd say that for slow-path (functions in .c) it is always good to

> >>>>>> check input parameters.

> >>>>>> For fast-path (functions in .h) we sometimes skip such checking,

> >>>>>> but debug mode can probably use RTE_ASSERT() or so.

> >>>>> Makes sense, I will change this in the next version.

> >>>>>

> >>>>>>

> >>>>>>

> >>>>>> lcore_id >= RTE_TQS_MAX_LCORE

> >>>>>>

> >>>>>> Is this limitation really necessary?

> >>>>> I added this limitation because currently DPDK application cannot

> >>>>> take a mask more than 64bit wide. Otherwise, this should be as big

> >>>>> as

> >>>> RTE_MAX_LCORE.

> >>>>> I see that in the case of '-lcores' option, the number of lcores

> >>>>> can be more than the number of PEs. In this case, we still need a

> >>>>> MAX limit (but

> >>>> can be bigger than 64).

> >>>>>

> >>>>>> First it means that only lcores can use that API (at least

> >>>>>> data-path part), second even today many machines have more than

> >>>>>> 64

> >> cores.

> >>>>>> I think you can easily avoid such limitation, if instead of

> >>>>>> requiring lcore_id as input parameter, you'll just make it return

> >>>>>> index of

> >>>> next available entry in w[].

> >>>>>> Then tqs_update() can take that index as input parameter.

> >>>>> I had thought about a similar approach based on IDs. I was

> >>>>> concerned that ID will be one more thing to manage for the

> >>>>> application. But, I see the

> >>>> limitations of the current approach now. I will change it to

> >>>> allocation

> >> based.

> >>>> This will support even non-EAL pthreads as well.

> >>>>

> >>>> Yes, with such approach non-lcore threads will be able to use it also.

> >>>>

> >>> I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore

> >>> need to be

> >> efficient as they can be called from the worker's packet processing

> >> loop (rte_event_dequeue_burst allows blocking. So, the worker thread

> >> needs to call rte_tqs_unregister_lcore before calling

> >> rte_event_dequeue_burst and rte_tqs_register_lcore before starting

> >> packet processing). Allocating the thread ID in these functions will make

> them more complex.

> >>>

> >>> I suggest that we change the argument 'lcore_id' to 'thread_id'. The

> >> application could use 'lcore_id' as 'thread_id' if threads are mapped

> >> to physical cores 1:1.

> >>>

> >>> If the threads are not mapped 1:1 to physical cores, the threads

> >>> need to use

> >> a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see

> >> that DPDK has a thread_id concept. For TQS, the thread IDs are global

> >> (i.e. not per TQS variable). I could provide APIs to do the thread ID

> >> allocation, but I think the thread ID allocation should not be part

> >> of this library. Such thread ID might be useful for other libraries.

> >>>

> >>> <snip

> >>

> >>

> >> Thread id is problematic since Glibc doesn't want to give it out.

> >> You have to roll your own function to do gettid().

> >> It is not as easy as just that.  Plus what about preemption?

> >

> > Agree. I looked into this further. The rte_gettid function uses a system call

> (BSD and Linux). I am not clear on the space of the ID returned (as well). I do

> not think it is guaranteed that it will be with in a narrow range that is required

> here.

> >

> > My suggestion would be to add a set of APIs that would allow for

> > allocation of thread IDs which are within a given range of 0 to

> > <predefined MAX>

> >

> 

> System-provided thread-ID's would probably also be potentially non-unique in

> multiprocess scenario?

For linux, rte_gettid is implemented as:
int rte_sys_gettid(void)
{
        return (int)syscall(SYS_gettid);
}

Came across [1] which states, thread-IDs are unique across the system.

For BSD, thr_self is used. [2] says it provides system wide unique thread IDs.

[1] https://stackoverflow.com/questions/6372102/what-is-the-difference-between-pthread-self-and-gettid-which-one-should-i-u
[2] https://nxmnpg.lemoda.net/2/thr_self

> 

> --

> Thanks,

> Anatoly

Stephen Hemminger Dec. 18, 2018, 6:31 a.m. UTC | #12

On Tue, 18 Dec 2018 04:30:39 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > 

> > On 11-Dec-18 6:40 AM, Honnappa Nagarahalli wrote:  

> > >>  

> > >>>>>  

> > >>>>>>> +

> > >>>>>>> +/* Add a reader thread, running on an lcore, to the list of

> > >>>>>>> +threads

> > >>>>>>> + * reporting their quiescent state on a TQS variable.

> > >>>>>>> + */

> > >>>>>>> +int __rte_experimental

> > >>>>>>> +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {

> > >>>>>>> +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=  

> > >>>>>> RTE_TQS_MAX_LCORE),  

> > >>>>>>> +				-EINVAL);  

> > >>>>>>

> > >>>>>> It is not very good practice to make function return different

> > >>>>>> values and behave in a different way in debug/non-debug mode.

> > >>>>>> I'd say that for slow-path (functions in .c) it is always good to

> > >>>>>> check input parameters.

> > >>>>>> For fast-path (functions in .h) we sometimes skip such checking,

> > >>>>>> but debug mode can probably use RTE_ASSERT() or so.  

> > >>>>> Makes sense, I will change this in the next version.

> > >>>>>  

> > >>>>>>

> > >>>>>>

> > >>>>>> lcore_id >= RTE_TQS_MAX_LCORE

> > >>>>>>

> > >>>>>> Is this limitation really necessary?  

> > >>>>> I added this limitation because currently DPDK application cannot

> > >>>>> take a mask more than 64bit wide. Otherwise, this should be as big

> > >>>>> as  

> > >>>> RTE_MAX_LCORE.  

> > >>>>> I see that in the case of '-lcores' option, the number of lcores

> > >>>>> can be more than the number of PEs. In this case, we still need a

> > >>>>> MAX limit (but  

> > >>>> can be bigger than 64).  

> > >>>>>  

> > >>>>>> First it means that only lcores can use that API (at least

> > >>>>>> data-path part), second even today many machines have more than

> > >>>>>> 64  

> > >> cores.  

> > >>>>>> I think you can easily avoid such limitation, if instead of

> > >>>>>> requiring lcore_id as input parameter, you'll just make it return

> > >>>>>> index of  

> > >>>> next available entry in w[].  

> > >>>>>> Then tqs_update() can take that index as input parameter.  

> > >>>>> I had thought about a similar approach based on IDs. I was

> > >>>>> concerned that ID will be one more thing to manage for the

> > >>>>> application. But, I see the  

> > >>>> limitations of the current approach now. I will change it to

> > >>>> allocation  

> > >> based.  

> > >>>> This will support even non-EAL pthreads as well.

> > >>>>

> > >>>> Yes, with such approach non-lcore threads will be able to use it also.

> > >>>>  

> > >>> I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore

> > >>> need to be  

> > >> efficient as they can be called from the worker's packet processing

> > >> loop (rte_event_dequeue_burst allows blocking. So, the worker thread

> > >> needs to call rte_tqs_unregister_lcore before calling

> > >> rte_event_dequeue_burst and rte_tqs_register_lcore before starting

> > >> packet processing). Allocating the thread ID in these functions will make  

> > them more complex.  

> > >>>

> > >>> I suggest that we change the argument 'lcore_id' to 'thread_id'. The  

> > >> application could use 'lcore_id' as 'thread_id' if threads are mapped

> > >> to physical cores 1:1.  

> > >>>

> > >>> If the threads are not mapped 1:1 to physical cores, the threads

> > >>> need to use  

> > >> a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see

> > >> that DPDK has a thread_id concept. For TQS, the thread IDs are global

> > >> (i.e. not per TQS variable). I could provide APIs to do the thread ID

> > >> allocation, but I think the thread ID allocation should not be part

> > >> of this library. Such thread ID might be useful for other libraries.  

> > >>>

> > >>> <snip  

> > >>

> > >>

> > >> Thread id is problematic since Glibc doesn't want to give it out.

> > >> You have to roll your own function to do gettid().

> > >> It is not as easy as just that.  Plus what about preemption?  

> > >

> > > Agree. I looked into this further. The rte_gettid function uses a system call  

> > (BSD and Linux). I am not clear on the space of the ID returned (as well). I do

> > not think it is guaranteed that it will be with in a narrow range that is required

> > here.  

> > >

> > > My suggestion would be to add a set of APIs that would allow for

> > > allocation of thread IDs which are within a given range of 0 to

> > > <predefined MAX>

> > >  

> > 

> > System-provided thread-ID's would probably also be potentially non-unique in

> > multiprocess scenario?  

> For linux, rte_gettid is implemented as:

> int rte_sys_gettid(void)

> {

>         return (int)syscall(SYS_gettid);

> }

> 

> Came across [1] which states, thread-IDs are unique across the system.

> 

> For BSD, thr_self is used. [2] says it provides system wide unique thread IDs.

> 

> [1] https://stackoverflow.com/questions/6372102/what-is-the-difference-between-pthread-self-and-gettid-which-one-should-i-u

> [2] https://nxmnpg.lemoda.net/2/thr_self

> 

> > 

> > --

> > Thanks,

> > Anatoly  


Using thread id directly on Linux is battling against the glibc gods wishes.
Bad things may come to those that disobey them :-)

But really many libraries need to do the same thing, it is worth looking around.
The bigger issue is pid and thread id recycling with the limited range allowed.

[RFC,2/3] tqs: add thread quiescent state library

Commit Message

Comments

Patch