diff mbox series

[v2,1/6] service: fix race condition for MT unsafe service

Message ID 1587659482-27133-2-git-send-email-phil.yang@arm.com
State New
Headers show
Series [v2,1/6] service: fix race condition for MT unsafe service | expand

Commit Message

Phil Yang April 23, 2020, 4:31 p.m. UTC
From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>


The MT unsafe service might get configured to run on another core
while the service is running currently. This might result in the
MT unsafe service running on multiple cores simultaneously. Use
'execute_lock' always when the service is MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Reviewed-by: Phil Yang <phil.yang@arm.com>

---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

-- 
2.7.4

Comments

Van Haaren, Harry April 29, 2020, 4:51 p.m. UTC | #1
> -----Original Message-----

> From: Phil Yang <phil.yang@arm.com>

> Sent: Thursday, April 23, 2020 5:31 PM

> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org

> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>; jerinj@marvell.com;

> hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;

> gavin.hu@arm.com; nd@arm.com; Honnappa Nagarahalli

> <honnappa.nagarahalli@arm.com>; stable@dpdk.org

> Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe service

> 

> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> 

> The MT unsafe service might get configured to run on another core

> while the service is running currently. This might result in the

> MT unsafe service running on multiple cores simultaneously. Use

> 'execute_lock' always when the service is MT unsafe.

> 

> Fixes: e9139a32f6e8 ("service: add function to run on app lcore")

> Cc: stable@dpdk.org

> 

> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> Reviewed-by: Phil Yang <phil.yang@arm.com>

> ---


Thanks for spinning a new revision - based on ML discussion previously,
it seems like the "use service-run-count" to avoid this race would be a
complex solution.

Suggesting the following;
1) Take the approach as per this patch, to always take the atomic, fixing the race condition.
2) Add an API to service-cores, which allows "committing" of mappings. Committing the mapping would imply that the mappings will not be changed in future. With runtime-remapping being removed from the equation, the existing branch-over-atomic optimization is valid again.

So this would offer applications two situations
A) No application change: possible performance regression due to atomic always taken.
B) Call "commit" API, and regain the performance as per previous DPDK versions.

Thoughts/opinions on the above?  I've flagged the rest of the patchset for review ASAP. Regards, -Harry

>  lib/librte_eal/common/rte_service.c | 11 +++++------

>  1 file changed, 5 insertions(+), 6 deletions(-)

> 

> diff --git a/lib/librte_eal/common/rte_service.c

> b/lib/librte_eal/common/rte_service.c

> index 70d17a5..b8c465e 100644

> --- a/lib/librte_eal/common/rte_service.c

> +++ b/lib/librte_eal/common/rte_service.c

> @@ -50,6 +50,10 @@ struct rte_service_spec_impl {

>  	uint8_t internal_flags;

> 

>  	/* per service statistics */

> +	/* Indicates how many cores the service is mapped to run on.

> +	 * It does not indicate the number of cores the service is running

> +	 * on currently.

> +	 */

>  	rte_atomic32_t num_mapped_cores;

>  	uint64_t calls;

>  	uint64_t cycles_spent;

> @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t

> service_mask,

> 

>  	cs->service_active_on_lcore[i] = 1;

> 

> -	/* check do we need cmpset, if MT safe or <= 1 core

> -	 * mapped, atomic ops are not required.

> -	 */

> -	const int use_atomics = (service_mt_safe(s) == 0) &&

> -				(rte_atomic32_read(&s->num_mapped_cores) > 1);

> -	if (use_atomics) {

> +	if (service_mt_safe(s) == 0) {

>  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))

>  			return -EBUSY;

> 

> --

> 2.7.4
Honnappa Nagarahalli April 29, 2020, 10:48 p.m. UTC | #2
Hi Harry,
	Thanks for getting back on this.

<snip>

> > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe

> > service

> >

> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> >

> > The MT unsafe service might get configured to run on another core

> > while the service is running currently. This might result in the MT

> > unsafe service running on multiple cores simultaneously. Use

> > 'execute_lock' always when the service is MT unsafe.

> >

> > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")

> > Cc: stable@dpdk.org

> >

> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > Reviewed-by: Phil Yang <phil.yang@arm.com>

> > ---

> 

> Thanks for spinning a new revision - based on ML discussion previously, it

> seems like the "use service-run-count" to avoid this race would be a complex

> solution.

> 

> Suggesting the following;

> 1) Take the approach as per this patch, to always take the atomic, fixing the

> race condition.

Ok

> 2) Add an API to service-cores, which allows "committing" of mappings.

> Committing the mapping would imply that the mappings will not be changed

> in future. With runtime-remapping being removed from the equation, the

> existing branch-over-atomic optimization is valid again.

Ok. Just to make sure I understand this:
a) on the data plane, if commit API is called (probably a new state variable) and num_mapped_cores is set to 1, there is no need to take the lock.
b) possible implementation of the commit API would check if num_mapped_cores for the service is set to 1 and set a variable to indicate that the lock is not required.

What do you think about asking the application to set  the service capability to MT_SAFE if it knows that the service will run on a single core? This would require us to change the documentation and does not require additional code.

> 

> So this would offer applications two situations

> A) No application change: possible performance regression due to atomic

> always taken.

> B) Call "commit" API, and regain the performance as per previous DPDK

> versions.

> 

> Thoughts/opinions on the above?  I've flagged the rest of the patchset for

> review ASAP. Regards, -Harry

> 

> >  lib/librte_eal/common/rte_service.c | 11 +++++------

> >  1 file changed, 5 insertions(+), 6 deletions(-)

> >

> > diff --git a/lib/librte_eal/common/rte_service.c

> > b/lib/librte_eal/common/rte_service.c

> > index 70d17a5..b8c465e 100644

> > --- a/lib/librte_eal/common/rte_service.c

> > +++ b/lib/librte_eal/common/rte_service.c

> > @@ -50,6 +50,10 @@ struct rte_service_spec_impl {

> >  	uint8_t internal_flags;

> >

> >  	/* per service statistics */

> > +	/* Indicates how many cores the service is mapped to run on.

> > +	 * It does not indicate the number of cores the service is running

> > +	 * on currently.

> > +	 */

> >  	rte_atomic32_t num_mapped_cores;

> >  	uint64_t calls;

> >  	uint64_t cycles_spent;

> > @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs,

> > uint64_t service_mask,

> >

> >  	cs->service_active_on_lcore[i] = 1;

> >

> > -	/* check do we need cmpset, if MT safe or <= 1 core

> > -	 * mapped, atomic ops are not required.

> > -	 */

> > -	const int use_atomics = (service_mt_safe(s) == 0) &&

> > -				(rte_atomic32_read(&s-

> >num_mapped_cores) > 1);

> > -	if (use_atomics) {

> > +	if (service_mt_safe(s) == 0) {

> >  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))

> >  			return -EBUSY;

> >

> > --

> > 2.7.4
Van Haaren, Harry May 1, 2020, 2:21 p.m. UTC | #3
> -----Original Message-----

> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>

> Sent: Wednesday, April 29, 2020 11:49 PM

> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang

> <Phil.Yang@arm.com>; dev@dpdk.org

> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>; jerinj@marvell.com;

> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; nd

> <nd@arm.com>; stable@dpdk.org; Eads, Gage <gage.eads@intel.com>;

> Richardson, Bruce <bruce.richardson@intel.com>; Honnappa Nagarahalli

> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>

> Subject: RE: [PATCH v2 1/6] service: fix race condition for MT unsafe service

> 

> Hi Harry,

> 	Thanks for getting back on this.

> 

> <snip>

> 

> > > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe

> > > service

> > >

> > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > >

> > > The MT unsafe service might get configured to run on another core

> > > while the service is running currently. This might result in the MT

> > > unsafe service running on multiple cores simultaneously. Use

> > > 'execute_lock' always when the service is MT unsafe.

> > >

> > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")

> > > Cc: stable@dpdk.org

> > >

> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > > Reviewed-by: Phil Yang <phil.yang@arm.com>

> > > ---

> >

> > Thanks for spinning a new revision - based on ML discussion previously, it

> > seems like the "use service-run-count" to avoid this race would be a complex

> > solution.

> >

> > Suggesting the following;

> > 1) Take the approach as per this patch, to always take the atomic, fixing the

> > race condition.

> Ok


I've micro-benchmarked this code change inside the service cores autotest, and it
introduces around 35 cycles of overhead per service call.  This is not ideal, but given
it's a bugfix, and by far the simplest method to fix this race-condition. Having
discussed and investigated multiple other solutions, I believe this is the right solution.
Thanks Honnappa and Phil for identifying and driving a solution.

I suggest to post the benchmarking unit-test addition patch, and integrate that
*before* the series under review here gets merged? This makes benchmarking
of the "before bugfix" performance in future easier should it be required.


> > 2) Add an API to service-cores, which allows "committing" of mappings.

> > Committing the mapping would imply that the mappings will not be changed

> > in future. With runtime-remapping being removed from the equation, the

> > existing branch-over-atomic optimization is valid again.

> Ok. Just to make sure I understand this:

> a) on the data plane, if commit API is called (probably a new state variable) and

> num_mapped_cores is set to 1, there is no need to take the lock.

> b) possible implementation of the commit API would check if

> num_mapped_cores for the service is set to 1 and set a variable to indicate that

> the lock is not required.

> 

> What do you think about asking the application to set  the service capability to

> MT_SAFE if it knows that the service will run on a single core? This would

> require us to change the documentation and does not require additional code.


That's a nice idea - I like that if applications want to micro-optimize around
the atomic, that they have a workaround/solution to do so, particularly that it
doesn't require code-changes and backporting.

Will send review and send feedback on the patches themselves.
Regards, -Harry

> > So this would offer applications two situations

> > A) No application change: possible performance regression due to atomic

> > always taken.

> > B) Call "commit" API, and regain the performance as per previous DPDK

> > versions.

> >

> > Thoughts/opinions on the above?  I've flagged the rest of the patchset for

> > review ASAP. Regards, -Harry

> >

> > >  lib/librte_eal/common/rte_service.c | 11 +++++------

> > >  1 file changed, 5 insertions(+), 6 deletions(-)

<snip patch changes>
Honnappa Nagarahalli May 1, 2020, 2:56 p.m. UTC | #4
<snip>
> >

> > > > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe

> > > > service

> > > >

> > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > > >

> > > > The MT unsafe service might get configured to run on another core

> > > > while the service is running currently. This might result in the

> > > > MT unsafe service running on multiple cores simultaneously. Use

> > > > 'execute_lock' always when the service is MT unsafe.

> > > >

> > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")

> > > > Cc: stable@dpdk.org

> > > >

> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>

> > > > ---

> > >

> > > Thanks for spinning a new revision - based on ML discussion

> > > previously, it seems like the "use service-run-count" to avoid this

> > > race would be a complex solution.

> > >

> > > Suggesting the following;

> > > 1) Take the approach as per this patch, to always take the atomic,

> > > fixing the race condition.

> > Ok

> 

> I've micro-benchmarked this code change inside the service cores autotest,

> and it introduces around 35 cycles of overhead per service call.  This is not

> ideal, but given it's a bugfix, and by far the simplest method to fix this race-

> condition. Having discussed and investigated multiple other solutions, I

> believe this is the right solution.

> Thanks Honnappa and Phil for identifying and driving a solution.

You are welcome. Thank you for your timely responses.

> 

> I suggest to post the benchmarking unit-test addition patch, and integrate

> that

> *before* the series under review here gets merged? This makes

> benchmarking of the "before bugfix" performance in future easier should it be

> required.

I do not see any issues, would be happy to review. I think we still have time to catch up with RC2 (May 8th).
You had also mentioned about calling out that, the control plane APIs are not MT safe. Should I add that to this patch?

> 

> 

> > > 2) Add an API to service-cores, which allows "committing" of mappings.

> > > Committing the mapping would imply that the mappings will not be

> > > changed in future. With runtime-remapping being removed from the

> > > equation, the existing branch-over-atomic optimization is valid again.

> > Ok. Just to make sure I understand this:

> > a) on the data plane, if commit API is called (probably a new state

> > variable) and num_mapped_cores is set to 1, there is no need to take the

> lock.

> > b) possible implementation of the commit API would check if

> > num_mapped_cores for the service is set to 1 and set a variable to

> > indicate that the lock is not required.

> >

> > What do you think about asking the application to set  the service

> > capability to MT_SAFE if it knows that the service will run on a

> > single core? This would require us to change the documentation and does

> not require additional code.

> 

> That's a nice idea - I like that if applications want to micro-optimize around

> the atomic, that they have a workaround/solution to do so, particularly that it

> doesn't require code-changes and backporting.

> 

> Will send review and send feedback on the patches themselves.

> Regards, -Harry

> 

> > > So this would offer applications two situations

> > > A) No application change: possible performance regression due to

> > > atomic always taken.

> > > B) Call "commit" API, and regain the performance as per previous

> > > DPDK versions.

> > >

> > > Thoughts/opinions on the above?  I've flagged the rest of the

> > > patchset for review ASAP. Regards, -Harry

> > >

> > > >  lib/librte_eal/common/rte_service.c | 11 +++++------

> > > >  1 file changed, 5 insertions(+), 6 deletions(-)

> <snip patch changes>
Van Haaren, Harry May 1, 2020, 5:51 p.m. UTC | #5
> -----Original Message-----

> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>

> Sent: Friday, May 1, 2020 3:56 PM

> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang

> <Phil.Yang@arm.com>; dev@dpdk.org

> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>; jerinj@marvell.com;

> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; nd

> <nd@arm.com>; stable@dpdk.org; Eads, Gage <gage.eads@intel.com>;

> Richardson, Bruce <bruce.richardson@intel.com>; nd <nd@arm.com>;

> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>

> Subject: RE: [PATCH v2 1/6] service: fix race condition for MT unsafe service

> 

> <snip>

> > >

> > > > > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe

> > > > > service

> > > > >

> > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > > > >

> > > > > The MT unsafe service might get configured to run on another core

> > > > > while the service is running currently. This might result in the

> > > > > MT unsafe service running on multiple cores simultaneously. Use

> > > > > 'execute_lock' always when the service is MT unsafe.

> > > > >

> > > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")

> > > > > Cc: stable@dpdk.org

> > > > >

> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>

> > > > > ---

> > > >

> > > > Thanks for spinning a new revision - based on ML discussion

> > > > previously, it seems like the "use service-run-count" to avoid this

> > > > race would be a complex solution.

> > > >

> > > > Suggesting the following;

> > > > 1) Take the approach as per this patch, to always take the atomic,

> > > > fixing the race condition.

> > > Ok

> >

> > I've micro-benchmarked this code change inside the service cores autotest,

> > and it introduces around 35 cycles of overhead per service call.  This is not

> > ideal, but given it's a bugfix, and by far the simplest method to fix this race-

> > condition. Having discussed and investigated multiple other solutions, I

> > believe this is the right solution.

> > Thanks Honnappa and Phil for identifying and driving a solution.

> You are welcome. Thank you for your timely responses.


Perhaps not so timely after all ... I'll review C11 patches Tuesday morning,
it's a long weekend in Ireland!

> > I suggest to post the benchmarking unit-test addition patch, and integrate

> > that

> > *before* the series under review here gets merged? This makes

> > benchmarking of the "before bugfix" performance in future easier should it be

> > required.

> I do not see any issues, would be happy to review.


Thanks for volunteering, you're on CC when sent, for convenience:
http://patches.dpdk.org/patch/69651/

> I think we still have time to catch up with RC2 (May 8th).


Agree, merge into RC2 would be great.

> You had also mentioned about calling out that, the control plane APIs are not

> MT safe. Should I add that to this patch?


Yes, that'd be great.

<snip discussion details>

Cheers, -Harry
diff mbox series

Patch

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 70d17a5..b8c465e 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@  struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@  service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;