diff mbox series

[net] net: fix hangup on napi_disable for threaded napi

Message ID 996c4bb33166b5cf8d881871ea8b61e54ad4da24.1617230551.git.pabeni@redhat.com
State New
Headers show
Series [net] net: fix hangup on napi_disable for threaded napi | expand

Commit Message

Paolo Abeni March 31, 2021, 10:46 p.m. UTC
I hit an hangup on napi_disable(), when the threaded
mode is enabled and the napi is under heavy traffic.

If the relevant napi has been scheduled and the napi_disable()
kicks in before the next napi_threaded_wait() completes - so
that the latter quits due to the napi_disable_pending() condition,
the existing code leaves the NAPI_STATE_SCHED bit set and the
napi_disable() loop waiting for such bit will hang.

Address the issue explicitly clearing the SCHED_BIT on napi_thread
termination, if the thread is owns the napi.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/core/dev.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Jakub Kicinski April 1, 2021, 1:41 a.m. UTC | #1
On Thu,  1 Apr 2021 00:46:18 +0200 Paolo Abeni wrote:
> I hit an hangup on napi_disable(), when the threaded

> mode is enabled and the napi is under heavy traffic.

> 

> If the relevant napi has been scheduled and the napi_disable()

> kicks in before the next napi_threaded_wait() completes - so

> that the latter quits due to the napi_disable_pending() condition,

> the existing code leaves the NAPI_STATE_SCHED bit set and the

> napi_disable() loop waiting for such bit will hang.

> 

> Address the issue explicitly clearing the SCHED_BIT on napi_thread

> termination, if the thread is owns the napi.

> 

> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> ---

>  net/core/dev.c | 8 ++++++++

>  1 file changed, 8 insertions(+)

> 

> diff --git a/net/core/dev.c b/net/core/dev.c

> index b4c67a5be606d..e2e716ba027b8 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -7059,6 +7059,14 @@ static int napi_thread_wait(struct napi_struct *napi)

>  		set_current_state(TASK_INTERRUPTIBLE);

>  	}

>  	__set_current_state(TASK_RUNNING);

> +

> +	/* if the thread owns this napi, and the napi itself has been disabled

> +	 * in-between napi_schedule() and the above napi_disable_pending()

> +	 * check, we need to clear the SCHED bit here, or napi_disable

> +	 * will hang waiting for such bit being cleared

> +	 */

> +	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken)

> +		clear_bit(NAPI_STATE_SCHED, &napi->state);


Not sure this covers 100% of the cases. We depend on the ability to go
through schedule() "unnecessarily" when the napi gets scheduled after
we go into TASK_INTERRUPTIBLE.

If we just check woken outside of the loop it may be false even though
we got a "wake event".


Looking closer now I don't really understand where we ended up with
disable handling :S  Seems like the thread exits on napi_disable(),
but is reaped by netif_napi_del(). Some drivers (*cough* nfp) will
go napi_disable() -> napi_enable()... and that will break. 

Am I missing something?

Should we not stay in the wait loop on napi_disable()?
Paolo Abeni April 1, 2021, 9:55 a.m. UTC | #2
On Wed, 2021-03-31 at 18:41 -0700, Jakub Kicinski wrote:
> On Thu,  1 Apr 2021 00:46:18 +0200 Paolo Abeni wrote:

> > I hit an hangup on napi_disable(), when the threaded

> > mode is enabled and the napi is under heavy traffic.

> > 

> > If the relevant napi has been scheduled and the napi_disable()

> > kicks in before the next napi_threaded_wait() completes - so

> > that the latter quits due to the napi_disable_pending() condition,

> > the existing code leaves the NAPI_STATE_SCHED bit set and the

> > napi_disable() loop waiting for such bit will hang.

> > 

> > Address the issue explicitly clearing the SCHED_BIT on napi_thread

> > termination, if the thread is owns the napi.

> > 

> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> > ---

> >  net/core/dev.c | 8 ++++++++

> >  1 file changed, 8 insertions(+)

> > 

> > diff --git a/net/core/dev.c b/net/core/dev.c

> > index b4c67a5be606d..e2e716ba027b8 100644

> > --- a/net/core/dev.c

> > +++ b/net/core/dev.c

> > @@ -7059,6 +7059,14 @@ static int napi_thread_wait(struct napi_struct *napi)

> >  		set_current_state(TASK_INTERRUPTIBLE);

> >  	}

> >  	__set_current_state(TASK_RUNNING);

> > +

> > +	/* if the thread owns this napi, and the napi itself has been disabled

> > +	 * in-between napi_schedule() and the above napi_disable_pending()

> > +	 * check, we need to clear the SCHED bit here, or napi_disable

> > +	 * will hang waiting for such bit being cleared

> > +	 */

> > +	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken)

> > +		clear_bit(NAPI_STATE_SCHED, &napi->state);

> 

> Not sure this covers 100% of the cases. We depend on the ability to go

> through schedule() "unnecessarily" when the napi gets scheduled after

> we go into TASK_INTERRUPTIBLE.


Empirically this patch fixes my test case (napi_disable/napi_enable in
a loop with the relevant napi under a lot of UDP traffic).

If I understand correctly, the critical scenario you see is something
alike:

CPU0			CPU1					CPU2
			// napi_threaded_poll() main loop
			napi_complete_done()
			// napi_threaded_poll() loop completes
	
napi_schedule()
// set SCHED bit
// NOT set SCHED_THREAD
// wake_up_process() is
// a no op
								napi_disable()
								// set DISABLE bit
			
			napi_thread_wait()
			set_current_state(TASK_INTERRUPTIBLE);
			// napi_thread_wait() loop completes,
			// SCHED_THREAD bit is cleared and
			// wake is false
	
> If we just check woken outside of the loop it may be false even though

> we got a "wake event".


I think in the above example even the normal processing will be
fooled?!? e.g. even without the napi_disable(), napi_thread_wait() will
 will miss the event/will not understand to it really own the napi and
will call schedule().

It looks a different problem to me ?!?

I *think* that replacing inside the napi_thread_wait() loop:

	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 

with:

	unsigned long state = READ_ONCE(napi->state);

	if (state & NAPIF_STATE_SCHED &&
	    !(state & (NAPIF_STATE_IN_BUSY_POLL | NAPIF_STATE_DISABLE)) 

should solve it and should also allow removing the
NAPI_STATE_SCHED_THREADED bit. I feel like I'm missing some relevant
point here.

> Looking closer now I don't really understand where we ended up with

> disable handling :S  Seems like the thread exits on napi_disable(),

> but is reaped by netif_napi_del(). Some drivers (*cough* nfp) will

> go napi_disable() -> napi_enable()... and that will break. 

> 

> Am I missing something?

> 

> Should we not stay in the wait loop on napi_disable()?


Here I do not follow?!? Modulo the tiny race (which i was unable to
trigger so far) above napi_disable()/napi_enable() loops work correctly
here.

Could you please re-phrase?

Thanks!

Paolo
Jakub Kicinski April 1, 2021, 11:44 p.m. UTC | #3
On Thu, 01 Apr 2021 11:55:45 +0200 Paolo Abeni wrote:
> On Wed, 2021-03-31 at 18:41 -0700, Jakub Kicinski wrote:

> > On Thu,  1 Apr 2021 00:46:18 +0200 Paolo Abeni wrote:  

> > > I hit an hangup on napi_disable(), when the threaded

> > > mode is enabled and the napi is under heavy traffic.

> > > 

> > > If the relevant napi has been scheduled and the napi_disable()

> > > kicks in before the next napi_threaded_wait() completes - so

> > > that the latter quits due to the napi_disable_pending() condition,

> > > the existing code leaves the NAPI_STATE_SCHED bit set and the

> > > napi_disable() loop waiting for such bit will hang.

> > > 

> > > Address the issue explicitly clearing the SCHED_BIT on napi_thread

> > > termination, if the thread is owns the napi.

> > > 

> > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> > > ---

> > >  net/core/dev.c | 8 ++++++++

> > >  1 file changed, 8 insertions(+)

> > > 

> > > diff --git a/net/core/dev.c b/net/core/dev.c

> > > index b4c67a5be606d..e2e716ba027b8 100644

> > > --- a/net/core/dev.c

> > > +++ b/net/core/dev.c

> > > @@ -7059,6 +7059,14 @@ static int napi_thread_wait(struct napi_struct *napi)

> > >  		set_current_state(TASK_INTERRUPTIBLE);

> > >  	}

> > >  	__set_current_state(TASK_RUNNING);

> > > +

> > > +	/* if the thread owns this napi, and the napi itself has been disabled

> > > +	 * in-between napi_schedule() and the above napi_disable_pending()

> > > +	 * check, we need to clear the SCHED bit here, or napi_disable

> > > +	 * will hang waiting for such bit being cleared

> > > +	 */

> > > +	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken)

> > > +		clear_bit(NAPI_STATE_SCHED, &napi->state);  

> > 

> > Not sure this covers 100% of the cases. We depend on the ability to go

> > through schedule() "unnecessarily" when the napi gets scheduled after

> > we go into TASK_INTERRUPTIBLE.  

> 

> Empirically this patch fixes my test case (napi_disable/napi_enable in

> a loop with the relevant napi under a lot of UDP traffic).

> 

> If I understand correctly, the critical scenario you see is something

> alike:

> 

> CPU0			CPU1					CPU2

> 			// napi_threaded_poll() main loop

> 			napi_complete_done()

> 			// napi_threaded_poll() loop completes

> 	

> napi_schedule()

> // set SCHED bit

> // NOT set SCHED_THREAD


Why does it not set SCHED_THREAD if task is RUNNING?

> // wake_up_process() is

> // a no op

> 								napi_disable()

> 								// set DISABLE bit

> 			

> 			napi_thread_wait()

> 			set_current_state(TASK_INTERRUPTIBLE);

> 			// napi_thread_wait() loop completes,

> 			// SCHED_THREAD bit is cleared and

> 			// wake is false


I was thinking of:

CPU0                        CPU1                            CPU2
====                        ====                            ====
napi_complete_done()
set INTERRUPTIBLE
                                                            napi_schedule
                                                            set RUNNING
                            napi_disable()
if (should_stop() || 
    disable_pending())
// does not enter loop
// test from this patch:
if (SCHED_THREADED || woken)
// .. is false


> > If we just check woken outside of the loop it may be false even though

> > we got a "wake event".  

> 

> I think in the above example even the normal processing will be

> fooled?!? e.g. even without the napi_disable(), napi_thread_wait() will

>  will miss the event/will not understand to it really own the napi and

> will call schedule().

> 

> It looks a different problem to me ?!?

> 

> I *think* that replacing inside the napi_thread_wait() loop:

> 

> 	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 

> 

> with:

> 

> 	unsigned long state = READ_ONCE(napi->state);

> 

> 	if (state & NAPIF_STATE_SCHED &&

> 	    !(state & (NAPIF_STATE_IN_BUSY_POLL | NAPIF_STATE_DISABLE)) 

> 

> should solve it and should also allow removing the

> NAPI_STATE_SCHED_THREADED bit. I feel like I'm missing some relevant

> point here.


Heh, that's closer to the proposal Eric put forward.

I strongly dislike the idea that every NAPI consumer needs to be aware
of all the other consumers to make things work. That's n^2 mental
complexity.

> > Looking closer now I don't really understand where we ended up with

> > disable handling :S  Seems like the thread exits on napi_disable(),

> > but is reaped by netif_napi_del(). Some drivers (*cough* nfp) will

> > go napi_disable() -> napi_enable()... and that will break. 

> > 

> > Am I missing something?

> > 

> > Should we not stay in the wait loop on napi_disable()?  

> 

> Here I do not follow?!? Modulo the tiny race (which i was unable to

> trigger so far) above napi_disable()/napi_enable() loops work correctly

> here.

> 

> Could you please re-phrase?


After napi_disable() the thread will exit right? (napi_thread_wait()
returns -1, the loop in napi_threaded_poll() breaks, and the function
returns).

napi_enable() will not re-start the thread.

What driver are you testing with? You driver must always call
netif_napi_del() and netif_napi_add().
Paolo Abeni April 7, 2021, 2:54 p.m. UTC | #4
Hello,

I'm sorry for the lag,

On Thu, 2021-04-01 at 16:44 -0700, Jakub Kicinski wrote:
> On Thu, 01 Apr 2021 11:55:45 +0200 Paolo Abeni wrote:

> > On Wed, 2021-03-31 at 18:41 -0700, Jakub Kicinski wrote:

> > > On Thu,  1 Apr 2021 00:46:18 +0200 Paolo Abeni wrote:  

> > > > I hit an hangup on napi_disable(), when the threaded

> > > > mode is enabled and the napi is under heavy traffic.

> > > > 

> > > > If the relevant napi has been scheduled and the napi_disable()

> > > > kicks in before the next napi_threaded_wait() completes - so

> > > > that the latter quits due to the napi_disable_pending() condition,

> > > > the existing code leaves the NAPI_STATE_SCHED bit set and the

> > > > napi_disable() loop waiting for such bit will hang.

> > > > 

> > > > Address the issue explicitly clearing the SCHED_BIT on napi_thread

> > > > termination, if the thread is owns the napi.

> > > > 

> > > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> > > > ---

> > > >  net/core/dev.c | 8 ++++++++

> > > >  1 file changed, 8 insertions(+)

> > > > 

> > > > diff --git a/net/core/dev.c b/net/core/dev.c

> > > > index b4c67a5be606d..e2e716ba027b8 100644

> > > > --- a/net/core/dev.c

> > > > +++ b/net/core/dev.c

> > > > @@ -7059,6 +7059,14 @@ static int napi_thread_wait(struct napi_struct *napi)

> > > >  		set_current_state(TASK_INTERRUPTIBLE);

> > > >  	}

> > > >  	__set_current_state(TASK_RUNNING);

> > > > +

> > > > +	/* if the thread owns this napi, and the napi itself has been disabled

> > > > +	 * in-between napi_schedule() and the above napi_disable_pending()

> > > > +	 * check, we need to clear the SCHED bit here, or napi_disable

> > > > +	 * will hang waiting for such bit being cleared

> > > > +	 */

> > > > +	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken)

> > > > +		clear_bit(NAPI_STATE_SCHED, &napi->state);  

> > > 

> > > Not sure this covers 100% of the cases. We depend on the ability to go

> > > through schedule() "unnecessarily" when the napi gets scheduled after

> > > we go into TASK_INTERRUPTIBLE.  

> > 

> > Empirically this patch fixes my test case (napi_disable/napi_enable in

> > a loop with the relevant napi under a lot of UDP traffic).

> > 

> > If I understand correctly, the critical scenario you see is something

> > alike:

> > 

> > CPU0			CPU1					CPU2

> > 			// napi_threaded_poll() main loop

> > 			napi_complete_done()

> > 			// napi_threaded_poll() loop completes

> > 	

> > napi_schedule()

> > // set SCHED bit

> > // NOT set SCHED_THREAD

> 

> Why does it not set SCHED_THREAD if task is RUNNING?


Because I'm dumb, I saw the race only on paper, and I mismatched the
napi thread status. The actual race I was thinking is what you wrote
below ;)

> > // wake_up_process() is

> > // a no op

> > 								napi_disable()

> > 								// set DISABLE bit

> > 			

> > 			napi_thread_wait()

> > 			set_current_state(TASK_INTERRUPTIBLE);

> > 			// napi_thread_wait() loop completes,

> > 			// SCHED_THREAD bit is cleared and

> > 			// wake is false

> 

> I was thinking of:

> 

> CPU0                        CPU1                            CPU2

> ====                        ====                            ====

> napi_complete_done()

> set INTERRUPTIBLE

>                                                             napi_schedule

>                                                             set RUNNING

>                             napi_disable()

> if (should_stop() || 

>     disable_pending())

> // does not enter loop

> // test from this patch:

> if (SCHED_THREADED || woken)

> // .. is false

> 

> 

> > > If we just check woken outside of the loop it may be false even though

> > > we got a "wake event".  

> > 

> > I think in the above example even the normal processing will be

> > fooled?!? e.g. even without the napi_disable(), napi_thread_wait() will

> >  will miss the event/will not understand to it really own the napi and

> > will call schedule().

> > 

> > It looks a different problem to me ?!?

> > 

> > I *think* that replacing inside the napi_thread_wait() loop:

> > 

> > 	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 

> > 

> > with:

> > 

> > 	unsigned long state = READ_ONCE(napi->state);

> > 

> > 	if (state & NAPIF_STATE_SCHED &&

> > 	    !(state & (NAPIF_STATE_IN_BUSY_POLL | NAPIF_STATE_DISABLE)) 

> > 

> > should solve it and should also allow removing the

> > NAPI_STATE_SCHED_THREADED bit. I feel like I'm missing some relevant

> > point here.

> 

> Heh, that's closer to the proposal Eric put forward.

> 

> I strongly dislike 


I guess that can't be addressed ;)

> the idea that every NAPI consumer needs to be aware

> of all the other consumers to make things work. That's n^2 mental

> complexity.


IMHO the overall complexity is not that bad: both napi_disable() and
NAPI poll already set their own specific NAPI bit when acquiring the
NAPI instance, they don't need to be aware of any other NAPI consumer
internal.

The only NAPI user that needs to be aware of others is napi threaded,
and I guess/hope we are not going to add more different kind of NAPI
users.

If you have strong opinion against the above, the only other option I
can think of is patching napi_schedule_prep() to set
both NAPI_STATE_SCHED and NAPI_STATE_SCHED_THREADED if threaded mode is
enabled for the running NAPI. That looks more complex and error prone,
so I really would avoid that.

Any other better option?

Side note: regardless of the above, I think we still need something
similar to the code in this patch, can we address the different issues
separately?

> > > Looking closer now I don't really understand where we ended up with

> > > disable handling :S  Seems like the thread exits on napi_disable(),

> > > but is reaped by netif_napi_del(). Some drivers (*cough* nfp) will

> > > go napi_disable() -> napi_enable()... and that will break. 

> > > 

> > > Am I missing something?

> > > 

> > > Should we not stay in the wait loop on napi_disable()?  

> > 

> > Here I do not follow?!? Modulo the tiny race (which i was unable to

> > trigger so far) above napi_disable()/napi_enable() loops work correctly

> > here.

> > 

> > Could you please re-phrase?

> 

> After napi_disable() the thread will exit right? (napi_thread_wait()

> returns -1, the loop in napi_threaded_poll() breaks, and the function

> returns).

> 

> napi_enable() will not re-start the thread.

> 

> What driver are you testing with? You driver must always call

> netif_napi_del() and netif_napi_add().


veth + some XDP dummy prog - used just to enable NAPI.

Yep, it does a full netif_napi_del()/netif_napi_add().

Looks like plain napi_disable()/napi_enable() is not going to work in
threaded mode.

Cheers,

Paolo
Jakub Kicinski April 7, 2021, 6:13 p.m. UTC | #5
On Wed, 07 Apr 2021 16:54:29 +0200 Paolo Abeni wrote:
> > > I think in the above example even the normal processing will be

> > > fooled?!? e.g. even without the napi_disable(), napi_thread_wait() will

> > >  will miss the event/will not understand to it really own the napi and

> > > will call schedule().

> > > 

> > > It looks a different problem to me ?!?

> > > 

> > > I *think* that replacing inside the napi_thread_wait() loop:

> > > 

> > > 	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 

> > > 

> > > with:

> > > 

> > > 	unsigned long state = READ_ONCE(napi->state);

> > > 

> > > 	if (state & NAPIF_STATE_SCHED &&

> > > 	    !(state & (NAPIF_STATE_IN_BUSY_POLL | NAPIF_STATE_DISABLE)) 

> > > 

> > > should solve it and should also allow removing the

> > > NAPI_STATE_SCHED_THREADED bit. I feel like I'm missing some relevant

> > > point here.  

> > 

> > Heh, that's closer to the proposal Eric put forward.

> > 

> > I strongly dislike   

> 

> I guess that can't be addressed ;)


I'm not _that_ unreasonable, I hope :) if there is multiple people
disagreeing with me then so be it.

> > the idea that every NAPI consumer needs to be aware

> > of all the other consumers to make things work. That's n^2 mental

> > complexity.  

> 

> IMHO the overall complexity is not that bad: both napi_disable() and

> NAPI poll already set their own specific NAPI bit when acquiring the

> NAPI instance, they don't need to be aware of any other NAPI consumer

> internal.

> 

> The only NAPI user that needs to be aware of others is napi threaded,

> and I guess/hope we are not going to add more different kind of NAPI

> users.


I thought we agreed that we should leave the door open for other
pollers as a condition of merging this simplistic thread thing.

> If you have strong opinion against the above, the only other option I

> can think of is patching napi_schedule_prep() to set

> both NAPI_STATE_SCHED and NAPI_STATE_SCHED_THREADED if threaded mode is

> enabled for the running NAPI. That looks more complex and error prone,

> so I really would avoid that.

> 

> Any other better option?

> 

> Side note: regardless of the above, I think we still need something

> similar to the code in this patch, can we address the different issues

> separately?


Not sure what issues you're referring to.

> > > Here I do not follow?!? Modulo the tiny race (which i was unable to

> > > trigger so far) above napi_disable()/napi_enable() loops work correctly

> > > here.

> > > 

> > > Could you please re-phrase?  

> > 

> > After napi_disable() the thread will exit right? (napi_thread_wait()

> > returns -1, the loop in napi_threaded_poll() breaks, and the function

> > returns).

> > 

> > napi_enable() will not re-start the thread.

> > 

> > What driver are you testing with? You driver must always call

> > netif_napi_del() and netif_napi_add().  

> 

> veth + some XDP dummy prog - used just to enable NAPI.

> 

> Yep, it does a full netif_napi_del()/netif_napi_add().

> 

> Looks like plain napi_disable()/napi_enable() is not going to work in

> threaded mode.


Right, I think the problem is disable_pending check is out of place.

How about this:

diff --git a/net/core/dev.c b/net/core/dev.c
index 9d1a8fac793f..e53f8bfed6a1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7041,7 +7041,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 
        set_current_state(TASK_INTERRUPTIBLE);
 
-       while (!kthread_should_stop() && !napi_disable_pending(napi)) {
+       while (!kthread_should_stop()) {
                /* Testing SCHED_THREADED bit here to make sure the current
                 * kthread owns this napi and could poll on this napi.
                 * Testing SCHED bit is not enough because SCHED bit might be
@@ -7049,8 +7049,14 @@ static int napi_thread_wait(struct napi_struct *napi)
                 */
                if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) {
                        WARN_ON(!list_empty(&napi->poll_list));
-                       __set_current_state(TASK_RUNNING);
-                       return 0;
+                       if (unlikely(napi_disable_pending(napi))) {
+                               clear_bit(NAPI_STATE_SCHED, &napi->state);
+                               clear_bit(NAPI_STATE_SCHED_THREADED,
+                                         &napi->state);
+                       } else {
+                               __set_current_state(TASK_RUNNING);
+                               return 0;
+                       }
                }
 
                schedule();
Paolo Abeni April 9, 2021, 9:24 a.m. UTC | #6
On Wed, 2021-04-07 at 11:13 -0700, Jakub Kicinski wrote:
> On Wed, 07 Apr 2021 16:54:29 +0200 Paolo Abeni wrote:

> > > > I think in the above example even the normal processing will be

> > > > fooled?!? e.g. even without the napi_disable(), napi_thread_wait() will

> > > >  will miss the event/will not understand to it really own the napi and

> > > > will call schedule().

> > > > 

> > > > It looks a different problem to me ?!?

> > > > 

> > > > I *think* that replacing inside the napi_thread_wait() loop:

> > > > 

> > > > 	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 

> > > > 

> > > > with:

> > > > 

> > > > 	unsigned long state = READ_ONCE(napi->state);

> > > > 

> > > > 	if (state & NAPIF_STATE_SCHED &&

> > > > 	    !(state & (NAPIF_STATE_IN_BUSY_POLL | NAPIF_STATE_DISABLE)) 

> > > > 

> > > > should solve it and should also allow removing the

> > > > NAPI_STATE_SCHED_THREADED bit. I feel like I'm missing some relevant

> > > > point here.  

> > > 

> > > Heh, that's closer to the proposal Eric put forward.

> > > 

> > > I strongly dislike   

> > 

> > I guess that can't be addressed ;)

> 

> I'm not _that_ unreasonable, I hope :) if there is multiple people

> disagreeing with me then so be it.


I'm sorry, I mean no offence! Just joking about the fact that is
usually hard changing preferences ;)

> > If you have strong opinion against the above, the only other option I

> > can think of is patching napi_schedule_prep() to set

> > both NAPI_STATE_SCHED and NAPI_STATE_SCHED_THREADED if threaded mode is

> > enabled for the running NAPI. That looks more complex and error prone,

> > so I really would avoid that.

> > 

> > Any other better option?

> > 

> > Side note: regardless of the above, I think we still need something

> > similar to the code in this patch, can we address the different issues

> > separately?

> 

> Not sure what issues you're referring to.


The patch that started this thread was ment to address a slightly
different race: napi_disable() hanging because napi_threaded_poll()
don't clear the NAPI_STATE_SCHED even when owning the napi instance.

> Right, I think the problem is disable_pending check is out of place.

> 

> How about this:

> 

> diff --git a/net/core/dev.c b/net/core/dev.c

> index 9d1a8fac793f..e53f8bfed6a1 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -7041,7 +7041,7 @@ static int napi_thread_wait(struct napi_struct *napi)

>  

>         set_current_state(TASK_INTERRUPTIBLE);

>  

> -       while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> +       while (!kthread_should_stop()) {

>                 /* Testing SCHED_THREADED bit here to make sure the current

>                  * kthread owns this napi and could poll on this napi.

>                  * Testing SCHED bit is not enough because SCHED bit might be

> @@ -7049,8 +7049,14 @@ static int napi_thread_wait(struct napi_struct *napi)

>                  */

>                 if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) {

>                         WARN_ON(!list_empty(&napi->poll_list));

> -                       __set_current_state(TASK_RUNNING);

> -                       return 0;

> +                       if (unlikely(napi_disable_pending(napi))) {

> +                               clear_bit(NAPI_STATE_SCHED, &napi->state);

> +                               clear_bit(NAPI_STATE_SCHED_THREADED,

> +                                         &napi->state);

> +                       } else {

> +                               __set_current_state(TASK_RUNNING);

> +                               return 0;

> +                       }

>                 }

>  

>                 schedule();


It looks like the above works, and also fixes the problem I originally
reported. 

I think it can be simplified as the following - if NAPIF_STATE_DISABLE
is set, napi_threaded_poll()/__napi_poll() will return clearing the
SCHEDS bits after trying to poll the device:

---
diff --git a/net/core/dev.c b/net/core/dev.c
index d9db02d4e044..5cb6f411043d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6985,7 +6985,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 
        set_current_state(TASK_INTERRUPTIBLE);
 
-       while (!kthread_should_stop() && !napi_disable_pending(napi)) {
+       while (!kthread_should_stop()) {
                /* Testing SCHED_THREADED bit here to make sure the current
                 * kthread owns this napi and could poll on this napi.
                 * Testing SCHED bit is not enough because SCHED bit might be

---

And works as intended here. I'll submit that formally, unless there are
objection.

Thanks!

Paolo
>
Eric Dumazet April 9, 2021, 10:08 a.m. UTC | #7
On 4/9/21 11:24 AM, Paolo Abeni wrote:
> On Wed, 2021-04-07 at 11:13 -0700, Jakub Kicinski wrote:

>> On Wed, 07 Apr 2021 16:54:29 +0200 Paolo Abeni wrote:

>>>>> I think in the above example even the normal processing will be

>>>>> fooled?!? e.g. even without the napi_disable(), napi_thread_wait() will

>>>>>  will miss the event/will not understand to it really own the napi and

>>>>> will call schedule().

>>>>>

>>>>> It looks a different problem to me ?!?

>>>>>

>>>>> I *think* that replacing inside the napi_thread_wait() loop:

>>>>>

>>>>> 	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 

>>>>>

>>>>> with:

>>>>>

>>>>> 	unsigned long state = READ_ONCE(napi->state);

>>>>>

>>>>> 	if (state & NAPIF_STATE_SCHED &&

>>>>> 	    !(state & (NAPIF_STATE_IN_BUSY_POLL | NAPIF_STATE_DISABLE)) 

>>>>>

>>>>> should solve it and should also allow removing the

>>>>> NAPI_STATE_SCHED_THREADED bit. I feel like I'm missing some relevant

>>>>> point here.  

>>>>

>>>> Heh, that's closer to the proposal Eric put forward.

>>>>

>>>> I strongly dislike   

>>>

>>> I guess that can't be addressed ;)

>>

>> I'm not _that_ unreasonable, I hope :) if there is multiple people

>> disagreeing with me then so be it.

> 

> I'm sorry, I mean no offence! Just joking about the fact that is

> usually hard changing preferences ;)

> 

>>> If you have strong opinion against the above, the only other option I

>>> can think of is patching napi_schedule_prep() to set

>>> both NAPI_STATE_SCHED and NAPI_STATE_SCHED_THREADED if threaded mode is

>>> enabled for the running NAPI. That looks more complex and error prone,

>>> so I really would avoid that.

>>>

>>> Any other better option?

>>>

>>> Side note: regardless of the above, I think we still need something

>>> similar to the code in this patch, can we address the different issues

>>> separately?

>>

>> Not sure what issues you're referring to.

> 

> The patch that started this thread was ment to address a slightly

> different race: napi_disable() hanging because napi_threaded_poll()

> don't clear the NAPI_STATE_SCHED even when owning the napi instance.

> 

>> Right, I think the problem is disable_pending check is out of place.

>>

>> How about this:

>>

>> diff --git a/net/core/dev.c b/net/core/dev.c

>> index 9d1a8fac793f..e53f8bfed6a1 100644

>> --- a/net/core/dev.c

>> +++ b/net/core/dev.c

>> @@ -7041,7 +7041,7 @@ static int napi_thread_wait(struct napi_struct *napi)

>>  

>>         set_current_state(TASK_INTERRUPTIBLE);

>>  

>> -       while (!kthread_should_stop() && !napi_disable_pending(napi)) {

>> +       while (!kthread_should_stop()) {

>>                 /* Testing SCHED_THREADED bit here to make sure the current

>>                  * kthread owns this napi and could poll on this napi.

>>                  * Testing SCHED bit is not enough because SCHED bit might be

>> @@ -7049,8 +7049,14 @@ static int napi_thread_wait(struct napi_struct *napi)

>>                  */

>>                 if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) {

>>                         WARN_ON(!list_empty(&napi->poll_list));

>> -                       __set_current_state(TASK_RUNNING);

>> -                       return 0;

>> +                       if (unlikely(napi_disable_pending(napi))) {

>> +                               clear_bit(NAPI_STATE_SCHED, &napi->state);

>> +                               clear_bit(NAPI_STATE_SCHED_THREADED,

>> +                                         &napi->state);

>> +                       } else {

>> +                               __set_current_state(TASK_RUNNING);

>> +                               return 0;

>> +                       }

>>                 }

>>  

>>                 schedule();

> 

> It looks like the above works, and also fixes the problem I originally

> reported. 

> 

> I think it can be simplified as the following - if NAPIF_STATE_DISABLE

> is set, napi_threaded_poll()/__napi_poll() will return clearing the

> SCHEDS bits after trying to poll the device:

> 

> ---

> diff --git a/net/core/dev.c b/net/core/dev.c

> index d9db02d4e044..5cb6f411043d 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -6985,7 +6985,7 @@ static int napi_thread_wait(struct napi_struct *napi)

>  

>         set_current_state(TASK_INTERRUPTIBLE);

>  

> -       while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> +       while (!kthread_should_stop()) {

>                 /* Testing SCHED_THREADED bit here to make sure the current

>                  * kthread owns this napi and could poll on this napi.

>                  * Testing SCHED bit is not enough because SCHED bit might be

> 

> ---

> 

> And works as intended here. I'll submit that formally, unless there are

> objection.

> 


This looks much better ;)

> Thanks!

> 

> Paolo

>>

>
Jakub Kicinski April 9, 2021, 3:15 p.m. UTC | #8
On Fri, 09 Apr 2021 11:24:07 +0200 Paolo Abeni wrote:
> I think it can be simplified as the following - if NAPIF_STATE_DISABLE

> is set, napi_threaded_poll()/__napi_poll() will return clearing the

> SCHEDS bits after trying to poll the device:


Indeed!

> diff --git a/net/core/dev.c b/net/core/dev.c

> index d9db02d4e044..5cb6f411043d 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -6985,7 +6985,7 @@ static int napi_thread_wait(struct napi_struct *napi)

>  

>         set_current_state(TASK_INTERRUPTIBLE);

>  

> -       while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> +       while (!kthread_should_stop()) {

>                 /* Testing SCHED_THREADED bit here to make sure the current

>                  * kthread owns this napi and could poll on this napi.

>                  * Testing SCHED bit is not enough because SCHED bit might be

> 

> ---

> 

> And works as intended here. I'll submit that formally, unless there are

> objection.


Please and thank you!
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index b4c67a5be606d..e2e716ba027b8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7059,6 +7059,14 @@  static int napi_thread_wait(struct napi_struct *napi)
 		set_current_state(TASK_INTERRUPTIBLE);
 	}
 	__set_current_state(TASK_RUNNING);
+
+	/* if the thread owns this napi, and the napi itself has been disabled
+	 * in-between napi_schedule() and the above napi_disable_pending()
+	 * check, we need to clear the SCHED bit here, or napi_disable
+	 * will hang waiting for such bit being cleared
+	 */
+	if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken)
+		clear_bit(NAPI_STATE_SCHED, &napi->state);
 	return -1;
 }