diff mbox series

[net] net: fix race between napi kthread mode and busy poll

Message ID 20210223234130.437831-1-weiwan@google.com
State New
Headers show
Series [net] net: fix race between napi kthread mode and busy poll | expand

Commit Message

Wei Wang Feb. 23, 2021, 11:41 p.m. UTC
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in
napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in
napi_complete_done() together with NAPI_STATE_SCHED. This helps
distinguish the ownership of the napi between kthread and the busy poll
thread, and prevents the kthread from polling on the napi when this napi
is still owned by the busy poll thread.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.come>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 include/linux/netdevice.h |  4 +++-
 net/core/dev.c            | 10 ++++++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

Comments

Jakub Kicinski Feb. 24, 2021, 7:48 p.m. UTC | #1
On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:
> Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to

> determine if the kthread owns this napi and could call napi->poll() on

> it. However, if socket busy poll is enabled, it is possible that the

> busy poll thread grabs this SCHED bit (after the previous napi->poll()

> invokes napi_complete_done() and clears SCHED bit) and tries to poll

> on the same napi.

> This patch tries to fix this race by adding a new bit

> NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in

> napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in

> napi_complete_done() together with NAPI_STATE_SCHED. This helps

> distinguish the ownership of the napi between kthread and the busy poll

> thread, and prevents the kthread from polling on the napi when this napi

> is still owned by the busy poll thread.

> 

> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> Reported-by: Martin Zaharinov <micron10@gmail.com>

> Suggested-by: Alexander Duyck <alexanderduyck@fb.com>

> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

> Reviewed-by: Eric Dumazet <edumazet@google.come>


AFAIU sched bit controls the ownership of the poll_list. Can we please
add a poll_list for the thread and make sure the thread polls based on
the list?
IMO that's far clearer than defining a forest of ownership state bits.

I think with just the right (wrong?) timing this patch will still not
protect against disabling the NAPI.
Eric Dumazet Feb. 24, 2021, 8:37 p.m. UTC | #2
On Wed, Feb 24, 2021 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:

> > Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to

> > determine if the kthread owns this napi and could call napi->poll() on

> > it. However, if socket busy poll is enabled, it is possible that the

> > busy poll thread grabs this SCHED bit (after the previous napi->poll()

> > invokes napi_complete_done() and clears SCHED bit) and tries to poll

> > on the same napi.

> > This patch tries to fix this race by adding a new bit

> > NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in

> > napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in

> > napi_complete_done() together with NAPI_STATE_SCHED. This helps

> > distinguish the ownership of the napi between kthread and the busy poll

> > thread, and prevents the kthread from polling on the napi when this napi

> > is still owned by the busy poll thread.

> >

> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > Reported-by: Martin Zaharinov <micron10@gmail.com>

> > Suggested-by: Alexander Duyck <alexanderduyck@fb.com>

> > Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

> > Reviewed-by: Eric Dumazet <edumazet@google.come>

>

> AFAIU sched bit controls the ownership of the poll_list


I disagree. BUSY POLL never inserted the napi into a list,
because the user thread was polling one napi.

Same for the kthread.

wake_up_process() should be good enough.

. Can we please
> add a poll_list for the thread and make sure the thread polls based on

> the list?


A list ? That would require a spinlock or something ?


> IMO that's far clearer than defining a forest of ownership state bits.


Adding a bit seems simpler than adding a list.

>

> I think with just the right (wrong?) timing this patch will still not

> protect against disabling the NAPI.


Maybe, but this patch is solving one issue that was easy to trigger.

disabling the NAPI is handled already.
Jakub Kicinski Feb. 24, 2021, 9:30 p.m. UTC | #3
On Wed, 24 Feb 2021 21:37:36 +0100 Eric Dumazet wrote:
> On Wed, Feb 24, 2021 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:  

> > > Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to

> > > determine if the kthread owns this napi and could call napi->poll() on

> > > it. However, if socket busy poll is enabled, it is possible that the

> > > busy poll thread grabs this SCHED bit (after the previous napi->poll()

> > > invokes napi_complete_done() and clears SCHED bit) and tries to poll

> > > on the same napi.

> > > This patch tries to fix this race by adding a new bit

> > > NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in

> > > napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in

> > > napi_complete_done() together with NAPI_STATE_SCHED. This helps

> > > distinguish the ownership of the napi between kthread and the busy poll

> > > thread, and prevents the kthread from polling on the napi when this napi

> > > is still owned by the busy poll thread.

> > >

> > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > > Reported-by: Martin Zaharinov <micron10@gmail.com>

> > > Suggested-by: Alexander Duyck <alexanderduyck@fb.com>

> > > Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

> > > Reviewed-by: Eric Dumazet <edumazet@google.come>  

> >

> > AFAIU sched bit controls the ownership of the poll_list  

> 

> I disagree. BUSY POLL never inserted the napi into a list,

> because the user thread was polling one napi.

> 

> Same for the kthread.


There is no delayed execution in busy_poll. It either got the sched bit
and it knows it, or it didn't.

> wake_up_process() should be good enough.


Well, if that's the direction maybe we should depend on the thread
state more?  IOW pay less attention to SCHED and have
napi_complete_done() set_current_state() if thread is running?

I didn't think that through fully but you can't say "wake_up_process()
should be good enough" and at the same time add another bit proving
it's not enough.

> > Can we pleaseadd a poll_list for the thread and make sure the

> > thread polls based on the list?  

> 

> A list ? That would require a spinlock or something ?


Does the softnet list require a spinlock?

Obviously with current code the list would only ever have one napi
instance per thread but I think it's worth the code simplicity.
napi_complete_done() dels from the list / releases that ownership 
already.

> > IMO that's far clearer than defining a forest of ownership state

> > bits.  

> 

> Adding a bit seems simpler than adding a list.


In terms of what? LoC? 

Just to find out what the LoC is I sketched this out:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..77f09ced9ee4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -348,6 +348,7 @@ struct napi_struct {
        struct hlist_node       napi_hash_node;
        unsigned int            napi_id;
        struct task_struct      *thread;
+       struct list_head        thread_poll_list;
 };
 
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..99ff083232e9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd,
                 */
                thread = READ_ONCE(napi->thread);
                if (thread) {
+                       list_add_tail(&napi->poll_list,
+                                     &napi->thread_poll_list);
                        wake_up_process(thread);
                        return;
                }
@@ -6777,6 +6779,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
                return;
 
        INIT_LIST_HEAD(&napi->poll_list);
+       INIT_LIST_HEAD(&napi->thread_poll_list);
        INIT_HLIST_NODE(&napi->napi_hash_node);
        hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
        napi->timer.function = napi_watchdog;
@@ -6971,8 +6974,7 @@ static int napi_thread_wait(struct napi_struct *napi)
        set_current_state(TASK_INTERRUPTIBLE);
 
        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
-                       WARN_ON(!list_empty(&napi->poll_list));
+               if (!list_emtpy(&napi->thread_poll_list)) {
                        __set_current_state(TASK_RUNNING);
                        return 0;
                }

$ git diff --stat 
 include/linux/netdevice.h | 1 +
 net/core/dev.c            | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

> > I think with just the right (wrong?) timing this patch will still

> > not protect against disabling the NAPI.  

> 

> Maybe, but this patch is solving one issue that was easy to trigger.

> 

> disabling the NAPI is handled already.


The thread checks if NAPI is getting disabled, then time passes, then
it checks if it's scheduled. If napi gets disabled in the "time passes"
period thread will think that it got scheduled again.

Sure, we can go and make special annotations in all other parts of NAPI
infra, but isn't that an obvious sign of a bad design?


I wanted to add that I have spent quite a bit of time hacking around
the threaded NAPI thing before I had to maintain, and (admittedly my
brain is not very capable but) I had a hard time getting things working
reliably with netpoll, busy polling, disabling etc. IOW I'm not just
claiming that "more bits" is not a good solution on a whim.
Wei Wang Feb. 24, 2021, 10:29 p.m. UTC | #4
On Wed, Feb 24, 2021 at 1:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 24 Feb 2021 21:37:36 +0100 Eric Dumazet wrote:

> > On Wed, Feb 24, 2021 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:

> > > > Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to

> > > > determine if the kthread owns this napi and could call napi->poll() on

> > > > it. However, if socket busy poll is enabled, it is possible that the

> > > > busy poll thread grabs this SCHED bit (after the previous napi->poll()

> > > > invokes napi_complete_done() and clears SCHED bit) and tries to poll

> > > > on the same napi.

> > > > This patch tries to fix this race by adding a new bit

> > > > NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in

> > > > napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in

> > > > napi_complete_done() together with NAPI_STATE_SCHED. This helps

> > > > distinguish the ownership of the napi between kthread and the busy poll

> > > > thread, and prevents the kthread from polling on the napi when this napi

> > > > is still owned by the busy poll thread.

> > > >

> > > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > > > Reported-by: Martin Zaharinov <micron10@gmail.com>

> > > > Suggested-by: Alexander Duyck <alexanderduyck@fb.com>

> > > > Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

> > > > Reviewed-by: Eric Dumazet <edumazet@google.come>

> > >

> > > AFAIU sched bit controls the ownership of the poll_list

> >

> > I disagree. BUSY POLL never inserted the napi into a list,

> > because the user thread was polling one napi.

> >

> > Same for the kthread.

>

> There is no delayed execution in busy_poll. It either got the sched bit

> and it knows it, or it didn't.

>

> > wake_up_process() should be good enough.

>

> Well, if that's the direction maybe we should depend on the thread

> state more?  IOW pay less attention to SCHED and have

> napi_complete_done() set_current_state() if thread is running?

>

> I didn't think that through fully but you can't say "wake_up_process()

> should be good enough" and at the same time add another bit proving

> it's not enough.

>

> > > Can we pleaseadd a poll_list for the thread and make sure the

> > > thread polls based on the list?

> >

> > A list ? That would require a spinlock or something ?

>

> Does the softnet list require a spinlock?

>

> Obviously with current code the list would only ever have one napi

> instance per thread but I think it's worth the code simplicity.

> napi_complete_done() dels from the list / releases that ownership

> already.

>


I think what Jakub proposed here should work. But I have a similar
concern as Eric. I think the kthread belongs to the NAPI instance, and
the kthread only polls on that specific NAPI if threaded mode is
enabled. Adding the NAPI to a list that the kthread polls seems to be
a reverse of logic. And it is unlike the sd->poll_list, where multiple
NAPI instances could be added to that list and get polled. But
functionality-wise, it does seem it will work.

> > > IMO that's far clearer than defining a forest of ownership state

> > > bits.

> >

> > Adding a bit seems simpler than adding a list.

>

> In terms of what? LoC?

>

> Just to find out what the LoC is I sketched this out:

>

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> index ddf4cfc12615..77f09ced9ee4 100644

> --- a/include/linux/netdevice.h

> +++ b/include/linux/netdevice.h

> @@ -348,6 +348,7 @@ struct napi_struct {

>         struct hlist_node       napi_hash_node;

>         unsigned int            napi_id;

>         struct task_struct      *thread;

> +       struct list_head        thread_poll_list;

>  };

>

>  enum {

> diff --git a/net/core/dev.c b/net/core/dev.c

> index 6c5967e80132..99ff083232e9 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd,

>                  */

>                 thread = READ_ONCE(napi->thread);

>                 if (thread) {

> +                       list_add_tail(&napi->poll_list,

> +                                     &napi->thread_poll_list);

>                         wake_up_process(thread);

>                         return;

>                 }

> @@ -6777,6 +6779,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,

>                 return;

>

>         INIT_LIST_HEAD(&napi->poll_list);

> +       INIT_LIST_HEAD(&napi->thread_poll_list);

>         INIT_HLIST_NODE(&napi->napi_hash_node);

>         hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);

>         napi->timer.function = napi_watchdog;

> @@ -6971,8 +6974,7 @@ static int napi_thread_wait(struct napi_struct *napi)

>         set_current_state(TASK_INTERRUPTIBLE);

>

>         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> -                       WARN_ON(!list_empty(&napi->poll_list));

> +               if (!list_emtpy(&napi->thread_poll_list)) {

>                         __set_current_state(TASK_RUNNING);

>                         return 0;

>                 }

>

> $ git diff --stat

>  include/linux/netdevice.h | 1 +

>  net/core/dev.c            | 6 ++++--

>  2 files changed, 5 insertions(+), 2 deletions(-)

>

> > > I think with just the right (wrong?) timing this patch will still

> > > not protect against disabling the NAPI.

> >

> > Maybe, but this patch is solving one issue that was easy to trigger.

> >

> > disabling the NAPI is handled already.

>

> The thread checks if NAPI is getting disabled, then time passes, then

> it checks if it's scheduled. If napi gets disabled in the "time passes"

> period thread will think that it got scheduled again.

>


Not sure if I understand it correctly, when you say "then it checks if
it's scheduled", do you mean the schedule() call in napi_thread_wait()
that re-enters this function? If so, it still checks to make sure
!napi_disable_pending(napi) before it goes to poll on the napi
instance. I think that is sufficient to make sure we don't poll on a
NAPI that is in DISABLE state?


> Sure, we can go and make special annotations in all other parts of NAPI

> infra, but isn't that an obvious sign of a bad design?

>

>

> I wanted to add that I have spent quite a bit of time hacking around

> the threaded NAPI thing before I had to maintain, and (admittedly my

> brain is not very capable but) I had a hard time getting things working

> reliably with netpoll, busy polling, disabling etc. IOW I'm not just

> claiming that "more bits" is not a good solution on a whim.
Jakub Kicinski Feb. 24, 2021, 11:29 p.m. UTC | #5
On Wed, 24 Feb 2021 14:29:21 -0800 Wei Wang wrote:
> On Wed, Feb 24, 2021 at 1:30 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > On Wed, 24 Feb 2021 21:37:36 +0100 Eric Dumazet wrote:  

> > > On Wed, Feb 24, 2021 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:  

> > > > On Tue, 23 Feb 2021 15:41:30 -0800 Wei Wang wrote:  

> > > > > Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to

> > > > > determine if the kthread owns this napi and could call napi->poll() on

> > > > > it. However, if socket busy poll is enabled, it is possible that the

> > > > > busy poll thread grabs this SCHED bit (after the previous napi->poll()

> > > > > invokes napi_complete_done() and clears SCHED bit) and tries to poll

> > > > > on the same napi.

> > > > > This patch tries to fix this race by adding a new bit

> > > > > NAPI_STATE_SCHED_BUSY_POLL in napi->state. This bit gets set in

> > > > > napi_busy_loop() togther with NAPI_STATE_SCHED, and gets cleared in

> > > > > napi_complete_done() together with NAPI_STATE_SCHED. This helps

> > > > > distinguish the ownership of the napi between kthread and the busy poll

> > > > > thread, and prevents the kthread from polling on the napi when this napi

> > > > > is still owned by the busy poll thread.

> > > > >

> > > > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")

> > > > > Reported-by: Martin Zaharinov <micron10@gmail.com>

> > > > > Suggested-by: Alexander Duyck <alexanderduyck@fb.com>

> > > > > Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

> > > > > Reviewed-by: Eric Dumazet <edumazet@google.come>  

> > > >

> > > > AFAIU sched bit controls the ownership of the poll_list  

> > >

> > > I disagree. BUSY POLL never inserted the napi into a list,

> > > because the user thread was polling one napi.

> > >

> > > Same for the kthread.  

> >

> > There is no delayed execution in busy_poll. It either got the sched bit

> > and it knows it, or it didn't.

> >  

> > > wake_up_process() should be good enough.  

> >

> > Well, if that's the direction maybe we should depend on the thread

> > state more?  IOW pay less attention to SCHED and have

> > napi_complete_done() set_current_state() if thread is running?

> >

> > I didn't think that through fully but you can't say "wake_up_process()

> > should be good enough" and at the same time add another bit proving

> > it's not enough.

> >  

> > > > Can we pleaseadd a poll_list for the thread and make sure the

> > > > thread polls based on the list?  

> > >

> > > A list ? That would require a spinlock or something ?  

> >

> > Does the softnet list require a spinlock?

> >

> > Obviously with current code the list would only ever have one napi

> > instance per thread but I think it's worth the code simplicity.

> > napi_complete_done() dels from the list / releases that ownership

> > already.

> 

> I think what Jakub proposed here should work. But I have a similar

> concern as Eric. I think the kthread belongs to the NAPI instance, and

> the kthread only polls on that specific NAPI if threaded mode is

> enabled. Adding the NAPI to a list that the kthread polls seems to be

> a reverse of logic. And it is unlike the sd->poll_list, where multiple

> NAPI instances could be added to that list and get polled. But

> functionality-wise, it does seem it will work.


My perspective is that the SCHED bit says "this NAPI has been scheduled
by someone to be processed". It doesn't say processed by who, so if the
ownership needs to be preserved the way to do that is napi->poll_list.

If NAPI is scheduled for sirq processing it goes on the sd list, if
it's threaded it goes on the thread's list.

Sure - today threads can only poll one NAPI so we could add a state bit
that says "this NAPI has been claimed by its thread". If you prefer
that strongly we can discuss, but IMO poll_list is a good abstraction
of linking the owner to the NAPI, no need for per-poller bits.

My mental model is that NAPI is always claimed or delegated to a poller
each time SCHED gets set. IIUC you're saying that it appears backwards
to give the NAPI to its dedicated thread, since the thread is expected
to own the NAPI. In my experience assuming the thread has the ownership
of the NAPI by the virtue that it was started causes issues around the
hand offs. It's much easier to establish that ownership on each SCHED.

> > > > IMO that's far clearer than defining a forest of ownership state

> > > > bits.  

> > >

> > > Adding a bit seems simpler than adding a list.  

> >

> > In terms of what? LoC?

> >

> > Just to find out what the LoC is I sketched this out:

> >

> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> > index ddf4cfc12615..77f09ced9ee4 100644

> > --- a/include/linux/netdevice.h

> > +++ b/include/linux/netdevice.h

> > @@ -348,6 +348,7 @@ struct napi_struct {

> >         struct hlist_node       napi_hash_node;

> >         unsigned int            napi_id;

> >         struct task_struct      *thread;

> > +       struct list_head        thread_poll_list;

> >  };

> >

> >  enum {

> > diff --git a/net/core/dev.c b/net/core/dev.c

> > index 6c5967e80132..99ff083232e9 100644

> > --- a/net/core/dev.c

> > +++ b/net/core/dev.c

> > @@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd,

> >                  */

> >                 thread = READ_ONCE(napi->thread);

> >                 if (thread) {

> > +                       list_add_tail(&napi->poll_list,

> > +                                     &napi->thread_poll_list);

> >                         wake_up_process(thread);

> >                         return;

> >                 }

> > @@ -6777,6 +6779,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,

> >                 return;

> >

> >         INIT_LIST_HEAD(&napi->poll_list);

> > +       INIT_LIST_HEAD(&napi->thread_poll_list);

> >         INIT_HLIST_NODE(&napi->napi_hash_node);

> >         hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);

> >         napi->timer.function = napi_watchdog;

> > @@ -6971,8 +6974,7 @@ static int napi_thread_wait(struct napi_struct *napi)

> >         set_current_state(TASK_INTERRUPTIBLE);

> >

> >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> > -                       WARN_ON(!list_empty(&napi->poll_list));

> > +               if (!list_emtpy(&napi->thread_poll_list)) {

> >                         __set_current_state(TASK_RUNNING);

> >                         return 0;

> >                 }

> >

> > $ git diff --stat

> >  include/linux/netdevice.h | 1 +

> >  net/core/dev.c            | 6 ++++--

> >  2 files changed, 5 insertions(+), 2 deletions(-)

> >  

> > > > I think with just the right (wrong?) timing this patch will still

> > > > not protect against disabling the NAPI.  

> > >

> > > Maybe, but this patch is solving one issue that was easy to trigger.

> > >

> > > disabling the NAPI is handled already.  

> >

> > The thread checks if NAPI is getting disabled, then time passes, then

> > it checks if it's scheduled. If napi gets disabled in the "time passes"

> > period thread will think that it got scheduled again.

> 

> Not sure if I understand it correctly, when you say "then it checks if

> it's scheduled", do you mean the schedule() call in napi_thread_wait()

> that re-enters this function? If so, it still checks to make sure

> !napi_disable_pending(napi) before it goes to poll on the napi

> instance. I think that is sufficient to make sure we don't poll on a

> NAPI that is in DISABLE state?


Let me do a mash up of the code - this is what I'm thinking:
(prefix AA for CPU A, prefix BB for CPU B)

AA	while (!kthread_should_stop() && !napi_disable_pending(napi)) {
AA	// condition true, enter the loop... but that's that? An IRQ comes..

BB	// napi_disable()
BB	set_bit(NAPI_STATE_DISABLE, &n->state);
BB	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
BB	// and CPU B continues on it's marry way..

AA	if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
AA		__set_current_state(TASK_RUNNING);
AA		return 0;

AA	// return to napi_threaded_poll()
AA	local_bh_disable();
AA	have = netpoll_poll_lock(napi);
AA	__napi_poll(napi, &repoll);

AA	// __napi_poll()
AA	weight = n->weight;
AA	work = 0;
AA	if (test_bit(NAPI_STATE_SCHED, &n->state)) {
AA		work = n->poll(n, weight);

> > Sure, we can go and make special annotations in all other parts of NAPI

> > infra, but isn't that an obvious sign of a bad design?

> >

> >

> > I wanted to add that I have spent quite a bit of time hacking around

> > the threaded NAPI thing before I had to maintain, and (admittedly my

> > brain is not very capable but) I had a hard time getting things working

> > reliably with netpoll, busy polling, disabling etc. IOW I'm not just

> > claiming that "more bits" is not a good solution on a whim.
Jakub Kicinski Feb. 24, 2021, 11:52 p.m. UTC | #6
On Wed, 24 Feb 2021 23:30:23 +0100 Eric Dumazet wrote:
> On Wed, Feb 24, 2021 at 10:30 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > Just to find out what the LoC is I sketched this out:

> >

> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> > index ddf4cfc12615..77f09ced9ee4 100644

> > --- a/include/linux/netdevice.h

> > +++ b/include/linux/netdevice.h

> > @@ -348,6 +348,7 @@ struct napi_struct {

> >         struct hlist_node       napi_hash_node;

> >         unsigned int            napi_id;

> >         struct task_struct      *thread;

> > +       struct list_head        thread_poll_list;

> >  };

> 

> offlist, since it seems this conversation is upsetting you.


Interesting, vger seems to be CCed but it isn't appearing on the ML.
Perhaps just a vger delay :S

Not really upsetting. I'm just trying to share what I learned devising
more advanced pollers. The bits get really messy really quickly.
Especially that the proposed fix adds a bit for a poor bystander (busy
poll) while it's the threaded IRQ that is incorrectly not preserving
its ownership.

> Additional 16 bytes here, possibly in a shared cache line, [1]

> I prefer using a bit in hot n->state, we have plenty of them available.


Right, presumably the location of the new member could be optimized.
I typed this proposal up in a couple of minutes.

> We worked hours with Alexander, Wei, I am sorry you think we did a poor job.

> I really thought we instead solved the issue at hand.

> 

> May I suggest you defer your idea of redesigning the NAPI model for

> net-next ?


Seems like you decided on this solution off list and now the fact that
there is a discussion on the list is upsetting you. May I suggest that
discussions should be conducted on list to avoid such situations?
Eric Dumazet Feb. 24, 2021, 11:59 p.m. UTC | #7
On Thu, Feb 25, 2021 at 12:52 AM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 24 Feb 2021 23:30:23 +0100 Eric Dumazet wrote:

> > On Wed, Feb 24, 2021 at 10:30 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > Just to find out what the LoC is I sketched this out:

> > >

> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> > > index ddf4cfc12615..77f09ced9ee4 100644

> > > --- a/include/linux/netdevice.h

> > > +++ b/include/linux/netdevice.h

> > > @@ -348,6 +348,7 @@ struct napi_struct {

> > >         struct hlist_node       napi_hash_node;

> > >         unsigned int            napi_id;

> > >         struct task_struct      *thread;

> > > +       struct list_head        thread_poll_list;

> > >  };

> >

> > offlist, since it seems this conversation is upsetting you.

>

> Interesting, vger seems to be CCed but it isn't appearing on the ML.

> Perhaps just a vger delay :S

>

> Not really upsetting. I'm just trying to share what I learned devising

> more advanced pollers. The bits get really messy really quickly.

> Especially that the proposed fix adds a bit for a poor bystander (busy

> poll) while it's the threaded IRQ that is incorrectly not preserving

> its ownership.

>

> > Additional 16 bytes here, possibly in a shared cache line, [1]

> > I prefer using a bit in hot n->state, we have plenty of them available.

>

> Right, presumably the location of the new member could be optimized.

> I typed this proposal up in a couple of minutes.

>

> > We worked hours with Alexander, Wei, I am sorry you think we did a poor job.

> > I really thought we instead solved the issue at hand.

> >

> > May I suggest you defer your idea of redesigning the NAPI model for

> > net-next ?

>

> Seems like you decided on this solution off list and now the fact that

> there is a discussion on the list is upsetting you. May I suggest that

> discussions should be conducted on list to avoid such situations?


We were trying to not pollute the list (with about 40 different emails so far)

(Note this was not something I initiated, I only hit Reply all button)

OK, I will shut up, since you seem to take over this matter, and it is
1am here in France.
Jakub Kicinski Feb. 25, 2021, 12:07 a.m. UTC | #8
On Thu, 25 Feb 2021 00:59:25 +0100 Eric Dumazet wrote:
> On Thu, Feb 25, 2021 at 12:52 AM Jakub Kicinski <kuba@kernel.org> wrote:

> > Interesting, vger seems to be CCed but it isn't appearing on the ML.

> > Perhaps just a vger delay :S

> >

> > Not really upsetting. I'm just trying to share what I learned devising

> > more advanced pollers. The bits get really messy really quickly.

> > Especially that the proposed fix adds a bit for a poor bystander (busy

> > poll) while it's the threaded IRQ that is incorrectly not preserving

> > its ownership.

> >  

> > > Additional 16 bytes here, possibly in a shared cache line, [1]

> > > I prefer using a bit in hot n->state, we have plenty of them available.  

> >

> > Right, presumably the location of the new member could be optimized.

> > I typed this proposal up in a couple of minutes.

> >  

> > > We worked hours with Alexander, Wei, I am sorry you think we did a poor job.

> > > I really thought we instead solved the issue at hand.

> > >

> > > May I suggest you defer your idea of redesigning the NAPI model for

> > > net-next ?  

> >

> > Seems like you decided on this solution off list and now the fact that

> > there is a discussion on the list is upsetting you. May I suggest that

> > discussions should be conducted on list to avoid such situations?  

> 

> We were trying to not pollute the list (with about 40 different emails so far)

> 

> (Note this was not something I initiated, I only hit Reply all button)

> 

> OK, I will shut up, since you seem to take over this matter, and it is

> 1am here in France.


Are you okay with adding a SCHED_THREADED bit for threaded NAPI to be
set in addition to SCHED? At least that way the bit is associated with
it's user. IIUC since the extra clear_bit() in busy poll was okay so
should be a new set_bit()?
Alexander Duyck Feb. 25, 2021, 12:11 a.m. UTC | #9
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, February 24, 2021 4:07 PM
> To: Eric Dumazet <edumazet@google.com>
> Cc: Wei Wang <weiwan@google.com>; David S . Miller
> <davem@davemloft.net>; netdev <netdev@vger.kernel.org>; Paolo Abeni
> <pabeni@redhat.com>; Hannes Frederic Sowa
> <hannes@stressinduktion.org>; Alexander Duyck
> <alexanderduyck@fb.com>; Martin Zaharinov <micron10@gmail.com>
> Subject: Re: [PATCH net] net: fix race between napi kthread mode and busy
> poll
> 
> On Thu, 25 Feb 2021 00:59:25 +0100 Eric Dumazet wrote:
> > On Thu, Feb 25, 2021 at 12:52 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Interesting, vger seems to be CCed but it isn't appearing on the ML.
> > > Perhaps just a vger delay :S
> > >
> > > Not really upsetting. I'm just trying to share what I learned
> > > devising more advanced pollers. The bits get really messy really quickly.
> > > Especially that the proposed fix adds a bit for a poor bystander
> > > (busy
> > > poll) while it's the threaded IRQ that is incorrectly not preserving
> > > its ownership.
> > >
> > > > Additional 16 bytes here, possibly in a shared cache line, [1] I
> > > > prefer using a bit in hot n->state, we have plenty of them available.
> > >
> > > Right, presumably the location of the new member could be optimized.
> > > I typed this proposal up in a couple of minutes.
> > >
> > > > We worked hours with Alexander, Wei, I am sorry you think we did a
> poor job.
> > > > I really thought we instead solved the issue at hand.
> > > >
> > > > May I suggest you defer your idea of redesigning the NAPI model
> > > > for net-next ?
> > >
> > > Seems like you decided on this solution off list and now the fact
> > > that there is a discussion on the list is upsetting you. May I
> > > suggest that discussions should be conducted on list to avoid such
> situations?
> >
> > We were trying to not pollute the list (with about 40 different emails
> > so far)
> >
> > (Note this was not something I initiated, I only hit Reply all button)
> >
> > OK, I will shut up, since you seem to take over this matter, and it is
> > 1am here in France.
> 
> Are you okay with adding a SCHED_THREADED bit for threaded NAPI to be
> set in addition to SCHED? At least that way the bit is associated with it's user.
> IIUC since the extra clear_bit() in busy poll was okay so should be a new
> set_bit()?

The problem with adding a bit for SCHED_THREADED is that you would have to heavily modify napi_schedule_prep so that it would add the bit. That is the reason for going with adding the bit to the busy poll logic because it added no additional overhead. Adding another atomic bit setting operation or heavily modifying the existing one would add considerable overhead as it is either adding a complicated conditional check to all NAPI calls, or adding an atomic operation to the path for the threaded NAPI.
Wei Wang Feb. 25, 2021, 12:16 a.m. UTC | #10
On Wed, Feb 24, 2021 at 4:11 PM Alexander Duyck <alexanderduyck@fb.com> wrote:
>

>

>

> > -----Original Message-----

> > From: Jakub Kicinski <kuba@kernel.org>

> > Sent: Wednesday, February 24, 2021 4:07 PM

> > To: Eric Dumazet <edumazet@google.com>

> > Cc: Wei Wang <weiwan@google.com>; David S . Miller

> > <davem@davemloft.net>; netdev <netdev@vger.kernel.org>; Paolo Abeni

> > <pabeni@redhat.com>; Hannes Frederic Sowa

> > <hannes@stressinduktion.org>; Alexander Duyck

> > <alexanderduyck@fb.com>; Martin Zaharinov <micron10@gmail.com>

> > Subject: Re: [PATCH net] net: fix race between napi kthread mode and busy

> > poll

> >

> > On Thu, 25 Feb 2021 00:59:25 +0100 Eric Dumazet wrote:

> > > On Thu, Feb 25, 2021 at 12:52 AM Jakub Kicinski <kuba@kernel.org> wrote:

> > > > Interesting, vger seems to be CCed but it isn't appearing on the ML.

> > > > Perhaps just a vger delay :S

> > > >

> > > > Not really upsetting. I'm just trying to share what I learned

> > > > devising more advanced pollers. The bits get really messy really quickly.

> > > > Especially that the proposed fix adds a bit for a poor bystander

> > > > (busy

> > > > poll) while it's the threaded IRQ that is incorrectly not preserving

> > > > its ownership.

> > > >

> > > > > Additional 16 bytes here, possibly in a shared cache line, [1] I

> > > > > prefer using a bit in hot n->state, we have plenty of them available.

> > > >

> > > > Right, presumably the location of the new member could be optimized.

> > > > I typed this proposal up in a couple of minutes.

> > > >

> > > > > We worked hours with Alexander, Wei, I am sorry you think we did a

> > poor job.

> > > > > I really thought we instead solved the issue at hand.

> > > > >

> > > > > May I suggest you defer your idea of redesigning the NAPI model

> > > > > for net-next ?

> > > >

> > > > Seems like you decided on this solution off list and now the fact

> > > > that there is a discussion on the list is upsetting you. May I

> > > > suggest that discussions should be conducted on list to avoid such

> > situations?

> > >

> > > We were trying to not pollute the list (with about 40 different emails

> > > so far)

> > >

> > > (Note this was not something I initiated, I only hit Reply all button)

> > >

> > > OK, I will shut up, since you seem to take over this matter, and it is

> > > 1am here in France.

> >

> > Are you okay with adding a SCHED_THREADED bit for threaded NAPI to be

> > set in addition to SCHED? At least that way the bit is associated with it's user.

> > IIUC since the extra clear_bit() in busy poll was okay so should be a new

> > set_bit()?

>

> The problem with adding a bit for SCHED_THREADED is that you would have to heavily modify napi_schedule_prep so that it would add the bit. That is the reason for going with adding the bit to the busy poll logic because it added no additional overhead. Adding another atomic bit setting operation or heavily modifying the existing one would add considerable overhead as it is either adding a complicated conditional check to all NAPI calls, or adding an atomic operation to the path for the threaded NAPI.


Please help hold on to the patch for now. I think Martin is still
seeing issues on his setup even with this patch applied. I have not
yet figured out why. But I think we should not merge this patch until
the issue is cleared. Will update this thread with progress.
Jakub Kicinski Feb. 25, 2021, 12:20 a.m. UTC | #11
On Thu, 25 Feb 2021 00:11:34 +0000 Alexander Duyck wrote:
> > > We were trying to not pollute the list (with about 40 different emails

> > > so far)

> > >

> > > (Note this was not something I initiated, I only hit Reply all button)

> > >

> > > OK, I will shut up, since you seem to take over this matter, and it is

> > > 1am here in France.  

> > 

> > Are you okay with adding a SCHED_THREADED bit for threaded NAPI to be

> > set in addition to SCHED? At least that way the bit is associated with it's user.

> > IIUC since the extra clear_bit() in busy poll was okay so should be a new

> > set_bit()?  

> 

> The problem with adding a bit for SCHED_THREADED is that you would

> have to heavily modify napi_schedule_prep so that it would add the

> bit. That is the reason for going with adding the bit to the busy

> poll logic because it added no additional overhead. Adding another

> atomic bit setting operation or heavily modifying the existing one

> would add considerable overhead as it is either adding a complicated

> conditional check to all NAPI calls, or adding an atomic operation to

> the path for the threaded NAPI.


I wasn't thinking of modifying the main schedule logic, just the
threaded parts:


diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..6953005d06af 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -360,6 +360,7 @@ enum {
        NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */
        NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over softirq processing*/
        NAPI_STATE_THREADED,            /* The poll is performed inside its own thread*/
+       NAPI_STATE_SCHED_THREAD,        /* Thread owns the NAPI and will poll */
 };
 
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..23e53f971478 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4294,6 +4294,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
                 */
                thread = READ_ONCE(napi->thread);
                if (thread) {
+                       set_bit(NAPI_STATE_SCHED_THREAD, &napi->state);
                        wake_up_process(thread);
                        return;
                }
@@ -6486,7 +6487,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
                WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
 
                new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
-                             NAPIF_STATE_PREFER_BUSY_POLL);
+                             NAPIF_STATE_PREFER_BUSY_POLL |
+                             NAPI_STATE_SCHED_THREAD);
 
                /* If STATE_MISSED was set, leave STATE_SCHED set,
                 * because we will call napi->poll() one more time.
@@ -6971,7 +6973,9 @@ static int napi_thread_wait(struct napi_struct *napi)
        set_current_state(TASK_INTERRUPTIBLE);
 
        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+               if (test_bit(NAPI_STATE_SCHED_THREAD, &napi->state)) {
+                       WARN_ON(!test_bit(test_bit(NAPI_STATE_SCHED,
+                                                  &napi->state)));
                        WARN_ON(!list_empty(&napi->poll_list));
                        __set_current_state(TASK_RUNNING);
                        return 0;
Jakub Kicinski Feb. 25, 2021, 12:32 a.m. UTC | #12
On Wed, 24 Feb 2021 16:16:58 -0800 Wei Wang wrote:
> On Wed, Feb 24, 2021 at 4:11 PM Alexander Duyck <alexanderduyck@fb.com> wrote:

> >

> > The problem with adding a bit for SCHED_THREADED is that you would

> > have to heavily modify napi_schedule_prep so that it would add the

> > bit. That is the reason for going with adding the bit to the busy

> > poll logic because it added no additional overhead. Adding another

> > atomic bit setting operation or heavily modifying the existing one

> > would add considerable overhead as it is either adding a

> > complicated conditional check to all NAPI calls, or adding an

> > atomic operation to the path for the threaded NAPI.  

> 

> Please help hold on to the patch for now. I think Martin is still

> seeing issues on his setup even with this patch applied. I have not

> yet figured out why. But I think we should not merge this patch until

> the issue is cleared. Will update this thread with progress.


If I'm looking right __busy_poll_stop() is only called if the last
napi poll used to re-enable IRQs consumed full budget. You need to
clear your new bit in busy_poll_stop(), not in __busy_poll_stop().
That will fix the case when hand off back to the normal poller (sirq, 
or thread) happens without going thru __napi_schedule().
Wei Wang Feb. 25, 2021, 12:44 a.m. UTC | #13
On Wed, Feb 24, 2021 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 24 Feb 2021 16:16:58 -0800 Wei Wang wrote:

> > On Wed, Feb 24, 2021 at 4:11 PM Alexander Duyck <alexanderduyck@fb.com> wrote:

> > >

> > > The problem with adding a bit for SCHED_THREADED is that you would

> > > have to heavily modify napi_schedule_prep so that it would add the

> > > bit. That is the reason for going with adding the bit to the busy

> > > poll logic because it added no additional overhead. Adding another

> > > atomic bit setting operation or heavily modifying the existing one

> > > would add considerable overhead as it is either adding a

> > > complicated conditional check to all NAPI calls, or adding an

> > > atomic operation to the path for the threaded NAPI.

> >

> > Please help hold on to the patch for now. I think Martin is still

> > seeing issues on his setup even with this patch applied. I have not

> > yet figured out why. But I think we should not merge this patch until

> > the issue is cleared. Will update this thread with progress.

>

> If I'm looking right __busy_poll_stop() is only called if the last

> napi poll used to re-enable IRQs consumed full budget. You need to

> clear your new bit in busy_poll_stop(), not in __busy_poll_stop().

> That will fix the case when hand off back to the normal poller (sirq,

> or thread) happens without going thru __napi_schedule().


If the budget is not fully consumed, napi_complete_done() should have
been called by the driver which will clear SCHED_BUSY_POLL bit.
Jakub Kicinski Feb. 25, 2021, 12:49 a.m. UTC | #14
On Wed, 24 Feb 2021 16:44:55 -0800 Wei Wang wrote:
> On Wed, Feb 24, 2021 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Wed, 24 Feb 2021 16:16:58 -0800 Wei Wang wrote:  

> > > On Wed, Feb 24, 2021 at 4:11 PM Alexander Duyck <alexanderduyck@fb.com> wrote:  

>  [...]  

> > >

> > > Please help hold on to the patch for now. I think Martin is still

> > > seeing issues on his setup even with this patch applied. I have not

> > > yet figured out why. But I think we should not merge this patch until

> > > the issue is cleared. Will update this thread with progress.  

> >

> > If I'm looking right __busy_poll_stop() is only called if the last

> > napi poll used to re-enable IRQs consumed full budget. You need to

> > clear your new bit in busy_poll_stop(), not in __busy_poll_stop().

> > That will fix the case when hand off back to the normal poller (sirq,

> > or thread) happens without going thru __napi_schedule().  

> 

> If the budget is not fully consumed, napi_complete_done() should have

> been called by the driver which will clear SCHED_BUSY_POLL bit.


Ah, right.
Wei Wang Feb. 25, 2021, 1:06 a.m. UTC | #15
I really have a hard time reproducing the warning Martin was seeing in
his setup. The difference between my setup and his is that mine uses
mlx4 driver, while Martin is using ixgbe driver.

To keep everyone up to date with Martin's previous email, with this
patch applied to 5.11.1, the following warning is triggered when
enabling threaded mode without enabling busy poll:
echo 1 > /sys/class/net/eth0/threaded
echo 1 > /sys/class/net/eth1/threaded
echo 1 > /sys/class/net/eth2/threaded
echo 1 > /sys/class/net/eth3/threaded

Warning message:
[   92.883326] WARNING: CPU: 10 PID: 37100 at net/core/dev.c:6993
napi_threaded_poll+0x144/0x150
[   92.883333] Modules linked in: iptable_filter xt_TCPMSS
iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables  pppoe pppox
ppp_generic slhc team_mode_loadbalance team netconsole coretemp ixgbe
mdio mdio_devres libphy
[   92.884616]  ip_tables  pppoe pppox ppp_generic slhc
team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres
libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp
[   92.886001]  nf_nat_sip
[   92.887262]  nf_nat_tftp
[   92.891169]  nf_conntrack_sip
[   92.894696]  nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si
ipmi_devintf ipmi_msghandler rtc_cmos
[   92.894705] CPU: 18 PID: 37132 Comm: napi/eth3-703 Tainted: G
    O      5.11.1 #1
[   92.895133]  nf_nat_pptp
[   92.895570] Hardware name: Supermicro Super Server/X10DRi-LN4+,
BIOS 3.2 11/19/2019
[   92.895572] RIP: 0010:napi_threaded_poll+0x144/0x150
[   92.895576] Code: 83 e8 f0 2b 9f ff 80 7c 24 07 00 0f 84 e9 fe ff
ff e8 40 75 1f 00 e9 77 ff ff ff 48 8d 74 24 07 48 89 df e8 2e fd ff
ff eb cb <0f> 0b e9 53 ff ff ff 0f 1f 44 00 00 41 57 41 56 41 55 41 54
55 53
[   92.896097]  nf_conntrack_pptp
[   92.898490] RSP: 0018:ffffa3af62857ee0 EFLAGS: 00010287
[   92.898493] RAX: ffffa3af434fcf50 RBX: ffff94e5da281050 RCX: 0000000000000000
[   92.898494] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff94e5da281050
[   92.898495] RBP: ffff94e60f463b00 R08: ffff94e9afa21758 R09: ffff94e9afa21758
[   92.898496] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94e5db0bf800
[   92.898497] R13: ffffa3af4521fd10 R14: ffff94e5da281050 R15: ffff94e60f463b00
[   92.898499] FS:  0000000000000000(0000) GS:ffff94e9afa00000(0000)
knlGS:0000000000000000
[   92.898501] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   92.898502] CR2: 00007f76000a1b60 CR3: 00000001db40a005 CR4: 00000000001706e0
[   92.898503] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   92.898504] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   92.898506] Call Trace:
[   92.898508]  ? __kthread_parkme+0x43/0x60
[   92.898514]  ? __napi_poll+0x190/0x190
[   92.898516]  kthread+0xea/0x120
[   92.898520]  ? kthread_park+0x80/0x80
[   92.898523]  ret_from_fork+0x1f/0x30
[   92.898527] ---[ end trace 51046c7b7172e5a2 ]---

This is the line in net/core/dev.c:6993
 WARN_ON(!list_empty(&napi->poll_list));
in napi_threaded wait()

Martin, do you think the driver version you are using could be at fault here?




On Wed, Feb 24, 2021 at 4:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 24 Feb 2021 16:44:55 -0800 Wei Wang wrote:

> > On Wed, Feb 24, 2021 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > >

> > > On Wed, 24 Feb 2021 16:16:58 -0800 Wei Wang wrote:

> > > > On Wed, Feb 24, 2021 at 4:11 PM Alexander Duyck <alexanderduyck@fb.com> wrote:

> >  [...]

> > > >

> > > > Please help hold on to the patch for now. I think Martin is still

> > > > seeing issues on his setup even with this patch applied. I have not

> > > > yet figured out why. But I think we should not merge this patch until

> > > > the issue is cleared. Will update this thread with progress.

> > >

> > > If I'm looking right __busy_poll_stop() is only called if the last

> > > napi poll used to re-enable IRQs consumed full budget. You need to

> > > clear your new bit in busy_poll_stop(), not in __busy_poll_stop().

> > > That will fix the case when hand off back to the normal poller (sirq,

> > > or thread) happens without going thru __napi_schedule().

> >

> > If the budget is not fully consumed, napi_complete_done() should have

> > been called by the driver which will clear SCHED_BUSY_POLL bit.

>

> Ah, right.
Alexander Duyck Feb. 25, 2021, 1:22 a.m. UTC | #16
> -----Original Message-----

> From: Jakub Kicinski <kuba@kernel.org>

> Sent: Wednesday, February 24, 2021 4:21 PM

> To: Alexander Duyck <alexanderduyck@fb.com>

> Cc: Eric Dumazet <edumazet@google.com>; Wei Wang

> <weiwan@google.com>; David S . Miller <davem@davemloft.net>; netdev

> <netdev@vger.kernel.org>; Paolo Abeni <pabeni@redhat.com>; Hannes

> Frederic Sowa <hannes@stressinduktion.org>; Martin Zaharinov

> <micron10@gmail.com>

> Subject: Re: [PATCH net] net: fix race between napi kthread mode and busy

> poll

> 

> On Thu, 25 Feb 2021 00:11:34 +0000 Alexander Duyck wrote:

> > > > We were trying to not pollute the list (with about 40 different

> > > > emails so far)

> > > >

> > > > (Note this was not something I initiated, I only hit Reply all

> > > > button)

> > > >

> > > > OK, I will shut up, since you seem to take over this matter, and

> > > > it is 1am here in France.

> > >

> > > Are you okay with adding a SCHED_THREADED bit for threaded NAPI to

> > > be set in addition to SCHED? At least that way the bit is associated with it's

> user.

> > > IIUC since the extra clear_bit() in busy poll was okay so should be

> > > a new set_bit()?

> >

> > The problem with adding a bit for SCHED_THREADED is that you would

> > have to heavily modify napi_schedule_prep so that it would add the

> > bit. That is the reason for going with adding the bit to the busy poll

> > logic because it added no additional overhead. Adding another atomic

> > bit setting operation or heavily modifying the existing one would add

> > considerable overhead as it is either adding a complicated conditional

> > check to all NAPI calls, or adding an atomic operation to the path for

> > the threaded NAPI.

> 

> I wasn't thinking of modifying the main schedule logic, just the threaded

> parts:

> 

> 

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index

> ddf4cfc12615..6953005d06af 100644

> --- a/include/linux/netdevice.h

> +++ b/include/linux/netdevice.h

> @@ -360,6 +360,7 @@ enum {

>         NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */

>         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over softirq

> processing*/

>         NAPI_STATE_THREADED,            /* The poll is performed inside its own

> thread*/

> +       NAPI_STATE_SCHED_THREAD,        /* Thread owns the NAPI and will poll

> */

>  };

> 

>  enum {

> diff --git a/net/core/dev.c b/net/core/dev.c index

> 6c5967e80132..23e53f971478 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -4294,6 +4294,7 @@ static inline void ____napi_schedule(struct

> softnet_data *sd,

>                  */

>                 thread = READ_ONCE(napi->thread);

>                 if (thread) {

> +                       set_bit(NAPI_STATE_SCHED_THREAD, &napi->state);

>                         wake_up_process(thread);

>                         return;

>                 }

> @@ -6486,7 +6487,8 @@ bool napi_complete_done(struct napi_struct *n,

> int work_done)

>                 WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

> 

>                 new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |

> -                             NAPIF_STATE_PREFER_BUSY_POLL);

> +                             NAPIF_STATE_PREFER_BUSY_POLL |

> +                             NAPI_STATE_SCHED_THREAD);

> 

>                 /* If STATE_MISSED was set, leave STATE_SCHED set,

>                  * because we will call napi->poll() one more time.

> @@ -6971,7 +6973,9 @@ static int napi_thread_wait(struct napi_struct

> *napi)

>         set_current_state(TASK_INTERRUPTIBLE);

> 

>         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> +               if (test_bit(NAPI_STATE_SCHED_THREAD, &napi->state)) {

> +                       WARN_ON(!test_bit(test_bit(NAPI_STATE_SCHED,

> +                                                  &napi->state)));

>                         WARN_ON(!list_empty(&napi->poll_list));

>                         __set_current_state(TASK_RUNNING);

>                         return 0;


Yeah, that was the patch Wei had done earlier. Eric complained about the extra set_bit atomic operation in the threaded path. That is when I came up with the idea of just adding a bit to the busy poll logic so that the only extra cost in the threaded path was having to check 2 bits instead of 1.
Jakub Kicinski Feb. 25, 2021, 1:40 a.m. UTC | #17
On Wed, 24 Feb 2021 17:06:13 -0800 Wei Wang wrote:
> I really have a hard time reproducing the warning Martin was seeing in

> his setup. The difference between my setup and his is that mine uses

> mlx4 driver, while Martin is using ixgbe driver.

> 

> To keep everyone up to date with Martin's previous email, with this

> patch applied to 5.11.1, the following warning is triggered when

> enabling threaded mode without enabling busy poll:

> echo 1 > /sys/class/net/eth0/threaded

> echo 1 > /sys/class/net/eth1/threaded

> echo 1 > /sys/class/net/eth2/threaded

> echo 1 > /sys/class/net/eth3/threaded

> 

> Warning message:

> [...]

> 

> This is the line in net/core/dev.c:6993

>  WARN_ON(!list_empty(&napi->poll_list));

> in napi_threaded wait()

> 

> Martin, do you think the driver version you are using could be at fault here?


We do kthread_run() meaning the thread gets immediately woken up, even
when sirq is polling the NAPI and owns it. Right?
Jakub Kicinski Feb. 25, 2021, 2:03 a.m. UTC | #18
On Thu, 25 Feb 2021 01:22:08 +0000 Alexander Duyck wrote:
> Yeah, that was the patch Wei had done earlier. Eric complained about the extra set_bit atomic operation in the threaded path. That is when I came up with the idea of just adding a bit to the busy poll logic so that the only extra cost in the threaded path was having to check 2 bits instead of 1.


Maybe we can set the bit only if the thread is running? When thread
comes out of schedule() it can be sure that it has an NAPI to service.
But when it enters napi_thread_wait() and before it hits schedule()
it must be careful to make sure the NAPI is still (or already in the
very first run after creation) owned by it.
Wei Wang Feb. 25, 2021, 2:16 a.m. UTC | #19
On Wed, Feb 24, 2021 at 5:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 24 Feb 2021 17:06:13 -0800 Wei Wang wrote:

> > I really have a hard time reproducing the warning Martin was seeing in

> > his setup. The difference between my setup and his is that mine uses

> > mlx4 driver, while Martin is using ixgbe driver.

> >

> > To keep everyone up to date with Martin's previous email, with this

> > patch applied to 5.11.1, the following warning is triggered when

> > enabling threaded mode without enabling busy poll:

> > echo 1 > /sys/class/net/eth0/threaded

> > echo 1 > /sys/class/net/eth1/threaded

> > echo 1 > /sys/class/net/eth2/threaded

> > echo 1 > /sys/class/net/eth3/threaded

> >

> > Warning message:

> > [...]

> >

> > This is the line in net/core/dev.c:6993

> >  WARN_ON(!list_empty(&napi->poll_list));

> > in napi_threaded wait()

> >

> > Martin, do you think the driver version you are using could be at fault here?

>

> We do kthread_run() meaning the thread gets immediately woken up, even

> when sirq is polling the NAPI and owns it. Right?


Indeed. Good catch! Changing it to kthread_create() should solve the
issue I think. We were using kthread_run() because kthread_create()
messes up certain thread stats if I remember it correctly.
Wei Wang Feb. 25, 2021, 2:31 a.m. UTC | #20
On Wed, Feb 24, 2021 at 6:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Thu, 25 Feb 2021 01:22:08 +0000 Alexander Duyck wrote:

> > Yeah, that was the patch Wei had done earlier. Eric complained about the extra set_bit atomic operation in the threaded path. That is when I came up with the idea of just adding a bit to the busy poll logic so that the only extra cost in the threaded path was having to check 2 bits instead of 1.

>

> Maybe we can set the bit only if the thread is running? When thread

> comes out of schedule() it can be sure that it has an NAPI to service.

> But when it enters napi_thread_wait() and before it hits schedule()

> it must be careful to make sure the NAPI is still (or already in the

> very first run after creation) owned by it.


Are you suggesting setting the SCHED_THREAD bit in napi_thread_wait()
somewhere instead of in ____napi_schedule() as you previously plotted?
What does it help? I think if we have to do an extra set_bit(), it
seems cleaner to set it in ____napi_schedule(). This would solve the
warning issue as well.
Martin Zaharinov Feb. 25, 2021, 5:52 a.m. UTC | #21
Hi Wei, Jakub and Eric

I have attached the new crash log below.

@Wei  I try to use external driver from sourceforge ixgbe and build in kernel internel driver.
With external driver when machine boot immediately crashed and reboot system.
With internal driver in kernel 5.11.1 one time get Bug info when system finishing boot and enable threaded. Other time system boot without any errors and after 1-2 hour crash and reboot .


2 crash with system reboot :


Feb 25 05:12:26  [20079.489376][ T3787] ------------[ cut here ]------------
Feb 25 05:12:26  [20079.512029][ T3787] list_del corruption. next->prev should be ffff94e5c4bfe800, but was ffff94e5e2265b00
Feb 25 05:12:26  [20079.557299][ T3787] WARNING: CPU: 17 PID: 3787 at lib/list_debug.c:54 __list_del_entry_valid+0x8a/0x90
Feb 25 05:12:26  [20079.602545][ T3787] Modules linked in: udp_diag raw_diag unix_diag nf_conntrack_netlink nfnetlink iptable_filter xt_TCPMSS iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables  pppoe pppox ppp_generic slhc team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_cmos
Feb 25 05:12:26  [20079.786930][ T3787] CPU: 17 PID: 3787 Comm: kresd Tainted: G        W  O      5.11.1 #1
Feb 25 05:12:26  [20079.832711][ T3787] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 3.2 11/19/2019
Feb 25 05:12:26  [20079.878486][ T3787] RIP: 0010:__list_del_entry_valid+0x8a/0x90
Feb 25 05:12:26  [20079.901452][ T3787] Code: 51 00 0f 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 38 f2 34 84 e8 8f 43 51 00 0f 0b 31 c0 c3 48 c7 c7 78 f2 34 84 e8 7e 43 51 00 <0f> 0b 31 c0 c3 cc 89 f8 48 85 d2 74 20 48 8d 0c 16 0f b6 16 48 ff
Feb 25 05:12:26  [20079.970301][ T3787] RSP: 0018:ffffa3af46073ce0 EFLAGS: 00010282
Feb 25 05:12:26  [20079.993714][ T3787] RAX: 0000000000000054 RBX: ffff94e5db11a158 RCX: 00000000ffefffff
Feb 25 05:12:27  [20080.041094][ T3787] RDX: 00000000ffefffff RSI: ffff94e9a6800000 RDI: 0000000000000001
Feb 25 05:12:27  [20080.090577][ T3787] RBP: ffff94e5db11a158 R08: ffff94e9ad7fffe8 R09: 00000000ffffffea
Feb 25 05:12:27  [20080.141361][ T3787] R10: 00000000ffefffff R11: 80000000fff00000 R12: ffff94e5c4bfe800
Feb 25 05:12:27  [20080.192432][ T3787] R13: 0000000000000005 R14: ffffa3af46073d08 R15: ffff94e5c4bfe800
Feb 25 05:12:27  [20080.243987][ T3787] FS:  00007fe4f6383d80(0000) GS:ffff94e9af9c0000(0000) knlGS:0000000000000000
Feb 25 05:12:27  [20080.295701][ T3787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 05:12:27  [20080.321667][ T3787] CR2: 00007fe4f6309114 CR3: 0000000151a82001 CR4: 00000000001706e0
Feb 25 05:12:27  [20080.372339][ T3787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 05:12:27  [20080.422779][ T3787] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 05:12:27  [20080.472894][ T3787] Call Trace:
Feb 25 05:12:27  [20080.497205][ T3787]  netif_receive_skb_list_internal+0x26c/0x2c0
Feb 25 05:12:27  [20080.521674][ T3787]  busy_poll_stop+0x171/0x1a0
Feb 25 05:12:27  [20080.545845][ T3787]  ? ixgbe_clean_rx_irq+0xa10/0xa10 [ixgbe]
Feb 25 05:12:27  [20080.569820][ T3787]  ? ep_destroy_wakeup_source+0x20/0x20
Feb 25 05:12:27  [20080.593359][ T3787]  napi_busy_loop+0x32a/0x340
Feb 25 05:12:27  [20080.616508][ T3787]  ? ktime_get_ts64+0x44/0xe0
Feb 25 05:12:27  [20080.639370][ T3787]  ep_poll+0xc6/0x660
Feb 25 05:12:27  [20080.661570][ T3787]  do_epoll_pwait.part.0+0xf3/0x160
Feb 25 05:12:27  [20080.683715][ T3787]  __x64_sys_epoll_pwait+0x6a/0x100
Feb 25 05:12:27  [20080.705454][ T3787]  ? do_syscall_64+0x2d/0x40
Feb 25 05:12:27  [20080.726932][ T3787]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 25 05:12:27  [20080.748279][ T3787] ---[ end trace 51046c7b7172e5a4 ]---
Feb 25 05:12:27  [20080.769384][    C5] BUG: kernel NULL pointer dereference, address: 0000000000000008
Feb 25 05:12:27  [20080.810675][    C5] #PF: supervisor write access in kernel mode
Feb 25 05:12:27  [20080.831353][    C5] #PF: error_code(0x0002) - not-present page
Feb 25 05:12:27  [20080.851636][    C5] PGD 105b18067 P4D 105b18067 PUD 1024cf067 PMD 0
Feb 25 05:12:27  [20080.871706][    C5] Oops: 0002 [#1] SMP NOPTI
Feb 25 05:12:27  [20080.891182][    C5] CPU: 5 PID: 37104 Comm: napi/eth1-691 Tainted: G        W  O      5.11.1 #1
Feb 25 05:12:27  [20080.929470][    C5] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 3.2 11/19/2019
Feb 25 05:12:27  [20080.967621][    C5] RIP: 0010:process_backlog+0xcf/0x230
Feb 25 05:12:27  [20080.986894][    C5] Code: 01 00 00 41 8b b7 10 ff ff ff 8d 56 ff 41 89 97 10 ff ff ff 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 <48> 89 51 08 48 89 0a 0f 1f 44 00 00 48 89 44 24 08 48 8d 54 24 10
Feb 25 05:12:27  [20081.044247][    C5] RSP: 0018:ffffa3af43420eb8 EFLAGS: 00010286
Feb 25 05:12:28  [20081.063017][    C5] RAX: ffff94e5c4bfe800 RBX: ffff94e5af961e50 RCX: 0000000000000000
Feb 25 05:12:28  [20081.100074][    C5] RDX: ffff94e5af961e50 RSI: 0000000000000002 RDI: ffffffff836e3601
Feb 25 05:12:28  [20081.138274][    C5] RBP: 0000000000000040 R08: 0000000000000000 R09: 0000000000000003
Feb 25 05:12:28  [20081.177716][    C5] R10: ffff94e36e9d2ac0 R11: ffff94e35a0a8660 R12: 0000000000000003
Feb 25 05:12:28  [20081.218624][    C5] R13: ffff94e5da7b2c40 R14: ffff94e243ce4000 R15: ffff94e5af961f50
Feb 25 05:12:28  [20081.261329][    C5] FS:  0000000000000000(0000) GS:ffff94e5af940000(0000) knlGS:0000000000000000
Feb 25 05:12:28  [20081.306019][    C5] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 05:12:28  [20081.329045][    C5] CR2: 0000000000000008 CR3: 000000021d4b2006 CR4: 00000000001706e0
Feb 25 05:12:28  [20081.374528][    C5] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 05:12:28  [20081.420343][    C5] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 05:12:28  [20081.467189][    C5] Call Trace:
Feb 25 05:12:28  [20081.490382][    C5]  <IRQ>
Feb 25 05:12:28  [20081.513088][    C5]  __napi_poll+0x21/0x190
Feb 25 05:12:28  [20081.535582][    C5]  net_rx_action+0x239/0x2f0
Feb 25 05:12:28  [20081.557602][    C5]  __do_softirq+0xad/0x1d8
Feb 25 05:12:28  [20081.579133][    C5]  asm_call_irq_on_stack+0xf/0x20
Feb 25 05:12:28  [20081.600495][    C5]  </IRQ>
Feb 25 05:12:28  [20081.621370][    C5]  do_softirq_own_stack+0x32/0x40
Feb 25 05:12:28  [20081.642058][    C5]  do_softirq+0x42/0x50
Feb 25 05:12:28  [20081.662354][    C5]  __local_bh_enable_ip+0x45/0x50
Feb 25 05:12:28  [20081.682435][    C5]  napi_threaded_poll+0x120/0x150
Feb 25 05:12:28  [20081.702239][    C5]  ? __napi_poll+0x190/0x190
Feb 25 05:12:28  [20081.721585][    C5]  kthread+0xea/0x120
Feb 25 05:12:28  [20081.740384][    C5]  ? kthread_park+0x80/0x80
Feb 25 05:12:28  [20081.758773][    C5]  ret_from_fork+0x1f/0x30
Feb 25 05:12:28  [20081.776646][    C5] Modules linked in: udp_diag raw_diag unix_diag nf_conntrack_netlink nfnetlink iptable_filter xt_TCPMSS iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables  pppoe pppox ppp_generic slhc  team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_cmos 
Feb 25 05:12:28  [20081.924344][    C5] CR2: 0000000000000008
Feb 25 05:12:28  [20081.943038][    C5] ---[ end trace 51046c7b7172e5a5 ]---
Feb 25 05:12:28  [20081.961529][    C5] RIP: 0010:process_backlog+0xcf/0x230
Feb 25 05:12:28  [20081.979654][    C5] Code: 01 00 00 41 8b b7 10 ff ff ff 8d 56 ff 41 89 97 10 ff ff ff 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 <48> 89 51 08 48 89 0a 0f 1f 44 00 00 48 89 44 24 08 48 8d 54 24 10
Feb 25 05:12:28  [20082.033915][    C5] RSP: 0018:ffffa3af43420eb8 EFLAGS: 00010286
Feb 25 05:12:29  [20082.052231][    C5] RAX: ffff94e5c4bfe800 RBX: ffff94e5af961e50 RCX: 0000000000000000
Feb 25 05:12:29  [20082.089750][    C5] RDX: ffff94e5af961e50 RSI: 0000000000000002 RDI: ffffffff836e3601
Feb 25 05:12:29  [20082.129164][    C5] RBP: 0000000000000040 R08: 0000000000000000 R09: 0000000000000003
Feb 25 05:12:29  [20082.170336][    C5] R10: ffff94e36e9d2ac0 R11: ffff94e35a0a8660 R12: 0000000000000003
Feb 25 05:12:29  [20082.213161][    C5] R13: ffff94e5da7b2c40 R14: ffff94e243ce4000 R15: ffff94e5af961f50
Feb 25 05:12:29  [20082.257809][    C5] FS:  0000000000000000(0000) GS:ffff94e5af940000(0000) knlGS:0000000000000000
Feb 25 05:12:29  [20082.304448][    C5] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 05:12:29  [20082.328708][    C5] CR2: 0000000000000008 CR3: 000000021d4b2006 CR4: 00000000001706e0
Feb 25 05:12:29  [20082.377167][    C5] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 05:12:29  [20082.425510][    C5] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 05:12:29  [20082.473439][    C5] Kernel panic - not syncing: Fatal exception in interrupt
Feb 25 05:12:29  [20082.497657][   C36] ------------[ cut here ]------------
Feb 25 05:12:29  [20082.521414][   C36] sched: Unexpected reschedule of offline CPU#12!
Feb 25 05:12:29  [20082.544949][   C36] WARNING: CPU: 36 PID: 0 at arch/x86/kernel/apic/ipi.c:68 native_smp_send_reschedule+0x2c/0x30
Feb 25 05:12:29  [20082.591983][   C36] Modules linked in: udp_diag raw_diag unix_diag nf_conntrack_netlink nfnetlink iptable_filter xt_TCPMSS iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables pppoe pppox ppp_generic slhc team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_cmos
Feb 25 05:12:29  [20082.792688][   C36] CPU: 36 PID: 0 Comm: swapper/36 Tainted: G      D W  O      5.11.1 #1
Feb 25 05:12:29  [20082.843263][   C36] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 3.2 11/19/2019
Feb 25 05:12:29  [20082.893782][   C36] RIP: 0010:native_smp_send_reschedule+0x2c/0x30
Feb 25 05:12:29  [20082.919265][   C36] Code: 48 0f a3 05 f6 f2 62 01 73 12 48 8b 05 4d b6 36 01 be fd 00 00 00 48 8b 40 30 ff e0 89 fe 48 c7 c7 20 d4 32 84 e8 3c 95 7d 00 <0f> 0b c3 90 48 8b 05 29 b6 36 01 be fb 00 00 00 48 8b 40 30 ff e0
Feb 25 05:12:29  [20082.995368][   C36] RSP: 0018:ffffa3af4397cf18 EFLAGS: 00010096
Feb 25 05:12:30  [20083.020383][   C36] RAX: 000000000000002f RBX: 0000000000000000 RCX: 00000000ffefffff
Feb 25 05:12:30  [20083.069872][   C36] RDX: 00000000ffefffff RSI: ffff94e9a6800000 RDI: 0000000000000001
Feb 25 05:12:30  [20083.119633][   C36] RBP: ffffa3af4397cf68 R08: ffff94e9ad7fffe8 R09: 00000000ffffffea
Feb 25 05:12:30  [20083.169439][   C36] R10: 00000000ffefffff R11: 80000000fff00000 R12: 0000000000000002
Feb 25 05:12:30  [20083.219384][   C36] R13: ffff94e5da7b3b00 R14: 0000000000020dc0 R15: ffff94e9af8a0dc0
Feb 25 05:12:30  [20083.269795][   C36] FS:  0000000000000000(0000) GS:ffff94e9afc00000(0000) knlGS:0000000000000000
Feb 25 05:12:30  [20083.320848][   C36] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 05:12:30  [20083.346500][   C36] CR2: 00007fe4f6368414 CR3: 0000000151a82004 CR4: 00000000001706e0
Feb 25 05:12:30  [20083.396487][   C36] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 05:12:30  [20083.446178][   C36] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 05:12:30  [20083.495481][   C36] Call Trace:
Feb 25 05:12:30  [20083.519343][   C36]  <IRQ>
Feb 25 05:12:30  [20083.524266][    C5] Shutting down cpus with NMI
Feb 25 05:12:30  [20083.542492][   C36]  try_to_wake_up+0x5da/0x780
Feb 25 05:12:30  [20083.666286][    C5] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Feb 25 05:12:30  [20083.734632][    C5] Rebooting in 10 seconds..
Feb 25 05:12:40  [20093.758840][    C5] ACPI MEMORY or I/O RESET_REG.


Feb 25 06:16:25  [ 3732.006467][ T3768] ------------[ cut here ]------------
Feb 25 06:16:25  [ 3732.007210][ T3768] list_del corruption. prev->next should be ffff9a80f49ad700, but was ffff9a7fd593f158
Feb 25 06:16:25  [ 3732.008527][ T3768] WARNING: CPU: 0 PID: 3768 at lib/list_debug.c:51 __list_del_entry_valid+0x79/0x90
Feb 25 06:16:25  [ 3732.009798][ T3768] Modules linked in: nf_conntrack_netlink nfnetlink udp_diag raw_diag unix_diag iptable_filter xt_TCPMSS iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables pppoe pppox ppp_generic slhc team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_cmos
Feb 25 06:16:25  [ 3732.017363][ T3768] CPU: 0 PID: 3768 Comm: kresd Tainted: G           O      5.11.1 #1
Feb 25 06:16:25  [ 3732.018450][ T3768] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 3.2 11/19/2019
Feb 25 06:16:25  [ 3732.019591][ T3768] RIP: 0010:__list_del_entry_valid+0x79/0x90
Feb 25 06:16:25  [ 3732.020392][ T3768] Code: c3 48 89 fe 4c 89 c2 48 c7 c7 00 f2 34 b0 e8 a6 43 51 00 0f 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 38 f2 34 b0 e8 8f 43 51 00 <0f> 0b 31 c0 c3 48 c7 c7 78 f2 34 b0 e8 7e 43 51 00 0f 0b 31 c0 c3
Feb 25 06:16:25  [ 3732.023055][ T3768] RSP: 0018:ffffb362c489fce0 EFLAGS: 00010282
Feb 25 06:16:25  [ 3732.023865][ T3768] RAX: 0000000000000054 RBX: ffff9a7fd593f158 RCX: 00000000ffefffff
Feb 25 06:16:25  [ 3732.024934][ T3768] RDX: 00000000ffefffff RSI: ffff9a8726800000 RDI: 0000000000000001
Feb 25 06:16:25  [ 3732.050484][ T3768] RBP: ffff9a7fd593f158 R08: ffff9a872d7fffe8 R09: 00000000ffffffea
Feb 25 06:16:25  [ 3732.101572][ T3768] R10: 00000000ffefffff R11: 80000000fff00000 R12: ffff9a80f49ad700
Feb 25 06:16:26  [ 3732.153197][ T3768] R13: 0000000000000010 R14: ffffb362c489fd08 R15: ffff9a80f49ad700
Feb 25 06:16:26  [ 3732.205130][ T3768] FS:  00007f70035a0d80(0000) GS:ffff9a832f800000(0000) knlGS:0000000000000000
Feb 25 06:16:26  [ 3732.256908][ T3768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 06:16:26  [ 3732.283060][ T3768] CR2: 00007ffc9012dd58 CR3: 0000000137d8a001 CR4: 00000000001706f0
Feb 25 06:16:26  [ 3732.334006][ T3768] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 06:16:26  [ 3732.384341][ T3768] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 06:16:26  [ 3732.434478][ T3768] Call Trace:
Feb 25 06:16:26  [ 3732.458838][ T3768]  netif_receive_skb_list_internal+0x26c/0x2c0
Feb 25 06:16:26  [ 3732.483356][ T3768]  busy_poll_stop+0x171/0x1a0
Feb 25 06:16:26  [ 3732.507593][ T3768]  ? ixgbe_clean_rx_irq+0xa10/0xa10 [ixgbe]
Feb 25 06:16:26  [ 3732.531658][ T3768]  ? ep_destroy_wakeup_source+0x20/0x20
Feb 25 06:16:26  [ 3732.555290][ T3768]  napi_busy_loop+0x32a/0x340
Feb 25 06:16:26  [ 3732.578556][ T3768]  ? ktime_get_ts64+0x44/0xe0
Feb 25 06:16:26  [ 3732.601494][ T3768]  ep_poll+0xc6/0x660
Feb 25 06:16:26  [ 3732.623816][ T3768]  do_epoll_pwait.part.0+0xf3/0x160
Feb 25 06:16:26  [ 3732.646157][ T3768]  __x64_sys_epoll_pwait+0x6a/0x100
Feb 25 06:16:26  [ 3732.668097][ T3768]  ? do_syscall_64+0x2d/0x40
Feb 25 06:16:26  [ 3732.689774][ T3768]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 25 06:16:26  [ 3732.711306][ T3768] ---[ end trace 4a3b032bd7c76cab ]---
Feb 25 06:16:26  [ 3732.749411][    C0] BUG: kernel NULL pointer dereference, address: 0000000000000008
Feb 25 06:16:26  [ 3732.791034][    C0] #PF: supervisor write access in kernel mode
Feb 25 06:16:26  [ 3732.811904][    C0] #PF: error_code(0x0002) - not-present page
Feb 25 06:16:26  [ 3732.832354][    C0] PGD 105979067 P4D 105979067 PUD 133337067 PMD 0
Feb 25 06:16:26  [ 3732.852769][    C0] Oops: 0002 [#1] SMP NOPTI
Feb 25 06:16:26  [ 3732.872440][    C0] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G        W  O      5.11.1 #1
Feb 25 06:16:26  [ 3732.910862][    C0] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 3.2 11/19/2019
Feb 25 06:16:26  [ 3732.949036][    C0] RIP: 0010:process_backlog+0xcf/0x230
Feb 25 06:16:26  [ 3732.968335][    C0] Code: 01 00 00 41 8b b7 10 ff ff ff 8d 56 ff 41 89 97 10 ff ff ff 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 <48> 89 51 08 48 89 0a 0f 1f 44 00 00 48 89 44 24 08 48 8d 54 24 10
Feb 25 06:16:26  [ 3733.025756][    C0] RSP: 0018:ffffb362c31f3da8 EFLAGS: 00010286
Feb 25 06:16:26  [ 3733.044533][    C0] RAX: ffff9a80f49ad700 RBX: ffff9a832f821e50 RCX: 0000000000000000
Feb 25 06:16:26  [ 3733.081729][    C0] RDX: ffff9a832f821e50 RSI: 000000000000060c RDI: ffffffffaf6e3601
Feb 25 06:16:26  [ 3733.119967][    C0] RBP: 0000000000000040 R08: 0000000000000000 R09: 0000000000000003
Feb 25 06:16:27  [ 3733.159415][    C0] R10: ffff9a835b65eec0 R11: ffff9a83428d7620 R12: 0000000000000031
Feb 25 06:16:27  [ 3733.200430][    C0] R13: ffff9a7fc1d5c9c0 R14: ffff9a8359740000 R15: ffff9a832f821f50
Feb 25 06:16:27  [ 3733.243461][    C0] FS:  0000000000000000(0000) GS:ffff9a832f800000(0000) knlGS:0000000000000000
Feb 25 06:16:27  [ 3733.288289][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 06:16:27  [ 3733.311467][    C0] CR2: 0000000000000008 CR3: 0000000137d8a001 CR4: 00000000001706f0
Feb 25 06:16:27  [ 3733.357178][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 06:16:27  [ 3733.403142][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 06:16:27  [ 3733.450233][    C0] Call Trace:
Feb 25 06:16:27  [ 3733.473595][    C0]  __napi_poll+0x21/0x190
Feb 25 06:16:27  [ 3733.496875][    C0]  net_rx_action+0x239/0x2f0
Feb 25 06:16:27  [ 3733.519867][    C0]  __do_softirq+0xad/0x1d8
Feb 25 06:16:27  [ 3733.542374][    C0]  run_ksoftirqd+0x15/0x20
Feb 25 06:16:27  [ 3733.564322][    C0]  smpboot_thread_fn+0xb3/0x140
Feb 25 06:16:27  [ 3733.586069][    C0]  ? sort_range+0x20/0x20
Feb 25 06:16:27  [ 3733.607479][    C0]  kthread+0xea/0x120
Feb 25 06:16:27  [ 3733.628223][    C0]  ? kthread_park+0x80/0x80
Feb 25 06:16:27  [ 3733.648796][    C0]  ret_from_fork+0x1f/0x30
Feb 25 06:16:27  [ 3733.668921][    C0] Modules linked in: nf_conntrack_netlink nfnetlink udp_diag raw_diag unix_diag iptable_filter xt_TCPMSS iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables  pppoe pppox ppp_generic slhc team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_cmos 
Feb 25 06:16:27  [ 3733.835634][    C0] CR2: 0000000000000008
Feb 25 06:16:27  [ 3733.856445][    C0] ---[ end trace 4a3b032bd7c76cac ]---
Feb 25 06:16:27  [ 3733.877035][    C0] RIP: 0010:process_backlog+0xcf/0x230
Feb 25 06:16:27  [ 3733.897229][    C0] Code: 01 00 00 41 8b b7 10 ff ff ff 8d 56 ff 41 89 97 10 ff ff ff 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 <48> 89 51 08 48 89 0a 0f 1f 44 00 00 48 89 44 24 08 48 8d 54 24 10
Feb 25 06:16:27  [ 3733.957692][    C0] RSP: 0018:ffffb362c31f3da8 EFLAGS: 00010286
Feb 25 06:16:27  [ 3733.977756][    C0] RAX: ffff9a80f49ad700 RBX: ffff9a832f821e50 RCX: 0000000000000000
Feb 25 06:16:27  [ 3734.016952][    C0] RDX: ffff9a832f821e50 RSI: 000000000000060c RDI: ffffffffaf6e3601
Feb 25 06:16:27  [ 3734.056124][    C0] RBP: 0000000000000040 R08: 0000000000000000 R09: 0000000000000003
Feb 25 06:16:27  [ 3734.096738][    C0] R10: ffff9a835b65eec0 R11: ffff9a83428d7620 R12: 0000000000000031
Feb 25 06:16:27  [ 3734.139326][    C0] R13: ffff9a7fc1d5c9c0 R14: ffff9a8359740000 R15: ffff9a832f821f50
Feb 25 06:16:28  [ 3734.184006][    C0] FS:  0000000000000000(0000) GS:ffff9a832f800000(0000) knlGS:0000000000000000
Feb 25 06:16:28  [ 3734.230851][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 06:16:28  [ 3734.255442][    C0] CR2: 0000000000000008 CR3: 0000000137d8a001 CR4: 00000000001706f0
Feb 25 06:16:28  [ 3734.304596][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 06:16:28  [ 3734.353227][    C0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 06:16:28  [ 3734.401522][    C0] Kernel panic - not syncing: Fatal exception in interrupt
Feb 25 06:16:28  [ 3734.425941][   C10] ------------[ cut here ]------------
Feb 25 06:16:28  [ 3734.449850][   C10] sched: Unexpected reschedule of offline CPU#31!
Feb 25 06:16:28  [ 3734.473526][   C10] WARNING: CPU: 10 PID: 0 at arch/x86/kernel/apic/ipi.c:68 native_smp_send_reschedule+0x2c/0x30
Feb 25 06:16:28  [ 3734.520834][   C10] Modules linked in: nf_conntrack_netlink nfnetlink udp_diag raw_diag unix_diag iptable_filter xt_TCPMSS iptable_mangle xt_addrtype xt_nat iptable_nat ip_tables pppoe pppox ppp_generic slhc  team_mode_loadbalance team netconsole coretemp ixgbe mdio mdio_devres libphy nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_cmos 
Feb 25 06:16:28  [ 3734.723018][   C10] CPU: 10 PID: 0 Comm: swapper/10 Tainted: G      D W  O      5.11.1 #1
Feb 25 06:16:28  [ 3734.773886][   C10] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 3.2 11/19/2019
Feb 25 06:16:28  [ 3734.824595][   C10] RIP: 0010:native_smp_send_reschedule+0x2c/0x30
Feb 25 06:16:28  [ 3734.850148][   C10] Code: 48 0f a3 05 f6 f2 62 01 73 12 48 8b 05 4d b6 36 01 be fd 00 00 00 48 8b 40 30 ff e0 89 fe 48 c7 c7 20 d4 32 b0 e8 3c 95 7d 00 <0f> 0b c3 90 48 8b 05 29 b6 36 01 be fb 00 00 00 48 8b 40 30 ff e0
Feb 25 06:16:28  [ 3734.926375][   C10] RSP: 0018:ffffb362c34fcf18 EFLAGS: 00010096
Feb 25 06:16:28  [ 3734.951428][   C10] RAX: 000000000000002f RBX: 0000000000000000 RCX: 00000000ffefffff
Feb 25 06:16:28  [ 3735.000932][   C10] RDX: 00000000ffefffff RSI: ffff9a8726800000 RDI: 0000000000000001
Feb 25 06:16:28  [ 3735.050766][   C10] RBP: ffffb362c34fcf68 R08: ffff9a872d7fffe8 R09: 00000000ffffffea
Feb 25 06:16:28  [ 3735.100642][   C10] R10: 00000000ffefffff R11: 80000000fff00000 R12: 0000000000000002
Feb 25 06:16:29  [ 3735.150651][   C10] R13: ffff9a8359acac40 R14: 0000000000020dc0 R15: ffff9a872fae0dc0
Feb 25 06:16:29  [ 3735.201238][   C10] FS:  0000000000000000(0000) GS:ffff9a872f800000(0000) knlGS:0000000000000000
Feb 25 06:16:29  [ 3735.252469][   C10] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 06:16:29  [ 3735.278217][   C10] CR2: 00007f1ef4480828 CR3: 0000000246356005 CR4: 00000000001706e0
Feb 25 06:16:29  [ 3735.328368][   C10] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 25 06:16:29  [ 3735.378197][   C10] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 25 06:16:29  [ 3735.427739][   C10] Call Trace:
Feb 25 06:16:29  [ 3735.451761][   C10]  <IRQ>
Feb 25 06:16:29  [ 3735.452551][    C0] Shutting down cpus with NMI
Feb 25 06:16:29  [ 3735.475047][   C10]  try_to_wake_up+0x5da/0x780
Feb 25 06:16:29  [ 3735.612306][    C0] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Feb 25 06:16:29  [ 3735.680997][    C0] Rebooting in 10 seconds..
Feb 25 06:16:39  [ 3745.704638][    C0] ACPI MEMORY or I/O RESET_REG.


Martin
Jakub Kicinski Feb. 25, 2021, 8:21 a.m. UTC | #22
On Wed, 24 Feb 2021 18:31:55 -0800 Wei Wang wrote:
> On Wed, Feb 24, 2021 at 6:03 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Thu, 25 Feb 2021 01:22:08 +0000 Alexander Duyck wrote:  

> > > Yeah, that was the patch Wei had done earlier. Eric complained about the extra set_bit atomic operation in the threaded path. That is when I came up with the idea of just adding a bit to the busy poll logic so that the only extra cost in the threaded path was having to check 2 bits instead of 1.  

> >

> > Maybe we can set the bit only if the thread is running? When thread

> > comes out of schedule() it can be sure that it has an NAPI to service.

> > But when it enters napi_thread_wait() and before it hits schedule()

> > it must be careful to make sure the NAPI is still (or already in the

> > very first run after creation) owned by it.  

> 

> Are you suggesting setting the SCHED_THREAD bit in napi_thread_wait()

> somewhere instead of in ____napi_schedule() as you previously plotted?

> What does it help? I think if we have to do an extra set_bit(), it

> seems cleaner to set it in ____napi_schedule(). This would solve the

> warning issue as well.


I was thinking of something roughly like this:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..3bce94e8c110 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -360,6 +360,7 @@ enum {
        NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */
        NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over softirq processing*/
        NAPI_STATE_THREADED,            /* The poll is performed inside its own thread*/
+       NAPI_STATE_SCHED_THREAD,        /* Thread owns the NAPI and will poll */
 };
 
 enum {
@@ -372,6 +373,7 @@ enum {
        NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),
        NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),
        NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),
+       NAPIF_STATE_SCHED_THREAD        = BIT(NAPI_STATE_SCHED_THREAD),
 };
 
 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..852b992d0ebb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd,
                 */
                thread = READ_ONCE(napi->thread);
                if (thread) {
+                       if (thread->state == TASK_RUNNING)
+                               set_bit(NAPIF_STATE_SCHED_THREAD, &napi->state);
                        wake_up_process(thread);
                        return;
                }
@@ -6486,7 +6488,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
                WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
 
                new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
-                             NAPIF_STATE_PREFER_BUSY_POLL);
+                             NAPIF_STATE_PREFER_BUSY_POLL |
+                             NAPIF_STATE_SCHED_THREAD);
 
                /* If STATE_MISSED was set, leave STATE_SCHED set,
                 * because we will call napi->poll() one more time.
@@ -6968,16 +6971,24 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 
 static int napi_thread_wait(struct napi_struct *napi)
 {
+       bool woken = false;
+
        set_current_state(TASK_INTERRUPTIBLE);
 
        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+               unsigned long state = READ_ONCE(napi->state);
+
+               if ((state & NAPIF_STATE_SCHED) &&
+                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
                        WARN_ON(!list_empty(&napi->poll_list));
                        __set_current_state(TASK_RUNNING);
                        return 0;
+               } else {
+                       WARN_ON(woken);
                }
 
                schedule();
+               woken = true;
                set_current_state(TASK_INTERRUPTIBLE);
        }
        __set_current_state(TASK_RUNNING);


Extra set_bit() is only done if napi_schedule() comes early enough to
see the thread still running. When the thread is woken we continue to
assume ownership.

It's just an idea (but it may solve the first run and the disable case).
Wei Wang Feb. 25, 2021, 6:29 p.m. UTC | #23
On Thu, Feb 25, 2021 at 12:21 AM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 24 Feb 2021 18:31:55 -0800 Wei Wang wrote:

> > On Wed, Feb 24, 2021 at 6:03 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > >

> > > On Thu, 25 Feb 2021 01:22:08 +0000 Alexander Duyck wrote:

> > > > Yeah, that was the patch Wei had done earlier. Eric complained about the extra set_bit atomic operation in the threaded path. That is when I came up with the idea of just adding a bit to the busy poll logic so that the only extra cost in the threaded path was having to check 2 bits instead of 1.

> > >

> > > Maybe we can set the bit only if the thread is running? When thread

> > > comes out of schedule() it can be sure that it has an NAPI to service.

> > > But when it enters napi_thread_wait() and before it hits schedule()

> > > it must be careful to make sure the NAPI is still (or already in the

> > > very first run after creation) owned by it.

> >

> > Are you suggesting setting the SCHED_THREAD bit in napi_thread_wait()

> > somewhere instead of in ____napi_schedule() as you previously plotted?

> > What does it help? I think if we have to do an extra set_bit(), it

> > seems cleaner to set it in ____napi_schedule(). This would solve the

> > warning issue as well.

>

> I was thinking of something roughly like this:

>

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> index ddf4cfc12615..3bce94e8c110 100644

> --- a/include/linux/netdevice.h

> +++ b/include/linux/netdevice.h

> @@ -360,6 +360,7 @@ enum {

>         NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */

>         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over softirq processing*/

>         NAPI_STATE_THREADED,            /* The poll is performed inside its own thread*/

> +       NAPI_STATE_SCHED_THREAD,        /* Thread owns the NAPI and will poll */

>  };

>

>  enum {

> @@ -372,6 +373,7 @@ enum {

>         NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),

>         NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),

>         NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),

> +       NAPIF_STATE_SCHED_THREAD        = BIT(NAPI_STATE_SCHED_THREAD),

>  };

>

>  enum gro_result {

> diff --git a/net/core/dev.c b/net/core/dev.c

> index 6c5967e80132..852b992d0ebb 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct softnet_data *sd,

>                  */

>                 thread = READ_ONCE(napi->thread);

>                 if (thread) {

> +                       if (thread->state == TASK_RUNNING)

> +                               set_bit(NAPIF_STATE_SCHED_THREAD, &napi->state);

>                         wake_up_process(thread);

>                         return;

>                 }

> @@ -6486,7 +6488,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)

>                 WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

>

>                 new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |

> -                             NAPIF_STATE_PREFER_BUSY_POLL);

> +                             NAPIF_STATE_PREFER_BUSY_POLL |

> +                             NAPIF_STATE_SCHED_THREAD);

>

>                 /* If STATE_MISSED was set, leave STATE_SCHED set,

>                  * because we will call napi->poll() one more time.

> @@ -6968,16 +6971,24 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)

>

>  static int napi_thread_wait(struct napi_struct *napi)

>  {

> +       bool woken = false;

> +

>         set_current_state(TASK_INTERRUPTIBLE);

>

>         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> +               unsigned long state = READ_ONCE(napi->state);

> +

> +               if ((state & NAPIF_STATE_SCHED) &&

> +                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {

>                         WARN_ON(!list_empty(&napi->poll_list));

>                         __set_current_state(TASK_RUNNING);

>                         return 0;

> +               } else {

> +                       WARN_ON(woken);

>                 }

>

>                 schedule();

> +               woken = true;

>                 set_current_state(TASK_INTERRUPTIBLE);

>         }

>         __set_current_state(TASK_RUNNING);

>

>

> Extra set_bit() is only done if napi_schedule() comes early enough to

> see the thread still running. When the thread is woken we continue to

> assume ownership.

>

> It's just an idea (but it may solve the first run and the disable case).


Hmm... I don't think the above patch would work. Consider a situation that:
1. At first, the kthread is in sleep mode.
2. Then someone calls napi_schedule() to schedule work on this napi.
So ____napi_schedule() is called. But at this moment, the kthread is
not yet in RUNNING state. So this function does not set SCHED_THREAD
bit.
3. Then wake_up_process() is called to wake up the thread.
4. Then napi_threaded_poll() calls napi_thread_wait(). woken is false
and SCHED_THREAD bit is not set. So the kthread will go to sleep again
(in INTERRUPTIBLE mode) when schedule() is called, and waits to be
woken up by the next napi_schedule().
That will introduce arbitrary delay for the napi->poll() to be called.
Isn't it? Please enlighten me if I did not understand it correctly.

I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().
Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with
kthread_create().
Jakub Kicinski Feb. 25, 2021, 11 p.m. UTC | #24
On Thu, 25 Feb 2021 10:29:47 -0800 Wei Wang wrote:
> Hmm... I don't think the above patch would work. Consider a situation that:

> 1. At first, the kthread is in sleep mode.

> 2. Then someone calls napi_schedule() to schedule work on this napi.

> So ____napi_schedule() is called. But at this moment, the kthread is

> not yet in RUNNING state. So this function does not set SCHED_THREAD

> bit.

> 3. Then wake_up_process() is called to wake up the thread.

> 4. Then napi_threaded_poll() calls napi_thread_wait().


But how is the task not in running state outside of napi_thread_wait()?

My scheduler knowledge is rudimentary, but AFAIU off CPU tasks which
were not put to sleep are still in RUNNING state, so unless we set
INTERRUPTIBLE the task will be running, even if it's stuck in cond_resched().

> woken is false

> and SCHED_THREAD bit is not set. So the kthread will go to sleep again

> (in INTERRUPTIBLE mode) when schedule() is called, and waits to be

> woken up by the next napi_schedule().

> That will introduce arbitrary delay for the napi->poll() to be called.

> Isn't it? Please enlighten me if I did not understand it correctly.


Probably just me not understanding the scheduler :)

> I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().

> Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with

> kthread_create().


Well, I'm fine with that too, no point arguing further if I'm not
convincing anyone. But we need a fix which fixes the issue completely,
not just one of three incarnations.
Wei Wang Feb. 26, 2021, 12:16 a.m. UTC | #25
On Thu, Feb 25, 2021 at 3:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Thu, 25 Feb 2021 10:29:47 -0800 Wei Wang wrote:

> > Hmm... I don't think the above patch would work. Consider a situation that:

> > 1. At first, the kthread is in sleep mode.

> > 2. Then someone calls napi_schedule() to schedule work on this napi.

> > So ____napi_schedule() is called. But at this moment, the kthread is

> > not yet in RUNNING state. So this function does not set SCHED_THREAD

> > bit.

> > 3. Then wake_up_process() is called to wake up the thread.

> > 4. Then napi_threaded_poll() calls napi_thread_wait().

>

> But how is the task not in running state outside of napi_thread_wait()?

>

> My scheduler knowledge is rudimentary, but AFAIU off CPU tasks which

> were not put to sleep are still in RUNNING state, so unless we set

> INTERRUPTIBLE the task will be running, even if it's stuck in cond_resched().

>


I think the thread is only in RUNNING state after wake_up_process() is
called on the thread in ____napi_schedule(). Before that, it should be
in INTERRUPTIBLE state. napi_thread_wait() explicitly calls
set_current_state(TASK_INTERRUPTIBLE) when it finishes 1 round of
polling.

> > woken is false

> > and SCHED_THREAD bit is not set. So the kthread will go to sleep again

> > (in INTERRUPTIBLE mode) when schedule() is called, and waits to be

> > woken up by the next napi_schedule().

> > That will introduce arbitrary delay for the napi->poll() to be called.

> > Isn't it? Please enlighten me if I did not understand it correctly.

>

> Probably just me not understanding the scheduler :)

>

> > I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().

> > Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with

> > kthread_create().

>

> Well, I'm fine with that too, no point arguing further if I'm not

> convincing anyone. But we need a fix which fixes the issue completely,

> not just one of three incarnations.


Alexander and Eric,
Do you guys have preference on which approach to take?
If we keep the current SCHED_BUSY_POLL patch, I think we need to
change kthread_run() to kthread_create() to address the warning Martin
reported.
Or if we choose to set SCHED_THREADED, we could keep kthread_run().
But there is 1 extra set_bit() operation.

Thanks.
Wei
Jakub Kicinski Feb. 26, 2021, 1:18 a.m. UTC | #26
On Thu, 25 Feb 2021 16:16:20 -0800 Wei Wang wrote:
> On Thu, Feb 25, 2021 at 3:00 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > On Thu, 25 Feb 2021 10:29:47 -0800 Wei Wang wrote:  

> > > Hmm... I don't think the above patch would work. Consider a situation that:

> > > 1. At first, the kthread is in sleep mode.

> > > 2. Then someone calls napi_schedule() to schedule work on this napi.

> > > So ____napi_schedule() is called. But at this moment, the kthread is

> > > not yet in RUNNING state. So this function does not set SCHED_THREAD

> > > bit.

> > > 3. Then wake_up_process() is called to wake up the thread.

> > > 4. Then napi_threaded_poll() calls napi_thread_wait().  

> >

> > But how is the task not in running state outside of napi_thread_wait()?

> >

> > My scheduler knowledge is rudimentary, but AFAIU off CPU tasks which

> > were not put to sleep are still in RUNNING state, so unless we set

> > INTERRUPTIBLE the task will be running, even if it's stuck in cond_resched().

>

> I think the thread is only in RUNNING state after wake_up_process() is

> called on the thread in ____napi_schedule(). Before that, it should be

> in INTERRUPTIBLE state. napi_thread_wait() explicitly calls

> set_current_state(TASK_INTERRUPTIBLE) when it finishes 1 round of

> polling.


Are you concerned about it not being in RUNNING state after it's
spawned but before it's first parked?

> > > woken is false

> > > and SCHED_THREAD bit is not set. So the kthread will go to sleep again

> > > (in INTERRUPTIBLE mode) when schedule() is called, and waits to be

> > > woken up by the next napi_schedule().

> > > That will introduce arbitrary delay for the napi->poll() to be called.

> > > Isn't it? Please enlighten me if I did not understand it correctly.  

> >

> > Probably just me not understanding the scheduler :)

> >  

> > > I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().

> > > Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with

> > > kthread_create().  

> >

> > Well, I'm fine with that too, no point arguing further if I'm not

> > convincing anyone. But we need a fix which fixes the issue completely,

> > not just one of three incarnations.  

> 

> Alexander and Eric,

> Do you guys have preference on which approach to take?

> If we keep the current SCHED_BUSY_POLL patch, I think we need to

> change kthread_run() to kthread_create() to address the warning Martin

> reported.

> Or if we choose to set SCHED_THREADED, we could keep kthread_run().

> But there is 1 extra set_bit() operation.


To be clear extra set_bit() only if thread is running, which if IRQ
coalescing works should be rather rare.
Wei Wang Feb. 26, 2021, 1:49 a.m. UTC | #27
On Thu, Feb 25, 2021 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Thu, 25 Feb 2021 16:16:20 -0800 Wei Wang wrote:

> > On Thu, Feb 25, 2021 at 3:00 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > On Thu, 25 Feb 2021 10:29:47 -0800 Wei Wang wrote:

> > > > Hmm... I don't think the above patch would work. Consider a situation that:

> > > > 1. At first, the kthread is in sleep mode.

> > > > 2. Then someone calls napi_schedule() to schedule work on this napi.

> > > > So ____napi_schedule() is called. But at this moment, the kthread is

> > > > not yet in RUNNING state. So this function does not set SCHED_THREAD

> > > > bit.

> > > > 3. Then wake_up_process() is called to wake up the thread.

> > > > 4. Then napi_threaded_poll() calls napi_thread_wait().

> > >

> > > But how is the task not in running state outside of napi_thread_wait()?

> > >

> > > My scheduler knowledge is rudimentary, but AFAIU off CPU tasks which

> > > were not put to sleep are still in RUNNING state, so unless we set

> > > INTERRUPTIBLE the task will be running, even if it's stuck in cond_resched().

> >

> > I think the thread is only in RUNNING state after wake_up_process() is

> > called on the thread in ____napi_schedule(). Before that, it should be

> > in INTERRUPTIBLE state. napi_thread_wait() explicitly calls

> > set_current_state(TASK_INTERRUPTIBLE) when it finishes 1 round of

> > polling.

>

> Are you concerned about it not being in RUNNING state after it's

> spawned but before it's first parked?


I think it's fine. As long as ___napi_schedule() calls
wake_up_process() to wake up the thread when in threaded mode, I think
that serves the purpose.

>

> > > > woken is false

> > > > and SCHED_THREAD bit is not set. So the kthread will go to sleep again

> > > > (in INTERRUPTIBLE mode) when schedule() is called, and waits to be

> > > > woken up by the next napi_schedule().

> > > > That will introduce arbitrary delay for the napi->poll() to be called.

> > > > Isn't it? Please enlighten me if I did not understand it correctly.

> > >

> > > Probably just me not understanding the scheduler :)

> > >

> > > > I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().

> > > > Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with

> > > > kthread_create().

> > >

> > > Well, I'm fine with that too, no point arguing further if I'm not

> > > convincing anyone. But we need a fix which fixes the issue completely,

> > > not just one of three incarnations.

> >

> > Alexander and Eric,

> > Do you guys have preference on which approach to take?

> > If we keep the current SCHED_BUSY_POLL patch, I think we need to

> > change kthread_run() to kthread_create() to address the warning Martin

> > reported.

> > Or if we choose to set SCHED_THREADED, we could keep kthread_run().

> > But there is 1 extra set_bit() operation.

>

> To be clear extra set_bit() only if thread is running, which if IRQ

> coalescing works should be rather rare.
Alexander Duyck Feb. 26, 2021, 3:52 a.m. UTC | #28
On Thu, Feb 25, 2021 at 5:20 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Thu, 25 Feb 2021 16:16:20 -0800 Wei Wang wrote:

> > On Thu, Feb 25, 2021 at 3:00 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > On Thu, 25 Feb 2021 10:29:47 -0800 Wei Wang wrote:

> > > > Hmm... I don't think the above patch would work. Consider a situation that:

> > > > 1. At first, the kthread is in sleep mode.

> > > > 2. Then someone calls napi_schedule() to schedule work on this napi.

> > > > So ____napi_schedule() is called. But at this moment, the kthread is

> > > > not yet in RUNNING state. So this function does not set SCHED_THREAD

> > > > bit.

> > > > 3. Then wake_up_process() is called to wake up the thread.

> > > > 4. Then napi_threaded_poll() calls napi_thread_wait().

> > >

> > > But how is the task not in running state outside of napi_thread_wait()?

> > >

> > > My scheduler knowledge is rudimentary, but AFAIU off CPU tasks which

> > > were not put to sleep are still in RUNNING state, so unless we set

> > > INTERRUPTIBLE the task will be running, even if it's stuck in cond_resched().

> >

> > I think the thread is only in RUNNING state after wake_up_process() is

> > called on the thread in ____napi_schedule(). Before that, it should be

> > in INTERRUPTIBLE state. napi_thread_wait() explicitly calls

> > set_current_state(TASK_INTERRUPTIBLE) when it finishes 1 round of

> > polling.

>

> Are you concerned about it not being in RUNNING state after it's

> spawned but before it's first parked?

>

> > > > woken is false

> > > > and SCHED_THREAD bit is not set. So the kthread will go to sleep again

> > > > (in INTERRUPTIBLE mode) when schedule() is called, and waits to be

> > > > woken up by the next napi_schedule().

> > > > That will introduce arbitrary delay for the napi->poll() to be called.

> > > > Isn't it? Please enlighten me if I did not understand it correctly.

> > >

> > > Probably just me not understanding the scheduler :)

> > >

> > > > I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().

> > > > Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with

> > > > kthread_create().

> > >

> > > Well, I'm fine with that too, no point arguing further if I'm not

> > > convincing anyone. But we need a fix which fixes the issue completely,

> > > not just one of three incarnations.

> >

> > Alexander and Eric,

> > Do you guys have preference on which approach to take?

> > If we keep the current SCHED_BUSY_POLL patch, I think we need to

> > change kthread_run() to kthread_create() to address the warning Martin

> > reported.

> > Or if we choose to set SCHED_THREADED, we could keep kthread_run().

> > But there is 1 extra set_bit() operation.

>

> To be clear extra set_bit() only if thread is running, which if IRQ

> coalescing works should be rather rare.


I was good with either approach. My preference would be to probably
use kthread_create regardless as it doesn't make much sense to have
the thread running until we really need it anyway.
Wei Wang Feb. 26, 2021, 6:28 p.m. UTC | #29
Hi Martin,
Could you help try the following new patch on your setup and let me
know if there are still issues?

Thanks.
Wei

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..9ed0f89ccdd5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -357,9 +357,10 @@ enum {
        NAPI_STATE_NPSVC,               /* Netpoll - don't dequeue
from poll_list */
        NAPI_STATE_LISTED,              /* NAPI added to system lists */
        NAPI_STATE_NO_BUSY_POLL,        /* Do not add in napi_hash, no
busy polling */
-       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */
+       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() grabs SHED
bit and could busy poll */
        NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over
softirq processing*/
        NAPI_STATE_THREADED,            /* The poll is performed
inside its own thread*/
+       NAPI_STATE_SCHED_BUSY_POLL,     /* Napi is currently scheduled
in busy poll mode */
 };

 enum {
@@ -372,6 +373,7 @@ enum {
        NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),
        NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),
        NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),
+       NAPIF_STATE_SCHED_BUSY_POLL     = BIT(NAPI_STATE_SCHED_BUSY_POLL),
 };

 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..c717b67ce137 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1501,15 +1501,14 @@ static int napi_kthread_create(struct napi_struct *n)
 {
        int err = 0;

-       /* Create and wake up the kthread once to put it in
-        * TASK_INTERRUPTIBLE mode to avoid the blocked task
-        * warning and work with loadavg.
+       /* Avoid using  kthread_run() here to prevent race
+        * between softirq and kthread polling.
         */
-       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
-                               n->dev->name, n->napi_id);
+       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
+                                  n->dev->name, n->napi_id);
        if (IS_ERR(n->thread)) {
                err = PTR_ERR(n->thread);
-               pr_err("kthread_run failed with err %d\n", err);
+               pr_err("kthread_create failed with err %d\n", err);
                n->thread = NULL;
        }

@@ -6486,6 +6485,7 @@ bool napi_complete_done(struct napi_struct *n,
int work_done)
                WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

                new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
+                             NAPIF_STATE_SCHED_BUSY_POLL |
                              NAPIF_STATE_PREFER_BUSY_POLL);

                /* If STATE_MISSED was set, leave STATE_SCHED set,
@@ -6525,6 +6525,7 @@ static struct napi_struct *napi_by_id(unsigned
int napi_id)

 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
 {
+       clear_bit(NAPI_STATE_SCHED_BUSY_POLL, &napi->state);
        if (!skip_schedule) {
                gro_normal_list(napi);
                __napi_schedule(napi);
@@ -6624,7 +6625,8 @@ void napi_busy_loop(unsigned int napi_id,
                        }
                        if (cmpxchg(&napi->state, val,
                                    val | NAPIF_STATE_IN_BUSY_POLL |
-                                         NAPIF_STATE_SCHED) != val) {
+                                         NAPIF_STATE_SCHED |
+                                         NAPIF_STATE_SCHED_BUSY_POLL) != val) {
                                if (prefer_busy_poll)

set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
                                goto count;
@@ -6971,7 +6973,10 @@ static int napi_thread_wait(struct napi_struct *napi)
        set_current_state(TASK_INTERRUPTIBLE);

        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+               unsigned long val = READ_ONCE(napi->state);
+
+               if (val & NAPIF_STATE_SCHED &&
+                   !(val & NAPIF_STATE_SCHED_BUSY_POLL)) {
                        WARN_ON(!list_empty(&napi->poll_list));
                        __set_current_state(TASK_RUNNING);
                        return 0;

On Thu, Feb 25, 2021 at 7:52 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>

> On Thu, Feb 25, 2021 at 5:20 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Thu, 25 Feb 2021 16:16:20 -0800 Wei Wang wrote:

> > > On Thu, Feb 25, 2021 at 3:00 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > > On Thu, 25 Feb 2021 10:29:47 -0800 Wei Wang wrote:

> > > > > Hmm... I don't think the above patch would work. Consider a situation that:

> > > > > 1. At first, the kthread is in sleep mode.

> > > > > 2. Then someone calls napi_schedule() to schedule work on this napi.

> > > > > So ____napi_schedule() is called. But at this moment, the kthread is

> > > > > not yet in RUNNING state. So this function does not set SCHED_THREAD

> > > > > bit.

> > > > > 3. Then wake_up_process() is called to wake up the thread.

> > > > > 4. Then napi_threaded_poll() calls napi_thread_wait().

> > > >

> > > > But how is the task not in running state outside of napi_thread_wait()?

> > > >

> > > > My scheduler knowledge is rudimentary, but AFAIU off CPU tasks which

> > > > were not put to sleep are still in RUNNING state, so unless we set

> > > > INTERRUPTIBLE the task will be running, even if it's stuck in cond_resched().

> > >

> > > I think the thread is only in RUNNING state after wake_up_process() is

> > > called on the thread in ____napi_schedule(). Before that, it should be

> > > in INTERRUPTIBLE state. napi_thread_wait() explicitly calls

> > > set_current_state(TASK_INTERRUPTIBLE) when it finishes 1 round of

> > > polling.

> >

> > Are you concerned about it not being in RUNNING state after it's

> > spawned but before it's first parked?

> >

> > > > > woken is false

> > > > > and SCHED_THREAD bit is not set. So the kthread will go to sleep again

> > > > > (in INTERRUPTIBLE mode) when schedule() is called, and waits to be

> > > > > woken up by the next napi_schedule().

> > > > > That will introduce arbitrary delay for the napi->poll() to be called.

> > > > > Isn't it? Please enlighten me if I did not understand it correctly.

> > > >

> > > > Probably just me not understanding the scheduler :)

> > > >

> > > > > I personally prefer to directly set SCHED_THREAD bit in ____napi_schedule().

> > > > > Or stick with SCHED_BUSY_POLL solution and replace kthread_run() with

> > > > > kthread_create().

> > > >

> > > > Well, I'm fine with that too, no point arguing further if I'm not

> > > > convincing anyone. But we need a fix which fixes the issue completely,

> > > > not just one of three incarnations.

> > >

> > > Alexander and Eric,

> > > Do you guys have preference on which approach to take?

> > > If we keep the current SCHED_BUSY_POLL patch, I think we need to

> > > change kthread_run() to kthread_create() to address the warning Martin

> > > reported.

> > > Or if we choose to set SCHED_THREADED, we could keep kthread_run().

> > > But there is 1 extra set_bit() operation.

> >

> > To be clear extra set_bit() only if thread is running, which if IRQ

> > coalescing works should be rather rare.

>

> I was good with either approach. My preference would be to probably

> use kthread_create regardless as it doesn't make much sense to have

> the thread running until we really need it anyway.
Jakub Kicinski Feb. 26, 2021, 9:35 p.m. UTC | #30
On Fri, 26 Feb 2021 10:28:00 -0800 Wei Wang wrote:
> Hi Martin,

> Could you help try the following new patch on your setup and let me

> know if there are still issues?


FWIW your email got line wrapped for me.

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> index ddf4cfc12615..9ed0f89ccdd5 100644

> --- a/include/linux/netdevice.h

> +++ b/include/linux/netdevice.h

> @@ -357,9 +357,10 @@ enum {

>         NAPI_STATE_NPSVC,               /* Netpoll - don't dequeue

> from poll_list */

>         NAPI_STATE_LISTED,              /* NAPI added to system lists */

>         NAPI_STATE_NO_BUSY_POLL,        /* Do not add in napi_hash, no

> busy polling */

> -       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */

> +       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() grabs SHED


nit: SHED -> SCHED

> bit and could busy poll */

>         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over

> softirq processing*/

>         NAPI_STATE_THREADED,            /* The poll is performed

> inside its own thread*/

> +       NAPI_STATE_SCHED_BUSY_POLL,     /* Napi is currently scheduled

> in busy poll mode */


nit: Napi -> NAPI ?

>  };

> 

>  enum {

> @@ -372,6 +373,7 @@ enum {

>         NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),

>         NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),

>         NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),

> +       NAPIF_STATE_SCHED_BUSY_POLL     = BIT(NAPI_STATE_SCHED_BUSY_POLL),

>  };

> 

>  enum gro_result {

> diff --git a/net/core/dev.c b/net/core/dev.c

> index 6c5967e80132..c717b67ce137 100644

> --- a/net/core/dev.c

> +++ b/net/core/dev.c

> @@ -1501,15 +1501,14 @@ static int napi_kthread_create(struct napi_struct *n)

>  {

>         int err = 0;

> 

> -       /* Create and wake up the kthread once to put it in

> -        * TASK_INTERRUPTIBLE mode to avoid the blocked task

> -        * warning and work with loadavg.

> +       /* Avoid using  kthread_run() here to prevent race

> +        * between softirq and kthread polling.

>          */

> -       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",

> -                               n->dev->name, n->napi_id);

> +       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",

> +                                  n->dev->name, n->napi_id);


I'm not sure this takes care of rapid:

dev_set_threaded(0)
 # NAPI gets sent to sirq
dev_set_threaded(1)

since subsequent set_threaded(1) doesn't spawn the thread "afresh".

>         if (IS_ERR(n->thread)) {

>                 err = PTR_ERR(n->thread);

> -               pr_err("kthread_run failed with err %d\n", err);

> +               pr_err("kthread_create failed with err %d\n", err);

>                 n->thread = NULL;

>         }

> 

> @@ -6486,6 +6485,7 @@ bool napi_complete_done(struct napi_struct *n,

> int work_done)

>                 WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

> 

>                 new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |

> +                             NAPIF_STATE_SCHED_BUSY_POLL |

>                               NAPIF_STATE_PREFER_BUSY_POLL);

> 

>                 /* If STATE_MISSED was set, leave STATE_SCHED set,

> @@ -6525,6 +6525,7 @@ static struct napi_struct *napi_by_id(unsigned

> int napi_id)

> 

>  static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)

>  {

> +       clear_bit(NAPI_STATE_SCHED_BUSY_POLL, &napi->state);

>         if (!skip_schedule) {

>                 gro_normal_list(napi);

>                 __napi_schedule(napi);

> @@ -6624,7 +6625,8 @@ void napi_busy_loop(unsigned int napi_id,

>                         }

>                         if (cmpxchg(&napi->state, val,

>                                     val | NAPIF_STATE_IN_BUSY_POLL |

> -                                         NAPIF_STATE_SCHED) != val) {

> +                                         NAPIF_STATE_SCHED |

> +                                         NAPIF_STATE_SCHED_BUSY_POLL) != val) {

>                                 if (prefer_busy_poll)

> 

> set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);

>                                 goto count;

> @@ -6971,7 +6973,10 @@ static int napi_thread_wait(struct napi_struct *napi)

>         set_current_state(TASK_INTERRUPTIBLE);

> 

>         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> +               unsigned long val = READ_ONCE(napi->state);

> +

> +               if (val & NAPIF_STATE_SCHED &&

> +                   !(val & NAPIF_STATE_SCHED_BUSY_POLL)) {


Again, not protected from the napi_disable() case AFAICT.

>                         WARN_ON(!list_empty(&napi->poll_list));

>                         __set_current_state(TASK_RUNNING);

>                         return 0;
Wei Wang Feb. 26, 2021, 10:24 p.m. UTC | #31
On Fri, Feb 26, 2021 at 1:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Fri, 26 Feb 2021 10:28:00 -0800 Wei Wang wrote:

> > Hi Martin,

> > Could you help try the following new patch on your setup and let me

> > know if there are still issues?

>

> FWIW your email got line wrapped for me.

>

> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> > index ddf4cfc12615..9ed0f89ccdd5 100644

> > --- a/include/linux/netdevice.h

> > +++ b/include/linux/netdevice.h

> > @@ -357,9 +357,10 @@ enum {

> >         NAPI_STATE_NPSVC,               /* Netpoll - don't dequeue

> > from poll_list */

> >         NAPI_STATE_LISTED,              /* NAPI added to system lists */

> >         NAPI_STATE_NO_BUSY_POLL,        /* Do not add in napi_hash, no

> > busy polling */

> > -       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */

> > +       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() grabs SHED

>

> nit: SHED -> SCHED

Ack.

>

> > bit and could busy poll */

> >         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over

> > softirq processing*/

> >         NAPI_STATE_THREADED,            /* The poll is performed

> > inside its own thread*/

> > +       NAPI_STATE_SCHED_BUSY_POLL,     /* Napi is currently scheduled

> > in busy poll mode */

>

> nit: Napi -> NAPI ?

Ack.

>

> >  };

> >

> >  enum {

> > @@ -372,6 +373,7 @@ enum {

> >         NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),

> >         NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),

> >         NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),

> > +       NAPIF_STATE_SCHED_BUSY_POLL     = BIT(NAPI_STATE_SCHED_BUSY_POLL),

> >  };

> >

> >  enum gro_result {

> > diff --git a/net/core/dev.c b/net/core/dev.c

> > index 6c5967e80132..c717b67ce137 100644

> > --- a/net/core/dev.c

> > +++ b/net/core/dev.c

> > @@ -1501,15 +1501,14 @@ static int napi_kthread_create(struct napi_struct *n)

> >  {

> >         int err = 0;

> >

> > -       /* Create and wake up the kthread once to put it in

> > -        * TASK_INTERRUPTIBLE mode to avoid the blocked task

> > -        * warning and work with loadavg.

> > +       /* Avoid using  kthread_run() here to prevent race

> > +        * between softirq and kthread polling.

> >          */

> > -       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",

> > -                               n->dev->name, n->napi_id);

> > +       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",

> > +                                  n->dev->name, n->napi_id);

>

> I'm not sure this takes care of rapid:

>

> dev_set_threaded(0)

>  # NAPI gets sent to sirq

> dev_set_threaded(1)

>

> since subsequent set_threaded(1) doesn't spawn the thread "afresh".

>


I think the race between softirq and kthread could be purely dependent
on the SCHED bit. In napi_schedule_prep(), we check if SCHED bit is
set. And we only call ____napi_schedule() when SCHED bit is not set.
In ____napi_schedule(), we either wake up kthread, or raise softirq,
never both.
So as long as we don't wake up the kthread when creating it, there
should not be a chance of race between softirq and kthread.

> >         if (IS_ERR(n->thread)) {

> >                 err = PTR_ERR(n->thread);

> > -               pr_err("kthread_run failed with err %d\n", err);

> > +               pr_err("kthread_create failed with err %d\n", err);

> >                 n->thread = NULL;

> >         }

> >

> > @@ -6486,6 +6485,7 @@ bool napi_complete_done(struct napi_struct *n,

> > int work_done)

> >                 WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

> >

> >                 new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |

> > +                             NAPIF_STATE_SCHED_BUSY_POLL |

> >                               NAPIF_STATE_PREFER_BUSY_POLL);

> >

> >                 /* If STATE_MISSED was set, leave STATE_SCHED set,

> > @@ -6525,6 +6525,7 @@ static struct napi_struct *napi_by_id(unsigned

> > int napi_id)

> >

> >  static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)

> >  {

> > +       clear_bit(NAPI_STATE_SCHED_BUSY_POLL, &napi->state);

> >         if (!skip_schedule) {

> >                 gro_normal_list(napi);

> >                 __napi_schedule(napi);

> > @@ -6624,7 +6625,8 @@ void napi_busy_loop(unsigned int napi_id,

> >                         }

> >                         if (cmpxchg(&napi->state, val,

> >                                     val | NAPIF_STATE_IN_BUSY_POLL |

> > -                                         NAPIF_STATE_SCHED) != val) {

> > +                                         NAPIF_STATE_SCHED |

> > +                                         NAPIF_STATE_SCHED_BUSY_POLL) != val) {

> >                                 if (prefer_busy_poll)

> >

> > set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);

> >                                 goto count;

> > @@ -6971,7 +6973,10 @@ static int napi_thread_wait(struct napi_struct *napi)

> >         set_current_state(TASK_INTERRUPTIBLE);

> >

> >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> > +               unsigned long val = READ_ONCE(napi->state);

> > +

> > +               if (val & NAPIF_STATE_SCHED &&

> > +                   !(val & NAPIF_STATE_SCHED_BUSY_POLL)) {

>

> Again, not protected from the napi_disable() case AFAICT.


Hmmm..... Yes. I think you are right. I missed that napi_disable()
also grabs the SCHED bit. In this case, I think we have to use the
SCHED_THREADED bit. The SCHED_BUSY_POLL bit is not enough to protect
the race between napi_disable() and napi_threaded_poll(). :(
Sorry, I missed this point when evaluating both solutions. I will have
to switch to use the SCHED_THREADED bit.


>

> >                         WARN_ON(!list_empty(&napi->poll_list));

> >                         __set_current_state(TASK_RUNNING);

> >                         return 0;
Wei Wang Feb. 26, 2021, 10:37 p.m. UTC | #32
On Fri, Feb 26, 2021 at 2:36 PM Martin Zaharinov <micron10@gmail.com> wrote:
>

>

>

> On Sat, Feb 27, 2021, 00:24 Wei Wang <weiwan@google.com> wrote:

>>

>> On Fri, Feb 26, 2021 at 1:35 PM Jakub Kicinski <kuba@kernel.org> wrote:

>> >

>> > On Fri, 26 Feb 2021 10:28:00 -0800 Wei Wang wrote:

>> > > Hi Martin,

>> > > Could you help try the following new patch on your setup and let me

>> > > know if there are still issues?

>> >

>> > FWIW your email got line wrapped for me.

>> >

>> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

>> > > index ddf4cfc12615..9ed0f89ccdd5 100644

>> > > --- a/include/linux/netdevice.h

>> > > +++ b/include/linux/netdevice.h

>> > > @@ -357,9 +357,10 @@ enum {

>> > >         NAPI_STATE_NPSVC,               /* Netpoll - don't dequeue

>> > > from poll_list */

>> > >         NAPI_STATE_LISTED,              /* NAPI added to system lists */

>> > >         NAPI_STATE_NO_BUSY_POLL,        /* Do not add in napi_hash, no

>> > > busy polling */

>> > > -       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */

>> > > +       NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() grabs SHED

>> >

>> > nit: SHED -> SCHED

>> Ack.

>>

>> >

>> > > bit and could busy poll */

>> > >         NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over

>> > > softirq processing*/

>> > >         NAPI_STATE_THREADED,            /* The poll is performed

>> > > inside its own thread*/

>> > > +       NAPI_STATE_SCHED_BUSY_POLL,     /* Napi is currently scheduled

>> > > in busy poll mode */

>> >

>> > nit: Napi -> NAPI ?

>> Ack.

>>

>> >

>> > >  };

>> > >

>> > >  enum {

>> > > @@ -372,6 +373,7 @@ enum {

>> > >         NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),

>> > >         NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),

>> > >         NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),

>> > > +       NAPIF_STATE_SCHED_BUSY_POLL     = BIT(NAPI_STATE_SCHED_BUSY_POLL),

>> > >  };

>> > >

>> > >  enum gro_result {

>> > > diff --git a/net/core/dev.c b/net/core/dev.c

>> > > index 6c5967e80132..c717b67ce137 100644

>> > > --- a/net/core/dev.c

>> > > +++ b/net/core/dev.c

>> > > @@ -1501,15 +1501,14 @@ static int napi_kthread_create(struct napi_struct *n)

>> > >  {

>> > >         int err = 0;

>> > >

>> > > -       /* Create and wake up the kthread once to put it in

>> > > -        * TASK_INTERRUPTIBLE mode to avoid the blocked task

>> > > -        * warning and work with loadavg.

>> > > +       /* Avoid using  kthread_run() here to prevent race

>> > > +        * between softirq and kthread polling.

>> > >          */

>> > > -       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",

>> > > -                               n->dev->name, n->napi_id);

>> > > +       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",

>> > > +                                  n->dev->name, n->napi_id);

>> >

>> > I'm not sure this takes care of rapid:

>> >

>> > dev_set_threaded(0)

>> >  # NAPI gets sent to sirq

>> > dev_set_threaded(1)

>> >

>> > since subsequent set_threaded(1) doesn't spawn the thread "afresh".

>> >

>>

>> I think the race between softirq and kthread could be purely dependent

>> on the SCHED bit. In napi_schedule_prep(), we check if SCHED bit is

>> set. And we only call ____napi_schedule() when SCHED bit is not set.

>> In ____napi_schedule(), we either wake up kthread, or raise softirq,

>> never both.

>> So as long as we don't wake up the kthread when creating it, there

>> should not be a chance of race between softirq and kthread.

>>

>> > >         if (IS_ERR(n->thread)) {

>> > >                 err = PTR_ERR(n->thread);

>> > > -               pr_err("kthread_run failed with err %d\n", err);

>> > > +               pr_err("kthread_create failed with err %d\n", err);

>> > >                 n->thread = NULL;

>> > >         }

>> > >

>> > > @@ -6486,6 +6485,7 @@ bool napi_complete_done(struct napi_struct *n,

>> > > int work_done)

>> > >                 WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

>> > >

>> > >                 new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |

>> > > +                             NAPIF_STATE_SCHED_BUSY_POLL |

>> > >                               NAPIF_STATE_PREFER_BUSY_POLL);

>> > >

>> > >                 /* If STATE_MISSED was set, leave STATE_SCHED set,

>> > > @@ -6525,6 +6525,7 @@ static struct napi_struct *napi_by_id(unsigned

>> > > int napi_id)

>> > >

>> > >  static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)

>> > >  {

>> > > +       clear_bit(NAPI_STATE_SCHED_BUSY_POLL, &napi->state);

>> > >         if (!skip_schedule) {

>> > >                 gro_normal_list(napi);

>> > >                 __napi_schedule(napi);

>> > > @@ -6624,7 +6625,8 @@ void napi_busy_loop(unsigned int napi_id,

>> > >                         }

>> > >                         if (cmpxchg(&napi->state, val,

>> > >                                     val | NAPIF_STATE_IN_BUSY_POLL |

>> > > -                                         NAPIF_STATE_SCHED) != val) {

>> > > +                                         NAPIF_STATE_SCHED |

>> > > +                                         NAPIF_STATE_SCHED_BUSY_POLL) != val) {

>> > >                                 if (prefer_busy_poll)

>> > >

>> > > set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);

>> > >                                 goto count;

>> > > @@ -6971,7 +6973,10 @@ static int napi_thread_wait(struct napi_struct *napi)

>> > >         set_current_state(TASK_INTERRUPTIBLE);

>> > >

>> > >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

>> > > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

>> > > +               unsigned long val = READ_ONCE(napi->state);

>> > > +

>> > > +               if (val & NAPIF_STATE_SCHED &&

>> > > +                   !(val & NAPIF_STATE_SCHED_BUSY_POLL)) {

>> >

>> > Again, not protected from the napi_disable() case AFAICT.

>>

>> Hmmm..... Yes. I think you are right. I missed that napi_disable()

>> also grabs the SCHED bit. In this case, I think we have to use the

>> SCHED_THREADED bit. The SCHED_BUSY_POLL bit is not enough to protect

>> the race between napi_disable() and napi_threaded_poll(). :(

>> Sorry, I missed this point when evaluating both solutions. I will have

>> to switch to use the SCHED_THREADED bit.

>

>

>

> should I wait with the test

> when you fix this?

>

Yes. Please. Sorry for the confusion.

>>

>>

>> >

>> > >                         WARN_ON(!list_empty(&napi->poll_list));

>> > >                         __set_current_state(TASK_RUNNING);

>> > >                         return 0;
Jakub Kicinski Feb. 26, 2021, 11:10 p.m. UTC | #33
On Fri, 26 Feb 2021 14:24:29 -0800 Wei Wang wrote:
> > I'm not sure this takes care of rapid:

> >

> > dev_set_threaded(0)

> >  # NAPI gets sent to sirq

> > dev_set_threaded(1)

> >

> > since subsequent set_threaded(1) doesn't spawn the thread "afresh".

> 

> I think the race between softirq and kthread could be purely dependent

> on the SCHED bit. In napi_schedule_prep(), we check if SCHED bit is

> set. And we only call ____napi_schedule() when SCHED bit is not set.

> In ____napi_schedule(), we either wake up kthread, or raise softirq,

> never both.

> So as long as we don't wake up the kthread when creating it, there

> should not be a chance of race between softirq and kthread.


But we don't destroy the thread when dev_set_threaded(0) is called, or
make sure that it gets parked, we just clear NAPI_STATE_THREADED and
that's it. 

The thread could be running long after NAPI_STATE_THREADED was cleared,
and long after it gave up NAPI_STATE_SCHED. E.g. if some heavy sirq
processing kicks in at the very moment we reenable BH.

> > >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {

> > > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {

> > > +               unsigned long val = READ_ONCE(napi->state);

> > > +

> > > +               if (val & NAPIF_STATE_SCHED &&

> > > +                   !(val & NAPIF_STATE_SCHED_BUSY_POLL)) {  

> >

> > Again, not protected from the napi_disable() case AFAICT.  

> 

> Hmmm..... Yes. I think you are right. I missed that napi_disable()

> also grabs the SCHED bit. In this case, I think we have to use the

> SCHED_THREADED bit. The SCHED_BUSY_POLL bit is not enough to protect

> the race between napi_disable() and napi_threaded_poll(). :(

> Sorry, I missed this point when evaluating both solutions. I will have

> to switch to use the SCHED_THREADED bit.


Alright, AFAICT SCHED_THREADED doesn't suffer either of the problems 
I brought up here.
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..9ed0f89ccdd5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -357,9 +357,10 @@  enum {
 	NAPI_STATE_NPSVC,		/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_LISTED,		/* NAPI added to system lists */
 	NAPI_STATE_NO_BUSY_POLL,	/* Do not add in napi_hash, no busy polling */
-	NAPI_STATE_IN_BUSY_POLL,	/* sk_busy_loop() owns this NAPI */
+	NAPI_STATE_IN_BUSY_POLL,	/* sk_busy_loop() grabs SHED bit and could busy poll */
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
+	NAPI_STATE_SCHED_BUSY_POLL,	/* Napi is currently scheduled in busy poll mode */
 };
 
 enum {
@@ -372,6 +373,7 @@  enum {
 	NAPIF_STATE_IN_BUSY_POLL	= BIT(NAPI_STATE_IN_BUSY_POLL),
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
+	NAPIF_STATE_SCHED_BUSY_POLL	= BIT(NAPI_STATE_SCHED_BUSY_POLL),
 };
 
 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..ec1a30d95d8b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6486,6 +6486,7 @@  bool napi_complete_done(struct napi_struct *n, int work_done)
 		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
 
 		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
+			      NAPIF_STATE_SCHED_BUSY_POLL |
 			      NAPIF_STATE_PREFER_BUSY_POLL);
 
 		/* If STATE_MISSED was set, leave STATE_SCHED set,
@@ -6525,6 +6526,7 @@  static struct napi_struct *napi_by_id(unsigned int napi_id)
 
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
 {
+	clear_bit(NAPI_STATE_SCHED_BUSY_POLL, &napi->state);
 	if (!skip_schedule) {
 		gro_normal_list(napi);
 		__napi_schedule(napi);
@@ -6624,7 +6626,8 @@  void napi_busy_loop(unsigned int napi_id,
 			}
 			if (cmpxchg(&napi->state, val,
 				    val | NAPIF_STATE_IN_BUSY_POLL |
-					  NAPIF_STATE_SCHED) != val) {
+					  NAPIF_STATE_SCHED |
+					  NAPIF_STATE_SCHED_BUSY_POLL) != val) {
 				if (prefer_busy_poll)
 					set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
 				goto count;
@@ -6971,7 +6974,10 @@  static int napi_thread_wait(struct napi_struct *napi)
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+		unsigned long val = READ_ONCE(napi->state);
+
+		if (val & NAPIF_STATE_SCHED &&
+		    !(val & NAPIF_STATE_SCHED_BUSY_POLL)) {
 			WARN_ON(!list_empty(&napi->poll_list));
 			__set_current_state(TASK_RUNNING);
 			return 0;