[net-next,1/2] net: add a napi variant for RT-well-behaved drivers

Message ID 20210514222402.295157-1-kuba@kernel.org
State New
Headers show
Series
  • [net-next,1/2] net: add a napi variant for RT-well-behaved drivers
Related show

Commit Message

Jakub Kicinski May 14, 2021, 10:24 p.m.
Most networking drivers use napi_schedule_irqoff() to schedule
NAPI from hardware IRQ handler. Unfortunately, as explained in
commit 8380c81d5c4f ("net: Treat __napi_schedule_irqoff() as
__napi_schedule() on PREEMPT_RT") the current implementation
is problematic for RT.

The best solution seems to be to mark the irq handler with
IRQF_NO_THREAD, to avoid going through an irq thread just
to schedule NAPI and therefore wake up ksoftirqd.

Since analyzing the 40 callers of napi_schedule_irqoff()
to figure out which handlers are light-weight enough to
warrant IRQF_NO_THREAD seems like a larger effort add
a new helper for drivers which set IRQF_NO_THREAD.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/linux/netdevice.h | 21 ++++++++++++++++-----
 net/core/dev.c            | 13 +++++++++++--
 2 files changed, 27 insertions(+), 7 deletions(-)

Comments

Thomas Gleixner May 15, 2021, 12:17 a.m. | #1
On Fri, May 14 2021 at 15:24, Jakub Kicinski wrote:
>  

> +void __napi_schedule_irq(struct napi_struct *n)

> +{

> +	____napi_schedule(this_cpu_ptr(&softnet_data), n);


Not that I have any clue, but why does this not need the
napi_schedule_prep() check?

Thanks,

        tglx
Jakub Kicinski May 15, 2021, 12:21 a.m. | #2
On Sat, 15 May 2021 02:17:50 +0200 Thomas Gleixner wrote:
> On Fri, May 14 2021 at 15:24, Jakub Kicinski wrote:

> >  

> > +void __napi_schedule_irq(struct napi_struct *n)

> > +{

> > +	____napi_schedule(this_cpu_ptr(&softnet_data), n);  

> 

> Not that I have any clue, but why does this not need the

> napi_schedule_prep() check?


napi_schedule_prep() is in the non-__ version in linux/netdevice.h:

static inline void napi_schedule_irq(struct napi_struct *n)
{
	if (napi_schedule_prep(n))
		__napi_schedule_irq(n);
}
Thomas Gleixner May 15, 2021, 12:29 a.m. | #3
On Fri, May 14 2021 at 17:21, Jakub Kicinski wrote:
> On Sat, 15 May 2021 02:17:50 +0200 Thomas Gleixner wrote:

>> On Fri, May 14 2021 at 15:24, Jakub Kicinski wrote:

>> >  

>> > +void __napi_schedule_irq(struct napi_struct *n)

>> > +{

>> > +	____napi_schedule(this_cpu_ptr(&softnet_data), n);  

>> 

>> Not that I have any clue, but why does this not need the

>> napi_schedule_prep() check?

>

> napi_schedule_prep() is in the non-__ version in linux/netdevice.h:

>

> static inline void napi_schedule_irq(struct napi_struct *n)

> {

> 	if (napi_schedule_prep(n))

> 		__napi_schedule_irq(n);

> }


I clearly should go to bed now :)
Thomas Gleixner May 15, 2021, 9:49 a.m. | #4
On Fri, May 14 2021 at 15:24, Jakub Kicinski wrote:
>  /**

> - *	napi_schedule_irqoff - schedule NAPI poll

> - *	@n: NAPI context

> + * napi_schedule_irq() - schedule NAPI poll from hardware IRQ

> + * @n: NAPI context

>   *

>   * Variant of napi_schedule(), assuming hard irqs are masked.

> + * Hardware interrupt handler must be marked with IRQF_NO_THREAD

> + * to safely invoke this function on CONFIG_RT=y kernels (unless

> + * it manually masks the interrupts already).

>   */

> -static inline void napi_schedule_irqoff(struct napi_struct *n)

> +static inline void napi_schedule_irq(struct napi_struct *n)

>  {

>  	if (napi_schedule_prep(n))

> -		__napi_schedule_irqoff(n);

> +		__napi_schedule_irq(n);


As this is intended for the trivial

   irqhandler()
        napi_schedule_irq(n);
        return IRQ_HANDLED;

case, wouldn't it make sense to bring napi_schedule_irq() out of line
and have the prep invocation right there?

void napi_schedule_irq(struct napi_struct *n)
{
 	if (napi_schedule_prep(n))
		____napi_schedule(this_cpu_ptr(&softnet_data), n);
}
EXPORT_SYMBOL(napi_schedule_irq);

As that spares a function call and lets the compiler be smarter about
it. I might be missing something though, but at least brain is more
awake now :)

Thanks,

        tglx
Sebastian Andrzej Siewior May 15, 2021, 11:07 a.m. | #5
On 2021-05-14 15:24:01 [-0700], Jakub Kicinski wrote:
> Most networking drivers use napi_schedule_irqoff() to schedule

> NAPI from hardware IRQ handler. Unfortunately, as explained in

> commit 8380c81d5c4f ("net: Treat __napi_schedule_irqoff() as

> __napi_schedule() on PREEMPT_RT") the current implementation

> is problematic for RT.

> 

> The best solution seems to be to mark the irq handler with

> IRQF_NO_THREAD, to avoid going through an irq thread just

> to schedule NAPI and therefore wake up ksoftirqd.


I'm not sure whether this is the best solution.
Having a threaded handler, the primary handler simply wakes the thread
and the IRQ thread (almost only) schedules NAPI by setting the matching
softirq bit. Since the threaded handler is invoked with disabled BH,
upon enabling BH again (after the routine of the threaded completed) the
(just raised) softirq handler gets invoked. This still happens in the
context of the IRQ thread with the higher (SCHED_FIFO) thread priority.
No ksoftirqd is involved here.

One deviation from the just described flow is when the timer-tick comes
into the mix. The hardirq handler (for the timer) schedules
TIMER_SOFTIRQ. Since this softirq can not be handled at the end of the
hardirq on PREEMPT_RT, it wakes the ksoftirqd which will handle it.
Once ksoftirqd is in state TASK_RUNNING then all the softirqs which are
raised later (as the NAPI softirq) won't be handled in the IRQ-thread
but also by the ksoftird thread.
From now on we have to hop HARDIRQ -> IRQ-THREAD -> KSOFTIRQD.

ksoftirqd runs as SCHED_OTHER and competes with other SCHED_OTHER tasks
for CPU resources. The IRQ-thread (which is SCHED_FIFO) was obviously
preferred. Once the ksoftirqd is running, it won't be preempted on
!PREEMPT_RT due the implicit disabled preemption as part of
local_bh_disable(). On PREEMPT_RT preemption may happen by a thread with
higher priority.
Once things get mangled into ksoftirq, all thread priority is lost (for
the non RT veteran readers here: we had various softirq handling
strategies over the years like "only handle the in-context raised
softirqs" just to mention one of them. It all comes with a price in
terms bugs / duct tape. What we have now as softirq handling is very
close to what !RT does resulting in zero patches as duct tape).

Now assume another interrupt comes in which wakes a force-threaded
handler (while ksoftirqd is preempted). Before the forced-threaded
handler is invoked, BH is disabled via local_bh_disable(). Since
ksoftirqd is preempted with BH disabled, disabling BH forces the
ksoftirqd thread to the priority of the interrupt thread (SCHED_FIFO,
prio 50 by default) due to the priority inheritance protocol. The
threaded handler will run once ksoftirqd is done which has now been
accelerated.

Part of the problem from RT perspective is the heavy use of softirq and
the BH disabled regions which act as a BKL. I *think* having the network
driver running in a thread would be better (in terms of prioritisation).
I know, davem explained the benefits of NAPI/softirq when it comes to
routing/forwarding (incl. NET_RX/TX priority) and part where NAPI kicks
in during a heavy load (say a packet flood) and system is still
responsible.

> Since analyzing the 40 callers of napi_schedule_irqoff()

> to figure out which handlers are light-weight enough to

> warrant IRQF_NO_THREAD seems like a larger effort add

> a new helper for drivers which set IRQF_NO_THREAD.

> 

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>


Sebastian
Jakub Kicinski May 15, 2021, 8:31 p.m. | #6
On Sat, 15 May 2021 13:07:40 +0200 Sebastian Andrzej Siewior wrote:
> On 2021-05-14 15:24:01 [-0700], Jakub Kicinski wrote:

> > Most networking drivers use napi_schedule_irqoff() to schedule

> > NAPI from hardware IRQ handler. Unfortunately, as explained in

> > commit 8380c81d5c4f ("net: Treat __napi_schedule_irqoff() as

> > __napi_schedule() on PREEMPT_RT") the current implementation

> > is problematic for RT.

> > 

> > The best solution seems to be to mark the irq handler with

> > IRQF_NO_THREAD, to avoid going through an irq thread just

> > to schedule NAPI and therefore wake up ksoftirqd.  

> 

> I'm not sure whether this is the best solution.

> Having a threaded handler, the primary handler simply wakes the thread

> and the IRQ thread (almost only) schedules NAPI by setting the matching

> softirq bit. Since the threaded handler is invoked with disabled BH,

> upon enabling BH again (after the routine of the threaded completed) the

> (just raised) softirq handler gets invoked. This still happens in the

> context of the IRQ thread with the higher (SCHED_FIFO) thread priority.

> No ksoftirqd is involved here.

> 

> One deviation from the just described flow is when the timer-tick comes

> into the mix. The hardirq handler (for the timer) schedules

> TIMER_SOFTIRQ. Since this softirq can not be handled at the end of the

> hardirq on PREEMPT_RT, it wakes the ksoftirqd which will handle it.

> Once ksoftirqd is in state TASK_RUNNING then all the softirqs which are

> raised later (as the NAPI softirq) won't be handled in the IRQ-thread

> but also by the ksoftird thread.

> From now on we have to hop HARDIRQ -> IRQ-THREAD -> KSOFTIRQD.

> 

> ksoftirqd runs as SCHED_OTHER and competes with other SCHED_OTHER tasks

> for CPU resources. The IRQ-thread (which is SCHED_FIFO) was obviously

> preferred. Once the ksoftirqd is running, it won't be preempted on

> !PREEMPT_RT due the implicit disabled preemption as part of

> local_bh_disable(). On PREEMPT_RT preemption may happen by a thread with

> higher priority.

> Once things get mangled into ksoftirq, all thread priority is lost (for

> the non RT veteran readers here: we had various softirq handling

> strategies over the years like "only handle the in-context raised

> softirqs" just to mention one of them. It all comes with a price in

> terms bugs / duct tape. What we have now as softirq handling is very

> close to what !RT does resulting in zero patches as duct tape).

> 

> Now assume another interrupt comes in which wakes a force-threaded

> handler (while ksoftirqd is preempted). Before the forced-threaded

> handler is invoked, BH is disabled via local_bh_disable(). Since

> ksoftirqd is preempted with BH disabled, disabling BH forces the

> ksoftirqd thread to the priority of the interrupt thread (SCHED_FIFO,

> prio 50 by default) due to the priority inheritance protocol. The

> threaded handler will run once ksoftirqd is done which has now been

> accelerated.


Thanks for the explanation. I'm not married to the patch, if you prefer
we can keep the status quo.

I'd think, however, that always deferring to ksoftirqd is conceptually
easier to comprehend. For power users who need networking there is
prefer-busy-poll (which allows application to ask the kernel to service
queues when it wants to, with some minimal poll frequency guarantees)
and threaded NAPI - which some RT users already started to adapt.

Your call.

> Part of the problem from RT perspective is the heavy use of softirq and

> the BH disabled regions which act as a BKL. I *think* having the network

> driver running in a thread would be better (in terms of prioritisation).

> I know, davem explained the benefits of NAPI/softirq when it comes to

> routing/forwarding (incl. NET_RX/TX priority) and part where NAPI kicks

> in during a heavy load (say a packet flood) and system is still

> responsible.


Right, although with modern multi-core systems where only a subset 
of cores process network Rx things look different.
Jakub Kicinski May 15, 2021, 8:38 p.m. | #7
On Sat, 15 May 2021 11:49:02 +0200 Thomas Gleixner wrote:
> On Fri, May 14 2021 at 15:24, Jakub Kicinski wrote:

> >  /**

> > - *	napi_schedule_irqoff - schedule NAPI poll

> > - *	@n: NAPI context

> > + * napi_schedule_irq() - schedule NAPI poll from hardware IRQ

> > + * @n: NAPI context

> >   *

> >   * Variant of napi_schedule(), assuming hard irqs are masked.

> > + * Hardware interrupt handler must be marked with IRQF_NO_THREAD

> > + * to safely invoke this function on CONFIG_RT=y kernels (unless

> > + * it manually masks the interrupts already).

> >   */

> > -static inline void napi_schedule_irqoff(struct napi_struct *n)

> > +static inline void napi_schedule_irq(struct napi_struct *n)

> >  {

> >  	if (napi_schedule_prep(n))

> > -		__napi_schedule_irqoff(n);

> > +		__napi_schedule_irq(n);  

> 

> As this is intended for the trivial

> 

>    irqhandler()

>         napi_schedule_irq(n);

>         return IRQ_HANDLED;

> 

> case, wouldn't it make sense to bring napi_schedule_irq() out of line

> and have the prep invocation right there?

> 

> void napi_schedule_irq(struct napi_struct *n)

> {

>  	if (napi_schedule_prep(n))

> 		____napi_schedule(this_cpu_ptr(&softnet_data), n);

> }

> EXPORT_SYMBOL(napi_schedule_irq);

> 

> As that spares a function call and lets the compiler be smarter about

> it. I might be missing something though, but at least brain is more

> awake now :)


I'm just copying the existing two handlers (reviewer's favorite
response, I know) :)

My guess is .._prep() used to be a static inline itself, I can look 
at modifying all helpers if the assembly really looks better, but based
on Sebastian's email I'm not sure we're gonna go ahead with the new 
helper at all?
Thomas Gleixner May 15, 2021, 8:53 p.m. | #8
On Sat, May 15 2021 at 13:31, Jakub Kicinski wrote:
> On Sat, 15 May 2021 13:07:40 +0200 Sebastian Andrzej Siewior wrote:

>> Now assume another interrupt comes in which wakes a force-threaded

>> handler (while ksoftirqd is preempted). Before the forced-threaded

>> handler is invoked, BH is disabled via local_bh_disable(). Since

>> ksoftirqd is preempted with BH disabled, disabling BH forces the

>> ksoftirqd thread to the priority of the interrupt thread (SCHED_FIFO,

>> prio 50 by default) due to the priority inheritance protocol. The

>> threaded handler will run once ksoftirqd is done which has now been

>> accelerated.

>

> Thanks for the explanation. I'm not married to the patch, if you prefer

> we can keep the status quo.

>

> I'd think, however, that always deferring to ksoftirqd is conceptually

> easier to comprehend. For power users who need networking there is

> prefer-busy-poll (which allows application to ask the kernel to service

> queues when it wants to, with some minimal poll frequency guarantees)

> and threaded NAPI - which some RT users already started to adapt.

>

> Your call.

>

>> Part of the problem from RT perspective is the heavy use of softirq and

>> the BH disabled regions which act as a BKL. I *think* having the network

>> driver running in a thread would be better (in terms of prioritisation).

>> I know, davem explained the benefits of NAPI/softirq when it comes to

>> routing/forwarding (incl. NET_RX/TX priority) and part where NAPI kicks

>> in during a heavy load (say a packet flood) and system is still

>> responsible.

>

> Right, although with modern multi-core systems where only a subset 

> of cores process network Rx things look different.


Bah, I completely forgot about that aspect. Thanks Sebastian for
bringing it up. I was too focussed on the other questions and there is
obviously the onset of alzheimer.

Anyway it's a touch choice to make. There are too many options to chose
from nowadays. 10 years ago running the softirq at the back of the
threaded irq handler which just scheduled NAPI was definitely a win, but
with threaded NAPI, zero copy and other things it's not that important
anymore IMO. But I might be missing something obviously.

I've cc'ed a few RT folks @RHT who might give some insight.

Thanks,

        tglx
Juri Lelli May 21, 2021, 2:11 p.m. | #9
Hi,

On 15/05/21 22:53, Thomas Gleixner wrote:
> On Sat, May 15 2021 at 13:31, Jakub Kicinski wrote:

> > On Sat, 15 May 2021 13:07:40 +0200 Sebastian Andrzej Siewior wrote:

> >> Now assume another interrupt comes in which wakes a force-threaded

> >> handler (while ksoftirqd is preempted). Before the forced-threaded

> >> handler is invoked, BH is disabled via local_bh_disable(). Since

> >> ksoftirqd is preempted with BH disabled, disabling BH forces the

> >> ksoftirqd thread to the priority of the interrupt thread (SCHED_FIFO,

> >> prio 50 by default) due to the priority inheritance protocol. The

> >> threaded handler will run once ksoftirqd is done which has now been

> >> accelerated.

> >

> > Thanks for the explanation. I'm not married to the patch, if you prefer

> > we can keep the status quo.

> >

> > I'd think, however, that always deferring to ksoftirqd is conceptually

> > easier to comprehend. For power users who need networking there is

> > prefer-busy-poll (which allows application to ask the kernel to service

> > queues when it wants to, with some minimal poll frequency guarantees)

> > and threaded NAPI - which some RT users already started to adapt.

> >

> > Your call.

> >

> >> Part of the problem from RT perspective is the heavy use of softirq and

> >> the BH disabled regions which act as a BKL. I *think* having the network

> >> driver running in a thread would be better (in terms of prioritisation).

> >> I know, davem explained the benefits of NAPI/softirq when it comes to

> >> routing/forwarding (incl. NET_RX/TX priority) and part where NAPI kicks

> >> in during a heavy load (say a packet flood) and system is still

> >> responsible.

> >

> > Right, although with modern multi-core systems where only a subset 

> > of cores process network Rx things look different.

> 

> Bah, I completely forgot about that aspect. Thanks Sebastian for

> bringing it up. I was too focussed on the other questions and there is

> obviously the onset of alzheimer.

> 

> Anyway it's a touch choice to make. There are too many options to chose

> from nowadays. 10 years ago running the softirq at the back of the

> threaded irq handler which just scheduled NAPI was definitely a win, but

> with threaded NAPI, zero copy and other things it's not that important

> anymore IMO. But I might be missing something obviously.

> 

> I've cc'ed a few RT folks @RHT who might give some insight.


So, I asked around, but got mixed type of answers. My feeling is that an
opt-in approach, if feasible, might be useful to accomodate setups that
might indeed benefit from priority inheritance kicking in. In most cases
I'm aware of net handling is done by housekeeping cpus and isolated cpus
(running RT workload) mostly don't enter the kernel, but there are
exceptions to that.

I'll keep trying to ping more people. :)

Thanks,
Juri

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5cbc950b34df..457e2e3ef5a5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -436,7 +436,8 @@  typedef enum rx_handler_result rx_handler_result_t;
 typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb);
 
 void __napi_schedule(struct napi_struct *n);
-void __napi_schedule_irqoff(struct napi_struct *n);
+void __napi_schedule_irqoff(struct napi_struct *n); /* deprecated */
+void __napi_schedule_irq(struct napi_struct *n);
 
 static inline bool napi_disable_pending(struct napi_struct *n)
 {
@@ -463,16 +464,26 @@  static inline void napi_schedule(struct napi_struct *n)
 		__napi_schedule(n);
 }
 
+/* Deprecated, use napi_schedule_irq(). */
+static inline void napi_schedule_irqoff(struct napi_struct *n)
+{
+	if (napi_schedule_prep(n))
+		__napi_schedule_irqoff(n);
+}
+
 /**
- *	napi_schedule_irqoff - schedule NAPI poll
- *	@n: NAPI context
+ * napi_schedule_irq() - schedule NAPI poll from hardware IRQ
+ * @n: NAPI context
  *
  * Variant of napi_schedule(), assuming hard irqs are masked.
+ * Hardware interrupt handler must be marked with IRQF_NO_THREAD
+ * to safely invoke this function on CONFIG_RT=y kernels (unless
+ * it manually masks the interrupts already).
  */
-static inline void napi_schedule_irqoff(struct napi_struct *n)
+static inline void napi_schedule_irq(struct napi_struct *n)
 {
 	if (napi_schedule_prep(n))
-		__napi_schedule_irqoff(n);
+		__napi_schedule_irq(n);
 }
 
 /* Try to reschedule poll. Called by dev->poll() after napi_complete().  */
diff --git a/net/core/dev.c b/net/core/dev.c
index febb23708184..2e20858b5df6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6497,20 +6497,29 @@  bool napi_schedule_prep(struct napi_struct *n)
 }
 EXPORT_SYMBOL(napi_schedule_prep);
 
+void __napi_schedule_irq(struct napi_struct *n)
+{
+	____napi_schedule(this_cpu_ptr(&softnet_data), n);
+}
+EXPORT_SYMBOL(__napi_schedule_irq);
+
 /**
  * __napi_schedule_irqoff - schedule for receive
  * @n: entry to schedule
  *
- * Variant of __napi_schedule() assuming hard irqs are masked.
+ * Legacy variant of __napi_schedule() assuming hard irqs are masked.
  *
  * On PREEMPT_RT enabled kernels this maps to __napi_schedule()
  * because the interrupt disabled assumption might not be true
  * due to force-threaded interrupts and spinlock substitution.
+ *
+ * For light weight IRQ handlers prefer use of napi_schedule_irq(),
+ * and marking IRQ handler with IRQF_NO_THREAD.
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
-		____napi_schedule(this_cpu_ptr(&softnet_data), n);
+		__napi_schedule_irq(n);
 	else
 		__napi_schedule(n);
 }