mbox series

[net-next,v2,0/5] implement kthread based napi poll

Message ID 20201002222514.1159492-1-weiwan@google.com
Headers show
Series implement kthread based napi poll | expand

Message

Wei Wang Oct. 2, 2020, 10:25 p.m. UTC
The idea of moving the napi poll process out of softirq context to a
kernel thread based context is not new.
Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
poll to kthread back in 2016. And Felix Fietkau has also proposed
patches of similar ideas to use workqueue to process napi poll just a
few weeks ago.

The main reason we'd like to push forward with this idea is that the
scheduler has poor visibility into cpu cycles spent in softirq context,
and is not able to make optimal scheduling decisions of the user threads.
For example, we see in one of the application benchmark where network
load is high, the CPUs handling network softirqs has ~80% cpu util. And
user threads are still scheduled on those CPUs, despite other more idle
cpus available in the system. And we see very high tail latencies. In this
case, we have to explicitly pin away user threads from the CPUs handling
network softirqs to ensure good performance.
With napi poll moved to kthread, scheduler is in charge of scheduling both
the kthreads handling network load, and the user threads, and is able to
make better decisions. In the previous benchmark, if we do this and we
pin the kthreads processing napi poll to specific CPUs, scheduler is
able to schedule user threads away from these CPUs automatically.

And the reason we prefer 1 kthread per napi, instead of 1 workqueue
entity per host, is that kthread is more configurable than workqueue,
and we could leverage existing tuning tools for threads, like taskset,
chrt, etc to tune scheduling class and cpu set, etc. Another reason is
if we eventually want to provide busy poll feature using kernel threads
for napi poll, kthread seems to be more suitable than workqueue. 

In this patch series, I revived Paolo and Hannes's patch in 2016 and
left them as the first 2 patches. Then there are changes proposed by
Felix, Jakub, Paolo and myself on top of those, with suggestions from
Eric Dumazet.

In terms of performance, I ran tcp_rr tests with 1000 flows with
various request/response sizes, with RFS/RPS disabled, and compared
performance between softirq vs kthread. Host has 56 hyper threads and
100Gbps nic.

        req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
softirq   1B/1B   2.19M   284us       987us      1.1ms      1.56ms
kthread   1B/1B   2.14M   295us       987us      1.0ms      1.17ms

softirq 5KB/5KB   1.31M   869us      1.06ms     1.28ms      2.38ms
kthread 5KB/5KB   1.32M   878us      1.06ms     1.26ms      1.66ms

softirq 1MB/1MB  10.78K   84ms       166ms      234ms       294ms
kthread 1MB/1MB  10.83K   82ms       173ms      262ms       320ms

I also ran one application benchmark where the user threads have more
work to do. We do see good amount of tail latency reductions with the
kthread model. 

Changes since v1:
Replaced kthread_create() with kthread_run() in patch 5 as suggested by
Felix Fietkau.

Changes since RFC:
Renamed the kthreads to be napi/<dev>-<napi_id> in patch 5 as suggested
by Hannes Frederic Sowa.

Paolo Abeni (2):
  net: implement threaded-able napi poll loop support
  net: add sysfs attribute to control napi threaded mode
Felix Fietkau (1):
  net: extract napi poll functionality to __napi_poll()
Jakub Kicinski (1):
  net: modify kthread handler to use __napi_poll()
Wei Wang (1):
  net: improve napi threaded config

 include/linux/netdevice.h |   5 ++
 net/core/dev.c            | 143 +++++++++++++++++++++++++++++++++++---
 net/core/net-sysfs.c      | 100 ++++++++++++++++++++++++++
 3 files changed, 239 insertions(+), 9 deletions(-)

Comments

David Laight Oct. 3, 2020, 9:57 a.m. UTC | #1
From: Wei Wang

> Sent: 02 October 2020 23:25

> 

> The idea of moving the napi poll process out of softirq context to a

> kernel thread based context is not new.

> Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi

> poll to kthread back in 2016. And Felix Fietkau has also proposed

> patches of similar ideas to use workqueue to process napi poll just a

> few weeks ago.


What default scheduler priority are you planning to use?

The current 'softint' is (effectively) slightly higher priority
than the highest RT priority.

I think you need to use a 'middle' priority RT process so that
applications can decide whether they need to be higher/lower
priority than the network code.

But then you hit the problem that the scheduler gives RT
processes a very 'sticky' cpu affinity.
IIRC they don't ever get 'stolen' by an idle cpu, so only
migrate when the scheduler for the cpu they last ran on
decides to run something of a higher priority.
This is problematic if a low priority process in looping
in kernel space somewhere (without a cond_resched()).
(I've been running ftrace...)

Given that the napi cpu cycles have to happen sometime,
the biggest problem I found with the current softint
implementation is that a hardware interrupt can happen
while an application is holding a (user space) mutex.
This will block other application threads from acquiring
the mutex until not only the hardware interrupt
completes, but also all the associated softint (typically
napi and rcu) processing has completed.
This can take a while!
Moving the 'softint' processing to a separate thread
will allow the interrupted process to release the mutex
and all the application threads continue.

I guess the downside of using a thread is that the
data needed is likely to be in the wrong cache.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight Oct. 3, 2020, 10:56 a.m. UTC | #2
From: Wei Wang

> Sent: 02 October 2020 23:25

> 

> The idea of moving the napi poll process out of softirq context to a

> kernel thread based context is not new.

> Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi

> poll to kthread back in 2016. And Felix Fietkau has also proposed

> patches of similar ideas to use workqueue to process napi poll just a

> few weeks ago.


I didn't spot anything that makes this continue to work?

static inline bool netdev_xmit_more(void)
{
        return __this_cpu_read(softnet_data.xmit.more);
}

I assume it normally relies on the softint code running with
pre-emption disabled.

(It also needs a level of indirection.
xmit.more is only set if more packets are queued when the tx
call is done.
I've seen a workload that manages to repeatedly add an extra
packet while the tx setup is in progress.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)