[WIP] net+mlx4: auto doorbell

Message ID 20161130123810.581ba21f@redhat.com
State New
Headers show

Commit Message

Jesper Dangaard Brouer Nov. 30, 2016, 11:38 a.m.
I've played with a somewhat similar patch (from Achiad Shochat) for
mlx5 (attached).  While it gives huge improvements, the problem I ran
into was that; TX performance became a function of the TX completion
time/interrupt and could easily be throttled if configured too
high/slow.

Can your patch be affected by this too?

Adjustable via:
 ethtool -C mlx5p2 tx-usecs 16 tx-frames 32
 

On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is

> not used.

> 

> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50

> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50


These +75% number is pktgen without "burst", and definitely show that
your patch activate xmit_more.
What is the pps performance number when using pktgen "burst" option?


> And about 11 % improvement on an mono-flow UDP_STREAM test.

> 

> skb_set_owner_w() is now the most consuming function.

> 

> 

> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &

> [1] 13696

> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50

> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50


The +11% number seems consistent with my perf observations that approx
12% was "fakely" spend on the xmit spin_lock.


[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> index 4b597dca5c52..affebb435679 100644

> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

[...]
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)

> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)

>  {

> -	return ring->prod - ring->cons > ring->full_size;

> +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;

>  }

[...]

> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)

>  	}

>  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);

>  

> +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion

> +	 * will happen shortly.

> +	 */

> +	if (send_doorbell &&

> +	    dev->doorbell_opt &&

> +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)


It would be nice with a function call with an appropriate name, instead
of an open-coded queue size check.  I'm also confused by the "ncons" name.

> +		send_doorbell = false;

> +

[...]

> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> index 574bcbb1b38f..c3fd0deda198 100644

> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {

>  	 */

>  	u32			last_nr_txbb;

>  	u32			cons;

> +	u32			ncons;


Maybe we can find a better name than "ncons" ?

>  	unsigned long		wake_queue;

>  	struct netdev_queue	*tx_queue;

>  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,

> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {

>  

>  	/* cache line used and dirtied in mlx4_en_xmit() */

>  	u32			prod ____cacheline_aligned_in_smp;

> +	u32			prod_bell;


Good descriptive variable name.

>  	unsigned int		tx_dropped;

>  	unsigned long		bytes;

>  	unsigned long		packets;



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Comments

Eric Dumazet Nov. 30, 2016, 3:56 p.m. | #1
On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> I've played with a somewhat similar patch (from Achiad Shochat) for

> mlx5 (attached).  While it gives huge improvements, the problem I ran

> into was that; TX performance became a function of the TX completion

> time/interrupt and could easily be throttled if configured too

> high/slow.

> 

> Can your patch be affected by this too?


Like all TX business, you should know this Jesper.

No need to constantly remind us something very well known.

> 

> Adjustable via:

>  ethtool -C mlx5p2 tx-usecs 16 tx-frames 32

>  


Generally speaking these settings impact TX, throughput/latencies.

Since NIC IRQ then trigger softirqs, we already know that IRQ handling
is critical, and some tuning can help, depending on particular load.

Now the trick is when you want a generic kernel, being deployed on hosts
shared by millions of jobs.


> 

> On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 

> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is

> > not used.

> > 

> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 

> > lpaa23:~# sar -n DEV 1 10|grep eth0

> [...]

> > Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50

> > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 

> > lpaa23:~# sar -n DEV 1 10|grep eth0

> [...]

> > Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50

> 

> These +75% number is pktgen without "burst", and definitely show that

> your patch activate xmit_more.

> What is the pps performance number when using pktgen "burst" option?


About the same really. About all packets now get the xmit_more effect.

> 

> 

> > And about 11 % improvement on an mono-flow UDP_STREAM test.

> > 

> > skb_set_owner_w() is now the most consuming function.

> > 

> > 

> > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &

> > [1] 13696

> > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt

> > lpaa23:~# sar -n DEV 1 10|grep eth0

> [...]

> > Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50

> > lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt

> > lpaa23:~# sar -n DEV 1 10|grep eth0

> [...]

> > Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50

> 

> The +11% number seems consistent with my perf observations that approx

> 12% was "fakely" spend on the xmit spin_lock.

> 

> 

> [...]

> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> > index 4b597dca5c52..affebb435679 100644

> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> [...]

> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)

> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)

> >  {

> > -	return ring->prod - ring->cons > ring->full_size;

> > +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;

> >  }

> [...]

> 

> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)

> >  	}

> >  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);

> >  

> > +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion

> > +	 * will happen shortly.

> > +	 */

> > +	if (send_doorbell &&

> > +	    dev->doorbell_opt &&

> > +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)

> 

> It would be nice with a function call with an appropriate name, instead

> of an open-coded queue size check.  I'm also confused by the "ncons" name.

> 

> > +		send_doorbell = false;

> > +

> [...]

> 

> > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> > index 574bcbb1b38f..c3fd0deda198 100644

> > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {

> >  	 */

> >  	u32			last_nr_txbb;

> >  	u32			cons;

> > +	u32			ncons;

> 

> Maybe we can find a better name than "ncons" ?


Thats because 'cons' in this driver really means 'old cons' 

and new cons = old cons + last_nr_txbb;


> 

> >  	unsigned long		wake_queue;

> >  	struct netdev_queue	*tx_queue;

> >  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,

> > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {

> >  

> >  	/* cache line used and dirtied in mlx4_en_xmit() */

> >  	u32			prod ____cacheline_aligned_in_smp;

> > +	u32			prod_bell;

> 

> Good descriptive variable name.

> 

> >  	unsigned int		tx_dropped;

> >  	unsigned long		bytes;

> >  	unsigned long		packets;

> 

>
Jesper Dangaard Brouer Nov. 30, 2016, 7:17 p.m. | #2
On Wed, 30 Nov 2016 07:56:26 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:

> > I've played with a somewhat similar patch (from Achiad Shochat) for

> > mlx5 (attached).  While it gives huge improvements, the problem I ran

> > into was that; TX performance became a function of the TX completion

> > time/interrupt and could easily be throttled if configured too

> > high/slow.

> > 

> > Can your patch be affected by this too?  

> 

> Like all TX business, you should know this Jesper.

> No need to constantly remind us something very well known.


Don't take is as critique Eric.  I was hoping your patch would have
solved this issue of being sensitive to TX completion adjustments.  You
usually have good solutions for difficult issues. I basically rejected
Achiad's approach/patch because it was too sensitive to these kind of
adjustments.


> > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

[...]
> > 

> > These +75% number is pktgen without "burst", and definitely show that

> > your patch activate xmit_more.

> > What is the pps performance number when using pktgen "burst" option?  

> 

> About the same really. About all packets now get the xmit_more effect.


Perfect!

> > [...]  

> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> > > index 4b597dca5c52..affebb435679 100644

> > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c  

> > [...]  

> > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)

> > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)

> > >  {

> > > -	return ring->prod - ring->cons > ring->full_size;

> > > +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;

> > >  }  

> > [...]

> >   

> > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)

> > >  	}

> > >  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);

> > >  

> > > +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion

> > > +	 * will happen shortly.

> > > +	 */

> > > +	if (send_doorbell &&

> > > +	    dev->doorbell_opt &&

> > > +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)  

> > 

> > It would be nice with a function call with an appropriate name, instead

> > of an open-coded queue size check.  I'm also confused by the "ncons" name.

> >   

> > > +		send_doorbell = false;

> > > +  

> > [...]

> >   

> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> > > index 574bcbb1b38f..c3fd0deda198 100644

> > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {

> > >  	 */

> > >  	u32			last_nr_txbb;

> > >  	u32			cons;

> > > +	u32			ncons;  

> > 

> > Maybe we can find a better name than "ncons" ?  

> 

> Thats because 'cons' in this driver really means 'old cons' 

> 

> and new cons = old cons + last_nr_txbb;


It was not clear to me that "n" meant "new".  And also not clear that
this drive have an issue of "cons" (consumer) is tracking "old" cons.

  
> > >  	unsigned long		wake_queue;

> > >  	struct netdev_queue	*tx_queue;

> > >  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,

> > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {

> > >  

> > >  	/* cache line used and dirtied in mlx4_en_xmit() */

> > >  	u32			prod ____cacheline_aligned_in_smp;

> > > +	u32			prod_bell;  

> > 

> > Good descriptive variable name.



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Eric Dumazet Nov. 30, 2016, 7:30 p.m. | #3
On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:

> Don't take is as critique Eric.  I was hoping your patch would have

> solved this issue of being sensitive to TX completion adjustments.  You

> usually have good solutions for difficult issues. I basically rejected

> Achiad's approach/patch because it was too sensitive to these kind of

> adjustments.


Well, this patch can hurt latencies, because a doorbell can be delayed,
and softirqs can be delayed by many hundred of usec in some cases.

I would not enable this behavior by default.
Eric Dumazet Nov. 30, 2016, 10:40 p.m. | #4
On Wed, 2016-11-30 at 23:30 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 30 Nov 2016 11:30:00 -0800

> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 

> > On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:

> > 

> > > Don't take is as critique Eric.  I was hoping your patch would have

> > > solved this issue of being sensitive to TX completion adjustments.  You

> > > usually have good solutions for difficult issues. I basically rejected

> > > Achiad's approach/patch because it was too sensitive to these kind of

> > > adjustments.  

> > 

> > Well, this patch can hurt latencies, because a doorbell can be delayed,

> > and softirqs can be delayed by many hundred of usec in some cases.

> > 

> > I would not enable this behavior by default.

> 

> What about another scheme, where dev_hard_start_xmit() can return an

> indication that driver choose not to flush (based on TX queue depth),

> and there by requesting stack to call flush at a later point.

> 

> Would that introduce less latency issues?



Again, how is it going to change anything when your netperf UDP sends
one packet at a time ?

qdisc gets one packet , dequeue it immediately.

No pressure. -> doorbell will be sent as before.

Some TX producers do not even use a qdisc to begin with.

Please think of a solution that does not involve qdisc layer at all.
Eric Dumazet Dec. 1, 2016, 12:27 a.m. | #5
Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.

When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().

The work-flow is :

1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.

2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); 

if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
     netif_schedule_queue(dev_queue);

This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)

So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.

This problem is magnified when XPS is used, if one mono-threaded application deals with
thousands of TCP sockets.

We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.
Tom Herbert Dec. 1, 2016, 1:16 a.m. | #6
On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>

> Another issue I found during my tests last days, is a problem with BQL,

> and more generally when driver stops/starts the queue.

>

> When under stress and BQL stops the queue, driver TX completion does a

> lot of work, and servicing CPU also takes over further qdisc_run().

>

Hmm, hard to say if this is problem or a feature I think ;-). One way
to "solve" this would be to use IRQ round robin, that would spread the
load of networking across CPUs, but that gives us no additional
parallelism and reduces locality-- it's generally considered a bad
idea. The question might be: is it better to continuously ding one CPU
with all the networking work or try to artificially spread it out?
Note this is orthogonal to MQ also, so we should already have multiple
CPUs doing netif_schedule_queue for queues they manage.

Do you have a test or application that shows this is causing pain?

> The work-flow is :

>

> 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and

> unmap things, queue skbs for freeing.

>

> 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);

>

> if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))

>      netif_schedule_queue(dev_queue);

>

> This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED

> (They absolutely have no chance to grab it)

>

> So we end up with one cpu doing the ndo_start_xmit() and TX completions,

> and RX work.

>

> This problem is magnified when XPS is used, if one mono-threaded application deals with

> thousands of TCP sockets.

>

Do you know of an application doing this? The typical way XPS and most
of the other steering mechanisms are configured assume that workloads
tend towards a normal distribution. Such mechanisms can become
problematic under asymmetric loads, but then I would expect these are
likely dedicated servers so that the mechanisms can be tuned
accordingly. For instance, XPS can be configured to map more than one
queue to a CPU. Alternatively, IIRC Windows has some functionality to
tune networking for the load (spin up queues, reconfigure things
similar to XPS/RPS, etc.)-- that's promising up the point that we need
a lot of heuristics and measurement; but then we lose determinism and
risk edge case where we get completely unsatisfied results (one of my
concerns with the recent attempt to put configuration in the kernel).

> We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way

> to allow another cpu to service the qdisc and spare us.

>

Wouldn't this need to be an active operation? That is to queue the
qdisc on another CPU's output_queue?

Tom

>

>
Eric Dumazet Dec. 1, 2016, 2:32 a.m. | #7
On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> >

> > Another issue I found during my tests last days, is a problem with BQL,

> > and more generally when driver stops/starts the queue.

> >

> > When under stress and BQL stops the queue, driver TX completion does a

> > lot of work, and servicing CPU also takes over further qdisc_run().

> >

> Hmm, hard to say if this is problem or a feature I think ;-). One way

> to "solve" this would be to use IRQ round robin, that would spread the

> load of networking across CPUs, but that gives us no additional

> parallelism and reduces locality-- it's generally considered a bad

> idea. The question might be: is it better to continuously ding one CPU

> with all the networking work or try to artificially spread it out?

> Note this is orthogonal to MQ also, so we should already have multiple

> CPUs doing netif_schedule_queue for queues they manage.

> 

> Do you have a test or application that shows this is causing pain?


Yes, just launch enough TCP senders (more than 10,000) to fully utilize
the NIC, with small messages.

super_netperf is not good for that, because you would need 10,000
processes and would spend too much cycles just dealing with an enormous
working set, you would not activate BQL.


> 

> > The work-flow is :

> >

> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and

> > unmap things, queue skbs for freeing.

> >

> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);

> >

> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))

> >      netif_schedule_queue(dev_queue);

> >

> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED

> > (They absolutely have no chance to grab it)

> >

> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,

> > and RX work.

> >

> > This problem is magnified when XPS is used, if one mono-threaded application deals with

> > thousands of TCP sockets.

> >

> Do you know of an application doing this? The typical way XPS and most

> of the other steering mechanisms are configured assume that workloads

> tend towards a normal distribution. Such mechanisms can become

> problematic under asymmetric loads, but then I would expect these are

> likely dedicated servers so that the mechanisms can be tuned

> accordingly. For instance, XPS can be configured to map more than one

> queue to a CPU. Alternatively, IIRC Windows has some functionality to

> tune networking for the load (spin up queues, reconfigure things

> similar to XPS/RPS, etc.)-- that's promising up the point that we need

> a lot of heuristics and measurement; but then we lose determinism and

> risk edge case where we get completely unsatisfied results (one of my

> concerns with the recent attempt to put configuration in the kernel).


We have thousands of applications, and some of them 'kind of multicast'
events to a broad number of TCP sockets.

Very often, applications writers use a single thread for doing this,
when all they need is to send small packets to 10,000 sockets, and they
do not really care of doing this very fast. They also do not want to
hurt other applications sharing the NIC.

Very often, process scheduler will also run this single thread in a
single cpu, ie avoiding expensive migrations if they are not needed.

Problem is this behavior locks one TX queue for the duration of the
multicast, since XPS will force all the TX packets to go to one TX
queue.

Other flows that would share the locked CPU experience high P99
latencies.


> 

> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way

> > to allow another cpu to service the qdisc and spare us.

> >

> Wouldn't this need to be an active operation? That is to queue the

> qdisc on another CPU's output_queue?


I simply suggest we try to queue the qdisc for further servicing as we
do today, from net_tx_action(), but we might use a different bit, so
that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
before we grab it from net_tx_action(), maybe 100 usec later (time to
flush all skbs queued in napi_consume_skb() and maybe RX processing,
since most NIC handle TX completion before doing RX processing from thei
napi poll handler.

Should be doable with few changes in __netif_schedule() and
net_tx_action(), plus some control paths that will need to take care of
the new bit at dismantle time, right ?
Eric Dumazet Dec. 1, 2016, 2:50 a.m. | #8
On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:

> I simply suggest we try to queue the qdisc for further servicing as we

> do today, from net_tx_action(), but we might use a different bit, so

> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED

> before we grab it from net_tx_action(), maybe 100 usec later (time to

> flush all skbs queued in napi_consume_skb() and maybe RX processing,

> since most NIC handle TX completion before doing RX processing from thei

> napi poll handler.

> 

> Should be doable with few changes in __netif_schedule() and

> net_tx_action(), plus some control paths that will need to take care of

> the new bit at dismantle time, right ?


Hmm... this is silly. Code already implements a different bit.

qdisc_run() seems to run more often from net_tx_action(), I have to find
out why.
Tom Herbert Dec. 1, 2016, 5:03 a.m. | #9
On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:

>> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

>> >

>> > Another issue I found during my tests last days, is a problem with BQL,

>> > and more generally when driver stops/starts the queue.

>> >

>> > When under stress and BQL stops the queue, driver TX completion does a

>> > lot of work, and servicing CPU also takes over further qdisc_run().

>> >

>> Hmm, hard to say if this is problem or a feature I think ;-). One way

>> to "solve" this would be to use IRQ round robin, that would spread the

>> load of networking across CPUs, but that gives us no additional

>> parallelism and reduces locality-- it's generally considered a bad

>> idea. The question might be: is it better to continuously ding one CPU

>> with all the networking work or try to artificially spread it out?

>> Note this is orthogonal to MQ also, so we should already have multiple

>> CPUs doing netif_schedule_queue for queues they manage.

>>

>> Do you have a test or application that shows this is causing pain?

>

> Yes, just launch enough TCP senders (more than 10,000) to fully utilize

> the NIC, with small messages.

>

> super_netperf is not good for that, because you would need 10,000

> processes and would spend too much cycles just dealing with an enormous

> working set, you would not activate BQL.

>

>

>>

>> > The work-flow is :

>> >

>> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and

>> > unmap things, queue skbs for freeing.

>> >

>> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);

>> >

>> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))

>> >      netif_schedule_queue(dev_queue);

>> >

>> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED

>> > (They absolutely have no chance to grab it)

>> >

>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,

>> > and RX work.

>> >

>> > This problem is magnified when XPS is used, if one mono-threaded application deals with

>> > thousands of TCP sockets.

>> >

>> Do you know of an application doing this? The typical way XPS and most

>> of the other steering mechanisms are configured assume that workloads

>> tend towards a normal distribution. Such mechanisms can become

>> problematic under asymmetric loads, but then I would expect these are

>> likely dedicated servers so that the mechanisms can be tuned

>> accordingly. For instance, XPS can be configured to map more than one

>> queue to a CPU. Alternatively, IIRC Windows has some functionality to

>> tune networking for the load (spin up queues, reconfigure things

>> similar to XPS/RPS, etc.)-- that's promising up the point that we need

>> a lot of heuristics and measurement; but then we lose determinism and

>> risk edge case where we get completely unsatisfied results (one of my

>> concerns with the recent attempt to put configuration in the kernel).

>

> We have thousands of applications, and some of them 'kind of multicast'

> events to a broad number of TCP sockets.

>

> Very often, applications writers use a single thread for doing this,

> when all they need is to send small packets to 10,000 sockets, and they

> do not really care of doing this very fast. They also do not want to

> hurt other applications sharing the NIC.

>

> Very often, process scheduler will also run this single thread in a

> single cpu, ie avoiding expensive migrations if they are not needed.

>

> Problem is this behavior locks one TX queue for the duration of the

> multicast, since XPS will force all the TX packets to go to one TX

> queue.

>

The fact that XPS is forcing TX packets to go over one CPU is a result
of how you've configured XPS. There are other potentially
configurations that present different tradeoffs. For instance, TX
queues are plentiful now days to the point that we could map a number
of queues to each CPU while respecting NUMA locality between the
sending CPU and where TX completions occur. If the set is sufficiently
large this would also helps to address device lock contention. The
other thing I'm wondering is if Willem's concepts distributing DOS
attacks in RPS might be applicable in XPS. If we could detect that a
TX queue is "under attack" maybe we can automatically backoff to
distributing the load to more queues in XPS somehow.

Tom

> Other flows that would share the locked CPU experience high P99

> latencies.

>

>

>>

>> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way

>> > to allow another cpu to service the qdisc and spare us.

>> >

>> Wouldn't this need to be an active operation? That is to queue the

>> qdisc on another CPU's output_queue?

>

> I simply suggest we try to queue the qdisc for further servicing as we

> do today, from net_tx_action(), but we might use a different bit, so

> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED

> before we grab it from net_tx_action(), maybe 100 usec later (time to

> flush all skbs queued in napi_consume_skb() and maybe RX processing,

> since most NIC handle TX completion before doing RX processing from thei

> napi poll handler.

>

> Should be doable with few changes in __netif_schedule() and

> net_tx_action(), plus some control paths that will need to take care of

> the new bit at dismantle time, right ?

>

>

>
Willem de Bruijn Dec. 1, 2016, 7:24 p.m. | #10
>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,

>>> > and RX work.


This problem is somewhat tangential to the doorbell avoidance discussion.

>>> >

>>> > This problem is magnified when XPS is used, if one mono-threaded application deals with

>>> > thousands of TCP sockets.

>>> >

>> We have thousands of applications, and some of them 'kind of multicast'

>> events to a broad number of TCP sockets.

>>

>> Very often, applications writers use a single thread for doing this,

>> when all they need is to send small packets to 10,000 sockets, and they

>> do not really care of doing this very fast. They also do not want to

>> hurt other applications sharing the NIC.

>>

>> Very often, process scheduler will also run this single thread in a

>> single cpu, ie avoiding expensive migrations if they are not needed.

>>

>> Problem is this behavior locks one TX queue for the duration of the

>> multicast, since XPS will force all the TX packets to go to one TX

>> queue.

>>

> The fact that XPS is forcing TX packets to go over one CPU is a result

> of how you've configured XPS. There are other potentially

> configurations that present different tradeoffs.


Right, XPS supports multiple txqueues mappings, using skb_tx_hash
to decide among them. Unfortunately cross-cpu is more expensive
than tx + completion on the same core, so this is suboptimal for
the common case where there is no excessive load imbalance.

> For instance, TX

> queues are plentiful now days to the point that we could map a number

> of queues to each CPU while respecting NUMA locality between the

> sending CPU and where TX completions occur. If the set is sufficiently

> large this would also helps to address device lock contention. The

> other thing I'm wondering is if Willem's concepts distributing DOS

> attacks in RPS might be applicable in XPS. If we could detect that a

> TX queue is "under attack" maybe we can automatically backoff to

> distributing the load to more queues in XPS somehow.


If only targeting states of imbalance, that indeed could work. For the
10,000 socket case, instead of load balancing qdisc servicing, we
could perhaps modify tx queue selection in __netdev_pick_tx to
choose another queue if the the initial choice is paused or otherwise
backlogged.
Eric Dumazet Dec. 2, 2016, 6:16 p.m. | #11
On Wed, 2016-11-30 at 18:50 -0800, Eric Dumazet wrote:
> On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:

> 

> > I simply suggest we try to queue the qdisc for further servicing as we

> > do today, from net_tx_action(), but we might use a different bit, so

> > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED

> > before we grab it from net_tx_action(), maybe 100 usec later (time to

> > flush all skbs queued in napi_consume_skb() and maybe RX processing,

> > since most NIC handle TX completion before doing RX processing from thei

> > napi poll handler.

> > 

> > Should be doable with few changes in __netif_schedule() and

> > net_tx_action(), plus some control paths that will need to take care of

> > the new bit at dismantle time, right ?

> 

> Hmm... this is silly. Code already implements a different bit.

> 

> qdisc_run() seems to run more often from net_tx_action(), I have to find

> out why.


After more analysis I believe TSQ was one of the bottlenecks.

I prepared a patch series that helped my use cases.

Patch hide | download patch | download mbox

Return-Path: tariqt@mellanox.com
Received: from zmta04.collab.prod.int.phx2.redhat.com (LHLO
 zmta04.collab.prod.int.phx2.redhat.com) (10.5.81.11) by
 zmail22.collab.prod.int.phx2.redhat.com with LMTP; Wed, 17 Aug 2016
 05:21:47 -0400 (EDT)
Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by zmta04.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id B23B4DA128
	for <jbrouer@mail.corp.redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400 (EDT)
Received: from mx1.redhat.com (ext-mx01.extmail.prod.ext.phx2.redhat.com [10.5.110.25])
	by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u7H9LlWp015796
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
	for <brouer@redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400
Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129])
	by mx1.redhat.com (Postfix) with ESMTP id B3ADB8122E
	for <brouer@redhat.com>; Wed, 17 Aug 2016 09:21:45 +0000 (UTC)
Received: from Internal Mail-Server by MTLPINE1 (envelope-from tariqt@mellanox.com)
	with ESMTPS (AES256-SHA encrypted); 17 Aug 2016 12:15:03 +0300
Received: from dev-l-vrt-206.mtl.labs.mlnx (dev-l-vrt-206.mtl.labs.mlnx [10.134.206.1])
	by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id u7H9F31D010642;
	Wed, 17 Aug 2016 12:15:03 +0300
From: Tariq Toukan <tariqt@mellanox.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>,
        Achiad Shochat <achiad@mellanox.com>,
        Rana Shahout <ranas@mellanox.com>,
        Saeed Mahameed <saeedm@mellanox.com>
Subject: [PATCH] net/mlx5e: force tx skb bulking
Date: Wed, 17 Aug 2016 12:14:51 +0300
Message-Id: <1471425291-1782-1-git-send-email-tariqt@mellanox.com>
X-Greylist: Delayed for 00:06:41 by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC)
X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC) for IP:'193.47.165.129' DOMAIN:'mail-il-dmz.mellanox.com' HELO:'mellanox.co.il' FROM:'tariqt@mellanox.com' RCPT:''
X-RedHat-Spam-Score: 0.251  (BAYES_50,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS,UNPARSEABLE_RELAY) 193.47.165.129 mail-il-dmz.mellanox.com 193.47.165.129 mail-il-dmz.mellanox.com <tariqt@mellanox.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
X-Scanned-By: MIMEDefang 2.78 on 10.5.110.25

From: Achiad Shochat <achiad@mellanox.com>

To improve SW message rate in case HW is faster.
Heuristically detect cases where the message rate is high and there
is still no skb bulking and if so, stops the txq for a while trying
to force the bulking.

Change-Id: Icb925135e69b030943cb4666117c47d1cc04da97
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    | 5 +++++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 9 ++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 74edd01..78a0661 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -394,6 +394,10 @@  enum {
 	MLX5E_SQ_STATE_TX_TIMEOUT,
 };
 
+enum {
+	MLX5E_SQ_STOP_ONCE,
+};
+
 struct mlx5e_ico_wqe_info {
 	u8  opcode;
 	u8  num_wqebbs;
@@ -403,6 +407,7 @@  struct mlx5e_sq {
 	/* data path */
 
 	/* dirtied @completion */
+	unsigned long              flags;
 	u16                        cc;
 	u32                        dma_fifo_cc;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index e073bf59..034eef0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -351,8 +351,10 @@  static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
 		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 
-	if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) {
+	if (test_bit(MLX5E_SQ_STOP_ONCE, &sq->flags) ||
+	    unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) {
 		netif_tx_stop_queue(sq->txq);
+		clear_bit(MLX5E_SQ_STOP_ONCE, &sq->flags);
 		sq->stats.stopped++;
 	}
 
@@ -429,6 +431,7 @@  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 	u32 dma_fifo_cc;
 	u32 nbytes;
 	u16 npkts;
+	u16 ncqes;
 	u16 sqcc;
 	int i;
 
@@ -439,6 +442,7 @@  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
 	npkts = 0;
 	nbytes = 0;
+	ncqes = 0;
 
 	/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
 	 * otherwise a cq overrun may occur
@@ -458,6 +462,7 @@  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			break;
 
 		mlx5_cqwq_pop(&cq->wq);
+		ncqes++;
 
 		wqe_counter = be16_to_cpu(cqe->wqe_counter);
 
@@ -508,6 +513,8 @@  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
 	sq->dma_fifo_cc = dma_fifo_cc;
 	sq->cc = sqcc;
+	if ((npkts > 7) && ((npkts >> (ilog2(ncqes))) < 8))
+		set_bit(MLX5E_SQ_STOP_ONCE, &sq->flags);
 
 	netdev_tx_completed_queue(sq->txq, npkts, nbytes);
 
-- 
1.8.3.1