[WIP] net+mlx4: auto doorbell

I've played with a somewhat similar patch (from Achiad Shochat) for
mlx5 (attached).  While it gives huge improvements, the problem I ran
into was that; TX performance became a function of the TX completion
time/interrupt and could easily be throttled if configured too
high/slow.

Can your patch be affected by this too?

Adjustable via:
 ethtool -C mlx5p2 tx-usecs 16 tx-frames 32

On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is

> not used.

> 

> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt 

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50

> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt 

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50

These +75% number is pktgen without "burst", and definitely show that
your patch activate xmit_more.
What is the pps performance number when using pktgen "burst" option?

> And about 11 % improvement on an mono-flow UDP_STREAM test.

> 

> skb_set_owner_w() is now the most consuming function.

> 

> 

> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &

> [1] 13696

> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50

> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt

> lpaa23:~# sar -n DEV 1 10|grep eth0

[...]
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50

The +11% number seems consistent with my perf observations that approx
12% was "fakely" spend on the xmit spin_lock.

[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> index 4b597dca5c52..affebb435679 100644

> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c

> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c

[...]
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)

> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)

>  {

> -	return ring->prod - ring->cons > ring->full_size;

> +	return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;

>  }

[...]

> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)

>  	}

>  	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);

>  

> +	/* Doorbell avoidance : We can omit doorbell if we know a TX completion

> +	 * will happen shortly.

> +	 */

> +	if (send_doorbell &&

> +	    dev->doorbell_opt &&

> +	    (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)

It would be nice with a function call with an appropriate name, instead
of an open-coded queue size check.  I'm also confused by the "ncons" name.

> +		send_doorbell = false;

> +

[...]

> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> index 574bcbb1b38f..c3fd0deda198 100644

> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h

> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {

>  	 */

>  	u32			last_nr_txbb;

>  	u32			cons;

> +	u32			ncons;

Maybe we can find a better name than "ncons" ?

>  	unsigned long		wake_queue;

>  	struct netdev_queue	*tx_queue;

>  	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,

> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {

>  

>  	/* cache line used and dirtied in mlx4_en_xmit() */

>  	u32			prod ____cacheline_aligned_in_smp;

> +	u32			prod_bell;

Good descriptive variable name.

>  	unsigned int		tx_dropped;

>  	unsigned long		bytes;

>  	unsigned long		packets;

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[WIP] net+mlx4: auto doorbell

Commit Message

Comments

Patch