mbox series

[RFC,net-next,0/3] sock: Fix sock queue mapping to include device

Message ID 20201021194743.781583-1-harshitha.ramamurthy@intel.com
Headers show
Series sock: Fix sock queue mapping to include device | expand

Message

Harshitha Ramamurthy Oct. 21, 2020, 7:47 p.m. UTC
In XPS, the transmit queue selected for a packet is saved in the associated
sock for the packet and is then used to avoid recalculating the queue
on subsequent sends. The problem is that the corresponding device is not
also recorded so that when the queue mapping is referenced it may
correspond to a different device than the sending one, resulting in an
incorrect queue being used for transmit. Particularly with xps_rxqs, this
can lead to non-deterministic behaviour as illustrated below.

Consider a case where xps_rxqs is configured and there is a difference
in number of Tx and Rx queues. Suppose we have 2 devices A and B. Device A
has 0-7 queues and device B has 0-15 queues. Packets are transmitted from
Device A but packets are received on B. For packets received on queue 0-7
of Device B, xps_rxqs will be applied for reply packets to transmit on
Device A's queues 0-7. However, when packets are received on queues
8-15 of Device B, normal XPS is used to reply packets when transmitting
from Device A. This leads to non-deterministic behaviour. The case where
there are fewer receive queues is even more insidious. Consider Device
A, the trasmitting device has queues 0-15 and Device B, the receiver
has queues 0-7. With xps_rxqs enabled, the packets will be received only
on queues 0-7 of Device B, but sent only on 0-7 queues of Device A
thereby causing a load imbalance.

This patch set fixes the issue by recording both the device (via
ifindex) and the queue in the sock mapping. The pair is set and
retrieved atomically. While retrieving the queue using the get
functions, we check if the ifindex held is the same as the ifindex
stored before returning the queue held. For instance during transmit,
we return a valid queue number only after checking if the ifindex stored
matches the device currently held.

This patch set contains:
	- Definition of dev_and_queue structure to hold the ifindex
	  and queue number
	- Generic functions to get, set, and clear dev_and_queue
	  structure
	- Change sk_tx_queue_{get,set,clear} to
	  sk_tx_dev_and_queue_{get,set,clear}
	- Modify callers of above to use new interface
	- Change sk_rx_queue_{get,set,clear} to 
          sk_rx_dev_and_queue_{get,set,clear}
        - Modify callers of above to use new interface

This patch set was tested as follows:
	- XPS with both xps_cpus and xps_rxqs works as expected
	- the Q index is calculated only once when picking a tx queue
	  per connection. For ex: in netdev_pick_tx

Tom Herbert (3):
  sock: Definition and general functions for dev_and_queue structure
  sock: Use dev_and_queue structure for RX queue mapping in sock
  sock: Use dev_and_queue structure for TX queue mapping in sock

 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |   6 +-
 drivers/net/hyperv/netvsc_drv.c               |   4 +-
 include/net/busy_poll.h                       |   2 +-
 include/net/request_sock.h                    |   2 +-
 include/net/sock.h                            | 107 ++++++++++++------
 net/core/dev.c                                |   6 +-
 net/core/filter.c                             |   7 +-
 net/core/sock.c                               |  10 +-
 net/ipv4/tcp_input.c                          |   2 +-
 9 files changed, 93 insertions(+), 53 deletions(-)

Comments

Willem de Bruijn Oct. 22, 2020, 5:38 p.m. UTC | #1
On Wed, Oct 21, 2020 at 3:51 PM Harshitha Ramamurthy
<harshitha.ramamurthy@intel.com> wrote:
>

> In XPS, the transmit queue selected for a packet is saved in the associated

> sock for the packet and is then used to avoid recalculating the queue

> on subsequent sends. The problem is that the corresponding device is not

> also recorded so that when the queue mapping is referenced it may

> correspond to a different device than the sending one, resulting in an

> incorrect queue being used for transmit. Particularly with xps_rxqs, this

> can lead to non-deterministic behaviour as illustrated below.

>

> Consider a case where xps_rxqs is configured and there is a difference

> in number of Tx and Rx queues. Suppose we have 2 devices A and B. Device A

> has 0-7 queues and device B has 0-15 queues. Packets are transmitted from

> Device A but packets are received on B. For packets received on queue 0-7

> of Device B, xps_rxqs will be applied for reply packets to transmit on

> Device A's queues 0-7. However, when packets are received on queues

> 8-15 of Device B, normal XPS is used to reply packets when transmitting

> from Device A. This leads to non-deterministic behaviour. The case where

> there are fewer receive queues is even more insidious. Consider Device

> A, the trasmitting device has queues 0-15 and Device B, the receiver

> has queues 0-7. With xps_rxqs enabled, the packets will be received only

> on queues 0-7 of Device B, but sent only on 0-7 queues of Device A

> thereby causing a load imbalance.


So the issue is limited to xps_rxqs with multiple nics.

When do we need sk_tx_dev_and_queue_mapping (patch 3/3)? It is used in
netdev_pick_tx, but associations are reset on route change and
recomputed if queue_index would exceed the current device queue count.

> This patch set fixes the issue by recording both the device (via

> ifindex) and the queue in the sock mapping. The pair is set and

> retrieved atomically.


I guess this is the reason for the somewhat convoluted cast to u64
logic in patch 1/3. Is the assumption that 64-bit loads and stores are
atomic on all platforms? That is not correct.

Is atomicity even needed? For the purpose of load balancing it isn't.
Just adding a sk->rx_ifindex would be a lot simpler.

sk->sk_napi_id already uniquely identifies the device. Unfortunately,
dev_get_by_napi_id is not cheap (traverses a hashtable bucket). Though
purely for the purpose of load balancing this validation could be
sample based.

The rx ifindex is also already recorded for inet sockets in
rx_dst_ifindex, and the sk_rx_queue_get functions are limited to
those, so could conceivably use that. But it is derived from skb_iif,
which is overwritten with every reentry of __netif_receive_skb_core.