mbox series

[bpf-next,00/17] Clean up and document RCU-based object protection for XDP_REDIRECT

Message ID 20210609103326.278782-1-toke@redhat.com
Headers show
Series Clean up and document RCU-based object protection for XDP_REDIRECT | expand

Message

Toke Høiland-Jørgensen June 9, 2021, 10:33 a.m. UTC
During the discussion[0] of Hangbin's multicast patch series, Martin pointed out
that the lifetime of the RCU-protected  map entries used by XDP_REDIRECT is by
no means obvious. I promised to look into cleaning this up, and Paul helpfully
provided some hints and a new unrcu_pointer() helper to aid in this.

This is mostly a documentation exercise, clearing up the description of the
lifetime expectations and adding __rcu annotations so sparse and lockdep can
help verify it.

Patches 1-2 are prepatory: Patch 1 adds Paul's unrcu_pointer() helper (which has
already been added to his tree) and patch 2 is a small fix for
dev_get_by_index_rcu() so lockdep understands _bh-disabled access to it. Patch 3
is the main bit that adds the __rcu annotations and updates documentation
comments, and the rest are patches updating the drivers, with one patch per
distinct maintainer.

Unfortunately I don't have any hardware to test any of the driver patches;
Jesper helpfully verified that it doesn't break anything on i40e, but the rest
of the driver patches are only compile-tested.

[0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/

Paul E. McKenney (1):
  rcu: Create an unrcu_pointer() to remove __rcu from a pointer

Toke Høiland-Jørgensen (16):
  bpf: allow RCU-protected lookups to happen from bh context
  dev: add rcu_read_lock_bh_held() as a valid check when getting a RCU
    dev ref
  xdp: add proper __rcu annotations to redirect map entries
  ena: remove rcu_read_lock() around XDP program invocation
  bnxt: remove rcu_read_lock() around XDP program invocation
  thunderx: remove rcu_read_lock() around XDP program invocation
  freescale: remove rcu_read_lock() around XDP program invocation
  net: intel: remove rcu_read_lock() around XDP program invocation
  marvell: remove rcu_read_lock() around XDP program invocation
  mlx4: remove rcu_read_lock() around XDP program invocation
  nfp: remove rcu_read_lock() around XDP program invocation
  qede: remove rcu_read_lock() around XDP program invocation
  sfc: remove rcu_read_lock() around XDP program invocation
  netsec: remove rcu_read_lock() around XDP program invocation
  stmmac: remove rcu_read_lock() around XDP program invocation
  net: ti: remove rcu_read_lock() around XDP program invocation

 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  3 --
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  2 -
 .../net/ethernet/cavium/thunder/nicvf_main.c  |  2 -
 .../net/ethernet/freescale/dpaa/dpaa_eth.c    |  8 +--
 .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  3 --
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  2 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +--
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  6 +--
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  6 +--
 drivers/net/ethernet/intel/igb/igb_main.c     |  2 -
 drivers/net/ethernet/intel/igc/igc_main.c     |  7 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  6 +--
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 -
 drivers/net/ethernet/marvell/mvneta.c         |  2 -
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |  4 --
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  8 +--
 .../ethernet/netronome/nfp/nfp_net_common.c   |  2 -
 drivers/net/ethernet/qlogic/qede/qede_fp.c    |  6 ---
 drivers/net/ethernet/sfc/rx.c                 |  9 +---
 drivers/net/ethernet/socionext/netsec.c       |  3 --
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 10 +---
 drivers/net/ethernet/ti/cpsw_priv.c           | 10 +---
 include/linux/rcupdate.h                      | 14 +++++
 include/net/xdp_sock.h                        |  2 +-
 kernel/bpf/cpumap.c                           | 14 +++--
 kernel/bpf/devmap.c                           | 52 ++++++++-----------
 kernel/bpf/hashtab.c                          | 21 +++++---
 kernel/bpf/helpers.c                          |  6 +--
 kernel/bpf/lpm_trie.c                         |  6 ++-
 net/core/dev.c                                |  2 +-
 net/core/filter.c                             | 28 ++++++++++
 net/xdp/xsk.c                                 |  4 +-
 net/xdp/xsk.h                                 |  4 +-
 net/xdp/xskmap.c                              | 29 ++++++-----
 35 files changed, 134 insertions(+), 159 deletions(-)

Comments

Paul E. McKenney June 9, 2021, 1:57 p.m. UTC | #1
On Wed, Jun 09, 2021 at 12:33:14PM +0200, Toke Høiland-Jørgensen wrote:
> The ena driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP
> program invocations. However, the actual lifetime of the objects referred
> by the XDP program invocation is longer, all the way through to the call to
> xdp_do_flush(), making the scope of the rcu_read_lock() too small. This
> turns out to be harmless because it all happens in a single NAPI poll
> cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock()
> misleading.
> 
> Rather than extend the scope of the rcu_read_lock(), just get rid of it
> entirely. With the addition of RCU annotations to the XDP_REDIRECT map
> types that take bh execution into account, lockdep even understands this to
> be safe, so there's really no reason to keep it around.

It might be worth adding a comment, perhaps where the rcu_read_lock()
used to be, stating what the protection is.  Maybe something like this?

	/*
	 * This code is invoked within a single NAPI poll cycle
	 * and thus under local_bh_disable(), which provides the
	 * needed RCU protection.
	 */

							Thanx, Paul

> Cc: Guy Tzalik <gtzalik@amazon.com>
> Cc: Saeed Bishara <saeedb@amazon.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 881f88754bf6..a4378b14af4c 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -385,7 +385,6 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp)
>  	u64 *xdp_stat;
>  	int qid;
>  
> -	rcu_read_lock();
>  	xdp_prog = READ_ONCE(rx_ring->xdp_bpf_prog);
>  
>  	if (!xdp_prog)
> @@ -443,8 +442,6 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp)
>  
>  	ena_increase_stat(xdp_stat, 1, &rx_ring->syncp);
>  out:
> -	rcu_read_unlock();
> -
>  	return verdict;
>  }
>  
> -- 
> 2.31.1
>
Grygorii Strashko June 9, 2021, 5:04 p.m. UTC | #2
On 09/06/2021 13:33, Toke Høiland-Jørgensen wrote:
> The cpsw driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP
> program invocations. However, the actual lifetime of the objects referred
> by the XDP program invocation is longer, all the way through to the call to
> xdp_do_flush(), making the scope of the rcu_read_lock() too small. This
> turns out to be harmless because it all happens in a single NAPI poll
> cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock()
> misleading.
> 
> Rather than extend the scope of the rcu_read_lock(), just get rid of it
> entirely. With the addition of RCU annotations to the XDP_REDIRECT map
> types that take bh execution into account, lockdep even understands this to
> be safe, so there's really no reason to keep it around.
> 
> Cc: Grygorii Strashko <grygorii.strashko@ti.com>
> Cc: linux-omap@vger.kernel.org
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>   drivers/net/ethernet/ti/cpsw_priv.c | 10 ++--------
>   1 file changed, 2 insertions(+), 8 deletions(-)

Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Yonghong Song June 10, 2021, 12:18 a.m. UTC | #3
On 6/9/21 3:33 AM, Toke Høiland-Jørgensen wrote:
> During the discussion[0] of Hangbin's multicast patch series, Martin pointed out
> that the lifetime of the RCU-protected  map entries used by XDP_REDIRECT is by
> no means obvious. I promised to look into cleaning this up, and Paul helpfully
> provided some hints and a new unrcu_pointer() helper to aid in this.
> 
> This is mostly a documentation exercise, clearing up the description of the
> lifetime expectations and adding __rcu annotations so sparse and lockdep can
> help verify it.
> 
> Patches 1-2 are prepatory: Patch 1 adds Paul's unrcu_pointer() helper (which has
> already been added to his tree) and patch 2 is a small fix for
> dev_get_by_index_rcu() so lockdep understands _bh-disabled access to it. Patch 3
> is the main bit that adds the __rcu annotations and updates documentation
> comments, and the rest are patches updating the drivers, with one patch per
> distinct maintainer.
> 
> Unfortunately I don't have any hardware to test any of the driver patches;
> Jesper helpfully verified that it doesn't break anything on i40e, but the rest
> of the driver patches are only compile-tested.
> 
> [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
> 
> Paul E. McKenney (1):
>    rcu: Create an unrcu_pointer() to remove __rcu from a pointer
> 
> Toke Høiland-Jørgensen (16):
>    bpf: allow RCU-protected lookups to happen from bh context
>    dev: add rcu_read_lock_bh_held() as a valid check when getting a RCU
>      dev ref
>    xdp: add proper __rcu annotations to redirect map entries
>    ena: remove rcu_read_lock() around XDP program invocation
>    bnxt: remove rcu_read_lock() around XDP program invocation
>    thunderx: remove rcu_read_lock() around XDP program invocation
>    freescale: remove rcu_read_lock() around XDP program invocation
>    net: intel: remove rcu_read_lock() around XDP program invocation
>    marvell: remove rcu_read_lock() around XDP program invocation
>    mlx4: remove rcu_read_lock() around XDP program invocation
>    nfp: remove rcu_read_lock() around XDP program invocation
>    qede: remove rcu_read_lock() around XDP program invocation
>    sfc: remove rcu_read_lock() around XDP program invocation
>    netsec: remove rcu_read_lock() around XDP program invocation
>    stmmac: remove rcu_read_lock() around XDP program invocation
>    net: ti: remove rcu_read_lock() around XDP program invocation
> 
>   drivers/net/ethernet/amazon/ena/ena_netdev.c  |  3 --
>   drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  2 -
>   .../net/ethernet/cavium/thunder/nicvf_main.c  |  2 -
>   .../net/ethernet/freescale/dpaa/dpaa_eth.c    |  8 +--
>   .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  3 --
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  2 -
>   drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +--
>   drivers/net/ethernet/intel/ice/ice_txrx.c     |  6 +--
>   drivers/net/ethernet/intel/ice/ice_xsk.c      |  6 +--
>   drivers/net/ethernet/intel/igb/igb_main.c     |  2 -
>   drivers/net/ethernet/intel/igc/igc_main.c     |  7 +--
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 -
>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  6 +--
>   .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 -
>   drivers/net/ethernet/marvell/mvneta.c         |  2 -
>   .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |  4 --
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  8 +--
>   .../ethernet/netronome/nfp/nfp_net_common.c   |  2 -
>   drivers/net/ethernet/qlogic/qede/qede_fp.c    |  6 ---
>   drivers/net/ethernet/sfc/rx.c                 |  9 +---
>   drivers/net/ethernet/socionext/netsec.c       |  3 --
>   .../net/ethernet/stmicro/stmmac/stmmac_main.c | 10 +---
>   drivers/net/ethernet/ti/cpsw_priv.c           | 10 +---
>   include/linux/rcupdate.h                      | 14 +++++
>   include/net/xdp_sock.h                        |  2 +-
>   kernel/bpf/cpumap.c                           | 14 +++--
>   kernel/bpf/devmap.c                           | 52 ++++++++-----------
>   kernel/bpf/hashtab.c                          | 21 +++++---
>   kernel/bpf/helpers.c                          |  6 +--
>   kernel/bpf/lpm_trie.c                         |  6 ++-
>   net/core/dev.c                                |  2 +-
>   net/core/filter.c                             | 28 ++++++++++
>   net/xdp/xsk.c                                 |  4 +-
>   net/xdp/xsk.h                                 |  4 +-
>   net/xdp/xskmap.c                              | 29 ++++++-----
>   35 files changed, 134 insertions(+), 159 deletions(-)

Martin, could you help review this patch set? You had participated
in early discussions related to this patch. Thanks!
Ilias Apalodimas June 10, 2021, 5:30 a.m. UTC | #4
On Wed, 9 Jun 2021 at 13:33, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>

> The netsec driver has a rcu_read_lock()/rcu_read_unlock() pair around the

> full RX loop, covering everything up to and including xdp_do_flush(). This

> is actually the correct behaviour, but because it all happens in a single

> NAPI poll cycle (and thus under local_bh_disable()), it is also technically

> redundant.

>

> With the addition of RCU annotations to the XDP_REDIRECT map types that

> take bh execution into account, lockdep even understands this to be safe,

> so there's really no reason to keep the rcu_read_lock() around anymore, so

> let's just remove it.

>

> Cc: Jassi Brar <jaswinder.singh@linaro.org>

> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>

> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

> ---

>  drivers/net/ethernet/socionext/netsec.c | 3 ---

>  1 file changed, 3 deletions(-)

>

> diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c

> index dfc85cc68173..20d148c019d8 100644

> --- a/drivers/net/ethernet/socionext/netsec.c

> +++ b/drivers/net/ethernet/socionext/netsec.c

> @@ -958,7 +958,6 @@ static int netsec_process_rx(struct netsec_priv *priv, int budget)

>

>         xdp_init_buff(&xdp, PAGE_SIZE, &dring->xdp_rxq);

>

> -       rcu_read_lock();

>         xdp_prog = READ_ONCE(priv->xdp_prog);

>         dma_dir = page_pool_get_dma_dir(dring->page_pool);

>

> @@ -1069,8 +1068,6 @@ static int netsec_process_rx(struct netsec_priv *priv, int budget)

>         }

>         netsec_finalize_xdp_rx(priv, xdp_act, xdp_xmit);

>

> -       rcu_read_unlock();

> -

>         return done;

>  }

>

> --

> 2.31.1

>


Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tariq Toukan June 10, 2021, 7:10 a.m. UTC | #5
On 6/9/2021 1:33 PM, Toke Høiland-Jørgensen wrote:
> The mlx4 driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP

> program invocations. However, the actual lifetime of the objects referred

> by the XDP program invocation is longer, all the way through to the call to

> xdp_do_flush(), making the scope of the rcu_read_lock() too small. This

> turns out to be harmless because it all happens in a single NAPI poll

> cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock()

> misleading.

> 

> Rather than extend the scope of the rcu_read_lock(), just get rid of it

> entirely. With the addition of RCU annotations to the XDP_REDIRECT map

> types that take bh execution into account, lockdep even understands this to

> be safe, so there's really no reason to keep it around. Also switch the RCU

> dereferences in the driver loop itself to the _bh variants.

> 

> Cc: Tariq Toukan <tariqt@nvidia.com>

> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

> ---


Thanks for your patch.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>


Regards,
Tariq
Alexei Starovoitov June 10, 2021, 6:38 p.m. UTC | #6
On Wed, Jun 9, 2021 at 7:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>

> XDP programs are called from a NAPI poll context, which means the RCU

> reference liveness is ensured by local_bh_disable(). Add

> rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so

> lockdep understands that the dereferences are safe from inside *either* an

> rcu_read_lock() section *or* a local_bh_disable() section. This is done in

> preparation for removing the redundant rcu_read_lock()s from the drivers.

>

> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

> ---

>  kernel/bpf/hashtab.c  | 21 ++++++++++++++-------

>  kernel/bpf/helpers.c  |  6 +++---

>  kernel/bpf/lpm_trie.c |  6 ++++--

>  3 files changed, 21 insertions(+), 12 deletions(-)

>

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c

> index 6f6681b07364..72c58cc516a3 100644

> --- a/kernel/bpf/hashtab.c

> +++ b/kernel/bpf/hashtab.c

> @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)

>         struct htab_elem *l;

>         u32 hash, key_size;

>

> -       WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +       WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +                    !rcu_read_lock_bh_held());


It's not clear to me whether rcu_read_lock_held() is still needed.
All comments sound like rcu_read_lock_bh_held() is a superset of rcu
that includes bh.
But reading rcu source code it looks like RCU_BH is its own rcu flavor...
which is confusing.
Martin KaFai Lau June 10, 2021, 7:33 p.m. UTC | #7
On Wed, Jun 09, 2021 at 12:33:11PM +0200, Toke Høiland-Jørgensen wrote:
> XDP programs are called from a NAPI poll context, which means the RCU

> reference liveness is ensured by local_bh_disable(). Add

> rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so

> lockdep understands that the dereferences are safe from inside *either* an

> rcu_read_lock() section *or* a local_bh_disable() section. This is done in

> preparation for removing the redundant rcu_read_lock()s from the drivers.

> 

> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

> ---

>  kernel/bpf/hashtab.c  | 21 ++++++++++++++-------

>  kernel/bpf/helpers.c  |  6 +++---

>  kernel/bpf/lpm_trie.c |  6 ++++--

>  3 files changed, 21 insertions(+), 12 deletions(-)

> 

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c

> index 6f6681b07364..72c58cc516a3 100644

> --- a/kernel/bpf/hashtab.c

> +++ b/kernel/bpf/hashtab.c

> @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)

>  	struct htab_elem *l;

>  	u32 hash, key_size;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> @@ -989,7 +990,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,

>  		/* unknown flags */

>  		return -EINVAL;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> @@ -1082,7 +1084,8 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,

>  		/* unknown flags */

>  		return -EINVAL;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> @@ -1148,7 +1151,8 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key,

>  		/* unknown flags */

>  		return -EINVAL;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> @@ -1202,7 +1206,8 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,

>  		/* unknown flags */

>  		return -EINVAL;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> @@ -1276,7 +1281,8 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)

>  	u32 hash, key_size;

>  	int ret;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> @@ -1311,7 +1317,8 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)

>  	u32 hash, key_size;

>  	int ret;

>  

> -	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

> +		     !rcu_read_lock_bh_held());

>  

>  	key_size = map->key_size;

>  

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c

> index 544773970dbc..e880f6bb6f28 100644

> --- a/kernel/bpf/helpers.c

> +++ b/kernel/bpf/helpers.c

> @@ -28,7 +28,7 @@

>   */

>  BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)

>  {

> -	WARN_ON_ONCE(!rcu_read_lock_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());

There is a discrepancy in rcu_read_lock_trace_held() here but
I think the patch_map_ops_generic step in the verifier has skipped
these helper calls.  It is unrelated and can be addressed later
until it is needed.

Acked-by: Martin KaFai Lau <kafai@fb.com>


>  	return (unsigned long) map->ops->map_lookup_elem(map, key);

>  }

>  

> @@ -44,7 +44,7 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {

>  BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,

>  	   void *, value, u64, flags)

>  {

> -	WARN_ON_ONCE(!rcu_read_lock_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());

>  	return map->ops->map_update_elem(map, key, value, flags);

>  }

>  

> @@ -61,7 +61,7 @@ const struct bpf_func_proto bpf_map_update_elem_proto = {

>  

>  BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)

>  {

> -	WARN_ON_ONCE(!rcu_read_lock_held());

> +	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());

>  	return map->ops->map_delete_elem(map, key);

>  }
Martin KaFai Lau June 10, 2021, 9:09 p.m. UTC | #8
On Wed, Jun 09, 2021 at 12:33:13PM +0200, Toke Høiland-Jørgensen wrote:
[ ... ]

> @@ -551,7 +551,8 @@ static void cpu_map_free(struct bpf_map *map)

>  	for (i = 0; i < cmap->map.max_entries; i++) {

>  		struct bpf_cpu_map_entry *rcpu;

>  

> -		rcpu = READ_ONCE(cmap->cpu_map[i]);

> +		rcpu = rcu_dereference_check(cmap->cpu_map[i],

> +					     rcu_read_lock_bh_held());

Is rcu_read_lock_bh_held() true during map_free()?

[ ... ]

> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

>  			       u64 map_flags)

>  {

>  	struct xsk_map *m = container_of(map, struct xsk_map, map);

> -	struct xdp_sock *xs, *old_xs, **map_entry;

> +	struct xdp_sock __rcu **map_entry;

> +	struct xdp_sock *xs, *old_xs;

>  	u32 i = *(u32 *)key, fd = *(u32 *)value;

>  	struct xsk_map_node *node;

>  	struct socket *sock;

> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

>  	}

>  

>  	spin_lock_bh(&m->lock);

> -	old_xs = READ_ONCE(*map_entry);

> +	old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held());

Is it actually protected by the m->lock at this point?

[ ... ]

>  void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,

> -			     struct xdp_sock **map_entry)

> +			     struct xdp_sock __rcu **map_entry)

>  {

>  	spin_lock_bh(&map->lock);

> -	if (READ_ONCE(*map_entry) == xs) {

> -		WRITE_ONCE(*map_entry, NULL);

> +	if (rcu_dereference(*map_entry) == xs) {

nit. rcu_access_pointer()?

> +		rcu_assign_pointer(*map_entry, NULL);

>  		xsk_map_sock_delete(xs, map_entry);

>  	}

>  	spin_unlock_bh(&map->lock);

> -- 

> 2.31.1

>
Daniel Borkmann June 10, 2021, 9:24 p.m. UTC | #9
Hi Paul,

On 6/10/21 8:38 PM, Alexei Starovoitov wrote:
> On Wed, Jun 9, 2021 at 7:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

>>

>> XDP programs are called from a NAPI poll context, which means the RCU

>> reference liveness is ensured by local_bh_disable(). Add

>> rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so

>> lockdep understands that the dereferences are safe from inside *either* an

>> rcu_read_lock() section *or* a local_bh_disable() section. This is done in

>> preparation for removing the redundant rcu_read_lock()s from the drivers.

>>

>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

>> ---

>>   kernel/bpf/hashtab.c  | 21 ++++++++++++++-------

>>   kernel/bpf/helpers.c  |  6 +++---

>>   kernel/bpf/lpm_trie.c |  6 ++++--

>>   3 files changed, 21 insertions(+), 12 deletions(-)

>>

>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c

>> index 6f6681b07364..72c58cc516a3 100644

>> --- a/kernel/bpf/hashtab.c

>> +++ b/kernel/bpf/hashtab.c

>> @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)

>>          struct htab_elem *l;

>>          u32 hash, key_size;

>>

>> -       WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

>> +       WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

>> +                    !rcu_read_lock_bh_held());

> 

> It's not clear to me whether rcu_read_lock_held() is still needed.

> All comments sound like rcu_read_lock_bh_held() is a superset of rcu

> that includes bh.

> But reading rcu source code it looks like RCU_BH is its own rcu flavor...

> which is confusing.


The series is a bit confusing to me as well. I recall we had a discussion with
Paul, but it was back in 2016 aka very early days of XDP to get some clarifications
about RCU vs RCU-bh flavour on this. Paul, given the series in here, I assume the
below is not true anymore, and in this case (since we're removing rcu_read_lock()
from drivers), the RCU-bh acts as a real superset?

Back then from your clarifications this was not the case:

   On Mon, Jul 25, 2016 at 11:26:02AM -0700, Alexei Starovoitov wrote:
   > On Mon, Jul 25, 2016 at 11:03 AM, Paul E. McKenney

   > <paulmck@linux.vnet.ibm.com> wrote:

   [...]
   >>> The crux of the question is whether a particular driver rx handler, when

   >>> called from __do_softirq, needs to add an additional rcu_read_lock or

   >>> whether it can rely on the mechanics of softirq.

   >>

   >> If it was rcu_read_lock_bh(), you could.

   >>

   >> But you didn't say rcu_read_lock_bh(), you instead said rcu_read_lock(),

   >> which means that you absolutely cannot rely on softirq semantics.

   >>

   >> In particular, in CONFIG_PREEMPT=y kernels, rcu_preempt_check_callbacks()

   >> will notice that there is no rcu_read_lock() in effect and report

   >> a quiescent state for that CPU.  Because rcu_preempt_check_callbacks()

   >> is invoked from the scheduling-clock interrupt, it absolutely can

   >> execute during do_softirq(), and therefore being in softirq context

   >> in no way provides rcu_read_lock()-style protection.

   >>

   >> Now, Alexei's question was for CONFIG_PREEMPT=n kernels.  However, in

   >> that case, rcu_read_lock() and rcu_read_unlock() generate no code

   >> in recent production kernels, so there is no performance penalty for

   >> using them.  (In older kernels, they implied a barrier().)

   >>

   >> So either way, with or without CONFIG_PREEMPT, you should use

   >> rcu_read_lock() to get RCU protection.

   >>

   >> One alternative might be to switch to rcu_read_lock_bh(), but that

   >> will add local_disable_bh() overhead to your read paths.

   >>

   >> Does that help, or am I missing the point of the question?

   >

   > thanks a lot for explanation.


   Glad you liked it!

   > I mistakenly assumed that _bh variants are 'stronger' and

   > act as inclusive, but sounds like they're completely orthogonal

   > especially with preempt_rcu=y.


   Yes, they are pretty much orthogonal.

   > With preempt_rcu=n and preempt=y, it would be the case, since

   > bh disables preemption and rcu_read_lock does the same as well,

   > right? Of course, the code shouldn't be relying on that, so we

   > have to fix our stuff.


   Indeed, especially given that the kernel currently won't allow you
   to configure CONFIG_PREEMPT_RCU=n and CONFIG_PREEMPT=y.  If it does,
   please let me know, as that would be a bug that needs to be fixed.
   (For one thing, I do not test that combination.)

							Thanx, Paul

And now, fast-forward again to 2021 ... :)

Thanks,
Daniel
Toke Høiland-Jørgensen June 10, 2021, 10:27 p.m. UTC | #10
Daniel Borkmann <daniel@iogearbox.net> writes:

> Hi Paul,

>

> On 6/10/21 8:38 PM, Alexei Starovoitov wrote:

>> On Wed, Jun 9, 2021 at 7:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:

>>>

>>> XDP programs are called from a NAPI poll context, which means the RCU

>>> reference liveness is ensured by local_bh_disable(). Add

>>> rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so

>>> lockdep understands that the dereferences are safe from inside *either* an

>>> rcu_read_lock() section *or* a local_bh_disable() section. This is done in

>>> preparation for removing the redundant rcu_read_lock()s from the drivers.

>>>

>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

>>> ---

>>>   kernel/bpf/hashtab.c  | 21 ++++++++++++++-------

>>>   kernel/bpf/helpers.c  |  6 +++---

>>>   kernel/bpf/lpm_trie.c |  6 ++++--

>>>   3 files changed, 21 insertions(+), 12 deletions(-)

>>>

>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c

>>> index 6f6681b07364..72c58cc516a3 100644

>>> --- a/kernel/bpf/hashtab.c

>>> +++ b/kernel/bpf/hashtab.c

>>> @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)

>>>          struct htab_elem *l;

>>>          u32 hash, key_size;

>>>

>>> -       WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held());

>>> +       WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&

>>> +                    !rcu_read_lock_bh_held());

>> 

>> It's not clear to me whether rcu_read_lock_held() is still needed.

>> All comments sound like rcu_read_lock_bh_held() is a superset of rcu

>> that includes bh.

>> But reading rcu source code it looks like RCU_BH is its own rcu flavor...

>> which is confusing.

>

> The series is a bit confusing to me as well. I recall we had a discussion with

> Paul, but it was back in 2016 aka very early days of XDP to get some clarifications

> about RCU vs RCU-bh flavour on this. Paul, given the series in here, I assume the

> below is not true anymore, and in this case (since we're removing rcu_read_lock()

> from drivers), the RCU-bh acts as a real superset?

>

> Back then from your clarifications this was not the case:

>

>    On Mon, Jul 25, 2016 at 11:26:02AM -0700, Alexei Starovoitov wrote:

>    > On Mon, Jul 25, 2016 at 11:03 AM, Paul E. McKenney

>    > <paulmck@linux.vnet.ibm.com> wrote:

>    [...]

>    >>> The crux of the question is whether a particular driver rx handler, when

>    >>> called from __do_softirq, needs to add an additional rcu_read_lock or

>    >>> whether it can rely on the mechanics of softirq.

>    >>

>    >> If it was rcu_read_lock_bh(), you could.

>    >>

>    >> But you didn't say rcu_read_lock_bh(), you instead said rcu_read_lock(),

>    >> which means that you absolutely cannot rely on softirq semantics.

>    >>

>    >> In particular, in CONFIG_PREEMPT=y kernels, rcu_preempt_check_callbacks()

>    >> will notice that there is no rcu_read_lock() in effect and report

>    >> a quiescent state for that CPU.  Because rcu_preempt_check_callbacks()

>    >> is invoked from the scheduling-clock interrupt, it absolutely can

>    >> execute during do_softirq(), and therefore being in softirq context

>    >> in no way provides rcu_read_lock()-style protection.

>    >>

>    >> Now, Alexei's question was for CONFIG_PREEMPT=n kernels.  However, in

>    >> that case, rcu_read_lock() and rcu_read_unlock() generate no code

>    >> in recent production kernels, so there is no performance penalty for

>    >> using them.  (In older kernels, they implied a barrier().)

>    >>

>    >> So either way, with or without CONFIG_PREEMPT, you should use

>    >> rcu_read_lock() to get RCU protection.

>    >>

>    >> One alternative might be to switch to rcu_read_lock_bh(), but that

>    >> will add local_disable_bh() overhead to your read paths.

>    >>

>    >> Does that help, or am I missing the point of the question?

>    >

>    > thanks a lot for explanation.

>

>    Glad you liked it!

>

>    > I mistakenly assumed that _bh variants are 'stronger' and

>    > act as inclusive, but sounds like they're completely orthogonal

>    > especially with preempt_rcu=y.

>

>    Yes, they are pretty much orthogonal.

>

>    > With preempt_rcu=n and preempt=y, it would be the case, since

>    > bh disables preemption and rcu_read_lock does the same as well,

>    > right? Of course, the code shouldn't be relying on that, so we

>    > have to fix our stuff.

>

>    Indeed, especially given that the kernel currently won't allow you

>    to configure CONFIG_PREEMPT_RCU=n and CONFIG_PREEMPT=y.  If it does,

>    please let me know, as that would be a bug that needs to be fixed.

>    (For one thing, I do not test that combination.)

>

> 							Thanx, Paul

>

> And now, fast-forward again to 2021 ... :)


We covered this in the thread I linked from the cover letter.
Specifically, this seems to have been a change from v4.20, see Paul's
reply here:
https://lore.kernel.org/bpf/20210417002301.GO4212@paulmck-ThinkPad-P17-Gen-1/

and the follow-up covering -rt here:
https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/

-Toke
Toke Høiland-Jørgensen June 10, 2021, 11:19 p.m. UTC | #11
Martin KaFai Lau <kafai@fb.com> writes:

> On Wed, Jun 09, 2021 at 12:33:13PM +0200, Toke Høiland-Jørgensen wrote:

> [ ... ]

>

>> @@ -551,7 +551,8 @@ static void cpu_map_free(struct bpf_map *map)

>>  	for (i = 0; i < cmap->map.max_entries; i++) {

>>  		struct bpf_cpu_map_entry *rcpu;

>>  

>> -		rcpu = READ_ONCE(cmap->cpu_map[i]);

>> +		rcpu = rcu_dereference_check(cmap->cpu_map[i],

>> +					     rcu_read_lock_bh_held());

> Is rcu_read_lock_bh_held() true during map_free()?


Hmm, no, I guess not since that's called from a workqueue. Will fix!

>> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

>>  			       u64 map_flags)

>>  {

>>  	struct xsk_map *m = container_of(map, struct xsk_map, map);

>> -	struct xdp_sock *xs, *old_xs, **map_entry;

>> +	struct xdp_sock __rcu **map_entry;

>> +	struct xdp_sock *xs, *old_xs;

>>  	u32 i = *(u32 *)key, fd = *(u32 *)value;

>>  	struct xsk_map_node *node;

>>  	struct socket *sock;

>> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

>>  	}

>>  

>>  	spin_lock_bh(&m->lock);

>> -	old_xs = READ_ONCE(*map_entry);

>> +	old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held());

> Is it actually protected by the m->lock at this point?


True, can just add that to the check.

>>  void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,

>> -			     struct xdp_sock **map_entry)

>> +			     struct xdp_sock __rcu **map_entry)

>>  {

>>  	spin_lock_bh(&map->lock);

>> -	if (READ_ONCE(*map_entry) == xs) {

>> -		WRITE_ONCE(*map_entry, NULL);

>> +	if (rcu_dereference(*map_entry) == xs) {

> nit. rcu_access_pointer()?


Yup.
Martin KaFai Lau June 10, 2021, 11:32 p.m. UTC | #12
On Fri, Jun 11, 2021 at 01:19:04AM +0200, Toke Høiland-Jørgensen wrote:
> >> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

> >>  			       u64 map_flags)

> >>  {

> >>  	struct xsk_map *m = container_of(map, struct xsk_map, map);

> >> -	struct xdp_sock *xs, *old_xs, **map_entry;

> >> +	struct xdp_sock __rcu **map_entry;

> >> +	struct xdp_sock *xs, *old_xs;

> >>  	u32 i = *(u32 *)key, fd = *(u32 *)value;

> >>  	struct xsk_map_node *node;

> >>  	struct socket *sock;

> >> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

> >>  	}

> >>  

> >>  	spin_lock_bh(&m->lock);

> >> -	old_xs = READ_ONCE(*map_entry);

> >> +	old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held());

> > Is it actually protected by the m->lock at this point?

> 

> True, can just add that to the check.

this should be enough
rcu_dereference_protected(*map_entry, lockdep_is_held(&m->lock));
Toke Høiland-Jørgensen June 10, 2021, 11:41 p.m. UTC | #13
Martin KaFai Lau <kafai@fb.com> writes:

> On Fri, Jun 11, 2021 at 01:19:04AM +0200, Toke Høiland-Jørgensen wrote:

>> >> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

>> >>  			       u64 map_flags)

>> >>  {

>> >>  	struct xsk_map *m = container_of(map, struct xsk_map, map);

>> >> -	struct xdp_sock *xs, *old_xs, **map_entry;

>> >> +	struct xdp_sock __rcu **map_entry;

>> >> +	struct xdp_sock *xs, *old_xs;

>> >>  	u32 i = *(u32 *)key, fd = *(u32 *)value;

>> >>  	struct xsk_map_node *node;

>> >>  	struct socket *sock;

>> >> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,

>> >>  	}

>> >>  

>> >>  	spin_lock_bh(&m->lock);

>> >> -	old_xs = READ_ONCE(*map_entry);

>> >> +	old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held());

>> > Is it actually protected by the m->lock at this point?

>> 

>> True, can just add that to the check.

> this should be enough

> rcu_dereference_protected(*map_entry, lockdep_is_held(&m->lock));


Right, that's what I had in mind as well :)

-Toke