[net] lwt: disable BH too in run_lwt_bpf()

Message ID	20201201194438.37402-1-xiyou.wangcong@gmail.com
State	New
Headers	show Return-Path: <netdev-owner@kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: Dongdong Wang <wangdongdong@bytedance.com>, Thomas Graf <tgraf@suug.ch>, bpf@vger.kernel.org, Cong Wang <cong.wang@bytedance.com> Subject: [Patch net] lwt: disable BH too in run_lwt_bpf() Date: Tue, 1 Dec 2020 11:44:38 -0800 Message-Id: <20201201194438.37402-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[net] lwt: disable BH too in run_lwt_bpf() \| expand [net] lwt: disable BH too in run_lwt_bpf()

Cong Wang Dec. 1, 2020, 7:44 p.m. UTC

From: Dongdong Wang <wangdongdong@bytedance.com>

The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.

In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.

Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.

Update the comments above the code too, as call_rcu() is now BH friendly.

Cc: Thomas Graf <tgraf@suug.ch>
Cc: bpf@vger.kernel.org
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>
---
 net/core/lwt_bpf.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Jakub Kicinski Dec. 2, 2020, 12:16 a.m. UTC | #1

On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:
> From: Dongdong Wang <wangdongdong@bytedance.com>
> 
> The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
> and BPF redirect helpers. Callers on RX path are all in BH context,
> disabling preemption is not sufficient to prevent BH interruption.
> 
> In production, we observed strange packet drops because of the race
> condition between LWT xmit and TC ingress, and we verified this issue
> is fixed after we disable BH.
> 
> Although this bug was technically introduced from the beginning, that
> is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
> at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
> So this patch may not work well before RCU flavor consolidation has been
> completed around v5.0.
> 
> Update the comments above the code too, as call_rcu() is now BH friendly.
> 
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: bpf@vger.kernel.org
> Reviewed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>

Fixes?

Jakub Kicinski Dec. 2, 2020, 12:17 a.m. UTC | #2

On Tue, 1 Dec 2020 16:16:33 -0800 Jakub Kicinski wrote:
> On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:
> > From: Dongdong Wang <wangdongdong@bytedance.com>
> > 
> > The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
> > and BPF redirect helpers. Callers on RX path are all in BH context,
> > disabling preemption is not sufficient to prevent BH interruption.
> > 
> > In production, we observed strange packet drops because of the race
> > condition between LWT xmit and TC ingress, and we verified this issue
> > is fixed after we disable BH.
> > 
> > Although this bug was technically introduced from the beginning, that
> > is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
> > at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
> > So this patch may not work well before RCU flavor consolidation has been
> > completed around v5.0.
> > 
> > Update the comments above the code too, as call_rcu() is now BH friendly.
> > 
> > Cc: Thomas Graf <tgraf@suug.ch>
> > Cc: bpf@vger.kernel.org
> > Reviewed-by: Cong Wang <cong.wang@bytedance.com>
> > Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>  
> 
> Fixes?

Ah, should have read the commit message first. Okay.

Jakub Kicinski Dec. 3, 2020, 1:10 a.m. UTC | #3

On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:
> From: Dongdong Wang <wangdongdong@bytedance.com>

> 

> The per-cpu bpf_redirect_info is shared among all skb_do_redirect()

> and BPF redirect helpers. Callers on RX path are all in BH context,

> disabling preemption is not sufficient to prevent BH interruption.

> 

> In production, we observed strange packet drops because of the race

> condition between LWT xmit and TC ingress, and we verified this issue

> is fixed after we disable BH.

> 

> Although this bug was technically introduced from the beginning, that

> is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),

> at that time call_rcu() had to be call_rcu_bh() to match the RCU context.

> So this patch may not work well before RCU flavor consolidation has been

> completed around v5.0.

> 

> Update the comments above the code too, as call_rcu() is now BH friendly.

> 

> Cc: Thomas Graf <tgraf@suug.ch>

> Cc: bpf@vger.kernel.org

> Reviewed-by: Cong Wang <cong.wang@bytedance.com>

> Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>

> ---

>  net/core/lwt_bpf.c | 8 ++++----

>  1 file changed, 4 insertions(+), 4 deletions(-)

> 

> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c

> index 7d3438215f32..4f3cb7c15ddf 100644

> --- a/net/core/lwt_bpf.c

> +++ b/net/core/lwt_bpf.c

> @@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,

>  {

>  	int ret;

>  

> -	/* Preempt disable is needed to protect per-cpu redirect_info between

> -	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and

> -	 * access to maps strictly require a rcu_read_lock() for protection,

> -	 * mixing with BH RCU lock doesn't work.

> +	/* Preempt disable and BH disable are needed to protect per-cpu

> +	 * redirect_info between BPF prog and skb_do_redirect().

>  	 */

>  	preempt_disable();

> +	local_bh_disable();


Why not remove the preempt_disable()? Disabling BH must also disable
preemption AFAIK.

>  	bpf_compute_data_pointers(skb);

>  	ret = bpf_prog_run_save_cb(lwt->prog, skb);

>  

> @@ -78,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,

>  		break;

>  	}

>  

> +	local_bh_enable();

>  	preempt_enable();

>  

>  	return ret;

Cong Wang Dec. 3, 2020, 1:29 a.m. UTC | #4

On Wed, Dec 2, 2020 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:

> > From: Dongdong Wang <wangdongdong@bytedance.com>

> >

> > The per-cpu bpf_redirect_info is shared among all skb_do_redirect()

> > and BPF redirect helpers. Callers on RX path are all in BH context,

> > disabling preemption is not sufficient to prevent BH interruption.

> >

> > In production, we observed strange packet drops because of the race

> > condition between LWT xmit and TC ingress, and we verified this issue

> > is fixed after we disable BH.

> >

> > Although this bug was technically introduced from the beginning, that

> > is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),

> > at that time call_rcu() had to be call_rcu_bh() to match the RCU context.

> > So this patch may not work well before RCU flavor consolidation has been

> > completed around v5.0.

> >

> > Update the comments above the code too, as call_rcu() is now BH friendly.

> >

> > Cc: Thomas Graf <tgraf@suug.ch>

> > Cc: bpf@vger.kernel.org

> > Reviewed-by: Cong Wang <cong.wang@bytedance.com>

> > Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>

> > ---

> >  net/core/lwt_bpf.c | 8 ++++----

> >  1 file changed, 4 insertions(+), 4 deletions(-)

> >

> > diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c

> > index 7d3438215f32..4f3cb7c15ddf 100644

> > --- a/net/core/lwt_bpf.c

> > +++ b/net/core/lwt_bpf.c

> > @@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,

> >  {

> >       int ret;

> >

> > -     /* Preempt disable is needed to protect per-cpu redirect_info between

> > -      * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and

> > -      * access to maps strictly require a rcu_read_lock() for protection,

> > -      * mixing with BH RCU lock doesn't work.

> > +     /* Preempt disable and BH disable are needed to protect per-cpu

> > +      * redirect_info between BPF prog and skb_do_redirect().

> >        */

> >       preempt_disable();

> > +     local_bh_disable();

>

> Why not remove the preempt_disable()? Disabling BH must also disable

> preemption AFAIK.


It seems RT kernel still needs preempt disable:
https://www.spinics.net/lists/kernel/msg3710124.html
but my RT knowledge is not sufficient to tell. So I just follow the
same pattern
in x86 FPU (as of today):

static inline void fpregs_lock(void)
{
        preempt_disable();
        local_bh_disable();
}

static inline void fpregs_unlock(void)
{
        local_bh_enable();
        preempt_enable();
}

There are other similar patterns in the current code base, so if this
needs a clean up, RT people can clean up them all together.

Thanks.

Jakub Kicinski Dec. 3, 2020, 1:47 a.m. UTC | #5

On Wed, 2 Dec 2020 17:29:53 -0800 Cong Wang wrote:
> On Wed, Dec 2, 2020 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:  

> > > From: Dongdong Wang <wangdongdong@bytedance.com>

> > >

> > > The per-cpu bpf_redirect_info is shared among all skb_do_redirect()

> > > and BPF redirect helpers. Callers on RX path are all in BH context,

> > > disabling preemption is not sufficient to prevent BH interruption.

> > >

> > > In production, we observed strange packet drops because of the race

> > > condition between LWT xmit and TC ingress, and we verified this issue

> > > is fixed after we disable BH.

> > >

> > > Although this bug was technically introduced from the beginning, that

> > > is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),

> > > at that time call_rcu() had to be call_rcu_bh() to match the RCU context.

> > > So this patch may not work well before RCU flavor consolidation has been

> > > completed around v5.0.

> > >

> > > Update the comments above the code too, as call_rcu() is now BH friendly.

> > >

> > > Cc: Thomas Graf <tgraf@suug.ch>

> > > Cc: bpf@vger.kernel.org

> > > Reviewed-by: Cong Wang <cong.wang@bytedance.com>

> > > Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>

> > > ---

> > >  net/core/lwt_bpf.c | 8 ++++----

> > >  1 file changed, 4 insertions(+), 4 deletions(-)

> > >

> > > diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c

> > > index 7d3438215f32..4f3cb7c15ddf 100644

> > > --- a/net/core/lwt_bpf.c

> > > +++ b/net/core/lwt_bpf.c

> > > @@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,

> > >  {

> > >       int ret;

> > >

> > > -     /* Preempt disable is needed to protect per-cpu redirect_info between

> > > -      * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and

> > > -      * access to maps strictly require a rcu_read_lock() for protection,

> > > -      * mixing with BH RCU lock doesn't work.

> > > +     /* Preempt disable and BH disable are needed to protect per-cpu

> > > +      * redirect_info between BPF prog and skb_do_redirect().

> > >        */

> > >       preempt_disable();

> > > +     local_bh_disable();  

> >

> > Why not remove the preempt_disable()? Disabling BH must also disable

> > preemption AFAIK.  

> 

> It seems RT kernel still needs preempt disable:

> https://www.spinics.net/lists/kernel/msg3710124.html

> but my RT knowledge is not sufficient to tell. So I just follow the

> same pattern

> in x86 FPU (as of today):

> 

> static inline void fpregs_lock(void)

> {

>         preempt_disable();

>         local_bh_disable();

> }

> 

> static inline void fpregs_unlock(void)

> {

>         local_bh_enable();

>         preempt_enable();

> }

> 

> There are other similar patterns in the current code base, so if this

> needs a clean up, RT people can clean up them all together.


I see. GTK.

The patch seem good but it's probably best suited to the bpf tree, let
me reassign it in patchwork.

Alexei Starovoitov Dec. 3, 2020, 1:50 a.m. UTC | #6

On Wed, Dec 2, 2020 at 5:32 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Wed, Dec 2, 2020 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:

> > > From: Dongdong Wang <wangdongdong@bytedance.com>

> > >

> > > The per-cpu bpf_redirect_info is shared among all skb_do_redirect()

> > > and BPF redirect helpers. Callers on RX path are all in BH context,

> > > disabling preemption is not sufficient to prevent BH interruption.

> > >

> > > In production, we observed strange packet drops because of the race

> > > condition between LWT xmit and TC ingress, and we verified this issue

> > > is fixed after we disable BH.

> > >

> > > Although this bug was technically introduced from the beginning, that

> > > is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),

> > > at that time call_rcu() had to be call_rcu_bh() to match the RCU context.

> > > So this patch may not work well before RCU flavor consolidation has been

> > > completed around v5.0.

> > >

> > > Update the comments above the code too, as call_rcu() is now BH friendly.

> > >

> > > Cc: Thomas Graf <tgraf@suug.ch>

> > > Cc: bpf@vger.kernel.org

> > > Reviewed-by: Cong Wang <cong.wang@bytedance.com>

> > > Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>

> > > ---

> > >  net/core/lwt_bpf.c | 8 ++++----

> > >  1 file changed, 4 insertions(+), 4 deletions(-)

> > >

> > > diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c

> > > index 7d3438215f32..4f3cb7c15ddf 100644

> > > --- a/net/core/lwt_bpf.c

> > > +++ b/net/core/lwt_bpf.c

> > > @@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,

> > >  {

> > >       int ret;

> > >

> > > -     /* Preempt disable is needed to protect per-cpu redirect_info between

> > > -      * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and

> > > -      * access to maps strictly require a rcu_read_lock() for protection,

> > > -      * mixing with BH RCU lock doesn't work.

> > > +     /* Preempt disable and BH disable are needed to protect per-cpu

> > > +      * redirect_info between BPF prog and skb_do_redirect().

> > >        */

> > >       preempt_disable();

> > > +     local_bh_disable();

> >

> > Why not remove the preempt_disable()? Disabling BH must also disable

> > preemption AFAIK.

>

> It seems RT kernel still needs preempt disable:


No. It's the opposite. When we did RT+bpf changes we missed this function.
It should be migrate_disable here instead of preempt_disable.
I don't know what local_bh_disable() maps to in RT.
Since it's used in many other places it's fine to use it here to
prevent this race.

Alexei Starovoitov Dec. 3, 2020, 6:22 p.m. UTC | #7

On Wed, Dec 2, 2020 at 5:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Wed, Dec 2, 2020 at 5:32 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > On Wed, Dec 2, 2020 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > >

> > > On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:

> > > > From: Dongdong Wang <wangdongdong@bytedance.com>

> > > >

> > > > The per-cpu bpf_redirect_info is shared among all skb_do_redirect()

> > > > and BPF redirect helpers. Callers on RX path are all in BH context,

> > > > disabling preemption is not sufficient to prevent BH interruption.

> > > >

> > > > In production, we observed strange packet drops because of the race

> > > > condition between LWT xmit and TC ingress, and we verified this issue

> > > > is fixed after we disable BH.

> > > >

> > > > Although this bug was technically introduced from the beginning, that

> > > > is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),

> > > > at that time call_rcu() had to be call_rcu_bh() to match the RCU context.

> > > > So this patch may not work well before RCU flavor consolidation has been

> > > > completed around v5.0.

> > > >

> > > > Update the comments above the code too, as call_rcu() is now BH friendly.

> > > >

> > > > Cc: Thomas Graf <tgraf@suug.ch>

> > > > Cc: bpf@vger.kernel.org

> > > > Reviewed-by: Cong Wang <cong.wang@bytedance.com>

> > > > Signed-off-by: Dongdong Wang <wangdongdong@bytedance.com>

> > > > ---

> > > >  net/core/lwt_bpf.c | 8 ++++----

> > > >  1 file changed, 4 insertions(+), 4 deletions(-)

> > > >

> > > > diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c

> > > > index 7d3438215f32..4f3cb7c15ddf 100644

> > > > --- a/net/core/lwt_bpf.c

> > > > +++ b/net/core/lwt_bpf.c

> > > > @@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,

> > > >  {

> > > >       int ret;

> > > >

> > > > -     /* Preempt disable is needed to protect per-cpu redirect_info between

> > > > -      * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and

> > > > -      * access to maps strictly require a rcu_read_lock() for protection,

> > > > -      * mixing with BH RCU lock doesn't work.

> > > > +     /* Preempt disable and BH disable are needed to protect per-cpu

> > > > +      * redirect_info between BPF prog and skb_do_redirect().

> > > >        */

> > > >       preempt_disable();

> > > > +     local_bh_disable();

> > >

> > > Why not remove the preempt_disable()? Disabling BH must also disable

> > > preemption AFAIK.

> >

> > It seems RT kernel still needs preempt disable:

>

> No. It's the opposite. When we did RT+bpf changes we missed this function.

> It should be migrate_disable here instead of preempt_disable.

> I don't know what local_bh_disable() maps to in RT.

> Since it's used in many other places it's fine to use it here to

> prevent this race.


I guess my previous comment could be misinterpreted.
Cong,
please respin with changing preempt_disable to migrate_disable
and adding local_bh_disable.

Cong Wang Dec. 3, 2020, 6:27 p.m. UTC | #8

On Thu, Dec 3, 2020 at 10:22 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> I guess my previous comment could be misinterpreted.

> Cong,

> please respin with changing preempt_disable to migrate_disable

> and adding local_bh_disable.

I have no objection, just want to point out migrate_disable() may
not exist if we backport this further to -stable, this helper was
introduced in Feb 2020.

Thanks.

Alexei Starovoitov Dec. 3, 2020, 6:29 p.m. UTC | #9

On Thu, Dec 3, 2020 at 10:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Thu, Dec 3, 2020 at 10:22 AM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > I guess my previous comment could be misinterpreted.

> > Cong,

> > please respin with changing preempt_disable to migrate_disable

> > and adding local_bh_disable.

>

> I have no objection, just want to point out migrate_disable() may

> not exist if we backport this further to -stable, this helper was

> introduced in Feb 2020.


I see. Then please split it into two patches for ease of backporting.

Cong Wang Dec. 4, 2020, 5:55 a.m. UTC | #10

On Thu, Dec 3, 2020 at 10:30 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Thu, Dec 3, 2020 at 10:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > On Thu, Dec 3, 2020 at 10:22 AM Alexei Starovoitov

> > <alexei.starovoitov@gmail.com> wrote:

> > >

> > > I guess my previous comment could be misinterpreted.

> > > Cong,

> > > please respin with changing preempt_disable to migrate_disable

> > > and adding local_bh_disable.

> >

> > I have no objection, just want to point out migrate_disable() may

> > not exist if we backport this further to -stable, this helper was

> > introduced in Feb 2020.

>

> I see. Then please split it into two patches for ease of backporting.


You mean the first patch does the same as this patch and the second
patch replaces preempt_disable() with migrate_disable(). Right?

Thanks.

Alexei Starovoitov Dec. 5, 2020, 2 a.m. UTC | #11

On Thu, Dec 3, 2020 at 9:55 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Thu, Dec 3, 2020 at 10:30 AM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > On Thu, Dec 3, 2020 at 10:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > >

> > > On Thu, Dec 3, 2020 at 10:22 AM Alexei Starovoitov

> > > <alexei.starovoitov@gmail.com> wrote:

> > > >

> > > > I guess my previous comment could be misinterpreted.

> > > > Cong,

> > > > please respin with changing preempt_disable to migrate_disable

> > > > and adding local_bh_disable.

> > >

> > > I have no objection, just want to point out migrate_disable() may

> > > not exist if we backport this further to -stable, this helper was

> > > introduced in Feb 2020.

> >

> > I see. Then please split it into two patches for ease of backporting.

>

> You mean the first patch does the same as this patch and the second

> patch replaces preempt_disable() with migrate_disable(). Right?


Right. Just be mindful of changing the comment.

[net] lwt: disable BH too in run_lwt_bpf()

Commit Message

Comments

Patch