[net,v5,0/3] fix packet stuck problem for lockless qdisc

Message ID	1620266264-48109-1-git-send-email-linyunsheng@huawei.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Yunsheng Lin <linyunsheng@huawei.com> To: <davem@davemloft.net>, <kuba@kernel.org> CC: <olteanv@gmail.com>, <ast@kernel.org>, <daniel@iogearbox.net>, <andriin@fb.com>, <edumazet@google.com>, <weiwan@google.com>, <cong.wang@bytedance.com>, <ap420073@gmail.com>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linuxarm@openeuler.org>, <mkl@pengutronix.de>, <linux-can@vger.kernel.org>, <jhs@mojatatu.com>, <xiyou.wangcong@gmail.com>, <jiri@resnulli.us>, <andrii@kernel.org>, <kafai@fb.com>, <songliubraving@fb.com>, <yhs@fb.com>, <john.fastabend@gmail.com>, <kpsingh@kernel.org>, <bpf@vger.kernel.org>, <jonas.bonn@netrounds.com>, <pabeni@redhat.com>, <mzhivich@akamai.com>, <johunt@akamai.com>, <albcamus@gmail.com>, <kehuan.feng@gmail.com>, <a.fatoum@pengutronix.de>, <atenart@kernel.org>, <alexander.duyck@gmail.com>, <hdanton@sina.com>, <jgross@suse.com>, <JKosina@suse.com>, <mkubecek@suse.cz>, <bjorn@kernel.org> Subject: [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc Date: Thu, 6 May 2021 09:57:41 +0800 Message-ID: <1620266264-48109-1-git-send-email-linyunsheng@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	fix packet stuck problem for lockless qdisc \| expand [net,v5,0/3] fix packet stuck problem for lockless qdisc [net,v5,3/3] net: sched: fix tx action reschedule issue with stopped queue

Message ID

1620266264-48109-1-git-send-email-linyunsheng@huawei.com

Headers

From: Yunsheng Lin <linyunsheng@huawei.com>
To: <davem@davemloft.net>, <kuba@kernel.org>
CC: <olteanv@gmail.com>, <ast@kernel.org>, <daniel@iogearbox.net>,
	<andriin@fb.com>, <edumazet@google.com>, <weiwan@google.com>,
	<cong.wang@bytedance.com>, <ap420073@gmail.com>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linuxarm@openeuler.org>, <mkl@pengutronix.de>,
	<linux-can@vger.kernel.org>, <jhs@mojatatu.com>,
	<xiyou.wangcong@gmail.com>, <jiri@resnulli.us>,
	<andrii@kernel.org>, <kafai@fb.com>, <songliubraving@fb.com>,
	<yhs@fb.com>, <john.fastabend@gmail.com>, <kpsingh@kernel.org>,
	<bpf@vger.kernel.org>, <jonas.bonn@netrounds.com>,
	<pabeni@redhat.com>, <mzhivich@akamai.com>, <johunt@akamai.com>,
	<albcamus@gmail.com>, <kehuan.feng@gmail.com>,
	<a.fatoum@pengutronix.de>, <atenart@kernel.org>,
	<alexander.duyck@gmail.com>, <hdanton@sina.com>, <jgross@suse.com>,
	<JKosina@suse.com>, <mkubecek@suse.cz>, <bjorn@kernel.org>
Subject: [PATCH net v5 0/3] fix packet stuck problem for lockless qdisc
Date: Thu, 6 May 2021 09:57:41 +0800
Message-ID: <1620266264-48109-1-git-send-email-linyunsheng@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Series

fix packet stuck problem for lockless qdisc | expand

Message

Yunsheng Lin May 6, 2021, 1:57 a.m. UTC

This patchset fixes the packet stuck problem mentioned in [1].

Patch 1: Add STATE_MISSED flag to fix packet stuck problem.
Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED
         flag is added in patch 1.
Patch 3: Fix the significantly higher CPU consumption problem when
         multiple threads are competing on a saturated outgoing
         device.

V5: add patch 3 to fix the problem reported by Michal Kubecek.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.

[1]. https://lkml.org/lkml/2019/10/9/42

Yunsheng Lin (3):
  net: sched: fix packet stuck problem for lockless qdisc
  net: sched: fix endless tx action reschedule during deactivation
  net: sched: fix tx action reschedule issue with stopped queue

 include/net/pkt_sched.h   |  7 +------
 include/net/sch_generic.h | 37 ++++++++++++++++++++++++++++++++++++-
 net/core/dev.c            | 30 +++++++++++++++++++++++++-----
 net/sched/sch_generic.c   | 24 ++++++++++++++++++++++--
 4 files changed, 84 insertions(+), 14 deletions(-)

Comments

Jakub Kicinski May 7, 2021, 11:57 p.m. UTC | #1

On Thu, 6 May 2021 09:57:42 +0800 Yunsheng Lin wrote:
> @@ -159,8 +160,37 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)

>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)

>  {

>  	if (qdisc->flags & TCQ_F_NOLOCK) {

> +		bool dont_retry = test_bit(__QDISC_STATE_MISSED,

> +					   &qdisc->state);

> +

> +		if (spin_trylock(&qdisc->seqlock))

> +			goto nolock_empty;

> +

> +		/* If the flag is set before doing the spin_trylock() and

> +		 * the above spin_trylock() return false, it means other cpu

> +		 * holding the lock will do dequeuing for us, or it wil see


s/wil/will/

> +		 * the flag set after releasing lock and reschedule the

> +		 * net_tx_action() to do the dequeuing.


I don't understand why MISSED is checked before the trylock.
Could you explain why it can't be tested directly here?

> +		 */

> +		if (dont_retry)

> +			return false;

> +

> +		/* We could do set_bit() before the first spin_trylock(),

> +		 * and avoid doing second spin_trylock() completely, then

> +		 * we could have multi cpus doing the set_bit(). Here use

> +		 * dont_retry to avoid doing the set_bit() and the second

> +		 * spin_trylock(), which has 5% performance improvement than

> +		 * doing the set_bit() before the first spin_trylock().

> +		 */

> +		set_bit(__QDISC_STATE_MISSED, &qdisc->state);

> +

> +		/* Retry again in case other CPU may not see the new flag

> +		 * after it releases the lock at the end of qdisc_run_end().

> +		 */

>  		if (!spin_trylock(&qdisc->seqlock))

>  			return false;

> +

> +nolock_empty:

>  		WRITE_ONCE(qdisc->empty, false);

>  	} else if (qdisc_is_running(qdisc)) {

>  		return false;

> @@ -176,8 +206,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)

>  static inline void qdisc_run_end(struct Qdisc *qdisc)

>  {

>  	write_seqcount_end(&qdisc->running);

> -	if (qdisc->flags & TCQ_F_NOLOCK)

> +	if (qdisc->flags & TCQ_F_NOLOCK) {

>  		spin_unlock(&qdisc->seqlock);

> +

> +		if (unlikely(test_bit(__QDISC_STATE_MISSED,

> +				      &qdisc->state)))

> +			__netif_schedule(qdisc);

> +	}

>  }

>  

>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c

> index 44991ea..9bc73ea 100644

> --- a/net/sched/sch_generic.c

> +++ b/net/sched/sch_generic.c

> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)

>  {

>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);

>  	struct sk_buff *skb = NULL;

> +	bool need_retry = true;

>  	int band;

>  

> +retry:

>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {

>  		struct skb_array *q = band2list(priv, band);

>  

> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)

>  	}

>  	if (likely(skb)) {

>  		qdisc_update_stats_at_dequeue(qdisc, skb);

> +	} else if (need_retry &&

> +		   test_and_clear_bit(__QDISC_STATE_MISSED,

> +				      &qdisc->state)) {


Why test_and_clear_bit() here? AFAICT this is the only place the bit 
is cleared. So the test and clear do not have to be atomic.

To my limited understanding on x86 test_bit() is never a locked
operation, while test_and_clear_bit() is always locked. So we'd save
an atomic operation in un-contended case if we tested first and then
cleared.

> +		/* do another dequeuing after clearing the flag to

> +		 * avoid calling __netif_schedule().

> +		 */

> +		smp_mb__after_atomic();


test_and_clear_bit() which returned true implies a memory barrier,
AFAIU, so the barrier is not needed with the code as is. It will be
needed if we switch to test_bit() + clear_bit(), but please clarify
what it is paring with.

> +		need_retry = false;

> +

> +		goto retry;

>  	} else {

>  		WRITE_ONCE(qdisc->empty, true);

>  	}

Yunsheng Lin May 8, 2021, 2:55 a.m. UTC | #2

On 2021/5/8 7:57, Jakub Kicinski wrote:
> On Thu, 6 May 2021 09:57:42 +0800 Yunsheng Lin wrote:

>> @@ -159,8 +160,37 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)

>>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)

>>  {

>>  	if (qdisc->flags & TCQ_F_NOLOCK) {

>> +		bool dont_retry = test_bit(__QDISC_STATE_MISSED,

>> +					   &qdisc->state);

>> +

>> +		if (spin_trylock(&qdisc->seqlock))

>> +			goto nolock_empty;

>> +

>> +		/* If the flag is set before doing the spin_trylock() and

>> +		 * the above spin_trylock() return false, it means other cpu

>> +		 * holding the lock will do dequeuing for us, or it wil see

> 

> s/wil/will/


Thanks.

> 

>> +		 * the flag set after releasing lock and reschedule the

>> +		 * net_tx_action() to do the dequeuing.

> 

> I don't understand why MISSED is checked before the trylock.

> Could you explain why it can't be tested directly here?

The initial thinking was:
Just like the set_bit() before the second trylock, If MISSED is set
before first trylock, it means other thread has set the MISSED flag
for this thread before doing the first trylock, so that this thread
does not need to do the set_bit().

But the initial thinking seems over thinking, as thread 3' setting the
MISSED before the second trylock has ensure either thread 3' second
trylock returns ture or thread 2 holding the lock will see the MISSED
flag, so thread 1 can do the test_bit() before or after the first
trylock, as below:

    thread 1                thread 2                    thread 3
                         holding q->seqlock
first trylock failed                              first trylock failed
                         unlock q->seqlock
                                               test_bit(MISSED) return false
                   test_bit(MISSED) return false
                          and not reschedule
                                                      set_bit(MISSED)
						      trylock success
test_bit(MISSED) retun ture
and not retry second trylock

If the above is correct, it seems we could:
1. do test_bit(MISSED) before the first trylock to avoid doing the
   first trylock for contended case.
or
2. do test_bit(MISSED) after the first trylock to avoid doing the
   test_bit() for un-contended case.

Which one do you prefer?

> 

>> +		 */

>> +		if (dont_retry)

>> +			return false;

>> +

>> +		/* We could do set_bit() before the first spin_trylock(),

>> +		 * and avoid doing second spin_trylock() completely, then

>> +		 * we could have multi cpus doing the set_bit(). Here use

>> +		 * dont_retry to avoid doing the set_bit() and the second

>> +		 * spin_trylock(), which has 5% performance improvement than

>> +		 * doing the set_bit() before the first spin_trylock().

>> +		 */

>> +		set_bit(__QDISC_STATE_MISSED, &qdisc->state);

>> +

>> +		/* Retry again in case other CPU may not see the new flag

>> +		 * after it releases the lock at the end of qdisc_run_end().

>> +		 */

>>  		if (!spin_trylock(&qdisc->seqlock))

>>  			return false;

>> +

>> +nolock_empty:

>>  		WRITE_ONCE(qdisc->empty, false);

>>  	} else if (qdisc_is_running(qdisc)) {

>>  		return false;

>> @@ -176,8 +206,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)

>>  static inline void qdisc_run_end(struct Qdisc *qdisc)

>>  {

>>  	write_seqcount_end(&qdisc->running);

>> -	if (qdisc->flags & TCQ_F_NOLOCK)

>> +	if (qdisc->flags & TCQ_F_NOLOCK) {

>>  		spin_unlock(&qdisc->seqlock);

>> +

>> +		if (unlikely(test_bit(__QDISC_STATE_MISSED,

>> +				      &qdisc->state)))

>> +			__netif_schedule(qdisc);

>> +	}

>>  }

>>  

>>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)

>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c

>> index 44991ea..9bc73ea 100644

>> --- a/net/sched/sch_generic.c

>> +++ b/net/sched/sch_generic.c

>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)

>>  {

>>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);

>>  	struct sk_buff *skb = NULL;

>> +	bool need_retry = true;

>>  	int band;

>>  

>> +retry:

>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {

>>  		struct skb_array *q = band2list(priv, band);

>>  

>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)

>>  	}

>>  	if (likely(skb)) {

>>  		qdisc_update_stats_at_dequeue(qdisc, skb);

>> +	} else if (need_retry &&

>> +		   test_and_clear_bit(__QDISC_STATE_MISSED,

>> +				      &qdisc->state)) {

> 

> Why test_and_clear_bit() here? AFAICT this is the only place the bit 

> is cleared. So the test and clear do not have to be atomic.


The the bit is also cleared in other place in patch 2/3, but within the
protection of q->seqlock too, so yes, the test and clear do not have to
be atomic for performance sake.

> 

> To my limited understanding on x86 test_bit() is never a locked

> operation, while test_and_clear_bit() is always locked. So we'd save

> an atomic operation in un-contended case if we tested first and then

> cleared.

> 

>> +		/* do another dequeuing after clearing the flag to

>> +		 * avoid calling __netif_schedule().

>> +		 */

>> +		smp_mb__after_atomic();

> 

> test_and_clear_bit() which returned true implies a memory barrier,

> AFAIU, so the barrier is not needed with the code as is. It will be

> needed if we switch to test_bit() + clear_bit(), but please clarify

> what it is paring with.


Ok.
Thanks for the reviewing.

> 

>> +		need_retry = false;

>> +

>> +		goto retry;

>>  	} else {

>>  		WRITE_ONCE(qdisc->empty, true);

>>  	}

> 

> 

> .

>

Jakub Kicinski May 8, 2021, 3:05 a.m. UTC | #3

On Sat, 8 May 2021 10:55:19 +0800 Yunsheng Lin wrote:
> >> +		 * the flag set after releasing lock and reschedule the

> >> +		 * net_tx_action() to do the dequeuing.  

> > 

> > I don't understand why MISSED is checked before the trylock.

> > Could you explain why it can't be tested directly here?  

> The initial thinking was:

> Just like the set_bit() before the second trylock, If MISSED is set

> before first trylock, it means other thread has set the MISSED flag

> for this thread before doing the first trylock, so that this thread

> does not need to do the set_bit().

> 

> But the initial thinking seems over thinking, as thread 3' setting the

> MISSED before the second trylock has ensure either thread 3' second

> trylock returns ture or thread 2 holding the lock will see the MISSED

> flag, so thread 1 can do the test_bit() before or after the first

> trylock, as below:

> 

>     thread 1                thread 2                    thread 3

>                          holding q->seqlock

> first trylock failed                              first trylock failed

>                          unlock q->seqlock

>                                                test_bit(MISSED) return false

>                    test_bit(MISSED) return false

>                           and not reschedule

>                                                       set_bit(MISSED)

> 						      trylock success

> test_bit(MISSED) retun ture

> and not retry second trylock

> 

> If the above is correct, it seems we could:

> 1. do test_bit(MISSED) before the first trylock to avoid doing the

>    first trylock for contended case.

> or

> 2. do test_bit(MISSED) after the first trylock to avoid doing the

>    test_bit() for un-contended case.

> 

> Which one do you prefer?


No strong preference but testing after the trylock seems more obvious
as it saves the temporary variable.

For the contended case could we potentially move or add a MISSED test
before even the first try_lock()? I'm not good at optimizing things, 
but it could save us the atomic op, right? (at least on x86)