[net] net: tcp: don't allocate fast clones for fastopen SYN

Message ID	20210302060753.953931-1-kuba@kernel.org
State	New
Headers	show Return-Path: <netdev-owner@kernel.org> From: Jakub Kicinski <kuba@kernel.org> To: davem@davemloft.net Cc: netdev@vger.kernel.org, edumazet@google.com, kernel-team@fb.com, Jakub Kicinski <kuba@kernel.org>, Neil Spring <ntspring@fb.com> Subject: [PATCH net] net: tcp: don't allocate fast clones for fastopen SYN Date: Mon, 1 Mar 2021 22:07:53 -0800 Message-Id: <20210302060753.953931-1-kuba@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[net] net: tcp: don't allocate fast clones for fastopen SYN \| expand [net] net: tcp: don't allocate fast clones for fastopen SYN

Jakub Kicinski March 2, 2021, 6:07 a.m. UTC

When receiver does not accept TCP Fast Open it will only ack
the SYN, and not the data. We detect this and immediately queue
the data for (re)transmission in tcp_rcv_fastopen_synack().

In DC networks with very low RTT and without RFS the SYN-ACK
may arrive before NIC driver reported Tx completion on
the original SYN. In which case skb_still_in_host_queue()
returns true and sender will need to wait for the retransmission
timer to fire milliseconds later.

Revert back to non-fast clone skbs, this way
skb_still_in_host_queue() won't prevent the recovery flow
from completing.

Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")
Signed-off-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/ipv4/tcp_output.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Jakub Kicinski March 2, 2021, 5 p.m. UTC | #1

On Tue, 2 Mar 2021 10:38:46 +0100 Eric Dumazet wrote:
> On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > When receiver does not accept TCP Fast Open it will only ack

> > the SYN, and not the data. We detect this and immediately queue

> > the data for (re)transmission in tcp_rcv_fastopen_synack().

> >

> > In DC networks with very low RTT and without RFS the SYN-ACK

> > may arrive before NIC driver reported Tx completion on

> > the original SYN. In which case skb_still_in_host_queue()

> > returns true and sender will need to wait for the retransmission

> > timer to fire milliseconds later.

> >

> > Revert back to non-fast clone skbs, this way

> > skb_still_in_host_queue() won't prevent the recovery flow

> > from completing.

> >

> > Suggested-by: Eric Dumazet <edumazet@google.com>

> > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")  

> 

> Hmmm, not sure if this Fixes: tag makes sense.

> 

> Really, if we delay TX completions by say 10 ms, other parts of the

> stack will misbehave anyway.

> 

> Also, backporting this patch up to linux-3.19 is going to be tricky.


Indeed, the problem is minor in practical terms. Maybe it's enough if I
spell that out more in the description? Are you thinking net-next or
net without a Fixes tag?

> The real issue here is that skb_still_in_host_queue() can give a false positive.

> 

> I have mixed feelings here, as you can read my answer :/

> 

> Maybe skb_still_in_host_queue() signal should not be used when a part

> of the SKB has been received/acknowledged by the remote peer

> (in this case the SYN part).


FWIW I was pondering this, when the rtx is requested by the receiver
we are relatively sure we can ignore skb_still_in_host_queue() because
we know our system should Tx in order so if receiver saw N + 1, N can't
be in our queues.

But AFAICT generalizing the test doesn't matter much. In cases other
than TFO worst case a loss probe will chase the rtx out. And I don't
grasp enough of TCP to implement the general optimization :)

> Alternative is that drivers unable to TX complete their skbs in a

> reasonable time should call skb_orphan()

>  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

> 

> If you really want to play and delay TX completions, maybe provide a

> way to disable skb_still_in_host_queue() globally,

> using a static key ?


I see the TFO issue with rx and tx completions set to 33us both, 
with two different NIC vendors, so the timing just influences the
likelihood.

> My personal WIP/hack was something like :


LGTM, are you happy with that being the fix?

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92

> 100644

> --- a/net/ipv4/tcp_input.c

> +++ b/net/ipv4/tcp_input.c

> @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock

> *sk, struct sk_buff *synack,

>                 else

>                         tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;

>                 skb_rbtree_walk_from(data) {

> -                       if (__tcp_retransmit_skb(sk, data, 1))

> +                       /* segs = -1 to bypass

> skb_still_in_host_queue() check */

> +                       if (__tcp_retransmit_skb(sk, data, -1))

>                                 break;

>                 }

>                 tcp_rearm_rto(sk);

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

> index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce

> 100644

> --- a/net/ipv4/tcp_output.c

> +++ b/net/ipv4/tcp_output.c

> @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk,

> struct sk_buff *skb, int segs)

>                   sk->sk_sndbuf))

>                 return -EAGAIN;

> 

> -       if (skb_still_in_host_queue(sk, skb))

> -               return -EBUSY;

> +       if (segs > 0) {

> +               if (skb_still_in_host_queue(sk, skb))

> +                       return -EBUSY;

> +       } else {

> +               segs = -segs;

> +       }

> 

>         if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {

>                 if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {

Eric Dumazet March 2, 2021, 5:02 p.m. UTC | #2

On Tue, Mar 2, 2021 at 6:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Tue, 2 Mar 2021 10:38:46 +0100 Eric Dumazet wrote:

> > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:

> > >

> > > When receiver does not accept TCP Fast Open it will only ack

> > > the SYN, and not the data. We detect this and immediately queue

> > > the data for (re)transmission in tcp_rcv_fastopen_synack().

> > >

> > > In DC networks with very low RTT and without RFS the SYN-ACK

> > > may arrive before NIC driver reported Tx completion on

> > > the original SYN. In which case skb_still_in_host_queue()

> > > returns true and sender will need to wait for the retransmission

> > > timer to fire milliseconds later.

> > >

> > > Revert back to non-fast clone skbs, this way

> > > skb_still_in_host_queue() won't prevent the recovery flow

> > > from completing.

> > >

> > > Suggested-by: Eric Dumazet <edumazet@google.com>

> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")

> >

> > Hmmm, not sure if this Fixes: tag makes sense.

> >

> > Really, if we delay TX completions by say 10 ms, other parts of the

> > stack will misbehave anyway.

> >

> > Also, backporting this patch up to linux-3.19 is going to be tricky.

>

> Indeed, the problem is minor in practical terms. Maybe it's enough if I

> spell that out more in the description? Are you thinking net-next or

> net without a Fixes tag?

>

> > The real issue here is that skb_still_in_host_queue() can give a false positive.

> >

> > I have mixed feelings here, as you can read my answer :/

> >

> > Maybe skb_still_in_host_queue() signal should not be used when a part

> > of the SKB has been received/acknowledged by the remote peer

> > (in this case the SYN part).

>

> FWIW I was pondering this, when the rtx is requested by the receiver

> we are relatively sure we can ignore skb_still_in_host_queue() because

> we know our system should Tx in order so if receiver saw N + 1, N can't

> be in our queues.

>

> But AFAICT generalizing the test doesn't matter much. In cases other

> than TFO worst case a loss probe will chase the rtx out. And I don't

> grasp enough of TCP to implement the general optimization :)

>

> > Alternative is that drivers unable to TX complete their skbs in a

> > reasonable time should call skb_orphan()

> >  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

> >

> > If you really want to play and delay TX completions, maybe provide a

> > way to disable skb_still_in_host_queue() globally,

> > using a static key ?

>

> I see the TFO issue with rx and tx completions set to 33us both,

> with two different NIC vendors, so the timing just influences the

> likelihood.

>

> > My personal WIP/hack was something like :

>

> LGTM, are you happy with that being the fix?



Yes, this seems a bit less intrusive, and net-next should be fine
(I guess FB can backport this early if needed)

>

> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> > index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92

> > 100644

> > --- a/net/ipv4/tcp_input.c

> > +++ b/net/ipv4/tcp_input.c

> > @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock

> > *sk, struct sk_buff *synack,

> >                 else

> >                         tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;

> >                 skb_rbtree_walk_from(data) {

> > -                       if (__tcp_retransmit_skb(sk, data, 1))

> > +                       /* segs = -1 to bypass

> > skb_still_in_host_queue() check */

> > +                       if (__tcp_retransmit_skb(sk, data, -1))

> >                                 break;

> >                 }

> >                 tcp_rearm_rto(sk);

> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

> > index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce

> > 100644

> > --- a/net/ipv4/tcp_output.c

> > +++ b/net/ipv4/tcp_output.c

> > @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk,

> > struct sk_buff *skb, int segs)

> >                   sk->sk_sndbuf))

> >                 return -EAGAIN;

> >

> > -       if (skb_still_in_host_queue(sk, skb))

> > -               return -EBUSY;

> > +       if (segs > 0) {

> > +               if (skb_still_in_host_queue(sk, skb))

> > +                       return -EBUSY;

> > +       } else {

> > +               segs = -segs;

> > +       }

> >

> >         if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {

> >                 if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {

>

Yuchung Cheng March 2, 2021, 8:52 p.m. UTC | #3

On Tue, Mar 2, 2021 at 11:58 AM Eric Dumazet <edumazet@google.com> wrote:
>

> On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > When receiver does not accept TCP Fast Open it will only ack

> > the SYN, and not the data. We detect this and immediately queue

> > the data for (re)transmission in tcp_rcv_fastopen_synack().

> >

> > In DC networks with very low RTT and without RFS the SYN-ACK

> > may arrive before NIC driver reported Tx completion on

> > the original SYN. In which case skb_still_in_host_queue()

> > returns true and sender will need to wait for the retransmission

> > timer to fire milliseconds later.

> >

> > Revert back to non-fast clone skbs, this way

> > skb_still_in_host_queue() won't prevent the recovery flow

> > from completing.

> >

> > Suggested-by: Eric Dumazet <edumazet@google.com>

> > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")

>

> Hmmm, not sure if this Fixes: tag makes sense.

>

> Really, if we delay TX completions by say 10 ms, other parts of the

> stack will misbehave anyway.

>

> Also, backporting this patch up to linux-3.19 is going to be tricky.

>

> The real issue here is that skb_still_in_host_queue() can give a false positive.

>

> I have mixed feelings here, as you can read my answer :/

>

> Maybe skb_still_in_host_queue() signal should not be used when a part

> of the SKB has been received/acknowledged by the remote peer

> (in this case the SYN part).

Thank you Eric and Jakub for working on the TFO issue.

I like this option the most because it's more generic and easy to understand. Is
it easy to implement by checking snd_una etc?




>

> Alternative is that drivers unable to TX complete their skbs in a

> reasonable time should call skb_orphan()

>  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

>

> If you really want to play and delay TX completions, maybe provide a

> way to disable skb_still_in_host_queue() globally,

> using a static key ?

>

> My personal WIP/hack was something like :

>

>

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92

> 100644

> --- a/net/ipv4/tcp_input.c

> +++ b/net/ipv4/tcp_input.c

> @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock

> *sk, struct sk_buff *synack,

>                 else

>                         tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;

>                 skb_rbtree_walk_from(data) {

> -                       if (__tcp_retransmit_skb(sk, data, 1))

> +                       /* segs = -1 to bypass

> skb_still_in_host_queue() check */

> +                       if (__tcp_retransmit_skb(sk, data, -1))

>                                 break;

>                 }

>                 tcp_rearm_rto(sk);

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

> index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce

> 100644

> --- a/net/ipv4/tcp_output.c

> +++ b/net/ipv4/tcp_output.c

> @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk,

> struct sk_buff *skb, int segs)

>                   sk->sk_sndbuf))

>                 return -EAGAIN;

>

> -       if (skb_still_in_host_queue(sk, skb))

> -               return -EBUSY;

> +       if (segs > 0) {

> +               if (skb_still_in_host_queue(sk, skb))

> +                       return -EBUSY;

> +       } else {

> +               segs = -segs;

> +       }

>

>         if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {

>                 if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {

>

>

> > Signed-off-by: Neil Spring <ntspring@fb.com>

> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>

> > ---

> >  net/ipv4/tcp_output.c | 9 ++++++++-

> >  1 file changed, 8 insertions(+), 1 deletion(-)

> >

> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

> > index fbf140a770d8..cd9461588539 100644

> > --- a/net/ipv4/tcp_output.c

> > +++ b/net/ipv4/tcp_output.c

> > @@ -3759,9 +3759,16 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)

> >         /* limit to order-0 allocations */

> >         space = min_t(size_t, space, SKB_MAX_HEAD(MAX_TCP_HEADER));

> >

> > -       syn_data = sk_stream_alloc_skb(sk, space, sk->sk_allocation, false);

> > +       syn_data = alloc_skb(MAX_TCP_HEADER + space, sk->sk_allocation);

> >         if (!syn_data)

> >                 goto fallback;

> > +       if (!sk_wmem_schedule(sk, syn_data->truesize)) {

> > +               __kfree_skb(syn_data);

> > +               goto fallback;

> > +       }

> > +       skb_reserve(syn_data, MAX_TCP_HEADER);

> > +       INIT_LIST_HEAD(&syn_data->tcp_tsorted_anchor);

> > +

> >         syn_data->ip_summed = CHECKSUM_PARTIAL;

> >         memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));

> >         if (space) {

> > --

> > 2.26.2

> >

Jakub Kicinski March 2, 2021, 10 p.m. UTC | #4

On Tue, 2 Mar 2021 12:52:14 -0800 Yuchung Cheng wrote:
> On Tue, Mar 2, 2021 at 11:58 AM Eric Dumazet <edumazet@google.com> wrote:

> > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:  

> > > When receiver does not accept TCP Fast Open it will only ack

> > > the SYN, and not the data. We detect this and immediately queue

> > > the data for (re)transmission in tcp_rcv_fastopen_synack().

> > >

> > > In DC networks with very low RTT and without RFS the SYN-ACK

> > > may arrive before NIC driver reported Tx completion on

> > > the original SYN. In which case skb_still_in_host_queue()

> > > returns true and sender will need to wait for the retransmission

> > > timer to fire milliseconds later.

> > >

> > > Revert back to non-fast clone skbs, this way

> > > skb_still_in_host_queue() won't prevent the recovery flow

> > > from completing.

> > >

> > > Suggested-by: Eric Dumazet <edumazet@google.com>

> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")  

> >

> > Hmmm, not sure if this Fixes: tag makes sense.

> >

> > Really, if we delay TX completions by say 10 ms, other parts of the

> > stack will misbehave anyway.

> >

> > Also, backporting this patch up to linux-3.19 is going to be tricky.

> >

> > The real issue here is that skb_still_in_host_queue() can give a false positive.

> >

> > I have mixed feelings here, as you can read my answer :/

> >

> > Maybe skb_still_in_host_queue() signal should not be used when a part

> > of the SKB has been received/acknowledged by the remote peer

> > (in this case the SYN part).  

> 

> Thank you Eric and Jakub for working on the TFO issue.

> 

> I like this option the most because it's more generic and easy to understand. 


I'm assuming by "this" you mean v1 - as far as understanding goes we
can polish v2. I think Eric just shared a quick example, but perhaps 
we could add an explicit bool to __tcp_retransmit_skb() called expl_req
or such to signify that the rtx was explicitly requested by the
receiver and therefore we can skip the "in queue" check? Or add some
flags to the call?

If everyone is okay with targeting -next the new argument won't be an
issue.

> Is it easy to implement by checking snd_una etc?


That is to detect TFO retransmits in __tcp_retransmit_skb() by matching
the connections state, like acked == 1?

Alexander Duyck March 3, 2021, 9:35 p.m. UTC | #5

On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote:
>

> On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > When receiver does not accept TCP Fast Open it will only ack

> > the SYN, and not the data. We detect this and immediately queue

> > the data for (re)transmission in tcp_rcv_fastopen_synack().

> >

> > In DC networks with very low RTT and without RFS the SYN-ACK

> > may arrive before NIC driver reported Tx completion on

> > the original SYN. In which case skb_still_in_host_queue()

> > returns true and sender will need to wait for the retransmission

> > timer to fire milliseconds later.

> >

> > Revert back to non-fast clone skbs, this way

> > skb_still_in_host_queue() won't prevent the recovery flow

> > from completing.

> >

> > Suggested-by: Eric Dumazet <edumazet@google.com>

> > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")

>

> Hmmm, not sure if this Fixes: tag makes sense.

>

> Really, if we delay TX completions by say 10 ms, other parts of the

> stack will misbehave anyway.

>

> Also, backporting this patch up to linux-3.19 is going to be tricky.

>

> The real issue here is that skb_still_in_host_queue() can give a false positive.

>

> I have mixed feelings here, as you can read my answer :/

>

> Maybe skb_still_in_host_queue() signal should not be used when a part

> of the SKB has been received/acknowledged by the remote peer

> (in this case the SYN part).

>

> Alternative is that drivers unable to TX complete their skbs in a

> reasonable time should call skb_orphan()

>  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

>

> If you really want to play and delay TX completions, maybe provide a

> way to disable skb_still_in_host_queue() globally,

> using a static key ?


The problem as I see it is that the original fclone isn't what we sent
out on the wire and that is confusing things. What we sent was a SYN
with data, but what we have now is just a data frame that hasn't been
put out on the wire yet.

I wonder if we couldn't get away with doing something like adding a
fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen
skbs? That would keep the skb_still_in_host queue from triggering as
we would be changing the state from SKB_FCLONE_ORIG to
SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In
addition if we have to clone it again and the fclone reference count
is 1 we could reset it back to SKB_FCLONE_ORIG.

Jakub Kicinski March 4, 2021, 12:07 a.m. UTC | #6

On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote:
> On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote:

> > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:  

> > > When receiver does not accept TCP Fast Open it will only ack

> > > the SYN, and not the data. We detect this and immediately queue

> > > the data for (re)transmission in tcp_rcv_fastopen_synack().

> > >

> > > In DC networks with very low RTT and without RFS the SYN-ACK

> > > may arrive before NIC driver reported Tx completion on

> > > the original SYN. In which case skb_still_in_host_queue()

> > > returns true and sender will need to wait for the retransmission

> > > timer to fire milliseconds later.

> > >

> > > Revert back to non-fast clone skbs, this way

> > > skb_still_in_host_queue() won't prevent the recovery flow

> > > from completing.

> > >

> > > Suggested-by: Eric Dumazet <edumazet@google.com>

> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")  

> >

> > Hmmm, not sure if this Fixes: tag makes sense.

> >

> > Really, if we delay TX completions by say 10 ms, other parts of the

> > stack will misbehave anyway.

> >

> > Also, backporting this patch up to linux-3.19 is going to be tricky.

> >

> > The real issue here is that skb_still_in_host_queue() can give a false positive.

> >

> > I have mixed feelings here, as you can read my answer :/

> >

> > Maybe skb_still_in_host_queue() signal should not be used when a part

> > of the SKB has been received/acknowledged by the remote peer

> > (in this case the SYN part).

> >

> > Alternative is that drivers unable to TX complete their skbs in a

> > reasonable time should call skb_orphan()

> >  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

> >

> > If you really want to play and delay TX completions, maybe provide a

> > way to disable skb_still_in_host_queue() globally,

> > using a static key ?  

> 

> The problem as I see it is that the original fclone isn't what we sent

> out on the wire and that is confusing things. What we sent was a SYN

> with data, but what we have now is just a data frame that hasn't been

> put out on the wire yet.


Not sure I understand why it's the key distinction here. Is it
re-transmitting part of the frame or having different flags?
Is re-transmit of half of a GSO skb also considered not the same?

To me the distinction is that the receiver has implicitly asked
us for the re-transmission. If it was requested by SACK we should 
ignore "in_queue" for the first transmission as well, even if the
skb state is identical.

> I wonder if we couldn't get away with doing something like adding a

> fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen

> skbs? That would keep the skb_still_in_host queue from triggering as

> we would be changing the state from SKB_FCLONE_ORIG to

> SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In

> addition if we have to clone it again and the fclone reference count

> is 1 we could reset it back to SKB_FCLONE_ORIG.


The unused value of fclone was tempting me as well :)

AFAICT we have at least these options:

1 - don't use a fclone skb [v1]

2 - mark the fclone as "special" at Tx to escape the "in queue" check

3 - indicate to retansmit that we're sure initial tx is out [v2]

4 - same as above but with a bool / flag instead of negative seg

5 - use the fclone bits but mark them at Rx when we see a rtx request

6 - check the skb state in retransmit to match the TFO case (IIUC
    Yuchung's suggestion)

#5 is my favorite but I didn't know how to extend it to fast
re-transmits so I just stuck to the suggestion from the ML :)

WDYT? Eric, Yuchung?

Alexander Duyck March 4, 2021, 2:45 a.m. UTC | #7

On Wed, Mar 3, 2021 at 4:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote:

> > On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote:

> > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:

> > > > When receiver does not accept TCP Fast Open it will only ack

> > > > the SYN, and not the data. We detect this and immediately queue

> > > > the data for (re)transmission in tcp_rcv_fastopen_synack().

> > > >

> > > > In DC networks with very low RTT and without RFS the SYN-ACK

> > > > may arrive before NIC driver reported Tx completion on

> > > > the original SYN. In which case skb_still_in_host_queue()

> > > > returns true and sender will need to wait for the retransmission

> > > > timer to fire milliseconds later.

> > > >

> > > > Revert back to non-fast clone skbs, this way

> > > > skb_still_in_host_queue() won't prevent the recovery flow

> > > > from completing.

> > > >

> > > > Suggested-by: Eric Dumazet <edumazet@google.com>

> > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")

> > >

> > > Hmmm, not sure if this Fixes: tag makes sense.

> > >

> > > Really, if we delay TX completions by say 10 ms, other parts of the

> > > stack will misbehave anyway.

> > >

> > > Also, backporting this patch up to linux-3.19 is going to be tricky.

> > >

> > > The real issue here is that skb_still_in_host_queue() can give a false positive.

> > >

> > > I have mixed feelings here, as you can read my answer :/

> > >

> > > Maybe skb_still_in_host_queue() signal should not be used when a part

> > > of the SKB has been received/acknowledged by the remote peer

> > > (in this case the SYN part).

> > >

> > > Alternative is that drivers unable to TX complete their skbs in a

> > > reasonable time should call skb_orphan()

> > >  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

> > >

> > > If you really want to play and delay TX completions, maybe provide a

> > > way to disable skb_still_in_host_queue() globally,

> > > using a static key ?

> >

> > The problem as I see it is that the original fclone isn't what we sent

> > out on the wire and that is confusing things. What we sent was a SYN

> > with data, but what we have now is just a data frame that hasn't been

> > put out on the wire yet.

>

> Not sure I understand why it's the key distinction here. Is it

> re-transmitting part of the frame or having different flags?

> Is re-transmit of half of a GSO skb also considered not the same?


The difference in my mind is the flags. So specifically the clone of
the syn_data frame in the case of the TCP fast open isn't actually a
clone of the sent frame. Instead we end up modifying the flags so that
it becomes the first data frame. We already have the SYN sitting in
the retransmit queue before we send the SYN w/ data frame. In addition
the SYN packet in the retransmit queue has a reference count of 1 so
it is not encumbered by the fclone reference count check so it could
theoretically be retransmitted immediately, it is just the data packet
that is being held.

If we replay a GSO frame we will get the same frames all over again.
In the case of a TCP fast open syn_data packet that isn't the case.
The first time out it is one packet, the second time it is two.

> To me the distinction is that the receiver has implicitly asked

> us for the re-transmission. If it was requested by SACK we should

> ignore "in_queue" for the first transmission as well, even if the

> skb state is identical.


In my mind the distinction is the fact that what we have in the
retransmit queue is 2 frames, a SYN and a data. Whereas what we have
put on the wire is SYN w/ data.

> > I wonder if we couldn't get away with doing something like adding a

> > fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen

> > skbs? That would keep the skb_still_in_host queue from triggering as

> > we would be changing the state from SKB_FCLONE_ORIG to

> > SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In

> > addition if we have to clone it again and the fclone reference count

> > is 1 we could reset it back to SKB_FCLONE_ORIG.

>

> The unused value of fclone was tempting me as well :)

>

> AFAICT we have at least these options:

>

> 1 - don't use a fclone skb [v1]

>

> 2 - mark the fclone as "special" at Tx to escape the "in queue" check


This is what I had in mind. Basically just add a function to
transition from SKB_FCLONE_ORIG to SKB_FCLONE_MODIFIED/DIRTY when the
clone no longer represents what is actually on the wire.

The added benefit is that we can potentially restore it to
SKB_FCLONE_ORIG if we clone it again assuming the reference count fell
back to 1.

> 3 - indicate to retansmit that we're sure initial tx is out [v2]

>

> 4 - same as above but with a bool / flag instead of negative seg

>

> 5 - use the fclone bits but mark them at Rx when we see a rtx request

>

> 6 - check the skb state in retransmit to match the TFO case (IIUC

>     Yuchung's suggestion)

>

> #5 is my favorite but I didn't know how to extend it to fast

> re-transmits so I just stuck to the suggestion from the ML :)

>

> WDYT? Eric, Yuchung?

Eric Dumazet March 4, 2021, 12:51 p.m. UTC | #8

On Thu, Mar 4, 2021 at 3:45 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>

> On Wed, Mar 3, 2021 at 4:07 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote:

> > > On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote:

> > > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote:

> > > > > When receiver does not accept TCP Fast Open it will only ack

> > > > > the SYN, and not the data. We detect this and immediately queue

> > > > > the data for (re)transmission in tcp_rcv_fastopen_synack().

> > > > >

> > > > > In DC networks with very low RTT and without RFS the SYN-ACK

> > > > > may arrive before NIC driver reported Tx completion on

> > > > > the original SYN. In which case skb_still_in_host_queue()

> > > > > returns true and sender will need to wait for the retransmission

> > > > > timer to fire milliseconds later.

> > > > >

> > > > > Revert back to non-fast clone skbs, this way

> > > > > skb_still_in_host_queue() won't prevent the recovery flow

> > > > > from completing.

> > > > >

> > > > > Suggested-by: Eric Dumazet <edumazet@google.com>

> > > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")

> > > >

> > > > Hmmm, not sure if this Fixes: tag makes sense.

> > > >

> > > > Really, if we delay TX completions by say 10 ms, other parts of the

> > > > stack will misbehave anyway.

> > > >

> > > > Also, backporting this patch up to linux-3.19 is going to be tricky.

> > > >

> > > > The real issue here is that skb_still_in_host_queue() can give a false positive.

> > > >

> > > > I have mixed feelings here, as you can read my answer :/

> > > >

> > > > Maybe skb_still_in_host_queue() signal should not be used when a part

> > > > of the SKB has been received/acknowledged by the remote peer

> > > > (in this case the SYN part).

> > > >

> > > > Alternative is that drivers unable to TX complete their skbs in a

> > > > reasonable time should call skb_orphan()

> > > >  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)

> > > >

> > > > If you really want to play and delay TX completions, maybe provide a

> > > > way to disable skb_still_in_host_queue() globally,

> > > > using a static key ?

> > >

> > > The problem as I see it is that the original fclone isn't what we sent

> > > out on the wire and that is confusing things. What we sent was a SYN

> > > with data, but what we have now is just a data frame that hasn't been

> > > put out on the wire yet.

> >

> > Not sure I understand why it's the key distinction here. Is it

> > re-transmitting part of the frame or having different flags?

> > Is re-transmit of half of a GSO skb also considered not the same?

>

> The difference in my mind is the flags. So specifically the clone of

> the syn_data frame in the case of the TCP fast open isn't actually a

> clone of the sent frame. Instead we end up modifying the flags so that

> it becomes the first data frame. We already have the SYN sitting in

> the retransmit queue before we send the SYN w/ data frame. In addition

> the SYN packet in the retransmit queue has a reference count of 1 so

> it is not encumbered by the fclone reference count check so it could

> theoretically be retransmitted immediately, it is just the data packet

> that is being held.

>

> If we replay a GSO frame we will get the same frames all over again.

> In the case of a TCP fast open syn_data packet that isn't the case.

> The first time out it is one packet, the second time it is two.

>

> > To me the distinction is that the receiver has implicitly asked

> > us for the re-transmission. If it was requested by SACK we should

> > ignore "in_queue" for the first transmission as well, even if the

> > skb state is identical.

>

> In my mind the distinction is the fact that what we have in the

> retransmit queue is 2 frames, a SYN and a data. Whereas what we have

> put on the wire is SYN w/ data.

>

> > > I wonder if we couldn't get away with doing something like adding a

> > > fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen

> > > skbs? That would keep the skb_still_in_host queue from triggering as

> > > we would be changing the state from SKB_FCLONE_ORIG to

> > > SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In

> > > addition if we have to clone it again and the fclone reference count

> > > is 1 we could reset it back to SKB_FCLONE_ORIG.

> >

> > The unused value of fclone was tempting me as well :)

> >

> > AFAICT we have at least these options:

> >

> > 1 - don't use a fclone skb [v1]

> >

> > 2 - mark the fclone as "special" at Tx to escape the "in queue" check

>

> This is what I had in mind. Basically just add a function to

> transition from SKB_FCLONE_ORIG to SKB_FCLONE_MODIFIED/DIRTY when the

> clone no longer represents what is actually on the wire.

>

> The added benefit is that we can potentially restore it to

> SKB_FCLONE_ORIG if we clone it again assuming the reference count fell

> back to 1.

>



I think we are over thinking this really (especially if the fix needs
a change in core networking or drivers)

We can reuse TSQ logic to have a chance to recover when the clone is
eventually freed.
This will be more generic, not only for the SYN+data of FastOpen.

Can you please test the following patch ?

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6d0a33d1c0db6defaa005335c3e285f0f1ac686c..43317b40df54718edf9165ca2018da92ddfeab62
100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1136,7 +1136,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,

        return skb->fclone == SKB_FCLONE_ORIG &&
               refcount_read(&fclones->fclone_ref) > 1 &&
-              fclones->skb2.sk == sk;
+              READ_ONCE(fclones->skb2.sk) == sk;
 }

 /**
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fbf140a770d8e21b936369b79abbe9857537acd8..0996e12e6c75121030e5db2fc8f2bd6d07037164
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2775,13 +2775,22 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)
  * a packet is still in a qdisc or driver queue.
  * In this case, there is very little point doing a retransmit !
  */
-static bool skb_still_in_host_queue(const struct sock *sk,
+static bool skb_still_in_host_queue(struct sock *sk,
                                    const struct sk_buff *skb)
 {
        if (unlikely(skb_fclone_busy(sk, skb))) {
-               NET_INC_STATS(sock_net(sk),
-                             LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
-               return true;
+               set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
+               /* If TX completion is really slow, we could have
received an ACK
+                * for a packet not yet freed by the driver.
+                * Setting TSQ_THROTTLED makes sure we will attempt to
retransmit
+                * later when fclone is no longer busy.
+                */
+               smp_mb__after_atomic();
+               if (skb_fclone_busy(sk, skb)) {
+                       NET_INC_STATS(sock_net(sk),
+                                     LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES);
+                       return true;
+               }
        }
        return false;
 }

Jakub Kicinski March 4, 2021, 7:06 p.m. UTC | #9

On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote:
> I think we are over thinking this really (especially if the fix needs
> a change in core networking or drivers)
> 
> We can reuse TSQ logic to have a chance to recover when the clone is
> eventually freed.
> This will be more generic, not only for the SYN+data of FastOpen.
> 
> Can you please test the following patch ?

#7 - Eric comes up with something much better :)


But so far doesn't seem to quite do it, I'm looking but maybe you'll
know right away (FWIW testing a v5.6 backport but I don't think TSQ
changed?):

On __tcp_retransmit_skb kretprobe:

==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00!

First hit:
        __tcp_retransmit_skb+1
        tcp_rcv_state_process+2488
        tcp_v6_do_rcv+405
        tcp_v6_rcv+2984
        ip6_protocol_deliver_rcu+180
        ip6_input_finish+17

Successful hit:
        __tcp_retransmit_skb+1
        tcp_retransmit_skb+18
        tcp_retransmit_timer+716
        tcp_write_timer_handler+136
        tcp_write_timer+141
        call_timer_fn+43

 skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1

Eric Dumazet March 4, 2021, 7:41 p.m. UTC | #10

On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote:

> > I think we are over thinking this really (especially if the fix needs

> > a change in core networking or drivers)

> >

> > We can reuse TSQ logic to have a chance to recover when the clone is

> > eventually freed.

> > This will be more generic, not only for the SYN+data of FastOpen.

> >

> > Can you please test the following patch ?

>

> #7 - Eric comes up with something much better :)

>

>

> But so far doesn't seem to quite do it, I'm looking but maybe you'll

> know right away (FWIW testing a v5.6 backport but I don't think TSQ

> changed?):

>

> On __tcp_retransmit_skb kretprobe:

>

> ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00!

>

> First hit:

>         __tcp_retransmit_skb+1

>         tcp_rcv_state_process+2488

>         tcp_v6_do_rcv+405

>         tcp_v6_rcv+2984

>         ip6_protocol_deliver_rcu+180

>         ip6_input_finish+17

>

> Successful hit:

>         __tcp_retransmit_skb+1

>         tcp_retransmit_skb+18

>         tcp_retransmit_timer+716

>         tcp_write_timer_handler+136

>         tcp_write_timer+141

>         call_timer_fn+43

>

>  skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1



Humm maybe one of the conditions used in tcp_tsq_write() does not hold...

if (tp->lost_out > tp->retrans_out &&
    tp->snd_cwnd > tcp_packets_in_flight(tp)) {
    tcp_mstamp_refresh(tp);
    tcp_xmit_retransmit_queue(sk);
}

Maybe FastOpen case is 'special' and tp->lost_out is wrong.

Eric Dumazet March 4, 2021, 8:18 p.m. UTC | #11

On Thu, Mar 4, 2021 at 8:41 PM Eric Dumazet <edumazet@google.com> wrote:
>

> On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote:

> > > I think we are over thinking this really (especially if the fix needs

> > > a change in core networking or drivers)

> > >

> > > We can reuse TSQ logic to have a chance to recover when the clone is

> > > eventually freed.

> > > This will be more generic, not only for the SYN+data of FastOpen.

> > >

> > > Can you please test the following patch ?

> >

> > #7 - Eric comes up with something much better :)

> >

> >

> > But so far doesn't seem to quite do it, I'm looking but maybe you'll

> > know right away (FWIW testing a v5.6 backport but I don't think TSQ

> > changed?):

> >

> > On __tcp_retransmit_skb kretprobe:

> >

> > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00!

> >

> > First hit:

> >         __tcp_retransmit_skb+1

> >         tcp_rcv_state_process+2488

> >         tcp_v6_do_rcv+405

> >         tcp_v6_rcv+2984

> >         ip6_protocol_deliver_rcu+180

> >         ip6_input_finish+17

> >

> > Successful hit:

> >         __tcp_retransmit_skb+1

> >         tcp_retransmit_skb+18

> >         tcp_retransmit_timer+716

> >         tcp_write_timer_handler+136

> >         tcp_write_timer+141

> >         call_timer_fn+43

> >

> >  skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1

>

>

> Humm maybe one of the conditions used in tcp_tsq_write() does not hold...

>

> if (tp->lost_out > tp->retrans_out &&

>     tp->snd_cwnd > tcp_packets_in_flight(tp)) {

>     tcp_mstamp_refresh(tp);

>     tcp_xmit_retransmit_queue(sk);

> }

>

> Maybe FastOpen case is 'special' and tp->lost_out is wrong.


It would be nice if tun driver would have the ability to delay TX
completions by N usecs,
so that packetdrill tests could be used.

It is probably not too hard to add such a feature.

Jonathan Lemon March 4, 2021, 9:08 p.m. UTC | #12

On Thu, Mar 04, 2021 at 08:41:45PM +0100, Eric Dumazet wrote:
> On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote:

> > > I think we are over thinking this really (especially if the fix needs

> > > a change in core networking or drivers)

> > >

> > > We can reuse TSQ logic to have a chance to recover when the clone is

> > > eventually freed.

> > > This will be more generic, not only for the SYN+data of FastOpen.

> > >

> > > Can you please test the following patch ?

> >

> > #7 - Eric comes up with something much better :)

> >

> >

> > But so far doesn't seem to quite do it, I'm looking but maybe you'll

> > know right away (FWIW testing a v5.6 backport but I don't think TSQ

> > changed?):

> >

> > On __tcp_retransmit_skb kretprobe:

> >

> > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00!

> >

> > First hit:

> >         __tcp_retransmit_skb+1

> >         tcp_rcv_state_process+2488

> >         tcp_v6_do_rcv+405

> >         tcp_v6_rcv+2984

> >         ip6_protocol_deliver_rcu+180

> >         ip6_input_finish+17

> >

> > Successful hit:

> >         __tcp_retransmit_skb+1

> >         tcp_retransmit_skb+18

> >         tcp_retransmit_timer+716

> >         tcp_write_timer_handler+136

> >         tcp_write_timer+141

> >         call_timer_fn+43

> >

> >  skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1

> 

> 

> Humm maybe one of the conditions used in tcp_tsq_write() does not hold...

> 

> if (tp->lost_out > tp->retrans_out &&

>     tp->snd_cwnd > tcp_packets_in_flight(tp)) {

>     tcp_mstamp_refresh(tp);

>     tcp_xmit_retransmit_queue(sk);

> }

> 

> Maybe FastOpen case is 'special' and tp->lost_out is wrong.



Something like this?  (completely untested)
-- 
Jonathan

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 69a545db80d2..92bc9b0f4955 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5995,8 +5995,10 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
                else
                        tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;
                skb_rbtree_walk_from(data) {
+                       tcp_mark_skb_lost(sk, data);
                        if (__tcp_retransmit_skb(sk, data, 1))
                                break;
+                       tp->retrans_out += tcp_skb_pcount(data);
                }
                tcp_rearm_rto(sk);
                NET_INC_STATS(sock_net(sk),

Eric Dumazet March 4, 2021, 9:20 p.m. UTC | #13

On Thu, Mar 4, 2021 at 10:08 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>

> On Thu, Mar 04, 2021 at 08:41:45PM +0100, Eric Dumazet wrote:

> > On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > >

> > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote:

> > > > I think we are over thinking this really (especially if the fix needs

> > > > a change in core networking or drivers)

> > > >

> > > > We can reuse TSQ logic to have a chance to recover when the clone is

> > > > eventually freed.

> > > > This will be more generic, not only for the SYN+data of FastOpen.

> > > >

> > > > Can you please test the following patch ?

> > >

> > > #7 - Eric comes up with something much better :)

> > >

> > >

> > > But so far doesn't seem to quite do it, I'm looking but maybe you'll

> > > know right away (FWIW testing a v5.6 backport but I don't think TSQ

> > > changed?):

> > >

> > > On __tcp_retransmit_skb kretprobe:

> > >

> > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00!

> > >

> > > First hit:

> > >         __tcp_retransmit_skb+1

> > >         tcp_rcv_state_process+2488

> > >         tcp_v6_do_rcv+405

> > >         tcp_v6_rcv+2984

> > >         ip6_protocol_deliver_rcu+180

> > >         ip6_input_finish+17

> > >

> > > Successful hit:

> > >         __tcp_retransmit_skb+1

> > >         tcp_retransmit_skb+18

> > >         tcp_retransmit_timer+716

> > >         tcp_write_timer_handler+136

> > >         tcp_write_timer+141

> > >         call_timer_fn+43

> > >

> > >  skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1

> >

> >

> > Humm maybe one of the conditions used in tcp_tsq_write() does not hold...

> >

> > if (tp->lost_out > tp->retrans_out &&

> >     tp->snd_cwnd > tcp_packets_in_flight(tp)) {

> >     tcp_mstamp_refresh(tp);

> >     tcp_xmit_retransmit_queue(sk);

> > }

> >

> > Maybe FastOpen case is 'special' and tp->lost_out is wrong.

>

>

> Something like this?  (completely untested)

> --

> Jonathan

>

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> index 69a545db80d2..92bc9b0f4955 100644

> --- a/net/ipv4/tcp_input.c

> +++ b/net/ipv4/tcp_input.c

> @@ -5995,8 +5995,10 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,

>                 else

>                         tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;

>                 skb_rbtree_walk_from(data) {

> +                       tcp_mark_skb_lost(sk, data);

>                         if (__tcp_retransmit_skb(sk, data, 1))

>                                 break;

> +                       tp->retrans_out += tcp_skb_pcount(data);

>                 }

>                 tcp_rearm_rto(sk);

>                 NET_INC_STATS(sock_net(sk),

>


Yes, but the hard part is testing this ;)

Once we properly mark the skb lost, we can call regular
tcp_xmit_retransmit_queue() to not have to care of tp->retrans_out

Not that we really need to make sure tcp_xmit_retransmit_queue() can
be called anyway from TSQ handler if TX completion is delayed.

Eric Dumazet March 4, 2021, 9:26 p.m. UTC | #14

On Thu, Mar 4, 2021 at 10:20 PM Eric Dumazet <edumazet@google.com> wrote:
>

> On Thu, Mar 4, 2021 at 10:08 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:

> >

> > On Thu, Mar 04, 2021 at 08:41:45PM +0100, Eric Dumazet wrote:

> > > On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > > >

> > > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote:

> > > > > I think we are over thinking this really (especially if the fix needs

> > > > > a change in core networking or drivers)

> > > > >

> > > > > We can reuse TSQ logic to have a chance to recover when the clone is

> > > > > eventually freed.

> > > > > This will be more generic, not only for the SYN+data of FastOpen.

> > > > >

> > > > > Can you please test the following patch ?

> > > >

> > > > #7 - Eric comes up with something much better :)

> > > >

> > > >

> > > > But so far doesn't seem to quite do it, I'm looking but maybe you'll

> > > > know right away (FWIW testing a v5.6 backport but I don't think TSQ

> > > > changed?):

> > > >

> > > > On __tcp_retransmit_skb kretprobe:

> > > >

> > > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00!

> > > >

> > > > First hit:

> > > >         __tcp_retransmit_skb+1

> > > >         tcp_rcv_state_process+2488

> > > >         tcp_v6_do_rcv+405

> > > >         tcp_v6_rcv+2984

> > > >         ip6_protocol_deliver_rcu+180

> > > >         ip6_input_finish+17

> > > >

> > > > Successful hit:

> > > >         __tcp_retransmit_skb+1

> > > >         tcp_retransmit_skb+18

> > > >         tcp_retransmit_timer+716

> > > >         tcp_write_timer_handler+136

> > > >         tcp_write_timer+141

> > > >         call_timer_fn+43

> > > >

> > > >  skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1

> > >

> > >

> > > Humm maybe one of the conditions used in tcp_tsq_write() does not hold...

> > >

> > > if (tp->lost_out > tp->retrans_out &&

> > >     tp->snd_cwnd > tcp_packets_in_flight(tp)) {

> > >     tcp_mstamp_refresh(tp);

> > >     tcp_xmit_retransmit_queue(sk);

> > > }

> > >

> > > Maybe FastOpen case is 'special' and tp->lost_out is wrong.

> >

> >

> > Something like this?  (completely untested)

> > --

> > Jonathan

> >

> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> > index 69a545db80d2..92bc9b0f4955 100644

> > --- a/net/ipv4/tcp_input.c

> > +++ b/net/ipv4/tcp_input.c

> > @@ -5995,8 +5995,10 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,

> >                 else

> >                         tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;

> >                 skb_rbtree_walk_from(data) {

> > +                       tcp_mark_skb_lost(sk, data);

> >                         if (__tcp_retransmit_skb(sk, data, 1))

> >                                 break;

> > +                       tp->retrans_out += tcp_skb_pcount(data);

> >                 }

> >                 tcp_rearm_rto(sk);

> >                 NET_INC_STATS(sock_net(sk),

> >

>

> Yes, but the hard part is testing this ;)

>

> Once we properly mark the skb lost, we can call regular

> tcp_xmit_retransmit_queue() to not have to care of tp->retrans_out

>

> Not that we really need to make sure tcp_xmit_retransmit_queue() can

> be called anyway from TSQ handler if TX completion is delayed.


I was testing this (note how I also removed the tcp_rearm_rto(sk) call)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct
sock *sk, struct sk_buff *synack,
        tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp);

        if (data) { /* Retransmit unacked data in SYN */
-               skb_rbtree_walk_from(data) {
-                       if (__tcp_retransmit_skb(sk, data, 1))
-                               break;
-               }
-               tcp_rearm_rto(sk);
+               skb_rbtree_walk_from(data)
+                       tcp_mark_skb_lost(sk, data);
+
+               tcp_xmit_retransmit_queue(sk);
                NET_INC_STATS(sock_net(sk),
                                LINUX_MIB_TCPFASTOPENACTIVEFAIL);
                return true;

Jakub Kicinski March 4, 2021, 11:27 p.m. UTC | #15

On Thu, 4 Mar 2021 22:26:58 +0100 Eric Dumazet wrote:
> It would be nice if tun driver would have the ability to delay TX
> completions by N usecs,
> so that packetdrill tests could be used.
> 
> It is probably not too hard to add such a feature.

Add an ioctl to turn off the skb_orphan, queue the skbs in tun_do_read() 
to free them from a timer?

> I was testing this (note how I also removed the tcp_rearm_rto(sk) call)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7
> 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct
> sock *sk, struct sk_buff *synack,
>         tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp);
> 
>         if (data) { /* Retransmit unacked data in SYN */
> -               skb_rbtree_walk_from(data) {
> -                       if (__tcp_retransmit_skb(sk, data, 1))
> -                               break;
> -               }
> -               tcp_rearm_rto(sk);
> +               skb_rbtree_walk_from(data)
> +                       tcp_mark_skb_lost(sk, data);
> +
> +               tcp_xmit_retransmit_queue(sk);
>                 NET_INC_STATS(sock_net(sk),
>                                 LINUX_MIB_TCPFASTOPENACTIVEFAIL);
>                 return true;

AFAICT this works great now:

==> TFO case ret:-16 (0) ca_state:0 skb:ffff8881d3513800!
  FREED swapper/5 -- skb 0xffff8881d3513800 freed after: 1us
-----
First:
        __tcp_retransmit_skb+1
        tcp_retransmit_skb+18
        tcp_xmit_retransmit_queue.part.70+339
        tcp_rcv_state_process+2491
        tcp_v6_do_rcv+405
        tcp_v6_rcv+2984

Second:
        __tcp_retransmit_skb+1
        tcp_retransmit_skb+18
        tcp_xmit_retransmit_queue.part.70+339
        tcp_tsq_write.part.71+146
        tcp_tsq_handler+53
        tcp_tasklet_func+181

 sk:0xffff8885adc16f00 skb:ffff8881d3513800 --- 61us acked:1


The other case where we hit RTO after __tcp_retransmit_skb() fails is:

==> non-TFO case ret:-11 (0) ca_state:3 skb:ffff8883d71dd400!
-----
First:
        __tcp_retransmit_skb+1
        tcp_retransmit_skb+18
        tcp_xmit_retransmit_queue.part.70+339
        tcp_ack+2270
        tcp_rcv_established+303
        tcp_v6_do_rcv+190

Second:
        __tcp_retransmit_skb+1
        tcp_retransmit_skb+18
        tcp_retransmit_timer+716
        tcp_write_timer_handler+136
        tcp_write_timer+141
        call_timer_fn+43

 sk:0xffff88801772d340 skb:ffff8883d71dd400 --- 51738us acked:47324

Which I believe is this:

	if (refcount_read(&sk->sk_wmem_alloc) >
	    min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
		  sk->sk_sndbuf))
		return -EAGAIN;

Because __tcp_retransmit_skb() seems to bail before
inet6_sk_rebuild_header gets called. Should we arm TSQ here as well?

Eric Dumazet March 5, 2021, 5:17 a.m. UTC | #16

On Fri, Mar 5, 2021 at 12:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Thu, 4 Mar 2021 22:26:58 +0100 Eric Dumazet wrote:

> > It would be nice if tun driver would have the ability to delay TX

> > completions by N usecs,

> > so that packetdrill tests could be used.

> >

> > It is probably not too hard to add such a feature.

>

> Add an ioctl to turn off the skb_orphan, queue the skbs in tun_do_read()

> to free them from a timer?


Yes, I  cooked a prototype like that a few hours ago before my night
shift to launch our test suite.

I yet have to add a sane limit to the number of delayed skbs so that
syzbot does not oom its hosts ;)

>

> > I was testing this (note how I also removed the tcp_rearm_rto(sk) call)

> >

> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> > index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7

> > 100644

> > --- a/net/ipv4/tcp_input.c

> > +++ b/net/ipv4/tcp_input.c

> > @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct

> > sock *sk, struct sk_buff *synack,

> >         tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp);

> >

> >         if (data) { /* Retransmit unacked data in SYN */

> > -               skb_rbtree_walk_from(data) {

> > -                       if (__tcp_retransmit_skb(sk, data, 1))

> > -                               break;

> > -               }

> > -               tcp_rearm_rto(sk);

> > +               skb_rbtree_walk_from(data)

> > +                       tcp_mark_skb_lost(sk, data);

> > +

> > +               tcp_xmit_retransmit_queue(sk);

> >                 NET_INC_STATS(sock_net(sk),

> >                                 LINUX_MIB_TCPFASTOPENACTIVEFAIL);

> >                 return true;

>

> AFAICT this works great now:


Yes but some packetdrill tests fail, I have to check them (sponge link
for Googlers), but at first glance we have more investigations.

Ran 2003 tests: 1990 passed, 13 failed
Sponge: http://sponge2/b0d4c652-3173-4837-86ef-5e8cc59730a1

Failures in
//net/tcp/fastopen/client:ipv4-mapped-ipv6:cookie-disabled-prod-conn
//net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed
//net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed-with-seq
//net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-only-syn-acked
//net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-partial-or-over-ack
//net/tcp/fastopen/client:ipv4:cookie-disabled-prod-conn
//net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed
//net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed-with-seq
//net/tcp/fastopen/client:ipv4:syn-data-only-syn-acked
//net/tcp/fastopen/client:ipv4:syn-data-partial-or-over-ack
//net/tcp/fastopen/client:ipv6:cookie-disabled-prod-conn
//net/tcp/fastopen/client:ipv6:syn-data-only-syn-acked
//net/tcp/fastopen/client:ipv6:syn-data-partial-or-over-ack

Showing test.log for
//net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed

syn-data-icmp-unreach-frag-needed.pkt:33: error handling packet: live
packet field ipv4_total_length: expected: 800 (0x320) vs actual: 1040
(0x410)
script packet: 0.090624 . 1:761(760) ack 1
actual packet: 0.090619 P. 1:1001(1000) ack 1 win 256


Yes, it looks like Alex patch no longer works

commit c31b70c9968fe9c4194d1b5d06d07596a3b680de
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Sat Dec 12 12:31:24 2020 -0800

    tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit



>

> ==> TFO case ret:-16 (0) ca_state:0 skb:ffff8881d3513800!

>   FREED swapper/5 -- skb 0xffff8881d3513800 freed after: 1us

> -----

> First:

>         __tcp_retransmit_skb+1

>         tcp_retransmit_skb+18

>         tcp_xmit_retransmit_queue.part.70+339

>         tcp_rcv_state_process+2491

>         tcp_v6_do_rcv+405

>         tcp_v6_rcv+2984

>

> Second:

>         __tcp_retransmit_skb+1

>         tcp_retransmit_skb+18

>         tcp_xmit_retransmit_queue.part.70+339

>         tcp_tsq_write.part.71+146

>         tcp_tsq_handler+53

>         tcp_tasklet_func+181

>

>  sk:0xffff8885adc16f00 skb:ffff8881d3513800 --- 61us acked:1

>

>

> The other case where we hit RTO after __tcp_retransmit_skb() fails is:

>

> ==> non-TFO case ret:-11 (0) ca_state:3 skb:ffff8883d71dd400!

> -----

> First:

>         __tcp_retransmit_skb+1

>         tcp_retransmit_skb+18

>         tcp_xmit_retransmit_queue.part.70+339

>         tcp_ack+2270

>         tcp_rcv_established+303

>         tcp_v6_do_rcv+190

>

> Second:

>         __tcp_retransmit_skb+1

>         tcp_retransmit_skb+18

>         tcp_retransmit_timer+716

>         tcp_write_timer_handler+136

>         tcp_write_timer+141

>         call_timer_fn+43

>

>  sk:0xffff88801772d340 skb:ffff8883d71dd400 --- 51738us acked:47324

>

> Which I believe is this:

>

>         if (refcount_read(&sk->sk_wmem_alloc) >

>             min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),

>                   sk->sk_sndbuf))

>                 return -EAGAIN;

>

> Because __tcp_retransmit_skb() seems to bail before

> inet6_sk_rebuild_header gets called. Should we arm TSQ here as well?


Yes, or make the limit slightly bigger since we now have the fclone
more precise thing for standard drivers [1]

if ((refcount_read(&sk->sk_wmem_alloc) >> 1) >
    min_t(u32, sk->sk_wmem_queued,  sk->sk_sndbuf))
   return -EAGAIN;

[1] It is probably wise to keep this code because some drivers do call
skb_orphan() in their ndo_start_xmit()

Eric Dumazet March 5, 2021, 5:33 a.m. UTC | #17

On Fri, Mar 5, 2021 at 6:17 AM Eric Dumazet <edumazet@google.com> wrote:
>

> On Fri, Mar 5, 2021 at 12:27 AM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > On Thu, 4 Mar 2021 22:26:58 +0100 Eric Dumazet wrote:

> > > It would be nice if tun driver would have the ability to delay TX

> > > completions by N usecs,

> > > so that packetdrill tests could be used.

> > >

> > > It is probably not too hard to add such a feature.

> >

> > Add an ioctl to turn off the skb_orphan, queue the skbs in tun_do_read()

> > to free them from a timer?

>

> Yes, I  cooked a prototype like that a few hours ago before my night

> shift to launch our test suite.

>

> I yet have to add a sane limit to the number of delayed skbs so that

> syzbot does not oom its hosts ;)

>

> >

> > > I was testing this (note how I also removed the tcp_rearm_rto(sk) call)

> > >

> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> > > index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7

> > > 100644

> > > --- a/net/ipv4/tcp_input.c

> > > +++ b/net/ipv4/tcp_input.c

> > > @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct

> > > sock *sk, struct sk_buff *synack,

> > >         tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp);

> > >

> > >         if (data) { /* Retransmit unacked data in SYN */

> > > -               skb_rbtree_walk_from(data) {

> > > -                       if (__tcp_retransmit_skb(sk, data, 1))

> > > -                               break;

> > > -               }

> > > -               tcp_rearm_rto(sk);

> > > +               skb_rbtree_walk_from(data)

> > > +                       tcp_mark_skb_lost(sk, data);

> > > +

> > > +               tcp_xmit_retransmit_queue(sk);

> > >                 NET_INC_STATS(sock_net(sk),

> > >                                 LINUX_MIB_TCPFASTOPENACTIVEFAIL);

> > >                 return true;

> >

> > AFAICT this works great now:

>

> Yes but some packetdrill tests fail, I have to check them (sponge link

> for Googlers), but at first glance we have more investigations.

>

> Ran 2003 tests: 1990 passed, 13 failed

> Sponge: http://sponge2/b0d4c652-3173-4837-86ef-5e8cc59730a1

>

> Failures in

> //net/tcp/fastopen/client:ipv4-mapped-ipv6:cookie-disabled-prod-conn

> //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed

> //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed-with-seq

> //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-only-syn-acked

> //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-partial-or-over-ack

> //net/tcp/fastopen/client:ipv4:cookie-disabled-prod-conn

> //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed

> //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed-with-seq

> //net/tcp/fastopen/client:ipv4:syn-data-only-syn-acked

> //net/tcp/fastopen/client:ipv4:syn-data-partial-or-over-ack

> //net/tcp/fastopen/client:ipv6:cookie-disabled-prod-conn

> //net/tcp/fastopen/client:ipv6:syn-data-only-syn-acked

> //net/tcp/fastopen/client:ipv6:syn-data-partial-or-over-ack

>

> Showing test.log for

> //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed

>

> syn-data-icmp-unreach-frag-needed.pkt:33: error handling packet: live

> packet field ipv4_total_length: expected: 800 (0x320) vs actual: 1040

> (0x410)

> script packet: 0.090624 . 1:761(760) ack 1

> actual packet: 0.090619 P. 1:1001(1000) ack 1 win 256

>


Actually this might be a good thing : We now resend the whole data in
a TSO packet, now
we call the standard rtx queue mechanism, instead of sending only one MSS

I will look at other test failures today (Friday) before submitting a
patch series officially.

>

> Yes, it looks like Alex patch no longer works

>

> commit c31b70c9968fe9c4194d1b5d06d07596a3b680de

> Author: Alexander Duyck <alexanderduyck@fb.com>

> Date:   Sat Dec 12 12:31:24 2020 -0800

>

>     tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit

>

>

>

> >

> > ==> TFO case ret:-16 (0) ca_state:0 skb:ffff8881d3513800!

> >   FREED swapper/5 -- skb 0xffff8881d3513800 freed after: 1us

> > -----

> > First:

> >         __tcp_retransmit_skb+1

> >         tcp_retransmit_skb+18

> >         tcp_xmit_retransmit_queue.part.70+339

> >         tcp_rcv_state_process+2491

> >         tcp_v6_do_rcv+405

> >         tcp_v6_rcv+2984

> >

> > Second:

> >         __tcp_retransmit_skb+1

> >         tcp_retransmit_skb+18

> >         tcp_xmit_retransmit_queue.part.70+339

> >         tcp_tsq_write.part.71+146

> >         tcp_tsq_handler+53

> >         tcp_tasklet_func+181

> >

> >  sk:0xffff8885adc16f00 skb:ffff8881d3513800 --- 61us acked:1

> >

> >

> > The other case where we hit RTO after __tcp_retransmit_skb() fails is:

> >

> > ==> non-TFO case ret:-11 (0) ca_state:3 skb:ffff8883d71dd400!

> > -----

> > First:

> >         __tcp_retransmit_skb+1

> >         tcp_retransmit_skb+18

> >         tcp_xmit_retransmit_queue.part.70+339

> >         tcp_ack+2270

> >         tcp_rcv_established+303

> >         tcp_v6_do_rcv+190

> >

> > Second:

> >         __tcp_retransmit_skb+1

> >         tcp_retransmit_skb+18

> >         tcp_retransmit_timer+716

> >         tcp_write_timer_handler+136

> >         tcp_write_timer+141

> >         call_timer_fn+43

> >

> >  sk:0xffff88801772d340 skb:ffff8883d71dd400 --- 51738us acked:47324

> >

> > Which I believe is this:

> >

> >         if (refcount_read(&sk->sk_wmem_alloc) >

> >             min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),

> >                   sk->sk_sndbuf))

> >                 return -EAGAIN;

> >

> > Because __tcp_retransmit_skb() seems to bail before

> > inet6_sk_rebuild_header gets called. Should we arm TSQ here as well?

>

> Yes, or make the limit slightly bigger since we now have the fclone

> more precise thing for standard drivers [1]

>

> if ((refcount_read(&sk->sk_wmem_alloc) >> 1) >

>     min_t(u32, sk->sk_wmem_queued,  sk->sk_sndbuf))

>    return -EAGAIN;

>

> [1] It is probably wise to keep this code because some drivers do call

> skb_orphan() in their ndo_start_xmit()

Eric Dumazet March 5, 2021, 6:38 a.m. UTC | #18

On Fri, Mar 5, 2021 at 6:33 AM Eric Dumazet <edumazet@google.com> wrote:

> Actually this might be a good thing : We now resend the whole data in

> a TSO packet, now

> we call the standard rtx queue mechanism, instead of sending only one MSS

>

> I will look at other test failures today (Friday) before submitting a

> patch series officially.

>


Well, there is also this part to sort out.

WARN_ON(tp->retrans_out != 0);

in tcp_fastretrans_alert()

[net] net: tcp: don't allocate fast clones for fastopen SYN

Commit Message

Comments

Patch