diff mbox series

[net-next,v2] mptcp: be careful on MPTCP-level ack.

Message ID 5370c0ae03449239e3d1674ddcfb090cf6f20abe.1606253206.git.pabeni@redhat.com
State Superseded
Headers show
Series [net-next,v2] mptcp: be careful on MPTCP-level ack. | expand

Commit Message

Paolo Abeni Nov. 24, 2020, 9:51 p.m. UTC
We can enter the main mptcp_recvmsg() loop even when
no subflows are connected. As note by Eric, that would
result in a divide by zero oops on ack generation.

Address the issue by checking the subflow status before
sending the ack.

Additionally protect mptcp_recvmsg() against invocation
with weird socket states.

v1 -> v2:
 - removed unneeded inline keyword - Jakub

Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------
 1 file changed, 49 insertions(+), 18 deletions(-)

Comments

Jakub Kicinski Nov. 25, 2020, 9:37 p.m. UTC | #1
On Tue, 24 Nov 2020 22:51:24 +0100 Paolo Abeni wrote:
> We can enter the main mptcp_recvmsg() loop even when

> no subflows are connected. As note by Eric, that would

> result in a divide by zero oops on ack generation.

> 

> Address the issue by checking the subflow status before

> sending the ack.

> 

> Additionally protect mptcp_recvmsg() against invocation

> with weird socket states.

> 

> v1 -> v2:

>  - removed unneeded inline keyword - Jakub

> 

> Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

> Signed-off-by: Paolo Abeni <pabeni@redhat.com>


Applied, thanks!
Eric Dumazet Dec. 2, 2020, 1:18 p.m. UTC | #2
On 11/24/20 10:51 PM, Paolo Abeni wrote:
> We can enter the main mptcp_recvmsg() loop even when

> no subflows are connected. As note by Eric, that would

> result in a divide by zero oops on ack generation.

> 

> Address the issue by checking the subflow status before

> sending the ack.

> 

> Additionally protect mptcp_recvmsg() against invocation

> with weird socket states.

> 

> v1 -> v2:

>  - removed unneeded inline keyword - Jakub

> 

> Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> ---

>  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

>  1 file changed, 49 insertions(+), 18 deletions(-)

> 


Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will
trigger an infinite loop if there is available data in receive queue ?

It seems the following is needed, commit ea4ca586b16f removed
a needed check to catch this condition.

Untested patch, I can submit it formally if you agree.

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 221f7cdd416bdb681968bf1b3ff2ed1b03cea3ce..57213ff60f784fae14c2a96f391ccdec6249c168 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1921,7 +1921,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
        len = min_t(size_t, len, INT_MAX);
        target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
 
-       for (;;) {
+       while (copied < len) {
                int bytes_read, old_space;
 
                bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied);
Paolo Abeni Dec. 2, 2020, 3:37 p.m. UTC | #3
On Wed, 2020-12-02 at 14:18 +0100, Eric Dumazet wrote:
> 

> On 11/24/20 10:51 PM, Paolo Abeni wrote:

> > We can enter the main mptcp_recvmsg() loop even when

> > no subflows are connected. As note by Eric, that would

> > result in a divide by zero oops on ack generation.

> > 

> > Address the issue by checking the subflow status before

> > sending the ack.

> > 

> > Additionally protect mptcp_recvmsg() against invocation

> > with weird socket states.

> > 

> > v1 -> v2:

> >  - removed unneeded inline keyword - Jakub

> > 

> > Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

> > Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> > ---

> >  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

> >  1 file changed, 49 insertions(+), 18 deletions(-)

> > 

> 

> Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will

> trigger an infinite loop if there is available data in receive queue ?


Thank you for looking into this!

I can't reproduce the issue with the following packetdrill ?!?

+0.0  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.1   > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 fflags[flag_h] nokey>
+0.1   < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscaale 8,mpcapable v1 flags[flag_h] key[skey=2] >
+0.1  > . 1:1(0) ack 1 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h]] key[ckey,skey]>
+0.1 fcntl(3, F_SETFL, O_RDWR) = 0
+0.1   < .  1:201(200) ack 1 win 225 <dss dack8=1 dsn8=1 ssn=1 dll=200 nocs,  nop, nop>
+0.1   > .  1:1(0) ack 201 <nop, nop, TS val 100 ecr 700, dss dack8=201 dll=00 nocs>
+0.1 read(3, ..., 0) = 0

The main recvmsg() loop is interrupted by the following check:

                if (copied >= target)
                        break;

I guess we could loop while the msk has available rcv space and some
subflow is feeding new data. If so, I think moving:

	if (skb_queue_empty(&msk->receive_queue) &&
                    __mptcp_move_skbs(msk, len - copied))
                        continue;

after the above check should address the issue, and will make the
common case faster. Let me test the above - unless I underlooked
something relevant!

Thanks,

Paolo
Eric Dumazet Dec. 2, 2020, 4:10 p.m. UTC | #4
On 12/2/20 4:37 PM, Paolo Abeni wrote:
> On Wed, 2020-12-02 at 14:18 +0100, Eric Dumazet wrote:

>>

>> On 11/24/20 10:51 PM, Paolo Abeni wrote:

>>> We can enter the main mptcp_recvmsg() loop even when

>>> no subflows are connected. As note by Eric, that would

>>> result in a divide by zero oops on ack generation.

>>>

>>> Address the issue by checking the subflow status before

>>> sending the ack.

>>>

>>> Additionally protect mptcp_recvmsg() against invocation

>>> with weird socket states.

>>>

>>> v1 -> v2:

>>>  - removed unneeded inline keyword - Jakub

>>>

>>> Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

>>> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

>>> ---

>>>  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

>>>  1 file changed, 49 insertions(+), 18 deletions(-)

>>>

>>

>> Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will

>> trigger an infinite loop if there is available data in receive queue ?

> 

> Thank you for looking into this!

> 

> I can't reproduce the issue with the following packetdrill ?!?

> 

> +0.0  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

> +0.1   > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 fflags[flag_h] nokey>

> +0.1   < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscaale 8,mpcapable v1 flags[flag_h] key[skey=2] >

> +0.1  > . 1:1(0) ack 1 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h]] key[ckey,skey]>

> +0.1 fcntl(3, F_SETFL, O_RDWR) = 0

> +0.1   < .  1:201(200) ack 1 win 225 <dss dack8=1 dsn8=1 ssn=1 dll=200 nocs,  nop, nop>

> +0.1   > .  1:1(0) ack 201 <nop, nop, TS val 100 ecr 700, dss dack8=201 dll=00 nocs>

> +0.1 read(3, ..., 0) = 0

> 

> The main recvmsg() loop is interrupted by the following check:

> 

>                 if (copied >= target)

>                         break;


@copied should be 0, and @target should be 1

Are you sure the above condition is triggering ?

Maybe read(fd, ..., 0) does not reach recvmsg() at all.

You could try recvmsg() or recvmmsg(), 

> 

> I guess we could loop while the msk has available rcv space and some

> subflow is feeding new data. If so, I think moving:

> 

> 	if (skb_queue_empty(&msk->receive_queue) &&

>                     __mptcp_move_skbs(msk, len - copied))

>                         continue;

> 

> after the above check should address the issue, and will make the

> common case faster. Let me test the above - unless I underlooked

> something relevant!

> 

> Thanks,

> 

> Paolo

>
Eric Dumazet Dec. 2, 2020, 4:30 p.m. UTC | #5
On 12/2/20 5:10 PM, Eric Dumazet wrote:
> 

> 

> On 12/2/20 4:37 PM, Paolo Abeni wrote:

>> On Wed, 2020-12-02 at 14:18 +0100, Eric Dumazet wrote:

>>>

>>> On 11/24/20 10:51 PM, Paolo Abeni wrote:

>>>> We can enter the main mptcp_recvmsg() loop even when

>>>> no subflows are connected. As note by Eric, that would

>>>> result in a divide by zero oops on ack generation.

>>>>

>>>> Address the issue by checking the subflow status before

>>>> sending the ack.

>>>>

>>>> Additionally protect mptcp_recvmsg() against invocation

>>>> with weird socket states.

>>>>

>>>> v1 -> v2:

>>>>  - removed unneeded inline keyword - Jakub

>>>>

>>>> Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

>>>> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

>>>> ---

>>>>  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

>>>>  1 file changed, 49 insertions(+), 18 deletions(-)

>>>>

>>>

>>> Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will

>>> trigger an infinite loop if there is available data in receive queue ?

>>

>> Thank you for looking into this!

>>

>> I can't reproduce the issue with the following packetdrill ?!?

>>

>> +0.0  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

>> +0.1   > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 fflags[flag_h] nokey>

>> +0.1   < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscaale 8,mpcapable v1 flags[flag_h] key[skey=2] >

>> +0.1  > . 1:1(0) ack 1 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h]] key[ckey,skey]>

>> +0.1 fcntl(3, F_SETFL, O_RDWR) = 0

>> +0.1   < .  1:201(200) ack 1 win 225 <dss dack8=1 dsn8=1 ssn=1 dll=200 nocs,  nop, nop>

>> +0.1   > .  1:1(0) ack 201 <nop, nop, TS val 100 ecr 700, dss dack8=201 dll=00 nocs>

>> +0.1 read(3, ..., 0) = 0

>>

>> The main recvmsg() loop is interrupted by the following check:

>>

>>                 if (copied >= target)

>>                         break;

> 

> @copied should be 0, and @target should be 1

> 

> Are you sure the above condition is triggering ?

> 

> Maybe read(fd, ..., 0) does not reach recvmsg() at all.


Yes, sock_read_iter() has a shortcut :

if (!iov_iter_count(to))    /* Match SYS5 behaviour */
     res = sock_recvmsg(sock, &msg, msg.msg_flags);

but recvmsg() does not have such check, or maybe I have not looked at the right place.

> 

> You could try recvmsg() or recvmmsg(), 

> 

>>

>> I guess we could loop while the msk has available rcv space and some

>> subflow is feeding new data. If so, I think moving:

>>

>> 	if (skb_queue_empty(&msk->receive_queue) &&

>>                     __mptcp_move_skbs(msk, len - copied))

>>                         continue;

>>

>> after the above check should address the issue, and will make the

>> common case faster. Let me test the above - unless I underlooked

>> something relevant!

>>

>> Thanks,

>>

>> Paolo

>>
Eric Dumazet Dec. 2, 2020, 4:32 p.m. UTC | #6
On 12/2/20 5:30 PM, Eric Dumazet wrote:
> 

> 

> On 12/2/20 5:10 PM, Eric Dumazet wrote:

>>

>>

>> On 12/2/20 4:37 PM, Paolo Abeni wrote:

>>> On Wed, 2020-12-02 at 14:18 +0100, Eric Dumazet wrote:

>>>>

>>>> On 11/24/20 10:51 PM, Paolo Abeni wrote:

>>>>> We can enter the main mptcp_recvmsg() loop even when

>>>>> no subflows are connected. As note by Eric, that would

>>>>> result in a divide by zero oops on ack generation.

>>>>>

>>>>> Address the issue by checking the subflow status before

>>>>> sending the ack.

>>>>>

>>>>> Additionally protect mptcp_recvmsg() against invocation

>>>>> with weird socket states.

>>>>>

>>>>> v1 -> v2:

>>>>>  - removed unneeded inline keyword - Jakub

>>>>>

>>>>> Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

>>>>> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

>>>>> ---

>>>>>  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

>>>>>  1 file changed, 49 insertions(+), 18 deletions(-)

>>>>>

>>>>

>>>> Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will

>>>> trigger an infinite loop if there is available data in receive queue ?

>>>

>>> Thank you for looking into this!

>>>

>>> I can't reproduce the issue with the following packetdrill ?!?

>>>

>>> +0.0  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

>>> +0.1   > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 fflags[flag_h] nokey>

>>> +0.1   < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscaale 8,mpcapable v1 flags[flag_h] key[skey=2] >

>>> +0.1  > . 1:1(0) ack 1 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h]] key[ckey,skey]>

>>> +0.1 fcntl(3, F_SETFL, O_RDWR) = 0

>>> +0.1   < .  1:201(200) ack 1 win 225 <dss dack8=1 dsn8=1 ssn=1 dll=200 nocs,  nop, nop>

>>> +0.1   > .  1:1(0) ack 201 <nop, nop, TS val 100 ecr 700, dss dack8=201 dll=00 nocs>

>>> +0.1 read(3, ..., 0) = 0

>>>

>>> The main recvmsg() loop is interrupted by the following check:

>>>

>>>                 if (copied >= target)

>>>                         break;

>>

>> @copied should be 0, and @target should be 1

>>

>> Are you sure the above condition is triggering ?

>>

>> Maybe read(fd, ..., 0) does not reach recvmsg() at all.

> 

> Yes, sock_read_iter() has a shortcut :

> 

> if (!iov_iter_count(to))    /* Match SYS5 behaviour */

>      res = sock_recvmsg(sock, &msg, msg.msg_flags);


No idea what went wrong with my copy/paste.

The real code is more like :

if (!iov_iter_count(to))    /* Match SYS5 behaviour */
    return 0;


> 

> but recvmsg() does not have such check, or maybe I have not looked at the right place.

> 

>>

>> You could try recvmsg() or recvmmsg(), 

>>

>>>

>>> I guess we could loop while the msk has available rcv space and some

>>> subflow is feeding new data. If so, I think moving:

>>>

>>> 	if (skb_queue_empty(&msk->receive_queue) &&

>>>                     __mptcp_move_skbs(msk, len - copied))

>>>                         continue;

>>>

>>> after the above check should address the issue, and will make the

>>> common case faster. Let me test the above - unless I underlooked

>>> something relevant!

>>>

>>> Thanks,

>>>

>>> Paolo

>>>
Eric Dumazet Dec. 2, 2020, 4:45 p.m. UTC | #7
On 12/2/20 5:32 PM, Eric Dumazet wrote:
> 

> 

> On 12/2/20 5:30 PM, Eric Dumazet wrote:

>>

>>

>> On 12/2/20 5:10 PM, Eric Dumazet wrote:

>>>

>>>

>>> On 12/2/20 4:37 PM, Paolo Abeni wrote:

>>>> On Wed, 2020-12-02 at 14:18 +0100, Eric Dumazet wrote:

>>>>>

>>>>> On 11/24/20 10:51 PM, Paolo Abeni wrote:

>>>>>> We can enter the main mptcp_recvmsg() loop even when

>>>>>> no subflows are connected. As note by Eric, that would

>>>>>> result in a divide by zero oops on ack generation.

>>>>>>

>>>>>> Address the issue by checking the subflow status before

>>>>>> sending the ack.

>>>>>>

>>>>>> Additionally protect mptcp_recvmsg() against invocation

>>>>>> with weird socket states.

>>>>>>

>>>>>> v1 -> v2:

>>>>>>  - removed unneeded inline keyword - Jakub

>>>>>>

>>>>>> Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

>>>>>> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

>>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

>>>>>> ---

>>>>>>  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

>>>>>>  1 file changed, 49 insertions(+), 18 deletions(-)

>>>>>>

>>>>>

>>>>> Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will

>>>>> trigger an infinite loop if there is available data in receive queue ?

>>>>

>>>> Thank you for looking into this!

>>>>

>>>> I can't reproduce the issue with the following packetdrill ?!?

>>>>

>>>> +0.0  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

>>>> +0.1   > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 fflags[flag_h] nokey>

>>>> +0.1   < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscaale 8,mpcapable v1 flags[flag_h] key[skey=2] >

>>>> +0.1  > . 1:1(0) ack 1 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h]] key[ckey,skey]>

>>>> +0.1 fcntl(3, F_SETFL, O_RDWR) = 0

>>>> +0.1   < .  1:201(200) ack 1 win 225 <dss dack8=1 dsn8=1 ssn=1 dll=200 nocs,  nop, nop>

>>>> +0.1   > .  1:1(0) ack 201 <nop, nop, TS val 100 ecr 700, dss dack8=201 dll=00 nocs>

>>>> +0.1 read(3, ..., 0) = 0

>>>>

>>>> The main recvmsg() loop is interrupted by the following check:

>>>>

>>>>                 if (copied >= target)

>>>>                         break;

>>>

>>> @copied should be 0, and @target should be 1

>>>

>>> Are you sure the above condition is triggering ?

>>>

>>> Maybe read(fd, ..., 0) does not reach recvmsg() at all.

>>

>> Yes, sock_read_iter() has a shortcut :

>>

>> if (!iov_iter_count(to))    /* Match SYS5 behaviour */

>>      res = sock_recvmsg(sock, &msg, msg.msg_flags);

> 

> No idea what went wrong with my copy/paste.

> 

> The real code is more like :

> 

> if (!iov_iter_count(to))    /* Match SYS5 behaviour */

>     return 0;

>


Packetdrill recvmsg syntax would be something like

   +0	recvmsg(3, {msg_name(...)=...,
		    msg_iov(1)=[{..., 0}],
		    msg_flags=0
		    }, 0) = 0
Paolo Abeni Dec. 2, 2020, 4:51 p.m. UTC | #8
On Wed, 2020-12-02 at 17:45 +0100, Eric Dumazet wrote:
> 

> On 12/2/20 5:32 PM, Eric Dumazet wrote:

> > 

> > On 12/2/20 5:30 PM, Eric Dumazet wrote:

> > > 

> > > On 12/2/20 5:10 PM, Eric Dumazet wrote:

> > > > 

> > > > On 12/2/20 4:37 PM, Paolo Abeni wrote:

> > > > > On Wed, 2020-12-02 at 14:18 +0100, Eric Dumazet wrote:

> > > > > > On 11/24/20 10:51 PM, Paolo Abeni wrote:

> > > > > > > We can enter the main mptcp_recvmsg() loop even when

> > > > > > > no subflows are connected. As note by Eric, that would

> > > > > > > result in a divide by zero oops on ack generation.

> > > > > > > 

> > > > > > > Address the issue by checking the subflow status before

> > > > > > > sending the ack.

> > > > > > > 

> > > > > > > Additionally protect mptcp_recvmsg() against invocation

> > > > > > > with weird socket states.

> > > > > > > 

> > > > > > > v1 -> v2:

> > > > > > >  - removed unneeded inline keyword - Jakub

> > > > > > > 

> > > > > > > Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

> > > > > > > Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")

> > > > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> > > > > > > ---

> > > > > > >  net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++------------

> > > > > > >  1 file changed, 49 insertions(+), 18 deletions(-)

> > > > > > > 

> > > > > > 

> > > > > > Looking at mptcp recvmsg(), it seems that a read(fd, ..., 0) will

> > > > > > trigger an infinite loop if there is available data in receive queue ?

> > > > > 

> > > > > Thank you for looking into this!

> > > > > 

> > > > > I can't reproduce the issue with the following packetdrill ?!?

> > > > > 

> > > > > +0.0  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

> > > > > +0.1   > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 fflags[flag_h] nokey>

> > > > > +0.1   < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscaale 8,mpcapable v1 flags[flag_h] key[skey=2] >

> > > > > +0.1  > . 1:1(0) ack 1 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h]] key[ckey,skey]>

> > > > > +0.1 fcntl(3, F_SETFL, O_RDWR) = 0

> > > > > +0.1   < .  1:201(200) ack 1 win 225 <dss dack8=1 dsn8=1 ssn=1 dll=200 nocs,  nop, nop>

> > > > > +0.1   > .  1:1(0) ack 201 <nop, nop, TS val 100 ecr 700, dss dack8=201 dll=00 nocs>

> > > > > +0.1 read(3, ..., 0) = 0

> > > > > 

> > > > > The main recvmsg() loop is interrupted by the following check:

> > > > > 

> > > > >                 if (copied >= target)

> > > > >                         break;

> > > > 

> > > > @copied should be 0, and @target should be 1

> > > > 

> > > > Are you sure the above condition is triggering ?

> > > > 

> > > > Maybe read(fd, ..., 0) does not reach recvmsg() at all.

> > > 

> > > Yes, sock_read_iter() has a shortcut :

> > > 

> > > if (!iov_iter_count(to))    /* Match SYS5 behaviour */

> > >      res = sock_recvmsg(sock, &msg, msg.msg_flags);

> > 

> > No idea what went wrong with my copy/paste.

> > 

> > The real code is more like :

> > 

> > if (!iov_iter_count(to))    /* Match SYS5 behaviour */

> >     return 0;

> > 

> 

> Packetdrill recvmsg syntax would be something like

> 

>    +0	recvmsg(3, {msg_name(...)=...,

> 		    msg_iov(1)=[{..., 0}],

> 		    msg_flags=0

> 		    }, 0) = 0


Thank you very much for all the effort!

Yes, with recvmsg() the packet drill hangs. I agree your proposed fix
is correct.

I can test it explicitly later today.

(and sorry for the initial confusing/confused reply)

Paolo
Paolo Abeni Dec. 2, 2020, 5:08 p.m. UTC | #9
On Wed, 2020-12-02 at 17:51 +0100, Paolo Abeni wrote:
> On Wed, 2020-12-02 at 17:45 +0100, Eric Dumazet wrote:

> > Packetdrill recvmsg syntax would be something like

> > 

> >    +0	recvmsg(3, {msg_name(...)=...,

> > 		    msg_iov(1)=[{..., 0}],

> > 		    msg_flags=0

> > 		    }, 0) = 0

> 

> Thank you very much for all the effort!

> 

> Yes, with recvmsg() the packet drill hangs. I agree your proposed fix

> is correct.

> 

> I can test it explicitly later today.


The proposed fix passes successfully the pktdrill test and there are no
regressions in the other self-tests.

Feel free to add:

Tested-by: Paolo Abeni <pabeni@redhat.com>


thanks!

Paolo
diff mbox series

Patch

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4b7794835fea..371a5e691a9a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -419,31 +419,57 @@  static bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
 	return ((1 << ssk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT));
 }
 
-static void mptcp_send_ack(struct mptcp_sock *msk, bool force)
+static bool tcp_can_send_ack(const struct sock *ssk)
+{
+	return !((1 << inet_sk_state_load(ssk)) &
+	       (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_TIME_WAIT | TCPF_CLOSE));
+}
+
+static void mptcp_send_ack(struct mptcp_sock *msk)
 {
 	struct mptcp_subflow_context *subflow;
-	struct sock *pick = NULL;
 
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
-		if (force) {
-			lock_sock(ssk);
+		lock_sock(ssk);
+		if (tcp_can_send_ack(ssk))
 			tcp_send_ack(ssk);
-			release_sock(ssk);
-			continue;
-		}
-
-		/* if the hintes ssk is still active, use it */
-		pick = ssk;
-		if (ssk == msk->ack_hint)
-			break;
+		release_sock(ssk);
 	}
-	if (!force && pick) {
-		lock_sock(pick);
-		tcp_cleanup_rbuf(pick, 1);
-		release_sock(pick);
+}
+
+static bool mptcp_subflow_cleanup_rbuf(struct sock *ssk)
+{
+	int ret;
+
+	lock_sock(ssk);
+	ret = tcp_can_send_ack(ssk);
+	if (ret)
+		tcp_cleanup_rbuf(ssk, 1);
+	release_sock(ssk);
+	return ret;
+}
+
+static void mptcp_cleanup_rbuf(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+
+	/* if the hinted ssk is still active, try to use it */
+	if (likely(msk->ack_hint)) {
+		mptcp_for_each_subflow(msk, subflow) {
+			struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+
+			if (msk->ack_hint == ssk &&
+			    mptcp_subflow_cleanup_rbuf(ssk))
+				return;
+		}
 	}
+
+	/* otherwise pick the first active subflow */
+	mptcp_for_each_subflow(msk, subflow)
+		if (mptcp_subflow_cleanup_rbuf(mptcp_subflow_tcp_sock(subflow)))
+			return;
 }
 
 static bool mptcp_check_data_fin(struct sock *sk)
@@ -494,7 +520,7 @@  static bool mptcp_check_data_fin(struct sock *sk)
 
 		ret = true;
 		mptcp_set_timeout(sk, NULL);
-		mptcp_send_ack(msk, true);
+		mptcp_send_ack(msk);
 		mptcp_close_wake_up(sk);
 	}
 	return ret;
@@ -1579,6 +1605,11 @@  static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return -EOPNOTSUPP;
 
 	lock_sock(sk);
+	if (unlikely(sk->sk_state == TCP_LISTEN)) {
+		copied = -ENOTCONN;
+		goto out_err;
+	}
+
 	timeo = sock_rcvtimeo(sk, nonblock);
 
 	len = min_t(size_t, len, INT_MAX);
@@ -1604,7 +1635,7 @@  static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		/* be sure to advertise window change */
 		old_space = READ_ONCE(msk->old_wspace);
 		if ((tcp_space(sk) - old_space) >= old_space)
-			mptcp_send_ack(msk, false);
+			mptcp_cleanup_rbuf(msk);
 
 		/* only the master socket status is relevant here. The exit
 		 * conditions mirror closely tcp_recvmsg()