mbox series

[bpf,0/5] sockmap fixes

Message ID 160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370
Headers show
Series sockmap fixes | expand

Message

John Fastabend Nov. 7, 2020, 7:37 p.m. UTC
This includes fixes for sockmap found after I started running skmsg and
verdict programs on systems that I use daily. To date with attached
series I've been running for multiple weeks without seeing any issues
on systems doing calls, mail, movies, etc.

Also I started running packetdrill and after this series last remaining
fix needed is to handle MSG_EOR correctly. This will come as a follow
up to this, but because we use sendpage to pass pages into TCP stack
we need to enable TCP side some.

---

John Fastabend (5):
      bpf, sockmap: fix partial copy_page_to_iter so progress can still be made
      bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
      bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
      bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
      bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list


 net/core/skmsg.c   | 84 +++++++++++++++++++++++++++++++++++++++-------
 net/ipv4/tcp_bpf.c |  3 +-
 2 files changed, 73 insertions(+), 14 deletions(-)

--
Signature

Comments

Daniel Borkmann Nov. 12, 2020, 12:19 a.m. UTC | #1
On 11/7/20 8:37 PM, John Fastabend wrote:
> If copy_page_to_iter() fails or even partially completes, but with fewer

> bytes copied than expected we currently reset sg.start and return EFAULT.

> This proves problematic if we already copied data into the user buffer

> before we return an error. Because we leave the copied data in the user

> buffer and fail to unwind the scatterlist so kernel side believes data

> has been copied and user side believes data has _not_ been received.

> 

> Expected behavior should be to return number of bytes copied and then

> on the next read we need to return the error assuming its still there. This

> can happen if we have a copy length spanning multiple scatterlist elements

> and one or more complete before the error is hit.

> 

> The error is rare enough though that my normal testing with server side

> programs, such as nginx, httpd, envoy, etc., I have never seen this. The

> only reliable way to reproduce that I've found is to stream movies over

> my browser for a day or so and wait for it to hang. Not very scientific,

> but with a few extra WARN_ON()s in the code the bug was obvious.

> 

> When we review the errors from copy_page_to_iter() it seems we are hitting

> a page fault from copy_page_to_iter_iovec() where the code checks

> fault_in_pages_writeable(buf, copy) where buf is the user buffer. It

> also seems typical server applications don't hit this case.

> 

> The other way to try and reproduce this is run the sockmap selftest tool

> test_sockmap with data verification enabled, but it doesn't reproduce the

> fault. Perhaps we can trigger this case artificially somehow from the

> test tools. I haven't sorted out a way to do that yet though.

> 

> Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")

> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

> ---

>   net/ipv4/tcp_bpf.c |   15 ++++++++++-----

>   1 file changed, 10 insertions(+), 5 deletions(-)

> 

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c

> index 37f4cb2bba5c..3709d679436e 100644

> --- a/net/ipv4/tcp_bpf.c

> +++ b/net/ipv4/tcp_bpf.c

> @@ -15,8 +15,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,

>   {

>   	struct iov_iter *iter = &msg->msg_iter;

>   	int peek = flags & MSG_PEEK;

> -	int i, ret, copied = 0;

>   	struct sk_msg *msg_rx;

> +	int i, copied = 0;

>   

>   	msg_rx = list_first_entry_or_null(&psock->ingress_msg,

>   					  struct sk_msg, list);

> @@ -37,10 +37,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,

>   			page = sg_page(sge);

>   			if (copied + copy > len)

>   				copy = len - copied;

> -			ret = copy_page_to_iter(page, sge->offset, copy, iter);

> -			if (ret != copy) {

> -				msg_rx->sg.start = i;

> -				return -EFAULT;

> +			copy = copy_page_to_iter(page, sge->offset, copy, iter);

> +			if (!copy) {

> +				return copied ? copied : -EFAULT;

>   			}


nit: no need for {}

>   

>   			copied += copy;

> @@ -56,6 +55,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,

>   						put_page(page);

>   				}

>   			} else {

> +				/* Lets not optimize peek case if copy_page_to_iter

> +				 * didn't copy the entire length lets just break.

> +				 */

> +				if (copy != sge->length)

> +					goto out;


nit: return copied;

Rest lgtm for this one.

>   				sk_msg_iter_var_next(i);

>   			}

>   

> @@ -82,6 +86,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,

>   						  struct sk_msg, list);

>   	}

>   

> +out:

>   	return copied;

>   }

>   EXPORT_SYMBOL_GPL(__tcp_bpf_recvmsg);

> 

>
Daniel Borkmann Nov. 12, 2020, 12:22 a.m. UTC | #2
On 11/7/20 8:38 PM, John Fastabend wrote:
> If a socket redirects to itself and it is under memory pressure it is

> possible to get a socket stuck so that recv() returns EAGAIN and the

> socket can not advance for some time. This happens because when

> redirecting a skb to the same socket we received the skb on we first

> check if it is OK to enqueue the skb on the receiving socket by checking

> memory limits. But, if the skb is itself the object holding the memory

> needed to enqueue the skb we will keep retrying from kernel side

> and always fail with EAGAIN. Then userspace will get a recv() EAGAIN

> error if there are no skbs in the psock ingress queue. This will continue

> until either some skbs get kfree'd causing the memory pressure to

> reduce far enough that we can enqueue the pending packet or the

> socket is destroyed. In some cases its possible to get a socket

> stuck for a noticable amount of time if the socket is only receiving

> skbs from sk_skb verdict programs. To reproduce I make the socket

> memory limits ridiculously low so sockets are always under memory

> pressure. More often though if under memory pressure it looks like

> a spurious EAGAIN error on user space side causing userspace to retry

> and typically enough has moved on the memory side that it works.

> 

> To fix skip memory checks and skb_orphan if receiving on the same

> sock as already assigned.

> 

> For SK_PASS cases this is easy, its always the same socket so we

> can just omit the orphan/set_owner pair.

> 

> For backlog cases we need to check skb->sk and decide if the orphan

> and set_owner pair are needed.

> 

> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")

> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

> ---

>   net/core/skmsg.c |   72 ++++++++++++++++++++++++++++++++++++++++--------------

>   1 file changed, 53 insertions(+), 19 deletions(-)

> 

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c

> index fe44280c033e..580252e532da 100644

> --- a/net/core/skmsg.c

> +++ b/net/core/skmsg.c

> @@ -399,38 +399,38 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,

>   }

>   EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);

>   

> -static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)

> +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,

> +						  struct sk_buff *skb)

>   {

> -	struct sock *sk = psock->sk;

> -	int copied = 0, num_sge;

>   	struct sk_msg *msg;

>   

>   	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)

> -		return -EAGAIN;

> +		return NULL;

> +

> +	if (!sk_rmem_schedule(sk, skb, skb->len))


Isn't accounting always truesize based, thus we should fix & convert all skb->len
to skb->truesize ?

> +		return NULL;

>   

>   	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);

>   	if (unlikely(!msg))

> -		return -EAGAIN;

> -	if (!sk_rmem_schedule(sk, skb, skb->len)) {

> -		kfree(msg);

> -		return -EAGAIN;

> -	}

> +		return NULL;

>   

>   	sk_msg_init(msg);

> -	num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len);

> +	return msg;

> +}

> +
John Fastabend Nov. 12, 2020, 7:52 p.m. UTC | #3
Daniel Borkmann wrote:
> On 11/7/20 8:38 PM, John Fastabend wrote:

> > If a socket redirects to itself and it is under memory pressure it is

> > possible to get a socket stuck so that recv() returns EAGAIN and the

> > socket can not advance for some time. This happens because when

> > redirecting a skb to the same socket we received the skb on we first

> > check if it is OK to enqueue the skb on the receiving socket by checking

> > memory limits. But, if the skb is itself the object holding the memory

> > needed to enqueue the skb we will keep retrying from kernel side

> > and always fail with EAGAIN. Then userspace will get a recv() EAGAIN

> > error if there are no skbs in the psock ingress queue. This will continue

> > until either some skbs get kfree'd causing the memory pressure to

> > reduce far enough that we can enqueue the pending packet or the

> > socket is destroyed. In some cases its possible to get a socket

> > stuck for a noticable amount of time if the socket is only receiving

> > skbs from sk_skb verdict programs. To reproduce I make the socket

> > memory limits ridiculously low so sockets are always under memory

> > pressure. More often though if under memory pressure it looks like

> > a spurious EAGAIN error on user space side causing userspace to retry

> > and typically enough has moved on the memory side that it works.

> > 

> > To fix skip memory checks and skb_orphan if receiving on the same

> > sock as already assigned.

> > 

> > For SK_PASS cases this is easy, its always the same socket so we

> > can just omit the orphan/set_owner pair.

> > 

> > For backlog cases we need to check skb->sk and decide if the orphan

> > and set_owner pair are needed.

> > 

> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")

> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>

> > ---

> >   net/core/skmsg.c |   72 ++++++++++++++++++++++++++++++++++++++++--------------

> >   1 file changed, 53 insertions(+), 19 deletions(-)

> > 

> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c

> > index fe44280c033e..580252e532da 100644

> > --- a/net/core/skmsg.c

> > +++ b/net/core/skmsg.c

> > @@ -399,38 +399,38 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,

> >   }

> >   EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);

> >   

> > -static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)

> > +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,

> > +						  struct sk_buff *skb)

> >   {

> > -	struct sock *sk = psock->sk;

> > -	int copied = 0, num_sge;

> >   	struct sk_msg *msg;

> >   

> >   	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)

> > -		return -EAGAIN;

> > +		return NULL;

> > +

> > +	if (!sk_rmem_schedule(sk, skb, skb->len))

> 

> Isn't accounting always truesize based, thus we should fix & convert all skb->len

> to skb->truesize ?


Right good catch, will fix in v2.
John Fastabend Nov. 12, 2020, 7:56 p.m. UTC | #4
Daniel Borkmann wrote:
> On 11/7/20 8:37 PM, John Fastabend wrote:

> > If copy_page_to_iter() fails or even partially completes, but with fewer

> > bytes copied than expected we currently reset sg.start and return EFAULT.

> > This proves problematic if we already copied data into the user buffer

> > before we return an error. Because we leave the copied data in the user

> > buffer and fail to unwind the scatterlist so kernel side believes data

> > has been copied and user side believes data has _not_ been received.


[...]

> > +			if (!copy) {

> > +				return copied ? copied : -EFAULT;

> >   			}

> 

> nit: no need for {}

> 

> >   

> >   			copied += copy;

> > @@ -56,6 +55,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,

> >   						put_page(page);

> >   				}

> >   			} else {

> > +				/* Lets not optimize peek case if copy_page_to_iter

> > +				 * didn't copy the entire length lets just break.

> > +				 */

> > +				if (copy != sge->length)

> > +					goto out;

> 

> nit: return copied;

> 

> Rest lgtm for this one.


Great, thanks for the review will fixup in v2.