mbox series

[v6,bpf-next,00/11] Socket migration for SO_REUSEPORT.

Message ID 20210517002258.75019-1-kuniyu@amazon.co.jp
Headers show
Series Socket migration for SO_REUSEPORT. | expand

Message

Kuniyuki Iwashima May 17, 2021, 12:22 a.m. UTC
The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation [1]. When a SYN packet is received, the connection is tied
to a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

To avoid such a situation, users have to know deeply how the kernel handles
SYN packets and implement connection draining by eBPF [2]:

  1. Stop routing SYN packets to the listener by eBPF.
  2. Wait for all timers to expire to complete requests
  3. Accept connections until EAGAIN, then close the listener.

  or

  1. Start counting SYN packets and accept syscalls using the eBPF map.
  2. Stop routing SYN packets.
  3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to the others in the same reuseport group
at/after close or shutdown syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons, such as replacing TLS certificates,
we may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining. 

Moreover, auto-migration simplifies user space logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use the eBPF program to select a
specific listener or to cancel migration.

Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
snippets along the way.


Link:
 [1] The SO_REUSEPORT socket option
 https://lwn.net/Articles/542629/

 [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
 https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/


Changelog:
 v6:
  * Change description in ip-sysctl.rst
  * Test IPPROTO_TCP before reading tfo_listener
  * Move reqsk_clone() to inet_connection_sock.c and rename to
    inet_reqsk_clone()
  * Pass req->rsk_listener to inet_csk_reqsk_queue_drop() and
    reqsk_queue_removed() in the migration path of receiving ACK
  * s/ARG_PTR_TO_SOCKET/PTR_TO_SOCKET/ in sk_reuseport_is_valid_access()
  * In selftest, use atomic ops to increment global vars, drop ACK by XDP,
    enable force fastopen, use "skel->bss" instead of "skel->data"

 v5:
 https://lore.kernel.org/bpf/20210510034433.52818-1-kuniyu@amazon.co.jp/
  * Move initializtion of sk_node from 6th to 5th patch
  * Initialize sk_refcnt in reqsk_clone()
  * Modify some definitions in reqsk_timer_handler()
  * Validate in which path/state migration happens in selftest

 v4:
 https://lore.kernel.org/bpf/20210427034623.46528-1-kuniyu@amazon.co.jp/
  * Make some functions and variables 'static' in selftest
  * Remove 'scalability' from the cover letter

 v3:
 https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/
  * Add sysctl back for reuseport_grow()
  * Add helper functions to manage socks[]
  * Separate migration related logic into functions: reuseport_resurrect(),
    reuseport_stop_listen_sock(), reuseport_migrate_sock()
  * Clone request_sock to be migrated
  * Migrate request one by one
  * Pass child socket to eBPF prog

 v2:
 https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/
  * Do not save closed sockets in socks[]
  * Revert 607904c357c61adf20b8fd18af765e501d61a385
  * Extract inet_csk_reqsk_queue_migrate() into a single patch
  * Change the spin_lock order to avoid lockdep warning
  * Add static to __reuseport_select_sock
  * Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
  * Set the default attach type in bpf_prog_load_check_attach()
  * Define new proto of BPF_FUNC_get_socket_cookie
  * Fix test to be compiled successfully
  * Update commit messages

 v1:
 https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/
  * Remove the sysctl option
  * Enable migration if eBPF progam is not attached
  * Add expected_attach_type to check if eBPF program can migrate sockets
  * Add a field to tell migration type to eBPF program
  * Support BPF_FUNC_get_socket_cookie to get the cookie of sk
  * Allocate an empty skb if skb is NULL
  * Pass req_to_sk(req)->sk_hash because listener's hash is zero
  * Update commit messages and coverletter

 RFC:
 https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/


Kuniyuki Iwashima (11):
  net: Introduce net.ipv4.tcp_migrate_req.
  tcp: Add num_closed_socks to struct sock_reuseport.
  tcp: Keep TCP_CLOSE sockets in the reuseport group.
  tcp: Add reuseport_migrate_sock() to select a new listener.
  tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
  tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
  bpf: Support BPF_FUNC_get_socket_cookie() for
    BPF_PROG_TYPE_SK_REUSEPORT.
  bpf: Support socket migration by eBPF.
  libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
  bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

 Documentation/networking/ip-sysctl.rst        |  25 +
 include/linux/bpf.h                           |   1 +
 include/linux/filter.h                        |   2 +
 include/net/netns/ipv4.h                      |   1 +
 include/net/sock_reuseport.h                  |   9 +-
 include/uapi/linux/bpf.h                      |  16 +
 kernel/bpf/syscall.c                          |  13 +
 net/core/filter.c                             |  23 +-
 net/core/sock_reuseport.c                     | 337 +++++++++--
 net/ipv4/inet_connection_sock.c               | 190 +++++-
 net/ipv4/inet_hashtables.c                    |   2 +-
 net/ipv4/sysctl_net_ipv4.c                    |   9 +
 net/ipv4/tcp_ipv4.c                           |  20 +-
 net/ipv4/tcp_minisocks.c                      |   4 +-
 net/ipv6/tcp_ipv6.c                           |  14 +-
 tools/include/uapi/linux/bpf.h                |  16 +
 tools/lib/bpf/libbpf.c                        |   5 +-
 tools/testing/selftests/bpf/network_helpers.c |   2 +-
 tools/testing/selftests/bpf/network_helpers.h |   1 +
 .../bpf/prog_tests/migrate_reuseport.c        | 553 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport.c        | 135 +++++
 21 files changed, 1314 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c

Comments

Martin KaFai Lau May 20, 2021, 6:26 a.m. UTC | #1
On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> +			       struct sock_reuseport *reuse, bool bind_inany)

> +{

> +	if (old_reuse == reuse) {

> +		/* If sk was in the same reuseport group, just pop sk out of

> +		 * the closed section and push sk into the listening section.

> +		 */

> +		__reuseport_detach_closed_sock(sk, old_reuse);

> +		__reuseport_add_sock(sk, old_reuse);

> +		return 0;

> +	}

> +

> +	if (!reuse) {

> +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> +		 * for the shutdown()ed socket. In setsockopt() path, we should

> +		 * not change the eBPF prog of listening sockets by attaching a

> +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> +		 * reuseport group and detach sk from the old group.

> +		 */

For the reuseport_attach_prog() path, I think it needs to consider
the reuse->num_closed_socks != 0 case also and that should belong
to the resurrect case.  For example, when
sk_unhashed(sk) but sk->sk_reuseport == 0.
Martin KaFai Lau May 20, 2021, 6:27 a.m. UTC | #2
On Mon, May 17, 2021 at 09:22:56AM +0900, Kuniyuki Iwashima wrote:
> This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT

> to check if the attached eBPF program is capable of migrating sockets. When

> the eBPF program is attached, we run it for socket migration if the

> expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or

> net.ipv4.tcp_migrate_req is enabled.

> 

> Ccurrently, the expected_attach_type is not enforced for the

nit. 'Currenctly,'
Martin KaFai Lau May 20, 2021, 6:30 a.m. UTC | #3
On Mon, May 17, 2021 at 09:22:47AM +0900, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to

> accept connections evenly. However, there is a defect in the current

> implementation [1]. When a SYN packet is received, the connection is tied

> to a listening socket. Accordingly, when the listener is closed, in-flight

> requests during the three-way handshake and child sockets in the accept

> queue are dropped even if other listeners on the same port could accept

> such connections.

> 

> This situation can happen when various server management tools restart

> server (such as nginx) processes. For instance, when we change nginx

> configurations and restart it, it spins up new workers that respect the new

> configuration and closes all listeners on the old workers, resulting in the

> in-flight ACK of 3WHS is responded by RST.

> 

> To avoid such a situation, users have to know deeply how the kernel handles

> SYN packets and implement connection draining by eBPF [2]:

> 

>   1. Stop routing SYN packets to the listener by eBPF.

>   2. Wait for all timers to expire to complete requests

>   3. Accept connections until EAGAIN, then close the listener.

> 

>   or

> 

>   1. Start counting SYN packets and accept syscalls using the eBPF map.

>   2. Stop routing SYN packets.

>   3. Accept connections up to the count, then close the listener.

> 

> In either way, we cannot close a listener immediately. However, ideally,

> the application need not drain the not yet accepted sockets because 3WHS

> and tying a connection to a listener are just the kernel behaviour. The

> root cause is within the kernel, so the issue should be addressed in kernel

> space and should not be visible to user space. This patchset fixes it so

> that users need not take care of kernel implementation and connection

> draining. With this patchset, the kernel redistributes requests and

> connections from a listener to the others in the same reuseport group

> at/after close or shutdown syscalls.

> 

> Although some software does connection draining, there are still merits in

> migration. For some security reasons, such as replacing TLS certificates,

> we may want to apply new settings as soon as possible and/or we may not be

> able to wait for connection draining. The sockets in the accept queue have

> not started application sessions yet. So, if we do not drain such sockets,

> they can be handled by the newer listeners and could have a longer

> lifetime. It is difficult to drain all connections in every case, but we

> can decrease such aborted connections by migration. In that sense,

> migration is always better than draining. 

> 

> Moreover, auto-migration simplifies user space logic and also works well in

> a case where we cannot modify and build a server program to implement the

> workaround.

> 

> Note that the source and destination listeners MUST have the same settings

> at the socket API level; otherwise, applications may face inconsistency and

> cause errors. In such a case, we have to use the eBPF program to select a

> specific listener or to cancel migration.

> 

> Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code

> snippets along the way.

> 

> 

> Link:

>  [1] The SO_REUSEPORT socket option

>  https://lwn.net/Articles/542629/ 

> 

>  [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

>  https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/

> 

> 

> Changelog:

>  v6:

>   * Change description in ip-sysctl.rst

>   * Test IPPROTO_TCP before reading tfo_listener

>   * Move reqsk_clone() to inet_connection_sock.c and rename to

>     inet_reqsk_clone()

>   * Pass req->rsk_listener to inet_csk_reqsk_queue_drop() and

>     reqsk_queue_removed() in the migration path of receiving ACK

>   * s/ARG_PTR_TO_SOCKET/PTR_TO_SOCKET/ in sk_reuseport_is_valid_access()

>   * In selftest, use atomic ops to increment global vars, drop ACK by XDP,

>     enable force fastopen, use "skel->bss" instead of "skel->data"

Some commit messages need to be updated: s/reqsk_clone/inet_reqsk_clone/

One thing needs to be addressed in patch 3.

Others lgtm.

Acked-by: Martin KaFai Lau <kafai@fb.com>
Kuniyuki Iwashima May 20, 2021, 8:51 a.m. UTC | #4
From:   Martin KaFai Lau <kafai@fb.com>

Date:   Wed, 19 May 2021 23:26:48 -0700
> On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> 

> > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > +			       struct sock_reuseport *reuse, bool bind_inany)

> > +{

> > +	if (old_reuse == reuse) {

> > +		/* If sk was in the same reuseport group, just pop sk out of

> > +		 * the closed section and push sk into the listening section.

> > +		 */

> > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > +		__reuseport_add_sock(sk, old_reuse);

> > +		return 0;

> > +	}

> > +

> > +	if (!reuse) {

> > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > +		 * not change the eBPF prog of listening sockets by attaching a

> > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > +		 * reuseport group and detach sk from the old group.

> > +		 */

> For the reuseport_attach_prog() path, I think it needs to consider

> the reuse->num_closed_socks != 0 case also and that should belong

> to the resurrect case.  For example, when

> sk_unhashed(sk) but sk->sk_reuseport == 0.


In the path, reuseport_resurrect() is called from reuseport_alloc() only
if reuse->num_closed_socks != 0.


> @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

>  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

>  					  lockdep_is_held(&reuseport_lock));

>  	if (reuse) {

> +		if (reuse->num_closed_socks) {


But, should this be

	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

because we need not allocate a new group when we attach a bpf prog to
listeners?


> +			/* sk was shutdown()ed before */

> +			int err = reuseport_resurrect(sk, reuse, NULL, bind_inany);

> +

> +			spin_unlock_bh(&reuseport_lock);

> +			return err;

> +		}

> +

>  		/* Only set reuse->bind_inany if the bind_inany is true.

>  		 * Otherwise, it will overwrite the reuse->bind_inany

>  		 * which was set by the bind/hash path.
Kuniyuki Iwashima May 20, 2021, 8:54 a.m. UTC | #5
From:   Martin KaFai Lau <kafai@fb.com>

Date:   Wed, 19 May 2021 23:27:23 -0700
> On Mon, May 17, 2021 at 09:22:56AM +0900, Kuniyuki Iwashima wrote:

> > This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT

> > to check if the attached eBPF program is capable of migrating sockets. When

> > the eBPF program is attached, we run it for socket migration if the

> > expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or

> > net.ipv4.tcp_migrate_req is enabled.

> > 

> > Ccurrently, the expected_attach_type is not enforced for the

> nit. 'Currenctly,'


Thank you, I'll fix it to 'Currently' :)
Kuniyuki Iwashima May 20, 2021, 8:58 a.m. UTC | #6
From:   Martin KaFai Lau <kafai@fb.com>

Date:   Wed, 19 May 2021 23:30:29 -0700
> On Mon, May 17, 2021 at 09:22:47AM +0900, Kuniyuki Iwashima wrote:

> > The SO_REUSEPORT option allows sockets to listen on the same port and to

> > accept connections evenly. However, there is a defect in the current

> > implementation [1]. When a SYN packet is received, the connection is tied

> > to a listening socket. Accordingly, when the listener is closed, in-flight

> > requests during the three-way handshake and child sockets in the accept

> > queue are dropped even if other listeners on the same port could accept

> > such connections.

> > 

> > This situation can happen when various server management tools restart

> > server (such as nginx) processes. For instance, when we change nginx

> > configurations and restart it, it spins up new workers that respect the new

> > configuration and closes all listeners on the old workers, resulting in the

> > in-flight ACK of 3WHS is responded by RST.

> > 

> > To avoid such a situation, users have to know deeply how the kernel handles

> > SYN packets and implement connection draining by eBPF [2]:

> > 

> >   1. Stop routing SYN packets to the listener by eBPF.

> >   2. Wait for all timers to expire to complete requests

> >   3. Accept connections until EAGAIN, then close the listener.

> > 

> >   or

> > 

> >   1. Start counting SYN packets and accept syscalls using the eBPF map.

> >   2. Stop routing SYN packets.

> >   3. Accept connections up to the count, then close the listener.

> > 

> > In either way, we cannot close a listener immediately. However, ideally,

> > the application need not drain the not yet accepted sockets because 3WHS

> > and tying a connection to a listener are just the kernel behaviour. The

> > root cause is within the kernel, so the issue should be addressed in kernel

> > space and should not be visible to user space. This patchset fixes it so

> > that users need not take care of kernel implementation and connection

> > draining. With this patchset, the kernel redistributes requests and

> > connections from a listener to the others in the same reuseport group

> > at/after close or shutdown syscalls.

> > 

> > Although some software does connection draining, there are still merits in

> > migration. For some security reasons, such as replacing TLS certificates,

> > we may want to apply new settings as soon as possible and/or we may not be

> > able to wait for connection draining. The sockets in the accept queue have

> > not started application sessions yet. So, if we do not drain such sockets,

> > they can be handled by the newer listeners and could have a longer

> > lifetime. It is difficult to drain all connections in every case, but we

> > can decrease such aborted connections by migration. In that sense,

> > migration is always better than draining. 

> > 

> > Moreover, auto-migration simplifies user space logic and also works well in

> > a case where we cannot modify and build a server program to implement the

> > workaround.

> > 

> > Note that the source and destination listeners MUST have the same settings

> > at the socket API level; otherwise, applications may face inconsistency and

> > cause errors. In such a case, we have to use the eBPF program to select a

> > specific listener or to cancel migration.

> > 

> > Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code

> > snippets along the way.

> > 

> > 

> > Link:

> >  [1] The SO_REUSEPORT socket option

> >  https://lwn.net/Articles/542629/ 

> > 

> >  [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode

> >  https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/

> > 

> > 

> > Changelog:

> >  v6:

> >   * Change description in ip-sysctl.rst

> >   * Test IPPROTO_TCP before reading tfo_listener

> >   * Move reqsk_clone() to inet_connection_sock.c and rename to

> >     inet_reqsk_clone()

> >   * Pass req->rsk_listener to inet_csk_reqsk_queue_drop() and

> >     reqsk_queue_removed() in the migration path of receiving ACK

> >   * s/ARG_PTR_TO_SOCKET/PTR_TO_SOCKET/ in sk_reuseport_is_valid_access()

> >   * In selftest, use atomic ops to increment global vars, drop ACK by XDP,

> >     enable force fastopen, use "skel->bss" instead of "skel->data"

> Some commit messages need to be updated: s/reqsk_clone/inet_reqsk_clone/


I'll fix them.


> 

> One thing needs to be addressed in patch 3.

> 

> Others lgtm.

> 

> Acked-by: Martin KaFai Lau <kafai@fb.com>


Thank you!!

I'll respin after the discussion about 3rd patch.
Martin KaFai Lau May 20, 2021, 9:22 p.m. UTC | #7
On Thu, May 20, 2021 at 05:51:17PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>

> Date:   Wed, 19 May 2021 23:26:48 -0700

> > On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> > 

> > > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > > +			       struct sock_reuseport *reuse, bool bind_inany)

> > > +{

> > > +	if (old_reuse == reuse) {

> > > +		/* If sk was in the same reuseport group, just pop sk out of

> > > +		 * the closed section and push sk into the listening section.

> > > +		 */

> > > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > > +		__reuseport_add_sock(sk, old_reuse);

> > > +		return 0;

> > > +	}

> > > +

> > > +	if (!reuse) {

> > > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > > +		 * not change the eBPF prog of listening sockets by attaching a

> > > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > > +		 * reuseport group and detach sk from the old group.

> > > +		 */

> > For the reuseport_attach_prog() path, I think it needs to consider

> > the reuse->num_closed_socks != 0 case also and that should belong

> > to the resurrect case.  For example, when

> > sk_unhashed(sk) but sk->sk_reuseport == 0.

> 

> In the path, reuseport_resurrect() is called from reuseport_alloc() only

> if reuse->num_closed_socks != 0.

> 

> 

> > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

> >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> >  					  lockdep_is_held(&reuseport_lock));

> >  	if (reuse) {

> > +		if (reuse->num_closed_socks) {

> 

> But, should this be

> 

> 	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

> 

> because we need not allocate a new group when we attach a bpf prog to

> listeners?

The reuseport_alloc() is fine as is.  No need to change.

I should have copied reuseport_attach_prog() in the last reply and
commented there instead.

I meant reuseport_attach_prog() needs a change.  In reuseport_attach_prog(),
iiuc, currently passing the "else if (!rcu_access_pointer(sk->sk_reuseport_cb))"
check implies the sk was (and still is) hashed with sk_reuseport enabled
because the current behavior would have set sk_reuseport_cb to NULL during
unhash but it is no longer true now.  For example, this will break:

1. shutdown(lsk); /* lsk was bound with sk_reuseport enabled */
2. setsockopt(lsk, ..., SO_REUSEPORT, &zero, ...); /* disable sk_reuseport */
3. setsockopt(lsk, ..., SO_ATTACH_REUSEPORT_EBPF, &prog_fd, ...);
   ^---- /* This will work now because sk_reuseport_cb is not NULL.
          * However, it shouldn't be allowed.
	  */

I am thinking something like this (uncompiled code):

int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)
{
	struct sock_reuseport *reuse;
	struct bpf_prog *old_prog;

	if (sk_unhashed(sk)) {
		int err;

		if (!sk->sk_reuseport)
			return -EINVAL;

		err = reuseport_alloc(sk, false);
		if (err)
			return err;
	} else if (!rcu_access_pointer(sk->sk_reuseport_cb)) {
		/* The socket wasn't bound with SO_REUSEPORT */
		return -EINVAL;
	}

	/* ... */
}

WDYT?
Kuniyuki Iwashima May 20, 2021, 10:54 p.m. UTC | #8
From:   Martin KaFai Lau <kafai@fb.com>

Date:   Thu, 20 May 2021 14:22:01 -0700
> On Thu, May 20, 2021 at 05:51:17PM +0900, Kuniyuki Iwashima wrote:

> > From:   Martin KaFai Lau <kafai@fb.com>

> > Date:   Wed, 19 May 2021 23:26:48 -0700

> > > On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> > > 

> > > > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > > > +			       struct sock_reuseport *reuse, bool bind_inany)

> > > > +{

> > > > +	if (old_reuse == reuse) {

> > > > +		/* If sk was in the same reuseport group, just pop sk out of

> > > > +		 * the closed section and push sk into the listening section.

> > > > +		 */

> > > > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > > > +		__reuseport_add_sock(sk, old_reuse);

> > > > +		return 0;

> > > > +	}

> > > > +

> > > > +	if (!reuse) {

> > > > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > > > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > > > +		 * not change the eBPF prog of listening sockets by attaching a

> > > > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > > > +		 * reuseport group and detach sk from the old group.

> > > > +		 */

> > > For the reuseport_attach_prog() path, I think it needs to consider

> > > the reuse->num_closed_socks != 0 case also and that should belong

> > > to the resurrect case.  For example, when

> > > sk_unhashed(sk) but sk->sk_reuseport == 0.

> > 

> > In the path, reuseport_resurrect() is called from reuseport_alloc() only

> > if reuse->num_closed_socks != 0.

> > 

> > 

> > > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

> > >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > >  					  lockdep_is_held(&reuseport_lock));

> > >  	if (reuse) {

> > > +		if (reuse->num_closed_socks) {

> > 

> > But, should this be

> > 

> > 	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

> > 

> > because we need not allocate a new group when we attach a bpf prog to

> > listeners?

> The reuseport_alloc() is fine as is.  No need to change.


I missed sk_unhashed(sk) prevents calling reuseport_alloc()
if sk_state == TCP_LISTEN. I'll keep it as is.


> 

> I should have copied reuseport_attach_prog() in the last reply and

> commented there instead.

> 

> I meant reuseport_attach_prog() needs a change.  In reuseport_attach_prog(),

> iiuc, currently passing the "else if (!rcu_access_pointer(sk->sk_reuseport_cb))"

> check implies the sk was (and still is) hashed with sk_reuseport enabled

> because the current behavior would have set sk_reuseport_cb to NULL during

> unhash but it is no longer true now.  For example, this will break:

> 

> 1. shutdown(lsk); /* lsk was bound with sk_reuseport enabled */

> 2. setsockopt(lsk, ..., SO_REUSEPORT, &zero, ...); /* disable sk_reuseport */

> 3. setsockopt(lsk, ..., SO_ATTACH_REUSEPORT_EBPF, &prog_fd, ...);

>    ^---- /* This will work now because sk_reuseport_cb is not NULL.

>           * However, it shouldn't be allowed.

> 	  */


Thank you for explanation, I understood the case.

Exactly, I've confirmed that the case succeeded in the setsockopt() and I
could change the active listeners' prog via a shutdowned socket.


> 

> I am thinking something like this (uncompiled code):

> 

> int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)

> {

> 	struct sock_reuseport *reuse;

> 	struct bpf_prog *old_prog;

> 

> 	if (sk_unhashed(sk)) {

> 		int err;

> 

> 		if (!sk->sk_reuseport)

> 			return -EINVAL;

> 

> 		err = reuseport_alloc(sk, false);

> 		if (err)

> 			return err;

> 	} else if (!rcu_access_pointer(sk->sk_reuseport_cb)) {

> 		/* The socket wasn't bound with SO_REUSEPORT */

> 		return -EINVAL;

> 	}

> 

> 	/* ... */

> }

> 

> WDYT?


I tested this change worked fine. I think this change should be added in
reuseport_detach_prog() also.

---8<---
int reuseport_detach_prog(struct sock *sk)
{
        struct sock_reuseport *reuse;
        struct bpf_prog *old_prog;

        if (!rcu_access_pointer(sk->sk_reuseport_cb))
		return sk->sk_reuseport ? -ENOENT : -EINVAL;
---8<---


Another option is to add the check in sock_setsockopt():
SO_ATTACH_REUSEPORT_[CE]BPF, SO_DETACH_REUSEPORT_BPF.

Which do you think is better ?
Martin KaFai Lau May 20, 2021, 11:39 p.m. UTC | #9
On Fri, May 21, 2021 at 07:54:48AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>

> Date:   Thu, 20 May 2021 14:22:01 -0700

> > On Thu, May 20, 2021 at 05:51:17PM +0900, Kuniyuki Iwashima wrote:

> > > From:   Martin KaFai Lau <kafai@fb.com>

> > > Date:   Wed, 19 May 2021 23:26:48 -0700

> > > > On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> > > > 

> > > > > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > > > > +			       struct sock_reuseport *reuse, bool bind_inany)

> > > > > +{

> > > > > +	if (old_reuse == reuse) {

> > > > > +		/* If sk was in the same reuseport group, just pop sk out of

> > > > > +		 * the closed section and push sk into the listening section.

> > > > > +		 */

> > > > > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > > > > +		__reuseport_add_sock(sk, old_reuse);

> > > > > +		return 0;

> > > > > +	}

> > > > > +

> > > > > +	if (!reuse) {

> > > > > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > > > > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > > > > +		 * not change the eBPF prog of listening sockets by attaching a

> > > > > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > > > > +		 * reuseport group and detach sk from the old group.

> > > > > +		 */

> > > > For the reuseport_attach_prog() path, I think it needs to consider

> > > > the reuse->num_closed_socks != 0 case also and that should belong

> > > > to the resurrect case.  For example, when

> > > > sk_unhashed(sk) but sk->sk_reuseport == 0.

> > > 

> > > In the path, reuseport_resurrect() is called from reuseport_alloc() only

> > > if reuse->num_closed_socks != 0.

> > > 

> > > 

> > > > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

> > > >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > > >  					  lockdep_is_held(&reuseport_lock));

> > > >  	if (reuse) {

> > > > +		if (reuse->num_closed_socks) {

> > > 

> > > But, should this be

> > > 

> > > 	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

> > > 

> > > because we need not allocate a new group when we attach a bpf prog to

> > > listeners?

> > The reuseport_alloc() is fine as is.  No need to change.

> 

> I missed sk_unhashed(sk) prevents calling reuseport_alloc()

> if sk_state == TCP_LISTEN. I'll keep it as is.

> 

> 

> > 

> > I should have copied reuseport_attach_prog() in the last reply and

> > commented there instead.

> > 

> > I meant reuseport_attach_prog() needs a change.  In reuseport_attach_prog(),

> > iiuc, currently passing the "else if (!rcu_access_pointer(sk->sk_reuseport_cb))"

> > check implies the sk was (and still is) hashed with sk_reuseport enabled

> > because the current behavior would have set sk_reuseport_cb to NULL during

> > unhash but it is no longer true now.  For example, this will break:

> > 

> > 1. shutdown(lsk); /* lsk was bound with sk_reuseport enabled */

> > 2. setsockopt(lsk, ..., SO_REUSEPORT, &zero, ...); /* disable sk_reuseport */

> > 3. setsockopt(lsk, ..., SO_ATTACH_REUSEPORT_EBPF, &prog_fd, ...);

> >    ^---- /* This will work now because sk_reuseport_cb is not NULL.

> >           * However, it shouldn't be allowed.

> > 	  */

> 

> Thank you for explanation, I understood the case.

> 

> Exactly, I've confirmed that the case succeeded in the setsockopt() and I

> could change the active listeners' prog via a shutdowned socket.

> 

> 

> > 

> > I am thinking something like this (uncompiled code):

> > 

> > int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)

> > {

> > 	struct sock_reuseport *reuse;

> > 	struct bpf_prog *old_prog;

> > 

> > 	if (sk_unhashed(sk)) {

> > 		int err;

> > 

> > 		if (!sk->sk_reuseport)

> > 			return -EINVAL;

> > 

> > 		err = reuseport_alloc(sk, false);

> > 		if (err)

> > 			return err;

> > 	} else if (!rcu_access_pointer(sk->sk_reuseport_cb)) {

> > 		/* The socket wasn't bound with SO_REUSEPORT */

> > 		return -EINVAL;

> > 	}

> > 

> > 	/* ... */

> > }

> > 

> > WDYT?

> 

> I tested this change worked fine. I think this change should be added in

> reuseport_detach_prog() also.

> 

> ---8<---

> int reuseport_detach_prog(struct sock *sk)

> {

>         struct sock_reuseport *reuse;

>         struct bpf_prog *old_prog;

> 

>         if (!rcu_access_pointer(sk->sk_reuseport_cb))

> 		return sk->sk_reuseport ? -ENOENT : -EINVAL;

> ---8<---

Right, a quick thought is something like this for detach:

	spin_lock_bh(&reuseport_lock);
	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
					  lockdep_is_held(&reuseport_lock));
	if (sk_unhashed(sk) && reuse->num_closed_socks) {
		spin_unlock_bh(&reuseport_lock);
		return -ENOENT;
	}

Although checking with reuseport_sock_index() will also work,
the above probably is simpler and faster?

> 

> 

> Another option is to add the check in sock_setsockopt():

> SO_ATTACH_REUSEPORT_[CE]BPF, SO_DETACH_REUSEPORT_BPF.

> 

> Which do you think is better ?

I think it is better to have this sock_reuseport specific bits
staying in sock_reuseport.c.
Kuniyuki Iwashima May 21, 2021, 12:26 a.m. UTC | #10
From:   Martin KaFai Lau <kafai@fb.com>

Date:   Thu, 20 May 2021 16:39:06 -0700
> On Fri, May 21, 2021 at 07:54:48AM +0900, Kuniyuki Iwashima wrote:

> > From:   Martin KaFai Lau <kafai@fb.com>

> > Date:   Thu, 20 May 2021 14:22:01 -0700

> > > On Thu, May 20, 2021 at 05:51:17PM +0900, Kuniyuki Iwashima wrote:

> > > > From:   Martin KaFai Lau <kafai@fb.com>

> > > > Date:   Wed, 19 May 2021 23:26:48 -0700

> > > > > On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> > > > > 

> > > > > > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > > > > > +			       struct sock_reuseport *reuse, bool bind_inany)

> > > > > > +{

> > > > > > +	if (old_reuse == reuse) {

> > > > > > +		/* If sk was in the same reuseport group, just pop sk out of

> > > > > > +		 * the closed section and push sk into the listening section.

> > > > > > +		 */

> > > > > > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > > > > > +		__reuseport_add_sock(sk, old_reuse);

> > > > > > +		return 0;

> > > > > > +	}

> > > > > > +

> > > > > > +	if (!reuse) {

> > > > > > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > > > > > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > > > > > +		 * not change the eBPF prog of listening sockets by attaching a

> > > > > > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > > > > > +		 * reuseport group and detach sk from the old group.

> > > > > > +		 */

> > > > > For the reuseport_attach_prog() path, I think it needs to consider

> > > > > the reuse->num_closed_socks != 0 case also and that should belong

> > > > > to the resurrect case.  For example, when

> > > > > sk_unhashed(sk) but sk->sk_reuseport == 0.

> > > > 

> > > > In the path, reuseport_resurrect() is called from reuseport_alloc() only

> > > > if reuse->num_closed_socks != 0.

> > > > 

> > > > 

> > > > > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

> > > > >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > > > >  					  lockdep_is_held(&reuseport_lock));

> > > > >  	if (reuse) {

> > > > > +		if (reuse->num_closed_socks) {

> > > > 

> > > > But, should this be

> > > > 

> > > > 	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

> > > > 

> > > > because we need not allocate a new group when we attach a bpf prog to

> > > > listeners?

> > > The reuseport_alloc() is fine as is.  No need to change.

> > 

> > I missed sk_unhashed(sk) prevents calling reuseport_alloc()

> > if sk_state == TCP_LISTEN. I'll keep it as is.

> > 

> > 

> > > 

> > > I should have copied reuseport_attach_prog() in the last reply and

> > > commented there instead.

> > > 

> > > I meant reuseport_attach_prog() needs a change.  In reuseport_attach_prog(),

> > > iiuc, currently passing the "else if (!rcu_access_pointer(sk->sk_reuseport_cb))"

> > > check implies the sk was (and still is) hashed with sk_reuseport enabled

> > > because the current behavior would have set sk_reuseport_cb to NULL during

> > > unhash but it is no longer true now.  For example, this will break:

> > > 

> > > 1. shutdown(lsk); /* lsk was bound with sk_reuseport enabled */

> > > 2. setsockopt(lsk, ..., SO_REUSEPORT, &zero, ...); /* disable sk_reuseport */

> > > 3. setsockopt(lsk, ..., SO_ATTACH_REUSEPORT_EBPF, &prog_fd, ...);

> > >    ^---- /* This will work now because sk_reuseport_cb is not NULL.

> > >           * However, it shouldn't be allowed.

> > > 	  */

> > 

> > Thank you for explanation, I understood the case.

> > 

> > Exactly, I've confirmed that the case succeeded in the setsockopt() and I

> > could change the active listeners' prog via a shutdowned socket.

> > 

> > 

> > > 

> > > I am thinking something like this (uncompiled code):

> > > 

> > > int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)

> > > {

> > > 	struct sock_reuseport *reuse;

> > > 	struct bpf_prog *old_prog;

> > > 

> > > 	if (sk_unhashed(sk)) {

> > > 		int err;

> > > 

> > > 		if (!sk->sk_reuseport)

> > > 			return -EINVAL;

> > > 

> > > 		err = reuseport_alloc(sk, false);

> > > 		if (err)

> > > 			return err;

> > > 	} else if (!rcu_access_pointer(sk->sk_reuseport_cb)) {

> > > 		/* The socket wasn't bound with SO_REUSEPORT */

> > > 		return -EINVAL;

> > > 	}

> > > 

> > > 	/* ... */

> > > }

> > > 

> > > WDYT?

> > 

> > I tested this change worked fine. I think this change should be added in

> > reuseport_detach_prog() also.

> > 

> > ---8<---

> > int reuseport_detach_prog(struct sock *sk)

> > {

> >         struct sock_reuseport *reuse;

> >         struct bpf_prog *old_prog;

> > 

> >         if (!rcu_access_pointer(sk->sk_reuseport_cb))

> > 		return sk->sk_reuseport ? -ENOENT : -EINVAL;

> > ---8<---

> Right, a quick thought is something like this for detach:

> 

> 	spin_lock_bh(&reuseport_lock);

> 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> 					  lockdep_is_held(&reuseport_lock));


Is this necessary because reuseport_grow() can detach sk?

        if (!reuse) {
                spin_unlock_bh(&reuseport_lock);
                return -ENOENT;
        }

Then we can remove rcu_access_pointer() check and move sk_reuseport check
here.


> 	if (sk_unhashed(sk) && reuse->num_closed_socks) {

> 		spin_unlock_bh(&reuseport_lock);

> 		return -ENOENT;

> 	}

> 

> Although checking with reuseport_sock_index() will also work,

> the above probably is simpler and faster?


Yes, if sk is unhashed and has sk_reuseport_cb, it stays in the closed
section of socks[] and num_closed_socks is larger than 0.


> 

> > 

> > 

> > Another option is to add the check in sock_setsockopt():

> > SO_ATTACH_REUSEPORT_[CE]BPF, SO_DETACH_REUSEPORT_BPF.

> > 

> > Which do you think is better ?

> I think it is better to have this sock_reuseport specific bits

> staying in sock_reuseport.c.


Exactly, I'll keep the change in sock_reuseport.c
Martin KaFai Lau May 21, 2021, 4:47 a.m. UTC | #11
On Fri, May 21, 2021 at 09:26:39AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>

> Date:   Thu, 20 May 2021 16:39:06 -0700

> > On Fri, May 21, 2021 at 07:54:48AM +0900, Kuniyuki Iwashima wrote:

> > > From:   Martin KaFai Lau <kafai@fb.com>

> > > Date:   Thu, 20 May 2021 14:22:01 -0700

> > > > On Thu, May 20, 2021 at 05:51:17PM +0900, Kuniyuki Iwashima wrote:

> > > > > From:   Martin KaFai Lau <kafai@fb.com>

> > > > > Date:   Wed, 19 May 2021 23:26:48 -0700

> > > > > > On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> > > > > > 

> > > > > > > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > > > > > > +			       struct sock_reuseport *reuse, bool bind_inany)

> > > > > > > +{

> > > > > > > +	if (old_reuse == reuse) {

> > > > > > > +		/* If sk was in the same reuseport group, just pop sk out of

> > > > > > > +		 * the closed section and push sk into the listening section.

> > > > > > > +		 */

> > > > > > > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > > > > > > +		__reuseport_add_sock(sk, old_reuse);

> > > > > > > +		return 0;

> > > > > > > +	}

> > > > > > > +

> > > > > > > +	if (!reuse) {

> > > > > > > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > > > > > > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > > > > > > +		 * not change the eBPF prog of listening sockets by attaching a

> > > > > > > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > > > > > > +		 * reuseport group and detach sk from the old group.

> > > > > > > +		 */

> > > > > > For the reuseport_attach_prog() path, I think it needs to consider

> > > > > > the reuse->num_closed_socks != 0 case also and that should belong

> > > > > > to the resurrect case.  For example, when

> > > > > > sk_unhashed(sk) but sk->sk_reuseport == 0.

> > > > > 

> > > > > In the path, reuseport_resurrect() is called from reuseport_alloc() only

> > > > > if reuse->num_closed_socks != 0.

> > > > > 

> > > > > 

> > > > > > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

> > > > > >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > > > > >  					  lockdep_is_held(&reuseport_lock));

> > > > > >  	if (reuse) {

> > > > > > +		if (reuse->num_closed_socks) {

> > > > > 

> > > > > But, should this be

> > > > > 

> > > > > 	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

> > > > > 

> > > > > because we need not allocate a new group when we attach a bpf prog to

> > > > > listeners?

> > > > The reuseport_alloc() is fine as is.  No need to change.

> > > 

> > > I missed sk_unhashed(sk) prevents calling reuseport_alloc()

> > > if sk_state == TCP_LISTEN. I'll keep it as is.

> > > 

> > > 

> > > > 

> > > > I should have copied reuseport_attach_prog() in the last reply and

> > > > commented there instead.

> > > > 

> > > > I meant reuseport_attach_prog() needs a change.  In reuseport_attach_prog(),

> > > > iiuc, currently passing the "else if (!rcu_access_pointer(sk->sk_reuseport_cb))"

> > > > check implies the sk was (and still is) hashed with sk_reuseport enabled

> > > > because the current behavior would have set sk_reuseport_cb to NULL during

> > > > unhash but it is no longer true now.  For example, this will break:

> > > > 

> > > > 1. shutdown(lsk); /* lsk was bound with sk_reuseport enabled */

> > > > 2. setsockopt(lsk, ..., SO_REUSEPORT, &zero, ...); /* disable sk_reuseport */

> > > > 3. setsockopt(lsk, ..., SO_ATTACH_REUSEPORT_EBPF, &prog_fd, ...);

> > > >    ^---- /* This will work now because sk_reuseport_cb is not NULL.

> > > >           * However, it shouldn't be allowed.

> > > > 	  */

> > > 

> > > Thank you for explanation, I understood the case.

> > > 

> > > Exactly, I've confirmed that the case succeeded in the setsockopt() and I

> > > could change the active listeners' prog via a shutdowned socket.

> > > 

> > > 

> > > > 

> > > > I am thinking something like this (uncompiled code):

> > > > 

> > > > int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)

> > > > {

> > > > 	struct sock_reuseport *reuse;

> > > > 	struct bpf_prog *old_prog;

> > > > 

> > > > 	if (sk_unhashed(sk)) {

> > > > 		int err;

> > > > 

> > > > 		if (!sk->sk_reuseport)

> > > > 			return -EINVAL;

> > > > 

> > > > 		err = reuseport_alloc(sk, false);

> > > > 		if (err)

> > > > 			return err;

> > > > 	} else if (!rcu_access_pointer(sk->sk_reuseport_cb)) {

> > > > 		/* The socket wasn't bound with SO_REUSEPORT */

> > > > 		return -EINVAL;

> > > > 	}

> > > > 

> > > > 	/* ... */

> > > > }

> > > > 

> > > > WDYT?

> > > 

> > > I tested this change worked fine. I think this change should be added in

> > > reuseport_detach_prog() also.

> > > 

> > > ---8<---

> > > int reuseport_detach_prog(struct sock *sk)

> > > {

> > >         struct sock_reuseport *reuse;

> > >         struct bpf_prog *old_prog;

> > > 

> > >         if (!rcu_access_pointer(sk->sk_reuseport_cb))

> > > 		return sk->sk_reuseport ? -ENOENT : -EINVAL;

> > > ---8<---

> > Right, a quick thought is something like this for detach:

> > 

> > 	spin_lock_bh(&reuseport_lock);

> > 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > 					  lockdep_is_held(&reuseport_lock));

> 

> Is this necessary because reuseport_grow() can detach sk?

> 

>         if (!reuse) {

>                 spin_unlock_bh(&reuseport_lock);

>                 return -ENOENT;

>         }

Yes, it is needed.  Please add a comment for the reuseport_grow() case also.

> 

> Then we can remove rcu_access_pointer() check and move sk_reuseport check

> here.

Make sense.
Kuniyuki Iwashima May 21, 2021, 5:15 a.m. UTC | #12
From:   Martin KaFai Lau <kafai@fb.com>

Date:   Thu, 20 May 2021 21:47:25 -0700
> On Fri, May 21, 2021 at 09:26:39AM +0900, Kuniyuki Iwashima wrote:

> > From:   Martin KaFai Lau <kafai@fb.com>

> > Date:   Thu, 20 May 2021 16:39:06 -0700

> > > On Fri, May 21, 2021 at 07:54:48AM +0900, Kuniyuki Iwashima wrote:

> > > > From:   Martin KaFai Lau <kafai@fb.com>

> > > > Date:   Thu, 20 May 2021 14:22:01 -0700

> > > > > On Thu, May 20, 2021 at 05:51:17PM +0900, Kuniyuki Iwashima wrote:

> > > > > > From:   Martin KaFai Lau <kafai@fb.com>

> > > > > > Date:   Wed, 19 May 2021 23:26:48 -0700

> > > > > > > On Mon, May 17, 2021 at 09:22:50AM +0900, Kuniyuki Iwashima wrote:

> > > > > > > 

> > > > > > > > +static int reuseport_resurrect(struct sock *sk, struct sock_reuseport *old_reuse,

> > > > > > > > +			       struct sock_reuseport *reuse, bool bind_inany)

> > > > > > > > +{

> > > > > > > > +	if (old_reuse == reuse) {

> > > > > > > > +		/* If sk was in the same reuseport group, just pop sk out of

> > > > > > > > +		 * the closed section and push sk into the listening section.

> > > > > > > > +		 */

> > > > > > > > +		__reuseport_detach_closed_sock(sk, old_reuse);

> > > > > > > > +		__reuseport_add_sock(sk, old_reuse);

> > > > > > > > +		return 0;

> > > > > > > > +	}

> > > > > > > > +

> > > > > > > > +	if (!reuse) {

> > > > > > > > +		/* In bind()/listen() path, we cannot carry over the eBPF prog

> > > > > > > > +		 * for the shutdown()ed socket. In setsockopt() path, we should

> > > > > > > > +		 * not change the eBPF prog of listening sockets by attaching a

> > > > > > > > +		 * prog to the shutdown()ed socket. Thus, we will allocate a new

> > > > > > > > +		 * reuseport group and detach sk from the old group.

> > > > > > > > +		 */

> > > > > > > For the reuseport_attach_prog() path, I think it needs to consider

> > > > > > > the reuse->num_closed_socks != 0 case also and that should belong

> > > > > > > to the resurrect case.  For example, when

> > > > > > > sk_unhashed(sk) but sk->sk_reuseport == 0.

> > > > > > 

> > > > > > In the path, reuseport_resurrect() is called from reuseport_alloc() only

> > > > > > if reuse->num_closed_socks != 0.

> > > > > > 

> > > > > > 

> > > > > > > @@ -92,6 +117,14 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)

> > > > > > >  	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > > > > > >  					  lockdep_is_held(&reuseport_lock));

> > > > > > >  	if (reuse) {

> > > > > > > +		if (reuse->num_closed_socks) {

> > > > > > 

> > > > > > But, should this be

> > > > > > 

> > > > > > 	if (sk->sk_state == TCP_CLOSE && reuse->num_closed_socks)

> > > > > > 

> > > > > > because we need not allocate a new group when we attach a bpf prog to

> > > > > > listeners?

> > > > > The reuseport_alloc() is fine as is.  No need to change.

> > > > 

> > > > I missed sk_unhashed(sk) prevents calling reuseport_alloc()

> > > > if sk_state == TCP_LISTEN. I'll keep it as is.

> > > > 

> > > > 

> > > > > 

> > > > > I should have copied reuseport_attach_prog() in the last reply and

> > > > > commented there instead.

> > > > > 

> > > > > I meant reuseport_attach_prog() needs a change.  In reuseport_attach_prog(),

> > > > > iiuc, currently passing the "else if (!rcu_access_pointer(sk->sk_reuseport_cb))"

> > > > > check implies the sk was (and still is) hashed with sk_reuseport enabled

> > > > > because the current behavior would have set sk_reuseport_cb to NULL during

> > > > > unhash but it is no longer true now.  For example, this will break:

> > > > > 

> > > > > 1. shutdown(lsk); /* lsk was bound with sk_reuseport enabled */

> > > > > 2. setsockopt(lsk, ..., SO_REUSEPORT, &zero, ...); /* disable sk_reuseport */

> > > > > 3. setsockopt(lsk, ..., SO_ATTACH_REUSEPORT_EBPF, &prog_fd, ...);

> > > > >    ^---- /* This will work now because sk_reuseport_cb is not NULL.

> > > > >           * However, it shouldn't be allowed.

> > > > > 	  */

> > > > 

> > > > Thank you for explanation, I understood the case.

> > > > 

> > > > Exactly, I've confirmed that the case succeeded in the setsockopt() and I

> > > > could change the active listeners' prog via a shutdowned socket.

> > > > 

> > > > 

> > > > > 

> > > > > I am thinking something like this (uncompiled code):

> > > > > 

> > > > > int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)

> > > > > {

> > > > > 	struct sock_reuseport *reuse;

> > > > > 	struct bpf_prog *old_prog;

> > > > > 

> > > > > 	if (sk_unhashed(sk)) {

> > > > > 		int err;

> > > > > 

> > > > > 		if (!sk->sk_reuseport)

> > > > > 			return -EINVAL;

> > > > > 

> > > > > 		err = reuseport_alloc(sk, false);

> > > > > 		if (err)

> > > > > 			return err;

> > > > > 	} else if (!rcu_access_pointer(sk->sk_reuseport_cb)) {

> > > > > 		/* The socket wasn't bound with SO_REUSEPORT */

> > > > > 		return -EINVAL;

> > > > > 	}

> > > > > 

> > > > > 	/* ... */

> > > > > }

> > > > > 

> > > > > WDYT?

> > > > 

> > > > I tested this change worked fine. I think this change should be added in

> > > > reuseport_detach_prog() also.

> > > > 

> > > > ---8<---

> > > > int reuseport_detach_prog(struct sock *sk)

> > > > {

> > > >         struct sock_reuseport *reuse;

> > > >         struct bpf_prog *old_prog;

> > > > 

> > > >         if (!rcu_access_pointer(sk->sk_reuseport_cb))

> > > > 		return sk->sk_reuseport ? -ENOENT : -EINVAL;

> > > > ---8<---

> > > Right, a quick thought is something like this for detach:

> > > 

> > > 	spin_lock_bh(&reuseport_lock);

> > > 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,

> > > 					  lockdep_is_held(&reuseport_lock));

> > 

> > Is this necessary because reuseport_grow() can detach sk?

> > 

> >         if (!reuse) {

> >                 spin_unlock_bh(&reuseport_lock);

> >                 return -ENOENT;

> >         }

> Yes, it is needed.  Please add a comment for the reuseport_grow() case also.


I see, I'll add this change in the next spin.
Thank you!

---8<---
@@ -608,13 +612,24 @@ int reuseport_detach_prog(struct sock *sk)
        struct sock_reuseport *reuse;
        struct bpf_prog *old_prog;
 
-       if (!rcu_access_pointer(sk->sk_reuseport_cb))
-               return sk->sk_reuseport ? -ENOENT : -EINVAL;
-
        old_prog = NULL;
        spin_lock_bh(&reuseport_lock);
        reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
                                          lockdep_is_held(&reuseport_lock));
+
+       /* reuse must be checked after acquiring the reuseport_lock
+        * because reuseport_grow() can detach a closed sk.
+        */
+       if (!reuse) {
+               spin_unlock_bh(&reuseport_lock);
+               return sk->sk_reuseport ? -ENOENT : -EINVAL;
+       }
+
+       if (sk_unhashed(sk) && reuse->num_closed_socks) {
+               spin_unlock_bh(&reuseport_lock);
+               return -ENOENT;
+       }
+
        old_prog = rcu_replace_pointer(reuse->prog, old_prog,
                                       lockdep_is_held(&reuseport_lock));
        spin_unlock_bh(&reuseport_lock);
---8<---


> 

> > 

> > Then we can remove rcu_access_pointer() check and move sk_reuseport check

> > here.

> Make sense.