From patchwork Tue Nov 17 09:40:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 325046 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAF81C63777 for ; Tue, 17 Nov 2020 09:42:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 976AF2464E for ; Tue, 17 Nov 2020 09:42:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="UG3TcgUt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727397AbgKQJld (ORCPT ); Tue, 17 Nov 2020 04:41:33 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:35108 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725355AbgKQJlc (ORCPT ); Tue, 17 Nov 2020 04:41:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606090; x=1637142090; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=t9UMtEM+NwbJ05gHATdIGVFgiZ8G6C5NenHPztMgsX0=; b=UG3TcgUtfgoEgPwiGZcZOiE36n+b94DuGIzFpWnAp/mOfNhmpoJ2Hze3 cjWM+T1Obq1ElrqlYBr+tbOUTB+/sK8XiXC37Qr7cAPPn5BBJ6CvXmFvX tPO2VtTdPRuEAeHb3A8bS7PSbdwDgc3YLeu4pycZHErrPYp4XnJn98Kn7 w=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="96088317" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1d-74cf8b49.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 17 Nov 2020 09:41:29 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1d-74cf8b49.us-east-1.amazon.com (Postfix) with ESMTPS id E77A1C077D; Tue, 17 Nov 2020 09:41:27 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:27 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:23 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group. Date: Tue, 17 Nov 2020 18:40:17 +0900 Message-ID: <20201117094023.3685-3-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch is a preparation patch to migrate incoming connections in the later commits and adds two fields (migrate_req and num_closed_socks) to the struct sock_reuseport to keep TCP_CLOSE sockets in the reuseport group. If migrate_req is 1, and then we close a listening socket, we can migrate its connections to another listener in the same reuseport group. Then we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, it is impossible because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and to have access to it while any child socket references to them. The point is that reuseport_detach_sock() is called twice from inet_unhash() and sk_destruct(). At first, it moves the socket backwards in socks[] and increments num_closed_socks. Later, when all migrated connections are accepted, it removes the socket from socks[], decrements num_closed_socks, and sets NULL to sk_reuseport_cb. By this change, closed sockets can keep sk_reuseport_cb until all child requests have been freed or accepted. Consequently calling listen() after shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or inet_csk_bind_conflict() which expect that such sockets should not have the reuseport group. Therefore, this patch loosens such validation rules so that the socket can listen again if it has the same reuseport group with other listening sockets. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/sock_reuseport.h | 6 ++- net/core/sock_reuseport.c | 83 +++++++++++++++++++++++++++------ net/ipv4/inet_connection_sock.c | 7 ++- 3 files changed, 79 insertions(+), 17 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 505f1e18e9bf..ade3af55c91f 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock; struct sock_reuseport { struct rcu_head rcu; - u16 max_socks; /* length of socks */ - u16 num_socks; /* elements in socks */ + u16 max_socks; /* length of socks */ + u16 num_socks; /* elements in socks */ + u16 num_closed_socks; /* closed elements in socks */ /* The last synq overflow event timestamp of this * reuse->socks[] group. */ @@ -23,6 +24,7 @@ struct sock_reuseport { unsigned int reuseport_id; unsigned int bind_inany:1; unsigned int has_conns:1; + unsigned int migrate_req:1; struct bpf_prog __rcu *prog; /* optional BPF sock selector */ struct sock *socks[]; /* array of sock pointers */ }; diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index bbdd3c7b6cb5..01a8b4ba39d7 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -36,6 +36,7 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) int reuseport_alloc(struct sock *sk, bool bind_inany) { struct sock_reuseport *reuse; + struct net *net = sock_net(sk); int id, ret = 0; /* bh lock used since this function call may precede hlist lock in @@ -75,6 +76,8 @@ int reuseport_alloc(struct sock *sk, bool bind_inany) reuse->socks[0] = sk; reuse->num_socks = 1; reuse->bind_inany = bind_inany; + reuse->migrate_req = sk->sk_protocol == IPPROTO_TCP ? + net->ipv4.sysctl_tcp_migrate_req : 0; rcu_assign_pointer(sk->sk_reuseport_cb, reuse); out: @@ -98,16 +101,22 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) return NULL; more_reuse->num_socks = reuse->num_socks; + more_reuse->num_closed_socks = reuse->num_closed_socks; more_reuse->prog = reuse->prog; more_reuse->reuseport_id = reuse->reuseport_id; more_reuse->bind_inany = reuse->bind_inany; more_reuse->has_conns = reuse->has_conns; + more_reuse->migrate_req = reuse->migrate_req; + more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); - more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); + memcpy(more_reuse->socks + + (more_reuse->max_socks - more_reuse->num_closed_socks), + reuse->socks + reuse->num_socks, + reuse->num_closed_socks * sizeof(struct sock *)); - for (i = 0; i < reuse->num_socks; ++i) + for (i = 0; i < reuse->max_socks; ++i) rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, more_reuse); @@ -129,6 +138,25 @@ static void reuseport_free_rcu(struct rcu_head *head) kfree(reuse); } +static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk, + bool closed) +{ + int left, right; + + if (!closed) { + left = 0; + right = reuse->num_socks; + } else { + left = reuse->max_socks - reuse->num_closed_socks; + right = reuse->max_socks; + } + + for (; left < right; left++) + if (reuse->socks[left] == sk) + return left; + return -1; +} + /** * reuseport_add_sock - Add a socket to the reuseport group of another. * @sk: New socket to add to the group. @@ -153,12 +181,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) lockdep_is_held(&reuseport_lock)); old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); - if (old_reuse && old_reuse->num_socks != 1) { + + if (old_reuse == reuse) { + int i = reuseport_sock_index(reuse, sk, true); + + if (i == -1) { + spin_unlock_bh(&reuseport_lock); + return -EBUSY; + } + + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; + reuse->num_closed_socks--; + } else if (old_reuse && old_reuse->num_socks != 1) { spin_unlock_bh(&reuseport_lock); return -EBUSY; } - if (reuse->num_socks == reuse->max_socks) { + if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) { reuse = reuseport_grow(reuse); if (!reuse) { spin_unlock_bh(&reuseport_lock); @@ -174,8 +213,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) spin_unlock_bh(&reuseport_lock); - if (old_reuse) + if (old_reuse && old_reuse != reuse) call_rcu(&old_reuse->rcu, reuseport_free_rcu); + return 0; } EXPORT_SYMBOL(reuseport_add_sock); @@ -199,17 +239,34 @@ void reuseport_detach_sock(struct sock *sk) */ bpf_sk_reuseport_detach(sk); - rcu_assign_pointer(sk->sk_reuseport_cb, NULL); + if (!reuse->migrate_req || sk->sk_state == TCP_LISTEN) { + i = reuseport_sock_index(reuse, sk, false); + if (i == -1) + goto out; + + reuse->num_socks--; + reuse->socks[i] = reuse->socks[reuse->num_socks]; - for (i = 0; i < reuse->num_socks; i++) { - if (reuse->socks[i] == sk) { - reuse->socks[i] = reuse->socks[reuse->num_socks - 1]; - reuse->num_socks--; - if (reuse->num_socks == 0) - call_rcu(&reuse->rcu, reuseport_free_rcu); - break; + if (reuse->migrate_req) { + reuse->num_closed_socks++; + reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; + } else { + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + } else { + i = reuseport_sock_index(reuse, sk, true); + if (i == -1) + goto out; + + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; + reuse->num_closed_socks--; + + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + + if (reuse->num_socks + reuse->num_closed_socks == 0) + call_rcu(&reuse->rcu, reuseport_free_rcu); +out: spin_unlock_bh(&reuseport_lock); } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4148f5f78f31..be8cda5b664f 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk, bool reuse = sk->sk_reuse; bool reuseport = !!sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); + struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb); /* * Unlike other sk lookup places we do not check @@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk, if ((!relax || (!reuseport_ok && reuseport && sk2->sk_reuseport && - !rcu_access_pointer(sk->sk_reuseport_cb) && + (!reuseport_cb || + reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) && (sk2->sk_state == TCP_TIME_WAIT || uid_eq(uid, sock_i_uid(sk2))))) && inet_rcv_saddr_equal(sk, sk2, true)) break; } else if (!reuseport_ok || !reuseport || !sk2->sk_reuseport || - rcu_access_pointer(sk->sk_reuseport_cb) || + (reuseport_cb && + reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2)))) { if (inet_rcv_saddr_equal(sk, sk2, true)) From patchwork Tue Nov 17 09:40:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 325045 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A4C3C64E7D for ; Tue, 17 Nov 2020 09:42:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 65CBB24656 for ; Tue, 17 Nov 2020 09:42:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="NSGICnlb" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727482AbgKQJlz (ORCPT ); Tue, 17 Nov 2020 04:41:55 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:35208 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727218AbgKQJlz (ORCPT ); Tue, 17 Nov 2020 04:41:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606115; x=1637142115; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=hEFvtzZFGSnVCTICpdWrUCFtgf37qHg3ZYvzvaotl6s=; b=NSGICnlbxGMacyENHQvicjqEvZENYV/z2ZXY0/IMJzLzyxiN8uXeVGnd +xXmBrIYf573z2RV7/QExmXijLpx/xC/Myth5WN2Xe4TDTkf1nw78GhRf jlvWcSk8Ja16hLF16c0W1JBfFMQm+Y+DTpQA0CdA5pVTWJmSFFxUPyNY2 U=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="96088423" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1a-af6a10df.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 17 Nov 2020 09:41:54 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-af6a10df.us-east-1.amazon.com (Postfix) with ESMTPS id BA015A0481; Tue, 17 Nov 2020 09:41:52 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:52 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:48 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV. Date: Tue, 17 Nov 2020 18:40:19 +0900 Message-ID: <20201117094023.3685-5-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A TFO request socket is only freed after BOTH 3WHS has completed (or aborted) and the child socket has been accepted (or its listener closed). Hence, depending on the order, there can be two kinds of request sockets in the accept queue. 3WHS -> accept : TCP_ESTABLISHED accept -> 3WHS : TCP_SYN_RECV Unlike TCP_ESTABLISHED socket, accept() does not free the request socket for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove(). Also, it accesses request_sock.rsk_listener. So, in order to complete TFO socket migration, we have to set the current listener to it at accept() before reqsk_fastopen_remove(). Moreover, if TFO request caused RST before 3WHS has completed, it is held in the listener's TFO queue to prevent DDoS attack. Thus, we also have to migrate the requests in TFO queue. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/ipv4/inet_connection_sock.c | 35 ++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 583db7e2b1da..398c5c708bc5 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) tcp_rsk(req)->tfo_listener) { spin_lock_bh(&queue->fastopenq.lock); if (tcp_rsk(req)->tfo_listener) { + if (req->rsk_listener != sk) { + /* TFO request was migrated to another listener so + * the new listener must be used in reqsk_fastopen_remove() + * to hold requests which cause RST. + */ + sock_put(req->rsk_listener); + sock_hold(sk); + req->rsk_listener = sk; + } + /* We are still waiting for the final ACK from 3WHS * so can't free req now. Instead, we set req->sk to * NULL to signify that the child socket is taken @@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req, if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) { BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req); - BUG_ON(sk != req->rsk_listener); /* Paranoid, to prevent race condition if * an inbound pkt destined for child is @@ -995,6 +1004,7 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_add); void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) { struct request_sock_queue *old_accept_queue, *new_accept_queue; + struct fastopen_queue *old_fastopenq, *new_fastopenq; old_accept_queue = &inet_csk(sk)->icsk_accept_queue; new_accept_queue = &inet_csk(nsk)->icsk_accept_queue; @@ -1019,6 +1029,29 @@ void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) spin_unlock(&new_accept_queue->rskq_lock); spin_unlock(&old_accept_queue->rskq_lock); + + old_fastopenq = &old_accept_queue->fastopenq; + new_fastopenq = &new_accept_queue->fastopenq; + + spin_lock_bh(&old_fastopenq->lock); + spin_lock_bh(&new_fastopenq->lock); + + new_fastopenq->qlen += old_fastopenq->qlen; + old_fastopenq->qlen = 0; + + if (old_fastopenq->rskq_rst_head) { + if (new_fastopenq->rskq_rst_head) + old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head; + else + old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail; + + new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head; + old_fastopenq->rskq_rst_head = NULL; + old_fastopenq->rskq_rst_tail = NULL; + } + + spin_unlock_bh(&new_fastopenq->lock); + spin_unlock_bh(&old_fastopenq->lock); } EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate); From patchwork Tue Nov 17 09:40:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 325044 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57166C64E90 for ; Tue, 17 Nov 2020 09:42:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0B9482463F for ; Tue, 17 Nov 2020 09:42:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="BEvJLRk+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727656AbgKQJml (ORCPT ); Tue, 17 Nov 2020 04:42:41 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:8939 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727122AbgKQJml (ORCPT ); Tue, 17 Nov 2020 04:42:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606160; x=1637142160; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=wrzdg6wvKa1DnV7c+HpDFpqsfyVbRJIGt6M+PZA79cU=; b=BEvJLRk+ioOn5ZQI/dX/CUV6bfTvWEev1cmCGBLrTjlofkav57c8qFx/ iprLIDWFFSxbQ96iHok6cQKi9LuI0MxY17QdBzc1pYy7M219S7vo9YRF0 0gyoEQ1iyqp+alXvG3ruGXicLABsRx0hVo+ug1NqZxvOQLLlR2vMC+MiN c=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="65440717" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1d-98acfc19.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 17 Nov 2020 09:42:40 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-98acfc19.us-east-1.amazon.com (Postfix) with ESMTPS id 9998DA182C; Tue, 17 Nov 2020 09:42:37 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:36 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:33 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration. Date: Tue, 17 Nov 2020 18:40:22 +0900 Message-ID: <20201117094023.3685-8-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch makes it possible to select a new listener for socket migration by eBPF. The noteworthy point is that we select a listening socket in reuseport_detach_sock() and reuseport_select_sock(), but we do not have struct skb in the unhash path. Since we cannot pass skb to the eBPF program, we run only the BPF_PROG_TYPE_SK_REUSEPORT program by calling bpf_run_sk_reuseport() with skb NULL. So, some fields derived from skb are also NULL in the eBPF program. Moreover, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/core/filter.c | 26 +++++++++++++++++++++----- net/core/sock_reuseport.c | 23 ++++++++++++++++++++--- net/ipv4/inet_hashtables.c | 2 +- 3 files changed, 42 insertions(+), 9 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 01e28f283962..ffc4591878b8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -8914,6 +8914,22 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, \ BPF_FIELD_SIZEOF(NS, NF), 0) +#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, SIZE, OFF) \ + do { \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg, \ + si->src_reg, offsetof(S, F)); \ + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); \ + *insn++ = BPF_LDX_MEM( \ + SIZE, si->dst_reg, si->dst_reg, \ + bpf_target_off(NS, NF, sizeof_field(NS, NF), \ + target_size) \ + + OFF); \ + } while (0) + +#define SOCK_ADDR_LOAD_NESTED_FIELD_OR_NULL(S, NS, F, NF) \ + SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, \ + BPF_FIELD_SIZEOF(NS, NF), 0) + /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation. * @@ -9858,7 +9874,7 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, reuse_kern->skb = skb; reuse_kern->sk = sk; reuse_kern->selected_sk = NULL; - reuse_kern->data_end = skb->data + skb_headlen(skb); + reuse_kern->data_end = skb ? skb->data + skb_headlen(skb) : NULL; reuse_kern->hash = hash; reuse_kern->reuseport_id = reuse->reuseport_id; reuse_kern->bind_inany = reuse->bind_inany; @@ -10039,10 +10055,10 @@ sk_reuseport_is_valid_access(int off, int size, }) #define SK_REUSEPORT_LOAD_SKB_FIELD(SKB_FIELD) \ - SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern, \ - struct sk_buff, \ - skb, \ - SKB_FIELD) + SOCK_ADDR_LOAD_NESTED_FIELD_OR_NULL(struct sk_reuseport_kern, \ + struct sk_buff, \ + skb, \ + SKB_FIELD) #define SK_REUSEPORT_LOAD_SK_FIELD(SK_FIELD) \ SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern, \ diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 74a46197854b..903f78ab35c3 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -224,6 +224,7 @@ struct sock *reuseport_detach_sock(struct sock *sk) { struct sock_reuseport *reuse; struct sock *nsk = NULL; + struct bpf_prog *prog; int i; spin_lock_bh(&reuseport_lock); @@ -249,8 +250,16 @@ struct sock *reuseport_detach_sock(struct sock *sk) reuse->socks[i] = reuse->socks[reuse->num_socks]; if (reuse->migrate_req) { - if (reuse->num_socks) - nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; + if (reuse->num_socks) { + prog = rcu_dereference(reuse->prog); + if (prog && prog->type == BPF_PROG_TYPE_SK_REUSEPORT) + nsk = bpf_run_sk_reuseport(reuse, sk, prog, + NULL, sk->sk_hash); + + if (!nsk) + nsk = i == reuse->num_socks ? + reuse->socks[i - 1] : reuse->socks[i]; + } reuse->num_closed_socks++; reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; @@ -340,8 +349,16 @@ struct sock *reuseport_select_sock(struct sock *sk, /* paired with smp_wmb() in reuseport_add_sock() */ smp_rmb(); - if (!prog || !skb) + if (!prog) + goto select_by_hash; + + if (!skb) { + if (reuse->migrate_req && + prog->type == BPF_PROG_TYPE_SK_REUSEPORT) + sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); + goto select_by_hash; + } if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index f35c76cf3365..d981e4876679 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -647,7 +647,7 @@ void inet_unhash(struct sock *sk) if (rcu_access_pointer(sk->sk_reuseport_cb)) { nsk = reuseport_detach_sock(sk); - if (nsk) + if (!IS_ERR_OR_NULL(nsk)) inet_csk_reqsk_queue_migrate(sk, nsk); }