From patchwork Tue Apr 20 15:41:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 425041 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F8C6C433B4 for ; Tue, 20 Apr 2021 15:43:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0A0D861354 for ; Tue, 20 Apr 2021 15:43:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233041AbhDTPoC (ORCPT ); Tue, 20 Apr 2021 11:44:02 -0400 Received: from smtp-fw-33001.amazon.com ([207.171.190.10]:44130 "EHLO smtp-fw-33001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232987AbhDTPn7 (ORCPT ); Tue, 20 Apr 2021 11:43:59 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1618933409; x=1650469409; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=StTHGdJH0e7a7hTsUp4VoJJabGCqsz7CPVVNUY3bueU=; b=VTc83Jmae7HjAPq0ATs9+r93MuANiZi/5MW8AI1F+1yyPBpJv2QgclGm K0scpUpzUYv192F+s/rXOS7xHqieQq49lGwBykKg3yApTzvdVYU2wO4IF k8mF03aSlVsf/hFF7eMfe6bD6CjlS2NSid86WQCBjh0HcuRYrOwx/lHpT I=; X-IronPort-AV: E=Sophos;i="5.82,237,1613433600"; d="scan'208";a="119989957" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-2b-55156cd4.us-west-2.amazon.com) ([10.25.36.210]) by smtp-border-fw-out-33001.sea14.amazon.com with ESMTP; 20 Apr 2021 15:43:28 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2b-55156cd4.us-west-2.amazon.com (Postfix) with ESMTPS id 54C60A1D46; Tue, 20 Apr 2021 15:43:27 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:43:26 +0000 Received: from 88665a182662.ant.amazon.com (10.43.161.41) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:43:22 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v3 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport. Date: Wed, 21 Apr 2021 00:41:31 +0900 Message-ID: <20210420154140.80034-3-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210420154140.80034-1-kuniyu@amazon.co.jp> References: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.41] X-ClientProxiedBy: EX13P01UWB001.ant.amazon.com (10.43.161.59) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org As noted in the following commit, a closed listener has to hold the reference to the reuseport group for socket migration. This patch adds a field (num_closed_socks) to struct sock_reuseport to manage closed sockets within the same reuseport group. Moreover, this and the following commits introduce some helper functions to split socks[] into two sections and keep TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE sockets from the end. TCP_LISTEN----------> <-------TCP_CLOSE +---+---+ --- +---+ --- +---+ --- +---+ | 0 | 1 | ... | i | ... | j | ... | k | +---+---+ --- +---+ --- +---+ --- +---+ i = num_socks - 1 j = max_socks - num_closed_socks k = max_socks - 1 This patch also extends reuseport_add_sock() and reuseport_grow() to support num_closed_socks. Signed-off-by: Kuniyuki Iwashima --- include/net/sock_reuseport.h | 5 ++- net/core/sock_reuseport.c | 76 +++++++++++++++++++++++++++--------- 2 files changed, 60 insertions(+), 21 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 505f1e18e9bf..0e558ca7afbf 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock; struct sock_reuseport { struct rcu_head rcu; - u16 max_socks; /* length of socks */ - u16 num_socks; /* elements in socks */ + u16 max_socks; /* length of socks */ + u16 num_socks; /* elements in socks */ + u16 num_closed_socks; /* closed elements in socks */ /* The last synq overflow event timestamp of this * reuse->socks[] group. */ diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index b065f0a103ed..079bd1aca0e7 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -18,6 +18,49 @@ DEFINE_SPINLOCK(reuseport_lock); static DEFINE_IDA(reuseport_ida); +static int reuseport_sock_index(struct sock *sk, + struct sock_reuseport *reuse, + bool closed) +{ + int left, right; + + if (!closed) { + left = 0; + right = reuse->num_socks; + } else { + left = reuse->max_socks - reuse->num_closed_socks; + right = reuse->max_socks; + } + + for (; left < right; left++) + if (reuse->socks[left] == sk) + return left; + return -1; +} + +static void __reuseport_add_sock(struct sock *sk, + struct sock_reuseport *reuse) +{ + reuse->socks[reuse->num_socks] = sk; + /* paired with smp_rmb() in reuseport_select_sock() */ + smp_wmb(); + reuse->num_socks++; +} + +static bool __reuseport_detach_sock(struct sock *sk, + struct sock_reuseport *reuse) +{ + int i = reuseport_sock_index(sk, reuse, false); + + if (i == -1) + return false; + + reuse->socks[i] = reuse->socks[reuse->num_socks - 1]; + reuse->num_socks--; + + return true; +} + static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) { unsigned int size = sizeof(struct sock_reuseport) + @@ -72,9 +115,8 @@ int reuseport_alloc(struct sock *sk, bool bind_inany) } reuse->reuseport_id = id; - reuse->socks[0] = sk; - reuse->num_socks = 1; reuse->bind_inany = bind_inany; + __reuseport_add_sock(sk, reuse); rcu_assign_pointer(sk->sk_reuseport_cb, reuse); out: @@ -98,6 +140,7 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) return NULL; more_reuse->num_socks = reuse->num_socks; + more_reuse->num_closed_socks = reuse->num_closed_socks; more_reuse->prog = reuse->prog; more_reuse->reuseport_id = reuse->reuseport_id; more_reuse->bind_inany = reuse->bind_inany; @@ -105,9 +148,13 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); + memcpy(more_reuse->socks + + (more_reuse->max_socks - more_reuse->num_closed_socks), + reuse->socks + reuse->num_socks, + reuse->num_closed_socks * sizeof(struct sock *)); more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); - for (i = 0; i < reuse->num_socks; ++i) + for (i = 0; i < reuse->max_socks; ++i) rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, more_reuse); @@ -158,7 +205,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) return -EBUSY; } - if (reuse->num_socks == reuse->max_socks) { + if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) { reuse = reuseport_grow(reuse); if (!reuse) { spin_unlock_bh(&reuseport_lock); @@ -166,10 +213,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } } - reuse->socks[reuse->num_socks] = sk; - /* paired with smp_rmb() in reuseport_select_sock() */ - smp_wmb(); - reuse->num_socks++; + __reuseport_add_sock(sk, reuse); rcu_assign_pointer(sk->sk_reuseport_cb, reuse); spin_unlock_bh(&reuseport_lock); @@ -183,7 +227,6 @@ EXPORT_SYMBOL(reuseport_add_sock); void reuseport_detach_sock(struct sock *sk) { struct sock_reuseport *reuse; - int i; spin_lock_bh(&reuseport_lock); reuse = rcu_dereference_protected(sk->sk_reuseport_cb, @@ -200,16 +243,11 @@ void reuseport_detach_sock(struct sock *sk) bpf_sk_reuseport_detach(sk); rcu_assign_pointer(sk->sk_reuseport_cb, NULL); + __reuseport_detach_sock(sk, reuse); + + if (reuse->num_socks + reuse->num_closed_socks == 0) + call_rcu(&reuse->rcu, reuseport_free_rcu); - for (i = 0; i < reuse->num_socks; i++) { - if (reuse->socks[i] == sk) { - reuse->socks[i] = reuse->socks[reuse->num_socks - 1]; - reuse->num_socks--; - if (reuse->num_socks == 0) - call_rcu(&reuse->rcu, reuseport_free_rcu); - break; - } - } spin_unlock_bh(&reuseport_lock); } EXPORT_SYMBOL(reuseport_detach_sock); @@ -274,7 +312,7 @@ struct sock *reuseport_select_sock(struct sock *sk, prog = rcu_dereference(reuse->prog); socks = READ_ONCE(reuse->num_socks); if (likely(socks)) { - /* paired with smp_wmb() in reuseport_add_sock() */ + /* paired with smp_wmb() in __reuseport_add_sock() */ smp_rmb(); if (!prog || !skb) From patchwork Tue Apr 20 15:41:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 425040 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5EF36C43460 for ; Tue, 20 Apr 2021 15:44:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2D0E2613C9 for ; Tue, 20 Apr 2021 15:44:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233092AbhDTPof (ORCPT ); Tue, 20 Apr 2021 11:44:35 -0400 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:33335 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233054AbhDTPod (ORCPT ); Tue, 20 Apr 2021 11:44:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1618933442; x=1650469442; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=riz39euh9VJ0Ml0wQ4FZVlkKZ1mau/Z7qyONSzLLBF4=; b=ZXm3HogpmZCrq/m9kYAwp0JqJe0FIg7JSaTGGj2OMzcAR5WFcJ0QqGMm x4TIIoMsfwsgtRMnl5+HMYG5G0IEXN3m1rLHpcdWVU/1qYcOQfeOqCTeC NGlYwCOlNodSI4Ib7oEIbqpCPA1DNwtrwhCicTyTJd1Mogn+X8GnK4K8u 4=; X-IronPort-AV: E=Sophos;i="5.82,237,1613433600"; d="scan'208";a="129190333" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-2c-397e131e.us-west-2.amazon.com) ([10.25.36.214]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 20 Apr 2021 15:44:02 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2c-397e131e.us-west-2.amazon.com (Postfix) with ESMTPS id 4F61AA26E3; Tue, 20 Apr 2021 15:43:56 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:43:56 +0000 Received: from 88665a182662.ant.amazon.com (10.43.161.41) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:43:52 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v3 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener. Date: Wed, 21 Apr 2021 00:41:33 +0900 Message-ID: <20210420154140.80034-5-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210420154140.80034-1-kuniyu@amazon.co.jp> References: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.41] X-ClientProxiedBy: EX13P01UWB001.ant.amazon.com (10.43.161.59) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org reuseport_migrate_sock() does the same check done in reuseport_listen_stop_sock(). If the reuseport group is capable of migration, reuseport_migrate_sock() selects a new listener by the child socket hash and increments the listener's sk_refcnt beforehand. Thus, if we fail in the migration, we have to decrement it later. We will support migration by eBPF in the later commits. Signed-off-by: Kuniyuki Iwashima Signed-off-by: Martin KaFai Lau --- include/net/sock_reuseport.h | 3 ++ net/core/sock_reuseport.c | 78 +++++++++++++++++++++++++++++------- 2 files changed, 67 insertions(+), 14 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 1333d0cddfbc..473b0b0fa4ab 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -37,6 +37,9 @@ extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, struct sk_buff *skb, int hdr_len); +struct sock *reuseport_migrate_sock(struct sock *sk, + struct sock *migrating_sk, + struct sk_buff *skb); extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog); extern int reuseport_detach_prog(struct sock *sk); diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index d5fb0ad12e87..a2bca39ec0e3 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -44,7 +44,7 @@ static void __reuseport_add_sock(struct sock *sk, struct sock_reuseport *reuse) { reuse->socks[reuse->num_socks] = sk; - /* paired with smp_rmb() in reuseport_select_sock() */ + /* paired with smp_rmb() in reuseport_(select|migrate)_sock() */ smp_wmb(); reuse->num_socks++; } @@ -435,6 +435,23 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks, return reuse->socks[index]; } +static struct sock *reuseport_select_sock_by_hash(struct sock_reuseport *reuse, + u32 hash, u16 num_socks) +{ + int i, j; + + i = j = reciprocal_scale(hash, num_socks); + while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) { + i++; + if (i >= num_socks) + i = 0; + if (i == j) + return NULL; + } + + return reuse->socks[i]; +} + /** * reuseport_select_sock - Select a socket from an SO_REUSEPORT group. * @sk: First socket in the group. @@ -478,19 +495,8 @@ struct sock *reuseport_select_sock(struct sock *sk, select_by_hash: /* no bpf or invalid bpf result: fall back to hash usage */ - if (!sk2) { - int i, j; - - i = j = reciprocal_scale(hash, socks); - while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) { - i++; - if (i >= socks) - i = 0; - if (i == j) - goto out; - } - sk2 = reuse->socks[i]; - } + if (!sk2) + sk2 = reuseport_select_sock_by_hash(reuse, hash, socks); } out: @@ -499,6 +505,50 @@ struct sock *reuseport_select_sock(struct sock *sk, } EXPORT_SYMBOL(reuseport_select_sock); +/** + * reuseport_migrate_sock - Select a socket from an SO_REUSEPORT group. + * @sk: close()ed or shutdown()ed socket in the group. + * @migrating_sk: ESTABLISHED/SYN_RECV full socket in the accept queue or + * NEW_SYN_RECV request socket during 3WHS. + * @skb: skb to run through BPF filter. + * Returns a socket (with sk_refcnt +1) that should accept the child socket + * (or NULL on error). + */ +struct sock *reuseport_migrate_sock(struct sock *sk, + struct sock *migrating_sk, + struct sk_buff *skb) +{ + struct sock_reuseport *reuse; + struct sock *nsk = NULL; + u16 socks; + u32 hash; + + rcu_read_lock(); + + reuse = rcu_dereference(sk->sk_reuseport_cb); + if (!reuse) + goto out; + + socks = READ_ONCE(reuse->num_socks); + if (unlikely(!socks)) + goto out; + + /* paired with smp_wmb() in __reuseport_add_sock() */ + smp_rmb(); + + hash = migrating_sk->sk_hash; + if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) + nsk = reuseport_select_sock_by_hash(reuse, hash, socks); + + if (nsk && unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt))) + nsk = NULL; + +out: + rcu_read_unlock(); + return nsk; +} +EXPORT_SYMBOL(reuseport_migrate_sock); + int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) { struct sock_reuseport *reuse; From patchwork Tue Apr 20 15:41:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 425039 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D357C43460 for ; Tue, 20 Apr 2021 15:44:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 26A6161264 for ; Tue, 20 Apr 2021 15:44:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233120AbhDTPpH (ORCPT ); Tue, 20 Apr 2021 11:45:07 -0400 Received: from smtp-fw-9103.amazon.com ([207.171.188.200]:53917 "EHLO smtp-fw-9103.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233105AbhDTPpF (ORCPT ); Tue, 20 Apr 2021 11:45:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1618933474; x=1650469474; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=V8UGnFpf8bTuF7lpZY4qc80ssPB3dSUpmiMIja9fnOs=; b=b5HSDK1Vzki1NebxqPcsn18iydN9eYOj6DEufBIpPiUJ3so/OQqiTAW2 sUthupuLKC9PuW+Az1kM6RvduxeR8xG8/Lia4wbKI+gTTArCDo8tfflpk RMV7LzsydqwNx+Q57pWNb87GfCRmZI2jO0572hwTYxMCSeHOb1z73RPxI s=; X-IronPort-AV: E=Sophos;i="5.82,237,1613433600"; d="scan'208";a="927523252" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-2a-d0be17ee.us-west-2.amazon.com) ([10.25.36.210]) by smtp-border-fw-out-9103.sea19.amazon.com with ESMTP; 20 Apr 2021 15:44:34 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2a-d0be17ee.us-west-2.amazon.com (Postfix) with ESMTPS id BD747A04DA; Tue, 20 Apr 2021 15:44:33 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:44:33 +0000 Received: from 88665a182662.ant.amazon.com (10.43.161.41) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:44:22 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v3 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs. Date: Wed, 21 Apr 2021 00:41:35 +0900 Message-ID: <20210420154140.80034-7-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210420154140.80034-1-kuniyu@amazon.co.jp> References: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.41] X-ClientProxiedBy: EX13P01UWB001.ant.amazon.com (10.43.161.59) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org As with the preceding patch, this patch changes reqsk_timer_handler() to call reuseport_migrate_sock() and reqsk_clone() to migrate in-flight requests at retransmitting SYN+ACKs. If we can select a new listener and clone the request, we resume setting the SYN+ACK timer for the new req. If we can set the timer, we call inet_ehash_insert() to unhash the old req and put the new req into ehash. The noteworthy point here is that by unhashing the old req, another CPU processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and drop the final ACK packet. However, the new timer will recover this situation. Signed-off-by: Kuniyuki Iwashima --- net/core/request_sock.c | 1 + net/ipv4/inet_connection_sock.c | 76 +++++++++++++++++++++++++++++---- 2 files changed, 68 insertions(+), 9 deletions(-) diff --git a/net/core/request_sock.c b/net/core/request_sock.c index 82cf9fbe2668..08c37ecd923b 100644 --- a/net/core/request_sock.c +++ b/net/core/request_sock.c @@ -151,6 +151,7 @@ struct request_sock *reqsk_clone(struct request_sock *req, struct sock *sk) memcpy(&nreq_sk->sk_dontcopy_end, &req_sk->sk_dontcopy_end, req->rsk_ops->obj_size - offsetof(struct sock, sk_dontcopy_end)); + sk_node_init(&nreq_sk->sk_node); nreq_sk->sk_tx_queue_mapping = req_sk->sk_tx_queue_mapping; #ifdef CONFIG_XPS nreq_sk->sk_rx_queue_mapping = req_sk->sk_rx_queue_mapping; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 851992405826..dc984d1f352e 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -695,10 +695,20 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req) } EXPORT_SYMBOL(inet_rtx_syn_ack); +static void reqsk_queue_migrated(struct request_sock_queue *queue, + const struct request_sock *req) +{ + if (req->num_timeout == 0) + atomic_inc(&queue->young); + atomic_inc(&queue->qlen); +} + static void reqsk_migrate_reset(struct request_sock *req) { + req->saved_syn = NULL; + inet_rsk(req)->ireq_opt = NULL; #if IS_ENABLED(CONFIG_IPV6) - inet_rsk(req)->ipv6_opt = NULL; + inet_rsk(req)->pktopts = NULL; #endif } @@ -741,16 +751,37 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_drop_and_put); static void reqsk_timer_handler(struct timer_list *t) { - struct request_sock *req = from_timer(req, t, rsk_timer); - struct sock *sk_listener = req->rsk_listener; - struct net *net = sock_net(sk_listener); - struct inet_connection_sock *icsk = inet_csk(sk_listener); - struct request_sock_queue *queue = &icsk->icsk_accept_queue; + struct request_sock *req = from_timer(req, t, rsk_timer), *nreq = NULL, *oreq = req; + struct sock *sk_listener = req->rsk_listener, *nsk = NULL; + struct inet_connection_sock *icsk; + struct request_sock_queue *queue; + struct net *net; int max_syn_ack_retries, qlen, expire = 0, resend = 0; - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) - goto drop; + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { + nsk = reuseport_migrate_sock(sk_listener, req_to_sk(req), NULL); + if (!nsk) + goto drop; + + nreq = reqsk_clone(req, nsk); + if (!nreq) + goto drop; + + /* The new timer for the cloned req can decrease the 2 + * by calling inet_csk_reqsk_queue_drop_and_put(), so + * hold another count to prevent use-after-free and + * call reqsk_put() just before return. + */ + refcount_set(&nreq->rsk_refcnt, 2 + 1); + timer_setup(&nreq->rsk_timer, reqsk_timer_handler, TIMER_PINNED); + reqsk_queue_migrated(&inet_csk(nsk)->icsk_accept_queue, req); + + req = nreq; + sk_listener = nsk; + } + icsk = inet_csk(sk_listener); + net = sock_net(sk_listener); max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; /* Normally all the openreqs are young and become mature * (i.e. converted to established socket) for first timeout. @@ -769,6 +800,7 @@ static void reqsk_timer_handler(struct timer_list *t) * embrions; and abort old ones without pity, if old * ones are about to clog our table. */ + queue = &icsk->icsk_accept_queue; qlen = reqsk_queue_len(queue); if ((qlen << 1) > max(8U, READ_ONCE(sk_listener->sk_max_ack_backlog))) { int young = reqsk_queue_len_young(queue) << 1; @@ -793,10 +825,36 @@ static void reqsk_timer_handler(struct timer_list *t) atomic_dec(&queue->young); timeo = min(TCP_TIMEOUT_INIT << req->num_timeout, TCP_RTO_MAX); mod_timer(&req->rsk_timer, jiffies + timeo); + + if (!nreq) + return; + + if (!inet_ehash_insert(req_to_sk(nreq), req_to_sk(oreq), NULL)) { + /* delete timer */ + inet_csk_reqsk_queue_drop(sk_listener, nreq); + goto drop; + } + + reqsk_migrate_reset(oreq); + reqsk_queue_removed(&inet_csk(oreq->rsk_listener)->icsk_accept_queue, oreq); + reqsk_put(oreq); + + reqsk_put(nreq); return; } + drop: - inet_csk_reqsk_queue_drop_and_put(sk_listener, req); + /* Even if we can clone the req, we may need not retransmit any more + * SYN+ACKs (nreq->num_timeout > max_syn_ack_retries, etc), or another + * CPU may win the "own_req" race so that inet_ehash_insert() fails. + */ + if (nreq) { + reqsk_migrate_reset(nreq); + reqsk_queue_removed(queue, nreq); + __reqsk_free(nreq); + } + + inet_csk_reqsk_queue_drop_and_put(oreq->rsk_listener, oreq); } static void reqsk_queue_hash_req(struct request_sock *req, From patchwork Tue Apr 20 15:41:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 425038 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E770BC433ED for ; Tue, 20 Apr 2021 15:45:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B438561354 for ; Tue, 20 Apr 2021 15:45:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233172AbhDTPpj (ORCPT ); Tue, 20 Apr 2021 11:45:39 -0400 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:57654 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232997AbhDTPpg (ORCPT ); Tue, 20 Apr 2021 11:45:36 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1618933505; x=1650469505; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4vdumPG+hHodAMIi+5gcUXgp5jGw2xWA/8qR5K8tSe0=; b=FsRh+WMWANCzAXETfCPNeIqLW/gC0MaLx+KbDgKOP0fuOKmGRC4ni3g2 oiDHElhvyQGwgPzOvU0Na52/rk1rrQ1Mk+/IaZtYvp5em5CYSPjJUTiTc ZQIqNkIOps+MBbvwgXnemv60GwqgeSRtwKQp4i48GaTFLyKw3OdJGeL9X k=; X-IronPort-AV: E=Sophos;i="5.82,237,1613433600"; d="scan'208";a="107202499" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-2b-5bdc5131.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 20 Apr 2021 15:45:03 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2b-5bdc5131.us-west-2.amazon.com (Postfix) with ESMTPS id 2D216A1D90; Tue, 20 Apr 2021 15:45:02 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:45:01 +0000 Received: from 88665a182662.ant.amazon.com (10.43.161.41) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:44:52 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v3 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT. Date: Wed, 21 Apr 2021 00:41:37 +0900 Message-ID: <20210420154140.80034-9-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210420154140.80034-1-kuniyu@amazon.co.jp> References: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.41] X-ClientProxiedBy: EX13P01UWB001.ant.amazon.com (10.43.161.59) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing to select a new listener. We can currently get a unique ID of each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the pointer of sk available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 10 ++++++++++ tools/include/uapi/linux/bpf.h | 1 + 3 files changed, 12 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 49371eba98ba..ab8584701334 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5299,6 +5299,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __bpf_md_ptr(struct bpf_sock *, sk); }; #define BPF_TAG_SIZE 8 diff --git a/net/core/filter.c b/net/core/filter.c index cae56d08a670..3d0f989f5d38 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -10135,6 +10135,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id, return &sk_reuseport_load_bytes_proto; case BPF_FUNC_skb_load_bytes_relative: return &sk_reuseport_load_bytes_relative_proto; + case BPF_FUNC_get_socket_cookie: + return &bpf_get_socket_ptr_cookie_proto; default: return bpf_base_func_proto(func_id); } @@ -10164,6 +10166,10 @@ sk_reuseport_is_valid_access(int off, int size, case offsetof(struct sk_reuseport_md, hash): return size == size_default; + case offsetof(struct sk_reuseport_md, sk): + info->reg_type = ARG_PTR_TO_SOCKET; + return size == sizeof(__u64); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10236,6 +10242,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, bind_inany): SK_REUSEPORT_LOAD_FIELD(bind_inany); break; + + case offsetof(struct sk_reuseport_md, sk): + SK_REUSEPORT_LOAD_FIELD(sk); + break; } return insn - insn_buf; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 69902603012c..5fd94d632792 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5293,6 +5293,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __bpf_md_ptr(struct bpf_sock *, sk); }; #define BPF_TAG_SIZE 8 From patchwork Tue Apr 20 15:41:38 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 425037 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A77ECC433B4 for ; Tue, 20 Apr 2021 15:45:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5FFE061264 for ; Tue, 20 Apr 2021 15:45:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233207AbhDTPqB (ORCPT ); Tue, 20 Apr 2021 11:46:01 -0400 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:57942 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233204AbhDTPpx (ORCPT ); Tue, 20 Apr 2021 11:45:53 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1618933523; x=1650469523; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3be/W7iSic+4drq35svGwHrLpQvuX2k0t+uns/7wj7s=; b=Asw+5NG1yQRIvJHXl83XOv9X9NzBaDWRx0zDFaqgfXWThaoBudp7dV2Q Vi44jHRX/cWcv3lldaZl7QIZDxpyA5ho0ZOXyvHeCVpiaHtaYsUbiEgwU kiR1QclxC5rn8xL1vc5+05uB0zOy4tdbivqsUgVIdyM++Pue2bpS6DSHB w=; X-IronPort-AV: E=Sophos;i="5.82,237,1613433600"; d="scan'208";a="107202537" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-2c-397e131e.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 20 Apr 2021 15:45:16 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2c-397e131e.us-west-2.amazon.com (Postfix) with ESMTPS id 4A84DA25BD; Tue, 20 Apr 2021 15:45:12 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:45:12 +0000 Received: from 88665a182662.ant.amazon.com (10.43.161.41) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:45:07 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v3 bpf-next 09/11] bpf: Support socket migration by eBPF. Date: Wed, 21 Apr 2021 00:41:38 +0900 Message-ID: <20210420154140.80034-10-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210420154140.80034-1-kuniyu@amazon.co.jp> References: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.41] X-ClientProxiedBy: EX13P01UWB001.ant.amazon.com (10.43.161.59) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Ccurrently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/linux/bpf.h | 1 + include/linux/filter.h | 2 ++ include/uapi/linux/bpf.h | 15 +++++++++++++++ kernel/bpf/syscall.c | 13 +++++++++++++ net/core/filter.c | 13 ++++++++++++- net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++++++++---- tools/include/uapi/linux/bpf.h | 15 +++++++++++++++ 7 files changed, 88 insertions(+), 5 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c9b7a876b0c8..40b9f074a929 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2014,6 +2014,7 @@ struct sk_reuseport_kern { struct sk_buff *skb; struct sock *sk; struct sock *selected_sk; + struct sock *migrating_sk; void *data_end; u32 hash; u32 reuseport_id; diff --git a/include/linux/filter.h b/include/linux/filter.h index 9a09547bc7ba..226f76c0b974 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -995,11 +995,13 @@ void bpf_warn_invalid_xdp_action(u32 act); #ifdef CONFIG_INET struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, + struct sock *migrating_sk, u32 hash); #else static inline struct sock * bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, + struct sock *migrating_sk, u32 hash) { return NULL; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ab8584701334..3888641c0d39 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -958,6 +958,8 @@ enum bpf_attach_type { BPF_SK_LOOKUP, BPF_XDP, BPF_SK_SKB_VERDICT, + BPF_SK_REUSEPORT_SELECT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, __MAX_BPF_ATTACH_TYPE }; @@ -5299,7 +5301,20 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + /* When reuse->migrating_sk is NULL, it is selecting a sk for the + * new incoming connection request (e.g. selecting a listen sk for + * the received SYN in the TCP case). reuse->sk is one of the sk + * in the reuseport group. The bpf prog can use reuse->sk to learn + * the local listening ip/port without looking into the skb. + * + * When reuse->migrating_sk is not NULL, reuse->sk is closed and + * reuse->migrating_sk is the socket that needs to be migrated + * to another listening socket. migrating_sk could be a fullsock + * sk that is fully established or a reqsk that is in-the-middle + * of 3-way handshake. + */ __bpf_md_ptr(struct bpf_sock *, sk); + __bpf_md_ptr(struct bpf_sock *, migrating_sk); }; #define BPF_TAG_SIZE 8 diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 6428634da57e..b5f5dbd16f89 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1931,6 +1931,11 @@ static void bpf_prog_load_fixup_attach_type(union bpf_attr *attr) attr->expected_attach_type = BPF_CGROUP_INET_SOCK_CREATE; break; + case BPF_PROG_TYPE_SK_REUSEPORT: + if (!attr->expected_attach_type) + attr->expected_attach_type = + BPF_SK_REUSEPORT_SELECT; + break; } } @@ -2014,6 +2019,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type, if (expected_attach_type == BPF_SK_LOOKUP) return 0; return -EINVAL; + case BPF_PROG_TYPE_SK_REUSEPORT: + switch (expected_attach_type) { + case BPF_SK_REUSEPORT_SELECT: + case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE: + return 0; + default: + return -EINVAL; + } case BPF_PROG_TYPE_EXT: if (expected_attach_type) return -EINVAL; diff --git a/net/core/filter.c b/net/core/filter.c index 3d0f989f5d38..1dfe2f820950 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -10007,11 +10007,13 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, struct sock_reuseport *reuse, struct sock *sk, struct sk_buff *skb, + struct sock *migrating_sk, u32 hash) { reuse_kern->skb = skb; reuse_kern->sk = sk; reuse_kern->selected_sk = NULL; + reuse_kern->migrating_sk = migrating_sk; reuse_kern->data_end = skb->data + skb_headlen(skb); reuse_kern->hash = hash; reuse_kern->reuseport_id = reuse->reuseport_id; @@ -10020,12 +10022,13 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, + struct sock *migrating_sk, u32 hash) { struct sk_reuseport_kern reuse_kern; enum sk_action action; - bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash); + bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, migrating_sk, hash); action = BPF_PROG_RUN(prog, &reuse_kern); if (action == SK_PASS) @@ -10170,6 +10173,10 @@ sk_reuseport_is_valid_access(int off, int size, info->reg_type = ARG_PTR_TO_SOCKET; return size == sizeof(__u64); + case offsetof(struct sk_reuseport_md, migrating_sk): + info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL; + return size == sizeof(__u64); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10246,6 +10253,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, sk): SK_REUSEPORT_LOAD_FIELD(sk); break; + + case offsetof(struct sk_reuseport_md, migrating_sk): + SK_REUSEPORT_LOAD_FIELD(migrating_sk); + break; } return insn - insn_buf; diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index a2bca39ec0e3..d35c8f51ab72 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -378,13 +378,17 @@ void reuseport_stop_listen_sock(struct sock *sk) { if (sk->sk_protocol == IPPROTO_TCP) { struct sock_reuseport *reuse; + struct bpf_prog *prog; spin_lock_bh(&reuseport_lock); reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); + prog = rcu_dereference_protected(reuse->prog, + lockdep_is_held(&reuseport_lock)); - if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) { + if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req || + (prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) { /* Migration capable, move sk from the listening section * to the closed section. */ @@ -489,7 +493,7 @@ struct sock *reuseport_select_sock(struct sock *sk, goto select_by_hash; if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) - sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); + sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, NULL, hash); else sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len); @@ -520,6 +524,8 @@ struct sock *reuseport_migrate_sock(struct sock *sk, { struct sock_reuseport *reuse; struct sock *nsk = NULL; + bool allocated = false; + struct bpf_prog *prog; u16 socks; u32 hash; @@ -537,10 +543,30 @@ struct sock *reuseport_migrate_sock(struct sock *sk, smp_rmb(); hash = migrating_sk->sk_hash; - if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) + prog = rcu_dereference(reuse->prog); + if (!prog || prog->expected_attach_type != BPF_SK_REUSEPORT_SELECT_OR_MIGRATE) { + if (sock_net(sk)->ipv4.sysctl_tcp_migrate_req) + goto select_by_hash; + goto out; + } + + if (!skb) { + skb = alloc_skb(0, GFP_ATOMIC); + if (!skb) + goto out; + allocated = true; + } + + nsk = bpf_run_sk_reuseport(reuse, sk, prog, skb, migrating_sk, hash); + + if (allocated) + kfree_skb(skb); + +select_by_hash: + if (!nsk) nsk = reuseport_select_sock_by_hash(reuse, hash, socks); - if (nsk && unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt))) + if (IS_ERR_OR_NULL(nsk) || unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt))) nsk = NULL; out: diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 5fd94d632792..449ca2f4ab65 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -958,6 +958,8 @@ enum bpf_attach_type { BPF_SK_LOOKUP, BPF_XDP, BPF_SK_SKB_VERDICT, + BPF_SK_REUSEPORT_SELECT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, __MAX_BPF_ATTACH_TYPE }; @@ -5293,7 +5295,20 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + /* When reuse->migrating_sk is NULL, it is selecting a sk for the + * new incoming connection request (e.g. selecting a listen sk for + * the received SYN in the TCP case). reuse->sk is one of the sk + * in the reuseport group. The bpf prog can use reuse->sk to learn + * the local listening ip/port without looking into the skb. + * + * When reuse->migrating_sk is not NULL, reuse->sk is closed and + * reuse->migrating_sk is the socket that needs to be migrated + * to another listening socket. migrating_sk could be a fullsock + * sk that is fully established or a reqsk that is in-the-middle + * of 3-way handshake. + */ __bpf_md_ptr(struct bpf_sock *, sk); + __bpf_md_ptr(struct bpf_sock *, migrating_sk); }; #define BPF_TAG_SIZE 8 From patchwork Tue Apr 20 15:41:40 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 425036 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EF0DC43460 for ; Tue, 20 Apr 2021 15:45:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0AC1561354 for ; Tue, 20 Apr 2021 15:45:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233060AbhDTPqU (ORCPT ); Tue, 20 Apr 2021 11:46:20 -0400 Received: from smtp-fw-6001.amazon.com ([52.95.48.154]:62921 "EHLO smtp-fw-6001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232303AbhDTPqS (ORCPT ); Tue, 20 Apr 2021 11:46:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1618933547; x=1650469547; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=s/VLMnw6pSXfmJvQ3sVWmnNV6r3/SAjX8knXk7gWstY=; b=IGTL7wObJTD/+nLI9/O+Zt0e1dUJa5xmZNb1hKrtRoVNcxwEAunWTFoG NHs6EvywmwhpUXw9xV8CnAaODjMIj2UDKQ2/kysUGzqCD8Gl1Q6JqUseZ zXYZ6jJGP2tmVvySoZfGkljDsfR9o1jgG1oCSnV+81TTXtZjMfh2foy1f Y=; X-IronPort-AV: E=Sophos;i="5.82,237,1613433600"; d="scan'208";a="108637064" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-2b-81e76b79.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6001.iad6.amazon.com with ESMTP; 20 Apr 2021 15:45:45 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2b-81e76b79.us-west-2.amazon.com (Postfix) with ESMTPS id EBA0FA1941; Tue, 20 Apr 2021 15:45:42 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:45:41 +0000 Received: from 88665a182662.ant.amazon.com (10.43.161.41) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 20 Apr 2021 15:45:37 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v3 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Date: Wed, 21 Apr 2021 00:41:40 +0900 Message-ID: <20210420154140.80034-12-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210420154140.80034-1-kuniyu@amazon.co.jp> References: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.41] X-ClientProxiedBy: EX13P01UWB001.ant.amazon.com (10.43.161.59) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE and removes 'static' from settimeo() in network_helpers.c. Signed-off-by: Kuniyuki Iwashima --- tools/testing/selftests/bpf/network_helpers.c | 2 +- tools/testing/selftests/bpf/network_helpers.h | 1 + .../bpf/prog_tests/migrate_reuseport.c | 483 ++++++++++++++++++ .../bpf/progs/test_migrate_reuseport.c | 51 ++ 4 files changed, 536 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index 12ee40284da0..2060bc122c53 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -40,7 +40,7 @@ struct ipv6_packet pkt_v6 = { .tcp.doff = 5, }; -static int settimeo(int fd, int timeout_ms) +int settimeo(int fd, int timeout_ms) { struct timeval timeout = { .tv_sec = 3 }; diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index 7205f8afdba1..5e0d51c07b63 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -33,6 +33,7 @@ struct ipv6_packet { } __packed; extern struct ipv6_packet pkt_v6; +int settimeo(int fd, int timeout_ms); int start_server(int family, int type, const char *addr, __u16 port, int timeout_ms); int connect_to_fd(int server_fd, int timeout_ms); diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c new file mode 100644 index 000000000000..726f6380390a --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c @@ -0,0 +1,483 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. call listen() for 5 server sockets. + * 2. update a map to migrate all child sockets + * to the last server socket (migrate_map[cookie] = 4) + * 3. call connect() for 25 client sockets. + * 4. call shutdown() for first 4 server sockets + * and migrate the requests in the accept queue + * to the last server socket. + * 5. call listen() for the second server socket. + * 6. call shutdown() for the last server + * and migrate the requests in the accept queue + * to the second server socket. + * 7. call listen() for the last server. + * 8. call shutdown() for the second server + * and migrate the requests in the accept queue + * to the last server socket. + * 9. call accept() for the last server socket. + * + * Author: Kuniyuki Iwashima + */ + +#include +#include + +#include "test_progs.h" +#include "test_migrate_reuseport.skel.h" +#include "network_helpers.h" + +#define NR_SERVERS 5 +#define NR_CLIENTS (NR_SERVERS * 5) +#define MIGRATED_TO (NR_SERVERS - 1) + +/* fastopenq->max_qlen and sk->sk_max_ack_backlog */ +#define QLEN (NR_CLIENTS * 5) + +#define MSG "Hello World" +#define MSGLEN 12 + +struct migrate_reuseport_test_case { + const char *name; + __s64 servers[NR_SERVERS]; + __s64 clients[NR_CLIENTS]; + struct sockaddr_storage addr; + socklen_t addrlen; + int family; + bool drop_ack; + bool expire_synack_timer; + bool fastopen; +} test_cases[] = { + { + .name = "IPv4 - TCP_ESTABLISHED - inet_csk_listen_stop", + .family = AF_INET, + .drop_ack = false, + .expire_synack_timer = false, + .fastopen = false, + }, + { + .name = "IPv4 - TCP_SYN_RECV - inet_csk_listen_stop", + .family = AF_INET, + .drop_ack = true, + .expire_synack_timer = false, + .fastopen = true, + }, + { + .name = "IPv4 - TCP_NEW_SYN_RECV - inet_csk_complete_hashdance", + .family = AF_INET, + .drop_ack = true, + .expire_synack_timer = false, + .fastopen = false, + }, + { + .name = "IPv4 - TCP_NEW_SYN_RECV - reqsk_timer_handler", + .family = AF_INET, + .drop_ack = true, + .expire_synack_timer = true, + .fastopen = false, + }, + { + .name = "IPv6 - TCP_ESTABLISHED - inet_csk_listen_stop", + .family = AF_INET6, + .drop_ack = false, + .expire_synack_timer = false, + .fastopen = false, + }, + { + .name = "IPv6 - TCP_SYN_RECV - inet_csk_listen_stop", + .family = AF_INET6, + .drop_ack = true, + .expire_synack_timer = false, + .fastopen = true, + }, + { + .name = "IPv6 - TCP_NEW_SYN_RECV - inet_csk_complete_hashdance", + .family = AF_INET6, + .drop_ack = true, + .expire_synack_timer = false, + .fastopen = false, + }, + { + .name = "IPv6 - TCP_NEW_SYN_RECV - reqsk_timer_handler", + .family = AF_INET6, + .drop_ack = true, + .expire_synack_timer = true, + .fastopen = false, + } +}; + +void init_fds(__s64 fds[], int len) +{ + int i; + + for (i = 0; i < len; i++) + fds[i] = -1; +} + +void close_fds(__s64 fds[], int len) +{ + int i; + + for (i = 0; i < len; i++) { + if (fds[i] != -1) { + close(fds[i]); + fds[i] = -1; + } + } +} + +int setup_fastopen(char *buf, int size, int *saved_len, bool restore) +{ + int err = 0, fd, len; + + fd = open("/proc/sys/net/ipv4/tcp_fastopen", O_RDWR); + if (!ASSERT_NEQ(fd, -1, "open")) + return -1; + + if (restore) { + len = write(fd, buf, *saved_len); + if (!ASSERT_EQ(len, *saved_len, "write - restore")) + err = -1; + } else { + *saved_len = read(fd, buf, size); + if (!ASSERT_LT(1, *saved_len, "read")) { + err = -1; + goto close; + } + + err = lseek(fd, 0, SEEK_SET); + if (!ASSERT_OK(err, "lseek")) + goto close; + + /* (TFO_CLIENT_ENABLE | TFO_SERVER_ENABLE) */ + len = write(fd, "3", 1); + if (!ASSERT_EQ(len, 1, "write - setup")) + err = -1; + } + +close: + close(fd); + + return err; +} + +int run_iptables(struct migrate_reuseport_test_case *test_case, bool add_rule) +{ + char buf[128]; + int err; + + sprintf(buf, "%s -%c OUTPUT -o lo -p tcp --dport %d --tcp-flags SYN,ACK ACK -j DROP", + test_case->family == AF_INET ? "iptables" : "ip6tables", + add_rule ? 'A' : 'D', + ntohs(test_case->family == AF_INET ? + ((struct sockaddr_in *)&test_case->addr)->sin_port : + ((struct sockaddr_in6 *)&test_case->addr)->sin6_port)); + + err = system(buf); + + return err == -1 ? err : WEXITSTATUS(err); +} + +int start_servers(struct migrate_reuseport_test_case *test_case, + struct test_migrate_reuseport *skel) +{ + int reuseport = 1, qlen = QLEN, migrated_to = MIGRATED_TO; + int i, err, prog_fd, reuseport_map_fd, migrate_map_fd; + __u64 value; + + prog_fd = bpf_program__fd(skel->progs.prog_migrate_reuseport); + reuseport_map_fd = bpf_map__fd(skel->maps.reuseport_map); + migrate_map_fd = bpf_map__fd(skel->maps.migrate_map); + + make_sockaddr(test_case->family, + test_case->family == AF_INET ? "127.0.0.1" : "::1", 0, + &test_case->addr, &test_case->addrlen); + + for (i = 0; i < NR_SERVERS; i++) { + test_case->servers[i] = socket(test_case->family, SOCK_STREAM, + IPPROTO_TCP); + if (!ASSERT_NEQ(test_case->servers[i], -1, "socket")) + return -1; + + err = setsockopt(test_case->servers[i], SOL_SOCKET, + SO_REUSEPORT, &reuseport, sizeof(reuseport)); + if (!ASSERT_OK(err, "setsockopt - SO_REUSEPORT")) + return -1; + + err = bind(test_case->servers[i], + (struct sockaddr *)&test_case->addr, + test_case->addrlen); + if (!ASSERT_OK(err, "bind")) + return -1; + + if (i == 0) { + err = setsockopt(test_case->servers[i], SOL_SOCKET, + SO_ATTACH_REUSEPORT_EBPF, + &prog_fd, sizeof(prog_fd)); + if (!ASSERT_OK(err, + "setsockopt - SO_ATTACH_REUSEPORT_EBPF")) + return -1; + + err = getsockname(test_case->servers[i], + (struct sockaddr *)&test_case->addr, + &test_case->addrlen); + if (!ASSERT_OK(err, "getsockname")) + return -1; + } + + if (test_case->fastopen) { + err = setsockopt(test_case->servers[i], + SOL_TCP, TCP_FASTOPEN, + &qlen, sizeof(qlen)); + if (!ASSERT_OK(err, "setsockopt - TCP_FASTOPEN")) + return -1; + } + + err = listen(test_case->servers[i], qlen); + if (!ASSERT_OK(err, "listen")) + return -1; + + value = (__u64)test_case->servers[i]; + err = bpf_map_update_elem(reuseport_map_fd, &i, &value, + BPF_NOEXIST); + if (!ASSERT_OK(err, "bpf_map_update_elem - reuseport_map")) + return -1; + + err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value); + if (!ASSERT_OK(err, "bpf_map_lookup_elem - reuseport_map")) + return -1; + + err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, + BPF_NOEXIST); + if (!ASSERT_OK(err, "bpf_map_update_elem - migrate_map")) + return -1; + } + + return 0; +} + +int start_clients(struct migrate_reuseport_test_case *test_case) +{ + char buf[MSGLEN] = MSG; + int i, err; + + for (i = 0; i < NR_CLIENTS; i++) { + test_case->clients[i] = socket(test_case->family, SOCK_STREAM, + IPPROTO_TCP); + if (!ASSERT_NEQ(test_case->clients[i], -1, "socket")) + return -1; + + /* iptables only drops the final ACK, so clients will + * transition to TCP_ESTABLISHED immediately. + */ + err = settimeo(test_case->clients[i], 100); + if (!ASSERT_OK(err, "settimeo")) + return -1; + + if (test_case->fastopen) { + int fastopen = 1; + + err = setsockopt(test_case->clients[i], IPPROTO_TCP, + TCP_FASTOPEN_CONNECT, &fastopen, + sizeof(fastopen)); + if (!ASSERT_OK(err, + "setsockopt - TCP_FASTOPEN_CONNECT")) + return -1; + } + + err = connect(test_case->clients[i], + (struct sockaddr *)&test_case->addr, + test_case->addrlen); + if (!ASSERT_OK(err, "connect")) + return -1; + + err = write(test_case->clients[i], buf, MSGLEN); + if (!ASSERT_EQ(err, MSGLEN, "write")) + return -1; + } + + return 0; +} + +int migrate_dance(struct migrate_reuseport_test_case *test_case) +{ + int i, err; + + /* Migrate TCP_ESTABLISHED and TCP_SYN_RECV requests + * to the last listener based on eBPF. + */ + for (i = 0; i < MIGRATED_TO; i++) { + err = shutdown(test_case->servers[i], SHUT_RDWR); + if (!ASSERT_OK(err, "shutdown")) + return -1; + } + + /* No dance for TCP_NEW_SYN_RECV to migrate based on eBPF */ + if (!test_case->fastopen && test_case->drop_ack) + return 0; + + /* Note that we use the second listener instead of the + * first one here. + * + * The fist listener is bind()ed with port 0 and, + * SOCK_BINDPORT_LOCK is not set to sk_userlocks, so + * calling listen() again will bind() the first listener + * on a new ephemeral port and detach it from the existing + * reuseport group. (See: __inet_bind(), tcp_set_state()) + * + * OTOH, the second one is bind()ed with a specific port, + * and SOCK_BINDPORT_LOCK is set. Thus, re-listen() will + * resurrect the listener on the existing reuseport group. + */ + err = listen(test_case->servers[1], QLEN); + if (!ASSERT_OK(err, "listen")) + return -1; + + /* Migrate from the last listener to the second one. + * + * All listeners were detached out of the reuseport_map, + * so migration will be done by kernel random pick from here. + */ + err = shutdown(test_case->servers[MIGRATED_TO], SHUT_RDWR); + if (!ASSERT_OK(err, "shutdown")) + return -1; + + /* Back to the existing reuseport group */ + err = listen(test_case->servers[MIGRATED_TO], QLEN); + if (!ASSERT_OK(err, "listen")) + return -1; + + /* Migrate back to the last one from the second one */ + err = shutdown(test_case->servers[1], SHUT_RDWR); + if (!ASSERT_OK(err, "shutdown")) + return -1; + + return 0; +} + +int count_requests(struct migrate_reuseport_test_case *test_case) +{ + struct sockaddr_storage addr; + socklen_t len = sizeof(addr); + char buf[MSGLEN]; + int cnt, client; + + settimeo(test_case->servers[MIGRATED_TO], 2000); + + for (cnt = 0; cnt < NR_CLIENTS; cnt++) { + client = accept(test_case->servers[MIGRATED_TO], + (struct sockaddr *)&addr, &len); + if (!ASSERT_NEQ(client, -1, "accept")) + goto out; + + memset(buf, 0, MSGLEN); + + read(client, &buf, MSGLEN); + if (!ASSERT_STREQ(buf, MSG, "read")) { + close(client); + goto out; + } + + close(client); + } + +out: + return cnt; +} + +void run_test(struct migrate_reuseport_test_case *test_case, + struct test_migrate_reuseport *skel) +{ + bool added_rule = false; + int err, saved_len; + char buf[16]; + + init_fds(test_case->servers, NR_SERVERS); + init_fds(test_case->clients, NR_CLIENTS); + + if (test_case->fastopen) { + memset(buf, 0, sizeof(buf)); + + err = setup_fastopen(buf, sizeof(buf), &saved_len, false); + if (!ASSERT_OK(err, "setup_fastopen - setup")) + return; + } + + err = start_servers(test_case, skel); + if (!ASSERT_OK(err, "start_servers")) + goto close_servers; + + if (test_case->drop_ack) { + /* Drop the final ACK of the 3-way handshake and stick the + * in-flight requests on TCP_SYN_RECV or TCP_NEW_SYN_RECV. + */ + err = run_iptables(test_case, true); + if (!ASSERT_OK(err, "run_iptables - add rule")) + goto close_servers; + + added_rule = true; + } + + err = start_clients(test_case); + if (!ASSERT_OK(err, "start_clients")) + goto close_clients; + + /* Migrate the requests in the accept queue only. + * TCP_NEW_SYN_RECV requests are not migrated at this point. + */ + err = migrate_dance(test_case); + if (!ASSERT_OK(err, "migrate_dance")) + goto close_clients; + + if (test_case->expire_synack_timer) { + /* Wait for SYN+ACK timer to expire so that + * reqsk_timer_handler() migrates TCP_NEW_SYN_RECV requests. + */ + sleep(1); + } + + if (test_case->drop_ack) { + /* Resume 3WHS and migrate TCP_NEW_SYN_RECV requests */ + err = run_iptables(test_case, false); + if (!ASSERT_OK(err, "run_iptables - delete rule")) + goto close_clients; + + added_rule = false; + } + + err = count_requests(test_case); + ASSERT_EQ(err, NR_CLIENTS, test_case->name); + +close_clients: + close_fds(test_case->clients, NR_CLIENTS); + + if (added_rule) { + err = run_iptables(test_case, false); + ASSERT_OK(err, "run_iptables - clean up rule"); + } + +close_servers: + close_fds(test_case->servers, NR_SERVERS); + + if (test_case->fastopen) { + err = setup_fastopen(buf, sizeof(buf), &saved_len, true); + ASSERT_OK(err, "setup_fastopen - restore"); + } +} + +void test_migrate_reuseport(void) +{ + struct test_migrate_reuseport *skel; + int i; + + skel = test_migrate_reuseport__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + return; + + for (i = 0; i < ARRAY_SIZE(test_cases); i++) + run_test(&test_cases[i], skel); + + test_migrate_reuseport__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c new file mode 100644 index 000000000000..d7136dc29fa2 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport.c @@ -0,0 +1,51 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. If reuse_md->migrating_sk is NULL (SYN packet), + * return SK_PASS without selecting a listener. + * 2. If reuse_md->migrating_sk is not NULL (socket migration), + * select a listener (reuseport_map[migrate_map[cookie]]) + * + * Author: Kuniyuki Iwashima + */ + +#include +#include +#include + +struct { + __uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY); + __uint(max_entries, 256); + __type(key, int); + __type(value, __u64); +} reuseport_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 256); + __type(key, __u64); + __type(value, int); +} migrate_map SEC(".maps"); + +SEC("sk_reuseport/migrate") +int prog_migrate_reuseport(struct sk_reuseport_md *reuse_md) +{ + int *key, flags = 0; + __u64 cookie; + + if (!reuse_md->migrating_sk) + return SK_PASS; + + cookie = bpf_get_socket_cookie(reuse_md->sk); + + key = bpf_map_lookup_elem(&migrate_map, &cookie); + if (!key) + return SK_DROP; + + bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags); + + return SK_PASS; +} + +char _license[] SEC("license") = "GPL";