From patchwork Thu Jul 1 20:05:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469417 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9135DC11F69 for ; Thu, 1 Jul 2021 20:05:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 78EB461411 for ; Thu, 1 Jul 2021 20:05:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233313AbhGAUIS (ORCPT ); Thu, 1 Jul 2021 16:08:18 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:51044 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233478AbhGAUIR (ORCPT ); Thu, 1 Jul 2021 16:08:17 -0400 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161Jwjv4001206 for ; Thu, 1 Jul 2021 13:05:46 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=4JzmU94JfPnQkwQ4bDUq+83XjyG3dTzA3zR1RCB4Jbw=; b=gtPkXTJCNkbTcpgKTZn5NfUGJly2QljUUcqF7bz7p0PBf0388mi7KXWALEBsJbPA0nFw 1xdVqmkeGIUPDKhTA9oGpMx0o8e1vcmAoMXwaZoYMuUzH+Tsv/WmGDAxX56HYPsRX8hL 0f40URaAhOwOoeGuKFqCmPXb6qJgYYQkBu0= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39h1wyxbr6-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:05:46 -0700 Received: from intmgw001.06.ash9.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c085:11d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:05:44 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id DAC172940BB9; Thu, 1 Jul 2021 13:05:41 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Date: Thu, 1 Jul 2021 13:05:41 -0700 Message-ID: <20210701200541.1033917-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: c8A7o1XOoGfqlE_f7ROz5F7xu4UMNcd9 X-Proofpoint-GUID: c8A7o1XOoGfqlE_f7ROz5F7xu4UMNcd9 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 phishscore=0 impostorscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 bulkscore=0 clxscore=1015 priorityscore=1501 spamscore=0 mlxlogscore=958 suspectscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org st->bucket stores the current bucket number. st->offset stores the offset within this bucket that is the sk to be seq_show(). Thus, st->offset only makes sense within the same st->bucket. These two variables are an optimization for the common no-lseek case. When resuming the seq_file iteration (i.e. seq_start()), tcp_seek_last_pos() tries to continue from the st->offset at bucket st->bucket. However, it is possible that the bucket pointed by st->bucket has changed and st->offset may end up skipping the whole st->bucket without finding a sk. In this case, tcp_seek_last_pos() currently continues to satisfy the offset condition in the next (and incorrect) bucket. Instead, regardless of the offset value, the first sk of the next bucket should be returned. Thus, "bucket == st->bucket" check is added to tcp_seek_last_pos(). The chance of hitting this is small and the issue is a decade old, so targeting for the next tree. Fixes: a8b690f98baf ("tcp: Fix slowness in read /proc/net/tcp") Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index e66ad6bfe808..26b7b2056585 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2451,6 +2451,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos) static void *tcp_seek_last_pos(struct seq_file *seq) { struct tcp_iter_state *st = seq->private; + int bucket = st->bucket; int offset = st->offset; int orig_num = st->num; void *rc = NULL; @@ -2461,7 +2462,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) break; st->state = TCP_SEQ_STATE_LISTENING; rc = listening_get_next(seq, NULL); - while (offset-- && rc) + while (offset-- && rc && bucket == st->bucket) rc = listening_get_next(seq, rc); if (rc) break; @@ -2472,7 +2473,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) if (st->bucket > tcp_hashinfo.ehash_mask) break; rc = established_get_first(seq); - while (offset-- && rc) + while (offset-- && rc && bucket == st->bucket) rc = established_get_next(seq, rc); } From patchwork Thu Jul 1 20:05:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469825 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16065C11F67 for ; Thu, 1 Jul 2021 20:06:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F225461411 for ; Thu, 1 Jul 2021 20:06:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233716AbhGAUIq (ORCPT ); Thu, 1 Jul 2021 16:08:46 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:55602 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233308AbhGAUIp (ORCPT ); Thu, 1 Jul 2021 16:08:45 -0400 Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161K0l1G013139 for ; Thu, 1 Jul 2021 13:06:14 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=y8p8sl+K6IUkuJFM9HAbZB+OAhB1dBzP3va6Ro5qn4I=; b=RHvHHvio/4Rw/YuiC2pN10dqSQvjNbaHwD4wsQezrVzjXSF5ZErAxK1TNo7OfLNBCcT8 XPGoea5prgbZl9yuJ5MncQtpFjJ9UaW0vKz9xi8zQYt9OfSAdk5qvZe+Auch3qiygqgL sPNVMesxL5cfaQYqJuz5du5Ol8e1T2EeEn0= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39h222x85r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:06:14 -0700 Received: from intmgw001.06.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:11d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:01 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 1E87B2940BB9; Thu, 1 Jul 2021 13:05:48 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 2/8] tcp: seq_file: Refactor net and family matching Date: Thu, 1 Jul 2021 13:05:48 -0700 Message-ID: <20210701200548.1034629-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: BwQnabfxx9yIxw9XvZV1ysk4cuyt9oew X-Proofpoint-GUID: BwQnabfxx9yIxw9XvZV1ysk4cuyt9oew X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 mlxscore=0 phishscore=0 spamscore=0 suspectscore=0 clxscore=1015 mlxlogscore=999 impostorscore=0 lowpriorityscore=0 adultscore=0 priorityscore=1501 bulkscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch refactors the net and family matching into two new helpers, seq_sk_match() and seq_file_family(). seq_file_family() is in the later part of the file to prepare the change of a following patch. Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 68 ++++++++++++++++++++------------------------- 1 file changed, 30 insertions(+), 38 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 26b7b2056585..e4e9f73a19a6 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2277,6 +2277,17 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock); #ifdef CONFIG_PROC_FS /* Proc filesystem TCP sock list dumping. */ +static unsigned short seq_file_family(const struct seq_file *seq); + +static bool seq_sk_match(struct seq_file *seq, const struct sock *sk) +{ + unsigned short family = seq_file_family(seq); + + /* AF_UNSPEC is used as a match all */ + return ((family == AF_UNSPEC || family == sk->sk_family) && + net_eq(sock_net(sk), seq_file_net(seq))); +} + /* * Get next listener socket follow cur. If cur is NULL, get first socket * starting from bucket given in st->bucket; when st->bucket is zero the @@ -2284,18 +2295,11 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock); */ static void *listening_get_next(struct seq_file *seq, void *cur) { - struct tcp_seq_afinfo *afinfo; struct tcp_iter_state *st = seq->private; - struct net *net = seq_file_net(seq); struct inet_listen_hashbucket *ilb; struct hlist_nulls_node *node; struct sock *sk = cur; - if (st->bpf_seq_afinfo) - afinfo = st->bpf_seq_afinfo; - else - afinfo = PDE_DATA(file_inode(seq->file)); - if (!sk) { get_head: ilb = &tcp_hashinfo.listening_hash[st->bucket]; @@ -2311,10 +2315,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur) sk = sk_nulls_next(sk); get_sk: sk_nulls_for_each_from(sk, node) { - if (!net_eq(sock_net(sk), net)) - continue; - if (afinfo->family == AF_UNSPEC || - sk->sk_family == afinfo->family) + if (seq_sk_match(seq, sk)) return sk; } spin_unlock(&ilb->lock); @@ -2351,15 +2352,7 @@ static inline bool empty_bucket(const struct tcp_iter_state *st) */ static void *established_get_first(struct seq_file *seq) { - struct tcp_seq_afinfo *afinfo; struct tcp_iter_state *st = seq->private; - struct net *net = seq_file_net(seq); - void *rc = NULL; - - if (st->bpf_seq_afinfo) - afinfo = st->bpf_seq_afinfo; - else - afinfo = PDE_DATA(file_inode(seq->file)); st->offset = 0; for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) { @@ -2373,32 +2366,20 @@ static void *established_get_first(struct seq_file *seq) spin_lock_bh(lock); sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) { - if ((afinfo->family != AF_UNSPEC && - sk->sk_family != afinfo->family) || - !net_eq(sock_net(sk), net)) { - continue; - } - rc = sk; - goto out; + if (seq_sk_match(seq, sk)) + return sk; } spin_unlock_bh(lock); } -out: - return rc; + + return NULL; } static void *established_get_next(struct seq_file *seq, void *cur) { - struct tcp_seq_afinfo *afinfo; struct sock *sk = cur; struct hlist_nulls_node *node; struct tcp_iter_state *st = seq->private; - struct net *net = seq_file_net(seq); - - if (st->bpf_seq_afinfo) - afinfo = st->bpf_seq_afinfo; - else - afinfo = PDE_DATA(file_inode(seq->file)); ++st->num; ++st->offset; @@ -2406,9 +2387,7 @@ static void *established_get_next(struct seq_file *seq, void *cur) sk = sk_nulls_next(sk); sk_nulls_for_each_from(sk, node) { - if ((afinfo->family == AF_UNSPEC || - sk->sk_family == afinfo->family) && - net_eq(sock_net(sk), net)) + if (seq_sk_match(seq, sk)) return sk; } @@ -2754,6 +2733,19 @@ static const struct seq_operations bpf_iter_tcp_seq_ops = { .stop = bpf_iter_tcp_seq_stop, }; #endif +static unsigned short seq_file_family(const struct seq_file *seq) +{ + const struct tcp_iter_state *st = seq->private; + const struct tcp_seq_afinfo *afinfo = st->bpf_seq_afinfo; + + /* Iterated from bpf_iter. Let the bpf prog to filter instead. */ + if (afinfo) + return AF_UNSPEC; + + /* Iterated from proc fs */ + afinfo = PDE_DATA(file_inode(seq->file)); + return afinfo->family; +} static const struct seq_operations tcp4_seq_ops = { .show = tcp4_seq_show, From patchwork Thu Jul 1 20:05:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469826 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15349C11F64 for ; Thu, 1 Jul 2021 20:06:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F1AEA61410 for ; Thu, 1 Jul 2021 20:06:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233607AbhGAUIi (ORCPT ); Thu, 1 Jul 2021 16:08:38 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:39926 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232628AbhGAUIh (ORCPT ); Thu, 1 Jul 2021 16:08:37 -0400 Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161K0cIb026353 for ; Thu, 1 Jul 2021 13:06:06 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=3zB4O/E1HZBoqzPN3o5ZOih5nXst9ZcjSSoDaT6Qfo0=; b=owIaRdIYEJ/D9IUUWpEuwCqAjLJRgikpKSw1r8GrTH3Ak75c9MvNb/NzH0/lIkE0TmgX pbiW/wR962BbGzcek2ExbZ0v3ppg08+PJX0m3qcvbj9R36IrE4hEDV+0AFFUKTBYj2Gx GhDRGfO2muupfFZ/mn3ecLceem2OPZxY6qY= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39gagnpgcw-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:06:06 -0700 Received: from intmgw002.06.ash9.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:00 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 5DB092940BB9; Thu, 1 Jul 2021 13:05:54 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 3/8] bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state Date: Thu, 1 Jul 2021 13:05:54 -0700 Message-ID: <20210701200554.1034982-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: 4usb8DsFO1Q-I_izlTq9mkseR2P2Ljza X-Proofpoint-GUID: 4usb8DsFO1Q-I_izlTq9mkseR2P2Ljza X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 impostorscore=0 suspectscore=0 mlxlogscore=999 lowpriorityscore=0 spamscore=0 priorityscore=1501 malwarescore=0 mlxscore=0 adultscore=0 bulkscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A following patch will create a separate struct to store extra bpf_iter state and it will embed the existing tcp_iter_state like this: struct bpf_tcp_iter_state { struct tcp_iter_state state; /* More bpf_iter specific states here ... */ } As a prep work, this patch removes the "struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is to tell if it is iterating from bpf_iter instead of proc fs. Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from bpf_iter. The kernel should not filter by the addr family and leave this filtering decision to the bpf prog. Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the "seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating from the bpf iter. The bpf_iter_(init|fini)_tcp() is left here to prepare for the change of a following patch. Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- include/net/tcp.h | 1 - net/ipv4/tcp_ipv4.c | 25 +++++-------------------- 2 files changed, 5 insertions(+), 21 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index e668f1bf780d..06ce38967890 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1955,7 +1955,6 @@ struct tcp_iter_state { struct seq_net_private p; enum tcp_seq_states state; struct sock *syn_wait_sk; - struct tcp_seq_afinfo *bpf_seq_afinfo; int bucket, offset, sbucket, num; loff_t last_pos; }; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index e4e9f73a19a6..6071391b9c0f 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2735,12 +2735,13 @@ static const struct seq_operations bpf_iter_tcp_seq_ops = { #endif static unsigned short seq_file_family(const struct seq_file *seq) { - const struct tcp_iter_state *st = seq->private; - const struct tcp_seq_afinfo *afinfo = st->bpf_seq_afinfo; + const struct tcp_seq_afinfo *afinfo; +#ifdef CONFIG_BPF_SYSCALL /* Iterated from bpf_iter. Let the bpf prog to filter instead. */ - if (afinfo) + if (seq->op == &bpf_iter_tcp_seq_ops) return AF_UNSPEC; +#endif /* Iterated from proc fs */ afinfo = PDE_DATA(file_inode(seq->file)); @@ -2998,27 +2999,11 @@ DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta, static int bpf_iter_init_tcp(void *priv_data, struct bpf_iter_aux_info *aux) { - struct tcp_iter_state *st = priv_data; - struct tcp_seq_afinfo *afinfo; - int ret; - - afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN); - if (!afinfo) - return -ENOMEM; - - afinfo->family = AF_UNSPEC; - st->bpf_seq_afinfo = afinfo; - ret = bpf_iter_init_seq_net(priv_data, aux); - if (ret) - kfree(afinfo); - return ret; + return bpf_iter_init_seq_net(priv_data, aux); } static void bpf_iter_fini_tcp(void *priv_data) { - struct tcp_iter_state *st = priv_data; - - kfree(st->bpf_seq_afinfo); bpf_iter_fini_seq_net(priv_data); } From patchwork Thu Jul 1 20:06:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469416 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 185FEC11F69 for ; Thu, 1 Jul 2021 20:06:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EECB361411 for ; Thu, 1 Jul 2021 20:06:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233592AbhGAUIk (ORCPT ); Thu, 1 Jul 2021 16:08:40 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:45520 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233308AbhGAUIi (ORCPT ); Thu, 1 Jul 2021 16:08:38 -0400 Received: from pps.filterd (m0148460.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161K0mx9023846 for ; Thu, 1 Jul 2021 13:06:07 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=fKzEvD488zdMvKvdVK5tew9J1ea9zhywAedqAsDgoDg=; b=lTmm8rVAUKAuVd0B8zvBTackCHK3Fi6EISA2QUJ04AeuFUmobZKveKK8LKZYCfanQ74B Dd9Kuyk1hD/3tvCrWsJpf/YhsU3dUnRJgqF6OPRqU6zHRmgLAZjuolC0o3D8Om1oIOJ4 SdOXBcK8USXAJbfmoxeDZgQ18PuyhGaU4JM= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 39hbpgkg5x-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:06:06 -0700 Received: from intmgw001.38.frc1.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:05 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 9718D2940BB9; Thu, 1 Jul 2021 13:06:00 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 4/8] tcp: seq_file: Add listening_get_first() Date: Thu, 1 Jul 2021 13:06:00 -0700 Message-ID: <20210701200600.1035353-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: 14S1i-oz5mGmfuPypzguICdaafw0fZEA X-Proofpoint-ORIG-GUID: 14S1i-oz5mGmfuPypzguICdaafw0fZEA X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 impostorscore=0 malwarescore=0 lowpriorityscore=0 clxscore=1015 bulkscore=0 mlxlogscore=933 phishscore=0 priorityscore=1501 adultscore=0 spamscore=0 mlxscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The current listening_get_next() is overloaded by passing NULL to the 2nd arg, like listening_get_next(seq, NULL), to mean get_first(). This patch moves some logic from the listening_get_next() into a new function listening_get_first(). It will be equivalent to the current established_get_first() and established_get_next() setup. get_first() is to find a non empty bucket and return the first sk. get_next() is to find the next sk of the current bucket and then resorts to get_first() if the current bucket is exhausted. The next patch is to move the listener seq_file iteration from listening_hash (port only) to lhash2 (port+addr). Separating out listening_get_first() from listening_get_next() here will make the following lhash2 changes cleaner and easier to follow. Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 59 ++++++++++++++++++++++++++++++--------------- 1 file changed, 39 insertions(+), 20 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 6071391b9c0f..fc2c2ecd10e1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2288,10 +2288,38 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk) net_eq(sock_net(sk), seq_file_net(seq))); } -/* - * Get next listener socket follow cur. If cur is NULL, get first socket - * starting from bucket given in st->bucket; when st->bucket is zero the - * very first socket in the hash table is returned. +/* Find a non empty bucket (starting from st->bucket) + * and return the first sk from it. + */ +static void *listening_get_first(struct seq_file *seq) +{ + struct tcp_iter_state *st = seq->private; + + st->offset = 0; + for (; st->bucket < INET_LHTABLE_SIZE; st->bucket++) { + struct inet_listen_hashbucket *ilb; + struct hlist_nulls_node *node; + struct sock *sk; + + ilb = &tcp_hashinfo.listening_hash[st->bucket]; + if (hlist_nulls_empty(&ilb->nulls_head)) + continue; + + spin_lock(&ilb->lock); + sk_nulls_for_each(sk, node, &ilb->nulls_head) { + if (seq_sk_match(seq, sk)) + return sk; + } + spin_unlock(&ilb->lock); + } + + return NULL; +} + +/* Find the next sk of "cur" within the same bucket (i.e. st->bucket). + * If "cur" is the last one in the st->bucket, + * call listening_get_first() to return the first sk of the next + * non empty bucket. */ static void *listening_get_next(struct seq_file *seq, void *cur) { @@ -2300,29 +2328,20 @@ static void *listening_get_next(struct seq_file *seq, void *cur) struct hlist_nulls_node *node; struct sock *sk = cur; - if (!sk) { -get_head: - ilb = &tcp_hashinfo.listening_hash[st->bucket]; - spin_lock(&ilb->lock); - sk = sk_nulls_head(&ilb->nulls_head); - st->offset = 0; - goto get_sk; - } - ilb = &tcp_hashinfo.listening_hash[st->bucket]; ++st->num; ++st->offset; sk = sk_nulls_next(sk); -get_sk: + sk_nulls_for_each_from(sk, node) { if (seq_sk_match(seq, sk)) return sk; } + + ilb = &tcp_hashinfo.listening_hash[st->bucket]; spin_unlock(&ilb->lock); - st->offset = 0; - if (++st->bucket < INET_LHTABLE_SIZE) - goto get_head; - return NULL; + ++st->bucket; + return listening_get_first(seq); } static void *listening_get_idx(struct seq_file *seq, loff_t *pos) @@ -2332,7 +2351,7 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos) st->bucket = 0; st->offset = 0; - rc = listening_get_next(seq, NULL); + rc = listening_get_first(seq); while (rc && *pos) { rc = listening_get_next(seq, rc); @@ -2440,7 +2459,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) if (st->bucket >= INET_LHTABLE_SIZE) break; st->state = TCP_SEQ_STATE_LISTENING; - rc = listening_get_next(seq, NULL); + rc = listening_get_first(seq); while (offset-- && rc && bucket == st->bucket) rc = listening_get_next(seq, rc); if (rc) From patchwork Thu Jul 1 20:06:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469415 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52C60C11F6A for ; Thu, 1 Jul 2021 20:06:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 394D261413 for ; Thu, 1 Jul 2021 20:06:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233809AbhGAUIq (ORCPT ); Thu, 1 Jul 2021 16:08:46 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:22034 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S233703AbhGAUIq (ORCPT ); Thu, 1 Jul 2021 16:08:46 -0400 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.43/8.16.0.43) with SMTP id 161K0bjh029986 for ; Thu, 1 Jul 2021 13:06:14 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=rd2j3ZlAVroBlCex8ZrTfxaxPgjFrj2sxhXUmFOs6pM=; b=N3kKDUanfINSmaMr97SPHJb8CENPDc2y9nGnM3W6QigxZJVL4ODJIB4zkY5zrpml2hKM qkJ6I3kuZvqSUdrX6xgTRrJL3O+bJzAz4KCi/TzBUhsEMA8Vcgl2l5Lm5tUaMKuEmmU/ NeFGZktgp5Bijb3QlWHry4si+0NatrANAks= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 39h1y1x7ts-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:06:14 -0700 Received: from intmgw001.05.ash7.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:10 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id D1DE92940BCC; Thu, 1 Jul 2021 13:06:06 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 5/8] tcp: seq_file: Replace listening_hash with lhash2 Date: Thu, 1 Jul 2021 13:06:06 -0700 Message-ID: <20210701200606.1035783-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: vrMp50KSH-_AfUC4zPk8O1vIG0-GRsbr X-Proofpoint-ORIG-GUID: vrMp50KSH-_AfUC4zPk8O1vIG0-GRsbr X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 mlxscore=0 clxscore=1015 impostorscore=0 lowpriorityscore=0 phishscore=0 adultscore=0 suspectscore=0 priorityscore=1501 bulkscore=0 malwarescore=0 spamscore=0 mlxlogscore=688 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch moves the tcp seq_file iteration on listeners from the port only listening_hash to the port+addr lhash2. When iterating from the bpf iter, the next patch will need to lock the socket such that the bpf iter can call setsockopt (e.g. to change the TCP_CONGESTION). To avoid locking the bucket and then locking the sock, the bpf iter will first batch some sockets from the same bucket and then unlock the bucket. If the bucket size is small (which usually is), it is easier to batch the whole bucket such that it is less likely to miss a setsockopt on a socket due to changes in the bucket. However, the port only listening_hash could have many listeners hashed to a bucket (e.g. many individual VIP(s):443 and also multiple by the number of SO_REUSEPORT). We have seen bucket size in tens of thousands range. Also, the chance of having changes in some popular port buckets (e.g. 443) is also high. The port+addr lhash2 was introduced to solve this large listener bucket issue. Also, the listening_hash usage has already been replaced with lhash2 in the fast path inet[6]_lookup_listener(). This patch follows the same direction on moving to lhash2 and iterates the lhash2 instead of listening_hash. Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- include/net/inet_hashtables.h | 6 ++++++ net/ipv4/tcp_ipv4.c | 35 ++++++++++++++++++----------------- 2 files changed, 24 insertions(+), 17 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index ca6a3ea9057e..f72ec113ae56 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -160,6 +160,12 @@ struct inet_hashinfo { ____cacheline_aligned_in_smp; }; +#define inet_lhash2_for_each_icsk_continue(__icsk) \ + hlist_for_each_entry_continue(__icsk, icsk_listen_portaddr_node) + +#define inet_lhash2_for_each_icsk(__icsk, list) \ + hlist_for_each_entry(__icsk, list, icsk_listen_portaddr_node) + #define inet_lhash2_for_each_icsk_rcu(__icsk, list) \ hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fc2c2ecd10e1..6b3c26be539b 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2296,21 +2296,22 @@ static void *listening_get_first(struct seq_file *seq) struct tcp_iter_state *st = seq->private; st->offset = 0; - for (; st->bucket < INET_LHTABLE_SIZE; st->bucket++) { - struct inet_listen_hashbucket *ilb; - struct hlist_nulls_node *node; + for (; st->bucket <= tcp_hashinfo.lhash2_mask; st->bucket++) { + struct inet_listen_hashbucket *ilb2; + struct inet_connection_sock *icsk; struct sock *sk; - ilb = &tcp_hashinfo.listening_hash[st->bucket]; - if (hlist_nulls_empty(&ilb->nulls_head)) + ilb2 = &tcp_hashinfo.lhash2[st->bucket]; + if (hlist_empty(&ilb2->head)) continue; - spin_lock(&ilb->lock); - sk_nulls_for_each(sk, node, &ilb->nulls_head) { + spin_lock(&ilb2->lock); + inet_lhash2_for_each_icsk(icsk, &ilb2->head) { + sk = (struct sock *)icsk; if (seq_sk_match(seq, sk)) return sk; } - spin_unlock(&ilb->lock); + spin_unlock(&ilb2->lock); } return NULL; @@ -2324,22 +2325,22 @@ static void *listening_get_first(struct seq_file *seq) static void *listening_get_next(struct seq_file *seq, void *cur) { struct tcp_iter_state *st = seq->private; - struct inet_listen_hashbucket *ilb; - struct hlist_nulls_node *node; + struct inet_listen_hashbucket *ilb2; + struct inet_connection_sock *icsk; struct sock *sk = cur; ++st->num; ++st->offset; - sk = sk_nulls_next(sk); - - sk_nulls_for_each_from(sk, node) { + icsk = inet_csk(sk); + inet_lhash2_for_each_icsk_continue(icsk) { + sk = (struct sock *)icsk; if (seq_sk_match(seq, sk)) return sk; } - ilb = &tcp_hashinfo.listening_hash[st->bucket]; - spin_unlock(&ilb->lock); + ilb2 = &tcp_hashinfo.lhash2[st->bucket]; + spin_unlock(&ilb2->lock); ++st->bucket; return listening_get_first(seq); } @@ -2456,7 +2457,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) switch (st->state) { case TCP_SEQ_STATE_LISTENING: - if (st->bucket >= INET_LHTABLE_SIZE) + if (st->bucket > tcp_hashinfo.lhash2_mask) break; st->state = TCP_SEQ_STATE_LISTENING; rc = listening_get_first(seq); @@ -2541,7 +2542,7 @@ void tcp_seq_stop(struct seq_file *seq, void *v) switch (st->state) { case TCP_SEQ_STATE_LISTENING: if (v != SEQ_START_TOKEN) - spin_unlock(&tcp_hashinfo.listening_hash[st->bucket].lock); + spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock); break; case TCP_SEQ_STATE_ESTABLISHED: if (v) From patchwork Thu Jul 1 20:06:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469824 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 645CDC11F64 for ; Thu, 1 Jul 2021 20:06:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4D70461410 for ; Thu, 1 Jul 2021 20:06:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233816AbhGAUIv (ORCPT ); Thu, 1 Jul 2021 16:08:51 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:1906 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233774AbhGAUIu (ORCPT ); Thu, 1 Jul 2021 16:08:50 -0400 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161K0xYv012632 for ; Thu, 1 Jul 2021 13:06:19 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=YimvdZpCoQS+dSCSsAjpw6+Xnaa3nOKUp/bdZ4XuaLc=; b=EGzM6JttJk7GpENK6HaCvYTMjABpklPWBTc2Z6RnyCbspVJ6qg69anZ9PprrXFinj3tw W4Ic7NJ2m3eELCkDG4AO1Lmk2jlacuqJE8qi5D5gdsBIG0LNPrqiDIf/k/qkaBvFhXKF E75/9EWXAnvD5EJGUTW6Jtli5me34Mp6q4A= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39h84wcm8c-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:06:19 -0700 Received: from intmgw001.25.frc3.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:11d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:16 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 206532940BCC; Thu, 1 Jul 2021 13:06:13 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock Date: Thu, 1 Jul 2021 13:06:13 -0700 Message-ID: <20210701200613.1036157-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: DLcTKBJbrKOHCj6ZBohRdteztSGquIyd X-Proofpoint-GUID: DLcTKBJbrKOHCj6ZBohRdteztSGquIyd X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 suspectscore=0 mlxscore=0 priorityscore=1501 impostorscore=0 spamscore=0 bulkscore=0 lowpriorityscore=0 malwarescore=0 mlxlogscore=999 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch does batching and lock_sock for the bpf tcp iter. It does not affect the proc fs iteration. With bpf-tcp-cc, new algo rollout happens more often. Instead of restarting the application to pick up the new tcp-cc, the next patch will allow bpf iter to do setsockopt(TCP_CONGESTION). This requires locking the sock. Also, unlike the proc iteration (cat /proc/net/tcp[6]), the bpf iter can inspect all fields of a tcp_sock. It will be useful to have a consistent view on some of the fields (e.g. the ones reported in tcp_get_info() that also acquires the sock lock). Double lock: locking the bucket first and then locking the sock could lead to deadlock. This patch takes a batching approach similar to inet_diag. While holding the bucket lock, it batch a number of sockets into an array first and then unlock the bucket. Before doing show(), it then calls lock_sock_fast(). In a machine with ~400k connections, the maximum number of sk in a bucket of the established hashtable is 7. 0.02% of the established connections fall into this bucket size. For listen hash (port+addr lhash2), the bucket is usually very small also except for the SO_REUSEPORT use case which the userspace could have one SO_REUSEPORT socket per thread. While batching is used, it can also minimize the chance of missing sock in the setsockopt use case if the whole bucket is batched. This patch will start with a batch array with INIT_BATCH_SZ (16) which will be enough for the most common cases. bpf_iter_tcp_batch() will try to realloc to a larger array to handle exception case (e.g. the SO_REUSEPORT case in the lhash2). Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 237 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 231 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 6b3c26be539b..3e1afab26381 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2687,6 +2687,15 @@ static int tcp4_seq_show(struct seq_file *seq, void *v) } #ifdef CONFIG_BPF_SYSCALL +struct bpf_tcp_iter_state { + struct tcp_iter_state state; + unsigned int cur_sk; + unsigned int end_sk; + unsigned int max_sk; + struct sock **batch; + bool st_bucket_done; +}; + struct bpf_iter__tcp { __bpf_md_ptr(struct bpf_iter_meta *, meta); __bpf_md_ptr(struct sock_common *, sk_common); @@ -2705,16 +2714,204 @@ static int tcp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta, return bpf_iter_run_prog(prog, &ctx); } +static void bpf_iter_tcp_put_batch(struct bpf_tcp_iter_state *iter) +{ + while (iter->cur_sk < iter->end_sk) + sock_put(iter->batch[iter->cur_sk++]); +} + +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter, + unsigned int new_batch_sz) +{ + struct sock **new_batch; + + new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, + GFP_USER | __GFP_NOWARN); + if (!new_batch) + return -ENOMEM; + + bpf_iter_tcp_put_batch(iter); + kvfree(iter->batch); + iter->batch = new_batch; + iter->max_sk = new_batch_sz; + + return 0; +} + +static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq, + struct sock *start_sk) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + struct inet_connection_sock *icsk; + unsigned int expected = 1; + struct sock *sk; + + sock_hold(start_sk); + iter->batch[iter->end_sk++] = start_sk; + + icsk = inet_csk(start_sk); + inet_lhash2_for_each_icsk_continue(icsk) { + sk = (struct sock *)icsk; + if (seq_sk_match(seq, sk)) { + if (iter->end_sk < iter->max_sk) { + sock_hold(sk); + iter->batch[iter->end_sk++] = sk; + } + expected++; + } + } + spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock); + + return expected; +} + +static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq, + struct sock *start_sk) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + struct hlist_nulls_node *node; + unsigned int expected = 1; + struct sock *sk; + + sock_hold(start_sk); + iter->batch[iter->end_sk++] = start_sk; + + sk = sk_nulls_next(start_sk); + sk_nulls_for_each_from(sk, node) { + if (seq_sk_match(seq, sk)) { + if (iter->end_sk < iter->max_sk) { + sock_hold(sk); + iter->batch[iter->end_sk++] = sk; + } + expected++; + } + } + spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); + + return expected; +} + +static struct sock *bpf_iter_tcp_batch(struct seq_file *seq) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + unsigned int expected; + bool resized = false; + struct sock *sk; + + /* The st->bucket is done. Directly advance to the next + * bucket instead of having the tcp_seek_last_pos() to skip + * one by one in the current bucket and eventually find out + * it has to advance to the next bucket. + */ + if (iter->st_bucket_done) { + st->offset = 0; + st->bucket++; + if (st->state == TCP_SEQ_STATE_LISTENING && + st->bucket > tcp_hashinfo.lhash2_mask) { + st->state = TCP_SEQ_STATE_ESTABLISHED; + st->bucket = 0; + } + } + +again: + /* Get a new batch */ + iter->cur_sk = 0; + iter->end_sk = 0; + iter->st_bucket_done = false; + + sk = tcp_seek_last_pos(seq); + if (!sk) + return NULL; /* Done */ + + if (st->state == TCP_SEQ_STATE_LISTENING) + expected = bpf_iter_tcp_listening_batch(seq, sk); + else + expected = bpf_iter_tcp_established_batch(seq, sk); + + if (iter->end_sk == expected) { + iter->st_bucket_done = true; + return sk; + } + + if (!resized && !bpf_iter_tcp_realloc_batch(iter, expected * 3 / 2)) { + resized = true; + goto again; + } + + return sk; +} + +static void *bpf_iter_tcp_seq_start(struct seq_file *seq, loff_t *pos) +{ + /* bpf iter does not support lseek, so it always + * continue from where it was stop()-ped. + */ + if (*pos) + return bpf_iter_tcp_batch(seq); + + return SEQ_START_TOKEN; +} + +static void *bpf_iter_tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + struct sock *sk; + + /* Whenever seq_next() is called, the iter->cur_sk is + * done with seq_show(), so advance to the next sk in + * the batch. + */ + if (iter->cur_sk < iter->end_sk) { + /* Keeping st->num consistent in tcp_iter_state. + * bpf_iter_tcp does not use st->num. + * meta.seq_num is used instead. + */ + st->num++; + /* Move st->offset to the next sk in the bucket such that + * the future start() will resume at st->offset in + * st->bucket. See tcp_seek_last_pos(). + */ + st->offset++; + sock_put(iter->batch[iter->cur_sk++]); + } + + if (iter->cur_sk < iter->end_sk) + sk = iter->batch[iter->cur_sk]; + else + sk = bpf_iter_tcp_batch(seq); + + ++*pos; + /* Keeping st->last_pos consistent in tcp_iter_state. + * bpf iter does not do lseek, so st->last_pos always equals to *pos. + */ + st->last_pos = *pos; + return sk; +} + static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v) { struct bpf_iter_meta meta; struct bpf_prog *prog; struct sock *sk = v; + bool slow; uid_t uid; + int ret; if (v == SEQ_START_TOKEN) return 0; + if (sk_fullsock(sk)) + slow = lock_sock_fast(sk); + + if (unlikely(sk_unhashed(sk))) { + ret = SEQ_SKIP; + goto unlock; + } + if (sk->sk_state == TCP_TIME_WAIT) { uid = 0; } else if (sk->sk_state == TCP_NEW_SYN_RECV) { @@ -2728,11 +2925,18 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v) meta.seq = seq; prog = bpf_iter_get_info(&meta, false); - return tcp_prog_seq_show(prog, &meta, v, uid); + ret = tcp_prog_seq_show(prog, &meta, v, uid); + +unlock: + if (sk_fullsock(sk)) + unlock_sock_fast(sk, slow); + return ret; + } static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v) { + struct bpf_tcp_iter_state *iter = seq->private; struct bpf_iter_meta meta; struct bpf_prog *prog; @@ -2743,13 +2947,16 @@ static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v) (void)tcp_prog_seq_show(prog, &meta, v, 0); } - tcp_seq_stop(seq, v); + if (iter->cur_sk < iter->end_sk) { + bpf_iter_tcp_put_batch(iter); + iter->st_bucket_done = false; + } } static const struct seq_operations bpf_iter_tcp_seq_ops = { .show = bpf_iter_tcp_seq_show, - .start = tcp_seq_start, - .next = tcp_seq_next, + .start = bpf_iter_tcp_seq_start, + .next = bpf_iter_tcp_seq_next, .stop = bpf_iter_tcp_seq_stop, }; #endif @@ -3017,21 +3224,39 @@ static struct pernet_operations __net_initdata tcp_sk_ops = { DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta, struct sock_common *sk_common, uid_t uid) +#define INIT_BATCH_SZ 16 + static int bpf_iter_init_tcp(void *priv_data, struct bpf_iter_aux_info *aux) { - return bpf_iter_init_seq_net(priv_data, aux); + struct bpf_tcp_iter_state *iter = priv_data; + int err; + + err = bpf_iter_init_seq_net(priv_data, aux); + if (err) + return err; + + err = bpf_iter_tcp_realloc_batch(iter, INIT_BATCH_SZ); + if (err) { + bpf_iter_fini_seq_net(priv_data); + return err; + } + + return 0; } static void bpf_iter_fini_tcp(void *priv_data) { + struct bpf_tcp_iter_state *iter = priv_data; + bpf_iter_fini_seq_net(priv_data); + kvfree(iter->batch); } static const struct bpf_iter_seq_info tcp_seq_info = { .seq_ops = &bpf_iter_tcp_seq_ops, .init_seq_private = bpf_iter_init_tcp, .fini_seq_private = bpf_iter_fini_tcp, - .seq_priv_size = sizeof(struct tcp_iter_state), + .seq_priv_size = sizeof(struct bpf_tcp_iter_state), }; static struct bpf_iter_reg tcp_reg_info = { From patchwork Thu Jul 1 20:06:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469414 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 493F2C11F69 for ; Thu, 1 Jul 2021 20:07:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2AAA96140D for ; Thu, 1 Jul 2021 20:07:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233297AbhGAUKK (ORCPT ); Thu, 1 Jul 2021 16:10:10 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:35180 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232637AbhGAUKJ (ORCPT ); Thu, 1 Jul 2021 16:10:09 -0400 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161JwkGw001234 for ; Thu, 1 Jul 2021 13:07:39 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=Ap8YVgAcgu/KxsWY91n7EfLGNM9mcq/F/d875VaAW1Q=; b=dj4dTaL400B8uTC/nud9dNfW2Ur1kZdSkgcv1odjy53Kn1N9i0eWMgoczcV5Q+j+6uki OpqQoglFqlpIhTD+p5ukQag4nhq6y+QFdu9kOfmv2eRQ3rfEvX5tuTUXjgm/AjngjeU+ icMOyd8n5oXgv8TTCR+6lwrvcrtJHvIrcEc= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39h1wyxc2d-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:07:38 -0700 Received: from intmgw001.25.frc3.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:26 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 5B43D2940BCC; Thu, 1 Jul 2021 13:06:19 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 7/8] bpf: tcp: Support bpf_(get|set)sockopt in bpf tcp iter Date: Thu, 1 Jul 2021 13:06:19 -0700 Message-ID: <20210701200619.1036715-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: Fx_p_3hec8WSef2sZp8tzmfV6TJC9RrF X-Proofpoint-GUID: Fx_p_3hec8WSef2sZp8tzmfV6TJC9RrF X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 phishscore=0 impostorscore=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 bulkscore=0 clxscore=1015 priorityscore=1501 spamscore=0 mlxlogscore=999 suspectscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch allows bpf tcp iter to call bpf_(get|set)sockopt. To allow a specific bpf iter (tcp here) to call a set of helpers, get_func_proto function pointer is added to bpf_iter_reg. The bpf iter is a tracing prog which currently requires CAP_PERFMON or CAP_SYS_ADMIN, so this patch does not impose other capability checks for bpf_(get|set)sockopt. Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- include/linux/bpf.h | 8 ++++++++ kernel/bpf/bpf_iter.c | 22 ++++++++++++++++++++++ kernel/trace/bpf_trace.c | 7 ++++++- net/core/filter.c | 34 ++++++++++++++++++++++++++++++++++ net/ipv4/tcp_ipv4.c | 15 +++++++++++++++ 5 files changed, 85 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f309fc1509f2..b9a62b805a99 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1397,6 +1397,9 @@ typedef void (*bpf_iter_show_fdinfo_t) (const struct bpf_iter_aux_info *aux, struct seq_file *seq); typedef int (*bpf_iter_fill_link_info_t)(const struct bpf_iter_aux_info *aux, struct bpf_link_info *info); +typedef const struct bpf_func_proto * +(*bpf_iter_get_func_proto_t)(enum bpf_func_id func_id, + const struct bpf_prog *prog); enum bpf_iter_feature { BPF_ITER_RESCHED = BIT(0), @@ -1409,6 +1412,7 @@ struct bpf_iter_reg { bpf_iter_detach_target_t detach_target; bpf_iter_show_fdinfo_t show_fdinfo; bpf_iter_fill_link_info_t fill_link_info; + bpf_iter_get_func_proto_t get_func_proto; u32 ctx_arg_info_size; u32 feature; struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; @@ -1431,6 +1435,8 @@ struct bpf_iter__bpf_map_elem { int bpf_iter_reg_target(const struct bpf_iter_reg *reg_info); void bpf_iter_unreg_target(const struct bpf_iter_reg *reg_info); bool bpf_iter_prog_supported(struct bpf_prog *prog); +const struct bpf_func_proto * +bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog); int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog); int bpf_iter_new_fd(struct bpf_link *link); bool bpf_link_is_iter(struct bpf_link *link); @@ -1997,6 +2003,8 @@ extern const struct bpf_func_proto bpf_task_storage_get_proto; extern const struct bpf_func_proto bpf_task_storage_delete_proto; extern const struct bpf_func_proto bpf_for_each_map_elem_proto; extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto; +extern const struct bpf_func_proto bpf_sk_setsockopt_proto; +extern const struct bpf_func_proto bpf_sk_getsockopt_proto; const struct bpf_func_proto *bpf_tracing_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 2d4fbdbb194e..2e9d47bb40ff 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -360,6 +360,28 @@ bool bpf_iter_prog_supported(struct bpf_prog *prog) return supported; } +const struct bpf_func_proto * +bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + const struct bpf_iter_target_info *tinfo; + const struct bpf_func_proto *fn = NULL; + + mutex_lock(&targets_mutex); + list_for_each_entry(tinfo, &targets, list) { + if (tinfo->btf_id == prog->aux->attach_btf_id) { + const struct bpf_iter_reg *reg_info; + + reg_info = tinfo->reg_info; + if (reg_info->get_func_proto) + fn = reg_info->get_func_proto(func_id, prog); + break; + } + } + mutex_unlock(&targets_mutex); + + return fn; +} + static void bpf_iter_link_release(struct bpf_link *link) { struct bpf_iter_link *iter_link = diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 64bd2d84367f..a137494f2505 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1430,6 +1430,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) const struct bpf_func_proto * tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { + const struct bpf_func_proto *fn; + switch (func_id) { #ifdef CONFIG_NET case BPF_FUNC_skb_output: @@ -1470,7 +1472,10 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_d_path: return &bpf_d_path_proto; default: - return raw_tp_prog_func_proto(func_id, prog); + fn = raw_tp_prog_func_proto(func_id, prog); + if (!fn && prog->expected_attach_type == BPF_TRACE_ITER) + fn = bpf_iter_get_func_proto(func_id, prog); + return fn; } } diff --git a/net/core/filter.c b/net/core/filter.c index d70187ce851b..61f8121f205f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5012,6 +5012,40 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname, return -EINVAL; } +BPF_CALL_5(bpf_sk_setsockopt, struct sock *, sk, int, level, + int, optname, char *, optval, int, optlen) +{ + return _bpf_setsockopt(sk, level, optname, optval, optlen); +} + +const struct bpf_func_proto bpf_sk_setsockopt_proto = { + .func = bpf_sk_setsockopt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + +BPF_CALL_5(bpf_sk_getsockopt, struct sock *, sk, int, level, + int, optname, char *, optval, int, optlen) +{ + return _bpf_getsockopt(sk, level, optname, optval, optlen); +} + +const struct bpf_func_proto bpf_sk_getsockopt_proto = { + .func = bpf_sk_getsockopt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_UNINIT_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + BPF_CALL_5(bpf_sock_addr_setsockopt, struct bpf_sock_addr_kern *, ctx, int, level, int, optname, char *, optval, int, optlen) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 3e1afab26381..6ea47850e1fa 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3259,6 +3259,20 @@ static const struct bpf_iter_seq_info tcp_seq_info = { .seq_priv_size = sizeof(struct bpf_tcp_iter_state), }; +static const struct bpf_func_proto * +bpf_iter_tcp_get_func_proto(enum bpf_func_id func_id, + const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_setsockopt: + return &bpf_sk_setsockopt_proto; + case BPF_FUNC_getsockopt: + return &bpf_sk_getsockopt_proto; + default: + return NULL; + } +} + static struct bpf_iter_reg tcp_reg_info = { .target = "tcp", .ctx_arg_info_size = 1, @@ -3266,6 +3280,7 @@ static struct bpf_iter_reg tcp_reg_info = { { offsetof(struct bpf_iter__tcp, sk_common), PTR_TO_BTF_ID_OR_NULL }, }, + .get_func_proto = bpf_iter_tcp_get_func_proto, .seq_info = &tcp_seq_info, }; From patchwork Thu Jul 1 20:06:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 469823 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57787C11F64 for ; Thu, 1 Jul 2021 20:07:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 344F46140D for ; Thu, 1 Jul 2021 20:07:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233681AbhGAUKZ (ORCPT ); Thu, 1 Jul 2021 16:10:25 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:25100 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232151AbhGAUKY (ORCPT ); Thu, 1 Jul 2021 16:10:24 -0400 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 161K15vn013549 for ; Thu, 1 Jul 2021 13:07:52 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=yAcd1HjSZpDkmk8UJDbX9Z0p7t9qpbvHFOFAnstSoJM=; b=bj0EFrbWmlWAuRF0DXKu3ZRE5mrIHTLsxR6XPBFXnIYCruIq/ADGeZfkLYZQomUwaiNz TjKDLEYXiZRgRcQBM2Ph04j9NM3MFQrhhQ1YJNhHk7tN9AwSZhjQJP+XoBq7jhX8tNjr FTLLwPcY2P2mEwMQQZG0lUG5kSkrgC2V82s= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39h84wcmfm-10 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 01 Jul 2021 13:07:52 -0700 Received: from intmgw006.03.ash8.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 1 Jul 2021 13:06:28 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 9ACBA2940BCC; Thu, 1 Jul 2021 13:06:25 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH v2 bpf-next 8/8] bpf: selftest: Test batching and bpf_(get|set)sockopt in bpf tcp iter Date: Thu, 1 Jul 2021 13:06:25 -0700 Message-ID: <20210701200625.1036874-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210701200535.1033513-1-kafai@fb.com> References: <20210701200535.1033513-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: WMcmdAgI5Rj003oOfj0hYtnU1HCIMOWq X-Proofpoint-GUID: WMcmdAgI5Rj003oOfj0hYtnU1HCIMOWq X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-01_12:2021-07-01,2021-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 suspectscore=0 mlxscore=0 priorityscore=1501 impostorscore=0 spamscore=0 bulkscore=0 lowpriorityscore=0 malwarescore=0 mlxlogscore=999 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107010117 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds tests for the batching and bpf_(get|set)sockopt in bpf tcp iter. It first creates: a) 1 non SO_REUSEPORT listener in lhash2. b) 256 passive and active fds connected to the listener in (a). c) 256 SO_REUSEPORT listeners in one of the lhash2 bucket. The test sets all listeners and connections to bpf_cubic before running the bpf iter. The bpf iter then calls setsockopt(TCP_CONGESTION) to switch each listener and connection from bpf_cubic to bpf_dctcp. The bpf iter has a random_retry mode such that it can return EAGAIN to the usespace in the middle of a batch. Acked-by: Yonghong Song Signed-off-by: Martin KaFai Lau --- tools/testing/selftests/bpf/network_helpers.c | 85 ++++++- tools/testing/selftests/bpf/network_helpers.h | 4 + .../bpf/prog_tests/bpf_iter_setsockopt.c | 226 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_setsockopt.c | 72 ++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 6 + 5 files changed, 384 insertions(+), 9 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index 2060bc122c53..26468a8f44f3 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -66,17 +66,13 @@ int settimeo(int fd, int timeout_ms) #define save_errno_close(fd) ({ int __save = errno; close(fd); errno = __save; }) -int start_server(int family, int type, const char *addr_str, __u16 port, - int timeout_ms) +static int __start_server(int type, const struct sockaddr *addr, + socklen_t addrlen, int timeout_ms, bool reuseport) { - struct sockaddr_storage addr = {}; - socklen_t len; + int on = 1; int fd; - if (make_sockaddr(family, addr_str, port, &addr, &len)) - return -1; - - fd = socket(family, type, 0); + fd = socket(addr->sa_family, type, 0); if (fd < 0) { log_err("Failed to create server socket"); return -1; @@ -85,7 +81,13 @@ int start_server(int family, int type, const char *addr_str, __u16 port, if (settimeo(fd, timeout_ms)) goto error_close; - if (bind(fd, (const struct sockaddr *)&addr, len) < 0) { + if (reuseport && + setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on))) { + log_err("Failed to set SO_REUSEPORT"); + return -1; + } + + if (bind(fd, addr, addrlen) < 0) { log_err("Failed to bind socket"); goto error_close; } @@ -104,6 +106,69 @@ int start_server(int family, int type, const char *addr_str, __u16 port, return -1; } +int start_server(int family, int type, const char *addr_str, __u16 port, + int timeout_ms) +{ + struct sockaddr_storage addr; + socklen_t addrlen; + + if (make_sockaddr(family, addr_str, port, &addr, &addrlen)) + return -1; + + return __start_server(type, (struct sockaddr *)&addr, + addrlen, timeout_ms, false); +} + +int *start_reuseport_server(int family, int type, const char *addr_str, + __u16 port, int timeout_ms, unsigned int nr_listens) +{ + struct sockaddr_storage addr; + unsigned int nr_fds = 0; + socklen_t addrlen; + int *fds; + + if (!nr_listens) + return NULL; + + if (make_sockaddr(family, addr_str, port, &addr, &addrlen)) + return NULL; + + fds = malloc(sizeof(*fds) * nr_listens); + if (!fds) + return NULL; + + fds[0] = __start_server(type, (struct sockaddr *)&addr, addrlen, + timeout_ms, true); + if (fds[0] == -1) + goto close_fds; + nr_fds = 1; + + if (getsockname(fds[0], (struct sockaddr *)&addr, &addrlen)) + goto close_fds; + + for (; nr_fds < nr_listens; nr_fds++) { + fds[nr_fds] = __start_server(type, (struct sockaddr *)&addr, + addrlen, timeout_ms, true); + if (fds[nr_fds] == -1) + goto close_fds; + } + + return fds; + +close_fds: + free_fds(fds, nr_fds); + return NULL; +} + +void free_fds(int *fds, unsigned int nr_close_fds) +{ + if (fds) { + while (nr_close_fds) + close(fds[--nr_close_fds]); + free(fds); + } +} + int fastopen_connect(int server_fd, const char *data, unsigned int data_len, int timeout_ms) { @@ -217,6 +282,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port, if (family == AF_INET) { struct sockaddr_in *sin = (void *)addr; + memset(addr, 0, sizeof(*sin)); sin->sin_family = AF_INET; sin->sin_port = htons(port); if (addr_str && @@ -230,6 +296,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port, } else if (family == AF_INET6) { struct sockaddr_in6 *sin6 = (void *)addr; + memset(addr, 0, sizeof(*sin6)); sin6->sin6_family = AF_INET6; sin6->sin6_port = htons(port); if (addr_str && diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index 5e0d51c07b63..d60bc2897770 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -36,6 +36,10 @@ extern struct ipv6_packet pkt_v6; int settimeo(int fd, int timeout_ms); int start_server(int family, int type, const char *addr, __u16 port, int timeout_ms); +int *start_reuseport_server(int family, int type, const char *addr_str, + __u16 port, int timeout_ms, + unsigned int nr_listens); +void free_fds(int *fds, unsigned int nr_close_fds); int connect_to_fd(int server_fd, int timeout_ms); int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms); int fastopen_connect(int server_fd, const char *data, unsigned int data_len, diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c new file mode 100644 index 000000000000..85babb0487b3 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c @@ -0,0 +1,226 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ +#define _GNU_SOURCE +#include +#include +#include "network_helpers.h" +#include "bpf_dctcp.skel.h" +#include "bpf_cubic.skel.h" +#include "bpf_iter_setsockopt.skel.h" + +static int create_netns(void) +{ + if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns")) + return -1; + + if (!ASSERT_OK(system("ip link set dev lo up"), "bring up lo")) + return -1; + + return 0; +} + +static unsigned int set_bpf_cubic(int *fds, unsigned int nr_fds) +{ + unsigned int i; + + for (i = 0; i < nr_fds; i++) { + if (setsockopt(fds[i], SOL_TCP, TCP_CONGESTION, "bpf_cubic", + sizeof("bpf_cubic"))) + return i; + } + + return nr_fds; +} + +static unsigned int check_bpf_dctcp(int *fds, unsigned int nr_fds) +{ + char tcp_cc[16]; + socklen_t optlen = sizeof(tcp_cc); + unsigned int i; + + for (i = 0; i < nr_fds; i++) { + if (getsockopt(fds[i], SOL_TCP, TCP_CONGESTION, + tcp_cc, &optlen) || + strcmp(tcp_cc, "bpf_dctcp")) + return i; + } + + return nr_fds; +} + +static int *make_established(int listen_fd, unsigned int nr_est, + int **paccepted_fds) +{ + int *est_fds, *accepted_fds; + unsigned int i; + + est_fds = malloc(sizeof(*est_fds) * nr_est); + if (!est_fds) + return NULL; + + accepted_fds = malloc(sizeof(*accepted_fds) * nr_est); + if (!accepted_fds) { + free(est_fds); + return NULL; + } + + for (i = 0; i < nr_est; i++) { + est_fds[i] = connect_to_fd(listen_fd, 0); + if (est_fds[i] == -1) + break; + if (set_bpf_cubic(&est_fds[i], 1) != 1) { + close(est_fds[i]); + break; + } + + accepted_fds[i] = accept(listen_fd, NULL, 0); + if (accepted_fds[i] == -1) { + close(est_fds[i]); + break; + } + } + + if (!ASSERT_EQ(i, nr_est, "create established fds")) { + free_fds(accepted_fds, i); + free_fds(est_fds, i); + return NULL; + } + + *paccepted_fds = accepted_fds; + return est_fds; +} + +static unsigned short get_local_port(int fd) +{ + struct sockaddr_in6 addr; + socklen_t addrlen = sizeof(addr); + + if (!getsockname(fd, &addr, &addrlen)) + return ntohs(addr.sin6_port); + + return 0; +} + +static void do_bpf_iter_setsockopt(struct bpf_iter_setsockopt *iter_skel, + bool random_retry) +{ + int *reuse_listen_fds = NULL, *accepted_fds = NULL, *est_fds = NULL; + unsigned int nr_reuse_listens = 256, nr_est = 256; + int err, iter_fd = -1, listen_fd = -1; + char buf; + + /* Prepare non-reuseport listen_fd */ + listen_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0); + if (!ASSERT_GE(listen_fd, 0, "start_server")) + return; + if (!ASSERT_EQ(set_bpf_cubic(&listen_fd, 1), 1, + "set listen_fd to cubic")) + goto done; + iter_skel->bss->listen_hport = get_local_port(listen_fd); + if (!ASSERT_NEQ(iter_skel->bss->listen_hport, 0, + "get_local_port(listen_fd)")) + goto done; + + /* Connect to non-reuseport listen_fd */ + est_fds = make_established(listen_fd, nr_est, &accepted_fds); + if (!ASSERT_OK_PTR(est_fds, "create established")) + goto done; + + /* Prepare reuseport listen fds */ + reuse_listen_fds = start_reuseport_server(AF_INET6, SOCK_STREAM, + "::1", 0, 0, + nr_reuse_listens); + if (!ASSERT_OK_PTR(reuse_listen_fds, "start_reuseport_server")) + goto done; + if (!ASSERT_EQ(set_bpf_cubic(reuse_listen_fds, nr_reuse_listens), + nr_reuse_listens, "set reuse_listen_fds to cubic")) + goto done; + iter_skel->bss->reuse_listen_hport = get_local_port(reuse_listen_fds[0]); + if (!ASSERT_NEQ(iter_skel->bss->reuse_listen_hport, 0, + "get_local_port(reuse_listen_fds[0])")) + goto done; + + /* Run bpf tcp iter to switch from bpf_cubic to bpf_dctcp */ + iter_skel->bss->random_retry = random_retry; + iter_fd = bpf_iter_create(bpf_link__fd(iter_skel->links.change_tcp_cc)); + if (!ASSERT_GE(iter_fd, 0, "create iter_fd")) + goto done; + + while ((err = read(iter_fd, &buf, sizeof(buf))) == -1 && + errno == EAGAIN) + ; + if (!ASSERT_OK(err, "read iter error")) + goto done; + + /* Check reuseport listen fds for dctcp */ + ASSERT_EQ(check_bpf_dctcp(reuse_listen_fds, nr_reuse_listens), + nr_reuse_listens, + "check reuse_listen_fds dctcp"); + + /* Check non reuseport listen fd for dctcp */ + ASSERT_EQ(check_bpf_dctcp(&listen_fd, 1), 1, + "check listen_fd dctcp"); + + /* Check established fds for dctcp */ + ASSERT_EQ(check_bpf_dctcp(est_fds, nr_est), nr_est, + "check est_fds dctcp"); + + /* Check accepted fds for dctcp */ + ASSERT_EQ(check_bpf_dctcp(accepted_fds, nr_est), nr_est, + "check accepted_fds dctcp"); + +done: + if (iter_fd != -1) + close(iter_fd); + if (listen_fd != -1) + close(listen_fd); + free_fds(reuse_listen_fds, nr_reuse_listens); + free_fds(accepted_fds, nr_est); + free_fds(est_fds, nr_est); +} + +void test_bpf_iter_setsockopt(void) +{ + struct bpf_iter_setsockopt *iter_skel = NULL; + struct bpf_cubic *cubic_skel = NULL; + struct bpf_dctcp *dctcp_skel = NULL; + struct bpf_link *cubic_link = NULL; + struct bpf_link *dctcp_link = NULL; + + if (create_netns()) + return; + + /* Load iter_skel */ + iter_skel = bpf_iter_setsockopt__open_and_load(); + if (!ASSERT_OK_PTR(iter_skel, "iter_skel")) + return; + iter_skel->links.change_tcp_cc = bpf_program__attach_iter(iter_skel->progs.change_tcp_cc, NULL); + if (!ASSERT_OK_PTR(iter_skel->links.change_tcp_cc, "attach iter")) + goto done; + + /* Load bpf_cubic */ + cubic_skel = bpf_cubic__open_and_load(); + if (!ASSERT_OK_PTR(cubic_skel, "cubic_skel")) + goto done; + cubic_link = bpf_map__attach_struct_ops(cubic_skel->maps.cubic); + if (!ASSERT_OK_PTR(cubic_link, "cubic_link")) + goto done; + + /* Load bpf_dctcp */ + dctcp_skel = bpf_dctcp__open_and_load(); + if (!ASSERT_OK_PTR(dctcp_skel, "dctcp_skel")) + goto done; + dctcp_link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp); + if (!ASSERT_OK_PTR(dctcp_link, "dctcp_link")) + goto done; + + do_bpf_iter_setsockopt(iter_skel, true); + do_bpf_iter_setsockopt(iter_skel, false); + +done: + bpf_link__destroy(cubic_link); + bpf_link__destroy(dctcp_link); + bpf_cubic__destroy(cubic_skel); + bpf_dctcp__destroy(dctcp_skel); + bpf_iter_setsockopt__destroy(iter_skel); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c new file mode 100644 index 000000000000..b77adfd55d73 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ +#include "bpf_iter.h" +#include "bpf_tracing_net.h" +#include +#include + +#define bpf_tcp_sk(skc) ({ \ + struct sock_common *_skc = skc; \ + sk = NULL; \ + tp = NULL; \ + if (_skc) { \ + tp = bpf_skc_to_tcp_sock(_skc); \ + sk = (struct sock *)tp; \ + } \ + tp; \ +}) + +unsigned short reuse_listen_hport = 0; +unsigned short listen_hport = 0; +char cubic_cc[TCP_CA_NAME_MAX] = "bpf_cubic"; +char dctcp_cc[TCP_CA_NAME_MAX] = "bpf_dctcp"; +bool random_retry = false; + +static bool tcp_cc_eq(const char *a, const char *b) +{ + int i; + + for (i = 0; i < TCP_CA_NAME_MAX; i++) { + if (a[i] != b[i]) + return false; + if (!a[i]) + break; + } + + return true; +} + +SEC("iter/tcp") +int change_tcp_cc(struct bpf_iter__tcp *ctx) +{ + char cur_cc[TCP_CA_NAME_MAX]; + struct tcp_sock *tp; + struct sock *sk; + int ret; + + if (!bpf_tcp_sk(ctx->sk_common)) + return 0; + + if (sk->sk_family != AF_INET6 || + (sk->sk_state != TCP_LISTEN && + sk->sk_state != TCP_ESTABLISHED) || + (sk->sk_num != reuse_listen_hport && + sk->sk_num != listen_hport && + bpf_ntohs(sk->sk_dport) != listen_hport)) + return 0; + + if (bpf_getsockopt(tp, SOL_TCP, TCP_CONGESTION, + cur_cc, sizeof(cur_cc))) + return 0; + + if (!tcp_cc_eq(cur_cc, cubic_cc)) + return 0; + + if (random_retry && bpf_get_prandom_u32() % 4 == 1) + return 1; + + bpf_setsockopt(tp, SOL_TCP, TCP_CONGESTION, dctcp_cc, sizeof(dctcp_cc)); + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index 01378911252b..3af0998a0623 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -5,6 +5,10 @@ #define AF_INET 2 #define AF_INET6 10 +#define SOL_TCP 6 +#define TCP_CONGESTION 13 +#define TCP_CA_NAME_MAX 16 + #define ICSK_TIME_RETRANS 1 #define ICSK_TIME_PROBE0 3 #define ICSK_TIME_LOSS_PROBE 5 @@ -32,6 +36,8 @@ #define ir_v6_rmt_addr req.__req_common.skc_v6_daddr #define ir_v6_loc_addr req.__req_common.skc_v6_rcv_saddr +#define sk_num __sk_common.skc_num +#define sk_dport __sk_common.skc_dport #define sk_family __sk_common.skc_family #define sk_rmem_alloc sk_backlog.rmem_alloc #define sk_refcnt __sk_common.skc_refcnt