From patchwork Fri Jun 26 17:55:14 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Martin KaFai Lau <kafai@fb.com>
X-Patchwork-Id: 217023
Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, 
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_PATCH, 
 MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS,
 USER_AGENT_GIT
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id C2C0CC433E0
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 18:15:55 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id A02AB206DD
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 18:15:55 +0000 (UTC)
Authentication-Results: mail.kernel.org;
 dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com
 header.b="jjbj+HQ1"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725992AbgFZSPy (ORCPT <rfc822;netdev@archiver.kernel.org>);
 Fri, 26 Jun 2020 14:15:54 -0400
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:15724 "EHLO
 mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
 by vger.kernel.org with ESMTP id S1725780AbgFZSPy (ORCPT
 <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 14:15:54 -0400
Received: from pps.filterd (m0148460.ppops.net [127.0.0.1])
 by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05QIEjYP016997
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 11:15:52 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=+5gX+pClJaFpnEtzx25mq9s0SmaUpVNIkxWYhJOUFrM=;
 b=jjbj+HQ1lXA9ZrMc6d9UdOLZPcE/7YXWuSgF1bJANngB32IvlmN04TbZKCoWDSTLADdp
 XATfKmuBwcI3khZHsmETzVIaBlZfYru6t15ZfdRy99GNh584znvxplXF7C0aTst4D6t0
 2HmKeOKvbquCPL3B+xC4QEMdHpFKSBpiYp0= 
Received: from mail.thefacebook.com ([163.114.132.120])
 by mx0a-00082601.pphosted.com with ESMTP id 31ux1exr8r-2
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128
 verify=NOT)
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 11:15:52 -0700
Received: from intmgw005.03.ash8.facebook.com (2620:10d:c085:208::11) by
 mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP
 Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1979.3; Fri, 26 Jun 2020 11:15:52 -0700
Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611)
 id 59D572942E38; Fri, 26 Jun 2020 10:55:14 -0700 (PDT)
Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>
CC: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>,
 Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>,
 Lawrence Brakmo <brakmo@fb.com>,
 Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>,
 Yuchung Cheng <ycheng@google.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option
Date: Fri, 26 Jun 2020 10:55:14 -0700
Message-ID: <20200626175514.1460570-1-kafai@fb.com>
X-Mailer: git-send-email 2.24.1
In-Reply-To: <20200626175501.1459961-1-kafai@fb.com>
References: <20200626175501.1459961-1-kafai@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-26_10:2020-06-26,2020-06-26 signatures=0
X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0
 lowpriorityscore=0
 clxscore=1015 cotscore=-2147483648 spamscore=0 mlxlogscore=607
 bulkscore=0
 adultscore=0 phishscore=0 priorityscore=1501 malwarescore=0
 mlxscore=0
 suspectscore=13 impostorscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2004280000 definitions=main-2006260129
X-FB-Internal: deliver
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch adds logic to parse experimental kind 254 with 16 bit magic
0xeB9F.  The latter patch will allow bpf prog to write and parse data
under this experimental kind and magic.

A one byte bpf_hdr_opt_off is added to tcp_skb_cb by using an existing
4 byte hole.  It is only used in rx.  It stores the offset to the
bpf experimental option and will be made available to BPF prog
in a latter patch.  This offset is also stored in the saved_syn.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/request_sock.h | 1 +
 include/net/tcp.h          | 3 +++
 net/ipv4/tcp_input.c       | 6 ++++++
 net/ipv4/tcp_ipv4.c        | 1 +
 net/ipv6/tcp_ipv6.c        | 1 +
 5 files changed, 12 insertions(+)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index d77237ec9fb4..55297286c066 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -43,6 +43,7 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
 
 struct saved_syn {
 	u32 network_hdrlen;
+	u32 bpf_hdr_opt_off;
 	u8 data[];
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eab1c7d0facb..07a9dfe35242 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -191,6 +191,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
  */
 #define TCPOPT_FASTOPEN_MAGIC	0xF989
 #define TCPOPT_SMC_MAGIC	0xE2D4C3D9
+#define TCPOPT_BPF_MAGIC	0xEB9F
 
 /*
  *     TCP option lengths
@@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_FASTOPEN_BASE  2
 #define TCPOLEN_EXP_FASTOPEN_BASE  4
 #define TCPOLEN_EXP_SMC_BASE   6
+#define TCPOLEN_EXP_BPF_BASE   4
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -857,6 +859,7 @@ struct tcp_skb_cb {
 			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
 			unused:5;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
+	__u8            bpf_hdr_opt_off;/* offset to bpf hdr option. rx only. */
 	union {
 		struct {
 			/* There is space for up to 24 bytes */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb0e32b2def9..640408a80b3d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3924,6 +3924,10 @@ void tcp_parse_options(const struct net *net,
 					tcp_parse_fastopen_option(opsize -
 						TCPOLEN_EXP_FASTOPEN_BASE,
 						ptr + 2, th->syn, foc, true);
+				else if (opsize >= TCPOLEN_EXP_BPF_BASE &&
+					 get_unaligned_be16(ptr) ==
+					 TCPOPT_BPF_MAGIC)
+					TCP_SKB_CB(skb)->bpf_hdr_opt_off = (ptr - 2) - (unsigned char *)th;
 				else
 					smc_parse_options(th, opt_rx, ptr,
 							  opsize);
@@ -6562,6 +6566,8 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 		saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
 		if (saved_syn) {
 			saved_syn->network_hdrlen = skb_network_header_len(skb);
+			saved_syn->bpf_hdr_opt_off =
+				TCP_SKB_CB(skb)->bpf_hdr_opt_off;
 			memcpy(saved_syn->data, skb_network_header(skb), len);
 			req->saved_syn = saved_syn;
 		}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea0df9fd7618..a3535b7fe002 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1864,6 +1864,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
 			skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+	TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
 }
 
 /*
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f67d45ff00b4..8356d0562279 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1545,6 +1545,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
 			skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+	TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
 }
 
 INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)

From patchwork Fri Jun 26 17:55:20 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Martin KaFai Lau <kafai@fb.com>
X-Patchwork-Id: 217028
Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, 
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_PATCH, 
 MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS,
 URIBL_BLOCKED, 
 USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id AC8BCC433E0
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:55:25 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 71C2720707
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:55:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
 dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com
 header.b="pQKT8OsA"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1726730AbgFZRzY (ORCPT <rfc822;netdev@archiver.kernel.org>);
 Fri, 26 Jun 2020 13:55:24 -0400
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:19980 "EHLO
 mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
 by vger.kernel.org with ESMTP id S1726337AbgFZRzY (ORCPT
 <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 13:55:24 -0400
Received: from pps.filterd (m0148460.ppops.net [127.0.0.1])
 by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05QHsukv019446
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:55:23 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=znNTPRNyXLzzPQruPP5HgKBxQ3n/j1DseZqoGcyxpv4=;
 b=pQKT8OsA9ZQUZo4TMaiSxNtXk9pnbYlkv6GXDEdHFp1ElQERSWI/zf5lBQsaQZ7WlYYZ
 QdPVXWAM/BsiGTzHFkwEk7i7x+nsFeJ+V8S059VcptlQAiiNUWEqWPKrSzThem08D/bE
 ATF2bllztCxLaGJ8MCCV2vq9Dwg5GK+Btzk= 
Received: from maileast.thefacebook.com ([163.114.130.16])
 by mx0a-00082601.pphosted.com with ESMTP id 31ux1exn06-11
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128
 verify=NOT)
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:55:23 -0700
Received: from intmgw001.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1979.3; Fri, 26 Jun 2020 10:55:21 -0700
Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611)
 id B05D92942E38; Fri, 26 Jun 2020 10:55:20 -0700 (PDT)
Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>
CC: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>,
 Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>,
 Lawrence Brakmo <brakmo@fb.com>,
 Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>,
 Yuchung Cheng <ycheng@google.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 03/10] bpf: sock_ops: Change some members of
 sock_ops_kern from u32 to u8
Date: Fri, 26 Jun 2020 10:55:20 -0700
Message-ID: <20200626175520.1460850-1-kafai@fb.com>
X-Mailer: git-send-email 2.24.1
In-Reply-To: <20200626175501.1459961-1-kafai@fb.com>
References: <20200626175501.1459961-1-kafai@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-26_10:2020-06-26,2020-06-26 signatures=0
X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0
 lowpriorityscore=0
 clxscore=1015 cotscore=-2147483648 spamscore=0 mlxlogscore=737
 bulkscore=0
 adultscore=0 phishscore=0 priorityscore=1501 malwarescore=0
 mlxscore=0
 suspectscore=13 impostorscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2004280000 definitions=main-2006260126
X-FB-Internal: deliver
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

A latter patch needs to add a few pointers and a few u8 to
sock_ops_kern.  Hence, this patch saves some spaces by moving
some of the existing members from u32 to u8 so that the latter
patch can still fit everything in a cacheline.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h |  4 ++--
 net/core/filter.c      | 15 ++++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..aae52cbda670 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1236,13 +1236,13 @@ struct bpf_sock_addr_kern {
 
 struct bpf_sock_ops_kern {
 	struct	sock *sk;
-	u32	op;
 	union {
 		u32 args[4];
 		u32 reply;
 		u32 replylong[4];
 	};
-	u32	is_fullsock;
+	u8	op;
+	u8	is_fullsock;
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
 					 * the BPF program. New fields that
diff --git a/net/core/filter.c b/net/core/filter.c
index 19dbcc8448d8..1dd07972c5c7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8355,17 +8355,22 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		return insn - insn_buf;
 
 	switch (si->off) {
-	case offsetof(struct bpf_sock_ops, op) ...
+	case offsetof(struct bpf_sock_ops, op):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       op),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, op));
+		break;
+
+	case offsetof(struct bpf_sock_ops, replylong[0]) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
-		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, op) !=
-			     sizeof_field(struct bpf_sock_ops_kern, op));
 		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, reply) !=
 			     sizeof_field(struct bpf_sock_ops_kern, reply));
 		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, replylong) !=
 			     sizeof_field(struct bpf_sock_ops_kern, replylong));
 		off = si->off;
-		off -= offsetof(struct bpf_sock_ops, op);
-		off += offsetof(struct bpf_sock_ops_kern, op);
+		off -= offsetof(struct bpf_sock_ops, replylong[0]);
+		off += offsetof(struct bpf_sock_ops_kern, replylong[0]);
 		if (type == BPF_WRITE)
 			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
 					      off);

From patchwork Fri Jun 26 17:55:26 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Martin KaFai Lau <kafai@fb.com>
X-Patchwork-Id: 217027
Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, 
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_PATCH, 
 MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS,
 URIBL_BLOCKED, 
 USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 7749EC433E0
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:55:47 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 2BA9D206B7
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:55:47 +0000 (UTC)
Authentication-Results: mail.kernel.org;
 dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com
 header.b="giIKDaqf"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1726990AbgFZRzq (ORCPT <rfc822;netdev@archiver.kernel.org>);
 Fri, 26 Jun 2020 13:55:46 -0400
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:46146 "EHLO
 mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
 by vger.kernel.org with ESMTP id S1726946AbgFZRzp (ORCPT
 <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 13:55:45 -0400
Received: from pps.filterd (m0109332.ppops.net [127.0.0.1])
 by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05QHtAQF021101
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:55:39 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-type : content-transfer-encoding; s=facebook;
 bh=Yyg3mh2BruNtS5AB0w5EW5zC5V17U6ck8GTtkzkT4oI=;
 b=giIKDaqfLJnAzmOU+qdnZpS7eIyAOY6AtKBFEwuZnj+fuRpBRnRQxquT0JPCgp11rUO3
 V6HN7BPm9Ah32sDJ1Vp3vEnN6mhpPpKiMxZhq+V33gfrczswVVKVt8hRrFK2j5IAaTfT
 JWgJrMOWjfZ39lhc062mKbpSftawMa9UQvU= 
Received: from mail.thefacebook.com ([163.114.132.120])
 by mx0a-00082601.pphosted.com with ESMTP id 31ux0qeneq-12
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128
 verify=NOT)
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:55:38 -0700
Received: from intmgw004.08.frc2.facebook.com (2620:10d:c085:208::11) by
 mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP
 Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1979.3; Fri, 26 Jun 2020 10:55:30 -0700
Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611)
 id EE11C2942E38; Fri, 26 Jun 2020 10:55:26 -0700 (PDT)
Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>
CC: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>,
 Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>,
 Lawrence Brakmo <brakmo@fb.com>,
 Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>,
 Yuchung Cheng <ycheng@google.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse
 BPF TCP header option
Date: Fri, 26 Jun 2020 10:55:26 -0700
Message-ID: <20200626175526.1461133-1-kafai@fb.com>
X-Mailer: git-send-email 2.24.1
In-Reply-To: <20200626175501.1459961-1-kafai@fb.com>
References: <20200626175501.1459961-1-kafai@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-26_09:2020-06-26,2020-06-26 signatures=0
X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0
 bulkscore=0
 lowpriorityscore=0 mlxscore=0 adultscore=0 malwarescore=0
 impostorscore=0
 priorityscore=1501 suspectscore=1 phishscore=0 clxscore=1015
 cotscore=-2147483648 mlxlogscore=999 spamscore=0 classifier=spam
 adjust=0
 reason=mlx scancount=1 engine=8.12.0-2004280000
 definitions=main-2006260126
X-FB-Internal: deliver
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal use only.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Header Option Format
────────────────────
The bpf prog will be allowed to write options under kind (254)
which is defined as the experimental TCP options in RFC 6994.
The exact format will be:

 0                  8                 16                             31
┌──────────────────┬─────────────────┬─────────────────────────────────┐
│   Kind: 254      │     length      │      magic: 0xeB9F              │
├──────────────────┴─────────────────┴─────────────────────────────────┤
│                                                                      │
│               BPF program written data                               │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

By putting it under the standard kind 254 and magic 0xeB9F, it will be
recognizable by the usual tool like tcpdump and tshark.  The kernel
can ensure the header option format is valid before sending out to
the wire and avoid the bpf program from writing options
duplicated/conflicted with what the kernel TCP stack has
already written.

A similar experimental numbering also exists in other protocols (e.g. IPv6).
Thus,  a similar idea (and API) could be extended to other layers/protocols
in the future.

As mentioned above, this patch set does not allow the bpf program to create
its own option "kind".  However, the header-writing's BPF API (mainly through
the helper "bpf_reserve_hdr_opt" and "bpf_store_hdr_opt") could be extended
in the future to allow a "raw" mode (e.g. by introducing a new helper
flag).

Sockops Callback Flags:
──────────────────────
The header parsing and writing callback can be turned on
by setting the existing callback flags "tp->bpf_sock_ops_cb_flags".
BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG are (newly) added.
The default is off, i.e. the bpf prog will not
be called to parse or write bpf hdr option.

3 Way HandShake
───────────────
* Passive side

When writing SYNACK, the received SYN skb will be available to the
bpf prog.  The bpf prog will also know if it is in syncookie
mode (want_cookie) or not.  The bpf prog can use the SYN skb (which
may carry the bpf hdr opt sent from the remote peer) to decide what
bpf header option should be written to the outgoing SYNACK skb.

The bpf prog can store the bpf header option of the received
SYN by using the existing "TCP_SAVE_SYN" setsockopt.
The example in a latter patch also uses TCP_SAVE_SYN.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the bpf header option from the saved syn and
then apply the needed operation to the newly established socket.
The latter patch will use the max delay ack specified in the SYN
packet as an example.  The received ack (that concludes the 3WHS)
will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB
through the sock_ops->skb_data, which could be useful in
syncookie scenario.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn which includes the IP[46] header and the TCP header.
A (new) BPF only "TCP_BPF_SYN_HDR_OPT" getsockopt is added to get
the bpf header option alone (without the IP and TCP header) from the
saved syn.  The kernel remembers the offset to the bpf header option (i.e.
kind:254, magic:0xeB9F) as it passes the TCP header.  It is stored in
the tcp_skb_cb and then also saved in the "struct saved_syn".  The new
"TCP_BPF_SYN_HDR_OPT" can directly return the bpf header option to the
bpf prog instead of asking the bpf prog to parse the IP[46] header and
TCP header again in order to get to the bpf header option.

In the new "TCP_BPF_SYN_HDR_OPT" getsockopt, the kernel will know
where it can get the SYN's bpf header option from:
  - the just received syn (available when the bpf prog is writing SYNACK)
  or
  - the saved syn (available in PASSIVE_ESTABLISHED_CB).

The bpf prog does not need to know where this bpf header option
can be obtained from.  The "TCP_BPF_SYN_HDR_OPT" getsockopt will
hide this details.

Fastopen should work the same as the regular non fastopen case.
This is a test in a latter patch.

For syncookie, the latter example patch asks the active
side's bpf prog to resend the header options in ACK.  Please refer
to this latter example for its details and limitation.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data.

In short, regardless of active or passive ESTABLISHED_CB,
the sock_ops->skb_data is always the received skb that
completed the 3WHS.

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.

Established Connection
──────────────────────
The bpf prog will be called as long as the parse/write
*_HDR_OPT_CB_FLAG is enabled in bpf_sock_ops_cb_flags.
That will allow the bpf prog to parse/write header options
in the data, pure-ack, and fin packet.

Writing BPF Header Option
─────────────────────────
[ bpf prog context: sock_ops ]

When writing header, the bpf prog is first called to reserve
the needed number of bytes in skb during "HDR_OPT_LEN_CB" and
then called to write the header during "WRITE_HDR_OPT_CB".
During these two write CB, the sock_ops->skb_* is always
representing the outgoing skb.

The bpf prog is expected to use the two (new) helpers,
"bpf_reserve_hdr_opt" and "bpf_store_hdr_opt", to
reserve option space and write the option.

In cgroup MULTI mode, the max reserved space among multiple bpf progs
will be used.  e.g. prog#1 reserves 8 bytes and a latter prog#2 reserves
4 bytes.  8 bytes will be reserved.
When multiple bpf progs write the bpf header option, the last
prog's header option will be used.  The "bpf_store_hdr_opt"
helper will take care of the TCP header option's kind-length.

When writing header in "WRITE_HDR_OPT_CB", the sock_ops->skb_data
is pointing to the outgoing skb.  If there is a need, the bpf prog
can inspect what has been written to the header.
sock_ops->skb_bpf_hdr_opt_off also provides an offset to the
beginning of the bpf header option (i.e. the beginning of
kind:245, magic:0xeB9F).
However, during option space reservation in "HDR_OPT_LEN_CB",
the sock_ops->skb_data does not have the tcp header because
the header has not been written yet.

Parsing BPF Header Option
─────────────────────────

As mentioned earlier, the received SYN/SYNACK/ACK during the 3WHS
will be available to some specific CB (e.g. the *_ESTABLISHED_CB)

For established connection, if the kernel finds a bpf header
option (i.e. option with kind:254 and magic:0xeB9F) and the
the "PARSE_HDR_OPT_CB_FLAG" flag is set,  the
bpf prog will be called in the "BPF_SOCK_OPS_PARSE_HDR_OPT_CB" op.
The received skb will be available through sock_ops->skb_data
and the bpf header option offset will also be specified
in sock_ops->skb_bpf_hdr_opt_off.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h     |  25 ++++
 include/linux/filter.h         |   6 +
 include/net/tcp.h              |  53 ++++++++-
 include/uapi/linux/bpf.h       | 187 +++++++++++++++++++++++++++++-
 net/core/filter.c              | 202 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_fastopen.c        |   2 +-
 net/ipv4/tcp_input.c           |  79 ++++++++++++-
 net/ipv4/tcp_ipv4.c            |   3 +-
 net/ipv4/tcp_minisocks.c       |   1 +
 net/ipv4/tcp_output.c          | 186 +++++++++++++++++++++++++++++-
 net/ipv6/tcp_ipv6.c            |   3 +-
 tools/include/uapi/linux/bpf.h | 187 +++++++++++++++++++++++++++++-
 12 files changed, 914 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index c66c545e161a..c069295564a4 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -270,6 +270,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
 #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr)			\
 	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
 
+/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a
+ * fullsock and its parent fullsock cannot be traced by
+ * sk_to_full_sk().
+ *
+ * e.g. sock_ops->sk is a request_sock and it is under syncookie mode.
+ * Its listener-sk is not attached to the rsk_listener.
+ * In this case, the caller holds the listener-sk (unlocked),
+ * set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with
+ * the listener-sk such that the cgroup-bpf-progs of the
+ * listener-sk will be run.
+ *
+ * Regardless of syncookie mode or not,
+ * calling bpf_setsockopt on listener-sk will not make sense anyway,
+ * so passing req_sk to bpf prog is appropriate here.
+ */
+#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk)			\
+({									\
+	int __ret = 0;							\
+	if (cgroup_bpf_enabled)						\
+		__ret = __cgroup_bpf_run_filter_sock_ops(sk,		\
+							 sock_ops,	\
+							 BPF_CGROUP_SOCK_OPS); \
+	__ret;								\
+})
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
diff --git a/include/linux/filter.h b/include/linux/filter.h
index aae52cbda670..2ba008f4937c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1241,8 +1241,14 @@ struct bpf_sock_ops_kern {
 		u32 reply;
 		u32 replylong[4];
 	};
+	void	*syn_skb;
+	void	*skb;
+	void	*skb_data_end;
 	u8	op;
 	u8	is_fullsock;
+	u8	write_hdr_opt_len;
+	u8	skb_bpf_hdr_opt_off;
+	u8	reservable_hdr_opt_len;	/* Only used in HDR_OPT_LEN */
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
 					 * the BPF program. New fields that
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 07a9dfe35242..e93ef2d324f3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -395,7 +395,7 @@ void tcp_metrics_init(void);
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
-void tcp_init_transfer(struct sock *sk, int bpf_op);
+void tcp_init_transfer(struct sock *sk, bool passive, struct sk_buff *skb);
 __poll_t tcp_poll(struct file *file, struct socket *sock,
 		      struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
@@ -459,6 +459,7 @@ enum tcp_synack_type {
 };
 struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
+				struct sk_buff *syn_skb,
 				struct tcp_fastopen_cookie *foc,
 				enum tcp_synack_type synack_type);
 int tcp_disconnect(struct sock *sk, int flags);
@@ -2023,6 +2024,7 @@ struct tcp_request_sock_ops {
 	u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
 	int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
 			   struct flowi *fl, struct request_sock *req,
+			   struct sk_buff *syn_skb,
 			   struct tcp_fastopen_cookie *foc,
 			   enum tcp_synack_type synack_type);
 };
@@ -2222,6 +2224,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		      struct msghdr *msg, int len, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
 
+#ifdef CONFIG_CGROUP_BPF
+/* Copy the listen sk's HDR_OPT_CB flags to its child.
+ *
+ * During 3-Way-HandShake, the synack is usually sent from
+ * the listen sk with the HDR_OPT_CB flags set so that
+ * bpf-prog will be called to write the BPF hdr option.
+ *
+ * In fastopen, the child sk is used to send synack instead
+ * of the listen sk.  Thus, inheriting the HDR_OPT_CB flags
+ * from the listen sk gives the bpf-prog a chance to write
+ * BPF hdr option in the synack pkt during fastopen.
+ *
+ * Both fastopen and non-fastopen child will inherit the
+ * HDR_OPT_CB flags to keep the bpf-prog having a consistent
+ * behavior when deciding to clear this cb flags (or not)
+ * during the PASSIVE_ESTABLISHED_CB.
+ *
+ * In the future, other cb flags could be inherited here also.
+ */
+static inline void bpf_skops_init_child(const struct sock *sk,
+					struct sock *child)
+{
+	tcp_sk(child)->bpf_sock_ops_cb_flags =
+		tcp_sk(sk)->bpf_sock_ops_cb_flags &
+		(BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG |
+		 BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+}
+
+static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
+				      struct sk_buff *skb,
+				      u8 bpf_hdr_opt_off)
+{
+	skops->skb = skb;
+	skops->skb_data_end = skb->data + skb_headlen(skb);
+	skops->skb_bpf_hdr_opt_off = bpf_hdr_opt_off;
+}
+#else
+static inline void bpf_skops_init_child(const struct sock *sk,
+					struct sock *child)
+{
+}
+
+static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
+				      struct sk_buff *skb,
+				      u8 bpf_hdr_opt_off)
+{
+}
+#endif
+
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0cb8ec948816..479b83d05811 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3285,6 +3285,110 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, u32 offset, const void *from, u32 len, u64 flags)
+ *	Description
+ *		Store BPF header option.  The data will be copied
+ *		from buffer *from* with length *len* to the bpf header
+ *		option at *offset*.
+ *
+ *		This helper does not allow updating the header "kind",
+ *		"kind length", and the 16-bit magic.  *offset* 0
+ *		points immediately after the 16-bit magic.
+ *		This helper will automatically update the "kind length" as
+ *		the bpf header option is written by this helper.  It can
+ *		guarantee that the header option format always
+ *		conforms to the standard.
+ *
+ *		The bpf program can write everything at once
+ *		from offset 0.
+ *
+ *		The bpf program can also do incremental write by calling
+ *		this helper multiple times with incremental (but
+ *		continuous) offset.
+ *
+ *		The bpf program (e.g. the parent cgroup bpf program)
+ *		can also rewrite a few bytes that has already been
+ *		written by some ealier bpf programs.  The bpf
+ *		program can also learn what header option has been
+ *		written by the earlier bpf programs through
+ *		sock_ops->skb_data.
+ *
+ *		A bpf program can always write at offset 0 again
+ *		to overwrite all bpf header options that has been written
+ *		by the previous bpf programs.  If the bpf program
+ *		writes less than the previous bpf programs do,
+ *		the tailing bytes will be removed unless
+ *		*BP_F_STORE_HDR_OPT_NOTRIM* is used.
+ *		e.g. bpf_store_hdr_opt(skops, 0, NULL, 0, 0) will
+ *		remove all previously written option.
+ *
+ *		This helper will reject the *offset* if there is any
+ *		byte/hole that has never been written before this
+ *		*offset*.
+ *
+ *		For example, it can write at offset 0 for 2 bytes,
+ *		then write at offset 2 for 1 bytes, and then finally
+ *		write at offset 3 for 2 bytes.  Thus, the bpf prog
+ *		has totally written 5 bytes.
+ *		The "kind length" will be 9 bytes:
+ *		4 bytes (1byte kind + 1byte kind_len + 2byte magic) +
+ *		5 bytes written by the bpf program.
+ *
+ *		However, this helper does not allow writing at offset 0
+ *		for 2 bytes and then immediately write at offset 3 without
+ *		writing at offset 2 first.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.  The previous reserved
+ *		space is passed in sock_ops->args[0].
+ *
+ *		The following flags are supported:
+ *		* **BPF_F_STORE_HDR_OPT_NOTRIM**: Usually, bpf program
+ *		  does write with incremental offset instead of
+ *		  jumping the offset back and forth.
+ *		  Thus, by default, after calling this helper,
+ *		  anything previously written after offset + len
+ *		  will be removed.
+ *		  This flag can be used to keep the tailing
+ *		  data after "offset + len".
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-EFAULT** *offset* + *len* is outside of the
+ *		            reserved option space or
+ *			    there is unwritten byte before *offset*.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
+ *	Description
+ *		Reserve *len* bytes for the bpf header option.  This
+ *		space will be available to use by bpf_store_hdr_opt()
+ *		later.
+ *
+ *		If bpf_reserve_hdr_opt() is called multiple times (e.g
+ *		by mulitple cgroup bpf programs),
+ *		the maximum *len* among them will be reserved.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_HDR_OPT_LEN_CB.  The maximum
+ *		reservable space is passed in
+ *		sock_ops->args[0].
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-ENOSPC** Not enough remaining space in the header.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3427,7 +3531,9 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(store_hdr_opt),		\
+	FN(reserve_hdr_opt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -4024,6 +4130,11 @@ struct bpf_sock_ops {
 	__u64 bytes_received;
 	__u64 bytes_acked;
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(void *, skb_data);
+	__bpf_md_ptr(void *, skb_data_end);
+	__u32 skb_len;
+	__u32 skb_bpf_hdr_opt_off;
+	__u32 skb_tcp_flags;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -4032,8 +4143,14 @@ enum {
 	BPF_SOCK_OPS_RETRANS_CB_FLAG	= (1<<1),
 	BPF_SOCK_OPS_STATE_CB_FLAG	= (1<<2),
 	BPF_SOCK_OPS_RTT_CB_FLAG	= (1<<3),
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG = (1<<4),
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<5),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xF,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x3F,
+};
+
+enum {
+	BPF_F_STORE_HDR_OPT_NOTRIM	= (1ULL << 0),
 };
 
 /* List of known BPF sock_ops operators.
@@ -4089,6 +4206,58 @@ enum {
 					 */
 	BPF_SOCK_OPS_RTT_CB,		/* Called on every RTT.
 					 */
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB,	/* Parse the BPF header option.
+					 * It will be called to handle
+					 * the packets received at
+					 * socket that has been established.
+					 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CB,	/* Reserve space for writing the BPF
+					 * header option later in
+					 * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+					 * Arg1: Max reservable bytes
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	Referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP payload because
+					 *	no TCP header has
+					 *	been written yet.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *      SYN, ACK, FIN).
+					 *
+					 * bpf_reserve_hdr_opt() should
+					 * be used to reserve space.
+					 */
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB,	/* Write the BPF header OPTIONS
+					 * Arg1: num of bytes reserved during
+					 *       BPF_SOCK_OPS_HDR_OPT_LEN_CB
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP header.
+					 *
+					 * sock_ops->skb_bpf_hdr_opt_off:
+					 *	offset to the bpf
+					 *	header option.  bpf
+					 *	prog can inspect what bpf
+					 *	option has already been
+					 *	written.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *	SYN, ACK, FIN).
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -4116,6 +4285,20 @@ enum {
 enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
+	/* Copy the BPF header option from syn pkt to optval
+	 *
+	 * BPF_PROG_TYPE_SOCK_OPS only.  It is similar to the
+	 * TCP_SAVED_SYN but it only gets the bpf header option
+	 * from the syn packet.
+	 *
+	 *       0: Success
+	 * -ENOSPC: Not enough space in optval. Only optlen number of
+	 *          bytes is copied.
+	 * -ENOMSG: BPF TCP header option not found
+	 * -ENOENT: The SYN skb is not available now and the earlier SYN pkt
+	 *	    is not saved by TCP_SAVE_SYN.
+	 */
+	TCP_BPF_SYN_HDR_OPT	= 1003,
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 1dd07972c5c7..5784f1bede2f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4629,6 +4629,64 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
 BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_INET) &&
+	    level == SOL_TCP && optname == TCP_BPF_SYN_HDR_OPT) {
+		struct sk_buff *syn_skb = bpf_sock->syn_skb;
+		u8 *kind_start, copy_len = 0;
+		int ret = 0;
+
+		if (bpf_sock->syn_skb) {
+			/* sk is a request_sock here */
+
+			if (!TCP_SKB_CB(syn_skb)->bpf_hdr_opt_off) {
+				ret = -ENOMSG;
+				goto init_unused;
+			}
+
+			kind_start = syn_skb->data +
+				TCP_SKB_CB(syn_skb)->bpf_hdr_opt_off;
+		} else {
+			struct sock *sk = bpf_sock->sk;
+			struct saved_syn *saved_syn;
+
+			if (sk->sk_state == TCP_NEW_SYN_RECV)
+				/* synack retransmit. bpf_sock->syn_skb will
+				 * not be available.
+				 */
+				saved_syn = inet_reqsk(sk)->saved_syn;
+			else
+				saved_syn = tcp_sk(sk)->saved_syn;
+
+			if (!saved_syn) {
+				ret = -ENOENT;
+				goto init_unused;
+			}
+
+			if (!saved_syn->bpf_hdr_opt_off) {
+				ret = -ENOMSG;
+				goto init_unused;
+			}
+
+			kind_start = saved_syn->data +
+				saved_syn->network_hdrlen +
+				saved_syn->bpf_hdr_opt_off;
+		}
+
+		copy_len = kind_start[1];
+		if (optlen < copy_len) {
+			copy_len = optlen;
+			ret = -ENOSPC;
+		}
+
+		memcpy(optval, kind_start, copy_len);
+
+init_unused:
+		if (optlen > copy_len)
+			memset(optval + copy_len, 0, optlen - copy_len);
+
+		return ret;
+	}
+
 	return _bpf_getsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }
 
@@ -4665,6 +4723,80 @@ static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_5(bpf_sock_ops_store_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   u32, offset, const void *, from, u32, len, u64, flags)
+{
+	u8 *kind_start, nr_written, max_len;
+	struct sk_buff *skb;
+
+	if (bpf_sock->op != BPF_SOCK_OPS_WRITE_HDR_OPT_CB)
+		return -EPERM;
+
+	if (flags > BPF_F_STORE_HDR_OPT_NOTRIM)
+		return -EINVAL;
+
+	max_len = bpf_sock->write_hdr_opt_len;
+	if (len > max_len || offset > max_len - len)
+		return -EFAULT;
+
+	skb = bpf_sock->skb;
+	kind_start = skb->data + bpf_sock->skb_bpf_hdr_opt_off;
+	nr_written = kind_start[1] - TCPOLEN_EXP_BPF_BASE;
+	if (offset > nr_written)
+		return -EFAULT;
+
+	if (len)
+		memcpy(kind_start + TCPOLEN_EXP_BPF_BASE + offset, from, len);
+
+	if (kind_start[1] < TCPOLEN_EXP_BPF_BASE + offset + len ||
+	    !(flags & BPF_F_STORE_HDR_OPT_NOTRIM))
+		/* Update kind_len */
+		kind_start[1] = TCPOLEN_EXP_BPF_BASE + offset + len;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_store_hdr_opt_proto = {
+	.func		= bpf_sock_ops_store_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	/* bpf_store_hdr_opt(skops, 0, NULL, 0, 0) will remove
+	 * all previously written options.
+	 */
+	.arg3_type	= ARG_PTR_TO_MEM_OR_NULL,
+	.arg4_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg5_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_sock_ops_reserve_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   u32, len, u64, flags)
+{
+	if (bpf_sock->op != BPF_SOCK_OPS_HDR_OPT_LEN_CB)
+		return -EPERM;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len > bpf_sock->reservable_hdr_opt_len)
+		return -ENOSPC;
+
+	if (len > bpf_sock->write_hdr_opt_len)
+		bpf_sock->write_hdr_opt_len = len;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
+	.func		= bpf_sock_ops_reserve_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 EXPORT_SYMBOL_GPL(ipv6_bpf_stub);
 
@@ -6515,6 +6647,10 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_sk_storage_delete:
 		return &bpf_sk_storage_delete_proto;
 #ifdef CONFIG_INET
+	case BPF_FUNC_store_hdr_opt:
+		return &bpf_sock_ops_store_hdr_opt_proto;
+	case BPF_FUNC_reserve_hdr_opt:
+		return &bpf_sock_ops_reserve_hdr_opt_proto;
 	case BPF_FUNC_tcp_sock:
 		return &bpf_tcp_sock_proto;
 #endif /* CONFIG_INET */
@@ -7313,6 +7449,21 @@ static bool sock_ops_is_valid_access(int off, int size,
 				return false;
 			info->reg_type = PTR_TO_SOCKET_OR_NULL;
 			break;
+		case offsetof(struct bpf_sock_ops, skb_data):
+			if (size != sizeof(__u64))
+				return false;
+			info->reg_type = PTR_TO_PACKET;
+			break;
+		case offsetof(struct bpf_sock_ops, skb_data_end):
+			if (size != sizeof(__u64))
+				return false;
+			info->reg_type = PTR_TO_PACKET_END;
+			break;
+		case offsetof(struct bpf_sock_ops, skb_tcp_flags):
+		case offsetof(struct bpf_sock_ops, skb_bpf_hdr_opt_off):
+			bpf_ctx_record_field_size(info, size_default);
+			return bpf_ctx_narrow_access_ok(off, size,
+							size_default);
 		default:
 			if (size != size_default)
 				return false;
@@ -8601,6 +8752,57 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct bpf_sock_ops_kern, sk));
 		break;
+	case offsetof(struct bpf_sock_ops, skb_bpf_hdr_opt_off):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb_bpf_hdr_opt_off),
+				      si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_sock_ops_kern,
+						     skb_bpf_hdr_opt_off, 1,
+						     target_size));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb_data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb_data_end));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct sk_buff, data));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_len):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct sk_buff, len));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_tcp_flags):
+		off = offsetof(struct sk_buff, cb);
+		off += offsetof(struct tcp_skb_cb, tcp_flags);
+		*target_size = sizeof_field(struct tcp_skb_cb, tcp_flags);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_skb_cb,
+						       tcp_flags),
+				      si->dst_reg, si->dst_reg, off);
+		break;
 	}
 	return insn - insn_buf;
 }
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 19ad9586c720..dd70b04da52a 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -272,7 +272,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
 	refcount_set(&req->rsk_refcnt, 2);
 
 	/* Now finish processing the fastopen child socket. */
-	tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
+	tcp_init_transfer(child, true, skb);
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 640408a80b3d..16dbf5dd46c1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -138,6 +138,69 @@ void clean_acked_data_flush(void)
 EXPORT_SYMBOL_GPL(clean_acked_data_flush);
 #endif
 
+#ifdef CONFIG_CGROUP_BPF
+static void bpf_skops_parse_bpf_hdr(struct sock *sk, struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					   BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG)))
+		return;
+
+	/* The skb will be handled in the
+	 * BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB or
+	 * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB.
+	 */
+	switch (sk->sk_state) {
+	case TCP_SYN_RECV:
+	case TCP_LISTEN:
+	case TCP_SYN_SENT:
+		return;
+	}
+
+	sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	bpf_skops_init_skb(&sock_ops, skb, TCP_SKB_CB(skb)->bpf_hdr_opt_off);
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+}
+
+static void bpf_skops_established(struct sock *sk, bool passive,
+				  struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	if (passive)
+		sock_ops.op = BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB;
+	else
+		sock_ops.op = BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
+	if (skb)
+		bpf_skops_init_skb(&sock_ops, skb,
+				   TCP_SKB_CB(skb)->bpf_hdr_opt_off);
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+}
+#else
+static void bpf_skops_parse_bpf_hdr(struct sock *sk, struct sk_buff *skb)
+{
+}
+
+static void bpf_skops_established(struct sock *sk, bool passive,
+				  struct sk_buff *skb)
+{
+}
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -5551,6 +5614,9 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 	}
 
+	if (TCP_SKB_CB(skb)->bpf_hdr_opt_off)
+		bpf_skops_parse_bpf_hdr(sk, skb);
+
 	return true;
 
 discard:
@@ -5759,7 +5825,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
-void tcp_init_transfer(struct sock *sk, int bpf_op)
+void tcp_init_transfer(struct sock *sk, bool passive, struct sk_buff *skb)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -5780,7 +5846,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
 		tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
-	tcp_call_bpf(sk, bpf_op, 0, NULL);
+	bpf_skops_established(sk, passive, skb);
 	tcp_init_congestion_control(sk);
 	tcp_init_buffer_space(sk);
 }
@@ -5799,7 +5865,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 		sk_mark_napi_id(sk, skb);
 	}
 
-	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
+	tcp_init_transfer(sk, false, skb);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data
 	 * packet.
@@ -6271,7 +6337,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		} else {
 			tcp_try_undo_spurious_syn(sk);
 			tp->retrans_stamp = 0;
-			tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
+			tcp_init_transfer(sk, true, skb);
 			WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 		}
 		smp_mb();
@@ -6718,7 +6784,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
 	}
 	if (fastopen_sk) {
-		af_ops->send_synack(fastopen_sk, dst, &fl, req,
+		af_ops->send_synack(fastopen_sk, dst, &fl, req, skb,
 				    &foc, TCP_SYNACK_FASTOPEN);
 		/* Add the child socket directly into the accept queue */
 		if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
@@ -6735,7 +6801,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		if (!want_cookie)
 			inet_csk_reqsk_queue_hash_add(sk, req,
 				tcp_timeout_init((struct sock *)req));
-		af_ops->send_synack(sk, dst, &fl, req, &foc,
+
+		af_ops->send_synack(sk, dst, &fl, req, skb, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :
 						   TCP_SYNACK_COOKIE);
 		if (want_cookie) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3535b7fe002..8a0b62fb4f79 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -963,6 +963,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct flowi *fl,
 			      struct request_sock *req,
+			      struct sk_buff *syn_skb,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
@@ -975,7 +976,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
 		return -1;
 
-	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
+	skb = tcp_make_synack(sk, dst, req, syn_skb, foc, synack_type);
 
 	if (skb) {
 		__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 495dda2449fe..56c306e3cd2f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
+	bpf_skops_init_child(sk, newsk);
 	tcp_bpf_clone(sk, newsk);
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a50e1990a845..a78a29980e1f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
+	u8 bpf_hdr_opt_len;	/* length of BPF hdr option */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
@@ -452,6 +453,152 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
 #endif
 }
 
+#ifdef CONFIG_CGROUP_BPF
+static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
+				  struct request_sock *req,
+				  struct sk_buff *syn_skb,
+				  bool want_cookie,
+				  struct tcp_out_options *opts,
+				  unsigned int *res_remaining)
+{
+	/* res_remaining has already been aligned to 4 bytes */
+	unsigned int remaining = *res_remaining;
+	struct bpf_sock_ops_kern sock_ops;
+	int err;
+
+	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
+	    remaining <= TCPOLEN_EXP_BPF_BASE)
+		return;
+
+	BUILD_BUG_ON(TCPOLEN_EXP_BPF_BASE != 4);
+	/* remaining is still rounded to 4 bytes */
+	remaining -= TCPOLEN_EXP_BPF_BASE;
+
+	/* init sock_ops */
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB;
+
+	if (req) {
+		/* The listen "sk" cannot be passed here because
+		 * it is not locked.  It would not make too much
+		 * sense to do bpf_setsockopt(listen_sk) based
+		 * on individual connection request also.
+		 *
+		 * Thus, "req" is passed here and the cgroup-bpf-progs
+		 * of the listen "sk" will be run.
+		 *
+		 * "req" is also used here for fastopen even the "sk" here is
+		 * the "child" sk.  It is to keep the behavior
+		 * consistent between fastopen and non-fastopen on
+		 * the bpf programming side.
+		 */
+		sock_ops.sk = (struct sock *)req;
+		sock_ops.syn_skb = syn_skb;
+	} else {
+		sock_owned_by_me(sk);
+
+		sock_ops.is_fullsock = 1;
+		sock_ops.sk = sk;
+	}
+
+	/* tcp_current_mss() does not pass a skb */
+	if (skb)
+		bpf_skops_init_skb(&sock_ops, skb, 0);
+
+	sock_ops.args[0] = remaining;
+	sock_ops.args[1] = want_cookie;
+	sock_ops.reservable_hdr_opt_len = remaining;
+
+	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+
+	if (err || !sock_ops.write_hdr_opt_len)
+		return;
+
+	opts->bpf_hdr_opt_len = sock_ops.write_hdr_opt_len +
+		TCPOLEN_EXP_BPF_BASE;
+	/* round up to 4 bytes */
+	opts->bpf_hdr_opt_len = (opts->bpf_hdr_opt_len + 3) & ~3;
+
+	*res_remaining -= opts->bpf_hdr_opt_len;
+}
+
+static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
+				    struct request_sock *req,
+				    struct sk_buff *syn_skb,
+				    bool want_cookie,
+				    struct tcp_out_options *opts)
+{
+	u8 *kind_start, kind_len, max_kind_len = opts->bpf_hdr_opt_len;
+	struct tcphdr *th = (struct tcphdr *)skb->data;
+	struct bpf_sock_ops_kern sock_ops;
+	int err;
+
+	if (likely(!max_kind_len))
+		return;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
+
+	if (req) {
+		sock_ops.sk = (struct sock *)req;
+		sock_ops.syn_skb = syn_skb;
+	} else {
+		sock_owned_by_me(sk);
+
+		sock_ops.is_fullsock = 1;
+		sock_ops.sk = sk;
+	}
+
+	bpf_skops_init_skb(&sock_ops, skb, __tcp_hdrlen(th) - max_kind_len);
+
+	sock_ops.args[0] = max_kind_len - TCPOLEN_EXP_BPF_BASE;
+	sock_ops.args[1] = want_cookie;
+	sock_ops.write_hdr_opt_len = sock_ops.args[0];
+
+	kind_start = skb->data + sock_ops.skb_bpf_hdr_opt_off;
+	*(__be32 *)kind_start = htonl((TCPOPT_EXP << 24) |
+				      TCPOLEN_EXP_BPF_BASE |
+				      TCPOPT_BPF_MAGIC);
+
+	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+
+	kind_len = kind_start[1];
+	if (err || kind_len == TCPOLEN_EXP_BPF_BASE) {
+		/* zero out the unwritten memory.
+		 *
+		 * bpf is the last one writing options, so it
+		 * can end with TCPOPT_EOL under abnormal cases.
+		 */
+		memset(kind_start, 0, max_kind_len);
+		kind_start[0] = TCPOPT_EOL;
+	} else if (kind_len < max_kind_len) {
+		/* Pad by NOP for normal case */
+		memset(kind_start + kind_len, TCPOPT_NOP,
+		       max_kind_len - kind_len);
+	}
+}
+#else
+static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
+				  struct request_sock *req,
+				  struct sk_buff *syn_skb,
+				  bool want_cookie,
+				  struct tcp_out_options *opts,
+				  unsigned int *res_remaining)
+{
+}
+
+static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
+				    struct request_sock *req,
+				    struct sk_buff *syn_skb,
+				    bool want_cookie,
+				    struct tcp_out_options *opts)
+{
+}
+#endif
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -691,12 +838,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, false, opts, &remaining);
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 /* Set up TCP options for SYN-ACKs. */
 static unsigned int tcp_synack_options(const struct sock *sk,
 				       struct request_sock *req,
+				       struct sk_buff *syn_skb,
+				       bool want_cookie,
 				       unsigned int mss, struct sk_buff *skb,
 				       struct tcp_out_options *opts,
 				       const struct tcp_md5sig_key *md5,
@@ -756,15 +907,19 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
+	bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb,
+			      want_cookie, opts, &remaining);
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 /* Compute TCP options for ESTABLISHED sockets. This is not the
  * final wire format yet.
  */
-static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb,
-					struct tcp_out_options *opts,
-					struct tcp_md5sig_key **md5)
+static unsigned int tcp_established_options(struct sock *sk,
+					    struct sk_buff *skb,
+					    struct tcp_out_options *opts,
+					    struct tcp_md5sig_key **md5)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int size = 0;
@@ -824,6 +979,16 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
+	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
+					    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
+		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, false, opts,
+				      &remaining);
+
+		size = MAX_TCP_OPTION_SPACE - remaining;
+	}
+
 	return size;
 }
 
@@ -1206,6 +1371,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 					       md5, sk, skb);
 	}
 #endif
+	bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, false, &opts);
 
 	icsk->icsk_af_ops->send_check(sk, skb);
 
@@ -3333,9 +3499,11 @@ int tcp_send_synack(struct sock *sk)
  */
 struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
+				struct sk_buff *syn_skb,
 				struct tcp_fastopen_cookie *foc,
 				enum tcp_synack_type synack_type)
 {
+	bool want_cookie = (synack_type == TCP_SYNACK_COOKIE);
 	struct inet_request_sock *ireq = inet_rsk(req);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_md5sig_key *md5 = NULL;
@@ -3393,8 +3561,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
 #endif
 	skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
-	tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
-					     foc) + sizeof(*th);
+	/* bpf program will be interested in the tcp_flags */
+	TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK;
+	tcp_header_size = tcp_synack_options(sk, req, syn_skb, want_cookie,
+					     mss, skb, &opts,
+					     md5, foc) + sizeof(*th);
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
@@ -3416,6 +3587,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	th->window = htons(min(req->rsk_rcv_wnd, 65535U));
 	tcp_options_write((__be32 *)(th + 1), NULL, &opts);
 	th->doff = (tcp_header_size >> 2);
+	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
+				want_cookie, &opts);
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -3919,7 +4092,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
 	int res;
 
 	tcp_rsk(req)->txhash = net_tx_rndhash();
-	res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL);
+	res = af_ops->send_synack(sk, NULL, &fl, req, NULL, NULL,
+				  TCP_SYNACK_NORMAL);
 	if (!res) {
 		__TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8356d0562279..3036ae6532dd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -500,6 +500,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct flowi *fl,
 			      struct request_sock *req,
+			      struct sk_buff *syn_skb,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
@@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 					       IPPROTO_TCP)) == NULL)
 		goto done;
 
-	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
+	skb = tcp_make_synack(sk, dst, req, syn_skb, foc, synack_type);
 
 	if (skb) {
 		__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0cb8ec948816..479b83d05811 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3285,6 +3285,110 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, u32 offset, const void *from, u32 len, u64 flags)
+ *	Description
+ *		Store BPF header option.  The data will be copied
+ *		from buffer *from* with length *len* to the bpf header
+ *		option at *offset*.
+ *
+ *		This helper does not allow updating the header "kind",
+ *		"kind length", and the 16-bit magic.  *offset* 0
+ *		points immediately after the 16-bit magic.
+ *		This helper will automatically update the "kind length" as
+ *		the bpf header option is written by this helper.  It can
+ *		guarantee that the header option format always
+ *		conforms to the standard.
+ *
+ *		The bpf program can write everything at once
+ *		from offset 0.
+ *
+ *		The bpf program can also do incremental write by calling
+ *		this helper multiple times with incremental (but
+ *		continuous) offset.
+ *
+ *		The bpf program (e.g. the parent cgroup bpf program)
+ *		can also rewrite a few bytes that has already been
+ *		written by some ealier bpf programs.  The bpf
+ *		program can also learn what header option has been
+ *		written by the earlier bpf programs through
+ *		sock_ops->skb_data.
+ *
+ *		A bpf program can always write at offset 0 again
+ *		to overwrite all bpf header options that has been written
+ *		by the previous bpf programs.  If the bpf program
+ *		writes less than the previous bpf programs do,
+ *		the tailing bytes will be removed unless
+ *		*BP_F_STORE_HDR_OPT_NOTRIM* is used.
+ *		e.g. bpf_store_hdr_opt(skops, 0, NULL, 0, 0) will
+ *		remove all previously written option.
+ *
+ *		This helper will reject the *offset* if there is any
+ *		byte/hole that has never been written before this
+ *		*offset*.
+ *
+ *		For example, it can write at offset 0 for 2 bytes,
+ *		then write at offset 2 for 1 bytes, and then finally
+ *		write at offset 3 for 2 bytes.  Thus, the bpf prog
+ *		has totally written 5 bytes.
+ *		The "kind length" will be 9 bytes:
+ *		4 bytes (1byte kind + 1byte kind_len + 2byte magic) +
+ *		5 bytes written by the bpf program.
+ *
+ *		However, this helper does not allow writing at offset 0
+ *		for 2 bytes and then immediately write at offset 3 without
+ *		writing at offset 2 first.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.  The previous reserved
+ *		space is passed in sock_ops->args[0].
+ *
+ *		The following flags are supported:
+ *		* **BPF_F_STORE_HDR_OPT_NOTRIM**: Usually, bpf program
+ *		  does write with incremental offset instead of
+ *		  jumping the offset back and forth.
+ *		  Thus, by default, after calling this helper,
+ *		  anything previously written after offset + len
+ *		  will be removed.
+ *		  This flag can be used to keep the tailing
+ *		  data after "offset + len".
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-EFAULT** *offset* + *len* is outside of the
+ *		            reserved option space or
+ *			    there is unwritten byte before *offset*.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
+ *	Description
+ *		Reserve *len* bytes for the bpf header option.  This
+ *		space will be available to use by bpf_store_hdr_opt()
+ *		later.
+ *
+ *		If bpf_reserve_hdr_opt() is called multiple times (e.g
+ *		by mulitple cgroup bpf programs),
+ *		the maximum *len* among them will be reserved.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_HDR_OPT_LEN_CB.  The maximum
+ *		reservable space is passed in
+ *		sock_ops->args[0].
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-ENOSPC** Not enough remaining space in the header.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3427,7 +3531,9 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(store_hdr_opt),		\
+	FN(reserve_hdr_opt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -4024,6 +4130,11 @@ struct bpf_sock_ops {
 	__u64 bytes_received;
 	__u64 bytes_acked;
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(void *, skb_data);
+	__bpf_md_ptr(void *, skb_data_end);
+	__u32 skb_len;
+	__u32 skb_bpf_hdr_opt_off;
+	__u32 skb_tcp_flags;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -4032,8 +4143,14 @@ enum {
 	BPF_SOCK_OPS_RETRANS_CB_FLAG	= (1<<1),
 	BPF_SOCK_OPS_STATE_CB_FLAG	= (1<<2),
 	BPF_SOCK_OPS_RTT_CB_FLAG	= (1<<3),
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG = (1<<4),
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<5),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xF,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x3F,
+};
+
+enum {
+	BPF_F_STORE_HDR_OPT_NOTRIM	= (1ULL << 0),
 };
 
 /* List of known BPF sock_ops operators.
@@ -4089,6 +4206,58 @@ enum {
 					 */
 	BPF_SOCK_OPS_RTT_CB,		/* Called on every RTT.
 					 */
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB,	/* Parse the BPF header option.
+					 * It will be called to handle
+					 * the packets received at
+					 * socket that has been established.
+					 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CB,	/* Reserve space for writing the BPF
+					 * header option later in
+					 * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+					 * Arg1: Max reservable bytes
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	Referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP payload because
+					 *	no TCP header has
+					 *	been written yet.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *      SYN, ACK, FIN).
+					 *
+					 * bpf_reserve_hdr_opt() should
+					 * be used to reserve space.
+					 */
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB,	/* Write the BPF header OPTIONS
+					 * Arg1: num of bytes reserved during
+					 *       BPF_SOCK_OPS_HDR_OPT_LEN_CB
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP header.
+					 *
+					 * sock_ops->skb_bpf_hdr_opt_off:
+					 *	offset to the bpf
+					 *	header option.  bpf
+					 *	prog can inspect what bpf
+					 *	option has already been
+					 *	written.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *	SYN, ACK, FIN).
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -4116,6 +4285,20 @@ enum {
 enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
+	/* Copy the BPF header option from syn pkt to optval
+	 *
+	 * BPF_PROG_TYPE_SOCK_OPS only.  It is similar to the
+	 * TCP_SAVED_SYN but it only gets the bpf header option
+	 * from the syn packet.
+	 *
+	 *       0: Success
+	 * -ENOSPC: Not enough space in optval. Only optlen number of
+	 *          bytes is copied.
+	 * -ENOMSG: BPF TCP header option not found
+	 * -ENOENT: The SYN skb is not available now and the earlier SYN pkt
+	 *	    is not saved by TCP_SAVE_SYN.
+	 */
+	TCP_BPF_SYN_HDR_OPT	= 1003,
 };
 
 struct bpf_perf_event_value {

From patchwork Fri Jun 26 17:55:52 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Martin KaFai Lau <kafai@fb.com>
X-Patchwork-Id: 217024
Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, 
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_PATCH, 
 MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS,
 URIBL_BLOCKED, 
 USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 89D38C433E1
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:56:35 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 5EEEB206B7
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:56:35 +0000 (UTC)
Authentication-Results: mail.kernel.org;
 dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com
 header.b="YZqEkMER"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727079AbgFZR4e (ORCPT <rfc822;netdev@archiver.kernel.org>);
 Fri, 26 Jun 2020 13:56:34 -0400
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:62942 "EHLO
 mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
 by vger.kernel.org with ESMTP id S1726616AbgFZR4d (ORCPT
 <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 13:56:33 -0400
Received: from pps.filterd (m0148460.ppops.net [127.0.0.1])
 by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05QHsvG2019469
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:56:30 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=oUUV6RasFPK2t9GAgU36nFq/Tl+bil+0+KXK/yh16/M=;
 b=YZqEkMERokuzQvxcQqFKHMcjfzHfiF8/2w71vBNqRUhM77vkwwCjtUMH50u3S5ou3O8P
 XsvhinrNRY/ZsmsRlD+GV4ODst5paqKQI7ktSpzu0E2jX/c93QhMQDlUxMkz3k43sGTf
 lhxSgWg9z7nFd6Owe0fpUUi9Mcw59gGO5CE= 
Received: from maileast.thefacebook.com ([163.114.130.16])
 by mx0a-00082601.pphosted.com with ESMTP id 31ux1exn6j-8
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128
 verify=NOT)
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:56:29 -0700
Received: from intmgw002.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:82::e) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1979.3; Fri, 26 Jun 2020 10:55:55 -0700
Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611)
 id 48D872942E38; Fri, 26 Jun 2020 10:55:52 -0700 (PDT)
Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>
CC: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>,
 Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>,
 Lawrence Brakmo <brakmo@fb.com>,
 Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>,
 Yuchung Cheng <ycheng@google.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 08/10] bpf: selftests: tcp header options
Date: Fri, 26 Jun 2020 10:55:52 -0700
Message-ID: <20200626175552.1462421-1-kafai@fb.com>
X-Mailer: git-send-email 2.24.1
In-Reply-To: <20200626175501.1459961-1-kafai@fb.com>
References: <20200626175501.1459961-1-kafai@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-26_10:2020-06-26,2020-06-26 signatures=0
X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0
 lowpriorityscore=0
 clxscore=1015 cotscore=-2147483648 spamscore=0 mlxlogscore=999
 bulkscore=0
 adultscore=0 phishscore=0 priorityscore=1501 malwarescore=0
 mlxscore=0
 suspectscore=15 impostorscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2004280000 definitions=main-2006260126
X-FB-Internal: deliver
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch adds tests for the new bpf tcp header option feature.
It tests header option writing and parsing in 3WHS, normal regular
connection establishment, fastopen, and syncookie.  In syncookie,
the passive side's bpf prog is asking the active side to resend
its bpf header option by specifying a RESEND bit in the outgoing SYNACK.
handle_active_estab() and write_nodata_opt() has some details on its
limitation.
handle_passive_estab() has comments on fastopen.

It also has test for header writing and parsing in FIN packet.

The "max_delack_ms" in bpf_test_option will be used in a latter test.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../bpf/prog_tests/tcp_hdr_options.c          | 520 ++++++++++++++
 .../bpf/progs/test_tcp_hdr_options.c          | 674 ++++++++++++++++++
 .../selftests/bpf/test_tcp_hdr_options.h      |  34 +
 3 files changed, 1228 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/test_tcp_hdr_options.h

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
new file mode 100644
index 000000000000..f8daf36783f3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -0,0 +1,520 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/socket.h>
+#include <linux/compiler.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_hdr_options.h"
+#include "test_tcp_hdr_options.skel.h"
+
+#define LO_ADDR6 "::eB9F"
+#define CG_NAME "/tcpbpf-hdr-opt-test"
+
+struct bpf_test_option exp_passive_estab_in;
+struct bpf_test_option exp_active_estab_in;
+struct bpf_test_option exp_passive_fin_in;
+struct bpf_test_option exp_active_fin_in;
+struct hdr_stg exp_passive_hdr_stg;
+struct hdr_stg exp_active_hdr_stg = { .active = true, };
+
+static struct test_tcp_hdr_options *skel;
+static int lport_linum_map_fd = -1;
+static int hdr_stg_map_fd = -1;
+static int cg_fd = -1;
+static __u32 duration;
+
+struct sk_fds {
+	int srv_fd;
+	int passive_fd;
+	int active_fd;
+	int passive_lport;
+	int active_lport;
+};
+
+static int add_lo_addr(int family, const char *addr_str)
+{
+	char ip_addr_cmd[256];
+	int cmdlen;
+
+	if (family == AF_INET6)
+		cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
+				  "ip -6 addr add %s/128 dev lo scope host",
+				  addr_str);
+	else
+		cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
+				  "ip addr add %s/32 dev lo",
+				  addr_str);
+
+	if (CHECK(cmdlen >= sizeof(ip_addr_cmd), "compile ip cmd",
+		  "failed to add host addr %s to lo. ip cmdlen is too long\n",
+		  addr_str))
+		return -1;
+
+	if (CHECK(system(ip_addr_cmd), "run ip cmd",
+		  "failed to add host addr %s to lo\n", addr_str))
+		return -1;
+
+	return 0;
+}
+
+static int create_netns(void)
+{
+	if (CHECK(unshare(CLONE_NEWNET), "create netns",
+		  "unshare(CLONE_NEWNET): %s (%d)",
+		  strerror(errno), errno))
+		return -1;
+
+	if (CHECK(system("ip link set dev lo up"), "run ip cmd",
+		  "failed to bring lo link up\n"))
+		return -1;
+
+	if (add_lo_addr(AF_INET6, LO_ADDR6))
+		return -1;
+
+	return 0;
+}
+
+static int write_sysctl(const char *sysctl, const char *value)
+{
+	int fd, err, len;
+
+	fd = open(sysctl, O_WRONLY);
+	if (CHECK(fd == -1, "open sysctl", "open(%s): %s (%d)\n",
+		  sysctl, strerror(errno), errno))
+		return -1;
+
+	len = strlen(value);
+	err = write(fd, value, len);
+	close(fd);
+	if (CHECK(err != len, "write sysctl",
+		  "write(%s, %s): err:%d %s (%d)\n",
+		  sysctl, value, err, strerror(errno), errno))
+		return -1;
+
+	return 0;
+}
+
+static void print_hdr_stg(const struct hdr_stg *hdr_stg, const char *prefix)
+{
+	fprintf(stderr, "%s{active:%u, resend_syn:%u, syncookie:%u, fastopen:%u}\n",
+		prefix ? : "", hdr_stg->active, hdr_stg->resend_syn,
+		hdr_stg->syncookie, hdr_stg->fastopen);
+}
+
+static void print_option(const struct bpf_test_option *opt, const char *prefix)
+{
+	fprintf(stderr, "%s{flags:0x%x, max_delack_ms:%u, magic:0x%x}\n",
+		prefix ? : "", opt->flags, opt->max_delack_ms, opt->magic);
+}
+
+static void sk_fds_close(struct sk_fds *sk_fds)
+{
+	close(sk_fds->srv_fd);
+	close(sk_fds->passive_fd);
+	close(sk_fds->active_fd);
+}
+
+static int sk_fds_connect(struct sk_fds *sk_fds, bool fast_open)
+{
+	const char fast[] = "FAST!!!";
+	struct sockaddr_storage addr;
+	struct sockaddr_in *addr_in;
+	socklen_t len;
+
+	sk_fds->srv_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
+	if (CHECK(sk_fds->srv_fd == -1, "start_server", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error;
+
+	if (fast_open)
+		sk_fds->active_fd = fastopen_connect(sk_fds->srv_fd, fast,
+						     sizeof(fast), 0);
+	else
+		sk_fds->active_fd = connect_to_fd(sk_fds->srv_fd, 0);
+
+	if (CHECK_FAIL(sk_fds->active_fd == -1)) {
+		close(sk_fds->srv_fd);
+		goto error;
+	}
+
+	addr_in = (struct sockaddr_in *)&addr;
+	len = sizeof(addr);
+	if (CHECK(getsockname(sk_fds->srv_fd, (struct sockaddr *)&addr,
+			      &len), "getsockname(srv_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+	sk_fds->passive_lport = ntohs(addr_in->sin_port);
+
+	len = sizeof(addr);
+	if (CHECK(getsockname(sk_fds->active_fd, (struct sockaddr *)&addr,
+			      &len), "getsockname(active_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+	sk_fds->active_lport = ntohs(addr_in->sin_port);
+
+	sk_fds->passive_fd = accept(sk_fds->srv_fd, NULL, 0);
+	if (CHECK(sk_fds->passive_fd == -1, "accept(srv_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+
+	if (fast_open) {
+		char bytes_in[sizeof(fast)];
+		int ret;
+
+		ret = read(sk_fds->passive_fd, bytes_in, sizeof(bytes_in));
+		CHECK(ret != sizeof(fast), "read fastopen syn data",
+		      "expected=%lu actual=%d\n", sizeof(fast), ret);
+		close(sk_fds->passive_fd);
+		goto error_close;
+	}
+
+	return 0;
+
+error_close:
+	close(sk_fds->active_fd);
+	close(sk_fds->srv_fd);
+
+error:
+	memset(sk_fds, -1, sizeof(*sk_fds));
+	return -1;
+}
+
+static int check_hdr_opt(const struct bpf_test_option *exp,
+			 const struct bpf_test_option *act,
+			 const char *hdr_desc)
+{
+	if (CHECK(memcmp(exp, act, sizeof(*exp)),
+		  "expected-vs-actual", "unexpected %s\n", hdr_desc)) {
+		print_option(exp, "expected: ");
+		print_option(act, "  actual: ");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int check_hdr_stg(const struct hdr_stg *exp, int fd,
+			 const char *stg_desc)
+{
+	struct hdr_stg act;
+
+	if (CHECK(bpf_map_lookup_elem(hdr_stg_map_fd, &fd, &act),
+		  "map_lookup(hdr_stg_map_fd)", "%s %s (%d)\n",
+		  stg_desc, strerror(errno), errno))
+		return -1;
+
+	if (CHECK(memcmp(exp, &act, sizeof(*exp)),
+		  "expected-vs-actual", "unexpected %s\n", stg_desc)) {
+		print_hdr_stg(exp, "expected: ");
+		print_hdr_stg(&act, "  actual: ");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int check_error_linum(const struct sk_fds *sk_fds)
+{
+	unsigned int linum, nr_errors = 0;
+	int lport = -1;
+
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:<unknown> linum:%u\n",
+			linum);
+		nr_errors++;
+	}
+
+	lport = sk_fds->passive_lport;
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:passive(%d), linum:%u\n",
+			lport, linum);
+		nr_errors++;
+	}
+
+	lport = sk_fds->active_lport;
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:active(%d), linum:%u\n",
+			lport, linum);
+		nr_errors++;
+	}
+
+	return nr_errors;
+}
+
+static void check_exp_and_close_fds(struct sk_fds *sk_fds)
+{
+	int ret, abyte;
+
+	if (check_hdr_stg(&exp_passive_hdr_stg, sk_fds->passive_fd,
+			  "passive_hdr_stg"))
+		goto check_linum;
+
+	if (check_hdr_stg(&exp_active_hdr_stg, sk_fds->active_fd,
+			  "active_hdr_stg"))
+		goto check_linum;
+
+	shutdown(sk_fds->active_fd, SHUT_WR);
+	ret = read(sk_fds->passive_fd, &abyte, sizeof(abyte));
+	if (CHECK(ret != 0, "read-after-shutdown(passive_fd):",
+		  "ret:%d %s (%d)\n",
+		  ret, strerror(errno), errno))
+		goto check_linum;
+
+	shutdown(sk_fds->passive_fd, SHUT_WR);
+	ret = read(sk_fds->active_fd, &abyte, sizeof(abyte));
+	if (CHECK(ret != 0, "read-after-shutdown(active_fd):",
+		  "ret:%d %s (%d)\n",
+		  ret, strerror(errno), errno))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_passive_estab_in, &skel->bss->passive_estab_in,
+			  "passive_estab_in"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_active_estab_in, &skel->bss->active_estab_in,
+			  "active_estab_in"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_passive_fin_in, &skel->bss->passive_fin_in,
+			  "passive_fin_in"))
+		goto check_linum;
+
+	check_hdr_opt(&exp_active_fin_in, &skel->bss->active_fin_in,
+		      "active_fin_in");
+
+check_linum:
+	CHECK_FAIL(check_error_linum(sk_fds));
+	sk_fds_close(sk_fds);
+}
+
+static void prepare_out(void)
+{
+	skel->bss->active_syn_out = exp_passive_estab_in;
+	skel->bss->passive_synack_out = exp_active_estab_in;
+
+	skel->bss->active_fin_out = exp_passive_fin_in;
+	skel->bss->passive_fin_out = exp_active_fin_in;
+}
+
+static void reset_test(void)
+{
+	size_t optsize = sizeof(struct bpf_test_option);
+	int lport, err;
+
+	memset(&skel->bss->passive_synack_out, 0, optsize);
+	memset(&skel->bss->passive_fin_out, 0, optsize);
+
+	memset(&skel->bss->passive_estab_in, 0, optsize);
+	memset(&skel->bss->passive_fin_in, 0, optsize);
+
+	memset(&skel->bss->active_syn_out, 0, optsize);
+	memset(&skel->bss->active_fin_out, 0, optsize);
+
+	memset(&skel->bss->active_estab_in, 0, optsize);
+	memset(&skel->bss->active_fin_in, 0, optsize);
+
+	memset(&exp_passive_estab_in, 0, optsize);
+	memset(&exp_active_estab_in, 0, optsize);
+	memset(&exp_passive_fin_in, 0, optsize);
+	memset(&exp_active_fin_in, 0, optsize);
+
+	memset(&exp_passive_hdr_stg, 0, sizeof(exp_passive_hdr_stg));
+	memset(&exp_active_hdr_stg, 0, sizeof(exp_active_hdr_stg));
+	exp_active_hdr_stg.active = true;
+
+	err = bpf_map_get_next_key(lport_linum_map_fd, NULL, &lport);
+	while (!err) {
+		bpf_map_delete_elem(lport_linum_map_fd, &lport);
+		err = bpf_map_get_next_key(lport_linum_map_fd, &lport, &lport);
+	}
+}
+
+static void fastopen_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.magic = 0xfa;
+
+	exp_active_estab_in.flags = OPTION_F_MAGIC;
+	exp_active_estab_in.magic = 0xce;
+
+	exp_passive_hdr_stg.fastopen = true;
+
+	prepare_out();
+
+	/* Allow fastopen without fastopen cookie */
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_fastopen", "1543"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, true)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void syncookie_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.magic = 0xfa;
+
+	exp_active_estab_in.flags = OPTION_F_MAGIC | OPTION_F_RESEND;
+	exp_active_estab_in.magic = 0xce;
+
+	exp_passive_hdr_stg.syncookie = true;
+	exp_active_hdr_stg.resend_syn = true,
+
+	prepare_out();
+
+	/* Clear the RESEND to ensure the bpf prog can learn
+	 * want_cookie and set the RESEND by itself.
+	 */
+	skel->bss->passive_synack_out.flags &= ~OPTION_F_RESEND;
+
+	/* Enforce syncookie mode */
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void fin(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_fin_in.flags = OPTION_F_MAGIC;
+	exp_passive_fin_in.magic = 0xfa;
+
+	exp_active_fin_in.flags = OPTION_F_MAGIC;
+	exp_active_fin_in.magic = 0xce;
+
+	prepare_out();
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab,
+					  cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void simple_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.magic = 0xfa;
+
+	exp_active_estab_in.flags = OPTION_F_MAGIC;
+	exp_active_estab_in.magic = 0xce;
+
+	prepare_out();
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab,
+					  cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+struct test {
+	const char *desc;
+	void (*run)(void);
+};
+
+#define DEF_TEST(name) { #name, name }
+static struct test tests[] = {
+	DEF_TEST(simple_estab),
+	DEF_TEST(syncookie_estab),
+	DEF_TEST(fastopen_estab),
+	DEF_TEST(fin),
+};
+
+void test_tcp_hdr_options(void)
+{
+	int i;
+
+	skel = test_tcp_hdr_options__open_and_load();
+	if (CHECK(!skel, "open and load skel", "failed"))
+		return;
+
+	hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
+	lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
+
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (CHECK_FAIL(cg_fd < 0)) {
+		test_tcp_hdr_options__destroy(skel);
+		return;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (!test__start_subtest(tests[i].desc))
+			continue;
+
+		if (create_netns())
+			break;
+
+		tests[i].run();
+
+		reset_test();
+	}
+
+	close(cg_fd);
+	test_tcp_hdr_options__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
new file mode 100644
index 000000000000..631181bfb4cc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
@@ -0,0 +1,674 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <stddef.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "test_tcp_hdr_options.h"
+
+#define TCPHDR_FIN 0x01
+#define TCPHDR_SYN 0x02
+#define TCPHDR_RST 0x04
+#define TCPHDR_PSH 0x08
+#define TCPHDR_ACK 0x10
+#define TCPHDR_URG 0x20
+#define TCPHDR_ECE 0x40
+#define TCPHDR_CWR 0x80
+#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK)
+
+#define CG_OK	1
+#define CG_ERR	0
+
+#ifndef SOL_TCP
+#define SOL_TCP 6
+#endif
+
+#define MAX_TCPHDR_SIZE 60
+
+#ifndef sizeof_field
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+const __u8 option_len[] = {
+	[OPTION_RESEND]		= 0,
+	[OPTION_MAX_DELACK_MS]	= sizeof_field(struct bpf_test_option,
+					       max_delack_ms),
+	[OPTION_MAGIC]		= sizeof_field(struct bpf_test_option, magic),
+};
+
+struct tcp_bpf_expopt {
+	__u8 kind;
+	__u8 len;
+	__u16 magic;
+	__u8 data[16];	/* <-- bpf prog can only write into this area */
+};
+
+#define TCP_BPF_EXPOPT_BASE_LEN 4
+
+struct bpf_test_option passive_synack_out = {};
+struct bpf_test_option passive_fin_out	= {};
+
+struct bpf_test_option passive_estab_in = {};
+struct bpf_test_option passive_fin_in	= {};
+
+struct bpf_test_option active_syn_out	= {};
+struct bpf_test_option active_fin_out	= {};
+
+struct bpf_test_option active_estab_in	= {};
+struct bpf_test_option active_fin_in	= {};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct hdr_stg);
+} hdr_stg_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 3);
+	__type(key, int);
+	__type(value, unsigned int);
+} lport_linum_map SEC(".maps");
+
+#define RET_CG_ERR(skops) ({			\
+	struct bpf_sock *tmp_sk = skops->sk;	\
+	__u32 linum = __LINE__;			\
+	int lport;				\
+						\
+	if (!skops->sk)	{			\
+		lport = -1;			\
+		bpf_map_update_elem(&lport_linum_map, &lport, &linum, BPF_NOEXIST); \
+		clear_hdr_cb_flags(skops);	\
+		return CG_ERR;			\
+	}					\
+									\
+	lport = skops->sk->src_port;					\
+	bpf_map_update_elem(&lport_linum_map, &lport, &linum, BPF_NOEXIST); \
+	clear_hdr_cb_flags(skops);					\
+	return CG_ERR;							\
+})
+
+static __always_inline void
+clear_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags &
+				  ~(BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG |
+				    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG));
+}
+
+static __always_inline void
+set_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags |
+				  BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG |
+				  BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+}
+
+static __always_inline __u32 skops_avail_len(const struct bpf_sock_ops *skops)
+{
+	return skops->args[0];
+}
+
+static __always_inline __u32 skops_want_cookie(const struct bpf_sock_ops *skops)
+{
+	return skops->args[1];
+}
+
+static __always_inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops)
+{
+	return skops->skb_tcp_flags;
+}
+
+static __always_inline unsigned int tcp_hdrlen(struct tcphdr *th)
+{
+	return th->doff * 4;
+}
+
+static __always_inline __u8 option_total_len(__u8 flags)
+{
+	__u8 i, len = sizeof_field(struct bpf_test_option, flags);
+
+	if (!flags)
+		return 0;
+
+	for (i = 0; i < __NR_OPTION_FLAGS; i++) {
+		if (TEST_OPTION_FLAGS(flags, i))
+			len += option_len[i];
+	}
+
+	return len;
+}
+
+static __always_inline int write_option(struct bpf_sock_ops *skops,
+					struct bpf_test_option *opt)
+{
+	__u8 flags = opt->flags;
+	__u8 offset = 0;
+	int err;
+
+	err = bpf_store_hdr_opt(skops, 0, &opt->flags, sizeof(opt->flags), 0);
+	if (err)
+		RET_CG_ERR(skops);
+	offset += sizeof(opt->flags);
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAX_DELACK_MS)) {
+		err = bpf_store_hdr_opt(skops, offset, &opt->max_delack_ms,
+					sizeof(opt->max_delack_ms), 0);
+		if (err)
+			RET_CG_ERR(skops);
+		offset += sizeof(opt->max_delack_ms);
+	}
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAGIC)) {
+		err = bpf_store_hdr_opt(skops, offset, &opt->magic,
+					sizeof(opt->magic), 0);
+		if (err)
+			RET_CG_ERR(skops);
+		offset += sizeof(opt->magic);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int parse_option(struct bpf_test_option *opt,
+					__u8 *start, __u8 *hdr_end)
+{
+	__u8 flags;
+
+	if (start + sizeof(opt->flags) > hdr_end)
+		return -EINVAL;
+
+	flags = *start;
+	opt->flags = flags;
+	start += sizeof(opt->flags);
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAX_DELACK_MS)) {
+		if (start + sizeof(opt->max_delack_ms) > hdr_end)
+			return -EINVAL;
+		opt->max_delack_ms = *(typeof(opt->max_delack_ms) *)start;
+		start += sizeof(opt->max_delack_ms);
+	}
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAGIC)) {
+		if (start + sizeof(opt->magic) > hdr_end)
+			return -EINVAL;
+		opt->magic = *(typeof(opt->magic) *)start;
+		start += sizeof(opt->magic);
+	}
+
+	return 0;
+}
+
+static __always_inline int synack_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 avail = skops_avail_len(skops);
+	__u8 exp_kind, flags, optlen;
+	int err;
+
+	if (!passive_synack_out.flags)
+		return CG_OK;
+
+	err = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_HDR_OPT,
+			     &exp_kind, sizeof(exp_kind));
+
+	/* ENOSPC:
+	 * BPF header option is found but not enough
+	 * space to copy everything and it is because
+	 * we intentionally only get the "kind" byte.
+	 *
+	 * All we want to learn here is the SYN pkt contains
+	 * the bpf header option before we proceed to write
+	 * the options in the outgoing SYNACK pkt.
+	 */
+	if (err != -ENOSPC)
+		return CG_OK;
+
+	flags = passive_synack_out.flags;
+	if (skops_want_cookie(skops))
+		SET_OPTION_FLAGS(flags, OPTION_RESEND);
+
+	optlen = option_total_len(flags);
+	if (optlen > avail)
+		RET_CG_ERR(skops);
+
+	if (optlen && bpf_reserve_hdr_opt(skops, optlen, 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_synack_opt(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option opt;
+	__u8 exp_kind;
+	int err;
+
+	if (!passive_synack_out.flags)
+		/* We should not even be called since no header
+		 * space has been reserved.
+		 */
+		RET_CG_ERR(skops);
+
+	err = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_HDR_OPT,
+			     &exp_kind, sizeof(exp_kind));
+
+	if (err != -ENOSPC)
+		RET_CG_ERR(skops);
+
+	opt = passive_synack_out;
+	if (skops_want_cookie(skops))
+		SET_OPTION_FLAGS(opt.flags, OPTION_RESEND);
+
+	return write_option(skops, &opt);
+}
+
+static __always_inline int syn_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 avail = skops_avail_len(skops);
+	__u8 optlen;
+
+	if (!active_syn_out.flags)
+		return CG_OK;
+
+	optlen = option_total_len(active_syn_out.flags);
+	if (optlen > avail)
+		RET_CG_ERR(skops);
+
+	if (optlen && bpf_reserve_hdr_opt(skops, optlen, 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_syn_opt(struct bpf_sock_ops *skops)
+{
+	if (!active_syn_out.flags)
+		return CG_OK;
+
+	return write_option(skops, &active_syn_out);
+}
+
+static __always_inline int fin_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 avail = skops_avail_len(skops);
+	struct bpf_test_option *opt;
+	struct hdr_stg *hdr_stg;
+	__u8 optlen;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (hdr_stg->active)
+		opt = &active_fin_out;
+	else
+		opt = &passive_fin_out;
+
+	optlen = option_total_len(opt->flags);
+	if (optlen > avail)
+		RET_CG_ERR(skops);
+
+	if (optlen && bpf_reserve_hdr_opt(skops, optlen, 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_fin_opt(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option *opt;
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	return write_option(skops, hdr_stg->active ?
+			    &active_fin_out : &passive_fin_out);
+}
+
+static __always_inline int nodata_opt_len(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (hdr_stg->resend_syn)
+		return syn_opt_len(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_nodata_opt(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (hdr_stg->resend_syn) {
+		int ret = write_syn_opt(skops);
+
+		/* Passive server side is in syncookie mode and
+		 * ask us (the active connect side) to resend
+		 * the header option.
+		 *
+		 * We resend it here in the last ack of 3WHS and
+		 * then clear the cb_flags so that we will not
+		 * be called to write the options again.
+		 * This strategy is considered best effort
+		 * because this ack could be lost.
+		 *
+		 * In practice, the active side bpf prog can also
+		 * select to keep sending the options until the
+		 * first data/ack packet is received from the
+		 * passive side.
+		 */
+		clear_hdr_cb_flags(skops);
+
+		return ret;
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int data_opt_len(struct bpf_sock_ops *skops)
+{
+	return CG_OK;
+}
+
+static __always_inline int write_data_opt(struct bpf_sock_ops *skops)
+{
+	return CG_OK;
+}
+
+static __always_inline int current_mss_opt_len(struct bpf_sock_ops *skops)
+{
+	/* Reserve maximum that may be needed */
+	if (bpf_reserve_hdr_opt(skops, sizeof(struct bpf_test_option), 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_hdr_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		return synack_opt_len(skops);
+
+	if (tcp_flags & TCPHDR_SYN)
+		return syn_opt_len(skops);
+
+	if (tcp_flags & TCPHDR_FIN)
+		return fin_opt_len(skops);
+
+	if (!skops->skb_data)
+		/* The kernel is calculating the MSS */
+		return current_mss_opt_len(skops);
+
+	if (skops->skb_len)
+		return data_opt_len(skops);
+
+	return nodata_opt_len(skops);
+}
+
+static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+	struct tcphdr *th;
+	int err;
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		return write_synack_opt(skops);
+
+	if (tcp_flags & TCPHDR_SYN)
+		return write_syn_opt(skops);
+
+	if (tcp_flags & TCPHDR_FIN)
+		return write_fin_opt(skops);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	if (skops->skb_len > tcp_hdrlen(th))
+		return write_data_opt(skops);
+
+	return write_nodata_opt(skops);
+}
+
+static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
+{
+	__u8 bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
+	struct hdr_stg *stg, init_stg = {
+		.active = true,
+	};
+	__u8 *start;
+	int err;
+
+	if (!bpf_hdr_opt_off)
+		goto update_sk_stg;
+
+	/* Bound the "start" reg->umax_value in the verifier */
+	if (bpf_hdr_opt_off >= MAX_TCPHDR_SIZE)
+		RET_CG_ERR(skops);
+
+	start = skops->skb_data;
+	start += bpf_hdr_opt_off;
+
+	if (start + TCP_BPF_EXPOPT_BASE_LEN > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	start += TCP_BPF_EXPOPT_BASE_LEN;
+	err = parse_option(&active_estab_in, start, skops->skb_data_end);
+
+	if (err)
+		RET_CG_ERR(skops);
+
+update_sk_stg:
+	init_stg.resend_syn = TEST_OPTION_FLAGS(active_estab_in.flags,
+						OPTION_RESEND);
+	if (!skops->sk || !bpf_sk_storage_get(&hdr_stg_map, skops->sk,
+					      &init_stg,
+					      BPF_SK_STORAGE_GET_F_CREATE))
+		RET_CG_ERR(skops);
+
+	if (!init_stg.resend_syn && !active_fin_out.flags)
+		/* No options will be written from now */
+		clear_hdr_cb_flags(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops)
+{
+	struct tcp_bpf_expopt expopt;
+	struct hdr_stg *hdr_stg;
+	bool syncookie = false;
+	__u8 *start, *end, len;
+	struct tcphdr *th;
+	int err;
+
+	err = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_HDR_OPT, &expopt,
+			     sizeof(expopt));
+	/* ENOMSG: no bpf hdr opt in the saved syn pkt
+	 * ENOENT: no saved syn (syncookie mode)
+	 */
+	if (err && err != -ENOMSG && err != -ENOENT)
+		RET_CG_ERR(skops);
+
+	if (err) {
+		__u8 bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
+
+		/* No bpf hdr option in SYN.  We were in syncookie
+		 * mode, so try to find it in ACK.
+		 */
+
+		/* No bpf hdr option in ACK either */
+		if (!bpf_hdr_opt_off)
+			goto update_sk_stg;
+
+		if (bpf_hdr_opt_off >= MAX_TCPHDR_SIZE)
+			RET_CG_ERR(skops);
+
+		start = skops->skb_data;
+		start += bpf_hdr_opt_off;
+
+		if (start + TCP_BPF_EXPOPT_BASE_LEN > skops->skb_data_end)
+			RET_CG_ERR(skops);
+
+		len = start[1] - TCP_BPF_EXPOPT_BASE_LEN;
+		start += TCP_BPF_EXPOPT_BASE_LEN;
+		end = skops->skb_data_end;
+		syncookie = true;
+	} else {
+		len = expopt.len - TCP_BPF_EXPOPT_BASE_LEN;
+		start = expopt.data;
+		end = expopt.data + len;
+	}
+
+	if (parse_option(&passive_estab_in, start, end))
+		RET_CG_ERR(skops);
+
+update_sk_stg:
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+	hdr_stg->syncookie = syncookie;
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	/* Fastopen */
+	if (skops->skb_len > tcp_hdrlen(th)) {
+		/* Cannot clear cb_flags to stop write_hdr cb.
+		 * synack is not sent yet for fast open.
+		 * Even it was, the synack may need
+		 * to be retransmitted.
+		 *
+		 * The cb_flags will be cleared in
+		 * handle_parse_hdr() later.
+		 */
+		hdr_stg->fastopen = true;
+		return CG_OK;
+	}
+
+	if (!passive_fin_out.flags)
+		/* No options will be written from now */
+		clear_hdr_cb_flags(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_parse_hdr(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+	struct tcphdr *th;
+	__u32 cb_flags;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (th->fin) {
+		__u32 skb_bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
+		struct bpf_test_option *fin_opt;
+		__u8 *kind_start;
+		int ret;
+
+		if (hdr_stg->active)
+			fin_opt = &active_fin_in;
+		else
+			fin_opt = &passive_fin_in;
+
+		if (!skb_bpf_hdr_opt_off)
+			return CG_OK;
+
+		if (skb_bpf_hdr_opt_off >= MAX_TCPHDR_SIZE)
+			RET_CG_ERR(skops);
+
+		kind_start = skops->skb_data;
+		kind_start += skb_bpf_hdr_opt_off;
+		if (kind_start + TCP_BPF_EXPOPT_BASE_LEN > skops->skb_data_end)
+			RET_CG_ERR(skops);
+
+		kind_start += TCP_BPF_EXPOPT_BASE_LEN;
+		if (parse_option(fin_opt, kind_start, skops->skb_data_end))
+			RET_CG_ERR(skops);
+
+		return CG_OK;
+	}
+
+	return CG_OK;
+}
+
+SEC("sockops/estab")
+int estab(struct bpf_sock_ops *skops)
+{
+	int op, optlen, true_val = 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
+			       &true_val, sizeof(true_val));
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
+		return handle_parse_hdr(skops);
+	case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
+		return handle_hdr_opt_len(skops);
+	case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
+		return handle_write_hdr_opt(skops);
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		return handle_passive_estab(skops);
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		return handle_active_estab(skops);
+	}
+
+	return CG_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
new file mode 100644
index 000000000000..d3a4593c81c8
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _TEST_TCP_HDR_OPTIONS_H
+#define _TEST_TCP_HDR_OPTIONS_H
+
+struct bpf_test_option {
+	__u8 flags;
+	__u8 max_delack_ms;
+	__u8 magic;
+};
+
+enum {
+	OPTION_RESEND,
+	OPTION_MAX_DELACK_MS,
+	OPTION_MAGIC,
+	__NR_OPTION_FLAGS,
+};
+
+#define OPTION_F_RESEND		(1 << OPTION_RESEND)
+#define OPTION_F_MAX_DELACK_MS	(1 << OPTION_MAX_DELACK_MS)
+#define OPTION_F_MAGIC		(1 << OPTION_MAGIC)
+
+#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option)))
+#define SET_OPTION_FLAGS(flags, option)	((flags) |= (1 << (option)))
+
+/* Store in bpf_sk_storage */
+struct hdr_stg {
+	bool active;
+	bool resend_syn; /* active side only */
+	bool syncookie;  /* passive side only */
+	bool fastopen;	/* passive side only */
+};
+
+#endif

From patchwork Fri Jun 26 17:55:58 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Martin KaFai Lau <kafai@fb.com>
X-Patchwork-Id: 217026
Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, 
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_PATCH, 
 MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS,
 URIBL_BLOCKED, 
 USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id A63CEC433DF
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:56:04 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 7BD98206B7
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:56:04 +0000 (UTC)
Authentication-Results: mail.kernel.org;
 dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com
 header.b="GAnaiBGm"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1726458AbgFZR4D (ORCPT <rfc822;netdev@archiver.kernel.org>);
 Fri, 26 Jun 2020 13:56:03 -0400
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:56362 "EHLO
 mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
 by vger.kernel.org with ESMTP id S1726469AbgFZR4C (ORCPT
 <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 13:56:02 -0400
Received: from pps.filterd (m0109334.ppops.net [127.0.0.1])
 by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05QHsYat010044
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:56:02 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=x9+g55k1QE3JMJ1MSBZfXQgh+V/wM36C3TVhv6OzQtM=;
 b=GAnaiBGmYGqCbBMZLH2EKsDLkqlxbQoP0UBACJ8qHzZsRgKAJNMJiIrtYZznjVIc2B9P
 tCFWyP+GWJljAME3CIZlWWVjJ22iWcxc+Wq0MdxmzUES/8dAUqKSrX9cNizUfx2tP0y2
 b9eRBGk0nzK1qnSWFEYxEln66wzFTkNlnNE= 
Received: from mail.thefacebook.com ([163.114.132.120])
 by mx0a-00082601.pphosted.com with ESMTP id 31vdptjfra-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128
 verify=NOT)
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:56:02 -0700
Received: from intmgw002.08.frc2.facebook.com (2620:10d:c085:208::11) by
 mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP
 Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1979.3; Fri, 26 Jun 2020 10:56:01 -0700
Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611)
 id 842D32942E38; Fri, 26 Jun 2020 10:55:58 -0700 (PDT)
Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>
CC: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>,
 Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>,
 Lawrence Brakmo <brakmo@fb.com>,
 Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>,
 Yuchung Cheng <ycheng@google.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and
 TCP_BPF_RTO_MIN to bpf_setsockopt
Date: Fri, 26 Jun 2020 10:55:58 -0700
Message-ID: <20200626175558.1462731-1-kafai@fb.com>
X-Mailer: git-send-email 2.24.1
In-Reply-To: <20200626175501.1459961-1-kafai@fb.com>
References: <20200626175501.1459961-1-kafai@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-26_10:2020-06-26,2020-06-26 signatures=0
X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0
 lowpriorityscore=0
 mlxlogscore=999 cotscore=-2147483648 mlxscore=0 phishscore=0
 bulkscore=0
 malwarescore=0 spamscore=0 adultscore=0 priorityscore=1501
 suspectscore=13
 clxscore=1015 impostorscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2004280000 definitions=main-2006260126
X-FB-Internal: deliver
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN).  A latter patch will use it
like this in a test as an example.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_connection_sock.h |  2 ++
 include/net/tcp.h                  |  2 +-
 include/uapi/linux/bpf.h           |  2 ++
 net/core/filter.c                  | 15 +++++++++++++++
 net/ipv4/tcp.c                     |  4 ++++
 net/ipv4/tcp_output.c              |  2 ++
 tools/include/uapi/linux/bpf.h     |  2 ++
 7 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index e5b388f5fa20..43d45864e5f0 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -93,6 +93,8 @@ struct inet_connection_sock {
  	struct timer_list	  icsk_retransmit_timer;
  	struct timer_list	  icsk_delack_timer;
 	__u32			  icsk_rto;
+	__u32                     icsk_rto_min;
+	__u32                     icsk_delack_max;
 	__u32			  icsk_pmtu_cookie;
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e93ef2d324f3..8a682d678971 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -703,7 +703,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 static inline u32 tcp_rto_min(struct sock *sk)
 {
 	const struct dst_entry *dst = __sk_dst_get(sk);
-	u32 rto_min = TCP_RTO_MIN;
+	u32 rto_min = inet_csk(sk)->icsk_rto_min;
 
 	if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
 		rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 479b83d05811..2ccc81548eef 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4299,6 +4299,8 @@ enum {
 	 *	    is not saved by TCP_SAVE_SYN.
 	 */
 	TCP_BPF_SYN_HDR_OPT	= 1003,
+	TCP_BPF_DELACK_MAX	= 1004, /* Max delay ack in usecs */
+	TCP_BPF_RTO_MIN		= 1005, /* Min delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5784f1bede2f..4b80934e6876 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4431,6 +4431,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 			struct tcp_sock *tp = tcp_sk(sk);
+			unsigned long timeout;
 
 			if (optlen != sizeof(int))
 				return -EINVAL;
@@ -4452,6 +4453,20 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 					tp->snd_ssthresh = val;
 				}
 				break;
+			case TCP_BPF_DELACK_MAX:
+				timeout = usecs_to_jiffies(val);
+				if (timeout > TCP_DELACK_MAX ||
+				    timeout < TCP_TIMEOUT_MIN)
+					return -EINVAL;
+				inet_csk(sk)->icsk_delack_max = timeout;
+				break;
+			case TCP_BPF_RTO_MIN:
+				timeout = usecs_to_jiffies(val);
+				if (timeout > TCP_RTO_MIN ||
+				    timeout < TCP_TIMEOUT_MIN)
+					return -EINVAL;
+				inet_csk(sk)->icsk_rto_min = timeout;
+				break;
 			case TCP_SAVE_SYN:
 				if (val < 0 || val > 1)
 					ret = -EINVAL;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 60093a211f4d..02be3e2a2fdb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,8 @@ void tcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&tp->tsorted_sent_queue);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto_min = TCP_RTO_MIN;
+	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
 	minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);
 
@@ -2685,6 +2687,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	icsk->icsk_backoff = 0;
 	icsk->icsk_probes_out = 0;
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto_min = TCP_RTO_MIN;
+	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd = TCP_INIT_CWND;
 	tp->snd_cwnd_cnt = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a78a29980e1f..db872a2a01c6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3899,6 +3899,8 @@ void tcp_send_delayed_ack(struct sock *sk)
 		ato = min(ato, max_ato);
 	}
 
+	ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max);
+
 	/* Stay within the limit we were given */
 	timeout = jiffies + ato;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 479b83d05811..2ccc81548eef 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4299,6 +4299,8 @@ enum {
 	 *	    is not saved by TCP_SAVE_SYN.
 	 */
 	TCP_BPF_SYN_HDR_OPT	= 1003,
+	TCP_BPF_DELACK_MAX	= 1004, /* Max delay ack in usecs */
+	TCP_BPF_RTO_MIN		= 1005, /* Min delay ack in usecs */
 };
 
 struct bpf_perf_event_value {

From patchwork Fri Jun 26 17:56:04 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Martin KaFai Lau <kafai@fb.com>
X-Patchwork-Id: 217025
Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, 
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_PATCH, 
 MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS,
 URIBL_BLOCKED, 
 USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 3EAA8C433DF
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:56:20 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 1E012206B7
 for <netdev@archiver.kernel.org>;
 Fri, 26 Jun 2020 17:56:20 +0000 (UTC)
Authentication-Results: mail.kernel.org;
 dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com
 header.b="KZ8kGK6t"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727845AbgFZR4T (ORCPT <rfc822;netdev@archiver.kernel.org>);
 Fri, 26 Jun 2020 13:56:19 -0400
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:1394 "EHLO
 mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
 by vger.kernel.org with ESMTP id S1726616AbgFZR4S (ORCPT
 <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 13:56:18 -0400
Received: from pps.filterd (m0148460.ppops.net [127.0.0.1])
 by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 05QHsuDF019455
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:56:17 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=UHFQQFqYSY8aevwSYLHh6Hobk6A4dxuODVh5qoz6O0I=;
 b=KZ8kGK6tZ2/Y2glbND15vFuJBXyR6zlf7Rs3Hq7BRSRTZVPPpECLI65lbAsaACOyuZM3
 uxcFi/BheiBqUc66daSqCV35+KPjbo+vbY+QkAyozwVAhVYaTkMPT8ycND/yQw1q9lbo
 +EXuGnq1n+jY3NfxE8gqaM/Zi2wdaQMjTSY= 
Received: from maileast.thefacebook.com ([163.114.130.16])
 by mx0a-00082601.pphosted.com with ESMTP id 31ux1exn4y-14
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128
 verify=NOT)
 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:56:17 -0700
Received: from intmgw001.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1979.3; Fri, 26 Jun 2020 10:56:09 -0700
Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611)
 id CA5942942E38; Fri, 26 Jun 2020 10:56:04 -0700 (PDT)
Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>
CC: Alexei Starovoitov <ast@kernel.org>,
 Daniel Borkmann <daniel@iogearbox.net>,
 Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>,
 Lawrence Brakmo <brakmo@fb.com>,
 Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>,
 Yuchung Cheng <ycheng@google.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf-next 10/10] bpf: selftest: Add test for
 TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN
Date: Fri, 26 Jun 2020 10:56:04 -0700
Message-ID: <20200626175604.1462935-1-kafai@fb.com>
X-Mailer: git-send-email 2.24.1
In-Reply-To: <20200626175501.1459961-1-kafai@fb.com>
References: <20200626175501.1459961-1-kafai@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-26_10:2020-06-26,2020-06-26 signatures=0
X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0
 lowpriorityscore=0
 clxscore=1015 cotscore=-2147483648 spamscore=0 mlxlogscore=739
 bulkscore=0
 adultscore=0 phishscore=0 priorityscore=1501 malwarescore=0
 mlxscore=0
 suspectscore=13 impostorscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2004280000 definitions=main-2006260126
X-FB-Internal: deliver
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch tests a bpf prog that parses/writes a max_delack_ms bpf header
option and also bpf_setsockopt its TCP_BPF_DELACK_MAX/TCP_BPF_RTO_MIN
accordingly.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../bpf/prog_tests/tcp_hdr_options.c          |  6 ++--
 .../bpf/progs/test_tcp_hdr_options.c          | 34 +++++++++++++++++++
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
index f8daf36783f3..5a58f60d2889 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -447,11 +447,13 @@ static void simple_estab(void)
 	struct bpf_link *link;
 	struct sk_fds sk_fds;
 
-	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.flags = OPTION_F_MAGIC | OPTION_F_MAX_DELACK_MS;
 	exp_passive_estab_in.magic = 0xfa;
+	exp_passive_estab_in.max_delack_ms = 11;
 
-	exp_active_estab_in.flags = OPTION_F_MAGIC;
+	exp_active_estab_in.flags = OPTION_F_MAGIC | OPTION_F_MAX_DELACK_MS;
 	exp_active_estab_in.magic = 0xce;
+	exp_active_estab_in.max_delack_ms = 22;
 
 	prepare_out();
 
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
index 631181bfb4cc..eb3b3c2a21f9 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
@@ -465,6 +465,24 @@ static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops)
 	return write_nodata_opt(skops);
 }
 
+static __always_inline int set_delack_max(struct bpf_sock_ops *skops,
+					  __u8 max_delack_ms)
+{
+	__u32 max_delack_us = max_delack_ms * 1000;
+
+	return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_DELACK_MAX,
+			      &max_delack_us, sizeof(max_delack_us));
+}
+
+static __always_inline int set_rto_min(struct bpf_sock_ops *skops,
+				       __u8 peer_max_delack_ms)
+{
+	__u32 min_rto_us = peer_max_delack_ms * 1000;
+
+	return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_RTO_MIN, &min_rto_us,
+			      sizeof(min_rto_us));
+}
+
 static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
 {
 	__u8 bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
@@ -505,6 +523,14 @@ static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
 		/* No options will be written from now */
 		clear_hdr_cb_flags(skops);
 
+	if (active_syn_out.max_delack_ms &&
+	    set_delack_max(skops, active_syn_out.max_delack_ms))
+		RET_CG_ERR(skops);
+
+	if (active_estab_in.max_delack_ms &&
+	    set_rto_min(skops, active_estab_in.max_delack_ms))
+		RET_CG_ERR(skops);
+
 	return CG_OK;
 }
 
@@ -590,6 +616,14 @@ static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops)
 		/* No options will be written from now */
 		clear_hdr_cb_flags(skops);
 
+	if (passive_synack_out.max_delack_ms &&
+	    set_delack_max(skops, passive_synack_out.max_delack_ms))
+		RET_CG_ERR(skops);
+
+	if (passive_estab_in.max_delack_ms &&
+	    set_rto_min(skops, passive_estab_in.max_delack_ms))
+		RET_CG_ERR(skops);
+
 	return CG_OK;
 }