[RFC,v4,bpf-next,09/12] bpf: tcp: Allow bpf prog to write and parse TCP header option

[ Note: The TCP changes here is mainly to implement the bpf
  pieces into the bpf_skops_*() functions introduced
  in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt().  The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog.  Different bpf-prog can write its
own option kind.  It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The header parsing and writing callback can be turned on
by enabling a few newly added callback flags:

BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG:
	Call bpf when kernel has received a header option that
	the kernel cannot handle.  It is useful when the peer doesn't
	send bpf-options very often.

	The bpf-prog can inspect the received header by sock_ops->skb_data
	which covers the whole header (including the fixed fields like
	ports, flags...etc) or
	use the new bpf_load_hdr_opt() to search for a particular TCP
	header option.

BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG:
	Call bpf for all received TCP header.

	It could be used at the client/active side (i.e. connect() side)
	when the server told it that the server was in syncookie
	mode and required the active side to resend the bpf-written
	options.  The active side can keep writing the bpf-options until
	it received a valid packet from the server side to confirm
	the earlier packet (and options) has been received.  The later
	example patch is using it like this at the active side when the
	server is in syncookie mode.

	The bpf prog will usually turn this off in the common cases.

When the above PARSE CB flags are turned on, the bpf-prog will
be called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB.
These PARSE CB flags will only ask the kernel to call the bpf-prog when
the tcp packet is received at an already-established sk.
It does not include the SYN-SYNACK-ACK during the 3WHS where the connection
has not been established.  The parsing of the SYN-SYNACK-ACK will be
discussed in the "3 Way HandShake" section.

BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG:
	Call bpf when the kernel is writing header options for the
	outgoing packet.

	The bpf will first be called to reserve the space in a skb during
	sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.  The kernel native options
	will get the spaces first and the bpf can only reserve the remaining
	spaces left.  The bpf prog can reserve through the new
	bpf_reserve_hdr_opt() helper.

	The bpf-prog will then be called to write the option during
	sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
	Again, the kernel will write its native options first before
	calling the bpf-prog.  The bpf-prog will use the new
	bpf_store_hdr_opt() to write the option.  The bpf_store_hdr_opt()
	will ensure the writing option has not already been written by
	the kernel or by the earlier bpf-progs.  This will avoid
	sending duplicated options in the header.

	The bpf-prog can also learn what options have been written
	by the kernel or other bpf-progs by reading sock_ops->skb_data
	or by calling the new bpf_load_hdr_opt() helper.

The default is off for all of the above new CB flags, i.e. the bpf prog will
not be called to parse or write bpf hdr option.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options.

bpf_load_hdr_opt() helps to search a particular option "kind"
in the TCP header.

When parsing header (sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB),
they are the received skb.

When reserving header space (sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB),
they will not be useful because the header has not been written yet.

When writing header (sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
they are the outgoing skb that the header option will be written into.
The bpf-prog can use them to see what header has been written by the kernel
or the earlier bpf-progs.

When concluding a 3WHS, it is the received skb that completes
the 3WHS:
In ACTIVE_ESTABLISHED_CB, it is the received SYNACK
In PASSIVE_ESTABLISHED_CB, it is usually the received ACK (or the received
SYN-DATA in fastopen).

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog.  The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb.

The bpf-prog can get the whole SYN TCP header by
"bpf_getsockopt(TCP_BPF_SYN)" or search for a particular header option by
"bpf_load_hdr_opt(BPF_LOAD_HDR_OPT_TCP_SYN)".

The bpf prog will also know if it is in syncookie mode
(sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE) or not.

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch also
does this.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
packet as an example.  The received ACK (that concludes the 3WHS)
will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB
through the sock_ops->skb_data and bpf_load_hdr_opt() and it
could be useful in syncookie scenario.  More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

In the new getsockopt(TCP_BPF_SYN*), the kernel will know
where it can get the SYN's packet from:
  - (a) the just received syn (available when the bpf prog is writing SYNACK)
        and it is the only way to get SYN in syncookie mode.
  or
  - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
        existing CB).

The bpf prog does not need to know where the SYN pkt can be obtained from.
The "TCP_BPF_SYN*" getsockopt will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to search header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK.  The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

Established Connection
──────────────────────
The bpf prog will be called as long as the parse/write
*_HDR_OPT_CB_FLAG is enabled in bpf_sock_ops_cb_flags.
That will allow the bpf prog to parse/write header options
in the data, pure-ack, and fin packet.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h     |  25 +++
 include/linux/filter.h         |   4 +
 include/net/tcp.h              |  49 +++++
 include/uapi/linux/bpf.h       | 228 +++++++++++++++++++-
 net/core/filter.c              | 365 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c           |  20 +-
 net/ipv4/tcp_minisocks.c       |   1 +
 net/ipv4/tcp_output.c          | 104 +++++++++-
 tools/include/uapi/linux/bpf.h | 228 +++++++++++++++++++-
 9 files changed, 1006 insertions(+), 18 deletions(-)

Message ID	20200803231110.2685038-1-kafai@fb.com
State	New
Headers	show Return-Path: <SRS0=FpY3=BN=vger.kernel.org=netdev-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD402C433DF for <netdev@archiver.kernel.org>; Mon, 3 Aug 2020 23:11:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B19D92065E for <netdev@archiver.kernel.org>; Mon, 3 Aug 2020 23:11:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com header.b="Op/D2YPc" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729223AbgHCXLZ (ORCPT <rfc822;netdev@archiver.kernel.org>); Mon, 3 Aug 2020 19:11:25 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:48272 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729036AbgHCXLZ (ORCPT <rfc822;netdev@vger.kernel.org>); Mon, 3 Aug 2020 19:11:25 -0400 Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 073MrZfF014756 for <netdev@vger.kernel.org>; Mon, 3 Aug 2020 16:11:16 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=facebook; bh=uR4DJ92fi1wOw9guPR4rgIIAkZcfhNsHi86S3kkO1F0=; b=Op/D2YPcYzIBxUI7LtSVGjktRsc3cA4Ar+LUaDKeMXw2vwa46k3R7vQz/RBg+lipfntF kvl9DJE/PV08+xNhl4GgW8hnCL5Anc+aH+tV1XluY/REB2hb8cN/P2t3Q7OhV2L92rJ4 EXfdJ8fdZ8iFru6rz7zrHOnSPjMeJVaeChU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 32n81fsnss-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <netdev@vger.kernel.org>; Mon, 03 Aug 2020 16:11:16 -0700 Received: from intmgw001.03.ash8.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 3 Aug 2020 16:11:14 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id D976C2943872; Mon, 3 Aug 2020 16:11:10 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau <kafai@fb.com> Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>, Lawrence Brakmo <brakmo@fb.com>, Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>, Yuchung Cheng <ycheng@google.com> Smtp-Origin-Cluster: ftw2c04 Subject: [RFC PATCH v4 bpf-next 09/12] bpf: tcp: Allow bpf prog to write and parse TCP header option Date: Mon, 3 Aug 2020 16:11:10 -0700 Message-ID: <20200803231110.2685038-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200803231013.2681560-1-kafai@fb.com> References: <20200803231013.2681560-1-kafai@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-03_15:2020-08-03,2020-08-03 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 clxscore=1015 priorityscore=1501 spamscore=0 impostorscore=0 mlxscore=0 phishscore=0 suspectscore=1 lowpriorityscore=0 bulkscore=0 adultscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008030158 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org
Series	BPF TCP header options \| expand [RFC,v4,bpf-next,00/12] BPF TCP header options [RFC,v4,bpf-next,02/12] tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt [RFC,v4,bpf-next,05/12] bpf: tcp: Add bpf_skops_established() [RFC,v4,bpf-next,06/12] bpf: tcp: Add bpf_skops_parse_hdr() [RFC,v4,bpf-next,08/12] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 [RFC,v4,bpf-next,09/12] bpf: tcp: Allow bpf prog to write and parse TCP header option [RFC,v4,bpf-next,11/12] bpf: selftests: tcp header options

[RFC,v4,bpf-next,09/12] bpf: tcp: Allow bpf prog to write and parse TCP header option

Commit Message

Patch