[bpf-next,04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal use only.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Header Option Format
────────────────────
The bpf prog will be allowed to write options under kind (254)
which is defined as the experimental TCP options in RFC 6994.
The exact format will be:

 0                  8                 16                             31
┌──────────────────┬─────────────────┬─────────────────────────────────┐
│   Kind: 254      │     length      │      magic: 0xeB9F              │
├──────────────────┴─────────────────┴─────────────────────────────────┤
│                                                                      │
│               BPF program written data                               │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

By putting it under the standard kind 254 and magic 0xeB9F, it will be
recognizable by the usual tool like tcpdump and tshark.  The kernel
can ensure the header option format is valid before sending out to
the wire and avoid the bpf program from writing options
duplicated/conflicted with what the kernel TCP stack has
already written.

A similar experimental numbering also exists in other protocols (e.g. IPv6).
Thus,  a similar idea (and API) could be extended to other layers/protocols
in the future.

As mentioned above, this patch set does not allow the bpf program to create
its own option "kind".  However, the header-writing's BPF API (mainly through
the helper "bpf_reserve_hdr_opt" and "bpf_store_hdr_opt") could be extended
in the future to allow a "raw" mode (e.g. by introducing a new helper
flag).

Sockops Callback Flags:
──────────────────────
The header parsing and writing callback can be turned on
by setting the existing callback flags "tp->bpf_sock_ops_cb_flags".
BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG are (newly) added.
The default is off, i.e. the bpf prog will not
be called to parse or write bpf hdr option.

3 Way HandShake
───────────────
* Passive side

When writing SYNACK, the received SYN skb will be available to the
bpf prog.  The bpf prog will also know if it is in syncookie
mode (want_cookie) or not.  The bpf prog can use the SYN skb (which
may carry the bpf hdr opt sent from the remote peer) to decide what
bpf header option should be written to the outgoing SYNACK skb.

The bpf prog can store the bpf header option of the received
SYN by using the existing "TCP_SAVE_SYN" setsockopt.
The example in a latter patch also uses TCP_SAVE_SYN.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the bpf header option from the saved syn and
then apply the needed operation to the newly established socket.
The latter patch will use the max delay ack specified in the SYN
packet as an example.  The received ack (that concludes the 3WHS)
will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB
through the sock_ops->skb_data, which could be useful in
syncookie scenario.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn which includes the IP[46] header and the TCP header.
A (new) BPF only "TCP_BPF_SYN_HDR_OPT" getsockopt is added to get
the bpf header option alone (without the IP and TCP header) from the
saved syn.  The kernel remembers the offset to the bpf header option (i.e.
kind:254, magic:0xeB9F) as it passes the TCP header.  It is stored in
the tcp_skb_cb and then also saved in the "struct saved_syn".  The new
"TCP_BPF_SYN_HDR_OPT" can directly return the bpf header option to the
bpf prog instead of asking the bpf prog to parse the IP[46] header and
TCP header again in order to get to the bpf header option.

In the new "TCP_BPF_SYN_HDR_OPT" getsockopt, the kernel will know
where it can get the SYN's bpf header option from:
  - the just received syn (available when the bpf prog is writing SYNACK)
  or
  - the saved syn (available in PASSIVE_ESTABLISHED_CB).

The bpf prog does not need to know where this bpf header option
can be obtained from.  The "TCP_BPF_SYN_HDR_OPT" getsockopt will
hide this details.

Fastopen should work the same as the regular non fastopen case.
This is a test in a latter patch.

For syncookie, the latter example patch asks the active
side's bpf prog to resend the header options in ACK.  Please refer
to this latter example for its details and limitation.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data.

In short, regardless of active or passive ESTABLISHED_CB,
the sock_ops->skb_data is always the received skb that
completed the 3WHS.

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.

Established Connection
──────────────────────
The bpf prog will be called as long as the parse/write
*_HDR_OPT_CB_FLAG is enabled in bpf_sock_ops_cb_flags.
That will allow the bpf prog to parse/write header options
in the data, pure-ack, and fin packet.

Writing BPF Header Option
─────────────────────────
[ bpf prog context: sock_ops ]

When writing header, the bpf prog is first called to reserve
the needed number of bytes in skb during "HDR_OPT_LEN_CB" and
then called to write the header during "WRITE_HDR_OPT_CB".
During these two write CB, the sock_ops->skb_* is always
representing the outgoing skb.

The bpf prog is expected to use the two (new) helpers,
"bpf_reserve_hdr_opt" and "bpf_store_hdr_opt", to
reserve option space and write the option.

In cgroup MULTI mode, the max reserved space among multiple bpf progs
will be used.  e.g. prog#1 reserves 8 bytes and a latter prog#2 reserves
4 bytes.  8 bytes will be reserved.
When multiple bpf progs write the bpf header option, the last
prog's header option will be used.  The "bpf_store_hdr_opt"
helper will take care of the TCP header option's kind-length.

When writing header in "WRITE_HDR_OPT_CB", the sock_ops->skb_data
is pointing to the outgoing skb.  If there is a need, the bpf prog
can inspect what has been written to the header.
sock_ops->skb_bpf_hdr_opt_off also provides an offset to the
beginning of the bpf header option (i.e. the beginning of
kind:245, magic:0xeB9F).
However, during option space reservation in "HDR_OPT_LEN_CB",
the sock_ops->skb_data does not have the tcp header because
the header has not been written yet.

Parsing BPF Header Option
─────────────────────────

As mentioned earlier, the received SYN/SYNACK/ACK during the 3WHS
will be available to some specific CB (e.g. the *_ESTABLISHED_CB)

For established connection, if the kernel finds a bpf header
option (i.e. option with kind:254 and magic:0xeB9F) and the
the "PARSE_HDR_OPT_CB_FLAG" flag is set,  the
bpf prog will be called in the "BPF_SOCK_OPS_PARSE_HDR_OPT_CB" op.
The received skb will be available through sock_ops->skb_data
and the bpf header option offset will also be specified
in sock_ops->skb_bpf_hdr_opt_off.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h     |  25 ++++
 include/linux/filter.h         |   6 +
 include/net/tcp.h              |  53 ++++++++-
 include/uapi/linux/bpf.h       | 187 +++++++++++++++++++++++++++++-
 net/core/filter.c              | 202 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_fastopen.c        |   2 +-
 net/ipv4/tcp_input.c           |  79 ++++++++++++-
 net/ipv4/tcp_ipv4.c            |   3 +-
 net/ipv4/tcp_minisocks.c       |   1 +
 net/ipv4/tcp_output.c          | 186 +++++++++++++++++++++++++++++-
 net/ipv6/tcp_ipv6.c            |   3 +-
 tools/include/uapi/linux/bpf.h | 187 +++++++++++++++++++++++++++++-
 12 files changed, 914 insertions(+), 20 deletions(-)

Message ID	20200626175526.1461133-1-kafai@fb.com
State	New
Headers	show Return-Path: <SRS0=3T32=AH=vger.kernel.org=netdev-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7749EC433E0 for <netdev@archiver.kernel.org>; Fri, 26 Jun 2020 17:55:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2BA9D206B7 for <netdev@archiver.kernel.org>; Fri, 26 Jun 2020 17:55:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com header.b="giIKDaqf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726990AbgFZRzq (ORCPT <rfc822;netdev@archiver.kernel.org>); Fri, 26 Jun 2020 13:55:46 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:46146 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726946AbgFZRzp (ORCPT <rfc822;netdev@vger.kernel.org>); Fri, 26 Jun 2020 13:55:45 -0400 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 05QHtAQF021101 for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:55:39 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=facebook; bh=Yyg3mh2BruNtS5AB0w5EW5zC5V17U6ck8GTtkzkT4oI=; b=giIKDaqfLJnAzmOU+qdnZpS7eIyAOY6AtKBFEwuZnj+fuRpBRnRQxquT0JPCgp11rUO3 V6HN7BPm9Ah32sDJ1Vp3vEnN6mhpPpKiMxZhq+V33gfrczswVVKVt8hRrFK2j5IAaTfT JWgJrMOWjfZ39lhc062mKbpSftawMa9UQvU= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 31ux0qeneq-12 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <netdev@vger.kernel.org>; Fri, 26 Jun 2020 10:55:38 -0700 Received: from intmgw004.08.frc2.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Fri, 26 Jun 2020 10:55:30 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id EE11C2942E38; Fri, 26 Jun 2020 10:55:26 -0700 (PDT) Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau <kafai@fb.com> Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, <kernel-team@fb.com>, Lawrence Brakmo <brakmo@fb.com>, Neal Cardwell <ncardwell@google.com>, <netdev@vger.kernel.org>, Yuchung Cheng <ycheng@google.com> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option Date: Fri, 26 Jun 2020 10:55:26 -0700 Message-ID: <20200626175526.1461133-1-kafai@fb.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200626175501.1459961-1-kafai@fb.com> References: <20200626175501.1459961-1-kafai@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687 definitions=2020-06-26_09:2020-06-26,2020-06-26 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 bulkscore=0 lowpriorityscore=0 mlxscore=0 adultscore=0 malwarescore=0 impostorscore=0 priorityscore=1501 suspectscore=1 phishscore=0 clxscore=1015 cotscore=-2147483648 mlxlogscore=999 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2006260126 X-FB-Internal: deliver Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org
Series	None \| expand [bpf-next,02/10] tcp: bpf: Parse BPF experimental header option [bpf-next,03/10] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 [bpf-next,04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option [bpf-next,08/10] bpf: selftests: tcp header options [bpf-next,09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt [bpf-next,10/10] bpf: selftest: Add test for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN

[bpf-next,04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option

Commit Message

Patch