From patchwork Wed Mar 18 09:43:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222318 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B713C10DCE for ; Wed, 18 Mar 2020 09:44:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4747E20767 for ; Wed, 18 Mar 2020 09:44:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727647AbgCRJn7 (ORCPT ); Wed, 18 Mar 2020 05:43:59 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51532 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727578AbgCRJn4 (ORCPT ); Wed, 18 Mar 2020 05:43:56 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hoVo012820; Wed, 18 Mar 2020 11:43:50 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id A3596360030; Wed, 18 Mar 2020 11:43:50 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 01/28] tcp: add tp to prepare for AccECN code Date: Wed, 18 Mar 2020 11:43:05 +0200 Message-Id: <1584524612-24470-2-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen No functional changes. Signed-off-by: Ilpo Järvinen --- net/ipv4/tcp_input.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6b6b57000dad..3525ec8edd54 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -254,8 +254,10 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp) static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb) { + struct tcp_sock *tp = tcp_sk(sk); + if (tcp_hdr(skb)->cwr) { - tcp_sk(sk)->ecn_flags &= ~TCP_ECN_DEMAND_CWR; + tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR; /* If the sender is telling us it has entered CWR, then its * cwnd may be very low (even just 1 packet), so we should ACK From patchwork Wed Mar 18 09:43:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222320 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96F8AC5ACD6 for ; Wed, 18 Mar 2020 09:44:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6D5EE20767 for ; Wed, 18 Mar 2020 09:44:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727665AbgCRJoB (ORCPT ); Wed, 18 Mar 2020 05:44:01 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51598 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727590AbgCRJn6 (ORCPT ); Wed, 18 Mar 2020 05:43:58 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9houV012837; Wed, 18 Mar 2020 11:43:50 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id C07BF360F45; Wed, 18 Mar 2020 11:43:50 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 05/28] tcp: extend TCP flags to allow AE bit/ACE field Date: Wed, 18 Mar 2020 11:43:09 +0200 Message-Id: <1584524612-24470-6-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen With AccECN, there's one additional TCP flag to be used (AE) and ACE field that overloads the definition of AE, CWR, and ECE flags. As tcp_flags was previously only 1 byte, the byte-order stuff needs to be added to it's handling. Reorganize the bitfields to not enlarge the structure. Signed-off-by: Ilpo Järvinen --- include/net/tcp.h | 13 ++++++++----- include/uapi/linux/tcp.h | 9 ++++++--- net/ipv4/tcp_input.c | 3 ++- net/ipv4/tcp_ipv4.c | 3 ++- net/ipv4/tcp_output.c | 4 ++-- net/ipv6/tcp_ipv6.c | 3 ++- net/netfilter/nf_log_common.c | 4 +++- 7 files changed, 25 insertions(+), 14 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index b97af0ff118f..6b6a1b8b3c6e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -791,7 +791,10 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb) #define TCPHDR_URG 0x20 #define TCPHDR_ECE 0x40 #define TCPHDR_CWR 0x80 +#define TCPHDR_AE 0x100 +#define TCPHDR_FLAGS_MASK 0x1ff +#define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE) #define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR) /* This is what the send packet queuing engine uses to pass @@ -816,7 +819,11 @@ struct tcp_skb_cb { u16 tcp_gso_size; }; }; - __u8 tcp_flags; /* TCP header flags. (tcp[13]) */ + __u16 tcp_flags:9, /* TCP header flags. (tcp[12-13]) */ + unused:4, + txstamp_ack:1, /* Record TX timestamp for ack? */ + eor:1, /* Is skb MSG_EOR marked? */ + has_rxtstamp:1; /* SKB has a RX timestamp */ __u8 sacked; /* State flags for SACK. */ #define TCPCB_SACKED_ACKED 0x01 /* SKB ACK'd by a SACK block */ @@ -829,10 +836,6 @@ struct tcp_skb_cb { TCPCB_REPAIRED) __u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */ - __u8 txstamp_ack:1, /* Record TX timestamp for ack? */ - eor:1, /* Is skb MSG_EOR marked? */ - has_rxtstamp:1, /* SKB has a RX timestamp */ - unused:5; __u32 ack_seq; /* Sequence number ACK'd */ union { struct { diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 1a7fc856e237..7a4bb72d7074 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -28,7 +28,8 @@ struct tcphdr { __be32 seq; __be32 ack_seq; #if defined(__LITTLE_ENDIAN_BITFIELD) - __u16 res1:4, + __u16 ae:1, + res1:3, doff:4, fin:1, syn:1, @@ -40,7 +41,8 @@ struct tcphdr { cwr:1; #elif defined(__BIG_ENDIAN_BITFIELD) __u16 doff:4, - res1:4, + res1:3, + ae:1, cwr:1, ece:1, urg:1, @@ -70,6 +72,7 @@ union tcp_word_hdr { #define tcp_flag_word(tp) ( ((union tcp_word_hdr *)(tp))->words [3]) enum { + TCP_FLAG_AE = __constant_cpu_to_be32(0x01000000), TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000), TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000), TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000), @@ -78,7 +81,7 @@ enum { TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000), TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000), TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000), - TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000), + TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0E000000), TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000) }; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 7c444541cefd..e49a6f7ad5ce 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6473,7 +6473,8 @@ static void tcp_ecn_create_request(struct request_sock *req, ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK); ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst; - if (((!ect || th->res1) && ecn_ok) || tcp_ca_needs_ecn(listen_sk) || + if (((!ect || th->res1 || th->ae) && ecn_ok) || + tcp_ca_needs_ecn(listen_sk) || (ecn_ok_dst & DST_FEATURE_ECN_CA) || tcp_bpf_ca_needs_ecn((struct sock *)req)) inet_rsk(req)->ecn_ok = 1; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 52acf0bc2ee5..f18a2fbf2761 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1850,7 +1850,8 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph, TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + skb->len - th->doff * 4); TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq); - TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th); + TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) & + TCPHDR_FLAGS_MASK; TCP_SKB_CB(skb)->tcp_tw_isn = 0; TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph); TCP_SKB_CB(skb)->sacked = 0; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 306e25d743e8..dc225e616f98 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -389,7 +389,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb, /* Constructs common control bits of non-data skb. If SYN/FIN is present, * auto increment end seqno. */ -static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags) +static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u16 flags) { skb->ip_summed = CHECKSUM_PARTIAL; @@ -1167,7 +1167,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, th->seq = htonl(tcb->seq); th->ack_seq = htonl(rcv_nxt); *(((__be16 *)th) + 6) = htons(((tcp_header_size >> 2) << 12) | - tcb->tcp_flags); + (tcb->tcp_flags & TCPHDR_FLAGS_MASK)); th->check = 0; th->urg_ptr = 0; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index eaf09e6b7844..73032cd261ea 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1528,7 +1528,8 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr, TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + skb->len - th->doff*4); TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq); - TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th); + TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) & + TCPHDR_FLAGS_MASK; TCP_SKB_CB(skb)->tcp_tw_isn = 0; TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr); TCP_SKB_CB(skb)->sacked = 0; diff --git a/net/netfilter/nf_log_common.c b/net/netfilter/nf_log_common.c index ae5628ddbe6d..1457ead432eb 100644 --- a/net/netfilter/nf_log_common.c +++ b/net/netfilter/nf_log_common.c @@ -84,7 +84,9 @@ int nf_log_dump_tcp_header(struct nf_log_buf *m, const struct sk_buff *skb, /* Max length: 9 "RES=0x3C " */ nf_log_buf_add(m, "RES=0x%02x ", (u_int8_t)(ntohl(tcp_flag_word(th) & TCP_RESERVED_BITS) >> 22)); - /* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */ + /* Max length: 35 "AE CWR ECE URG ACK PSH RST SYN FIN " */ + if (th->ae) + nf_log_buf_add(m, "AE "); if (th->cwr) nf_log_buf_add(m, "CWR "); if (th->ece) From patchwork Wed Mar 18 09:43:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222316 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3E59C5ACD7 for ; Wed, 18 Mar 2020 09:44:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9B7CF20767 for ; Wed, 18 Mar 2020 09:44:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727725AbgCRJob (ORCPT ); Wed, 18 Mar 2020 05:44:31 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51672 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727624AbgCRJn6 (ORCPT ); Wed, 18 Mar 2020 05:43:58 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9ho5q012856; Wed, 18 Mar 2020 11:43:50 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id D0D49360F45; Wed, 18 Mar 2020 11:43:50 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 10/28] gro: prevent ACE field corruption & better AccECN handling Date: Wed, 18 Mar 2020 11:43:14 +0200 Message-Id: <1584524612-24470-11-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen There are important differences in how the CWR field behaves in RFC3168 and AccECN. Thus, it is better to never let anything to receive a mixed-CWR skb. Set the Accurate ECN GSO flag to avoid corrupting CWR bits somewhere. Signed-off-by: Ilpo Järvinen --- net/ipv4/tcp_offload.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 7a81cf438010..58ce382c793e 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -242,7 +242,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) flush = NAPI_GRO_CB(p)->flush; flush |= (__force int)(flags & TCP_FLAG_CWR); flush |= (__force int)((flags ^ tcp_flag_word(th2)) & - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); flush |= (__force int)(th->ack_seq ^ th2->ack_seq); for (i = sizeof(*th); i < thlen; i += 4) flush |= *(u32 *)((u8 *)th + i) ^ @@ -300,7 +300,7 @@ int tcp_gro_complete(struct sk_buff *skb) skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count; if (th->cwr) - skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN; + skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN; return 0; } From patchwork Wed Mar 18 09:43:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222323 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, UNWANTED_LANGUAGE_BODY, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 154ECC5ACD8 for ; Wed, 18 Mar 2020 09:44:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DF71C20674 for ; Wed, 18 Mar 2020 09:44:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727675AbgCRJoF (ORCPT ); Wed, 18 Mar 2020 05:44:05 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51650 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727561AbgCRJn6 (ORCPT ); Wed, 18 Mar 2020 05:43:58 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9holr012869; Wed, 18 Mar 2020 11:43:50 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id E35E8360F45; Wed, 18 Mar 2020 11:43:50 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 13/28] tcp: Pass flags to tcp_send_ack Date: Wed, 18 Mar 2020 11:43:17 +0200 Message-Id: <1584524612-24470-14-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen Accurate ECN reflector needs to send custom flags to handle IP-ECN field reflector. Signed-off-by: Ilpo Järvinen --- include/net/tcp.h | 4 ++-- net/ipv4/bpf_tcp_ca.c | 2 +- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_dctcp.h | 2 +- net/ipv4/tcp_input.c | 14 +++++++------- net/ipv4/tcp_output.c | 10 +++++----- net/ipv4/tcp_timer.c | 4 ++-- 7 files changed, 19 insertions(+), 19 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index ee938066fd8c..ddeb11c01faa 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -640,8 +640,8 @@ void tcp_send_fin(struct sock *sk); void tcp_send_active_reset(struct sock *sk, gfp_t priority); int tcp_send_synack(struct sock *); void tcp_push_one(struct sock *, unsigned int mss_now); -void __tcp_send_ack(struct sock *sk, u32 rcv_nxt); -void tcp_send_ack(struct sock *sk); +void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags); +void tcp_send_ack(struct sock *sk, u16 flags); void tcp_send_delayed_ack(struct sock *sk); void tcp_send_loss_probe(struct sock *sk); bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto); diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index 574972bc7299..55b78183fbd9 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -146,7 +146,7 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log, BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt) { /* bpf_tcp_ca prog cannot have NULL tp */ - __tcp_send_ack((struct sock *)tp, rcv_nxt); + __tcp_send_ack((struct sock *)tp, rcv_nxt, 0); return 0; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 2ee1e4794c7d..edc03a1bf704 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1572,7 +1572,7 @@ static void tcp_cleanup_rbuf(struct sock *sk, int copied) } } if (time_to_ack) - tcp_send_ack(sk); + tcp_send_ack(sk, 0); } static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h index d69a77cbd0c7..4b0259111d81 100644 --- a/net/ipv4/tcp_dctcp.h +++ b/net/ipv4/tcp_dctcp.h @@ -28,7 +28,7 @@ static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt, */ if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) { dctcp_ece_ack_cwr(sk, *ce_state); - __tcp_send_ack(sk, *prior_rcv_nxt); + __tcp_send_ack(sk, *prior_rcv_nxt, 0); } inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 65bbfadbee67..dbe70a114b1d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3512,7 +3512,7 @@ static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb) if (count > 0) { WRITE_ONCE(challenge_count, count - 1); NET_INC_STATS(net, LINUX_MIB_TCPCHALLENGEACK); - tcp_send_ack(sk); + tcp_send_ack(sk, 0); } } @@ -4255,12 +4255,12 @@ void tcp_fin(struct sock *sk) * happens, we must ack the received FIN and * enter the CLOSING state. */ - tcp_send_ack(sk); + tcp_send_ack(sk, 0); tcp_set_state(sk, TCP_CLOSING); break; case TCP_FIN_WAIT2: /* Received a FIN -- send ACK and enter TIME_WAIT. */ - tcp_send_ack(sk); + tcp_send_ack(sk, 0); tcp_time_wait(sk, TCP_TIME_WAIT, 0); break; default: @@ -4367,7 +4367,7 @@ static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb) } } - tcp_send_ack(sk); + tcp_send_ack(sk, 0); } /* These routines update the SACK block as out-of-order packets arrive or @@ -4427,7 +4427,7 @@ static void tcp_sack_new_ofo_skb(struct sock *sk, u32 seq, u32 end_seq) */ if (this_sack >= TCP_NUM_SACKS) { if (tp->compressed_ack > TCP_FASTRETRANS_THRESH) - tcp_send_ack(sk); + tcp_send_ack(sk, 0); this_sack--; tp->rx_opt.num_sacks--; sp--; @@ -5331,7 +5331,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) /* Protocol state mandates a one-time immediate ACK */ inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) { send_now: - tcp_send_ack(sk); + tcp_send_ack(sk, 0); return; } @@ -6126,7 +6126,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, tcp_drop(sk, skb); return 0; } else { - tcp_send_ack(sk); + tcp_send_ack(sk, 0); } return -1; } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a1414d1a8ef1..c8d0a7baf2d4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3741,7 +3741,7 @@ void tcp_send_delayed_ack(struct sock *sk) */ if (icsk->icsk_ack.blocked || time_before_eq(icsk->icsk_ack.timeout, jiffies + (ato >> 2))) { - tcp_send_ack(sk); + tcp_send_ack(sk, 0); return; } @@ -3754,7 +3754,7 @@ void tcp_send_delayed_ack(struct sock *sk) } /* This routine sends an ack and also updates the window. */ -void __tcp_send_ack(struct sock *sk, u32 rcv_nxt) +void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags) { struct sk_buff *buff; @@ -3778,7 +3778,7 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt) /* Reserve space for headers and prepare control bits. */ skb_reserve(buff, MAX_TCP_HEADER); - tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK); + tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK | flags); /* We do not want pure acks influencing TCP Small Queues or fq/pacing * too much. @@ -3791,9 +3791,9 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt) } EXPORT_SYMBOL_GPL(__tcp_send_ack); -void tcp_send_ack(struct sock *sk) +void tcp_send_ack(struct sock *sk, u16 flags) { - __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt); + __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt, flags); } /* This routine sends a packet with an out of date sequence diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index c3f26dcd6704..f37289216d37 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -302,7 +302,7 @@ void tcp_delack_timer_handler(struct sock *sk) icsk->icsk_ack.ato = TCP_ATO_MIN; } tcp_mstamp_refresh(tcp_sk(sk)); - tcp_send_ack(sk); + tcp_send_ack(sk, 0); __NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKS); } @@ -754,7 +754,7 @@ static enum hrtimer_restart tcp_compressed_ack_kick(struct hrtimer *timer) bh_lock_sock(sk); if (!sock_owned_by_user(sk)) { if (tp->compressed_ack > TCP_FASTRETRANS_THRESH) - tcp_send_ack(sk); + tcp_send_ack(sk, 0); } else { if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &sk->sk_tsq_flags)) From patchwork Wed Mar 18 09:43:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222321 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A79AC5ACD9 for ; Wed, 18 Mar 2020 09:44:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D57B02076E for ; Wed, 18 Mar 2020 09:44:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727689AbgCRJoJ (ORCPT ); Wed, 18 Mar 2020 05:44:09 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51648 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727593AbgCRJoA (ORCPT ); Wed, 18 Mar 2020 05:44:00 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hoqh012843; Wed, 18 Mar 2020 11:43:50 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id E5DF0360F46; Wed, 18 Mar 2020 11:43:50 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 14/28] tcp: AccECN negotiation Date: Wed, 18 Mar 2020 11:43:18 +0200 Message-Id: <1584524612-24470-15-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen Signed-off-by: Olivier Tilmans Signed-off-by: Ilpo Järvinen --- include/linux/tcp.h | 6 ++ include/net/tcp.h | 40 ++++++++++++- net/ipv4/syncookies.c | 12 ++++ net/ipv4/tcp.c | 1 + net/ipv4/tcp_input.c | 126 +++++++++++++++++++++++++++++++++++---- net/ipv4/tcp_ipv4.c | 3 +- net/ipv4/tcp_minisocks.c | 51 ++++++++++++++-- net/ipv4/tcp_output.c | 68 ++++++++++++++++----- net/ipv6/syncookies.c | 1 + net/ipv6/tcp_ipv6.c | 1 + 10 files changed, 272 insertions(+), 37 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 9bdf67dd0d1d..a6b1d150cb05 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -149,6 +149,9 @@ struct tcp_request_sock { u64 snt_synack; /* first SYNACK sent time */ bool tfo_listener; bool is_mptcp; + u8 accecn_ok : 1, + syn_ect_snt: 2, + syn_ect_rcv: 2; u32 txhash; u32 rcv_isn; u32 snt_isn; @@ -246,6 +249,9 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 compressed_ack; + u8 syn_ect_snt:2, /* AccECN ECT memory, only */ + syn_ect_rcv:2, /* ... needed durign 3WHS + first seqno */ + ecn_fail:1; /* ECN reflector detected path mangling */ u32 chrono_start; /* Start time in jiffies of a TCP chrono */ u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */ u8 chrono_type:2, /* current chronograph type */ diff --git a/include/net/tcp.h b/include/net/tcp.h index ddeb11c01faa..e1dd06bbb472 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -409,11 +409,28 @@ static inline u8 tcp_accecn_ace(const struct tcphdr *th) return (th->ae << 2) | (th->cwr << 1) | th->ece; } +/* Infer the ECT value our SYN arrived with from the echoed ACE field */ +static inline int tcp_accecn_extract_syn_ect(u8 ace) +{ + if (ace & 0x1) + return INET_ECN_ECT_1; + if (!(ace & 0x2)) + return INET_ECN_ECT_0; + if (ace & 0x4) + return INET_ECN_CE; + return INET_ECN_NOT_ECT; +} + static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp) { return tp->received_ce - tp->received_ce_tx; } +bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect); +void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb, + u8 syn_ect_snt); +void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb); + enum tcp_tw_status { TCP_TW_SUCCESS = 0, TCP_TW_RST = 1, @@ -604,6 +621,7 @@ bool cookie_timestamp_decode(const struct net *net, struct tcp_options_received *opt); bool cookie_ecn_ok(const struct tcp_options_received *opt, const struct net *net, const struct dst_entry *dst); +bool cookie_accecn_ok(const struct tcphdr *th); /* From net/ipv6/syncookies.c */ int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th, @@ -842,6 +860,7 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb) #define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE) #define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR) +#define TCPHDR_SYNACK_ACCECN (TCPHDR_SYN | TCPHDR_ACK | TCPHDR_CWR) #define TCP_ACCECN_CEP_ACE_MASK 0x7 #define TCP_ACCECN_ACE_MAX_DELTA 6 @@ -926,6 +945,15 @@ struct tcp_skb_cb { #define TCP_SKB_CB(__skb) ((struct tcp_skb_cb *)&((__skb)->cb[0])) +static inline u16 tcp_accecn_reflector_flags(u8 ect) +{ + u32 flags = ect + 2; + + if (ect == 3) + flags++; + return flags * TCPHDR_ECE; +} + static inline void bpf_compute_data_end_sk_skb(struct sk_buff *skb) { TCP_SKB_CB(skb)->bpf.data_end = skb->data + skb_headlen(skb); @@ -1059,7 +1087,10 @@ enum tcp_ca_ack_event_flags { #define TCP_CONG_NON_RESTRICTED 0x1 /* Requires ECN/ECT set on all packets */ #define TCP_CONG_NEEDS_ECN 0x2 -#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN) +/* Require successfully negotiated AccECN capability */ +#define TCP_CONG_NEEDS_ACCECN 0x4 +#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN | \ + TCP_CONG_NEEDS_ACCECN) union tcp_cc_info; @@ -1173,6 +1204,13 @@ static inline bool tcp_ca_needs_ecn(const struct sock *sk) return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ECN; } +static inline bool tcp_ca_needs_accecn(const struct sock *sk) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + + return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ACCECN; +} + static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state) { struct inet_connection_sock *icsk = inet_csk(sk); diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index 9a4f6b16c9bc..1d1ded175042 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -276,6 +276,17 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt, } EXPORT_SYMBOL(cookie_ecn_ok); +/* §4.1 [...] a server can determine that it negotiated AccECN as + * [...] if the ACK contains an ACE field with the value 0b010 to + * 0b111 (decimal 2 to 7). + */ +/* We don't need to check for > 7 as ACE is on 3 bits */ +bool cookie_accecn_ok(const struct tcphdr *th) +{ + return tcp_accecn_ace(th) > 0x1; +} +EXPORT_SYMBOL(cookie_accecn_ok); + /* On input, sk is a listener. * Output is listener if incoming packet would not create a child * NULL if memory could not be allocated. @@ -398,6 +409,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) ireq->rcv_wscale = rcv_wscale; ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst); + treq->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th); ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff); /* ip_queue_xmit() depends on our flow being setup diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index edc03a1bf704..624dff543301 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2624,6 +2624,7 @@ int tcp_disconnect(struct sock *sk, int flags) tp->window_clamp = 0; tp->delivered = 0; tp->delivered_ce = 0; + tp->ecn_fail = 0; tcp_accecn_init_counters(tp); tcp_set_ca_state(sk, TCP_CA_Open); tp->is_sack_reneg = 0; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dbe70a114b1d..bf307be4c659 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -308,14 +308,89 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb) } } -static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th) +/* §3.1.2 If a TCP server that implements AccECN receives a SYN with the three + * TCP header flags (AE, CWR and ECE) set to any combination other than 000, + * 011 or 111, it MUST negotiate the use of AccECN as if they had been set to + * 111. + */ +static inline bool tcp_accecn_syn_requested(const struct tcphdr *th) +{ + u8 ace = tcp_accecn_ace(th); + + return ace && ace != 0x3; +} + +/* Check ECN field transition to detect invalid transitions */ +static bool tcp_ect_transition_valid(u8 snt, u8 rcv) +{ + if (rcv == snt) + return true; + + /* Non-ECT altered to something or something became non-ECT */ + if ((snt == INET_ECN_NOT_ECT) || (rcv == INET_ECN_NOT_ECT)) + return false; + /* CE -> ECT(0/1)? */ + if (snt == INET_ECN_CE) + return false; + return true; +} + +bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect) +{ + struct tcp_sock *tp = tcp_sk(sk); + u8 ect = tcp_accecn_extract_syn_ect(ace); + + if (!sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback) + return true; + + if (!tcp_ect_transition_valid(sent_ect, ect)) { + tp->ecn_fail = 1; + return false; + } + + return true; +} + +/* See Table 2 of the AccECN draft */ +static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th, + u8 ip_dsfield) { - if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr)) + struct tcp_sock *tp = tcp_sk(sk); + u8 ace = tcp_accecn_ace(th); + + switch (ace) { + case 0x0: + case 0x7: tcp_ecn_mode_set(tp, TCP_ECN_DISABLED); + break; + case 0x1: + case 0x5: + if (tcp_ecn_mode_pending(tp)) + /* Downgrade from AccECN, or requested initially */ + tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); + break; + default: + tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); + tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK; + if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) && + INET_ECN_is_ce(ip_dsfield)) + tp->received_ce++; + break; + } } -static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th) +static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th, + const struct sk_buff *skb) { + if (tcp_ecn_mode_pending(tp)) { + if (!tcp_accecn_syn_requested(th)) { + /* Downgrade to classic ECN feedback */ + tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); + } else { + tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK; + tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); + } + } if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr)) tcp_ecn_mode_set(tp, TCP_ECN_DISABLED); } @@ -3484,7 +3559,8 @@ bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb, } /* RFC 5961 7 [ACK Throttling] */ -static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb) +static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb, + bool accecn_reflector) { /* unprotected vars, we dont care of overwrites */ static u32 challenge_timestamp; @@ -3512,7 +3588,8 @@ static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb) if (count > 0) { WRITE_ONCE(challenge_count, count - 1); NET_INC_STATS(net, LINUX_MIB_TCPCHALLENGEACK); - tcp_send_ack(sk, 0); + tcp_send_ack(sk, !accecn_reflector ? 0 : + tcp_accecn_reflector_flags(tp->syn_ect_rcv)); } } @@ -3678,7 +3755,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) /* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */ if (before(ack, prior_snd_una - tp->max_window)) { if (!(flag & FLAG_NO_CHALLENGE_ACK)) - tcp_send_challenge_ack(sk, skb); + tcp_send_challenge_ack(sk, skb, false); return -1; } goto old_ack; @@ -5482,7 +5559,7 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t } /* Updates Accurate ECN received counters from the received IP ECN field */ -static void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb) +void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK; @@ -5521,6 +5598,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, { struct tcp_sock *tp = tcp_sk(sk); bool rst_seq_match = false; + bool send_accecn_reflector = false; /* RFC1323: H1. Apply PAWS check first. */ if (tcp_fast_parse_options(sock_net(sk), skb, th, tp) && @@ -5598,7 +5676,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, if (tp->syn_fastopen && !tp->data_segs_in && sk->sk_state == TCP_ESTABLISHED) tcp_fastopen_active_disable(sk); - tcp_send_challenge_ack(sk, skb); + tcp_send_challenge_ack(sk, skb, false); } goto discard; } @@ -5609,11 +5687,14 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, * RFC 5961 4.2 : Send a challenge ack */ if (th->syn) { + if (tcp_ecn_mode_accecn(tp)) { + send_accecn_reflector = true; + } syn_challenge: if (syn_inerr) TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE); - tcp_send_challenge_ack(sk, skb); + tcp_send_challenge_ack(sk, skb, send_accecn_reflector); goto discard; } @@ -6049,7 +6130,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, * state to ESTABLISHED..." */ - tcp_ecn_rcv_synack(tp, th); + if (tcp_ecn_mode_any(tp)) + tcp_ecn_rcv_synack(sk, th, TCP_SKB_CB(skb)->ip_dsfield); tcp_init_wl(tp, TCP_SKB_CB(skb)->seq); tcp_try_undo_spurious_syn(sk); @@ -6126,7 +6208,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, tcp_drop(sk, skb); return 0; } else { - tcp_send_ack(sk, 0); + tcp_send_ack(sk, !tcp_ecn_mode_accecn(tp) ? 0 : + tcp_accecn_reflector_flags(tp->syn_ect_rcv)); } return -1; } @@ -6175,7 +6258,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, tp->snd_wl1 = TCP_SKB_CB(skb)->seq; tp->max_window = tp->snd_wnd; - tcp_ecn_rcv_syn(tp, th); + tcp_ecn_rcv_syn(tp, th, skb); tcp_mtup_init(sk); tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); @@ -6334,7 +6417,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) if (!acceptable) { if (sk->sk_state == TCP_SYN_RECV) return 1; /* send one RST */ - tcp_send_challenge_ack(sk, skb); + tcp_send_challenge_ack(sk, skb, false); goto discard; } switch (sk->sk_state) { @@ -6376,6 +6459,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) tp->lsndtime = tcp_jiffies32; tcp_initialize_rcv_mss(sk); + if (tcp_ecn_mode_accecn(tp)) + tcp_accecn_third_ack(sk, skb, tp->syn_ect_snt); tcp_fast_path_on(tp); break; @@ -6539,6 +6624,18 @@ static void tcp_ecn_create_request(struct request_sock *req, bool ect, ecn_ok; u32 ecn_ok_dst; + if (tcp_accecn_syn_requested(th) && + (net->ipv4.sysctl_tcp_ecn || tcp_ca_needs_accecn(listen_sk))) { + inet_rsk(req)->ecn_ok = 1; + if ((net->ipv4.sysctl_tcp_ecn >= 2) || + tcp_ca_needs_accecn(listen_sk)) { + tcp_rsk(req)->accecn_ok = 1; + tcp_rsk(req)->syn_ect_rcv = + TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK; + } + return; + } + if (!th_ecn) return; @@ -6565,6 +6662,9 @@ static void tcp_openreq_init(struct request_sock *req, tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; tcp_rsk(req)->snt_synack = 0; tcp_rsk(req)->last_oow_ack_time = 0; + tcp_rsk(req)->accecn_ok = 0; + tcp_rsk(req)->syn_ect_rcv = 0; + tcp_rsk(req)->syn_ect_snt = 0; req->mss = rx_opt->mss_clamp; req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0; ireq->tstamp_ok = rx_opt->tstamp_ok; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index dab0c1b85e95..d3b3e4d011b1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -958,7 +958,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, struct tcp_fastopen_cookie *foc, enum tcp_synack_type synack_type) { - const struct inet_request_sock *ireq = inet_rsk(req); + struct inet_request_sock *ireq = inet_rsk(req); struct flowi4 fl4; int err = -1; struct sk_buff *skb; @@ -970,6 +970,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, skb = tcp_make_synack(sk, dst, req, foc, synack_type); if (skb) { + tcp_rsk(req)->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK; __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr); rcu_read_lock(); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 3b5a137e416c..57b7cf4658fc 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -397,12 +397,51 @@ void tcp_openreq_init_rwin(struct request_sock *req, } EXPORT_SYMBOL(tcp_openreq_init_rwin); -static void tcp_ecn_openreq_child(struct tcp_sock *tp, - const struct request_sock *req) +void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb, + u8 syn_ect_snt) { - tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ? - TCP_ECN_MODE_RFC3168 : - TCP_ECN_DISABLED); + struct tcp_sock *tp = tcp_sk(sk); + u8 ace = tcp_accecn_ace(tcp_hdr(skb)); + + switch (ace) { + case 0x0: + tp->ecn_fail = 1; + break; + case 0x7: + case 0x5: + case 0x1: + /* Unused but legal values */ + break; + default: + /* Validation only applies to first non-data packet */ + if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq && + !TCP_SKB_CB(skb)->sacked && + tcp_accecn_validate_syn_feedback(sk, ace, syn_ect_snt)) { + if ((tcp_accecn_extract_syn_ect(ace) == INET_ECN_CE) && + !tp->delivered_ce) + tp->delivered_ce++; + } + break; + } +} + +static void tcp_ecn_openreq_child(struct sock *sk, + const struct request_sock *req, + const struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + const struct tcp_request_sock *treq = tcp_rsk(req); + + if (treq->accecn_ok) { + tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); + tp->syn_ect_snt = treq->syn_ect_snt; + tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt); + tcp_ecn_received_counters(sk, skb); + } else { + tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ? + TCP_ECN_MODE_RFC3168 : + TCP_ECN_DISABLED); + } } void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst) @@ -546,7 +585,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len) newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len; newtp->rx_opt.mss_clamp = req->mss; - tcp_ecn_openreq_child(newtp, req); + tcp_ecn_openreq_child(newsk, req, skb); newtp->fastopen_req = NULL; RCU_INIT_POINTER(newtp->fastopen_rsk, NULL); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index c8d0a7baf2d4..adc22d0d75fd 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -308,7 +308,7 @@ static u16 tcp_select_window(struct sock *sk) /* Packet ECN state for a SYN-ACK */ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb) { - const struct tcp_sock *tp = tcp_sk(sk); + struct tcp_sock *tp = tcp_sk(sk); TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR; if (tcp_ecn_disabled(tp)) @@ -316,6 +316,13 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb) else if (tcp_ca_needs_ecn(sk) || tcp_bpf_ca_needs_ecn(sk)) INET_ECN_xmit(sk); + + if (tp->ecn_flags & TCP_ECN_MODE_ACCECN) { + TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE; + TCP_SKB_CB(skb)->tcp_flags |= + tcp_accecn_reflector_flags(tp->syn_ect_rcv); + tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK; + } } /* Packet ECN state for a SYN. */ @@ -323,8 +330,10 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk); + bool use_accecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 3 || + tcp_ca_needs_accecn(sk); bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 || - tcp_ca_needs_ecn(sk) || bpf_needs_ecn; + tcp_ca_needs_ecn(sk) || bpf_needs_ecn || use_accecn; if (!use_ecn) { const struct dst_entry *dst = __sk_dst_get(sk); @@ -340,36 +349,59 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb) INET_ECN_xmit(sk); TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR; - tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); + if (use_accecn) { + TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_AE; + tcp_ecn_mode_set(tp, TCP_ECN_MODE_PENDING); + tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK; + } else { + tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168); + } } } static void tcp_ecn_clear_syn(struct sock *sk, struct sk_buff *skb) { - if (sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback) + if (sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback) { /* tp->ecn_flags are cleared at a later point in time when * SYN ACK is ultimatively being received. */ - TCP_SKB_CB(skb)->tcp_flags &= ~(TCPHDR_ECE | TCPHDR_CWR); + TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE; + } +} + +static void tcp_accecn_echo_syn_ect(struct tcphdr *th, u8 ect) +{ + th->ae = !!(ect & 2); + th->cwr = ect != INET_ECN_ECT_0; + th->ece = ect == INET_ECN_ECT_1; } static void tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th) { - if (inet_rsk(req)->ecn_ok) + if (tcp_rsk(req)->accecn_ok) + tcp_accecn_echo_syn_ect(th, tcp_rsk(req)->syn_ect_rcv); + else if (inet_rsk(req)->ecn_ok) th->ece = 1; } -static void tcp_accecn_set_ace(struct tcphdr *th, struct tcp_sock *tp) +static void tcp_accecn_set_ace(struct tcp_sock *tp, struct sk_buff *skb, + struct tcphdr *th) { u32 wire_ace; - tp->received_ce_tx += min_t(u32, tcp_accecn_ace_deficit(tp), - TCP_ACCECN_ACE_MAX_DELTA); - wire_ace = tp->received_ce_tx + TCP_ACCECN_CEP_INIT_OFFSET; - th->ece = !!(wire_ace & 0x1); - th->cwr = !!(wire_ace & 0x2); - th->ae = !!(wire_ace & 0x4); + /* The final packet of the 3WHS or anything like it must reflect + * the SYN/ACK ECT instead of putting CEP into ACE field, such + * case show up in tcp_flags. + */ + if (likely(!(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACE))) { + tp->received_ce_tx += min_t(u32, tcp_accecn_ace_deficit(tp), + TCP_ACCECN_ACE_MAX_DELTA); + wire_ace = tp->received_ce_tx + TCP_ACCECN_CEP_INIT_OFFSET; + th->ece = !!(wire_ace & 0x1); + th->cwr = !!(wire_ace & 0x2); + th->ae = !!(wire_ace & 0x4); + } } /* Set up ECN state for a packet on a ESTABLISHED socket that is about to @@ -383,9 +415,10 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb, if (!tcp_ecn_mode_any(tp)) return; - INET_ECN_xmit(sk); + if (!tp->ecn_fail) + INET_ECN_xmit(sk); if (tcp_ecn_mode_accecn(tp)) { - tcp_accecn_set_ace(th, tp); + tcp_accecn_set_ace(tp, skb, th); skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN; } else { /* Not-retransmitted data segment: set ECT and inject CWR. */ @@ -3034,7 +3067,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) tcp_retrans_try_collapse(sk, skb, cur_mss); } - /* RFC3168, section 6.1.1.1. ECN fallback */ + /* RFC3168, section 6.1.1.1. ECN fallback + * As AccECN uses the same SYN flags (+ AE), this check covers both + * cases. + */ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN) tcp_ecn_clear_syn(sk, skb); diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index 13235a012388..6e859d8fa489 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -251,6 +251,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) ireq->rcv_wscale = rcv_wscale; ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst); + tcp_rsk(req)->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th); ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff); out: diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 73032cd261ea..ecbab28c98cf 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -507,6 +507,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, skb = tcp_make_synack(sk, dst, req, foc, synack_type); if (skb) { + tcp_rsk(req)->syn_ect_snt = np->tclass & INET_ECN_MASK; __tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr, &ireq->ir_v6_rmt_addr); From patchwork Wed Mar 18 09:43:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222325 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B702BC5ACD6 for ; Wed, 18 Mar 2020 09:43:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 974CF20674 for ; Wed, 18 Mar 2020 09:43:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727599AbgCRJnz (ORCPT ); Wed, 18 Mar 2020 05:43:55 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51536 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727561AbgCRJnz (ORCPT ); Wed, 18 Mar 2020 05:43:55 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9ho7Y012877; Wed, 18 Mar 2020 11:43:51 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id EE6C536032A; Wed, 18 Mar 2020 11:43:50 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 15/28] tcp: add AccECN rx byte counters Date: Wed, 18 Mar 2020 11:43:19 +0200 Message-Id: <1584524612-24470-16-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen Signed-off-by: Ilpo Järvinen --- include/linux/tcp.h | 1 + include/net/tcp.h | 18 +++++++++++++++++- net/ipv4/tcp_input.c | 13 +++++++++---- net/ipv4/tcp_minisocks.c | 3 ++- 4 files changed, 29 insertions(+), 6 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index a6b1d150cb05..6b81d7eb0117 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -323,6 +323,7 @@ struct tcp_sock { u32 delivered_ce; /* Like the above but only ECE marked packets */ u32 received_ce; /* Like the above but for received CE marked packets */ u32 received_ce_tx; /* Like the above but max transmitted value */ + u32 received_ecn_bytes[3]; u32 lost; /* Total data packets lost incl. rexmits */ u32 app_limited; /* limited until "delivered" reaches this val */ u64 first_tx_mstamp; /* start of window send phase */ diff --git a/include/net/tcp.h b/include/net/tcp.h index e1dd06bbb472..5824447b1fc5 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -429,7 +429,8 @@ static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp) bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect); void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb, u8 syn_ect_snt); -void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb); +void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb, + u32 payload_len); enum tcp_tw_status { TCP_TW_SUCCESS = 0, @@ -869,11 +870,26 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb) * See draft-ietf-tcpm-accurate-ecn for the latest values. */ #define TCP_ACCECN_CEP_INIT_OFFSET 5 +#define TCP_ACCECN_E1B_INIT_OFFSET 0 +#define TCP_ACCECN_E0B_INIT_OFFSET 1 +#define TCP_ACCECN_CEB_INIT_OFFSET 0 + +static inline void __tcp_accecn_init_bytes_counters(int *counter_array) +{ + BUILD_BUG_ON(INET_ECN_ECT_1 != 0x1); + BUILD_BUG_ON(INET_ECN_ECT_0 != 0x2); + BUILD_BUG_ON(INET_ECN_CE != 0x3); + + counter_array[INET_ECN_ECT_1 - 1] = 0; + counter_array[INET_ECN_ECT_0 - 1] = 0; + counter_array[INET_ECN_CE - 1] = 0; +} static inline void tcp_accecn_init_counters(struct tcp_sock *tp) { tp->received_ce = 0; tp->received_ce_tx = 0; + __tcp_accecn_init_bytes_counters(tp->received_ecn_bytes); } /* This is what the send packet queuing engine uses to pass diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bf307be4c659..3109e3199906 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5559,7 +5559,8 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t } /* Updates Accurate ECN received counters from the received IP ECN field */ -void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb) +void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb, + u32 payload_len) { struct tcp_sock *tp = tcp_sk(sk); u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK; @@ -5570,6 +5571,10 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb) /* ACE counter tracks *all* segments including pure acks */ tp->received_ce += is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs); + + if (payload_len > 0) { + tp->received_ecn_bytes[ecnfield - 1] += payload_len; + } } } @@ -5808,7 +5813,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) tp->rcv_nxt == tp->rcv_wup) flag |= __tcp_replace_ts_recent(tp, tstamp_delta); - tcp_ecn_received_counters(sk, skb); + tcp_ecn_received_counters(sk, skb, 0); /* We know that such packets are checksummed * on entry. @@ -5851,7 +5856,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) /* Bulk data transfer: receiver */ __skb_pull(skb, tcp_header_len); - tcp_ecn_received_counters(sk, skb); + tcp_ecn_received_counters(sk, skb, len - tcp_header_len); eaten = tcp_queue_rcv(sk, skb, &fragstolen); tcp_event_data_recv(sk, skb); @@ -5888,7 +5893,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) return; step5: - tcp_ecn_received_counters(sk, skb); + tcp_ecn_received_counters(sk, skb, len - th->doff * 4); if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0) goto discard; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 57b7cf4658fc..668edd00e377 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -433,10 +433,11 @@ static void tcp_ecn_openreq_child(struct sock *sk, const struct tcp_request_sock *treq = tcp_rsk(req); if (treq->accecn_ok) { + const struct tcphdr *th = (const struct tcphdr *)skb->data; tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); tp->syn_ect_snt = treq->syn_ect_snt; tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt); - tcp_ecn_received_counters(sk, skb); + tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4); } else { tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ? TCP_ECN_MODE_RFC3168 : From patchwork Wed Mar 18 09:43:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222314 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1302EC5ACD7 for ; Wed, 18 Mar 2020 09:44:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DDB4920767 for ; Wed, 18 Mar 2020 09:44:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727753AbgCRJon (ORCPT ); Wed, 18 Mar 2020 05:44:43 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51662 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727616AbgCRJn5 (ORCPT ); Wed, 18 Mar 2020 05:43:57 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hpud012889; Wed, 18 Mar 2020 11:43:51 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id 08DF6360F45; Wed, 18 Mar 2020 11:43:51 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 18/28] tcp: don't early return when sack doesn't fit Date: Wed, 18 Mar 2020 11:43:22 +0200 Message-Id: <1584524612-24470-19-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen AccECN code will be placed after this fragment so no early returns please. Signed-off-by: Ilpo Järvinen --- net/ipv4/tcp_output.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 9ff6d14363df..084ebd2e725f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -875,17 +875,16 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack; if (unlikely(eff_sacks)) { const unsigned int remaining = MAX_TCP_OPTION_SPACE - size; - if (unlikely(remaining < TCPOLEN_SACK_BASE_ALIGNED + - TCPOLEN_SACK_PERBLOCK)) - return size; - - opts->num_sack_blocks = - min_t(unsigned int, eff_sacks, - (remaining - TCPOLEN_SACK_BASE_ALIGNED) / - TCPOLEN_SACK_PERBLOCK); - - size += TCPOLEN_SACK_BASE_ALIGNED + - opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; + if (likely(remaining >= TCPOLEN_SACK_BASE_ALIGNED + + TCPOLEN_SACK_PERBLOCK)) { + opts->num_sack_blocks = + min_t(unsigned int, eff_sacks, + (remaining - TCPOLEN_SACK_BASE_ALIGNED) / + TCPOLEN_SACK_PERBLOCK); + + size += TCPOLEN_SACK_BASE_ALIGNED + + opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; + } } return size; From patchwork Wed Mar 18 09:43:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222324 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84143C5ACD6 for ; Wed, 18 Mar 2020 09:44:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4E19020674 for ; Wed, 18 Mar 2020 09:44:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727669AbgCRJoD (ORCPT ); Wed, 18 Mar 2020 05:44:03 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51534 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727558AbgCRJn6 (ORCPT ); Wed, 18 Mar 2020 05:43:58 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hpI8012893; Wed, 18 Mar 2020 11:43:51 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id 0F3E8360030; Wed, 18 Mar 2020 11:43:51 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 19/28] tcp: AccECN option Date: Wed, 18 Mar 2020 11:43:23 +0200 Message-Id: <1584524612-24470-20-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen AccECN option tx & rx side without option send control related features that are added in a later change. Signed-off-by: Ilpo Järvinen --- include/linux/tcp.h | 4 ++ include/net/tcp.h | 16 ++++++ net/ipv4/tcp_input.c | 126 +++++++++++++++++++++++++++++++++++++++--- net/ipv4/tcp_output.c | 112 ++++++++++++++++++++++++++++++++++++- 4 files changed, 246 insertions(+), 12 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 6b81d7eb0117..fd232bb7fae9 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -114,6 +114,7 @@ struct tcp_options_received { snd_wscale : 4, /* Window scaling received from sender */ rcv_wscale : 4; /* Window scaling to send to receiver */ u8 num_sacks; /* Number of SACK blocks */ + s8 accecn; /* AccECN index in header, -1=no option */ u16 user_mss; /* mss requested by user in ioctl */ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ #if IS_ENABLED(CONFIG_MPTCP) @@ -321,9 +322,12 @@ struct tcp_sock { u32 prr_out; /* Total number of pkts sent during Recovery. */ u32 delivered; /* Total data packets delivered incl. rexmits */ u32 delivered_ce; /* Like the above but only ECE marked packets */ + u32 delivered_ecn_bytes[3]; u32 received_ce; /* Like the above but for received CE marked packets */ u32 received_ce_tx; /* Like the above but max transmitted value */ u32 received_ecn_bytes[3]; + u8 accecn_minlen:2,/* Minimum length of AccECN option sent */ + estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */ u32 lost; /* Total data packets lost incl. rexmits */ u32 app_limited; /* limited until "delivered" reaches this val */ u64 first_tx_mstamp; /* start of window send phase */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 5824447b1fc5..54471c2dedd5 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -189,6 +189,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* Magic number to be after the option value for sharing TCP * experimental options. See draft-ietf-tcpm-experimental-options-00.txt */ +#define TCPOPT_ACCECN_MAGIC 0xACCE #define TCPOPT_FASTOPEN_MAGIC 0xF989 #define TCPOPT_SMC_MAGIC 0xE2D4C3D9 @@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); #define TCPOLEN_FASTOPEN_BASE 2 #define TCPOLEN_EXP_FASTOPEN_BASE 4 #define TCPOLEN_EXP_SMC_BASE 6 +#define TCPOLEN_EXP_ACCECN_BASE 4 /* But this is what stacks really send out. */ #define TCPOLEN_TSTAMP_ALIGNED 12 @@ -215,6 +217,13 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); #define TCPOLEN_MD5SIG_ALIGNED 20 #define TCPOLEN_MSS_ALIGNED 4 #define TCPOLEN_EXP_SMC_BASE_ALIGNED 8 +#define TCPOLEN_ACCECN_PERCOUNTER 3 + +/* Maximum number of byte counters in AccECN option + size */ +#define TCP_ACCECN_NUMCOUNTERS 3 +#define TCP_ACCECN_MAXSIZE (TCPOLEN_EXP_ACCECN_BASE + \ + TCPOLEN_ACCECN_PERCOUNTER * \ + TCP_ACCECN_NUMCOUNTERS) /* Flags in tp->nonagle */ #define TCP_NAGLE_OFF 1 /* Nagle's algo is disabled */ @@ -363,6 +372,10 @@ static inline void tcp_dec_quickack_mode(struct sock *sk, } } +/* sysctl_tcp_ecn value */ +#define TCP_ECN_ENABLE_MASK 0x3 +#define TCP_ACCECN_NO_OPT 0x100 + #define TCP_ECN_MODE_RFC3168 0x1 #define TCP_ECN_QUEUE_CWR 0x2 #define TCP_ECN_DEMAND_CWR 0x4 @@ -890,6 +903,9 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp) tp->received_ce = 0; tp->received_ce_tx = 0; __tcp_accecn_init_bytes_counters(tp->received_ecn_bytes); + __tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes); + tp->accecn_minlen = 0; + tp->estimate_ecnfield = 0; } /* This is what the send packet queuing engine uses to pass diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0a63f8a49057..d34b50f73652 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -402,9 +402,92 @@ static u32 tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr * return 0; } +/* Handles AccECN option ECT and CE 24-bit byte counters update into + * the u32 value in tcp_sock. As we're processing TCP options, it is + * safe to access from - 1. + */ +static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 init_offset) +{ + u32 truncated = (get_unaligned_be32(from - 1) - init_offset) & 0xFFFFFFU; + u32 delta = (truncated - *cnt) & 0xFFFFFFU; + /* If delta has the highest bit set (24th bit) indicating negative, + * sign extend to correct an estimation error in the ecn_bytes + */ + delta = delta & 0x800000 ? delta | 0xFF000000 : delta; + *cnt += delta; + return (s32)delta; +} + +static u8 accecn_opt_ecnfield[3] = { + INET_ECN_ECT_0, INET_ECN_CE, INET_ECN_ECT_1, +}; + +/* Returns true if the byte counters can be used */ +static bool tcp_accecn_process_option(struct tcp_sock *tp, + const struct sk_buff *skb, + u32 delivered_bytes) +{ + unsigned char *ptr; + unsigned int optlen; + int i; + bool ambiguous_ecn_bytes_incr = false; + bool first_changed = false; + bool res; + + if (tp->rx_opt.accecn < 0) { + if (tp->estimate_ecnfield) { + tp->delivered_ecn_bytes[tp->estimate_ecnfield - 1] += + delivered_bytes; + return true; + } + return false; + } + + ptr = skb_transport_header(skb) + tp->rx_opt.accecn; + optlen = ptr[1]; + if (ptr[0] == TCPOPT_EXP) { + optlen -= 2; + ptr += 2; + } + ptr += 2; + + res = !!tp->estimate_ecnfield; + for (i = 0; i < 3; i++) { + if (optlen >= TCPOLEN_ACCECN_PERCOUNTER) { + u8 ecnfield = accecn_opt_ecnfield[i]; + u32 init_offset = i ? 0 : TCP_ACCECN_E0B_INIT_OFFSET; + s32 delta; + + delta = tcp_update_ecn_bytes(&(tp->delivered_ecn_bytes[ecnfield - 1]), + ptr, init_offset); + if (delta) { + if (delta < 0) { + res = false; + ambiguous_ecn_bytes_incr = true; + } + if (ecnfield != tp->estimate_ecnfield) { + if (!first_changed) { + tp->estimate_ecnfield = ecnfield; + first_changed = true; + } else { + res = false; + ambiguous_ecn_bytes_incr = true; + } + } + } + + optlen -= TCPOLEN_ACCECN_PERCOUNTER; + } + } + if (ambiguous_ecn_bytes_incr) + tp->estimate_ecnfield = 0; + + return res; +} + /* Returns the ECN CE delta */ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb, - u32 delivered_pkts, int flag) + u32 delivered_pkts, u32 delivered_bytes, int flag) { u32 delta, safe_delta; u32 corrected_ace; @@ -413,6 +496,8 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb, if (!(flag & (FLAG_FORWARD_PROGRESS|FLAG_TS_PROGRESS))) return 0; + tcp_accecn_process_option(tp, skb, delivered_bytes); + if (!(flag & FLAG_SLOWPATH)) { /* AccECN counter might overflow on large ACKs */ if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK) @@ -3839,7 +3924,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if (tcp_ecn_mode_accecn(tp)) { ecn_count = tcp_accecn_process(tp, skb, - tp->delivered - delivered, flag); + tp->delivered - delivered, + sack_state.delivered_bytes, flag); if (ecn_count > 0) flag |= FLAG_ECE; } @@ -3878,7 +3964,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) no_queue: if (tcp_ecn_mode_accecn(tp)) { ecn_count = tcp_accecn_process(tp, skb, - tp->delivered - delivered, flag); + tp->delivered - delivered, + sack_state.delivered_bytes, flag); if (ecn_count > 0) flag |= FLAG_ECE; } @@ -4005,6 +4092,7 @@ void tcp_parse_options(const struct net *net, ptr = (const unsigned char *)(th + 1); opt_rx->saw_tstamp = 0; + opt_rx->accecn = -1; while (length > 0) { int opcode = *ptr++; @@ -4094,12 +4182,16 @@ void tcp_parse_options(const struct net *net, break; case TCPOPT_EXP: + if (opsize >= TCPOLEN_EXP_ACCECN_BASE && + get_unaligned_be16(ptr) == + TCPOPT_ACCECN_MAGIC) + opt_rx->accecn = (ptr - 2) - (unsigned char *)th; /* Fast Open option shares code 254 using a * 16 bits magic number. */ - if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE && - get_unaligned_be16(ptr) == - TCPOPT_FASTOPEN_MAGIC) + else if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE && + get_unaligned_be16(ptr) == + TCPOPT_FASTOPEN_MAGIC) tcp_parse_fastopen_option(opsize - TCPOLEN_EXP_FASTOPEN_BASE, ptr + 2, th->syn, foc, true); @@ -5567,6 +5659,19 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t } } +/* Maps ECT/CE bits to minimum length of AccECN option */ +static inline unsigned int tcp_ecn_field_to_accecn_len(u8 ecnfield) +{ + unsigned int opt; + + opt = (ecnfield - 2) & INET_ECN_MASK; + /* Shift+XOR for 11 -> 10 */ + opt = (opt ^ (opt >> 1)) + 1; + + return opt; +} + + /* Updates Accurate ECN received counters from the received IP ECN field */ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb, u32 payload_len) @@ -5582,7 +5687,9 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb, tp->received_ce += is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs); if (payload_len > 0) { + u8 minlen = tcp_ecn_field_to_accecn_len(ecnfield); tp->received_ecn_bytes[ecnfield - 1] += payload_len; + tp->accecn_minlen = max_t(u8, tp->accecn_minlen, minlen); } } } @@ -6639,9 +6746,10 @@ static void tcp_ecn_create_request(struct request_sock *req, u32 ecn_ok_dst; if (tcp_accecn_syn_requested(th) && - (net->ipv4.sysctl_tcp_ecn || tcp_ca_needs_accecn(listen_sk))) { + ((net->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) || + tcp_ca_needs_accecn(listen_sk))) { inet_rsk(req)->ecn_ok = 1; - if ((net->ipv4.sysctl_tcp_ecn >= 2) || + if (((net->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) >= 2) || tcp_ca_needs_accecn(listen_sk)) { tcp_rsk(req)->accecn_ok = 1; tcp_rsk(req)->syn_ect_rcv = @@ -6655,7 +6763,7 @@ static void tcp_ecn_create_request(struct request_sock *req, ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield); ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK); - ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst; + ecn_ok = (net->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) || ecn_ok_dst; if (((!ect || th->res1 || th->ae) && ecn_ok) || tcp_ca_needs_ecn(listen_sk) || diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 084ebd2e725f..7bce1a73ac8f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -330,9 +330,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk); - bool use_accecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 3 || + bool use_accecn = + (sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) == 3 || tcp_ca_needs_accecn(sk); - bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 || + bool use_ecn = + (sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) == 1 || tcp_ca_needs_ecn(sk) || bpf_needs_ecn || use_accecn; if (!use_ecn) { @@ -468,6 +470,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp) #define OPTION_FAST_OPEN_COOKIE (1 << 8) #define OPTION_SMC (1 << 9) #define OPTION_MPTCP (1 << 10) +#define OPTION_ACCECN (1 << 11) static void smc_options_write(__be32 *ptr, u16 *options) { @@ -488,12 +491,14 @@ struct tcp_out_options { u16 options; /* bit field of OPTION_* */ u16 mss; /* 0 to disable */ u8 ws; /* window scale, 0 to disable */ - u8 num_sack_blocks; /* number of SACK blocks to include */ + u8 num_sack_blocks:3, /* number of SACK blocks to include */ + num_ecn_bytes:2; /* number of AccECN fields needed */ u8 hash_size; /* bytes in hash_location */ __u8 *hash_location; /* temporary pointer, overloaded */ __u32 tsval, tsecr; /* need to include OPTION_TS */ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */ struct mptcp_out_options mptcp; + u32 *ecn_bytes; /* AccECN ECT/CE byte counters */ }; static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts) @@ -557,6 +562,33 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp, *ptr++ = htonl(opts->tsecr); } + if (unlikely(OPTION_ACCECN & options)) { + u32 e0b = opts->ecn_bytes[INET_ECN_ECT_0 - 1] + TCP_ACCECN_E0B_INIT_OFFSET; + u32 ceb = opts->ecn_bytes[INET_ECN_CE - 1] + TCP_ACCECN_CEB_INIT_OFFSET; + u32 e1b = opts->ecn_bytes[INET_ECN_ECT_1 - 1] + TCP_ACCECN_E1B_INIT_OFFSET; + u8 len = TCPOLEN_EXP_ACCECN_BASE + + opts->num_ecn_bytes * TCPOLEN_ACCECN_PERCOUNTER; + + *ptr++ = htonl((TCPOPT_EXP << 24) | (len << 16) | + TCPOPT_ACCECN_MAGIC); + if (opts->num_ecn_bytes > 0) { + *ptr++ = htonl((e0b << 8) | + (opts->num_ecn_bytes > 1 ? + (ceb >> 16) & 0xff : + TCPOPT_NOP)); + if (opts->num_ecn_bytes == 2) { + leftover_bytes = (ceb >> 8) & 0xffff; + } else { + *ptr++ = htonl((ceb << 16) | + ((e1b >> 8) & 0xffff)); + leftover_bytes = ((e1b & 0xff) << 8) | + TCPOPT_NOP; + leftover_size = 1; + } + } + if (tp != NULL) + tp->accecn_minlen = 0; + } if (unlikely(OPTION_SACK_ADVERTISE & options)) { *ptr++ = htonl((leftover_bytes << 16) | (TCPOPT_SACK_PERM << 8) | @@ -677,6 +709,53 @@ static void mptcp_set_option_cond(const struct request_sock *req, } } +/* Initial values for AccECN option, ordered is based on ECN field bits + * similar to received_ecn_bytes. Used for SYN/ACK AccECN option. + */ +u32 synack_ecn_bytes[3] = { 0, 0, 0 }; + +static u32 tcp_synack_options_combine_saving(struct tcp_out_options *opts) +{ + /* How much there's room for combining with the alignment padding? */ + if ((opts->options & (OPTION_SACK_ADVERTISE|OPTION_TS)) == + OPTION_SACK_ADVERTISE) + return 2; + else if (opts->options & OPTION_WSCALE) + return 1; + return 0; +} + +/* AccECN can take here and there take advantage of NOPs for alignment of + * other options + */ +static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required, + int remaining, int max_combine_saving) +{ + int size = TCP_ACCECN_MAXSIZE; + + opts->num_ecn_bytes = TCP_ACCECN_NUMCOUNTERS; + + while (opts->num_ecn_bytes >= required) { + int leftover_size = size & 0x3; + /* Pad to dword if cannot combine */ + if (leftover_size > max_combine_saving) + leftover_size = -((4 - leftover_size) & 0x3); + + if (remaining >= size - leftover_size) { + size -= leftover_size; + break; + } + + opts->num_ecn_bytes--; + size -= TCPOLEN_ACCECN_PERCOUNTER; + } + if (opts->num_ecn_bytes < required) + return 0; + + opts->options |= OPTION_ACCECN; + return size; +} + /* Compute TCP options for SYN packets. This is not the final * network wire format yet. */ @@ -755,6 +834,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, } } + /* Simultaneous open SYN/ACK needs AccECN option but not SYN */ + if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) && + tcp_ecn_mode_accecn(tp) && + !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) && + (remaining >= TCPOLEN_EXP_ACCECN_BASE))) { + opts->ecn_bytes = synack_ecn_bytes; + remaining -= tcp_options_fit_accecn(opts, 0, remaining, + tcp_synack_options_combine_saving(opts)); + } + return MAX_TCP_OPTION_SPACE - remaining; } @@ -767,6 +856,7 @@ static unsigned int tcp_synack_options(const struct sock *sk, struct tcp_fastopen_cookie *foc) { struct inet_request_sock *ireq = inet_rsk(req); + struct tcp_request_sock *treq = tcp_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; #ifdef CONFIG_TCP_MD5SIG @@ -820,6 +910,14 @@ static unsigned int tcp_synack_options(const struct sock *sk, smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); + if (treq->accecn_ok && + !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) && + (remaining >= TCPOLEN_EXP_ACCECN_BASE)) { + opts->ecn_bytes = synack_ecn_bytes; + remaining -= tcp_options_fit_accecn(opts, 0, remaining, + tcp_synack_options_combine_saving(opts)); + } + return MAX_TCP_OPTION_SPACE - remaining; } @@ -887,6 +985,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb } } + if (tcp_ecn_mode_accecn(tp) && + !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT)) { + opts->ecn_bytes = tp->received_ecn_bytes; + size += tcp_options_fit_accecn(opts, tp->accecn_minlen, + MAX_TCP_OPTION_SPACE - size, + opts->num_sack_blocks > 0 ? 2 : 0); + } + return size; } From patchwork Wed Mar 18 09:43:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222319 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 235D7C5ACD6 for ; Wed, 18 Mar 2020 09:44:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id ECD8020767 for ; Wed, 18 Mar 2020 09:44:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727657AbgCRJoA (ORCPT ); Wed, 18 Mar 2020 05:44:00 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51654 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727601AbgCRJn6 (ORCPT ); Wed, 18 Mar 2020 05:43:58 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hp8Z012905; Wed, 18 Mar 2020 11:43:51 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id 19A0F36032A; Wed, 18 Mar 2020 11:43:51 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 22/28] tcp: AccECN option order bit & failure handling Date: Wed, 18 Mar 2020 11:43:26 +0200 Message-Id: <1584524612-24470-23-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen AccECN option has two possible field orders. Collect the order bit from first AccECN option that has enough length to contain it. AccECN option may fail in various way, handle these: - Remove option from SYN/ACK rexmits to handle blackholes - If no option arrives in SYN/ACK, assume Option is not usable - If an option arrives later, re-enabled - If option is zeroed, disable AccECN option processing Signed-off-by: Ilpo Järvinen --- include/linux/tcp.h | 2 ++ include/net/tcp.h | 10 +++++++++ net/ipv4/tcp.c | 1 + net/ipv4/tcp_input.c | 46 ++++++++++++++++++++++++++++++++++------ net/ipv4/tcp_minisocks.c | 32 ++++++++++++++++++++++++++++ net/ipv4/tcp_output.c | 4 +++- 6 files changed, 88 insertions(+), 7 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index c381aea5c764..64db51e5d45e 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -151,6 +151,7 @@ struct tcp_request_sock { bool tfo_listener; bool is_mptcp; u8 accecn_ok : 1, + saw_accecn_opt : 3, syn_ect_snt: 2, syn_ect_rcv: 2; u32 txhash; @@ -252,6 +253,7 @@ struct tcp_sock { u8 compressed_ack; u8 syn_ect_snt:2, /* AccECN ECT memory, only */ syn_ect_rcv:2, /* ... needed durign 3WHS + first seqno */ + saw_accecn_opt:3, /* A valid AccECN option was seen */ ecn_fail:1; /* ECN reflector detected path mangling */ u32 chrono_start; /* Start time in jiffies of a TCP chrono */ u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 52567d8fca33..a29109fa2ce2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -226,6 +226,14 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); TCP_ACCECN_NUMCOUNTERS) #define TCP_ACCECN_BEACON_FREQ_SHIFT 2 /* Send option at least 2^2 times per RTT */ +/* tp->saw_accecn_opt states, empty seen & orderbit are overloaded */ +#define TCP_ACCECN_OPT_EMPTY_SEEN 0x1 +#define TCP_ACCECN_OPT_ORDERBIT 0x1 +#define TCP_ACCECN_OPT_COUNTER_SEEN 0x2 +#define TCP_ACCECN_OPT_SEEN (TCP_ACCECN_OPT_COUNTER_SEEN | \ + TCP_ACCECN_OPT_EMPTY_SEEN) +#define TCP_ACCECN_OPT_FAIL 0x4 + /* Flags in tp->nonagle */ #define TCP_NAGLE_OFF 1 /* Nagle's algo is disabled */ #define TCP_NAGLE_CORK 2 /* Socket is corked */ @@ -443,6 +451,7 @@ static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp) bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect); void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb, u8 syn_ect_snt); +u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset); void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb, u32 payload_len); @@ -885,6 +894,7 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb) */ #define TCP_ACCECN_CEP_INIT_OFFSET 5 #define TCP_ACCECN_E1B_INIT_OFFSET 0 +#define TCP_ACCECN_E1B_FIRST_INIT_OFFSET 0x800001 #define TCP_ACCECN_E0B_INIT_OFFSET 1 #define TCP_ACCECN_CEB_INIT_OFFSET 0 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cfbdc1468342..09f73f81e6fa 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2624,6 +2624,7 @@ int tcp_disconnect(struct sock *sk, int flags) tp->window_clamp = 0; tp->delivered = 0; tp->delivered_ce = 0; + tp->saw_accecn_opt = 0; tp->ecn_fail = 0; tcp_accecn_init_counters(tp); tp->prev_ecnfield = 0; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 504309a73de2..826dfd5bf114 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -352,7 +352,8 @@ bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect) } /* See Table 2 of the AccECN draft */ -static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th, +static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb, + const struct tcphdr *th, u8 ip_dsfield) { struct tcp_sock *tp = tcp_sk(sk); @@ -372,7 +373,12 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th, default: tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK; - tp->accecn_opt_demand = 2; + if (tp->rx_opt.accecn >= 0 && + tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) { + tp->saw_accecn_opt = tcp_accecn_option_init(skb, + tp->rx_opt.accecn); + tp->accecn_opt_demand = 2; + } if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) && INET_ECN_is_ce(ip_dsfield)) tp->received_ce++; @@ -436,7 +442,19 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp, bool first_changed = false; bool res; + if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL) + return false; + if (tp->rx_opt.accecn < 0) { + if (!tp->saw_accecn_opt) { + /* Too late to enable after this point due to + * potential counter wraps + */ + if (tp->bytes_sent >= (1 << 23) - 1) + tp->saw_accecn_opt = TCP_ACCECN_OPT_FAIL; + return false; + } + if (tp->estimate_ecnfield) { tp->delivered_ecn_bytes[tp->estimate_ecnfield - 1] += delivered_bytes; @@ -453,11 +471,20 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp, } ptr += 2; + if (tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) + tp->saw_accecn_opt = tcp_accecn_option_init(skb, + tp->rx_opt.accecn); + res = !!tp->estimate_ecnfield; for (i = 0; i < 3; i++) { if (optlen >= TCPOLEN_ACCECN_PERCOUNTER) { - u8 ecnfield = accecn_opt_ecnfield[i]; - u32 init_offset = i ? 0 : TCP_ACCECN_E0B_INIT_OFFSET; + u8 orderbit = tp->saw_accecn_opt & TCP_ACCECN_OPT_ORDERBIT; + int idx = orderbit ? i : 2 - i; + u8 ecnfield = accecn_opt_ecnfield[idx]; + u32 init_offset = i ? 0 : + !orderbit ? + TCP_ACCECN_E0B_INIT_OFFSET : + TCP_ACCECN_E1B_FIRST_INIT_OFFSET; s32 delta; delta = tcp_update_ecn_bytes(&(tp->delivered_ecn_bytes[ecnfield - 1]), @@ -4188,6 +4215,7 @@ void tcp_parse_options(const struct net *net, get_unaligned_be16(ptr) == TCPOPT_ACCECN_MAGIC) opt_rx->accecn = (ptr - 2) - (unsigned char *)th; + /* Fast Open option shares code 254 using a * 16 bits magic number. */ @@ -5836,7 +5864,12 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, if (th->syn) { if (tcp_ecn_mode_accecn(tp)) { send_accecn_reflector = true; - tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand); + if (tp->rx_opt.accecn >= 0 && + tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) { + tp->saw_accecn_opt = tcp_accecn_option_init(skb, + tp->rx_opt.accecn); + tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand); + } } syn_challenge: if (syn_inerr) @@ -6279,7 +6312,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, */ if (tcp_ecn_mode_any(tp)) - tcp_ecn_rcv_synack(sk, th, TCP_SKB_CB(skb)->ip_dsfield); + tcp_ecn_rcv_synack(sk, skb, th, TCP_SKB_CB(skb)->ip_dsfield); tcp_init_wl(tp, TCP_SKB_CB(skb)->seq); tcp_try_undo_spurious_syn(sk); @@ -6812,6 +6845,7 @@ static void tcp_openreq_init(struct request_sock *req, tcp_rsk(req)->snt_synack = 0; tcp_rsk(req)->last_oow_ack_time = 0; tcp_rsk(req)->accecn_ok = 0; + tcp_rsk(req)->saw_accecn_opt = 0; tcp_rsk(req)->syn_ect_rcv = 0; tcp_rsk(req)->syn_ect_snt = 0; req->mss = rx_opt->mss_clamp; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 2e532758a34a..eda3d0c3af32 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -97,6 +97,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, bool paws_reject = false; tmp_opt.saw_tstamp = 0; + tmp_opt.accecn = -1; if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) { tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL); @@ -437,6 +438,7 @@ static void tcp_ecn_openreq_child(struct sock *sk, tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN); tp->syn_ect_snt = treq->syn_ect_snt; tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt); + tp->saw_accecn_opt = treq->saw_accecn_opt; tp->prev_ecnfield = treq->syn_ect_rcv; tp->accecn_opt_demand = 1; tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4); @@ -491,6 +493,32 @@ static void smc_check_reset_syn_req(struct tcp_sock *oldtp, #endif } +u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset) +{ + unsigned char *ptr = skb_transport_header(skb) + opt_offset; + unsigned int optlen = ptr[1]; + + if (ptr[0] == TCPOPT_EXP) { + optlen -= 2; + ptr += 2; + } + ptr += 2; + + if (optlen >= TCPOLEN_ACCECN_PERCOUNTER) { + u32 first_field = get_unaligned_be32(ptr - 1) & 0xFFFFFFU; + u8 orderbit = first_field >> 23; + /* Detect option zeroing. Check the first byte counter value, + * if present, it must be != 0. + */ + if (!first_field) + return TCP_ACCECN_OPT_FAIL; + + return TCP_ACCECN_OPT_COUNTER_SEEN + orderbit; + } + + return TCP_ACCECN_OPT_EMPTY_SEEN; +} + /* This is not only more efficient than what we used to do, it eliminates * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM * @@ -793,6 +821,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, if (!(flg & TCP_FLAG_ACK)) return NULL; + if (tcp_rsk(req)->accecn_ok && tmp_opt.accecn >= 0 && + tcp_rsk(req)->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) + tcp_rsk(req)->saw_accecn_opt = tcp_accecn_option_init(skb, tmp_opt.accecn); + /* For Fast Open no more processing is needed (sk is the * child socket). */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f070128b69e6..4cc590a47f43 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -841,6 +841,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, /* Simultaneous open SYN/ACK needs AccECN option but not SYN */ if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) && tcp_ecn_mode_accecn(tp) && + inet_csk(sk)->icsk_retransmits < 2 && !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) && (remaining >= TCPOLEN_EXP_ACCECN_BASE))) { opts->ecn_bytes = synack_ecn_bytes; @@ -914,7 +915,7 @@ static unsigned int tcp_synack_options(const struct sock *sk, smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); - if (treq->accecn_ok && + if (treq->accecn_ok && req->num_timeout < 1 && !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) && (remaining >= TCPOLEN_EXP_ACCECN_BASE)) { opts->ecn_bytes = synack_ecn_bytes; @@ -990,6 +991,7 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb } if (tcp_ecn_mode_accecn(tp) && + (tp->saw_accecn_opt & TCP_ACCECN_OPT_SEEN) && !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT)) { if (tp->accecn_opt_demand || (tcp_stamp_us_delta(tp->tcp_mstamp, tp->accecn_opt_tstamp) >= From patchwork Wed Mar 18 09:43:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222322 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F26FDC5ACD7 for ; Wed, 18 Mar 2020 09:44:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D346F20674 for ; Wed, 18 Mar 2020 09:44:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727699AbgCRJoK (ORCPT ); Wed, 18 Mar 2020 05:44:10 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51686 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727639AbgCRJoC (ORCPT ); Wed, 18 Mar 2020 05:44:02 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hprI012915; Wed, 18 Mar 2020 11:43:51 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id 2A6AF360F45; Wed, 18 Mar 2020 11:43:51 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 25/28] tcp: try to avoid safer when ACKs are thinned Date: Wed, 18 Mar 2020 11:43:29 +0200 Message-Id: <1584524612-24470-26-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen Add newly acked pkts EWMA. When ACK thinning occurs, select between safer and unsafe cep delta based on it. Signed-off-by: Ilpo Järvinen --- include/linux/tcp.h | 1 + net/ipv4/tcp.c | 1 + net/ipv4/tcp_input.c | 20 +++++++++++++++++++- 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 64db51e5d45e..22be7cf2e084 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -332,6 +332,7 @@ struct tcp_sock { prev_ecnfield:2,/* ECN bits from the previous segment */ accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */ estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */ + u16 pkts_acked_ewma;/* EWMA of packets acked for AccECN cep heuristic */ u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */ u32 lost; /* Total data packets lost incl. rexmits */ u32 app_limited; /* limited until "delivered" reaches this val */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 09f73f81e6fa..4a22c19fa6d5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2629,6 +2629,7 @@ int tcp_disconnect(struct sock *sk, int flags) tcp_accecn_init_counters(tp); tp->prev_ecnfield = 0; tp->accecn_opt_tstamp = 0; + tp->pkts_acked_ewma = 0; tcp_set_ca_state(sk, TCP_CA_Open); tp->is_sack_reneg = 0; tcp_clear_retrans(tp); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6bc9995202c8..f5476e6b1479 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -514,6 +514,10 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp, return res; } +#define PKTS_ACKED_WEIGHT 6 +#define PKTS_ACKED_PREC 6 +#define ACK_COMP_THRESH 4 + /* Returns the ECN CE delta */ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb, u32 delivered_pkts, u32 delivered_bytes, int flag) @@ -529,6 +533,19 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb, opt_deltas_valid = tcp_accecn_process_option(tp, skb, delivered_bytes); + if (delivered_pkts) { + if (!tp->pkts_acked_ewma) { + tp->pkts_acked_ewma = delivered_pkts << PKTS_ACKED_PREC; + } else { + u32 ewma = tp->pkts_acked_ewma; + + ewma = (((ewma << PKTS_ACKED_WEIGHT) - ewma) + + (delivered_pkts << PKTS_ACKED_PREC)) >> + PKTS_ACKED_WEIGHT; + tp->pkts_acked_ewma = min_t(u32, ewma, 0xFFFFU); + } + } + if (!(flag & FLAG_SLOWPATH)) { /* AccECN counter might overflow on large ACKs */ if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK) @@ -555,7 +572,8 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb, return safe_delta; if (d_ceb < safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT) return delta; - } + } else if (tp->pkts_acked_ewma > (ACK_COMP_THRESH << PKTS_ACKED_PREC)) + return delta; return safe_delta; } From patchwork Wed Mar 18 09:43:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 222317 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B70C2C5ACD6 for ; Wed, 18 Mar 2020 09:44:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 98DA12076F for ; Wed, 18 Mar 2020 09:44:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727640AbgCRJn7 (ORCPT ); Wed, 18 Mar 2020 05:43:59 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:51580 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727585AbgCRJn4 (ORCPT ); Wed, 18 Mar 2020 05:43:56 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id 02I9hpBA012925; Wed, 18 Mar 2020 11:43:51 +0200 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id 31D2F360F45; Wed, 18 Mar 2020 11:43:51 +0200 (EET) From: =?iso-8859-1?q?Ilpo_J=E4rvinen?= To: netdev@vger.kernel.org Cc: Yuchung Cheng , Neal Cardwell , Eric Dumazet , Olivier Tilmans Subject: [RFC PATCH 28/28] tcp: AccECN sysctl documentation Date: Wed, 18 Mar 2020 11:43:32 +0200 Message-Id: <1584524612-24470-29-git-send-email-ilpo.jarvinen@helsinki.fi> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> References: <1584524612-24470-1-git-send-email-ilpo.jarvinen@helsinki.fi> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ilpo Järvinen Signed-off-by: Ilpo Järvinen --- Documentation/networking/ip-sysctl.txt | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 5f53faff4e25..ecca6e1d6bea 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -301,15 +301,21 @@ tcp_ecn - INTEGER 0 Disable ECN. Neither initiate nor accept ECN. 1 Enable ECN when requested by incoming connections and also request ECN on outgoing connection attempts. - 2 Enable ECN when requested by incoming connections + 2 Enable ECN or AccECN when requested by incoming connections but do not request ECN on outgoing connections. + 3 Enable AccECN when requested by incoming connections and + also request AccECN on outgoing connection attempts. + 0x102 Enable AccECN in optionless mode for incoming connections. + 0x103 Enable AccECN in optionless mode for incoming and outgoing + connections. Default: 2 tcp_ecn_fallback - BOOLEAN If the kernel detects that ECN connection misbehaves, enable fall back to non-ECN. Currently, this knob implements the fallback - from RFC3168, section 6.1.1.1., but we reserve that in future, - additional detection mechanisms could be implemented under this + from RFC3168, section 6.1.1.1., as well as the ECT codepoint mangling + detection during the Accurate ECN handshake, but we reserve that in + future, additional detection mechanisms could be implemented under this knob. The value is not used, if tcp_ecn or per route (or congestion control) ECN settings are disabled. Default: 1 (fallback enabled)