From patchwork Wed Dec 2 22:53:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337593 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51D24C64E7C for ; Wed, 2 Dec 2020 22:54:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F3012221FD for ; Wed, 2 Dec 2020 22:54:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728938AbgLBWyf (ORCPT ); Wed, 2 Dec 2020 17:54:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36646 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726254AbgLBWyf (ORCPT ); Wed, 2 Dec 2020 17:54:35 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E4B5C061A47 for ; Wed, 2 Dec 2020 14:53:55 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id x24so2194575pfn.6 for ; Wed, 02 Dec 2020 14:53:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=sW7XgTXjijFkHwuI3jul9atpINhPCxmibaWgL857oB8=; b=ICuGaMdGg6dXWXsI9xWzG/dsvyHbz/s1gWYR1dmK5KasL1BNOwwz5Q1AFBvBC0G7SR RWDWWhLM0zbcxiJsVEDrvzrQ7la1IHeTcc6aZJp9krSl/tleXUEyvpcC6iKpPuzSHisx oHDsPgA4+K/eD5CMB5gsYDljaIti44hodMHmC8dsOfVWppakkhMc7KGYTMuSr3feVQNT jVAFWE+YzNYPoxH8rBt4rJWLo0hZHNDm6IugWdpGro0lLMYdi0kr7O04rFv34wYMK3zw zhKX7bjpSi7u4haTrj68jbomDGL8XPrkmTNR34Jqr1QD5NugC+h20rr50+d5KYuwa+OC I9+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=sW7XgTXjijFkHwuI3jul9atpINhPCxmibaWgL857oB8=; b=gXZYVOoX4gQuvBlJau1f21PWXeycwOz3K8TVTk5GGXrPdzZqamZcDAjdbzb09CnwU8 DfY621Z3sTrDsgyXVlJFksMFAx1Z1KoLNlFM6SqSphCU+QeaZAxNbutuhVPFcGp38Rmv xnRj7rEskZDpzphyLoobwdr0w3Foql+RQf6QUSGXc1gKgsmAiMIP2eD7QUJGChn/RNaT 5+b5kF8AbYKZ63GT6racKB61Or+tixbzTvjxHC5Tvx17vD4qriWH9Y57Yioi2B5dr2t8 hUlRpIDnGgOY9/wSkirfy/Z75VVHiKwlWynJkPFwrs3wteRr6yEnVvnPIpz3ZtBZOg2g RRcQ== X-Gm-Message-State: AOAM533NfNwjzSBoRhh0FM4LXgtmN5rB5W2jRyj0oFTvLMdx5ojH7gEC dhDWL9fQYhhui6E0sj/fAGJQhgtGNLk= X-Google-Smtp-Source: ABdhPJyxlWGI6nqDeh0tA8q4ijgOQ9An0VmO+NRakIqXHIjaXOlLns9VXfja4YOKutLTK4ul4evWbQ== X-Received: by 2002:a63:de4e:: with SMTP id y14mr422762pgi.411.1606949634830; Wed, 02 Dec 2020 14:53:54 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:53:54 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy. Date: Wed, 2 Dec 2020 14:53:42 -0800 Message-Id: <20201202225349.935284-2-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 84 ++++++++++++++++++++++++++++++++-------- 2 files changed, 70 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index cfcb10b75483..62db78b9c1a0 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { __u32 recv_skip_hint; /* out: amount of bytes to skip */ __u32 inq; /* out: amount of bytes in read queue */ __s32 err; /* out: socket error */ + __u64 copybuf_address; /* in: copybuf address (small reads) */ + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b2bc3d7fe9e8..887c6e986927 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,52 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 copylen, + u32 *offset, u32 *seq) +{ + unsigned long copy_address = (unsigned long)zc->copybuf_address; + struct msghdr msg = {}; + struct iovec iov; + int err; + + if (copy_address != zc->copybuf_address) + return -EINVAL; + + err = import_single_range(READ, (void __user *)copy_address, + copylen, &iov, &msg.msg_iter); + if (err) + return err; + err = skb_copy_datagram_msg(skb, *offset, &msg, copylen); + if (err) + return err; + zc->recv_skip_hint -= copylen; + *offset += copylen; + *seq += copylen; + return (__s32)copylen; +} + +static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, + struct sock *sk, + struct sk_buff *skb, + u32 *seq, + s32 copybuf_len) +{ + u32 offset, copylen = min_t(u32, copybuf_len, zc->recv_skip_hint); + + if (!copylen) + return 0; + /* skb is null if inq < PAGE_SIZE. */ + if (skb) + offset = *seq - TCP_SKB_CB(skb)->seq; + else + skb = tcp_recv_skb(sk, *seq, &offset); + + zc->copybuf_len = tcp_copy_straggler_data(zc, skb, copylen, &offset, + seq); + return zc->copybuf_len < 0 ? 0 : copylen; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, unsigned long pages_to_map, @@ -1776,8 +1822,10 @@ static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { + u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; unsigned long address = (unsigned long)zc->address; - u32 length = 0, seq, offset, zap_len; + s32 copybuf_len = zc->copybuf_len; + struct tcp_sock *tp = tcp_sk(sk); #define PAGE_BATCH_SIZE 8 struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; @@ -1785,10 +1833,12 @@ static int tcp_zerocopy_receive(struct sock *sk, struct sk_buff *skb = NULL; unsigned long pg_idx = 0; unsigned long curr_addr; - struct tcp_sock *tp; - int inq; + u32 seq = tp->copied_seq; + int inq = tcp_inq(sk); int ret; + zc->copybuf_len = 0; + if (address & (PAGE_SIZE - 1) || address != zc->address) return -EINVAL; @@ -1797,8 +1847,6 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); - tp = tcp_sk(sk); - mmap_read_lock(current->mm); vma = find_vma(current->mm, address); @@ -1806,17 +1854,16 @@ static int tcp_zerocopy_receive(struct sock *sk, mmap_read_unlock(current->mm); return -EINVAL; } - zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); - - seq = tp->copied_seq; - inq = tcp_inq(sk); - zc->length = min_t(u32, zc->length, inq); - zap_len = zc->length & ~(PAGE_SIZE - 1); - if (zap_len) { - zap_page_range(vma, address, zap_len); + vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); + avail_len = min_t(u32, vma_len, inq); + aligned_len = avail_len & ~(PAGE_SIZE - 1); + if (aligned_len) { + zap_page_range(vma, address, aligned_len); + zc->length = aligned_len; zc->recv_skip_hint = 0; } else { - zc->recv_skip_hint = zc->length; + zc->length = avail_len; + zc->recv_skip_hint = avail_len; } ret = 0; curr_addr = address; @@ -1885,13 +1932,18 @@ static int tcp_zerocopy_receive(struct sock *sk, } out: mmap_read_unlock(current->mm); - if (length) { + /* Try to copy straggler data. */ + if (!ret) + copylen = tcp_zerocopy_handle_leftover_data(zc, sk, skb, &seq, + copybuf_len); + + if (length + copylen) { WRITE_ONCE(tp->copied_seq, seq); tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ tcp_recv_skb(sk, seq, &offset); - tcp_cleanup_rbuf(sk, length); + tcp_cleanup_rbuf(sk, length + copylen); ret = 0; if (length == zc->length) zc->recv_skip_hint = 0; From patchwork Wed Dec 2 22:53:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337592 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92328C6369E for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3E12222201 for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387727AbgLBWzO (ORCPT ); Wed, 2 Dec 2020 17:55:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387674AbgLBWzN (ORCPT ); Wed, 2 Dec 2020 17:55:13 -0500 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 36015C061A48 for ; Wed, 2 Dec 2020 14:53:56 -0800 (PST) Received: by mail-pf1-x42c.google.com with SMTP id w6so1967pfu.1 for ; Wed, 02 Dec 2020 14:53:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=yC4TtlbK31NIsMHDt9xMetZqQod/Od5w+vG2xAeN+28=; b=RagS3LnRH+sN+jMeFtoqOSXbChAuKbcN2mOLwHDV5zSR2gbg0eqrE67woFf8wSll7N OL4EqKXbEvGhAezhjQAuqNnsYarNLPwQh0fZgkd8Sh/cEcn1oxOYow1LTpXAZMaChorT nDRUeiMxoQ8BhvPBVcblYRpMvHR/4Oepbwh9zI605wHkJVJUsD5wDCOpBWxbrC2UKhop lBg3V97PYLn41IHWzET8Rpe/ZNoUEpiVOQVtYlOBwk64viBPTRlXunpwnOU9THW6334D VgCKttxaJtW58yddy+c3i+KqP1BwGu6lz/EyrbDn4j0Ykl3RgxNsAamZ3GLRooyy+y3O T8+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=yC4TtlbK31NIsMHDt9xMetZqQod/Od5w+vG2xAeN+28=; b=JlisvyCCc+kir9FR4iAIgvoLRiF1Vz9ZnO3mXZnvyht+/RV6uLCS4xLmbUSca6z3QO loJe4oWVwXHkVkv9r0QVjM4rTR/dZECBWhQCZaRXYLZ8nHw8ucYkbQqvd7oWdAmDUdS2 djtZBNDn2eYXXhlVTgXJpgDgmphEIo3PHNgF3LmEdTshmZ3dFZGeMVsy9mFwXDtsM67x aaq/FBGR7tt7UUOpPtX2dJRxJrEOki/RNrjrOr6j65VmiYeW9yNoN8kMbRibGQLA7G8Z y87+vC5ku235iqkP+mqvLI8XJi9UGprfGcZSpBXnCkNHKvUDhgcATJKM+lvGmvHbA79Q lj2w== X-Gm-Message-State: AOAM530s4/Ddpruio8yNfzmCBIvX3Pmay6vATTUf/PVwOBSybhPkn9QP 7OKhB3I7HuvMhCFMkkUqoxM= X-Google-Smtp-Source: ABdhPJxYnYGA+TEOovM+AEDrjPoxf9Kh768xf6g2gfbZ5TbNEQhkLmWxhTHp9ki+tlvobXDil3fn1g== X-Received: by 2002:a62:e70e:0:b029:18b:913e:9d1d with SMTP id s14-20020a62e70e0000b029018b913e9d1dmr180020pfh.47.1606949635776; Wed, 02 Dec 2020 14:53:55 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:53:55 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 2/8] net-tcp: Introduce tcp_recvmsg_locked(). Date: Wed, 2 Dec 2020 14:53:43 -0800 Message-Id: <20201202225349.935284-3-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor tcp_recvmsg() by splitting it into locked and unlocked portions. Callers already holding the socket lock and not using ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked(). This is in preparation for a short-circuit copy performed by TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested by the user) reads. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 69 ++++++++++++++++++++++++++++---------------------- 1 file changed, 39 insertions(+), 30 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 887c6e986927..232cb478bacd 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2065,36 +2065,28 @@ static int tcp_inq_hint(struct sock *sk) * Probably, code can be easily improved even more. */ -int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, - int flags, int *addr_len) +static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, + struct scm_timestamping_internal *tss, + int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); int copied = 0; u32 peek_seq; u32 *seq; unsigned long used; - int err, inq; + int err; int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; u32 urg_hole = 0; - struct scm_timestamping_internal tss; - int cmsg_flags; - - if (unlikely(flags & MSG_ERRQUEUE)) - return inet_recv_error(sk, msg, len, addr_len); - - if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) && - (sk->sk_state == TCP_ESTABLISHED)) - sk_busy_loop(sk, nonblock); - - lock_sock(sk); err = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) goto out; - cmsg_flags = tp->recvmsg_inq ? 1 : 0; + if (tp->recvmsg_inq) + *cmsg_flags = 1; timeo = sock_rcvtimeo(sk, nonblock); /* Urgent data needs to be handled specially. */ @@ -2274,8 +2266,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, } if (TCP_SKB_CB(skb)->has_rxtstamp) { - tcp_update_recv_tstamps(skb, &tss); - cmsg_flags |= 2; + tcp_update_recv_tstamps(skb, tss); + *cmsg_flags |= 2; } if (used + offset < skb->len) @@ -2301,22 +2293,9 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, /* Clean up data we have read: This will do ACK frames. */ tcp_cleanup_rbuf(sk, copied); - - release_sock(sk); - - if (cmsg_flags) { - if (cmsg_flags & 2) - tcp_recv_timestamp(msg, sk, &tss); - if (cmsg_flags & 1) { - inq = tcp_inq_hint(sk); - put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); - } - } - return copied; out: - release_sock(sk); return err; recv_urg: @@ -2327,6 +2306,36 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, err = tcp_peek_sndq(sk, msg, len); goto out; } + +int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, + int flags, int *addr_len) +{ + int cmsg_flags = 0, ret, inq; + struct scm_timestamping_internal tss; + + if (unlikely(flags & MSG_ERRQUEUE)) + return inet_recv_error(sk, msg, len, addr_len); + + if (sk_can_busy_loop(sk) && + skb_queue_empty_lockless(&sk->sk_receive_queue) && + sk->sk_state == TCP_ESTABLISHED) + sk_busy_loop(sk, nonblock); + + lock_sock(sk); + ret = tcp_recvmsg_locked(sk, msg, len, nonblock, flags, &tss, + &cmsg_flags); + release_sock(sk); + + if (cmsg_flags && ret >= 0) { + if (cmsg_flags & 2) + tcp_recv_timestamp(msg, sk, &tss); + if (cmsg_flags & 1) { + inq = tcp_inq_hint(sk); + put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); + } + } + return ret; +} EXPORT_SYMBOL(tcp_recvmsg); void tcp_set_state(struct sock *sk, int state) From patchwork Wed Dec 2 22:53:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336521 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6353EC64E7B for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0D3D9221FD for ; Wed, 2 Dec 2020 22:55:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387718AbgLBWzO (ORCPT ); Wed, 2 Dec 2020 17:55:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387566AbgLBWzN (ORCPT ); Wed, 2 Dec 2020 17:55:13 -0500 Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26225C061A49 for ; Wed, 2 Dec 2020 14:53:57 -0800 (PST) Received: by mail-pg1-x533.google.com with SMTP id f17so186582pge.6 for ; Wed, 02 Dec 2020 14:53:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=wTsfUYOdoYiwZAtLhjpPjaSfgXXTJRBYDbKCj7wu+aw=; b=gRp5A5RHz5soLefsXQo1EGY8nC0E2zz1Lk+u71qGYLppeSCFWTdt8zyp/X+EXX2DPB 1F/o4ybBVkmAAI5NwzVDmo852gOG7fDgU2NPIMzBhPExI2wq6GfBjOL40yy4iaLD0Dar JyL6OKMfwnJdcoBAIGXMNI51EYT7YQWH0jtmYo54ua+ZZuPZIWK9JunxPsOOcPC/1KQs n+odIB3i7Wd99VQsYD3uk+IY9R0tH6f2bnJz1prWFxURmWkdrzCShnlEIGe+tzWrFPM7 XBA8+THLm6fxVfF/Zjqo7isy4YWvefhav0F+xeb4pOqQKlHVadAfZ1UNRkwpUFsWL3Zt DdYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=wTsfUYOdoYiwZAtLhjpPjaSfgXXTJRBYDbKCj7wu+aw=; b=KtfoHCH0iaMUNyHZIA1dvxxCNsL1ZT3YGpxqY/0yjeTiTwEDT9q/dATkvTirT835zz 7SajCAqjUxWPSmY4owBQm6LiatHevTmne/JreMq4oqOrRGWLIdlSIXsY8nsnzZqQlluY 8/q3QujlUAbEVk++tqq1RY8cFDYB1/VCK59D8F62NuodMzWlrdiL9jKfB41ca77Hii7Q lmer5zSNtNsLfXuWhYg4hA1ZpZk68tpsG7N9/1xbDc8U810cQNrf+LgUFEJtv9pHP1k4 /RhjPHCZmh0LxyN9JnrJdcz9phlYEAyWhp0sERq6CcCTy7Fz6m2Prb8Jj0aDuQL5a8b/ R5Tg== X-Gm-Message-State: AOAM530hs82llFQWMPIxf5l1CvKyqE/O8n5ILfDKYl4GsofPT3vUZyKG 5D1LbWRDHFz/9toZ55eewMg= X-Google-Smtp-Source: ABdhPJyYu2uSXVFd1Ttqn6/ia+IDVRAMbbj0H3UQmZ3uJcmE5FTRbPWQJr9m4jU74UkfFjaKvCWX8g== X-Received: by 2002:a63:f652:: with SMTP id u18mr459535pgj.240.1606949636722; Wed, 02 Dec 2020 14:53:56 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:53:56 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 3/8] net-zerocopy: Refactor skb frag fast-forward op. Date: Wed, 2 Dec 2020 14:53:44 -0800 Message-Id: <20201202225349.935284-4-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor skb frag fast-forwarding for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. skb_advance_to_frag(), given a skb and an offset into the skb, iterates from the first frag for the skb until we're at the frag specified by the offset. Assuming the offset provided refers to how many bytes in the skb are already read, the returned frag points to the next frag we may read from, while offset_frag is set to the number of bytes from this frag that we have already read. If frag is not null and offset_frag is equal to 0, then we may be able to map this frag's page into the process address space with vm_insert_page(). However, if offset_frag is not equal to 0, then we cannot do so. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 232cb478bacd..0f17b46c4c0c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,28 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb, + u32 *offset_frag) +{ + skb_frag_t *frag; + + offset_skb -= skb_headlen(skb); + if ((int)offset_skb < 0 || skb_has_frag_list(skb)) + return NULL; + + frag = skb_shinfo(skb)->frags; + while (offset_skb) { + if (skb_frag_size(frag) > offset_skb) { + *offset_frag = offset_skb; + return frag; + } + offset_skb -= skb_frag_size(frag); + ++frag; + } + *offset_frag = 0; + return frag; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1869,6 +1891,8 @@ static int tcp_zerocopy_receive(struct sock *sk, curr_addr = address; while (length + PAGE_SIZE <= zc->length) { if (zc->recv_skip_hint < PAGE_SIZE) { + u32 offset_frag; + /* If we're here, finish the current batch. */ if (pg_idx) { ret = tcp_zerocopy_vm_insert_batch(vma, pages, @@ -1889,16 +1913,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); } zc->recv_skip_hint = skb->len - offset; - offset -= skb_headlen(skb); - if ((int)offset < 0 || skb_has_frag_list(skb)) + frags = skb_advance_to_frag(skb, offset, &offset_frag); + if (!frags || offset_frag) break; - frags = skb_shinfo(skb)->frags; - while (offset) { - if (skb_frag_size(frags) > offset) - goto out; - offset -= skb_frag_size(frags); - frags++; - } } if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) { int remaining = zc->recv_skip_hint; From patchwork Wed Dec 2 22:53:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337591 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5B15C71155 for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7050C22203 for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387707AbgLBWzO (ORCPT ); Wed, 2 Dec 2020 17:55:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36748 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387664AbgLBWzN (ORCPT ); Wed, 2 Dec 2020 17:55:13 -0500 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E764C061A4A for ; Wed, 2 Dec 2020 14:53:58 -0800 (PST) Received: by mail-pf1-x433.google.com with SMTP id q22so2179439pfk.12 for ; Wed, 02 Dec 2020 14:53:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=yQiqnLCbPUEkQ+y8axoTVAEf5B7uvXOGPhszzHJMS84=; b=DK/1gsdXjpngCVfHMBCIhFaQzBNGIlZDtIvZgsFGilg8A1tPWbyuc0gQLKP64X86vf ewE11ViL0m3VYGzdAGRAeotRyoV9pmI/0tC7z4Q4wBDaesOM3t/lDxqjjx4aJw/3S6fy wfN9YaGKYpI/mX7STyIebLCWTgF6q/n7Im6ulS4zd83bAA/JdoSF9CsnHJrzGOJw2DK1 z3rBRHZGXh7E5vAdf4QJUuWAD7UQ+Aa0tqmB7P9OM2q8v7zR2pIc4rLXr4WDZ+XKbbgF +AX6vju382lMiL/Kafl/X2+1HrppBxMf5br5SWilzyIxEzLe7wax1YSgr1gcd8KYAvB2 foMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=yQiqnLCbPUEkQ+y8axoTVAEf5B7uvXOGPhszzHJMS84=; b=PA0GGrdTcw1pCe1hxDWEi9265ftvnx3WxEXlIW12dkewgPRwuerWTAPN2xqdgR3sST XoQ3DRrAg98xSJO+R7Lm7boW1rz9gN7815qJBfB6knQ+i3edwfhm9cUj7IRYPIc9eFzs 5cpIFY0sXlCndblgp5Wrc8z/4n8TD3jtya/wUASmQcxdG9Jnuf7zCqT2aQuOxa49/Yie CsHdnsbdm/RStwEzRVcKJodj0VHUbWuu4kHBmRp+3p9AyvWxHEKqhf4x7fELQFf/TVcJ WzvSwK5yXs92feeoqI8t5K8ar1J3s+j1PHF7r7VyY8IspzXjTrcc0Lr31FSzg9AprKFm Cd9Q== X-Gm-Message-State: AOAM533nzmFPnP1+n+OkpyCLCeVDh0e5GcPGRvMTLVPCuqMFz/w7K9Fg xGYoO4DlZ1LJmErxy425vIolkqGhyac= X-Google-Smtp-Source: ABdhPJwYy/V9KTa18sy7KHZjP8E7xoIlO+8wjccAwDJ/mWP8ZxSfxE8VJqMyheLpyYPK0dUMmNHd4A== X-Received: by 2002:aa7:982e:0:b029:18b:6372:d445 with SMTP id q14-20020aa7982e0000b029018b6372d445mr186709pfl.31.1606949637692; Wed, 02 Dec 2020 14:53:57 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:53:57 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 4/8] net-zerocopy: Refactor frag-is-remappable test. Date: Wed, 2 Dec 2020 14:53:45 -0800 Message-Id: <20201202225349.935284-5-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor frag-is-remappable test for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0f17b46c4c0c..4bdd4a358588 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1765,6 +1765,26 @@ static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb, return frag; } +static bool can_map_frag(const skb_frag_t *frag) +{ + return skb_frag_size(frag) == PAGE_SIZE && !skb_frag_off(frag); +} + +static int find_next_mappable_frag(const skb_frag_t *frag, + int remaining_in_skb) +{ + int offset = 0; + + if (likely(can_map_frag(frag))) + return 0; + + while (offset < remaining_in_skb && !can_map_frag(frag)) { + offset += skb_frag_size(frag); + ++frag; + } + return offset; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1890,6 +1910,8 @@ static int tcp_zerocopy_receive(struct sock *sk, ret = 0; curr_addr = address; while (length + PAGE_SIZE <= zc->length) { + int mappable_offset; + if (zc->recv_skip_hint < PAGE_SIZE) { u32 offset_frag; @@ -1917,15 +1939,11 @@ static int tcp_zerocopy_receive(struct sock *sk, if (!frags || offset_frag) break; } - if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) { - int remaining = zc->recv_skip_hint; - while (remaining && (skb_frag_size(frags) != PAGE_SIZE || - skb_frag_off(frags))) { - remaining -= skb_frag_size(frags); - frags++; - } - zc->recv_skip_hint -= remaining; + mappable_offset = find_next_mappable_frag(frags, + zc->recv_skip_hint); + if (mappable_offset) { + zc->recv_skip_hint = mappable_offset; break; } pages[pg_idx] = skb_frag_page(frags); From patchwork Wed Dec 2 22:53:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336520 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00547C64E7C for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A05DA221FD for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387743AbgLBWzP (ORCPT ); Wed, 2 Dec 2020 17:55:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36750 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387665AbgLBWzN (ORCPT ); Wed, 2 Dec 2020 17:55:13 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 01577C061A4B for ; Wed, 2 Dec 2020 14:53:59 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id b10so2202803pfo.4 for ; Wed, 02 Dec 2020 14:53:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=tTNYrn4xDC+1b8MnNoT1J4xKmUKcP9Eh5NPYTUnTJa0=; b=pSWrKgeM4a2J7yDfcP1bJLvRANonp9ZNkbwuyL/a1ZVoJgMyb/F0FtVGrGUqbwv5iA vSdzqObjg22WvidLN0C/1tUEC8tz+hn1A4i2l7bdM2ncRGswL1tSaSvNZxfMUylTDfmF 56hFJjpoHe/LTZbXxsYK9mAGxNvI8XC/2GYLSSrtFXlrxXgB540FWhQFUzfbuQxHk10i 42K0i3+5zMhVbif+n6xPB05KpO7SLeyY1qTYBpJrQf/37+OchdUv5JhPYucyXHH7S2q4 fJ5ImHDUky/2neI0oAWuU07iqHAIXtemMEIBdH72rUkzUbjdNQSBgsC0Xi7Gx5p4lEfE Tvrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=tTNYrn4xDC+1b8MnNoT1J4xKmUKcP9Eh5NPYTUnTJa0=; b=HTeDNcogpwqzf0B1I4JTlsMtJU6zOwtG4/Cy7MblUvPL6nc2nnOpwUuD5fa693dXBD NkW2Q2XVD+fWQdZ0km6Bcd3MovCfCVwwS0FYCTUopAc+sJKSKD7FS9MuUd0gfGPs8OCJ lwotUoybwzgsWOhh/sL0Ou/+tQ1k5V8XUzyLoJ5CKxmwU6Add0wpeu6i4ID3KcMF9VfB mu0l/b3w3nY5Ll94ZSL0W4yC7YTlrIkcWyt2wIsJxPudHaJDL9nmgJHdKJmCWx5Fjbb0 +LZFDtYTE+YN0bWK7T2Fm6E+ElfbKbaB0SsvDQHA2bRfl1FdiITmjEkvdOV0npnT1RT/ 8PnQ== X-Gm-Message-State: AOAM531t3q97FZBByQ9I5Lf4GPTk1HTS9fV77EnXdDdJYshbFuIqdt4g NWz/13O4gOeioMAMX20VImQ= X-Google-Smtp-Source: ABdhPJwvorPz2AGpeYt4WLp/V45wbXvCzd1VnVGY+5PV3OoxTeEUGQir2ljh82B3OMU3hqfFCzjgrw== X-Received: by 2002:a62:88c3:0:b029:18c:3203:efb7 with SMTP id l186-20020a6288c30000b029018c3203efb7mr426484pfd.33.1606949638634; Wed, 02 Dec 2020 14:53:58 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:53:58 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 5/8] net-zerocopy: Fast return if inq < PAGE_SIZE Date: Wed, 2 Dec 2020 14:53:46 -0800 Message-Id: <20201202225349.935284-6-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, in which case we cannot remap pages. In this case, simply return the appropriate hint for regular copying without taking mmap_sem. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4bdd4a358588..b2f24a5ec230 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1889,6 +1889,14 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); + if (inq < PAGE_SIZE) { + zc->length = 0; + zc->recv_skip_hint = inq; + if (!inq && sock_flag(sk, SOCK_DONE)) + return -EIO; + return 0; + } + mmap_read_lock(current->mm); vma = find_vma(current->mm, address); From patchwork Wed Dec 2 22:53:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337590 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22368C8300F for ; Wed, 2 Dec 2020 22:55:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D6CC722201 for ; Wed, 2 Dec 2020 22:55:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387759AbgLBWzQ (ORCPT ); Wed, 2 Dec 2020 17:55:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387744AbgLBWzP (ORCPT ); Wed, 2 Dec 2020 17:55:15 -0500 Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F2A85C061A4C for ; Wed, 2 Dec 2020 14:53:59 -0800 (PST) Received: by mail-pl1-x641.google.com with SMTP id b23so32741pls.11 for ; Wed, 02 Dec 2020 14:53:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ydveksiLM6/azYNCUoqD8knHGXlio7YZIFn6mGmF60c=; b=VMe5ka88AZ2cZLL0nTuwk57+X+LN4xGH76PN6YiMs+62nxNklIzr57sVc2xeD2XlGa gl3EcCywX4sedNnlYDJ1vSgZza+c1e4BmzmMcY0Lznk7uorxKAt8i0QNUyHhG63Mlsmq VAjUuUp+JLMR2rRPtGB79IkN6PbBI5kFI5HP6CneV7ViU2hR2EE1PC7diwmlSH/SwXzq oSPGxTF/qIqe01psoVx++gH/zwlBZjsWCX08Q6IMU2jhOR4eCXVNbIRW2QBVsXrLbVG7 HUOaV+Bu5UbWEHfmZh6Xq8AU4y6VjqJIlYZ+9+gKTy8KeJh05AmtRdKmvTNnfPLxPmkP tbjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ydveksiLM6/azYNCUoqD8knHGXlio7YZIFn6mGmF60c=; b=HLOiXISOvY9eWFFy8CHWuTuZsJVb1MMs2LrlCvRiF1Lt4x4mkJxzd1A5FKRaRYd0RL GvHQByqwkPmxLvdASOGGYeS4zrm15zHBDjNNerBg16AwwbJ0TCDkFBJJVDiTQmrqY7ZI 83KSZDe4fT/v3hTnSkP/oTAetE/zoIZZ8WgEQ7vQZdmn9VA79rI/LCH1LvcHE29ANAkm VMix29oQbUOIdv+N+X5ZDOGFSrpVQ5pY2WXT0PUDOG14BL/KGSjeR2KGCUEyynmD7NBV Zrc2AFFpRjsulhHgbiMWbrWe6fjs4SO5F/YCDf+UR4AXkcnazmXjvlx6Ef6yD58Pk1cl 5JjA== X-Gm-Message-State: AOAM530HGGxZSxeEuf9s6WxKUIh/8BoP0DyAPQ+WdTIiLVO4vvrpgGjY CXpdLUGsvQZjP6ds7x9dak4= X-Google-Smtp-Source: ABdhPJwMsiQX5q5kik1I5gsvDGOPUTe002Jm6M7pGrGhF/EckKnrhle14LmxYzi1tS2neYCpfca0yw== X-Received: by 2002:a17:90a:7844:: with SMTP id y4mr209191pjl.68.1606949639566; Wed, 02 Dec 2020 14:53:59 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:53:59 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 6/8] net-zerocopy: Introduce short-circuit small reads. Date: Wed, 2 Dec 2020 14:53:47 -0800 Message-Id: <20201202225349.935284-7-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, or inq is generally small enough that it is cheaper to copy rather than remap pages. In these cases, we may want to either return early (inq=0) or attempt to use the provided copy buffer to simply copy the received data. This allows us to save both system call overhead and the latency of acquiring mmap_sem in read mode for cases where it would be useless to do so. This patchset enables this behaviour by: 1. Returning quickly if inq is 0. 2. Attempting to perform a regular copy if a hybrid copybuffer is provided and it is large enough to absorb all available bytes. 3. Return quickly if no such buffer was provided and there are less than PAGE_SIZE bytes available. For small RPC ping-pong workloads, normally we would have 1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this change, we remove the recvmsg() call entirely, reducing the syscall overhead by about 33%. In testing with small (hundreds of bytes) RPC traffic, this yields a syscall reduction of about 33% and an efficiency gain of about 3-5% when defined as QPS/CPU Util. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b2f24a5ec230..f67dd732a47b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1785,6 +1785,39 @@ static int find_next_mappable_frag(const skb_frag_t *frag, return offset; } +static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, + struct scm_timestamping_internal *tss, + int *cmsg_flags); +static int receive_fallback_to_copy(struct sock *sk, + struct tcp_zerocopy_receive *zc, int inq) +{ + unsigned long copy_address = (unsigned long)zc->copybuf_address; + struct scm_timestamping_internal tss_unused; + int err, cmsg_flags_unused; + struct msghdr msg = {}; + struct iovec iov; + + zc->length = 0; + zc->recv_skip_hint = 0; + + if (copy_address != zc->copybuf_address) + return -EINVAL; + + err = import_single_range(READ, (void __user *)copy_address, + inq, &iov, &msg.msg_iter); + if (err) + return err; + + err = tcp_recvmsg_locked(sk, &msg, inq, /*nonblock=*/1, /*flags=*/0, + &tss_unused, &cmsg_flags_unused); + if (err < 0) + return err; + + zc->copybuf_len = err; + return 0; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1889,6 +1922,9 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); + if (inq && inq <= copybuf_len) + return receive_fallback_to_copy(sk, zc, inq); + if (inq < PAGE_SIZE) { zc->length = 0; zc->recv_skip_hint = inq; From patchwork Wed Dec 2 22:53:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336519 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49744C83014 for ; Wed, 2 Dec 2020 22:55:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1090B221FD for ; Wed, 2 Dec 2020 22:55:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387779AbgLBWzQ (ORCPT ); Wed, 2 Dec 2020 17:55:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387665AbgLBWzP (ORCPT ); Wed, 2 Dec 2020 17:55:15 -0500 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E9B36C061A4D for ; Wed, 2 Dec 2020 14:54:00 -0800 (PST) Received: by mail-pl1-x62a.google.com with SMTP id l11so57490plt.1 for ; Wed, 02 Dec 2020 14:54:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ITgCJpbibuIyjnACDvxhxEJWpu8sUCdte6M77XSzcUs=; b=RjXL6Kzn6h35JZbmGzF9X+wJKk85NzHbW+kKJuUTea8eHr90kZ0GEGDkISbUbU2ryx EPQrvOd5KKBtH0XJU1I3eznzGsrksflWOhS/VBQOLbhcz1YpVO2/+fcJ89ySraHO2jTE eyhrxQiKdxtFjQqDidCmfEO4Dbugy4DPqk76FWnVWzm7nk1SCq9VBKGHp8GFScgPd0AF NhXbKhWA/629KPYUwEHCzMPCFqF4Eoojx92EK8pTvVOCTT3FjzO1NyVx3FaZj1oAPTfU pTfe09VJzhOkD2oW/+AozUmGBmmxN4bi+pMQFSbwZall5VPcRt0SaSPB0HTeuNhOgHrl wn8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ITgCJpbibuIyjnACDvxhxEJWpu8sUCdte6M77XSzcUs=; b=qBUTu/G3vpCZh4DzwRyMvkaQmWVU9oaqprDNd9yoiMkKYjtGrprWk0hHZq4QMfdbh9 bIQMroF3lyqKwWyzc6YrFvyjXd1qjBu6eTTxxP4FQamRZmWvkpAmtu3Zz8kQd0R8sTvA 0YY6FkjZvZ9yAIFapWPvfdAf6tcPAa7DI8Z4kJIE18PN5jI4zPFAAmZK1Q1z8Me+ReOo Z8xOpWtoR65F9vDQVAvqLcMqp1kgkw8iNPzdqk8rIYPHFkWvO62Skbh7pX2wp/GU9z3G +ov2l5pW0ELypoL9B5EjRs8v9JDs7AdLf7PPWGyv5Xh1SdNIY7lW4fj/iPzBS+c63lTx jH2A== X-Gm-Message-State: AOAM530lSzDO2Hvl0s0SWufs1uwjMaxCKvYKG8xXvEzuK6fLMfmLKt55 LWEgmS0vh9rnjA8gj+JeHqme9/0GbRI= X-Google-Smtp-Source: ABdhPJy/+Bo+9kBTNd+V3mnhK/F9YWQ+4JL8rLRr2AjcQMwDZ7ylWOoCAoHnnn/UiaRYYWIO/lwffA== X-Received: by 2002:a17:90a:f0c1:: with SMTP id fa1mr194111pjb.148.1606949640527; Wed, 02 Dec 2020 14:54:00 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.53.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:54:00 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 7/8] net-zerocopy: Set zerocopy hint when data is copied Date: Wed, 2 Dec 2020 14:53:48 -0800 Message-Id: <20201202225349.935284-8-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Set zerocopy hint, event when falling back to copy, so that the pending data can be efficiently received using zerocopy when possible. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f67dd732a47b..49480ce162db 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1785,6 +1785,43 @@ static int find_next_mappable_frag(const skb_frag_t *frag, return offset; } +static void tcp_zerocopy_set_hint_for_skb(struct sock *sk, + struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 offset) +{ + u32 frag_offset, partial_frag_remainder = 0; + int mappable_offset; + skb_frag_t *frag; + + /* worst case: skip to next skb. try to improve on this case below */ + zc->recv_skip_hint = skb->len - offset; + + /* Find the frag containing this offset (and how far into that frag) */ + frag = skb_advance_to_frag(skb, offset, &frag_offset); + if (!frag) + return; + + if (frag_offset) { + struct skb_shared_info *info = skb_shinfo(skb); + + /* We read part of the last frag, must recvmsg() rest of skb. */ + if (frag == &info->frags[info->nr_frags - 1]) + return; + + /* Else, we must at least read the remainder in this frag. */ + partial_frag_remainder = skb_frag_size(frag) - frag_offset; + zc->recv_skip_hint -= partial_frag_remainder; + ++frag; + } + + /* partial_frag_remainder: If part way through a frag, must read rest. + * mappable_offset: Bytes till next mappable frag, *not* counting bytes + * in partial_frag_remainder. + */ + mappable_offset = find_next_mappable_frag(frag, zc->recv_skip_hint); + zc->recv_skip_hint = mappable_offset + partial_frag_remainder; +} + static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, struct scm_timestamping_internal *tss, @@ -1815,6 +1852,14 @@ static int receive_fallback_to_copy(struct sock *sk, return err; zc->copybuf_len = err; + if (likely(zc->copybuf_len)) { + struct sk_buff *skb; + u32 offset; + + skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset); + if (skb) + tcp_zerocopy_set_hint_for_skb(sk, zc, skb, offset); + } return 0; } From patchwork Wed Dec 2 22:53:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336518 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A82FC83013 for ; Wed, 2 Dec 2020 22:55:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 402D622201 for ; Wed, 2 Dec 2020 22:55:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387804AbgLBWzU (ORCPT ); Wed, 2 Dec 2020 17:55:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387665AbgLBWzT (ORCPT ); Wed, 2 Dec 2020 17:55:19 -0500 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C8FAC061A4E for ; Wed, 2 Dec 2020 14:54:02 -0800 (PST) Received: by mail-pj1-x1031.google.com with SMTP id j13so7872pjz.3 for ; Wed, 02 Dec 2020 14:54:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=2l7rK6zoOTfA3KeOqpW4cIIQ7nAlc5ZfKm1VVftKyfo=; b=CI6b4bAKNa8NFkQ6EVTCa67Y/d8iQOVoadWhz3BaXNQ8qh9s9gV8fUKIclVdbeORZN Gr2N51mMrFIfbTTuk4T+QXvIOddYfvrDSfk/V47Pt0q9gEapbfEX7aOQoUKSz8MLlWbn vlcQC/zww4AlyaSuMDtkYW/KZ6iT2MG/HOaFYuffA8GrsC3n0xE/Enm62qsC62bh0r+4 ISEtZCFEmr/dYc0YIzx22HELzVIEdv4BjTwT/0A56ek2RkBWkH/BiKq6q/i0BTg+hdL4 50A9memnIzpG/kO3vtlSHGo+zPcnL4jMCS9WU7Nol+x1odNYLtTil0YOjcpYaN48tZba o1tw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2l7rK6zoOTfA3KeOqpW4cIIQ7nAlc5ZfKm1VVftKyfo=; b=d9kZLY23neNW4NZIQCgoJdLmE0SJ/Z5wQEInHFxAtHQgK+ycG3YQHFkA1lZ6W5gtLV OhWBcx0D262Zxwge82cDaSy4datHvoJZ3V7y+CfO03921PT5qrLyJhlAzQ5WSbGABoiw FDTVi8/lFmRdU1BhKrdpm4nfU0DLyZsf79tT1rNhHrMmzaGKYRXNfudiqMcghFk3X9sV uKaQxpWNG2WAJYlNAVuitJR9sIFS8NYsUaH0q28fUi0rQ9Nhh91xevE26yif2HGLmtfi VfUgi5lJmIfTndHhSNkzb87Sl1lMfhhwosm5Po48HoL1IHcnmzw2zttyQrU2O4zcls2T AB+A== X-Gm-Message-State: AOAM530HxUJCZDBKRmcygjdBG21ntBewNBvWCyICVED4J4XE4vB9pTNj BD6PI2RuvHabarcMC3xApre+ZNpMmjc= X-Google-Smtp-Source: ABdhPJw5ktNZ7xOS200y60dEzd94nc3xOwCU0zjy9NDsBtheycr7aS8E2Bi7SxGH6znveTxWDjaOSQ== X-Received: by 2002:a17:902:8c81:b029:da:15fc:b23b with SMTP id t1-20020a1709028c81b02900da15fcb23bmr280151plo.60.1606949641526; Wed, 02 Dec 2020 14:54:01 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id i3sm39962pjs.34.2020.12.02.14.54.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:54:01 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v3 8/8] net-zerocopy: Defer vm zap unless actually needed. Date: Wed, 2 Dec 2020 14:53:49 -0800 Message-Id: <20201202225349.935284-9-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202225349.935284-1-arjunroy.kdev@gmail.com> References: <20201202225349.935284-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Zapping pages is required only if we are calling vm_insert_page into a region where pages had previously been mapped. Receive zerocopy allows reusing such regions, and hitherto called zap_page_range() before calling vm_insert_page() in that range. zap_page_range() can also be triggered from userspace with madvise(MADV_DONTNEED). If userspace is configured to call this before reusing a segment, or if there was nothing mapped at this virtual address to begin with, we can avoid calling zap_page_range() under the socket lock. That said, if userspace does not do that, then we are still responsible for calling zap_page_range(). This patch adds a flag that the user can use to hint to the kernel that a zap is not required. If the flag is not set, or if an older user application does not have a flags field at all, then the kernel calls zap_page_range as before. Also, if the flag is set but a zap is still required, the kernel performs that zap as necessary. Thus incorrectly indicating that a zap can be avoided does not change the correctness of operation. It also increases the batchsize for vm_insert_pages and prefetches the page struct for the batch since we're about to bump the refcount. An alternative mechanism could be to not have a flag, assume by default a zap is not needed, and fall back to zapping if needed. However, this would harm performance for older applications for which a zap is necessary, and thus we implement it with an explicit flag so newer applications can opt in. When using RPC-style traffic with medium sized (tens of KB) RPCs, this change yields an efficency improvement of about 30% for QPS/CPU usage. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 147 ++++++++++++++++++++++++++------------- 2 files changed, 99 insertions(+), 50 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 62db78b9c1a0..13ceeb395eb8 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -343,6 +343,7 @@ struct tcp_diag_md5sig { /* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ +#define TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT 0x1 struct tcp_zerocopy_receive { __u64 address; /* in: address of mapping */ __u32 length; /* in/out: number of bytes to map/mapped */ @@ -351,5 +352,6 @@ struct tcp_zerocopy_receive { __s32 err; /* out: socket error */ __u64 copybuf_address; /* in: copybuf address (small reads) */ __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ + __u32 flags; /* in: flags */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 49480ce162db..83d16f04f464 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1909,51 +1909,101 @@ static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, return zc->copybuf_len < 0 ? 0 : copylen; } +static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma, + struct page **pending_pages, + unsigned long pages_remaining, + unsigned long *address, + u32 *length, + u32 *seq, + struct tcp_zerocopy_receive *zc, + u32 total_bytes_to_map, + int err) +{ + /* At least one page did not map. Try zapping if we skipped earlier. */ + if (err == -EBUSY && + zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT) { + u32 maybe_zap_len; + + maybe_zap_len = total_bytes_to_map - /* All bytes to map */ + *length + /* Mapped or pending */ + (pages_remaining * PAGE_SIZE); /* Failed map. */ + zap_page_range(vma, *address, maybe_zap_len); + err = 0; + } + + if (!err) { + unsigned long leftover_pages = pages_remaining; + int bytes_mapped; + + /* We called zap_page_range, try to reinsert. */ + err = vm_insert_pages(vma, *address, + pending_pages, + &pages_remaining); + bytes_mapped = PAGE_SIZE * (leftover_pages - pages_remaining); + *seq += bytes_mapped; + *address += bytes_mapped; + } + if (err) { + /* Either we were unable to zap, OR we zapped, retried an + * insert, and still had an issue. Either ways, pages_remaining + * is the number of pages we were unable to map, and we unroll + * some state we speculatively touched before. + */ + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; + + *length -= bytes_not_mapped; + zc->recv_skip_hint += bytes_not_mapped; + } + return err; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, - unsigned long pages_to_map, - unsigned long *insert_addr, - u32 *length_with_pending, + unsigned int pages_to_map, + unsigned long *address, + u32 *length, u32 *seq, - struct tcp_zerocopy_receive *zc) + struct tcp_zerocopy_receive *zc, + u32 total_bytes_to_map) { unsigned long pages_remaining = pages_to_map; - int bytes_mapped; - int ret; + unsigned int pages_mapped; + unsigned int bytes_mapped; + int err; - ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); - bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); + err = vm_insert_pages(vma, *address, pages, &pages_remaining); + pages_mapped = pages_to_map - (unsigned int)pages_remaining; + bytes_mapped = PAGE_SIZE * pages_mapped; /* Even if vm_insert_pages fails, it may have partially succeeded in * mapping (some but not all of the pages). */ *seq += bytes_mapped; - *insert_addr += bytes_mapped; - if (ret) { - /* But if vm_insert_pages did fail, we have to unroll some state - * we speculatively touched before. - */ - const int bytes_not_mapped = PAGE_SIZE * pages_remaining; - *length_with_pending -= bytes_not_mapped; - zc->recv_skip_hint += bytes_not_mapped; - } - return ret; + *address += bytes_mapped; + + if (likely(!err)) + return 0; + + /* Error: maybe zap and retry + rollback state for failed inserts. */ + return tcp_zerocopy_vm_insert_batch_error(vma, pages + pages_mapped, + pages_remaining, address, length, seq, zc, total_bytes_to_map, + err); } +#define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32 static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { - u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; + u32 length = 0, offset, vma_len, avail_len, copylen = 0; unsigned long address = (unsigned long)zc->address; + struct page *pages[TCP_ZEROCOPY_PAGE_BATCH_SIZE]; s32 copybuf_len = zc->copybuf_len; struct tcp_sock *tp = tcp_sk(sk); - #define PAGE_BATCH_SIZE 8 - struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; + unsigned int pages_to_map = 0; struct vm_area_struct *vma; struct sk_buff *skb = NULL; - unsigned long pg_idx = 0; - unsigned long curr_addr; u32 seq = tp->copied_seq; + u32 total_bytes_to_map; int inq = tcp_inq(sk); int ret; @@ -1987,34 +2037,24 @@ static int tcp_zerocopy_receive(struct sock *sk, } vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); avail_len = min_t(u32, vma_len, inq); - aligned_len = avail_len & ~(PAGE_SIZE - 1); - if (aligned_len) { - zap_page_range(vma, address, aligned_len); - zc->length = aligned_len; + total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); + if (total_bytes_to_map) { + if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) + zap_page_range(vma, address, total_bytes_to_map); + zc->length = total_bytes_to_map; zc->recv_skip_hint = 0; } else { zc->length = avail_len; zc->recv_skip_hint = avail_len; } ret = 0; - curr_addr = address; while (length + PAGE_SIZE <= zc->length) { int mappable_offset; + struct page *page; if (zc->recv_skip_hint < PAGE_SIZE) { u32 offset_frag; - /* If we're here, finish the current batch. */ - if (pg_idx) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, - pg_idx, - &curr_addr, - &length, - &seq, zc); - if (ret) - goto out; - pg_idx = 0; - } if (skb) { if (zc->recv_skip_hint > 0) break; @@ -2035,24 +2075,31 @@ static int tcp_zerocopy_receive(struct sock *sk, zc->recv_skip_hint = mappable_offset; break; } - pages[pg_idx] = skb_frag_page(frags); - pg_idx++; + page = skb_frag_page(frags); + prefetchw(page); + pages[pages_to_map++] = page; length += PAGE_SIZE; zc->recv_skip_hint -= PAGE_SIZE; frags++; - if (pg_idx == PAGE_BATCH_SIZE) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, - &curr_addr, &length, - &seq, zc); + if (pages_to_map == TCP_ZEROCOPY_PAGE_BATCH_SIZE || + zc->recv_skip_hint < PAGE_SIZE) { + /* Either full batch, or we're about to go to next skb + * (and we cannot unroll failed ops across skbs). + */ + ret = tcp_zerocopy_vm_insert_batch(vma, pages, + pages_to_map, + &address, &length, + &seq, zc, + total_bytes_to_map); if (ret) goto out; - pg_idx = 0; + pages_to_map = 0; } } - if (pg_idx) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, - &curr_addr, &length, &seq, - zc); + if (pages_to_map) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pages_to_map, + &address, &length, &seq, + zc, total_bytes_to_map); } out: mmap_read_unlock(current->mm);