From patchwork Wed Dec 2 22:09:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336526 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19F06C83013 for ; Wed, 2 Dec 2020 22:10:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D83FE22201 for ; Wed, 2 Dec 2020 22:10:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387432AbgLBWKt (ORCPT ); Wed, 2 Dec 2020 17:10:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726011AbgLBWKt (ORCPT ); Wed, 2 Dec 2020 17:10:49 -0500 Received: from mail-pj1-x1042.google.com (mail-pj1-x1042.google.com [IPv6:2607:f8b0:4864:20::1042]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 375CBC0617A7 for ; Wed, 2 Dec 2020 14:10:09 -0800 (PST) Received: by mail-pj1-x1042.google.com with SMTP id b12so1743581pjl.0 for ; Wed, 02 Dec 2020 14:10:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=GbFS33Hg5cMp2CuAVns1TJfw10E2mmOSQeFjaYl1mQA=; b=Lfe7c15CMLFItYgbZlPQRDjGRT2l28uMHYVeyD8g6tmRV8GkM4PjdA2fJSEnn87kwm ciUcaLcP4/qsWBqCGPfi3EvLcP/+Q6ne84G3JTGs74g5lmDwyLv/GIbZ0kGOIh+iLfXK RntLKjvRvtmiFDMmwVQ5+HSsc1cSKcw43cpo8xwrqT1Ycr38oVc3z+l2T6MGaRAGb+te qPEyoWNSHurDu5lwymgoZAenfHTOigGUlK/IvqktEy9/zyR1Nw86ikuUp6I6gWcIOpTh 8KOiGJdDR88PTm6+y6IBV/zFPCCUbmLJOReGGO/0ADV2JrQiyjryQ1Z1uurYoJB3w+gj W4iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=GbFS33Hg5cMp2CuAVns1TJfw10E2mmOSQeFjaYl1mQA=; b=OX+r3a9gOADtMRoLD5SFnU42uY20+uZJmhKsqUXk5+qs++MGjsjR6uF8BhC6gia+St zFRRwMsG5urW19SsSzr9K/3yJaX8T8rXrJPyi8Izezw6UssLWeTpkVHyNacybqdnK6bt oV3L8ivnTrVH/6Ikjq9GPPYg8XaZ2M5q0ZAgKrlusWp14E1BgYsqKkzIla/195LxHmLy c0X+QqWVvMgWTaGszDU5qsg3SsNrHFqC9ioxxK8HmTfC+CPcPzUP+mzYVu4yaUA5BLg6 6PxaPEJN3drxw4DaC0ymDhPkh/MRm0RIif7K1eJMyYP6asrbhZwC7den6rnM1FQdvcRp 5yDA== X-Gm-Message-State: AOAM530hmUUC6HkBYP1rLxZFfXT0qrIhjzKneheilR+cq6nXM77K0e3s ZcIHXgRmVb8TQMIyybzyZ1F1iJHJJ+Y= X-Google-Smtp-Source: ABdhPJw2lva1jw2Wwz3E3Pus/Gww2mOP0GBeb7Y2I2EpUGQEyeXukZk9zs+rHnawn3hurlimg0auvg== X-Received: by 2002:a17:90a:9f44:: with SMTP id q4mr68349pjv.226.1606947008806; Wed, 02 Dec 2020 14:10:08 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:08 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy. Date: Wed, 2 Dec 2020 14:09:38 -0800 Message-Id: <20201202220945.911116-2-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 84 ++++++++++++++++++++++++++++++++-------- 2 files changed, 70 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index cfcb10b75483..62db78b9c1a0 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { __u32 recv_skip_hint; /* out: amount of bytes to skip */ __u32 inq; /* out: amount of bytes in read queue */ __s32 err; /* out: socket error */ + __u64 copybuf_address; /* in: copybuf address (small reads) */ + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b2bc3d7fe9e8..887c6e986927 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,52 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 copylen, + u32 *offset, u32 *seq) +{ + unsigned long copy_address = (unsigned long)zc->copybuf_address; + struct msghdr msg = {}; + struct iovec iov; + int err; + + if (copy_address != zc->copybuf_address) + return -EINVAL; + + err = import_single_range(READ, (void __user *)copy_address, + copylen, &iov, &msg.msg_iter); + if (err) + return err; + err = skb_copy_datagram_msg(skb, *offset, &msg, copylen); + if (err) + return err; + zc->recv_skip_hint -= copylen; + *offset += copylen; + *seq += copylen; + return (__s32)copylen; +} + +static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, + struct sock *sk, + struct sk_buff *skb, + u32 *seq, + s32 copybuf_len) +{ + u32 offset, copylen = min_t(u32, copybuf_len, zc->recv_skip_hint); + + if (!copylen) + return 0; + /* skb is null if inq < PAGE_SIZE. */ + if (skb) + offset = *seq - TCP_SKB_CB(skb)->seq; + else + skb = tcp_recv_skb(sk, *seq, &offset); + + zc->copybuf_len = tcp_copy_straggler_data(zc, skb, copylen, &offset, + seq); + return zc->copybuf_len < 0 ? 0 : copylen; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, unsigned long pages_to_map, @@ -1776,8 +1822,10 @@ static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { + u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; unsigned long address = (unsigned long)zc->address; - u32 length = 0, seq, offset, zap_len; + s32 copybuf_len = zc->copybuf_len; + struct tcp_sock *tp = tcp_sk(sk); #define PAGE_BATCH_SIZE 8 struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; @@ -1785,10 +1833,12 @@ static int tcp_zerocopy_receive(struct sock *sk, struct sk_buff *skb = NULL; unsigned long pg_idx = 0; unsigned long curr_addr; - struct tcp_sock *tp; - int inq; + u32 seq = tp->copied_seq; + int inq = tcp_inq(sk); int ret; + zc->copybuf_len = 0; + if (address & (PAGE_SIZE - 1) || address != zc->address) return -EINVAL; @@ -1797,8 +1847,6 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); - tp = tcp_sk(sk); - mmap_read_lock(current->mm); vma = find_vma(current->mm, address); @@ -1806,17 +1854,16 @@ static int tcp_zerocopy_receive(struct sock *sk, mmap_read_unlock(current->mm); return -EINVAL; } - zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); - - seq = tp->copied_seq; - inq = tcp_inq(sk); - zc->length = min_t(u32, zc->length, inq); - zap_len = zc->length & ~(PAGE_SIZE - 1); - if (zap_len) { - zap_page_range(vma, address, zap_len); + vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); + avail_len = min_t(u32, vma_len, inq); + aligned_len = avail_len & ~(PAGE_SIZE - 1); + if (aligned_len) { + zap_page_range(vma, address, aligned_len); + zc->length = aligned_len; zc->recv_skip_hint = 0; } else { - zc->recv_skip_hint = zc->length; + zc->length = avail_len; + zc->recv_skip_hint = avail_len; } ret = 0; curr_addr = address; @@ -1885,13 +1932,18 @@ static int tcp_zerocopy_receive(struct sock *sk, } out: mmap_read_unlock(current->mm); - if (length) { + /* Try to copy straggler data. */ + if (!ret) + copylen = tcp_zerocopy_handle_leftover_data(zc, sk, skb, &seq, + copybuf_len); + + if (length + copylen) { WRITE_ONCE(tp->copied_seq, seq); tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ tcp_recv_skb(sk, seq, &offset); - tcp_cleanup_rbuf(sk, length); + tcp_cleanup_rbuf(sk, length + copylen); ret = 0; if (length == zc->length) zc->recv_skip_hint = 0; From patchwork Wed Dec 2 22:09:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336524 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDD5DC71155 for ; Wed, 2 Dec 2020 22:11:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 68E7B217A0 for ; Wed, 2 Dec 2020 22:11:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387522AbgLBWLC (ORCPT ); Wed, 2 Dec 2020 17:11:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387514AbgLBWLB (ORCPT ); Wed, 2 Dec 2020 17:11:01 -0500 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAB92C061A04 for ; Wed, 2 Dec 2020 14:10:15 -0800 (PST) Received: by mail-pj1-x1029.google.com with SMTP id f14so1385548pju.4 for ; Wed, 02 Dec 2020 14:10:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=whasLqjMPGXLpEY20tjlFFBh+irnxgmgA2f5gTPPoVE=; b=n6AUQNBq1hDjghI5BAyogXEB+6/kpj4y//6WpIjiujILlZ2YNaxX3Ms7Cid2052N4y B8GydnlEEPIPuEIQUO442jW5U7EKsXfiugXksQ9FGKwj4JT4tvD3MzXiu3gYOrgMckmr P8vBy8hhu9rbNza/kydh/dNWtKYwBhBEOAoAzxaw+sMlVQx645FrlLiAj4Nxvd13WJMP tyGK0fiuMNyQx7B9WVVjUMgJw2CWZN0U9k/bpA0+L5daOvmzdggsyX0y7RfWoeUX7SOy RpVUxc5UzR6WnXi7xgSA8M/Qa8GYlKiBiybl/intPoDUQi4TQVWHbQkZucK+b7AgN1DH wTdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=whasLqjMPGXLpEY20tjlFFBh+irnxgmgA2f5gTPPoVE=; b=NZzbizYRmIUu5d0pmKV2f0C2z+pq7I6fDAn1XMTDZfx2lB4SWd31lV8gi6ivWgAGEM krjipWCAtWWdBQIkIsyUFhNL7VaygAS0cW+rzqlSLiA7jSBY6beBLYRwFOiCVVbtxaz2 mBVtAZXKwJvqnoOl45S9sAXqShKWLcUrTiopDiZuWvcpV2u1/nHYxg5iBuFgN3+bHWIs iCTadO3hvynlhyljbhb2Tew7ooRnKoc33DwScuqjYZ+/8P55/SGAoeI/Xjd/1yzdMMGJ Mg6jJ4vKyFXNMtqJ3axIFfuFOZkR+q/OZag/TCWFWibBZEXjMF3LUrOTLo62R8dZDhjw 5t2g== X-Gm-Message-State: AOAM530ulR2x6/NYzLzhLq8d0czzo3jCqP3owTlEcbHfi2iJzJSxN5OV CKhWSoIsN6Q5LW6lU6m5EFg= X-Google-Smtp-Source: ABdhPJwwyZt0H3dr6KsD+JLnjNW6xSz3FBDs7nwiFARLng3kX5ddUcGMFQ3g40EqA1uYTYsDZAHRqA== X-Received: by 2002:a17:902:900c:b029:da:b7a3:d83a with SMTP id a12-20020a170902900cb02900dab7a3d83amr214419plp.57.1606947015448; Wed, 02 Dec 2020 14:10:15 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:15 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 2/8] net-tcp: Introduce tcp_recvmsg_locked(). Date: Wed, 2 Dec 2020 14:09:39 -0800 Message-Id: <20201202220945.911116-3-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor tcp_recvmsg() by splitting it into locked and unlocked portions. Callers already holding the socket lock and not using ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked(). This is in preparation for a short-circuit copy performed by TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested by the user) reads. --- net/ipv4/tcp.c | 69 ++++++++++++++++++++++++++++---------------------- 1 file changed, 39 insertions(+), 30 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 887c6e986927..232cb478bacd 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2065,36 +2065,28 @@ static int tcp_inq_hint(struct sock *sk) * Probably, code can be easily improved even more. */ -int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, - int flags, int *addr_len) +static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, + struct scm_timestamping_internal *tss, + int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); int copied = 0; u32 peek_seq; u32 *seq; unsigned long used; - int err, inq; + int err; int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; u32 urg_hole = 0; - struct scm_timestamping_internal tss; - int cmsg_flags; - - if (unlikely(flags & MSG_ERRQUEUE)) - return inet_recv_error(sk, msg, len, addr_len); - - if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) && - (sk->sk_state == TCP_ESTABLISHED)) - sk_busy_loop(sk, nonblock); - - lock_sock(sk); err = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) goto out; - cmsg_flags = tp->recvmsg_inq ? 1 : 0; + if (tp->recvmsg_inq) + *cmsg_flags = 1; timeo = sock_rcvtimeo(sk, nonblock); /* Urgent data needs to be handled specially. */ @@ -2274,8 +2266,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, } if (TCP_SKB_CB(skb)->has_rxtstamp) { - tcp_update_recv_tstamps(skb, &tss); - cmsg_flags |= 2; + tcp_update_recv_tstamps(skb, tss); + *cmsg_flags |= 2; } if (used + offset < skb->len) @@ -2301,22 +2293,9 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, /* Clean up data we have read: This will do ACK frames. */ tcp_cleanup_rbuf(sk, copied); - - release_sock(sk); - - if (cmsg_flags) { - if (cmsg_flags & 2) - tcp_recv_timestamp(msg, sk, &tss); - if (cmsg_flags & 1) { - inq = tcp_inq_hint(sk); - put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); - } - } - return copied; out: - release_sock(sk); return err; recv_urg: @@ -2327,6 +2306,36 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, err = tcp_peek_sndq(sk, msg, len); goto out; } + +int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, + int flags, int *addr_len) +{ + int cmsg_flags = 0, ret, inq; + struct scm_timestamping_internal tss; + + if (unlikely(flags & MSG_ERRQUEUE)) + return inet_recv_error(sk, msg, len, addr_len); + + if (sk_can_busy_loop(sk) && + skb_queue_empty_lockless(&sk->sk_receive_queue) && + sk->sk_state == TCP_ESTABLISHED) + sk_busy_loop(sk, nonblock); + + lock_sock(sk); + ret = tcp_recvmsg_locked(sk, msg, len, nonblock, flags, &tss, + &cmsg_flags); + release_sock(sk); + + if (cmsg_flags && ret >= 0) { + if (cmsg_flags & 2) + tcp_recv_timestamp(msg, sk, &tss); + if (cmsg_flags & 1) { + inq = tcp_inq_hint(sk); + put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); + } + } + return ret; +} EXPORT_SYMBOL(tcp_recvmsg); void tcp_set_state(struct sock *sk, int state) From patchwork Wed Dec 2 22:09:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336525 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 383A8C6369E for ; Wed, 2 Dec 2020 22:11:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CB050217A0 for ; Wed, 2 Dec 2020 22:11:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387464AbgLBWK6 (ORCPT ); Wed, 2 Dec 2020 17:10:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58076 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387442AbgLBWK6 (ORCPT ); Wed, 2 Dec 2020 17:10:58 -0500 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4136BC061A47 for ; Wed, 2 Dec 2020 14:10:18 -0800 (PST) Received: by mail-pj1-x1033.google.com with SMTP id r9so1810977pjl.5 for ; Wed, 02 Dec 2020 14:10:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=kF6MpbIz7UBtOUQRndbrPYKTjWhsAlqLLMJgGNnG7ng=; b=EjwrA82jdXQg+V949MNc/Y0Gn5b+TnSJxrJzewtMbt+t1BR7b/wKUG73OJtDUXtKkq 0hSr/NKUGT51IYH3GomiPSjleFRyDbTFtt0/vIlpcZ3PU3AozGRiHg32t+4aocikeGix AnkBqMNeGWDXrknm7SWFCottES3t0H8bcMsBTiHBUAAFDQJWjoOTByHyjQlj7S62R5Wm lrN86eEkv81/mtwt8RQdjWeaLjP15I5wquoZNvzwvLuullNQ/meILSHj9PybVa7jYuH+ iZdiUO9Z1A3DeRTEc48bd4pYuqqa2sXhWi2prwljgvYrCjVJkOqLFfSHZ60yYEO2lPul 2HpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=kF6MpbIz7UBtOUQRndbrPYKTjWhsAlqLLMJgGNnG7ng=; b=C6L/ZS/2ZpDaC0qeMt6uTa1VodnjbxGrJXkqmK766NR3AV2sBajw+gjWFTw1yvM6wP 7x6+5MIxrM1u9iMy/XAuivKwIa06kFpCteSAFbMRVRQFRNeq5m1ObtEauPakyWLnMpAz 5kuFIKkVNk9GiI9AZ+sGWnqE7iM4Qx70ujaMJfTEaWLauN8ABD2xOfeJmIiHjTt/d9d7 lj05vqL4fM2DMjQdQDSuQyVm9dYBfEPJUt9lJV1EmDr+aDyavLLPgO9I3lGk0nuEiXrX lUtOkfU09KmQMsluvI/YPyT8kGQ5FrOgoL/yZ1v2cMam5h/qC7cpCkqQVL5Zbep3nYvl qZjg== X-Gm-Message-State: AOAM533keJUyGjkqzTI4mhAd40733AJuFUnSs+KbK1fWWh7DCzunN8Y5 FB1bCr5QUbAx9mxbY2VTHog= X-Google-Smtp-Source: ABdhPJy7QGWk5pDjl71vr9ZqA5VpBcKFaiFlxnIFzHPKpWeJ9si4su71HdPl6i9FJNrL5O4vgLt0OQ== X-Received: by 2002:a17:902:eac1:b029:da:88ce:f38f with SMTP id p1-20020a170902eac1b02900da88cef38fmr166365pld.42.1606947017829; Wed, 02 Dec 2020 14:10:17 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:17 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 3/8] net-zerocopy: Refactor skb frag fast-forward op. Date: Wed, 2 Dec 2020 14:09:40 -0800 Message-Id: <20201202220945.911116-4-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor skb frag fast-forwarding for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. skb_advance_to_frag(), given a skb and an offset into the skb, iterates from the first frag for the skb until we're at the frag specified by the offset. Assuming the offset provided refers to how many bytes in the skb are already read, the returned frag points to the next frag we may read from, while offset_frag is set to the number of bytes from this frag that we have already read. If frag is not null and offset_frag is equal to 0, then we may be able to map this frag's page into the process address space with vm_insert_page(). However, if offset_frag is not equal to 0, then we cannot do so. --- net/ipv4/tcp.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 232cb478bacd..0f17b46c4c0c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,28 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb, + u32 *offset_frag) +{ + skb_frag_t *frag; + + offset_skb -= skb_headlen(skb); + if ((int)offset_skb < 0 || skb_has_frag_list(skb)) + return NULL; + + frag = skb_shinfo(skb)->frags; + while (offset_skb) { + if (skb_frag_size(frag) > offset_skb) { + *offset_frag = offset_skb; + return frag; + } + offset_skb -= skb_frag_size(frag); + ++frag; + } + *offset_frag = 0; + return frag; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1869,6 +1891,8 @@ static int tcp_zerocopy_receive(struct sock *sk, curr_addr = address; while (length + PAGE_SIZE <= zc->length) { if (zc->recv_skip_hint < PAGE_SIZE) { + u32 offset_frag; + /* If we're here, finish the current batch. */ if (pg_idx) { ret = tcp_zerocopy_vm_insert_batch(vma, pages, @@ -1889,16 +1913,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); } zc->recv_skip_hint = skb->len - offset; - offset -= skb_headlen(skb); - if ((int)offset < 0 || skb_has_frag_list(skb)) + frags = skb_advance_to_frag(skb, offset, &offset_frag); + if (!frags || offset_frag) break; - frags = skb_shinfo(skb)->frags; - while (offset) { - if (skb_frag_size(frags) > offset) - goto out; - offset -= skb_frag_size(frags); - frags++; - } } if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) { int remaining = zc->recv_skip_hint; From patchwork Wed Dec 2 22:09:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337598 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F4D3C64E7C for ; Wed, 2 Dec 2020 22:11:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 07E9221D7A for ; Wed, 2 Dec 2020 22:11:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387488AbgLBWLB (ORCPT ); Wed, 2 Dec 2020 17:11:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58084 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387442AbgLBWLA (ORCPT ); Wed, 2 Dec 2020 17:11:00 -0500 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A5FE4C061A48 for ; Wed, 2 Dec 2020 14:10:19 -0800 (PST) Received: by mail-pl1-x62a.google.com with SMTP id 4so1908563plk.5 for ; Wed, 02 Dec 2020 14:10:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=fx2J5kxkuy4RVQPdcQdvLIdfLQ1KmMIrt6Uak2OdX1I=; b=tZKIvjrSiwk2uFDis4sT08WCbufr0qyAwYTvUDg2a1MrjCtV/fmJ5/T8+sZeKg9D2N wJt+goPhJ113l0Fc3b+bMSWWKjMPk+WhmrAs2x9OuSt8SaSDlrfDY+6bhBmDAn2Iqljp +YIgHu+t69lJZPdcUjBFMqxsTF55uxqLHm6eqBGFvFbdgl82qI/jx/meEnEhM7Tk3wnD Su4/VypluEGAs/q2QZtRV0Uq1r4nGFGFdEFOAbzcPCwlXGbWQANIA4AiYHaGlz6QZjL5 vutp5wdVtQz459iI6e5yBRUmKKh7ChQj7BYG/tOAswb34eYxXcFhXNDE1r7OFRlUlT7J F4lA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=fx2J5kxkuy4RVQPdcQdvLIdfLQ1KmMIrt6Uak2OdX1I=; b=Qjy1AoPjm2CwKOi9h8iPMWsc9G91loF/UqIXOGQwH9Z8XAEI4NjNDvlNR/lgwPvIg3 7Gj9qZaHq5j+0sLwOA7fFj+JGLKXHQjBO4NZ9cd09dGUuSgBp8vQ/NdXNnMuwF85vQdN 3WtNQVqVlO39TGSAceDXI2Eaimyr9jdC5NqothoswWsbjP98eFP69JuqVJPWb2T6bUEP wHTwjUN2PW8Sz3Z99Zz7GzzxEUl8WLT9HcK5Li7UUNGeVijdVl1nH1ryUEaRj3+mGjDJ Tt6YkM2qoyuV9/vQIa3teI5eEXmWb9gUR/dWs28aEaOjn32kUxEnsMbXb2uWcisrIk3G rmHg== X-Gm-Message-State: AOAM533LWNY22u5M0qnJl/WrQhB0NWb+9oIftylMdVN0rtgQ9SZETpc+ 2wN+bXuC0ZaV06LSbJnTBOY= X-Google-Smtp-Source: ABdhPJw8knJBUKQ77DVvtCVmi2KSPYSuNH8GezOkQPNpENzO/6valQjY2xeJsZu62ZDOPXyS9QBOuw== X-Received: by 2002:a17:90a:fa8e:: with SMTP id cu14mr78855pjb.140.1606947019294; Wed, 02 Dec 2020 14:10:19 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:18 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 4/8] net-zerocopy: Refactor frag-is-remappable test. Date: Wed, 2 Dec 2020 14:09:41 -0800 Message-Id: <20201202220945.911116-5-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor frag-is-remappable test for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. --- net/ipv4/tcp.c | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0f17b46c4c0c..4bdd4a358588 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1765,6 +1765,26 @@ static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb, return frag; } +static bool can_map_frag(const skb_frag_t *frag) +{ + return skb_frag_size(frag) == PAGE_SIZE && !skb_frag_off(frag); +} + +static int find_next_mappable_frag(const skb_frag_t *frag, + int remaining_in_skb) +{ + int offset = 0; + + if (likely(can_map_frag(frag))) + return 0; + + while (offset < remaining_in_skb && !can_map_frag(frag)) { + offset += skb_frag_size(frag); + ++frag; + } + return offset; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1890,6 +1910,8 @@ static int tcp_zerocopy_receive(struct sock *sk, ret = 0; curr_addr = address; while (length + PAGE_SIZE <= zc->length) { + int mappable_offset; + if (zc->recv_skip_hint < PAGE_SIZE) { u32 offset_frag; @@ -1917,15 +1939,11 @@ static int tcp_zerocopy_receive(struct sock *sk, if (!frags || offset_frag) break; } - if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) { - int remaining = zc->recv_skip_hint; - while (remaining && (skb_frag_size(frags) != PAGE_SIZE || - skb_frag_off(frags))) { - remaining -= skb_frag_size(frags); - frags++; - } - zc->recv_skip_hint -= remaining; + mappable_offset = find_next_mappable_frag(frags, + zc->recv_skip_hint); + if (mappable_offset) { + zc->recv_skip_hint = mappable_offset; break; } pages[pg_idx] = skb_frag_page(frags); From patchwork Wed Dec 2 22:09:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337597 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88ACEC64E7B for ; Wed, 2 Dec 2020 22:11:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 37CAE221FD for ; Wed, 2 Dec 2020 22:11:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387510AbgLBWLB (ORCPT ); Wed, 2 Dec 2020 17:11:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387442AbgLBWLB (ORCPT ); Wed, 2 Dec 2020 17:11:01 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08CFCC0613D6 for ; Wed, 2 Dec 2020 14:10:21 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id w6so2147585pfu.1 for ; Wed, 02 Dec 2020 14:10:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=774IxFvxtEcksvpFGz8hbT30bXwYiShxKYm2dingbMk=; b=TW8kQCxtvvRlOObspD8xhhcwj88oCOFqlogHS8gyrEFgV4GiTUATzP6ZNZQ1gP5XKr LUJTo/miQZzooD5/oa0ai5ujWLKZehAFuWFx5VbbzUcMf3cQoBy26rkkjiw+q0Q9QBVc 65tT5zMvbWCEBpnPgunEsgjSK8eYCZlfDUWmknnlPvPD7KIxqeM1kcapeezpzPIv/28r 24OMjcrzHh7+RX7ahE9M9Oy5oRKM2p5NYNhLhqi58MY4tlSgJpdhFRxrMPwiErW5ob5o d08cjjor6AAzhmpzrfsoHKoS9I7Mg9JQWoG49I1DyFJri1s2H+LlnLi73MtIvcCx+sAD xzQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=774IxFvxtEcksvpFGz8hbT30bXwYiShxKYm2dingbMk=; b=E61d+Ko2d6c4Bpxy8Bbpht/f7HcjqWQR2LuxYiqqWE7E6MTQ2ABWagmaziJKc/ONN1 w7OaqLc4AQTtG+HeQQWHY8ssEbh0tzvkhPsMVcaEbCxgs7HHaYdvf/OzGhW0klcaARgk 1i1JHxvkCUEzdAFXAoRQUnm8Jt0Oy+gtgoitt153CGmDHaKiOVbC+lhTJyhpmJS3jrUq tx1/INHwpIz9FnhUgowKoR5J6LZg7CnpYfOz5cX2EVeA5PpmI5yRcRgrGQ9z6+ly/zAN RU5XE8/mpklvAsbhkqHJuEz2c0SDzHmd+uV+90mUwNJC5e2/0KKrbtVBV8w6egQWDziI ob0A== X-Gm-Message-State: AOAM532SWkQHl0V9jBT6ycWiUr8n8QVX3d3YHCL32NRhaU4ULM+kkXN+ 2PsJWnoT+u9DuLsTbiOUxRk= X-Google-Smtp-Source: ABdhPJyFl8A35N+CSf54h4dRapovfHWobauiF0A6lExmXTRhVaovb9DkBDow6zRYw2MpZrMwr9zLCg== X-Received: by 2002:a65:4346:: with SMTP id k6mr322787pgq.83.1606947020673; Wed, 02 Dec 2020 14:10:20 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:20 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 5/8] net-zerocopy: Fast return if inq < PAGE_SIZE Date: Wed, 2 Dec 2020 14:09:42 -0800 Message-Id: <20201202220945.911116-6-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, in which case we cannot remap pages. In this case, simply return the appropriate hint for regular copying without taking mmap_sem. --- net/ipv4/tcp.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4bdd4a358588..b2f24a5ec230 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1889,6 +1889,14 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); + if (inq < PAGE_SIZE) { + zc->length = 0; + zc->recv_skip_hint = inq; + if (!inq && sock_flag(sk, SOCK_DONE)) + return -EIO; + return 0; + } + mmap_read_lock(current->mm); vma = find_vma(current->mm, address); From patchwork Wed Dec 2 22:09:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337595 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70928C64E7C for ; Wed, 2 Dec 2020 22:11:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1999621D7A for ; Wed, 2 Dec 2020 22:11:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728411AbgLBWL3 (ORCPT ); Wed, 2 Dec 2020 17:11:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58160 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726011AbgLBWL2 (ORCPT ); Wed, 2 Dec 2020 17:11:28 -0500 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8E0AC061A49 for ; Wed, 2 Dec 2020 14:10:22 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id p6so1897892plr.7 for ; Wed, 02 Dec 2020 14:10:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=HGWorQKZtUsiU/9suEjJpSdiZpLS/Ni42A/DUaNeKYo=; b=P96ZJ68lxk+tVis5jmLz+0ask2CfEC/tkDVQPgCVJv1h3xE7Z12IhQNZy3Y4HwK0ch 1HKWJli7gwKn8u+OX3EyUWF6KXrPgYo9Vipo9QA5PuoyJBv0/VOInsgOZtvAOZ+ob81U vgDuoMSvgOx7eqBOLp2kus8GQR17etf2w1k1XZAZpgUqCt9bzvtLTeVpYS3lQFH4Phkb FTEj9q5PnO28F1PMrzUpq+PjvZz0jdhCM+9WH5NhtpG/7YlsQcP2U3IVz74+30STEAlQ MCAPKkvfB3DSW7EqHiGjMx8XPqXOwaLzU+49/s+NLHeZnFZ0PJStnxen3DnKJprAmU7H R9NQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=HGWorQKZtUsiU/9suEjJpSdiZpLS/Ni42A/DUaNeKYo=; b=K4XCXi7Ju1TqaXHRKxjR7gH1CwxvQTs9PNkLbxrBNBAkit1myMg2L+c/y+F+45vtbz wg7j2mU7/wWZIj4j+u/w25LcElSmUNXLCCID+dLRMgHqIlBanV+yFEmzucoHoE1b6Ca1 eQLyEFTR7ZCEkg1RBpYtrFiaMbwhaiK1hx8S5VZPvt1GBJBsIWupI9OyeaItzPV4C+YL fVIfFwu1QwPStQG75MrAtiO5khDFYJbOTkURoelgFKRX48qOA86ZnOVt4Ta6FK7bw1bK VaPUXgFlzidnsm8oSsUMChJyBbf4leSCXlzjhqzx716XwXdf2ZxhjhRG9Wlwm8WQFbKR FWPw== X-Gm-Message-State: AOAM532GUY75iwWOEDOKG43Jc4bnrfO7ygnWbFIpIbEB/SUT3LU/cTL9 WfxMkchfLk1ZZvSd9uvvW48= X-Google-Smtp-Source: ABdhPJyq8709qKjfU+XkFCEYPuJpHhBitVwdDLvad38KLW1Ot6a9QAanBWEx2BDCOflX8Y0UXGVe4A== X-Received: by 2002:a17:90a:cce:: with SMTP id 14mr47841pjt.163.1606947022339; Wed, 02 Dec 2020 14:10:22 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:21 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 6/8] net-zerocopy: Introduce short-circuit small reads. Date: Wed, 2 Dec 2020 14:09:43 -0800 Message-Id: <20201202220945.911116-7-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, or inq is generally small enough that it is cheaper to copy rather than remap pages. In these cases, we may want to either return early (inq=0) or attempt to use the provided copy buffer to simply copy the received data. This allows us to save both system call overhead and the latency of acquiring mmap_sem in read mode for cases where it would be useless to do so. This patchset enables this behaviour by: 1. Returning quickly if inq is 0. 2. Attempting to perform a regular copy if a hybrid copybuffer is provided and it is large enough to absorb all available bytes. 3. Return quickly if no such buffer was provided and there are less than PAGE_SIZE bytes available. For small RPC ping-pong workloads, normally we would have 1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this change, we remove the recvmsg() call entirely, reducing the syscall overhead by about 33%. In testing with small (hundreds of bytes) RPC traffic, this yields a syscall reduction of about 33% and an efficiency gain of about 3-5% when defined as QPS/CPU Util. --- net/ipv4/tcp.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b2f24a5ec230..f67dd732a47b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1785,6 +1785,39 @@ static int find_next_mappable_frag(const skb_frag_t *frag, return offset; } +static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, + struct scm_timestamping_internal *tss, + int *cmsg_flags); +static int receive_fallback_to_copy(struct sock *sk, + struct tcp_zerocopy_receive *zc, int inq) +{ + unsigned long copy_address = (unsigned long)zc->copybuf_address; + struct scm_timestamping_internal tss_unused; + int err, cmsg_flags_unused; + struct msghdr msg = {}; + struct iovec iov; + + zc->length = 0; + zc->recv_skip_hint = 0; + + if (copy_address != zc->copybuf_address) + return -EINVAL; + + err = import_single_range(READ, (void __user *)copy_address, + inq, &iov, &msg.msg_iter); + if (err) + return err; + + err = tcp_recvmsg_locked(sk, &msg, inq, /*nonblock=*/1, /*flags=*/0, + &tss_unused, &cmsg_flags_unused); + if (err < 0) + return err; + + zc->copybuf_len = err; + return 0; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1889,6 +1922,9 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); + if (inq && inq <= copybuf_len) + return receive_fallback_to_copy(sk, zc, inq); + if (inq < PAGE_SIZE) { zc->length = 0; zc->recv_skip_hint = inq; From patchwork Wed Dec 2 22:09:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 337596 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28DB6C8300F for ; Wed, 2 Dec 2020 22:11:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CA75B21D7A for ; Wed, 2 Dec 2020 22:11:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387524AbgLBWL3 (ORCPT ); Wed, 2 Dec 2020 17:11:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58162 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727452AbgLBWL2 (ORCPT ); Wed, 2 Dec 2020 17:11:28 -0500 Received: from mail-pg1-x529.google.com (mail-pg1-x529.google.com [IPv6:2607:f8b0:4864:20::529]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F2F9C061A4A for ; Wed, 2 Dec 2020 14:10:24 -0800 (PST) Received: by mail-pg1-x529.google.com with SMTP id e23so101045pgk.12 for ; Wed, 02 Dec 2020 14:10:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=AqjZOIuE4X85vaE2wxWXKVDHIkqqQieb92c3SlQLVn8=; b=GHs6vzsHcL9F4nZ5PaA4og5MVuN4KweozZy2IO+Ud3GCyUJclY0TefnEVDd68MJC1R 5wYx9VA7p3Dw8A4fnZlK8OHuKSAhY/qQtmaeHq4qnKJgGzFD/qxm89HEeoxX6O1jx3sn HESc6fIg8EbyjM6FcbfeHOiwBoLzOF64bb5cL+2yJHbeo14RYdTvXykLv6/Zn6R2ZRNV vvWja/VvDHXscEFeV9LJVP5C2BtESSe+XGwf6CpN3U2eMCydU106VT6ssh5cy8agMDkb q6+ptTrOwx3XDw1MSuDmKSVYKtEnLo/QviNmNVedTIPQAmonXApPMnuuEYNLUTAtf/Jc 9MPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=AqjZOIuE4X85vaE2wxWXKVDHIkqqQieb92c3SlQLVn8=; b=mdDZvdJ1nxusER+YEE9J1hYfSTl1/h1mdfi5tUm92wmiwHoqjahS/La+imiSUwWLz1 MVKDfujaVIRR8y1s6I3NOLS7Zu6Cbpaix2OqAcDr1vfgrP87VA774r7/F53ObaPioqBi biG6sAVC/OmawCjqJ2kxxyu/Y3ZRqaYtk1uunc8sAZiykIK3oso1fOlCaqmhHqIdHjRR HaYMBMmhfb/NlAG1NzKIFrsjgRcv/AgcB4VK+wU/wa/MZ8pIooLRFVzf1DcjYEe0qKIR veE9AXhkUTf7ncZ871Ue+JRMGB0Uo6apTttqM68MS2mMVEra9he8Wnp/DLW9CF3/Y7fk 0OmQ== X-Gm-Message-State: AOAM532w4TUeroqlGotFkPyEBAcplrnpbzR008ae0O2yxrlvZiNWOjsw MK46cKxtPalqa7gh8gD1Jlg= X-Google-Smtp-Source: ABdhPJxjqnfn+dwJZv9M3PI5P0jMn2xOBPkesLHu1zLSZz02rpTMdATzHrgpJ2y/oJ3fy2RN1zx99w== X-Received: by 2002:a62:5e81:0:b029:197:baa5:1792 with SMTP id s123-20020a625e810000b0290197baa51792mr76616pfb.80.1606947023805; Wed, 02 Dec 2020 14:10:23 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:23 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 7/8] net-zerocopy: Set zerocopy hint when data is copied Date: Wed, 2 Dec 2020 14:09:44 -0800 Message-Id: <20201202220945.911116-8-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Set zerocopy hint, event when falling back to copy, so that the pending data can be efficiently received using zerocopy when possible. --- net/ipv4/tcp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f67dd732a47b..49480ce162db 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1785,6 +1785,43 @@ static int find_next_mappable_frag(const skb_frag_t *frag, return offset; } +static void tcp_zerocopy_set_hint_for_skb(struct sock *sk, + struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 offset) +{ + u32 frag_offset, partial_frag_remainder = 0; + int mappable_offset; + skb_frag_t *frag; + + /* worst case: skip to next skb. try to improve on this case below */ + zc->recv_skip_hint = skb->len - offset; + + /* Find the frag containing this offset (and how far into that frag) */ + frag = skb_advance_to_frag(skb, offset, &frag_offset); + if (!frag) + return; + + if (frag_offset) { + struct skb_shared_info *info = skb_shinfo(skb); + + /* We read part of the last frag, must recvmsg() rest of skb. */ + if (frag == &info->frags[info->nr_frags - 1]) + return; + + /* Else, we must at least read the remainder in this frag. */ + partial_frag_remainder = skb_frag_size(frag) - frag_offset; + zc->recv_skip_hint -= partial_frag_remainder; + ++frag; + } + + /* partial_frag_remainder: If part way through a frag, must read rest. + * mappable_offset: Bytes till next mappable frag, *not* counting bytes + * in partial_frag_remainder. + */ + mappable_offset = find_next_mappable_frag(frag, zc->recv_skip_hint); + zc->recv_skip_hint = mappable_offset + partial_frag_remainder; +} + static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, struct scm_timestamping_internal *tss, @@ -1815,6 +1852,14 @@ static int receive_fallback_to_copy(struct sock *sk, return err; zc->copybuf_len = err; + if (likely(zc->copybuf_len)) { + struct sk_buff *skb; + u32 offset; + + skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset); + if (skb) + tcp_zerocopy_set_hint_for_skb(sk, zc, skb, offset); + } return 0; } From patchwork Wed Dec 2 22:09:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 336523 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 730F6C6369E for ; Wed, 2 Dec 2020 22:11:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0E2F921D7A for ; Wed, 2 Dec 2020 22:11:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387552AbgLBWLa (ORCPT ); Wed, 2 Dec 2020 17:11:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727452AbgLBWLa (ORCPT ); Wed, 2 Dec 2020 17:11:30 -0500 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F170C061A4B for ; Wed, 2 Dec 2020 14:10:26 -0800 (PST) Received: by mail-pg1-x535.google.com with SMTP id w16so111678pga.9 for ; Wed, 02 Dec 2020 14:10:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=qXiSxAhJ4A9JePiUyw4j89U2PONGSo4LDlTvKPop6aY=; b=kdQNxIR2uZ/CY7Pe09wN0SK+g/UM2YavBIdxE15/u0Dd+6H/afVb1pbQx3BSPYpDnJ YKzFj1TFxgwm7d/nUvK0aAPgHAgQSlcr4juKXVqcOErLd5/CGYXfmeNmOpKAAAPUtPVS 6G8WQJUYa4UOOdOsPzLokWB7G4UWf4EMsz1pGBgh6bjQDMDsFdpPNe6/ODkziPdIiOl0 Up/cWHx/1H+BbnOBdklJ7cIVf7NDSBhcFVfNQniUEJqLrTpIOVKTADpUtIXLYsr4q2cu mPX6FPOIJLmZsvJzO7ia/gKMQc+BPj1A/pBLARFzr3JL9C2ejn3vSWkQKDzGfIppTvjO HIow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qXiSxAhJ4A9JePiUyw4j89U2PONGSo4LDlTvKPop6aY=; b=sZscwIFv1GDmCpKXIgWOXFRB+LqsTW6Xxew4CTmIIzl0KdbFNMMgs4sRyMHzsM+izM /9GOLHcspFk43XT9n1Yv7X7Mmz43DBbYwJy6k7oft8H5lj6gsRPHMd2tVf9mUQZXk6iE Ax31ftQp/lPjYvBn6bHwniYS0bqGb6Z7sLe/gfjNf1bnSCGVORNsomu66P96DYZoq7WT AP68Gr2+e8/XDheWUi3YjoKco3slzjmrMizySnrHNxykVizTPk3sugnOaE8vyHOLI8Fq 7fiGSr5kwf7SRbpx4H/gOc451MQ4q9djNCkr/apaQlb/gjvMqIqmVPj9+ntlw9Wxo5tw a8aw== X-Gm-Message-State: AOAM5303Q2iJjT8auD2xJo4LvPo18uwH3g3y3WMpcwmT1OWOOFuht1Re vp+c/gLy+7nC+sQy6gPNwGwPKsEk7K8= X-Google-Smtp-Source: ABdhPJwJTTdiyIn/iRtc4wTA+1cFxVkN1rl/yhAgcsn8FPGQoYlBm8VblojeOnwoFHROYMwl//61nQ== X-Received: by 2002:a05:6a00:848:b029:197:e659:e236 with SMTP id q8-20020a056a000848b0290197e659e236mr79674pfk.74.1606947025725; Wed, 02 Dec 2020 14:10:25 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id p16sm4872pju.47.2020.12.02.14.10.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 14:10:25 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next v2 8/8] net-zerocopy: Defer vm zap unless actually needed. Date: Wed, 2 Dec 2020 14:09:45 -0800 Message-Id: <20201202220945.911116-9-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.576.ga3fc446d84-goog In-Reply-To: <20201202220945.911116-1-arjunroy.kdev@gmail.com> References: <20201202220945.911116-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Zapping pages is required only if we are calling vm_insert_page into a region where pages had previously been mapped. Receive zerocopy allows reusing such regions, and hitherto called zap_page_range() before calling vm_insert_page() in that range. zap_page_range() can also be triggered from userspace with madvise(MADV_DONTNEED). If userspace is configured to call this before reusing a segment, or if there was nothing mapped at this virtual address to begin with, we can avoid calling zap_page_range() under the socket lock. That said, if userspace does not do that, then we are still responsible for calling zap_page_range(). This patch adds a flag that the user can use to hint to the kernel that a zap is not required. If the flag is not set, or if an older user application does not have a flags field at all, then the kernel calls zap_page_range as before. Also, if the flag is set but a zap is still required, the kernel performs that zap as necessary. Thus incorrectly indicating that a zap can be avoided does not change the correctness of operation. It also increases the batchsize for vm_insert_pages and prefetches the page struct for the batch since we're about to bump the refcount. An alternative mechanism could be to not have a flag, assume by default a zap is not needed, and fall back to zapping if needed. However, this would harm performance for older applications for which a zap is necessary, and thus we implement it with an explicit flag so newer applications can opt in. When using RPC-style traffic with medium sized (tens of KB) RPCs, this change yields an efficency improvement of about 30% for QPS/CPU usage. --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 147 ++++++++++++++++++++++++++------------- 2 files changed, 99 insertions(+), 50 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 62db78b9c1a0..13ceeb395eb8 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -343,6 +343,7 @@ struct tcp_diag_md5sig { /* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ +#define TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT 0x1 struct tcp_zerocopy_receive { __u64 address; /* in: address of mapping */ __u32 length; /* in/out: number of bytes to map/mapped */ @@ -351,5 +352,6 @@ struct tcp_zerocopy_receive { __s32 err; /* out: socket error */ __u64 copybuf_address; /* in: copybuf address (small reads) */ __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ + __u32 flags; /* in: flags */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 49480ce162db..83d16f04f464 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1909,51 +1909,101 @@ static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, return zc->copybuf_len < 0 ? 0 : copylen; } +static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma, + struct page **pending_pages, + unsigned long pages_remaining, + unsigned long *address, + u32 *length, + u32 *seq, + struct tcp_zerocopy_receive *zc, + u32 total_bytes_to_map, + int err) +{ + /* At least one page did not map. Try zapping if we skipped earlier. */ + if (err == -EBUSY && + zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT) { + u32 maybe_zap_len; + + maybe_zap_len = total_bytes_to_map - /* All bytes to map */ + *length + /* Mapped or pending */ + (pages_remaining * PAGE_SIZE); /* Failed map. */ + zap_page_range(vma, *address, maybe_zap_len); + err = 0; + } + + if (!err) { + unsigned long leftover_pages = pages_remaining; + int bytes_mapped; + + /* We called zap_page_range, try to reinsert. */ + err = vm_insert_pages(vma, *address, + pending_pages, + &pages_remaining); + bytes_mapped = PAGE_SIZE * (leftover_pages - pages_remaining); + *seq += bytes_mapped; + *address += bytes_mapped; + } + if (err) { + /* Either we were unable to zap, OR we zapped, retried an + * insert, and still had an issue. Either ways, pages_remaining + * is the number of pages we were unable to map, and we unroll + * some state we speculatively touched before. + */ + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; + + *length -= bytes_not_mapped; + zc->recv_skip_hint += bytes_not_mapped; + } + return err; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, - unsigned long pages_to_map, - unsigned long *insert_addr, - u32 *length_with_pending, + unsigned int pages_to_map, + unsigned long *address, + u32 *length, u32 *seq, - struct tcp_zerocopy_receive *zc) + struct tcp_zerocopy_receive *zc, + u32 total_bytes_to_map) { unsigned long pages_remaining = pages_to_map; - int bytes_mapped; - int ret; + unsigned int pages_mapped; + unsigned int bytes_mapped; + int err; - ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); - bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); + err = vm_insert_pages(vma, *address, pages, &pages_remaining); + pages_mapped = pages_to_map - (unsigned int)pages_remaining; + bytes_mapped = PAGE_SIZE * pages_mapped; /* Even if vm_insert_pages fails, it may have partially succeeded in * mapping (some but not all of the pages). */ *seq += bytes_mapped; - *insert_addr += bytes_mapped; - if (ret) { - /* But if vm_insert_pages did fail, we have to unroll some state - * we speculatively touched before. - */ - const int bytes_not_mapped = PAGE_SIZE * pages_remaining; - *length_with_pending -= bytes_not_mapped; - zc->recv_skip_hint += bytes_not_mapped; - } - return ret; + *address += bytes_mapped; + + if (likely(!err)) + return 0; + + /* Error: maybe zap and retry + rollback state for failed inserts. */ + return tcp_zerocopy_vm_insert_batch_error(vma, pages + pages_mapped, + pages_remaining, address, length, seq, zc, total_bytes_to_map, + err); } +#define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32 static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { - u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; + u32 length = 0, offset, vma_len, avail_len, copylen = 0; unsigned long address = (unsigned long)zc->address; + struct page *pages[TCP_ZEROCOPY_PAGE_BATCH_SIZE]; s32 copybuf_len = zc->copybuf_len; struct tcp_sock *tp = tcp_sk(sk); - #define PAGE_BATCH_SIZE 8 - struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; + unsigned int pages_to_map = 0; struct vm_area_struct *vma; struct sk_buff *skb = NULL; - unsigned long pg_idx = 0; - unsigned long curr_addr; u32 seq = tp->copied_seq; + u32 total_bytes_to_map; int inq = tcp_inq(sk); int ret; @@ -1987,34 +2037,24 @@ static int tcp_zerocopy_receive(struct sock *sk, } vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); avail_len = min_t(u32, vma_len, inq); - aligned_len = avail_len & ~(PAGE_SIZE - 1); - if (aligned_len) { - zap_page_range(vma, address, aligned_len); - zc->length = aligned_len; + total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); + if (total_bytes_to_map) { + if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) + zap_page_range(vma, address, total_bytes_to_map); + zc->length = total_bytes_to_map; zc->recv_skip_hint = 0; } else { zc->length = avail_len; zc->recv_skip_hint = avail_len; } ret = 0; - curr_addr = address; while (length + PAGE_SIZE <= zc->length) { int mappable_offset; + struct page *page; if (zc->recv_skip_hint < PAGE_SIZE) { u32 offset_frag; - /* If we're here, finish the current batch. */ - if (pg_idx) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, - pg_idx, - &curr_addr, - &length, - &seq, zc); - if (ret) - goto out; - pg_idx = 0; - } if (skb) { if (zc->recv_skip_hint > 0) break; @@ -2035,24 +2075,31 @@ static int tcp_zerocopy_receive(struct sock *sk, zc->recv_skip_hint = mappable_offset; break; } - pages[pg_idx] = skb_frag_page(frags); - pg_idx++; + page = skb_frag_page(frags); + prefetchw(page); + pages[pages_to_map++] = page; length += PAGE_SIZE; zc->recv_skip_hint -= PAGE_SIZE; frags++; - if (pg_idx == PAGE_BATCH_SIZE) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, - &curr_addr, &length, - &seq, zc); + if (pages_to_map == TCP_ZEROCOPY_PAGE_BATCH_SIZE || + zc->recv_skip_hint < PAGE_SIZE) { + /* Either full batch, or we're about to go to next skb + * (and we cannot unroll failed ops across skbs). + */ + ret = tcp_zerocopy_vm_insert_batch(vma, pages, + pages_to_map, + &address, &length, + &seq, zc, + total_bytes_to_map); if (ret) goto out; - pg_idx = 0; + pages_to_map = 0; } } - if (pg_idx) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, - &curr_addr, &length, &seq, - zc); + if (pages_to_map) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pages_to_map, + &address, &length, &seq, + zc, total_bytes_to_map); } out: mmap_read_unlock(current->mm);