From patchwork Thu Nov 5 10:28:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= X-Patchwork-Id: 318411 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D6CCC4741F for ; Thu, 5 Nov 2020 10:28:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F233B21D81 for ; Thu, 5 Nov 2020 10:28:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="caanxCQB" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729227AbgKEK2d (ORCPT ); Thu, 5 Nov 2020 05:28:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725308AbgKEK2d (ORCPT ); Thu, 5 Nov 2020 05:28:33 -0500 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 000A9C0613CF; Thu, 5 Nov 2020 02:28:32 -0800 (PST) Received: by mail-pf1-x444.google.com with SMTP id o129so1163007pfb.1; Thu, 05 Nov 2020 02:28:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=u0SuH5WxErVbxB5qRSImEqAIuj5APGxU6dWKcRVt0qk=; b=caanxCQBlEu2cfF0iUTiuGUEIPuL/+duvx4QZeS898zNc2K0XYSjbZWan/3WhI3OKY jMO10Ijd00dXNiAG5NE5dkbGupPKSFYDQLKV6w5qGCMV+Yov3sCKaHtd0K2IsyR0X2hh rXbFwlyifL59kJk2xw6DLRklElnXmoLRa1N5JuWMxn+OTmmlBjifPBkux86QgEcL1KQG Qk5MY+F/6pfam2wVA0wsaypoAnl7x8CAITJMZEjdYcy3TIQrKt3Dg4IPMDhporQ4P9A6 Ar5SJcMo3o953mYwn/bqO7SmfTNNm3Sd0sP/3C+sEdUj3tY4v62fIKtODIOBZQ6M5mjM qYDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=u0SuH5WxErVbxB5qRSImEqAIuj5APGxU6dWKcRVt0qk=; b=hgvt4EWmjoJVnrHpDmt05EhA9l0JQNvw1p1KaRS/8eGfTdG3z3RQLn0L+EJDKuqMqp 0ESZ8PliuLiUyKN5eC87DUarq+DQjE0qmOyixc8gqh0jf2mdYjlc1DOXPtu6NCa+kubd bVKiyFd8XG6nB4Vsnn3/o7AjJNiRc9IvBBN923GgAswYHSI0rrDjTTphSEkQel6nFJOJ rEISpInLebupXbeFCGwVtiYKG9B3s2fLRZhWSETuhUZ611yhS+0pk+VGIysmmpq1tb/m wDmJ2ZKCuCE98CQ7RvlTm4zhzE/r0LuVDFadeAHzi1bZsN3EZR8IMwzjBp+pVsM0eZOp d5Sw== X-Gm-Message-State: AOAM532vvnkLpS2KNJN5Uw4oP5SAFs5BB4Lgu7Xh2mmft3w1Nt0cC6Yj xjhhpzhbyzhERrVNASeUm9NH2oWu2oUKg/Hs X-Google-Smtp-Source: ABdhPJy4bJ3jwFN7YUvgBfvTZ/lfExQgZYzmYqOpAQLuwC08zuqua8Fq2/XYa5lmnSFah6tVL0NscA== X-Received: by 2002:a65:4107:: with SMTP id w7mr1705004pgp.361.1604572111799; Thu, 05 Nov 2020 02:28:31 -0800 (PST) Received: from btopel-mobl.ger.intel.com (fmdmzpr03-ext.fm.intel.com. [192.55.54.38]) by smtp.gmail.com with ESMTPSA id 192sm2050117pfz.200.2020.11.05.02.28.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Nov 2020 02:28:30 -0800 (PST) From: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= To: netdev@vger.kernel.org, bpf@vger.kernel.org Cc: =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= , bjorn.topel@intel.com, magnus.karlsson@intel.com, ast@kernel.org, daniel@iogearbox.net, maciej.fijalkowski@intel.com, sridhar.samudrala@intel.com, jesse.brandeburg@intel.com, qi.z.zhang@intel.com, kuba@kernel.org, edumazet@google.com, intel-wired-lan@lists.osuosl.org, jonathan.lemon@gmail.com Subject: [RFC PATCH bpf-next v2 0/9] Introduce preferred busy-polling Date: Thu, 5 Nov 2020 11:28:03 +0100 Message-Id: <20201105102812.152836-1-bjorn.topel@gmail.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This series introduces three new features: 1. A new "heavy traffic" busy-polling variant that works in concert with the existing napi_defer_hard_irqs and gro_flush_timeout knobs. 2. A new socket option that let a user change the busy-polling NAPI budget. 3. Allow busy-polling to be performed on XDP sockets. The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Performance netperf UDP_RR: Note that netperf UDP_RR is not a heavy traffic tests, and preferred busy-polling is not typically something we want to use here. $ echo 20 | sudo tee /proc/sys/net/core/busy_read $ netperf -H 192.168.1.1 -l 30 -t UDP_RR -v 2 -- \ -o min_latency,mean_latency,max_latency,stddev_latency,transaction_rate busy-polling blocking sockets: 12,13.33,224,0.63,74731.177 I hacked netperf to use non-blocking sockets and re-ran: busy-polling non-blocking sockets: 12,13.46,218,0.72,73991.172 prefer busy-polling non-blocking sockets: 12,13.62,221,0.59,73138.448 Using the preferred busy-polling mode does not impact performance. Performance XDP sockets: Today, running XDP sockets sample on the same core as the softirq handling, performance tanks mainly because we do not yield to user-space when the XDP socket Rx queue is full. # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r Rx: 64Kpps # # biased busy-polling, budget 8 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 8 Rx 9.9Mpps # # biased busy-polling, budget 64 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 64 Rx: 19.3Mpps # # biased busy-polling, budget 256 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 256 Rx: 21.4Mpps # # biased busy-polling, budget 512 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 512 Rx: 21.7Mpps Compared to the two-core case: # taskset -c 4 ./xdpsock -i ens785f1 -q 20 -n 1 -r Rx: 20.7Mpps We're getting better single-core performance than two, for this naïve drop scenario. The above tests was done for the 'ice' driver. Thanks to Jakub for suggesting this busy-polling addition [1], and Eric for the input on the v1 RFC. Some outstanding questions: * Currently busy-polling for UDP/TCP is only wired up in the recvmsg() path. Does it make sense to extend that to sendmsg() as well? * Extending xdp_rxq_info_reg() with napi_id touches a lot of drivers, and I've only verified the Intel ones. Some drivers initialize NAPI (generating the napi_id) after the xdp_rxq_info_reg() call, which maybe would open up for another API. I did not send this RFC to all the driver authors. I'll do that for a patch proper series. * Today, enabling busy-polling require CAP_NET_ADMIN. For a NAPI context that services multiple socket, this makes sense because one socket can affect performance of other sockets. Now, for a *dedicated* queue for say XDP socket, would it be OK to drop CAP_NET_ADMIN, because it cannot affect other sockets/users. Changes: rfc-v1 [2] -> rfc-v2: * Changed name from bias to prefer. * Base the work on Eric's/Luigi's defer irq/gro timeout work. * Proper GRO flushing. * Build issues for some XDP drivers. [1] https://lore.kernel.org/netdev/20200925120652.10b8d7c5@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ [2] https://lore.kernel.org/bpf/20201028133437.212503-1-bjorn.topel@gmail.com/ Björn Töpel (9): net: introduce preferred busy-polling net: add SO_BUSY_POLL_BUDGET socket option xsk: add support for recvmsg() xsk: check need wakeup flag in sendmsg() xsk: add busy-poll support for {recv,send}msg() xsk: propagate napi_id to XDP socket Rx path samples/bpf: use recvfrom() in xdpsock samples/bpf: add busy-poll support to xdpsock samples/bpf: add option to set the busy-poll budget arch/alpha/include/uapi/asm/socket.h | 3 + arch/mips/include/uapi/asm/socket.h | 3 + arch/parisc/include/uapi/asm/socket.h | 3 + arch/sparc/include/uapi/asm/socket.h | 3 + drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- .../ethernet/cavium/thunder/nicvf_queues.c | 2 +- .../net/ethernet/freescale/dpaa2/dpaa2-eth.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +- drivers/net/ethernet/intel/ice/ice_base.c | 4 +- drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- drivers/net/ethernet/marvell/mvneta.c | 2 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- .../ethernet/netronome/nfp/nfp_net_common.c | 2 +- drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +- drivers/net/ethernet/sfc/rx_common.c | 2 +- drivers/net/ethernet/socionext/netsec.c | 2 +- drivers/net/ethernet/ti/cpsw_priv.c | 2 +- drivers/net/hyperv/netvsc.c | 2 +- drivers/net/tun.c | 2 +- drivers/net/veth.c | 2 +- drivers/net/virtio_net.c | 2 +- drivers/net/xen-netfront.c | 2 +- fs/eventpoll.c | 3 +- include/linux/netdevice.h | 35 +++++--- include/net/busy_poll.h | 27 ++++-- include/net/sock.h | 4 + include/net/xdp.h | 3 +- include/uapi/asm-generic/socket.h | 3 + net/core/dev.c | 89 ++++++++++++++----- net/core/sock.c | 19 ++++ net/core/xdp.c | 3 +- net/xdp/xsk.c | 36 +++++++- net/xdp/xsk_buff_pool.c | 13 ++- samples/bpf/xdpsock_user.c | 53 ++++++++--- 40 files changed, 262 insertions(+), 90 deletions(-) base-commit: d0b3d2d7e50de5ce121f77a16df4c17e91b09421