From patchwork Mon Feb 18 23:08:42 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 158662 Delivered-To: patch@linaro.org Received: by 2002:a02:48:0:0:0:0:0 with SMTP id 69csp3022700jaa; Mon, 18 Feb 2019 15:08:56 -0800 (PST) X-Google-Smtp-Source: AHgI3IbwybKBVvMgb4jFcCR6QnfizFZussK9UG6PdkiukXfy1Rd7NiO9OHA35HyCs9+6uYzVNQZS X-Received: by 2002:a17:902:780a:: with SMTP id p10mr28351479pll.54.1550531336540; Mon, 18 Feb 2019 15:08:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550531336; cv=none; d=google.com; s=arc-20160816; b=B9TbFR2bCcqM8HPkOfkTHRDrXB1veQUmF9FXCi7XLdUdmrKkJ3hauLIlQs5skltLDF oVLq/YtNnpdBoLVxDfa7COp+KZp/FG34CjWwfRyVjcLcMbuxH0J0FjWdKNWG3llS/OYu E3nXOsIXEeMrXQbpIVQ69SlngJ1OGtjy1uXg4Ie7AxVuyRaxrkAwADCi3SkXOkmcQ1y4 HtUCrP/ZVMHluY+Atw5JkmsHtvsF5IkWmJRKj1wTie45SSaPexmvO5Dxby6Hcuj8PQ20 RpNAhpKuH4OAj1TrW8jMOTTw8KsWu7mnNlaRr3UVMY9gOQVKCod3kCLFzD7DYePuiVYo GZCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=EP+WFPDRsonbIe4RHBPZCD+zNiL1GGOxakU9E0tXC64=; b=r3XbxhgFcTchegDVw4J7hT/Sb0U7gO1nca5umg5s12gjtKc73UhXbBOH2EYo9IKphi F0XWtY7X8x9IQ+I5UbNemaflsIFhonbCSphZuYlNzbzWFR8uTMhwyB8f7N7nLIluKgIF qWtKUJxYHHozWiYk1n5mXn6CUeurFF/lab+miPMvxebaN51UsYVvTbp2RbcG8Mot9/ny wIIjbC7PgnKiFVEQUokxX3uCM1MAPBNuBD9MjsCXA1/ngNbK7Ns5fIp9X+soJVJpZXaQ B/AylV8uJKAgZCdsk20yGst7fwByw7+GF/NRWzUBWYIW6UM+8EuJRURE5nnc9ElWTUIu hQ0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="AIcNWw/F"; spf=pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=netdev-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y14si14701428pgo.363.2019.02.18.15.08.55; Mon, 18 Feb 2019 15:08:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="AIcNWw/F"; spf=pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=netdev-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729064AbfBRXIx (ORCPT + 10 others); Mon, 18 Feb 2019 18:08:53 -0500 Received: from mail-wr1-f67.google.com ([209.85.221.67]:34438 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727708AbfBRXIw (ORCPT ); Mon, 18 Feb 2019 18:08:52 -0500 Received: by mail-wr1-f67.google.com with SMTP id f14so20216109wrg.1 for ; Mon, 18 Feb 2019 15:08:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=EP+WFPDRsonbIe4RHBPZCD+zNiL1GGOxakU9E0tXC64=; b=AIcNWw/FmPm4ydB5rsbLSvvlOZQgbCXrkv8mKUnWJ9s2j8zVw907kZmnYMS1yamLfh f6KrYrqxMfQp8lnXkkNWXvMmWiMDpbUODEq0CLOIR+uTK8CwgctpBuhdP7nR/WRlAfOK Eyvwf+GfGwP9LBJcZH4JGfDwHoKIiFMxeUsCR5hWoFQnzbd72XOQjuo/1lQNgllz/v2f KnZTI3+LbZ4FobUuZ0qCbeAGS2kZgwlh/6euNipKEsIFl6cOXGBXCUjpgUK/g6WILX8O qJwo/2AyIiqte4s2FvyGRlB1Ebyh7PnAea4bbSB1BmRtf49PH4b1Zf5AglG0Sa/tF2gC ih/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=EP+WFPDRsonbIe4RHBPZCD+zNiL1GGOxakU9E0tXC64=; b=OvFQCDt7C2VTw+ldDI2H5Y8tnZKUdEzMfW48xSUHast8F1LTmPwa6vKC4C0wR9W/ay JFPFXl6KK2jQN1qZJa+IG9meRCxYpNijzcYMcA4EfGb1ILLLJwPybm7/QeNr35538NC5 TYMQl52H0xwe/yCZr32qRnyptjGAFgFvUGW4KYSKxwC1N0xvZZMDKdH2SmWnIHxP6v5J lcY3woSDAUzvYt5tv248NY3K42Cj7ZDYCxlGfo39DSYB5Fs5m06anBZa8FpYrDg/CaSt Q2wZFlg6BRLIalwsUgWPin3jW+a/qBVNgKUMkOYcrVJlclHma3w0Hwun0I4Bc/E4JkIb IXsw== X-Gm-Message-State: AHQUAuYRlHYaQTRN/+3QoB7sf/8Lkk0sXHgeBvMHMvBwGuBE9uLQCdZx PLFt4XSDlzRhFKFwonL2TbwLsg== X-Received: by 2002:a5d:434f:: with SMTP id u15mr17427892wrr.174.1550531330461; Mon, 18 Feb 2019 15:08:50 -0800 (PST) Received: from sudo.home ([2a01:cb1d:112:6f00:4165:624e:307c:25e9]) by smtp.gmail.com with ESMTPSA id m16sm12290983wro.78.2019.02.18.15.08.48 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Feb 2019 15:08:49 -0800 (PST) From: Ard Biesheuvel To: linux-arm-kernel@lists.infradead.org Cc: will.deacon@arm.com, steve.capper@arm.com, ilias.apalodimas@linaro.org, netdev@vger.kernel.org, Ard Biesheuvel , "huanglingyan (A)" Subject: [PATCH] arm64: do_csum: implement accelerated scalar version Date: Tue, 19 Feb 2019 00:08:42 +0100 Message-Id: <20190218230842.11448-1-ard.biesheuvel@linaro.org> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org It turns out that the IP checksumming code is still exercised often, even though one might expect that modern NICs with checksum offload have no use for it. However, as Lingyan points out, there are combinations of features where the network stack may still fall back to software checksumming, and so it makes sense to provide an optimized implementation in software as well. So provide an implementation of do_csum() in scalar assembler, which, unlike C, gives direct access to the carry flag, making the code run substantially faster. The routine uses overlapping 64 byte loads for all input size > 64 bytes, in order to reduce the number of branches and improve performance on cores with deep pipelines. On Cortex-A57, this implementation is on par with Lingyan's NEON implementation, and roughly 7x as fast as the generic C code. Cc: "huanglingyan (A)" Signed-off-by: Ard Biesheuvel --- Test code after the patch. arch/arm64/include/asm/checksum.h | 3 + arch/arm64/lib/Makefile | 2 +- arch/arm64/lib/csum.S | 127 ++++++++++++++++++++ 3 files changed, 131 insertions(+), 1 deletion(-) -- 2.20.1 diff --git a/lib/checksum.c b/lib/checksum.c index d3ec93f9e5f3..7711f1186f71 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -37,7 +37,7 @@ #include -#ifndef do_csum +#if 1 //ndef do_csum static inline unsigned short from32to16(unsigned int x) { /* add up 16-bit and 16-bit for 16+c bit */ @@ -47,7 +47,7 @@ static inline unsigned short from32to16(unsigned int x) return x; } -static unsigned int do_csum(const unsigned char *buff, int len) +static unsigned int __do_csum(const unsigned char *buff, int len) { int odd; unsigned int result = 0; @@ -206,3 +206,23 @@ __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, } EXPORT_SYMBOL(csum_tcpudp_nofold); #endif + +extern u8 crypto_ft_tab[]; + +static int __init do_selftest(void) +{ + int i, j; + u16 c1, c2; + + for (i = 0; i < 1024; i++) { + for (j = i + 1; j <= 1024; j++) { + c1 = __do_csum(crypto_ft_tab + i, j - i); + c2 = do_csum(crypto_ft_tab + i, j - i); + + if (c1 != c2) + pr_err("######### %d %d %x %x\n", i, j, c1, c2); + } + } + return 0; +} +late_initcall(do_selftest); Acked-by: Ilias Apalodimas diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h index 0b6f5a7d4027..e906b956c1fc 100644 --- a/arch/arm64/include/asm/checksum.h +++ b/arch/arm64/include/asm/checksum.h @@ -46,6 +46,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) } #define ip_fast_csum ip_fast_csum +extern unsigned int do_csum(const unsigned char *buff, int len); +#define do_csum do_csum + #include #endif /* __ASM_CHECKSUM_H */ diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile index 5540a1638baf..a7606007a749 100644 --- a/arch/arm64/lib/Makefile +++ b/arch/arm64/lib/Makefile @@ -3,7 +3,7 @@ lib-y := clear_user.o delay.o copy_from_user.o \ copy_to_user.o copy_in_user.o copy_page.o \ clear_page.o memchr.o memcpy.o memmove.o memset.o \ memcmp.o strcmp.o strncmp.o strlen.o strnlen.o \ - strchr.o strrchr.o tishift.o + strchr.o strrchr.o tishift.o csum.o ifeq ($(CONFIG_KERNEL_MODE_NEON), y) obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o diff --git a/arch/arm64/lib/csum.S b/arch/arm64/lib/csum.S new file mode 100644 index 000000000000..534e2ebdc426 --- /dev/null +++ b/arch/arm64/lib/csum.S @@ -0,0 +1,127 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 Linaro, Ltd. + */ + +#include +#include + +ENTRY(do_csum) + adds x2, xzr, xzr // clear x2 and C flag + + // 64 bytes at a time + lsr x3, x1, #6 + and x1, x1, #63 + cbz x3, 1f + + // Eight 64-bit adds per iteration +0: ldp x4, x5, [x0], #64 + ldp x6, x7, [x0, #-48] + ldp x8, x9, [x0, #-32] + ldp x10, x11, [x0, #-16] + adcs x2, x2, x4 + sub x3, x3, #1 + adcs x2, x2, x5 + adcs x2, x2, x6 + adcs x2, x2, x7 + adcs x2, x2, x8 + adcs x2, x2, x9 + adcs x2, x2, x10 + adcs x2, x2, x11 + cbnz x3, 0b + adc x2, x2, xzr + + cbz x1, 7f + bic x3, x1, #1 + add x12, x0, x1 + add x0, x0, x3 + neg x3, x3 + add x3, x3, #64 + lsl x3, x3, #3 + + // Handle remaining 63 bytes or less using an overlapping 64-byte load + // and a branchless code path to complete the calculation + ldp x4, x5, [x0, #-64] + ldp x6, x7, [x0, #-48] + ldp x8, x9, [x0, #-32] + ldp x10, x11, [x0, #-16] + ldrb w12, [x12, #-1] + + .irp reg, x4, x5, x6, x7, x8, x9, x10, x11 + cmp x3, #64 + csel \reg, \reg, xzr, lt + ccmp x3, xzr, #0, lt + csel x13, x3, xzr, gt + sub x3, x3, #64 +CPU_LE( lsr \reg, \reg, x13 ) +CPU_BE( lsl \reg, \reg, x13 ) + .endr + + adds x2, x2, x4 + adcs x2, x2, x5 + adcs x2, x2, x6 + adcs x2, x2, x7 + adcs x2, x2, x8 + adcs x2, x2, x9 + adcs x2, x2, x10 + adcs x2, x2, x11 + adc x2, x2, xzr + +CPU_LE( adds x12, x2, x12 ) +CPU_BE( adds x12, x2, x12, lsl #8 ) + adc x12, x12, xzr + tst x1, #1 + csel x2, x2, x12, eq + +7: lsr x1, x2, #32 + adds w2, w2, w1 + adc w2, w2, wzr + + lsr w1, w2, #16 + uxth w2, w2 + add w2, w2, w1 + + lsr w1, w2, #16 // handle the carry by hand + add w2, w2, w1 + + uxth w0, w2 + ret + + // Handle 63 bytes or less +1: tbz x1, #5, 2f + ldp x4, x5, [x0], #32 + ldp x6, x7, [x0, #-16] + adds x2, x2, x4 + adcs x2, x2, x5 + adcs x2, x2, x6 + adcs x2, x2, x7 + adc x2, x2, xzr + +2: tbz x1, #4, 3f + ldp x4, x5, [x0], #16 + adds x2, x2, x4 + adcs x2, x2, x5 + adc x2, x2, xzr + +3: tbz x1, #3, 4f + ldr x4, [x0], #8 + adds x2, x2, x4 + adc x2, x2, xzr + +4: tbz x1, #2, 5f + ldr w4, [x0], #4 + adds x2, x2, x4 + adc x2, x2, xzr + +5: tbz x1, #1, 6f + ldrh w4, [x0], #2 + adds x2, x2, x4 + adc x2, x2, xzr + +6: tbz x1, #0, 7b + ldrb w4, [x0] +CPU_LE( adds x2, x2, x4 ) +CPU_BE( adds x2, x2, x4, lsl #8 ) + adc x2, x2, xzr + b 7b +ENDPROC(do_csum)