From patchwork Mon Oct 28 19:02:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 839654 Received: from mail-wr1-f74.google.com (mail-wr1-f74.google.com [209.85.221.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 38C4418C333 for ; Mon, 28 Oct 2024 19:02:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142156; cv=none; b=X2fOiE1wlaPu/OfksCk86nLmPkpQuyfbP7Owqzz5xdP9akoayMvK4oVdeUJUCga+HmjeNnvY4bmpM8vrHZJGdmBBK2VmZrX1iRE9Q7yB6RdWPdk4TgnezaH4SEeC9TLoD6JO0330uo0MvsWCksM0N2y/YVYhHDD483xn71WzD3E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142156; c=relaxed/simple; bh=+5/2gDQzUGvCacg+JbMY2HNOSPPEjQzmPsrzsnEblxI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=b+Ga4owi3c82LmWKZZPe5WLe2m/6hzXG/GRWFIODeGqXovBx8tvAa05+TiYSxrkkaZYUAi33Q/bLLvXedZovEoLCRB9m8PgJeMMZ3kw3iu+xrwj7PSr6xERfWrD4fHI/d9RVw5ELVnbL195mSW248qHwYks2WObzRPeDHKWQwUg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=NWyWyb6P; arc=none smtp.client-ip=209.85.221.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="NWyWyb6P" Received: by mail-wr1-f74.google.com with SMTP id ffacd0b85a97d-37d462b64e3so2349730f8f.3 for ; Mon, 28 Oct 2024 12:02:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730142152; x=1730746952; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rF/BajsA4u9lnrpfeqsxR/4AwVRpiKJbattjpDj/5to=; b=NWyWyb6PrxYZCowoFbru8URblbaAlUKPqbio3kZW40zX5evcuHLjQI3JGkxTC00mpt k4UqpJlgAw4qz+TDPkOOsrQ3k0UH7uU8aAdcRzgYVOE8q7n+OUPZN+Te5KRqsMxKfAbY BgndYgKKoHp1y6UEtSeCPQnCPr1Ye1zL8NeZ7RVlRi8l6Fg9u1rekn7suurv4TVomOd7 mjOO6NvrvXDwk+lAjKRcykvC6tT618umbo6VQKXSLBCxc+YgrfriZ2vqugjYMt3asX/u CbbB2W+v+nulM4G2Rj+dXP13ssA0s99c4cVjzBqVARdY2QM/96Mj2sEdv2SMWYRYI94f Hqvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730142152; x=1730746952; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rF/BajsA4u9lnrpfeqsxR/4AwVRpiKJbattjpDj/5to=; b=u/y9b276X8F1G7MgaknxTkWL/1gf8mgjpIv1+CWmokiIPdQoh+E4EUTxs2rZfg4lKj +vKUT967G6l5zh8Q6GLEqMSda7PtHZPOaNXkBqOzIEdlPa4Vdf7eqaONkuPLrJ8vpB3J NBHzcWRwIuKhUZBbJmYAlamEQE3bG5wfceFCH688UEBUW/oCmsULX9OmeKK2SkEgvD23 uklyhqqZGSGKpRlC9KsyLd9MJqh4VUr9PKStEDk7xyFiMr7PmK/gqdCBRzHGKtuHoxCE SVTDKaiOu+wOtsanYrGebeYgAEbiNknhCKX95U4a493b7Ylh4ii06ob/ax3Eh7Io3I97 oYfg== X-Gm-Message-State: AOJu0YxGAvmhixxxfyOIYXcaQO8/e6H8CrTOAAGwnMo9Qis4ZO1i8L7L epk1hxfXsEr3rH4H1iiuCGWVBySjBdlWz9PjtzyV9nSnL5Mf4Z/OISK53DTd2tGnZFr7hILEut0 wZrpLa2plU6bR3z/HpKydvbO/Az7afk6O9eSAr4Qo+DQqdYnAmv1xcTFK2/wLEx/ieIbFhXUqee wP2cYzpIU3clcJLRkERwSPkrUDErKgkA== X-Google-Smtp-Source: AGHT+IHnKiYkcGHQSPrTxUsrUDubZTKuRI+DKoCgVg6SVEfpN0Ad0Wkc6hbgZTQpyG+ynHifIYyUvtxr X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a5d:6605:0:b0:37d:5134:fdd with SMTP id ffacd0b85a97d-38061238a43mr5286f8f.11.1730142151662; Mon, 28 Oct 2024 12:02:31 -0700 (PDT) Date: Mon, 28 Oct 2024 20:02:09 +0100 In-Reply-To: <20241028190207.1394367-8-ardb+git@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241028190207.1394367-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=2040; i=ardb@kernel.org; h=from:subject; bh=/q+jS0d+3r8GHzPCkmD3ZWCsmDV9ozYENJWU6Pwv9u8=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3+/sYNYuXnd/Tz9v7avEL+xMo6W3MvT/XnAUIr9B3de U61hEd2lLIwiHEwyIopsgjM/vtu5+mJUrXOs2Rh5rAygQxh4OIUgIk8CGRkmFO9T69Y62N2abPk umULOm3W3XXqjjv3fUXXS8+DE+d4/Gf4p9au/aeo8tia+wcr9ktzRcYfPiXkf+bg7bBpYd0X7d/ fZQYA X-Mailer: git-send-email 2.47.0.163.g1226f6d8fa-goog Message-ID: <20241028190207.1394367-9-ardb+git@google.com> Subject: [PATCH 1/6] crypto: arm64/crct10dif - Remove obsolete chunking logic From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel From: Ard Biesheuvel This is a partial revert of commit fc754c024a343b, which moved the logic into C code which ensures that kernel mode NEON code does not hog the CPU for too long. This is no longer needed now that kernel mode NEON no longer disables preemption, so we can drop this. Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/crct10dif-ce-glue.c | 30 ++++---------------- 1 file changed, 6 insertions(+), 24 deletions(-) diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c index 606d25c559ed..7b05094a0480 100644 --- a/arch/arm64/crypto/crct10dif-ce-glue.c +++ b/arch/arm64/crypto/crct10dif-ce-glue.c @@ -37,18 +37,9 @@ static int crct10dif_update_pmull_p8(struct shash_desc *desc, const u8 *data, u16 *crc = shash_desc_ctx(desc); if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { - do { - unsigned int chunk = length; - - if (chunk > SZ_4K + CRC_T10DIF_PMULL_CHUNK_SIZE) - chunk = SZ_4K; - - kernel_neon_begin(); - *crc = crc_t10dif_pmull_p8(*crc, data, chunk); - kernel_neon_end(); - data += chunk; - length -= chunk; - } while (length); + kernel_neon_begin(); + *crc = crc_t10dif_pmull_p8(*crc, data, length); + kernel_neon_end(); } else { *crc = crc_t10dif_generic(*crc, data, length); } @@ -62,18 +53,9 @@ static int crct10dif_update_pmull_p64(struct shash_desc *desc, const u8 *data, u16 *crc = shash_desc_ctx(desc); if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { - do { - unsigned int chunk = length; - - if (chunk > SZ_4K + CRC_T10DIF_PMULL_CHUNK_SIZE) - chunk = SZ_4K; - - kernel_neon_begin(); - *crc = crc_t10dif_pmull_p64(*crc, data, chunk); - kernel_neon_end(); - data += chunk; - length -= chunk; - } while (length); + kernel_neon_begin(); + *crc = crc_t10dif_pmull_p64(*crc, data, length); + kernel_neon_end(); } else { *crc = crc_t10dif_generic(*crc, data, length); } From patchwork Mon Oct 28 19:02:10 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 839197 Received: from mail-wm1-f74.google.com (mail-wm1-f74.google.com [209.85.128.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2801E18B03 for ; Mon, 28 Oct 2024 19:02:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142158; cv=none; b=fuUjNIABWTm/3DMabKtKP+UDdBcpwAjuCWPjHP0zYd0MutZu7LfufBuLqLvkRifLQxQPoz8ARL0yjSPDH4MZMQyYMLavBWsxxx/3Jk2r7PaWKTG6xdZeWGKgRAJhrfFSOrvNQ+l/wpUzs2HEAWS8WTEYJS/1EgrJmU95cJkTms4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142158; c=relaxed/simple; bh=ryK96+oDave7YZNXIpCKT7SxGDjqJppxbNJtiIAeJew=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=dQc9+udJF35PCKoBgKDj4xu6D7yJ/JbHfP+x4d7TeKiMLrSXq/E439JOktVooXz8mDk/8CUvaMMJ41tKoMXAouyxGZGOT4nftM77mPBvg+m4Yotni0TVodBqbQT+AI4nZjk6zLGDn0Yj570sdAXH42jCBfq9mQ7A+3Yhz0H4bU4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=JtK6srLq; arc=none smtp.client-ip=209.85.128.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="JtK6srLq" Received: by mail-wm1-f74.google.com with SMTP id 5b1f17b1804b1-4316ac69e6dso35718205e9.0 for ; Mon, 28 Oct 2024 12:02:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730142154; x=1730746954; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Zpm0CqO0jkLCQ2hNqhD5Z+6vomBgNs7dr74ifVRZnt4=; b=JtK6srLqjG77Ki0faGB8Gguoh6Rz73aYCtR7Gm+itODaKFvCscFu5480xGx7/qgtHK CqM74Ag7zwsrdP6i2AUxy6S9+y8dqERnDzpQ2E6KaecNytUyUeupAxec34BI9Tpb78TP N+4q/nsi8LeXiwgnpMk+KqStNCNhrPVyzi42Xek03TulfhcQnY50Jg5yRfNiJcePfbBY 1YkqYD4mNIJFQWhDCq1sxfLntC8HH+wGZspLIh9nUd/wRA/AAWZDQ5S4SkXF5GLC09Zf piB/qkFH1BeMfBMcdkdc6jwGuwJemw59XFPxVTd0szZQT9h0GQ5QrRXySaXamoHLKEB8 WYtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730142154; x=1730746954; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Zpm0CqO0jkLCQ2hNqhD5Z+6vomBgNs7dr74ifVRZnt4=; b=FYjIqruNQawlbqYuO5xz04iwmXrov3Zu4EKuldLcx1bZsVQuMY0n45j+C4Wbi8rD+v 0ICB5ag+0ddt0zwJiMiAz6eNyPHsRBOc7KeiSEWCQT5U1dlg8YGwGFPUbKEBCW2xkYCU q2XYSk/wqnIcDRPGnN9zEL/b0/jvhg4+kZDcYdO6V9wzzaAht4q8l8RuR9UZn4DxNCrb LaBz0E+isGpJ/+mQukIB8r8C8HqXms3trW1boAs70/Kwt1GpvgOXzSGCICHUxtpv4s3X RWpUfUG2abKL3FXSxxgZkyEOS0W51vIEuy6ui+XcXj4Z8pIduKYolmR57vzImt+n9p/v xO9A== X-Gm-Message-State: AOJu0YzIRZ7hQ9Mz8sa39Mf5U28hw+XCQxDlFt5TSAfrJCP42+nYj9eh GR2rszaS3YuZwVRw/5YwbFgQyW+feSuvITQQ+1KTmgrjQoYFIJh7TtTfFAMEPNx8OBBU520nYAS I0Ryjfj9b5gvJpR6KtmPTMuFZiuLJqIjIr8k+xAsrhRG1civIg9jWSWt03f/Scv8MyPuzN8DhDL SzG5ior00nI0ZoJpcYtDbuKtPOtvkX3A== X-Google-Smtp-Source: AGHT+IHGLMQXbNwdTcbne8ZPAgPqC6Qf7gZJjvWldbCy0+SyVOa2wRc/e+wvEEV0+cVX9xWO8q2z0UUr X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a5d:45c5:0:b0:37d:4f54:78b0 with SMTP id ffacd0b85a97d-38060ffebe5mr9194f8f.0.1730142154146; Mon, 28 Oct 2024 12:02:34 -0700 (PDT) Date: Mon, 28 Oct 2024 20:02:10 +0100 In-Reply-To: <20241028190207.1394367-8-ardb+git@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241028190207.1394367-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=6680; i=ardb@kernel.org; h=from:subject; bh=QvAfJRoNWdwrMVr1L2fVoyNArQiBoPeeMAMmJjZbWLQ=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3+/uY/HbF7q6KVWOVU1bu/fbKpKD62X/nPC91lWTYfl /v+/13fUcrCIMbBICumyCIw+++7nacnStU6z5KFmcPKBDKEgYtTACZyvYnhn3FY/teFh351Z11z z5Ire3IttYff/WT20RXHpq0tFt9VfZXhn9mU3cctGCtY76fV7q6fEph4TLvo3uVo5t7bc74uy/5 6ixMA X-Mailer: git-send-email 2.47.0.163.g1226f6d8fa-goog Message-ID: <20241028190207.1394367-10-ardb+git@google.com> Subject: [PATCH 2/6] crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel From: Ard Biesheuvel The CRC-T10DIF implementation for arm64 has a version that uses 8x8 polynomial multiplication, for cores that lack the crypto extensions, which cover the 64x64 polynomial multiplication instruction that the algorithm was built around. This fallback version rather naively adopted the 64x64 polynomial multiplication algorithm that I ported from ARM for the GHASH driver, which needs 8 PMULL8 instructions to implement one PMULL64. This is reasonable, given that each 8-bit vector element needs to be multiplied with each element in the other vector, producing 8 vectors with partial results that need to be combined to yield the correct result. However, most PMULL64 invocations in the CRC-T10DIF code involve multiplication by a pair of 16-bit folding coefficients, and so all the partial results from higher order bytes will be zero, and there is no need to calculate them to begin with. Then, the CRC-T10DIF algorithm always XORs the output values of the PMULL64 instructions being issued in pairs, and so there is no need to faithfully implement each individual PMULL64 instruction, as long as XORing the results pairwise produces the expected result. Implementing these improvements results in a speedup of 3.3x on low-end platforms such as Raspberry Pi 4 (Cortex-A72) Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/crct10dif-ce-core.S | 71 +++++++++++++++----- 1 file changed, 54 insertions(+), 17 deletions(-) diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S index 5604de61d06d..8d99ccf61f16 100644 --- a/arch/arm64/crypto/crct10dif-ce-core.S +++ b/arch/arm64/crypto/crct10dif-ce-core.S @@ -1,8 +1,11 @@ // // Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions // -// Copyright (C) 2016 Linaro Ltd -// Copyright (C) 2019 Google LLC +// Copyright (C) 2016 Linaro Ltd +// Copyright (C) 2019-2024 Google LLC +// +// Authors: Ard Biesheuvel +// Eric Biggers // // This program is free software; you can redistribute it and/or modify // it under the terms of the GNU General Public License version 2 as @@ -122,6 +125,10 @@ sli perm2.2d, perm1.2d, #56 sli perm3.2d, perm1.2d, #48 sli perm4.2d, perm1.2d, #40 + + mov_q x5, 0x909010108080000 + mov bd1.d[0], x5 + zip1 bd1.16b, bd1.16b, bd1.16b .endm .macro __pmull_pre_p8, bd @@ -196,6 +203,45 @@ SYM_FUNC_START_LOCAL(__pmull_p8_core) ret SYM_FUNC_END(__pmull_p8_core) +SYM_FUNC_START_LOCAL(__pmull_p8_16x64) + ext t6.16b, t5.16b, t5.16b, #8 + + pmull t3.8h, t7.8b, t5.8b + pmull t4.8h, t7.8b, t6.8b + pmull2 t5.8h, t7.16b, t5.16b + pmull2 t6.8h, t7.16b, t6.16b + + ext t8.16b, t3.16b, t3.16b, #8 + eor t4.16b, t4.16b, t6.16b + ext t7.16b, t5.16b, t5.16b, #8 + ext t6.16b, t4.16b, t4.16b, #8 + eor t8.8b, t8.8b, t3.8b + eor t5.8b, t5.8b, t7.8b + eor t4.8b, t4.8b, t6.8b + ext t5.16b, t5.16b, t5.16b, #14 + ret +SYM_FUNC_END(__pmull_p8_16x64) + + .macro pmull16x64_p64, a16, b64, c64 + pmull2 \c64\().1q, \a16\().2d, \b64\().2d + pmull \b64\().1q, \a16\().1d, \b64\().1d + .endm + + /* + * NOTE: the 16x64 bit polynomial multiply below is not equivalent to + * the one above, but XOR'ing the outputs together will produce the + * expected result, and this is sufficient in the context of this + * algorithm. + */ + .macro pmull16x64_p8, a16, b64, c64 + ext t7.16b, \b64\().16b, \b64\().16b, #1 + tbl t5.16b, {\a16\().16b}, bd1.16b + uzp1 t7.16b, \b64\().16b, t7.16b + bl __pmull_p8_16x64 + ext \b64\().16b, t4.16b, t4.16b, #15 + eor \c64\().16b, t8.16b, t5.16b + .endm + .macro __pmull_p8, rq, ad, bd, i .ifnc \bd, fold_consts .err @@ -218,14 +264,12 @@ SYM_FUNC_END(__pmull_p8_core) .macro fold_32_bytes, p, reg1, reg2 ldp q11, q12, [buf], #0x20 - __pmull_\p v8, \reg1, fold_consts, 2 - __pmull_\p \reg1, \reg1, fold_consts + pmull16x64_\p fold_consts, \reg1, v8 CPU_LE( rev64 v11.16b, v11.16b ) CPU_LE( rev64 v12.16b, v12.16b ) - __pmull_\p v9, \reg2, fold_consts, 2 - __pmull_\p \reg2, \reg2, fold_consts + pmull16x64_\p fold_consts, \reg2, v9 CPU_LE( ext v11.16b, v11.16b, v11.16b, #8 ) CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 ) @@ -238,11 +282,9 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 ) // Fold src_reg into dst_reg, optionally loading the next fold constants .macro fold_16_bytes, p, src_reg, dst_reg, load_next_consts - __pmull_\p v8, \src_reg, fold_consts - __pmull_\p \src_reg, \src_reg, fold_consts, 2 + pmull16x64_\p fold_consts, \src_reg, v8 .ifnb \load_next_consts ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts .endif eor \dst_reg\().16b, \dst_reg\().16b, v8.16b eor \dst_reg\().16b, \dst_reg\().16b, \src_reg\().16b @@ -296,7 +338,6 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // Load the constants for folding across 128 bytes. ld1 {fold_consts.2d}, [fold_consts_ptr] - __pmull_pre_\p fold_consts // Subtract 128 for the 128 data bytes just consumed. Subtract another // 128 to simplify the termination condition of the following loop. @@ -318,7 +359,6 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // Fold across 64 bytes. add fold_consts_ptr, fold_consts_ptr, #16 ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts fold_16_bytes \p, v0, v4 fold_16_bytes \p, v1, v5 fold_16_bytes \p, v2, v6 @@ -339,8 +379,7 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // into them, storing the result back into v7. b.lt .Lfold_16_bytes_loop_done_\@ .Lfold_16_bytes_loop_\@: - __pmull_\p v8, v7, fold_consts - __pmull_\p v7, v7, fold_consts, 2 + pmull16x64_\p fold_consts, v7, v8 eor v7.16b, v7.16b, v8.16b ldr q0, [buf], #16 CPU_LE( rev64 v0.16b, v0.16b ) @@ -387,9 +426,8 @@ CPU_LE( ext v0.16b, v0.16b, v0.16b, #8 ) bsl v2.16b, v1.16b, v0.16b // Fold the first chunk into the second chunk, storing the result in v7. - __pmull_\p v0, v3, fold_consts - __pmull_\p v7, v3, fold_consts, 2 - eor v7.16b, v7.16b, v0.16b + pmull16x64_\p fold_consts, v3, v0 + eor v7.16b, v3.16b, v0.16b eor v7.16b, v7.16b, v2.16b .Lreduce_final_16_bytes_\@: @@ -450,7 +488,6 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // Load the fold-across-16-bytes constants. ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts cmp len, #16 b.eq .Lreduce_final_16_bytes_\@ // len == 16 From patchwork Mon Oct 28 19:02:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 839653 Received: from mail-wm1-f74.google.com (mail-wm1-f74.google.com [209.85.128.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B9841DF96A for ; Mon, 28 Oct 2024 19:02:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142161; cv=none; b=kBzwBcXbIA2Tl0k7qCO1lamw/RJB9tr2kmkAzFadezZGZxjQrTXao0kRaQNZMZZlUQ9w4fy5DKS1JAQYS+ZuIRsGEd3+VQ2ceJvlocJ/fKFyP1junyVLnLR91nC47gGfvW5nEhDUQYg/MFprRgY6h/C3nAoEsoWCpun/abaP7Gs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142161; c=relaxed/simple; bh=J7aFTZ/f+621TjfCMvLqBV9mSzYeHA0Rh7jsQetDVQM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=RA8/cwjIFIyN6oWGYOKzDBSAVeM8Hff6ol3Li3iL6uUZgnVMj2qXLWtRXcV57mlBqt0/R9cl5icJ4jqzPZh+a2TKWPnrHu/bs26E/F7ZCwL05T/q2HFVwR3/BcdAyaE5g5XHPmJ2wn8OTYaTTPCvzq4FA37Ct1c/oqV9fm5MiBY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CsWAY1DK; arc=none smtp.client-ip=209.85.128.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CsWAY1DK" Received: by mail-wm1-f74.google.com with SMTP id 5b1f17b1804b1-43157cff1d1so34173305e9.2 for ; Mon, 28 Oct 2024 12:02:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730142156; x=1730746956; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Echa0+oKKJH44B+9S6rJI95UvNtKR1Bf1nPVGCz2Lh4=; b=CsWAY1DKvG8E7CpMRuXku/reix6I8bWwKlGvSe9FN9f7iwq0E9j3iLcQvSoGrnYmV+ Jzf9A/4F+mThA9MDHtV4lBdUHJyq5F6BoCmzPWndetccYnw/LDPBPfiPvvjuVPO7ilGw veDu+wl81AJ0sUa+osoSU4cGtuN2K4Qcg96TtwgdJFlqNUCVSoi1a9SKPKt5/0MjZJOr Ip1VoPeSZ/WzLDoXhfo9LGnzNsYcuuvsFbLyI6QzjxgBvIo0g6x70qlbQM9roMXbv+fm xZ6NXjWxt2kBzwMB5jDT4i6tlwZ5vknXSYuluq9jk6s1nhCuvGSX3p8c2VWiHZrp0kXU ktFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730142156; x=1730746956; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Echa0+oKKJH44B+9S6rJI95UvNtKR1Bf1nPVGCz2Lh4=; b=KZ5tQneYHZLMeNoYGH61vBHsLOwcZ4mLV18GvEi2hrY+hTQF29tb4zTijY8ngvAoQT Zm3pTP4uT2DLpeexyaulmVRk/uZE1RdODJ9CO59vFLcnGf6AjBGw79z9DoLUzXp0tdWG EdtQkp7JlY1hFOP7rfEptKiO7WNnxjmYN9z8XuCHKDamFZcTku4Ge/PE9nBGBfyfGRck EQrfuQJ+TF2fa+JRLmfAjP1xECGD8JQA8Jre9zWhO/rmXA9RCbpKz7glXrtqmP0DAkmF DxqxIn6XodXTFzq1EgZLUstkpeZlMGvNFan5HwKBgC5Uardm9GehTBUEmyuDN+y6JavX LZ2Q== X-Gm-Message-State: AOJu0YxeexZy6mwYc3cDbdIRU7BSB4zlSv3lKmXfGG3cBe2GHobOU0/g yq/u6aW5aWgtXr8IG2R9LmaqQMQPBwVR8QDzO8X+X1vnxhAScZOdCbQZeECdQ02boj3vUwXy2cr by1GXV6jlrv8PR6c5HO94QoO7dhiE75JiwpdCzpmjW7eQvdpsVYSPOGU8fmPwrtzJPJbpNjR+O2 qG+urS6P40IWtAXUqkrIByRmYohUc29A== X-Google-Smtp-Source: AGHT+IEy2t8IV03SYSWSP1iIc6zNu5vJGMwwfry7I6Nnl0H6Rg2JBL1p2BwJnhY8iG6jvNn6l5QvQwZ2 X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:600c:683:b0:431:4509:696a with SMTP id 5b1f17b1804b1-4319ac8a64amr142545e9.2.1730142156375; Mon, 28 Oct 2024 12:02:36 -0700 (PDT) Date: Mon, 28 Oct 2024 20:02:11 +0100 In-Reply-To: <20241028190207.1394367-8-ardb+git@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241028190207.1394367-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=11267; i=ardb@kernel.org; h=from:subject; bh=+XBmdyxmNZmCfvA5pryYtkZFwd//f6d6ET/JrqhadgY=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3+/hY1AeeNfmLRkqqNyozz5T4n/HrFafBWSyn6zCnp0 sCFz391lLIwiHEwyIopsgjM/vtu5+mJUrXOs2Rh5rAygQxh4OIUgIlM62Nk+HEz98fJrS4cMhbS SfVdm9llFxxKsNinZsySLBX92fK2GMN/5xr7Ys/5L5exbZm4NfGA40ypdrGSLsbdHrxM55ezCaz hAQA= X-Mailer: git-send-email 2.47.0.163.g1226f6d8fa-goog Message-ID: <20241028190207.1394367-11-ardb+git@google.com> Subject: [PATCH 3/6] crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel From: Ard Biesheuvel The only remaining user of the fallback implementation of 64x64 polynomial multiplication using 8x8 PMULL instructions is the final reduction from a 16 byte vector to a 16-bit CRC. The fallback code is complicated and messy, and this reduction has very little impact on the overall performance, so instead, let's calculate the final CRC by passing the 16 byte vector to the generic CRC-T10DIF implementation. Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/crct10dif-ce-core.S | 237 +++++--------------- arch/arm64/crypto/crct10dif-ce-glue.c | 15 +- 2 files changed, 64 insertions(+), 188 deletions(-) diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S index 8d99ccf61f16..1db5d1d1e2b7 100644 --- a/arch/arm64/crypto/crct10dif-ce-core.S +++ b/arch/arm64/crypto/crct10dif-ce-core.S @@ -74,14 +74,12 @@ init_crc .req w0 buf .req x1 len .req x2 - fold_consts_ptr .req x3 + fold_consts_ptr .req x5 fold_consts .req v10 ad .req v14 - - k00_16 .req v15 - k32_48 .req v16 + bd .req v15 t3 .req v17 t4 .req v18 @@ -91,117 +89,7 @@ t8 .req v22 t9 .req v23 - perm1 .req v24 - perm2 .req v25 - perm3 .req v26 - perm4 .req v27 - - bd1 .req v28 - bd2 .req v29 - bd3 .req v30 - bd4 .req v31 - - .macro __pmull_init_p64 - .endm - - .macro __pmull_pre_p64, bd - .endm - - .macro __pmull_init_p8 - // k00_16 := 0x0000000000000000_000000000000ffff - // k32_48 := 0x00000000ffffffff_0000ffffffffffff - movi k32_48.2d, #0xffffffff - mov k32_48.h[2], k32_48.h[0] - ushr k00_16.2d, k32_48.2d, #32 - - // prepare the permutation vectors - mov_q x5, 0x080f0e0d0c0b0a09 - movi perm4.8b, #8 - dup perm1.2d, x5 - eor perm1.16b, perm1.16b, perm4.16b - ushr perm2.2d, perm1.2d, #8 - ushr perm3.2d, perm1.2d, #16 - ushr perm4.2d, perm1.2d, #24 - sli perm2.2d, perm1.2d, #56 - sli perm3.2d, perm1.2d, #48 - sli perm4.2d, perm1.2d, #40 - - mov_q x5, 0x909010108080000 - mov bd1.d[0], x5 - zip1 bd1.16b, bd1.16b, bd1.16b - .endm - - .macro __pmull_pre_p8, bd - tbl bd1.16b, {\bd\().16b}, perm1.16b - tbl bd2.16b, {\bd\().16b}, perm2.16b - tbl bd3.16b, {\bd\().16b}, perm3.16b - tbl bd4.16b, {\bd\().16b}, perm4.16b - .endm - -SYM_FUNC_START_LOCAL(__pmull_p8_core) -.L__pmull_p8_core: - ext t4.8b, ad.8b, ad.8b, #1 // A1 - ext t5.8b, ad.8b, ad.8b, #2 // A2 - ext t6.8b, ad.8b, ad.8b, #3 // A3 - - pmull t4.8h, t4.8b, fold_consts.8b // F = A1*B - pmull t8.8h, ad.8b, bd1.8b // E = A*B1 - pmull t5.8h, t5.8b, fold_consts.8b // H = A2*B - pmull t7.8h, ad.8b, bd2.8b // G = A*B2 - pmull t6.8h, t6.8b, fold_consts.8b // J = A3*B - pmull t9.8h, ad.8b, bd3.8b // I = A*B3 - pmull t3.8h, ad.8b, bd4.8b // K = A*B4 - b 0f - -.L__pmull_p8_core2: - tbl t4.16b, {ad.16b}, perm1.16b // A1 - tbl t5.16b, {ad.16b}, perm2.16b // A2 - tbl t6.16b, {ad.16b}, perm3.16b // A3 - - pmull2 t4.8h, t4.16b, fold_consts.16b // F = A1*B - pmull2 t8.8h, ad.16b, bd1.16b // E = A*B1 - pmull2 t5.8h, t5.16b, fold_consts.16b // H = A2*B - pmull2 t7.8h, ad.16b, bd2.16b // G = A*B2 - pmull2 t6.8h, t6.16b, fold_consts.16b // J = A3*B - pmull2 t9.8h, ad.16b, bd3.16b // I = A*B3 - pmull2 t3.8h, ad.16b, bd4.16b // K = A*B4 - -0: eor t4.16b, t4.16b, t8.16b // L = E + F - eor t5.16b, t5.16b, t7.16b // M = G + H - eor t6.16b, t6.16b, t9.16b // N = I + J - - uzp1 t8.2d, t4.2d, t5.2d - uzp2 t4.2d, t4.2d, t5.2d - uzp1 t7.2d, t6.2d, t3.2d - uzp2 t6.2d, t6.2d, t3.2d - - // t4 = (L) (P0 + P1) << 8 - // t5 = (M) (P2 + P3) << 16 - eor t8.16b, t8.16b, t4.16b - and t4.16b, t4.16b, k32_48.16b - - // t6 = (N) (P4 + P5) << 24 - // t7 = (K) (P6 + P7) << 32 - eor t7.16b, t7.16b, t6.16b - and t6.16b, t6.16b, k00_16.16b - - eor t8.16b, t8.16b, t4.16b - eor t7.16b, t7.16b, t6.16b - - zip2 t5.2d, t8.2d, t4.2d - zip1 t4.2d, t8.2d, t4.2d - zip2 t3.2d, t7.2d, t6.2d - zip1 t6.2d, t7.2d, t6.2d - - ext t4.16b, t4.16b, t4.16b, #15 - ext t5.16b, t5.16b, t5.16b, #14 - ext t6.16b, t6.16b, t6.16b, #13 - ext t3.16b, t3.16b, t3.16b, #12 - - eor t4.16b, t4.16b, t5.16b - eor t6.16b, t6.16b, t3.16b - ret -SYM_FUNC_END(__pmull_p8_core) + perm .req v27 SYM_FUNC_START_LOCAL(__pmull_p8_16x64) ext t6.16b, t5.16b, t5.16b, #8 @@ -235,30 +123,13 @@ SYM_FUNC_END(__pmull_p8_16x64) */ .macro pmull16x64_p8, a16, b64, c64 ext t7.16b, \b64\().16b, \b64\().16b, #1 - tbl t5.16b, {\a16\().16b}, bd1.16b + tbl t5.16b, {\a16\().16b}, perm.16b uzp1 t7.16b, \b64\().16b, t7.16b bl __pmull_p8_16x64 ext \b64\().16b, t4.16b, t4.16b, #15 eor \c64\().16b, t8.16b, t5.16b .endm - .macro __pmull_p8, rq, ad, bd, i - .ifnc \bd, fold_consts - .err - .endif - mov ad.16b, \ad\().16b - .ifb \i - pmull \rq\().8h, \ad\().8b, \bd\().8b // D = A*B - .else - pmull2 \rq\().8h, \ad\().16b, \bd\().16b // D = A*B - .endif - - bl .L__pmull_p8_core\i - - eor \rq\().16b, \rq\().16b, t4.16b - eor \rq\().16b, \rq\().16b, t6.16b - .endm - // Fold reg1, reg2 into the next 32 data bytes, storing the result back // into reg1, reg2. .macro fold_32_bytes, p, reg1, reg2 @@ -290,16 +161,7 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 ) eor \dst_reg\().16b, \dst_reg\().16b, \src_reg\().16b .endm - .macro __pmull_p64, rd, rn, rm, n - .ifb \n - pmull \rd\().1q, \rn\().1d, \rm\().1d - .else - pmull2 \rd\().1q, \rn\().2d, \rm\().2d - .endif - .endm - .macro crc_t10dif_pmull, p - __pmull_init_\p // For sizes less than 256 bytes, we can't fold 128 bytes at a time. cmp len, #256 @@ -429,47 +291,7 @@ CPU_LE( ext v0.16b, v0.16b, v0.16b, #8 ) pmull16x64_\p fold_consts, v3, v0 eor v7.16b, v3.16b, v0.16b eor v7.16b, v7.16b, v2.16b - -.Lreduce_final_16_bytes_\@: - // Reduce the 128-bit value M(x), stored in v7, to the final 16-bit CRC. - - movi v2.16b, #0 // init zero register - - // Load 'x^48 * (x^48 mod G(x))' and 'x^48 * (x^80 mod G(x))'. - ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts - - // Fold the high 64 bits into the low 64 bits, while also multiplying by - // x^64. This produces a 128-bit value congruent to x^64 * M(x) and - // whose low 48 bits are 0. - ext v0.16b, v2.16b, v7.16b, #8 - __pmull_\p v7, v7, fold_consts, 2 // high bits * x^48 * (x^80 mod G(x)) - eor v0.16b, v0.16b, v7.16b // + low bits * x^64 - - // Fold the high 32 bits into the low 96 bits. This produces a 96-bit - // value congruent to x^64 * M(x) and whose low 48 bits are 0. - ext v1.16b, v0.16b, v2.16b, #12 // extract high 32 bits - mov v0.s[3], v2.s[0] // zero high 32 bits - __pmull_\p v1, v1, fold_consts // high 32 bits * x^48 * (x^48 mod G(x)) - eor v0.16b, v0.16b, v1.16b // + low bits - - // Load G(x) and floor(x^48 / G(x)). - ld1 {fold_consts.2d}, [fold_consts_ptr] - __pmull_pre_\p fold_consts - - // Use Barrett reduction to compute the final CRC value. - __pmull_\p v1, v0, fold_consts, 2 // high 32 bits * floor(x^48 / G(x)) - ushr v1.2d, v1.2d, #32 // /= x^32 - __pmull_\p v1, v1, fold_consts // *= G(x) - ushr v0.2d, v0.2d, #48 - eor v0.16b, v0.16b, v1.16b // + low 16 nonzero bits - // Final CRC value (x^16 * M(x)) mod G(x) is in low 16 bits of v0. - - umov w0, v0.h[0] - .ifc \p, p8 - frame_pop - .endif - ret + b .Lreduce_final_16_bytes_\@ .Lless_than_256_bytes_\@: // Checksumming a buffer of length 16...255 bytes @@ -495,6 +317,8 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) b.ge .Lfold_16_bytes_loop_\@ // 32 <= len <= 255 add len, len, #16 b .Lhandle_partial_segment_\@ // 17 <= len <= 31 + +.Lreduce_final_16_bytes_\@: .endm // @@ -504,7 +328,19 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // SYM_FUNC_START(crc_t10dif_pmull_p8) frame_push 1 + + mov_q x4, 0x909010108080000 + mov perm.d[0], x4 + zip1 perm.16b, perm.16b, perm.16b + crc_t10dif_pmull p8 + +CPU_LE( rev64 v7.16b, v7.16b ) +CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) + str q7, [x3] + + frame_pop + ret SYM_FUNC_END(crc_t10dif_pmull_p8) .align 5 @@ -515,6 +351,41 @@ SYM_FUNC_END(crc_t10dif_pmull_p8) // SYM_FUNC_START(crc_t10dif_pmull_p64) crc_t10dif_pmull p64 + + // Reduce the 128-bit value M(x), stored in v7, to the final 16-bit CRC. + + movi v2.16b, #0 // init zero register + + // Load 'x^48 * (x^48 mod G(x))' and 'x^48 * (x^80 mod G(x))'. + ld1 {fold_consts.2d}, [fold_consts_ptr], #16 + + // Fold the high 64 bits into the low 64 bits, while also multiplying by + // x^64. This produces a 128-bit value congruent to x^64 * M(x) and + // whose low 48 bits are 0. + ext v0.16b, v2.16b, v7.16b, #8 + pmull2 v7.1q, v7.2d, fold_consts.2d // high bits * x^48 * (x^80 mod G(x)) + eor v0.16b, v0.16b, v7.16b // + low bits * x^64 + + // Fold the high 32 bits into the low 96 bits. This produces a 96-bit + // value congruent to x^64 * M(x) and whose low 48 bits are 0. + ext v1.16b, v0.16b, v2.16b, #12 // extract high 32 bits + mov v0.s[3], v2.s[0] // zero high 32 bits + pmull v1.1q, v1.1d, fold_consts.1d // high 32 bits * x^48 * (x^48 mod G(x)) + eor v0.16b, v0.16b, v1.16b // + low bits + + // Load G(x) and floor(x^48 / G(x)). + ld1 {fold_consts.2d}, [fold_consts_ptr] + + // Use Barrett reduction to compute the final CRC value. + pmull2 v1.1q, v1.2d, fold_consts.2d // high 32 bits * floor(x^48 / G(x)) + ushr v1.2d, v1.2d, #32 // /= x^32 + pmull v1.1q, v1.1d, fold_consts.1d // *= G(x) + ushr v0.2d, v0.2d, #48 + eor v0.16b, v0.16b, v1.16b // + low 16 nonzero bits + // Final CRC value (x^16 * M(x)) mod G(x) is in low 16 bits of v0. + + umov w0, v0.h[0] + ret SYM_FUNC_END(crc_t10dif_pmull_p64) .section ".rodata", "a" diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c index 7b05094a0480..b6db5f5683e1 100644 --- a/arch/arm64/crypto/crct10dif-ce-glue.c +++ b/arch/arm64/crypto/crct10dif-ce-glue.c @@ -20,7 +20,7 @@ #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U -asmlinkage u16 crc_t10dif_pmull_p8(u16 init_crc, const u8 *buf, size_t len); +asmlinkage void crc_t10dif_pmull_p8(u16 init_crc, const u8 *buf, size_t len, u8 *out); asmlinkage u16 crc_t10dif_pmull_p64(u16 init_crc, const u8 *buf, size_t len); static int crct10dif_init(struct shash_desc *desc) @@ -34,16 +34,21 @@ static int crct10dif_init(struct shash_desc *desc) static int crct10dif_update_pmull_p8(struct shash_desc *desc, const u8 *data, unsigned int length) { - u16 *crc = shash_desc_ctx(desc); + u16 *crcp = shash_desc_ctx(desc); + u16 crc = *crcp; + u8 buf[16]; if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { kernel_neon_begin(); - *crc = crc_t10dif_pmull_p8(*crc, data, length); + crc_t10dif_pmull_p8(crc, data, length, buf); kernel_neon_end(); - } else { - *crc = crc_t10dif_generic(*crc, data, length); + + crc = 0; + data = buf; + length = sizeof(buf); } + *crcp = crc_t10dif_generic(crc, data, length); return 0; } From patchwork Mon Oct 28 19:02:12 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 839196 Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E5C01DFDBE for ; Mon, 28 Oct 2024 19:02:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142163; cv=none; b=h4JLXm0swDN5bSv9IjgxX4obDcnUPOYGDoco2cfQODgjm22aBoaUxlk6kevhbG8HJ6YZL1usR6HDupMMxbJmvtZsa/PDzBKdm7PjO3XwhCJvNVVETu2q5jELS2iMMSGRQVNCF9akwgr+Q1okWPnbYq4BTsCrWd6cFkmKnJtNFcs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142163; c=relaxed/simple; bh=CS6UcgBuTMfLsyQGJno1a+xI08hKCy85Zzce1SdkcXU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=b61cbY9crkDAD9pO++/mwWw0NkByF1IZOduSKbVgko4i1IkkoMHwYpCM4QIpsJ7Ve2ySXpnij+ey4eD3MJDeDiw99IIrX6QLiarW6aKYIfSFBs3K3uzc0BSSkyDBowH5DOwsES68Xp9S341nbej68ky/9LmNfzLTcs2rcqiSF5w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Gn8YuE+E; arc=none smtp.client-ip=209.85.128.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Gn8YuE+E" Received: by mail-wm1-f73.google.com with SMTP id 5b1f17b1804b1-431ad45828aso7460975e9.3 for ; Mon, 28 Oct 2024 12:02:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730142159; x=1730746959; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=mW4N9RcBemcG+LSxm4j7sDWA+GX7u+7PH3vPvZyT3dE=; b=Gn8YuE+EeMDBKy32Evm42dcmKSpv1MKUz56wiE5hy/W+kBgDKsw0H9y+2xRUnRpviB ySps8D8vQEazlw6l/SoCbvaimOw0VYO1kkLjcMjT2Pg66rLiWqNHX84u9K7s3WlumNs7 9WLlcXrUjY4u7tNAQ/QIWg0gBtkiuGnLAi+b60sAvKCXQ5JfwQUpAWeEJJOdQafp4cBI ceZD1Z5svZmkw0tP6xPfTHQATdKALm+qUZJiuQzwAVdmymWfCGo9lBs3Nnz8V3PUrJ0u 5SUWoQwVbwuiiMBZcZP/MJUHkQPR9Om6eueg2aJErgqYu8EWluvIwjUYMpa0DFMdVi4r QfYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730142159; x=1730746959; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mW4N9RcBemcG+LSxm4j7sDWA+GX7u+7PH3vPvZyT3dE=; b=IDO1K4pihk21xbOd5b4KtHyqY78U7JrKyKFwmfthfzc3Td1BT/kGPTyQrxpI4d2k46 oQPcipVMkjBoZksX6Uqs5WRV8/176yH8fus5yI9PEJPMsUIVPKTkUVyH3iPYebdloV9H qywbp4MpeMpOxdsNNI0+dR4oCvEpAKDrmHfqN2gKPOi+JmcsRwizZ0rist1gPN0j+DIt 5iGtb/qMM+bnj2eB4v6oaAZqDwAqIeJBKY/0H8ROhnfMgWd/7dBc+JANGNm2Nesm20+m e5dQyEj7LMFFcUYBnA/ULJMubif+8pIZ/ibD7Oj02OBqecQx6ab/lG8IY+iz1/CCAIMI +jWQ== X-Gm-Message-State: AOJu0Yyue+EnGdEAjndumk3dRpuptrwEi2w0Vui97S+WvKdLWhD2qIob kc9l5NfrBSIjp1x3k7RXWK1ySW9DHEOl+OlSor18rLJGCJqqrR6Qmxcvtav6RtjMzwilXsqL+YB d+NijhKhjeBL2xsPKQzhm6V5YQcq8b5hhvbZScsUTcnfdbADXyeiNxBRbhqHc78d2xM81kS/5ea axvn/Tl3/CTaI35utdx3wPJT2M7bYVTg== X-Google-Smtp-Source: AGHT+IG4LbMrWCrTV5C1ezOwLSzp6ulJZwcKc6XgeUCOiB+YpT02bX0ZVvYmPcxlEno86XBHwdx9cESK X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:600c:796:b0:42c:8875:95da with SMTP id 5b1f17b1804b1-4319ad7daecmr188275e9.7.1730142158545; Mon, 28 Oct 2024 12:02:38 -0700 (PDT) Date: Mon, 28 Oct 2024 20:02:12 +0100 In-Reply-To: <20241028190207.1394367-8-ardb+git@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241028190207.1394367-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=1638; i=ardb@kernel.org; h=from:subject; bh=6LxElQtkWbCdfRpyCBv0NKZQB6OtvRTpcDbX0wXSeRA=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3+/lbNS/dteh45mPsbnXd5Eqx08v5DZ1+DmNAkE1eTM vXJThUdpSwMYhwMsmKKLAKz/77beXqiVK3zLFmYOaxMIEMYuDgFYCK7LjL8lel4z3zW4avv1eRf q933aYcd/ly+e1d634dOYdmZM19bhDL8sztomR+5vNBptpr3D4kVGzf++/oke/K9ul1lh8/ZRvB 3cQMA X-Mailer: git-send-email 2.47.0.163.g1226f6d8fa-goog Message-ID: <20241028190207.1394367-12-ardb+git@google.com> Subject: [PATCH 4/6] crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel From: Ard Biesheuvel Signed-off-by: Ard Biesheuvel Reviewed-by: Eric Biggers --- arch/arm/crypto/crct10dif-ce-core.S | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S index 46c02c518a30..4dac32e020de 100644 --- a/arch/arm/crypto/crct10dif-ce-core.S +++ b/arch/arm/crypto/crct10dif-ce-core.S @@ -144,11 +144,6 @@ CPU_LE( vrev64.8 q12, q12 ) veor.8 \dst_reg, \dst_reg, \src_reg .endm - .macro __adrl, out, sym - movw \out, #:lower16:\sym - movt \out, #:upper16:\sym - .endm - // // u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); // @@ -160,7 +155,7 @@ ENTRY(crc_t10dif_pmull) cmp len, #256 blt .Lless_than_256_bytes - __adrl fold_consts_ptr, .Lfold_across_128_bytes_consts + mov_l fold_consts_ptr, .Lfold_across_128_bytes_consts // Load the first 128 data bytes. Byte swapping is necessary to make // the bit order match the polynomial coefficient order. @@ -262,7 +257,7 @@ CPU_LE( vrev64.8 q0, q0 ) vswp q0l, q0h // q1 = high order part of second chunk: q7 left-shifted by 'len' bytes. - __adrl r3, .Lbyteshift_table + 16 + mov_l r3, .Lbyteshift_table + 16 sub r3, r3, len vld1.8 {q2}, [r3] vtbl.8 q1l, {q7l-q7h}, q2l @@ -324,7 +319,7 @@ CPU_LE( vrev64.8 q0, q0 ) .Lless_than_256_bytes: // Checksumming a buffer of length 16...255 bytes - __adrl fold_consts_ptr, .Lfold_across_16_bytes_consts + mov_l fold_consts_ptr, .Lfold_across_16_bytes_consts // Load the first 16 data bytes. vld1.64 {q7}, [buf]! From patchwork Mon Oct 28 19:02:13 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 839652 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A33F18FC7F for ; Mon, 28 Oct 2024 19:02:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142165; cv=none; b=PVeWnN+gqISVpGhPgcitrJw6CqORQnngChJzT0k/dsjVOIXqFb1RlV7g5HO1opdGspaUlsSZjAQ710pZYhpDEiP3sodTw9J3WKlVBIqRla70Jx3F4/3RkpNIMnmubHewiz6EaK2XD9ScFUepfw1z30LSWogAS6bZJGJigbOgN2k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142165; c=relaxed/simple; bh=nwY684iR07A5eO6c9bY9cKtPOT58y/mdXOtnSAc1PYA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=W21yCN0lK7VDv49P3SVcI93G8zpb6L8tagD2/2um5aSy5luYD5U8M8quoo4tAcKB+e/iIpgmT/DkLCHtymKox3u8ebAt4J3aqzbpLm7d00qn658zayY3o0Gszn4AWYha9imhK+/nO/2RhJFY0JOQqfZjF1iTbAp9LpSrarz6XB4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=JMombk4V; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="JMombk4V" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1159159528so7978961276.1 for ; Mon, 28 Oct 2024 12:02:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730142162; x=1730746962; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=j+ea6V2hCZg+NCY52pVq8zKTlUngbGyAnKf6QaYWs/I=; b=JMombk4VHNWkdDygECGcmi1CXY9n5MH4dd0jQA5xej7JJtuMaKz8UsEyIx9MZn/ofS Fv2Mh1Wewpf+i74UHRmzPnL3IK+Nad2l53Kias5FOBXp7qgtXH9oNiLvIQfpsJNJDdIZ PNGFY/8z8OdnQ1VvIi82ZtNI4fP/rAndM6tjqt0IHfiaqD7r8EBA4IOtV/cAprHKnU8a RUSJJ+TW/4wH8WqqefkZA7+GUo3Qk7jtU3MEpJK2+glJOI3yevPxrJXK8jLoOsXkoF2f P6MmDe7BxTAEMpeIA9RzWWSRIkK+zoKdw1EfBlXI0G7oLf0sQeYHa8gONZVj3bvPdwFl 2Kmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730142162; x=1730746962; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=j+ea6V2hCZg+NCY52pVq8zKTlUngbGyAnKf6QaYWs/I=; b=QUebWDBvBVoRlaHCSn+mqdXXRGDqRNY5pT6TlksAWQmKayVmW24ePMJErrMRF5qP9K kmHNANQ00qNMWH+XSpBFQX0I8ck9LEG5AtP26hofDeYN8NuSRmVjqQacROFnEVO5olFu 2RwVtYmA5+TmvspNGNG0nw9axaPuIXENcLFV7icNnwZdrqBhs/Pn2ihlJ6umpY9/RnWA qy8vdAtJR8EFmr9jh4cpA1clN1SQFG1fFj05Dg8w/AZ0brvKVU7SDgs+1noH3FsOPhOa f+GKkPz9CkGBcQ+hjbj1zs2mUVkCYfClb4mZF1A9exXYMiNgejM418rQp8VUf6tZ5j/Y bN5Q== X-Gm-Message-State: AOJu0YwjhK4rMc/AtrRuxJwVO0XsSrIFAFDKUgEnfHPl5TxWVKPLd9lz bEYr6n9lKTwz+7koAjdOHHr5V2TmnLJ88CgFio2FBe88Di24j+6bdM4HbXwg67xGDaoqfDgQsKF YP9MMpL41HLHUhFWfp4lt8Mh05TXYz6Vby0AbYo6MHAd29l+cgfiXISnj3YTei0FhgcM1HwypiC 61uFoKMM7oCINKDx2Y+PkmsTsMPqCmww== X-Google-Smtp-Source: AGHT+IE8Rd9TRCmShmTbPLcE1CtUdzs1DteQhuZC4fakDWbu1rlZa6KR9r0aRWRJLZx0de/e72Sf+en7 X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:6902:8b:b0:e29:7454:e773 with SMTP id 3f1490d57ef6-e30bc85e5b9mr10810276.5.1730142161238; Mon, 28 Oct 2024 12:02:41 -0700 (PDT) Date: Mon, 28 Oct 2024 20:02:13 +0100 In-Reply-To: <20241028190207.1394367-8-ardb+git@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241028190207.1394367-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=8825; i=ardb@kernel.org; h=from:subject; bh=8Kp3llqJWIWDmtrbK3FJDxKfev/q+y1BeWnPR55aiXo=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3+/var8RfVjhpwGP7Mbf9y3YK55UT2IZOI+n/VPflZd wS9+mM7SlkYxDgYZMUUWQRm/3238/REqVrnWbIwc1iZQIYwcHEKwERi6xj+Byncnj8/XuhVlygv 0+mg66v3+PfMLXkW9pLJceVGxcgjwYwMj1zW5SxTPdDm2et/qMhoX86jG5K2G95b92a9ONrt587 IBgA= X-Mailer: git-send-email 2.47.0.163.g1226f6d8fa-goog Message-ID: <20241028190207.1394367-13-ardb+git@google.com> Subject: [PATCH 5/6] crypto: arm/crct10dif - Macroify PMULL asm code From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel From: Ard Biesheuvel To allow an alternative version to be created of the PMULL based CRC-T10DIF algorithm, turn the bulk of it into a macro, except for the final reduction, which will only be used by the existing version. Signed-off-by: Ard Biesheuvel --- arch/arm/crypto/crct10dif-ce-core.S | 154 ++++++++++---------- arch/arm/crypto/crct10dif-ce-glue.c | 10 +- 2 files changed, 83 insertions(+), 81 deletions(-) diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S index 4dac32e020de..6b72167574b2 100644 --- a/arch/arm/crypto/crct10dif-ce-core.S +++ b/arch/arm/crypto/crct10dif-ce-core.S @@ -112,48 +112,42 @@ FOLD_CONST_L .req q10l FOLD_CONST_H .req q10h + .macro pmull16x64_p64, v16, v64 + vmull.p64 q11, \v64\()l, \v16\()_L + vmull.p64 \v64, \v64\()h, \v16\()_H + veor \v64, \v64, q11 + .endm + // Fold reg1, reg2 into the next 32 data bytes, storing the result back // into reg1, reg2. - .macro fold_32_bytes, reg1, reg2 - vld1.64 {q11-q12}, [buf]! + .macro fold_32_bytes, reg1, reg2, p + vld1.64 {q8-q9}, [buf]! - vmull.p64 q8, \reg1\()h, FOLD_CONST_H - vmull.p64 \reg1, \reg1\()l, FOLD_CONST_L - vmull.p64 q9, \reg2\()h, FOLD_CONST_H - vmull.p64 \reg2, \reg2\()l, FOLD_CONST_L + pmull16x64_\p FOLD_CONST, \reg1 + pmull16x64_\p FOLD_CONST, \reg2 -CPU_LE( vrev64.8 q11, q11 ) -CPU_LE( vrev64.8 q12, q12 ) - vswp q11l, q11h - vswp q12l, q12h +CPU_LE( vrev64.8 q8, q8 ) +CPU_LE( vrev64.8 q9, q9 ) + vswp q8l, q8h + vswp q9l, q9h veor.8 \reg1, \reg1, q8 veor.8 \reg2, \reg2, q9 - veor.8 \reg1, \reg1, q11 - veor.8 \reg2, \reg2, q12 .endm // Fold src_reg into dst_reg, optionally loading the next fold constants - .macro fold_16_bytes, src_reg, dst_reg, load_next_consts - vmull.p64 q8, \src_reg\()l, FOLD_CONST_L - vmull.p64 \src_reg, \src_reg\()h, FOLD_CONST_H + .macro fold_16_bytes, src_reg, dst_reg, p, load_next_consts + pmull16x64_\p FOLD_CONST, \src_reg .ifnb \load_next_consts vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! .endif - veor.8 \dst_reg, \dst_reg, q8 veor.8 \dst_reg, \dst_reg, \src_reg .endm -// -// u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); -// -// Assumes len >= 16. -// -ENTRY(crc_t10dif_pmull) - + .macro crct10dif, p // For sizes less than 256 bytes, we can't fold 128 bytes at a time. cmp len, #256 - blt .Lless_than_256_bytes + blt .Lless_than_256_bytes\@ mov_l fold_consts_ptr, .Lfold_across_128_bytes_consts @@ -194,27 +188,27 @@ CPU_LE( vrev64.8 q7, q7 ) // While >= 128 data bytes remain (not counting q0-q7), fold the 128 // bytes q0-q7 into them, storing the result back into q0-q7. -.Lfold_128_bytes_loop: - fold_32_bytes q0, q1 - fold_32_bytes q2, q3 - fold_32_bytes q4, q5 - fold_32_bytes q6, q7 +.Lfold_128_bytes_loop\@: + fold_32_bytes q0, q1, \p + fold_32_bytes q2, q3, \p + fold_32_bytes q4, q5, \p + fold_32_bytes q6, q7, \p subs len, len, #128 - bge .Lfold_128_bytes_loop + bge .Lfold_128_bytes_loop\@ // Now fold the 112 bytes in q0-q6 into the 16 bytes in q7. // Fold across 64 bytes. vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! - fold_16_bytes q0, q4 - fold_16_bytes q1, q5 - fold_16_bytes q2, q6 - fold_16_bytes q3, q7, 1 + fold_16_bytes q0, q4, \p + fold_16_bytes q1, q5, \p + fold_16_bytes q2, q6, \p + fold_16_bytes q3, q7, \p, 1 // Fold across 32 bytes. - fold_16_bytes q4, q6 - fold_16_bytes q5, q7, 1 + fold_16_bytes q4, q6, \p + fold_16_bytes q5, q7, \p, 1 // Fold across 16 bytes. - fold_16_bytes q6, q7 + fold_16_bytes q6, q7, \p // Add 128 to get the correct number of data bytes remaining in 0...127 // (not counting q7), following the previous extra subtraction by 128. @@ -224,25 +218,23 @@ CPU_LE( vrev64.8 q7, q7 ) // While >= 16 data bytes remain (not counting q7), fold the 16 bytes q7 // into them, storing the result back into q7. - blt .Lfold_16_bytes_loop_done -.Lfold_16_bytes_loop: - vmull.p64 q8, q7l, FOLD_CONST_L - vmull.p64 q7, q7h, FOLD_CONST_H - veor.8 q7, q7, q8 + blt .Lfold_16_bytes_loop_done\@ +.Lfold_16_bytes_loop\@: + pmull16x64_\p FOLD_CONST, q7 vld1.64 {q0}, [buf]! CPU_LE( vrev64.8 q0, q0 ) vswp q0l, q0h veor.8 q7, q7, q0 subs len, len, #16 - bge .Lfold_16_bytes_loop + bge .Lfold_16_bytes_loop\@ -.Lfold_16_bytes_loop_done: +.Lfold_16_bytes_loop_done\@: // Add 16 to get the correct number of data bytes remaining in 0...15 // (not counting q7), following the previous extra subtraction by 16. adds len, len, #16 - beq .Lreduce_final_16_bytes + beq .Lreduce_final_16_bytes\@ -.Lhandle_partial_segment: +.Lhandle_partial_segment\@: // Reduce the last '16 + len' bytes where 1 <= len <= 15 and the first // 16 bytes are in q7 and the rest are the remaining data in 'buf'. To // do this without needing a fold constant for each possible 'len', @@ -277,12 +269,46 @@ CPU_LE( vrev64.8 q0, q0 ) vbsl.8 q2, q1, q0 // Fold the first chunk into the second chunk, storing the result in q7. - vmull.p64 q0, q3l, FOLD_CONST_L - vmull.p64 q7, q3h, FOLD_CONST_H - veor.8 q7, q7, q0 - veor.8 q7, q7, q2 + pmull16x64_\p FOLD_CONST, q3 + veor.8 q7, q3, q2 + b .Lreduce_final_16_bytes\@ + +.Lless_than_256_bytes\@: + // Checksumming a buffer of length 16...255 bytes + + mov_l fold_consts_ptr, .Lfold_across_16_bytes_consts + + // Load the first 16 data bytes. + vld1.64 {q7}, [buf]! +CPU_LE( vrev64.8 q7, q7 ) + vswp q7l, q7h + + // XOR the first 16 data *bits* with the initial CRC value. + vmov.i8 q0h, #0 + vmov.u16 q0h[3], init_crc + veor.8 q7h, q7h, q0h + + // Load the fold-across-16-bytes constants. + vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! + + cmp len, #16 + beq .Lreduce_final_16_bytes\@ // len == 16 + subs len, len, #32 + addlt len, len, #16 + blt .Lhandle_partial_segment\@ // 17 <= len <= 31 + b .Lfold_16_bytes_loop\@ // 32 <= len <= 255 + +.Lreduce_final_16_bytes\@: + .endm + +// +// u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); +// +// Assumes len >= 16. +// +ENTRY(crc_t10dif_pmull64) + crct10dif p64 -.Lreduce_final_16_bytes: // Reduce the 128-bit value M(x), stored in q7, to the final 16-bit CRC. // Load 'x^48 * (x^48 mod G(x))' and 'x^48 * (x^80 mod G(x))'. @@ -316,31 +342,7 @@ CPU_LE( vrev64.8 q0, q0 ) vmov.u16 r0, q0l[0] bx lr -.Lless_than_256_bytes: - // Checksumming a buffer of length 16...255 bytes - - mov_l fold_consts_ptr, .Lfold_across_16_bytes_consts - - // Load the first 16 data bytes. - vld1.64 {q7}, [buf]! -CPU_LE( vrev64.8 q7, q7 ) - vswp q7l, q7h - - // XOR the first 16 data *bits* with the initial CRC value. - vmov.i8 q0h, #0 - vmov.u16 q0h[3], init_crc - veor.8 q7h, q7h, q0h - - // Load the fold-across-16-bytes constants. - vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! - - cmp len, #16 - beq .Lreduce_final_16_bytes // len == 16 - subs len, len, #32 - addlt len, len, #16 - blt .Lhandle_partial_segment // 17 <= len <= 31 - b .Lfold_16_bytes_loop // 32 <= len <= 255 -ENDPROC(crc_t10dif_pmull) +ENDPROC(crc_t10dif_pmull64) .section ".rodata", "a" .align 4 diff --git a/arch/arm/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c index 79f3b204d8c0..60aa79c2fcdb 100644 --- a/arch/arm/crypto/crct10dif-ce-glue.c +++ b/arch/arm/crypto/crct10dif-ce-glue.c @@ -19,7 +19,7 @@ #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U -asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); +asmlinkage u16 crc_t10dif_pmull64(u16 init_crc, const u8 *buf, size_t len); static int crct10dif_init(struct shash_desc *desc) { @@ -29,14 +29,14 @@ static int crct10dif_init(struct shash_desc *desc) return 0; } -static int crct10dif_update(struct shash_desc *desc, const u8 *data, - unsigned int length) +static int crct10dif_update_ce(struct shash_desc *desc, const u8 *data, + unsigned int length) { u16 *crc = shash_desc_ctx(desc); if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { kernel_neon_begin(); - *crc = crc_t10dif_pmull(*crc, data, length); + *crc = crc_t10dif_pmull64(*crc, data, length); kernel_neon_end(); } else { *crc = crc_t10dif_generic(*crc, data, length); @@ -56,7 +56,7 @@ static int crct10dif_final(struct shash_desc *desc, u8 *out) static struct shash_alg crc_t10dif_alg = { .digestsize = CRC_T10DIF_DIGEST_SIZE, .init = crct10dif_init, - .update = crct10dif_update, + .update = crct10dif_update_ce, .final = crct10dif_final, .descsize = CRC_T10DIF_DIGEST_SIZE, From patchwork Mon Oct 28 19:02:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 839195 Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 58D011DF740 for ; Mon, 28 Oct 2024 19:02:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142167; cv=none; b=oQzWa9H1iUgUoCDm0S5Z8K+1qexex5zRFL+ILwrggONsC8OPLu83AsM2QC0Y5ugF7Xu2lptWnNB6BeDz9CtBPrpT4MgjaGscG78ASS4fJx0A3ZdNx++CLPwl65JzVa1P9ziCAQP6X5xHgaEhEgfj2i3xAdmc06UR/TXhqEq5n/U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730142167; c=relaxed/simple; bh=Ti+YoZm3aixDSljGsC8hec11DLDfvLlMVJfh06t3YKk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=UvbKk0FqiTbR3mexCy16WisVLSJiZ55+J4a7CcSIPFJk/bh59U3vPw1dWHbH5FPkHw1eoPIHJhC7HwmJLfT5U6nrzQi28gnPeDsAbVRvIcmlJO1PPnXcR9qtdKsdvK69nbpKOjjJZa9aH6/SZIjCcZjnozv9yMolJ5vblQT+YV8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ZF/GVdLl; arc=none smtp.client-ip=209.85.128.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ardb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZF/GVdLl" Received: by mail-wm1-f73.google.com with SMTP id 5b1f17b1804b1-43154a0886bso31478515e9.0 for ; Mon, 28 Oct 2024 12:02:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730142164; x=1730746964; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=a7BCIKpTVqIRNR9S6eacMDL7qtu1ZJ5oyLsNXt+b1AU=; b=ZF/GVdLlcvsmyOIMGpCObVq6Q9+maUjXXrvwAwF8xoN5rz8Dv/xtIETPZUw0qevzoS stBTK7q5FuIH/aipzcjJbjM2VsLD3b90Rj2fHneWAJLtToz0QZB81pqgJGQwY53VAp/V YhEOOP2JaI3nOrz05iWrgLEMXYDtPj6PZ79rtj3ZqpSsXQ/pBqbtrgrernqjIhopjrzP 9aLnwmsPuwYvjrwG1ipOuCyHT+Lj3JZQXt5f3Jl+UfjgUAzWIFHhIWU5NxG1w4GzeTOe 4fAhN+I7cB8fPsytOBMNfd+UqtB7bc+tMWHUspaRWNInwD6OiJLRhhBBr+kETuyoTh83 OKRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730142164; x=1730746964; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=a7BCIKpTVqIRNR9S6eacMDL7qtu1ZJ5oyLsNXt+b1AU=; b=exWSKiq0kEPLzey/k/eMBXjYgBAeU5katrGD+V18xsIM0zBPki8kELivhv7Aob1KQ5 GNT4PNMkyEONoVTxYM874BcKgMp71winX23Z1XKlk0sAH0cTeqCJxnWRWvDFceXBequ2 0tGz9U+O3XcWcLFSLWNIGALCTyjXkpP298nie/mi8zCdyKuodDjVncfadPr2B/TY7mBm frnqhCqipGTnPDdR4ACHv19OPZGJtjNQGMYkzYpWwaFBvoIJjtMo8SXfj79kDCnL8BvO 9cIkjnGiwLuTe+IcvILCGr0UE9gLN1V9N0FoUba2Km4R/yCbCkSvOpv0pheKhpTDPJhP RVkw== X-Gm-Message-State: AOJu0YzfMz+QdWiIPBc2Cr+7uAGqolKfdBXCsaILZtnAuQ051bUb7nWR L/jD5rTIaBd1RKB7qmCKuO9hA0MZ3GPLZqOWxgpfnHDvP9SXRaSfp/P3+XuuqjkZs820Ma3UJAC NR8HD+wAQIupqAm7sS3rJyz7s1HkZDk315PKmlDHBlCevQRFbn2zxIrmr5wMOlV/iFb6RMHbI54 g7+ZQQKkwpK1nBWw5ZcporZU23ydL3Qw== X-Google-Smtp-Source: AGHT+IEzvN2ok8rDBPNnkk6TZNp9LKGhiBd3e2OVhcExVW8L9WPTUFB0RHTWAmh3/KKNio+2HfLioMjl X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:600c:4306:b0:42c:a7dc:a5be with SMTP id 5b1f17b1804b1-4319ad1fc7bmr1257055e9.5.1730142163422; Mon, 28 Oct 2024 12:02:43 -0700 (PDT) Date: Mon, 28 Oct 2024 20:02:14 +0100 In-Reply-To: <20241028190207.1394367-8-ardb+git@google.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241028190207.1394367-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=5729; i=ardb@kernel.org; h=from:subject; bh=eoCM7N8BhzX8EfbDlPGwcIi6u3yVlDPFwPgcxjWgo+Q=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3+/g7m2yfeb4usKay/K6ry867R8S0XL7ef+HZjZ6a5i oT23G+7OkpZGMQ4GGTFFFkEZv99t/P0RKla51myMHNYmUCGMHBxCsBEfOcyMnxivttTWmitdX1Z kEibDdsUMcFzM42arr44MT9Ws/ftb2mG/+EzCzXspqdZuOzo/+hafl5Vk0d/dszOCOWzGuwKehp TGQA= X-Mailer: git-send-email 2.47.0.163.g1226f6d8fa-goog Message-ID: <20241028190207.1394367-14-ardb+git@google.com> Subject: [PATCH 6/6] crypto: arm/crct10dif - Implement plain NEON variant From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel From: Ard Biesheuvel The CRC-T10DIF algorithm produces a 16-bit CRC, and this is reflected in the folding coefficients, which are also only 16 bits wide. This means that the polynomial multiplications involving these coefficients can be performed using 8-bit long polynomial multiplication (8x8 -> 16) in only a few steps, and this is an instruction that is part of the base NEON ISA, which is all most real ARMv7 cores implement. (The 64-bit PMULL instruction is part of the crypto extensions, which are only implemented by 64-bit cores) The final reduction is a bit more involved, but we can delegate that to the generic CRC-T10DIF implementation after folding the entire input into a 16 byte vector. This results in a speedup of around 6.6x on Cortex-A72 running in 32-bit mode. Signed-off-by: Ard Biesheuvel --- arch/arm/crypto/crct10dif-ce-core.S | 50 ++++++++++++++++++-- arch/arm/crypto/crct10dif-ce-glue.c | 44 +++++++++++++++-- 2 files changed, 85 insertions(+), 9 deletions(-) diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S index 6b72167574b2..5e103a9a42dd 100644 --- a/arch/arm/crypto/crct10dif-ce-core.S +++ b/arch/arm/crypto/crct10dif-ce-core.S @@ -112,6 +112,34 @@ FOLD_CONST_L .req q10l FOLD_CONST_H .req q10h +__pmull16x64_p8: + vmull.p8 q13, d23, d24 + vmull.p8 q14, d23, d25 + vmull.p8 q15, d22, d24 + vmull.p8 q12, d22, d25 + + veor q14, q14, q15 + veor d24, d24, d25 + veor d26, d26, d27 + veor d28, d28, d29 + vmov.i32 d25, #0 + vmov.i32 d29, #0 + vext.8 q12, q12, q12, #14 + vext.8 q14, q14, q14, #15 + veor d24, d24, d26 + bx lr +ENDPROC(__pmull16x64_p8) + + .macro pmull16x64_p8, v16, v64 + vext.8 q11, \v64, \v64, #1 + vld1.64 {q12}, [r4, :128] + vuzp.8 q11, \v64 + vtbl.8 d24, {\v16\()_L-\v16\()_H}, d24 + vtbl.8 d25, {\v16\()_L-\v16\()_H}, d25 + bl __pmull16x64_p8 + veor \v64, q12, q14 + .endm + .macro pmull16x64_p64, v16, v64 vmull.p64 q11, \v64\()l, \v16\()_L vmull.p64 \v64, \v64\()h, \v16\()_H @@ -249,9 +277,9 @@ CPU_LE( vrev64.8 q0, q0 ) vswp q0l, q0h // q1 = high order part of second chunk: q7 left-shifted by 'len' bytes. - mov_l r3, .Lbyteshift_table + 16 - sub r3, r3, len - vld1.8 {q2}, [r3] + mov_l r1, .Lbyteshift_table + 16 + sub r1, r1, len + vld1.8 {q2}, [r1] vtbl.8 q1l, {q7l-q7h}, q2l vtbl.8 q1h, {q7l-q7h}, q2h @@ -341,9 +369,20 @@ ENTRY(crc_t10dif_pmull64) vmov.u16 r0, q0l[0] bx lr - ENDPROC(crc_t10dif_pmull64) +ENTRY(crc_t10dif_pmull8) + push {r4, lr} + mov_l r4, .L16x64perm + + crct10dif p8 + +CPU_LE( vrev64.8 q7, q7 ) + vswp q7l, q7h + vst1.64 {q7}, [r3, :128] + pop {r4, pc} +ENDPROC(crc_t10dif_pmull8) + .section ".rodata", "a" .align 4 @@ -376,3 +415,6 @@ ENDPROC(crc_t10dif_pmull64) .byte 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7 .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe , 0x0 + +.L16x64perm: + .quad 0x808080800000000, 0x909090901010101 diff --git a/arch/arm/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c index 60aa79c2fcdb..4431e4ce2dbe 100644 --- a/arch/arm/crypto/crct10dif-ce-glue.c +++ b/arch/arm/crypto/crct10dif-ce-glue.c @@ -20,6 +20,7 @@ #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U asmlinkage u16 crc_t10dif_pmull64(u16 init_crc, const u8 *buf, size_t len); +asmlinkage void crc_t10dif_pmull8(u16 init_crc, const u8 *buf, size_t len, u8 *out); static int crct10dif_init(struct shash_desc *desc) { @@ -45,6 +46,27 @@ static int crct10dif_update_ce(struct shash_desc *desc, const u8 *data, return 0; } +static int crct10dif_update_neon(struct shash_desc *desc, const u8 *data, + unsigned int length) +{ + u16 *crcp = shash_desc_ctx(desc); + u8 buf[16] __aligned(16); + u16 crc = *crcp; + + if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { + kernel_neon_begin(); + crc_t10dif_pmull8(crc, data, length, buf); + kernel_neon_end(); + + crc = 0; + data = buf; + length = sizeof(buf); + } + + *crcp = crc_t10dif_generic(crc, data, length); + return 0; +} + static int crct10dif_final(struct shash_desc *desc, u8 *out) { u16 *crc = shash_desc_ctx(desc); @@ -53,7 +75,19 @@ static int crct10dif_final(struct shash_desc *desc, u8 *out) return 0; } -static struct shash_alg crc_t10dif_alg = { +static struct shash_alg algs[] = {{ + .digestsize = CRC_T10DIF_DIGEST_SIZE, + .init = crct10dif_init, + .update = crct10dif_update_neon, + .final = crct10dif_final, + .descsize = CRC_T10DIF_DIGEST_SIZE, + + .base.cra_name = "crct10dif", + .base.cra_driver_name = "crct10dif-arm-neon", + .base.cra_priority = 150, + .base.cra_blocksize = CRC_T10DIF_BLOCK_SIZE, + .base.cra_module = THIS_MODULE, +}, { .digestsize = CRC_T10DIF_DIGEST_SIZE, .init = crct10dif_init, .update = crct10dif_update_ce, @@ -65,19 +99,19 @@ static struct shash_alg crc_t10dif_alg = { .base.cra_priority = 200, .base.cra_blocksize = CRC_T10DIF_BLOCK_SIZE, .base.cra_module = THIS_MODULE, -}; +}}; static int __init crc_t10dif_mod_init(void) { - if (!(elf_hwcap2 & HWCAP2_PMULL)) + if (!(elf_hwcap & HWCAP_NEON)) return -ENODEV; - return crypto_register_shash(&crc_t10dif_alg); + return crypto_register_shashes(algs, 1 + !!(elf_hwcap2 & HWCAP2_PMULL)); } static void __exit crc_t10dif_mod_exit(void) { - crypto_unregister_shash(&crc_t10dif_alg); + crypto_unregister_shashes(algs, 1 + !!(elf_hwcap2 & HWCAP2_PMULL)); } module_init(crc_t10dif_mod_init);