From patchwork Tue May 7 00:23:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 795335 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 23DB8EDE; Tue, 7 May 2024 00:25:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041517; cv=none; b=Zl07JVnNGXXb6rKBJrsVNDEjoQ7fMxrQhYmAKzTf8G4lMxFl1IIZOhPGPHxC7eAIbO4BNuZo27bkGjyFkeWV5xBrldiiw56lzQlXP28z+PJIr7xLFT51tVHUV0xj77QEek2TJiqgBcJ3TTw/2+YqmzOjTdXqRbYE1nD+x3d+62U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041517; c=relaxed/simple; bh=Uz9jUnVkeFhGTzlbzEDSdQYkq7NwLZLlyFt/EXJFwgU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Ub8kRpAD+bbYa3fyv0tkLOUic9U8Nvm8OwLtUSh3ua5T+xkKoy7048nAxk4L2Tu0aN+1MlpTxymJcWcHzV3dJcdBmd712B5XDNep3W9uM8U4jCPzqNqa2Z734blpoADwP6pCSq95Bgys0rAKXx4cXjPUX0RfnEWth3+aWuLwWEI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fdoVgErU; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fdoVgErU" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 74AEBC4AF68; Tue, 7 May 2024 00:25:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715041516; bh=Uz9jUnVkeFhGTzlbzEDSdQYkq7NwLZLlyFt/EXJFwgU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fdoVgErUHlFaNHNPh7L2HqNT2WhznXXZhVrDA03ETRpynG1rucaSJefVdt2C+j4Dg pByuM9zrY3/oKGhLDINRgqHVGjPoU6hPpffQnAsgTPBJO8WmuwAeehgVemsROA0O35 LGulunY6Bce+vgQqqhKRnCdScqv2Px2OYsSLst/Ei+5/ko7Fa9QPG/WUAabYP6KybP tGCMblilUdPk+SI7OsBRZYKJb9uEKHZ7ybTYF4XqtWww0jXcVUM4/S5juxhzlouGx1 CFGaSIC74zn60cHGQoYvGsAoAw+x4S2kEWBwZFxbSpL/RVotxMl4bkzZsZ0TzigmzZ SOCmbWoyvaPig== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev, dm-devel@lists.linux.dev Cc: x86@kernel.org, linux-arm-kernel@lists.infradead.org, Ard Biesheuvel , Sami Tolvanen , Bart Van Assche Subject: [PATCH v3 2/8] crypto: testmgr - generate power-of-2 lengths more often Date: Mon, 6 May 2024 17:23:37 -0700 Message-ID: <20240507002343.239552-3-ebiggers@kernel.org> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20240507002343.239552-1-ebiggers@kernel.org> References: <20240507002343.239552-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Implementations of hash functions often have special cases when lengths are a multiple of the hash function's internal block size (e.g. 64 for SHA-256, 128 for SHA-512). Currently, when the fuzz testing code generates lengths, it doesn't prefer any length mod 64 over any other. This limits the coverage of these special cases. Therefore, this patch updates the fuzz testing code to generate power-of-2 lengths and divide messages exactly in half a bit more often. Signed-off-by: Eric Biggers --- crypto/testmgr.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index 00f5a6cf341a..2c57ebcaf368 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -901,18 +901,24 @@ static unsigned int generate_random_length(struct rnd_state *rng, { unsigned int len = prandom_u32_below(rng, max_len + 1); switch (prandom_u32_below(rng, 4)) { case 0: - return len % 64; + len %= 64; + break; case 1: - return len % 256; + len %= 256; + break; case 2: - return len % 1024; + len %= 1024; + break; default: - return len; + break; } + if (prandom_u32_below(rng, 4) == 0) + len = rounddown_pow_of_two(len); + return len; } /* Flip a random bit in the given nonempty data buffer */ static void flip_random_bit(struct rnd_state *rng, u8 *buf, size_t size) { @@ -1004,10 +1010,12 @@ static char *generate_random_sgl_divisions(struct rnd_state *rng, unsigned int this_len; const char *flushtype_str; if (div == &divs[max_divs - 1] || prandom_bool(rng)) this_len = remaining; + else if (prandom_u32_below(rng, 4) == 0) + this_len = (remaining + 1) / 2; else this_len = prandom_u32_inclusive(rng, 1, remaining); div->proportion_of_total = this_len; if (prandom_u32_below(rng, 4) == 0) From patchwork Tue May 7 00:23:39 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 795334 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93A7D1876; Tue, 7 May 2024 00:25:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041517; cv=none; b=IDqeTJw6u4WAfpRVy7jOkWG8SowQMgzlt4oMmJunzj1xNP83DYgFVuyVDxibgFZub7PJizw5x+AliffkJ/AGHreGdUGi62qMssurzNme2n7O0PVbhLbiUffT8Uuu7jeYP9bGXEQYxH//Taj5vYvV0jfkRi0NHlFQHEWbUMsCTeA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041517; c=relaxed/simple; bh=3pSr3e+ErJA6w3pt9sSErT1XgU6cwvAsvRZcun31Ruo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kAxrEfJoQ91pQdbHoBQPBGcoo/rOu2FWC1UGykINcmGhQILOJht0+KZ21celkUKn9TZETwqYQuI/TA2NQ4OFlgjEiAqMIpENL7LLPn5fT5q3HFAFILYk+FCllk/qqSyrPESA/zLkbO+wUK7aJVSfByznem5Mz9LI4QmeqpmA8bA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gwq6LeR3; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gwq6LeR3" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3D55BC4DDE1; Tue, 7 May 2024 00:25:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715041517; bh=3pSr3e+ErJA6w3pt9sSErT1XgU6cwvAsvRZcun31Ruo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gwq6LeR3z9Czw9JfrcYi/waPZL1c9Z73Vb6r4qf5cxY/MqBOWk9Ne6mZnRycIDIyx iXgX43BJbxb8wArVgEs4Fpue5nYFO+5qvnKzubbTDpSMOp92MSUo5y9TILMHPsbVGA VAMUZYnt+JRGiTB0zlX70Kp7hAjO8kWGWQjpJ+LvKARUeH6FiySaitJYJK0bKuJd2l Kn3yqFCpld2Sb9o2YHlgXlyOCqlXjfxZAKPoyyvdPyeoVTLr6J8TXt3Iw5wbAej6f6 oRZXWqxO5Q7e91gRoY6hF5io3/fDGi+Z2Bci+Sm+6nlI4lluHkzEP2THYcJbSNtdfL a4v8kG5H82fRA== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev, dm-devel@lists.linux.dev Cc: x86@kernel.org, linux-arm-kernel@lists.infradead.org, Ard Biesheuvel , Sami Tolvanen , Bart Van Assche Subject: [PATCH v3 4/8] crypto: x86/sha256-ni - add support for finup_mb Date: Mon, 6 May 2024 17:23:39 -0700 Message-ID: <20240507002343.239552-5-ebiggers@kernel.org> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20240507002343.239552-1-ebiggers@kernel.org> References: <20240507002343.239552-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add an implementation of finup_mb to sha256-ni, using an interleaving factor of 2. It interleaves a finup operation for two equal-length messages that share a common prefix. dm-verity and fs-verity will take advantage of this for greatly improved performance on capable CPUs. This increases the throughput of SHA-256 hashing 4096-byte messages by the following amounts on the following CPUs: AMD Zen 1: 84% AMD Zen 4: 98% Intel Ice Lake: 4% Intel Sapphire Rapids: 20% For now, this seems to benefit AMD much more than Intel. This seems to be because current AMD CPUs support concurrent execution of the SHA-NI instructions, but unfortunately current Intel CPUs don't, except for the sha256msg2 instruction. Hopefully future Intel CPUs will support SHA-NI on more execution ports. Zen 1 supports 2 concurrent sha256rnds2, and Zen 4 supports 4 concurrent sha256rnds2, which suggests that even better performance may be achievable on Zen 4 by interleaving more than two hashes; however, doing so poses a number of trade-offs. It's been reported that the method that achieves the highest SHA-256 throughput on Intel CPUs is actually computing 16 hashes simultaneously using AVX512. That method would be quite different to the SHA-NI method used in this patch. However, such a high interleaving factor isn't practical for the use cases being targeted in the kernel. Signed-off-by: Eric Biggers --- arch/x86/crypto/sha256_ni_asm.S | 368 ++++++++++++++++++++++++++++ arch/x86/crypto/sha256_ssse3_glue.c | 39 +++ 2 files changed, 407 insertions(+) diff --git a/arch/x86/crypto/sha256_ni_asm.S b/arch/x86/crypto/sha256_ni_asm.S index d515a55a3bc1..5e97922a24e4 100644 --- a/arch/x86/crypto/sha256_ni_asm.S +++ b/arch/x86/crypto/sha256_ni_asm.S @@ -172,10 +172,378 @@ SYM_TYPED_FUNC_START(sha256_ni_transform) .Ldone_hash: RET SYM_FUNC_END(sha256_ni_transform) +#undef DIGEST_PTR +#undef DATA_PTR +#undef NUM_BLKS +#undef SHA256CONSTANTS +#undef MSG +#undef STATE0 +#undef STATE1 +#undef MSG0 +#undef MSG1 +#undef MSG2 +#undef MSG3 +#undef TMP +#undef SHUF_MASK +#undef ABEF_SAVE +#undef CDGH_SAVE + +// parameters for __sha256_ni_finup2x() +#define SCTX %rdi +#define DATA1 %rsi +#define DATA2 %rdx +#define LEN %ecx +#define LEN8 %cl +#define LEN64 %rcx +#define OUT1 %r8 +#define OUT2 %r9 + +// other scalar variables +#define SHA256CONSTANTS %rax +#define COUNT %r10 +#define COUNT32 %r10d +#define FINAL_STEP %r11d + +// rbx is used as a temporary. + +#define MSG %xmm0 // sha256rnds2 implicit operand +#define STATE0_A %xmm1 +#define STATE1_A %xmm2 +#define STATE0_B %xmm3 +#define STATE1_B %xmm4 +#define TMP_A %xmm5 +#define TMP_B %xmm6 +#define MSG0_A %xmm7 +#define MSG1_A %xmm8 +#define MSG2_A %xmm9 +#define MSG3_A %xmm10 +#define MSG0_B %xmm11 +#define MSG1_B %xmm12 +#define MSG2_B %xmm13 +#define MSG3_B %xmm14 +#define SHUF_MASK %xmm15 + +#define OFFSETOF_STATE 0 // offsetof(struct sha256_state, state) +#define OFFSETOF_COUNT 32 // offsetof(struct sha256_state, count) +#define OFFSETOF_BUF 40 // offsetof(struct sha256_state, buf) + +// Do 4 rounds of SHA-256 for each of two messages (interleaved). m0_a and m0_b +// contain the current 4 message schedule words for the first and second message +// respectively. +// +// If not all the message schedule words have been computed yet, then this also +// computes 4 more message schedule words for each message. m1_a-m3_a contain +// the next 3 groups of 4 message schedule words for the first message, and +// likewise m1_b-m3_b for the second. After consuming the current value of +// m0_a, this macro computes the group after m3_a and writes it to m0_a, and +// likewise for *_b. This means that the next (m0_a, m1_a, m2_a, m3_a) is the +// current (m1_a, m2_a, m3_a, m0_a), and likewise for *_b, so the caller must +// cycle through the registers accordingly. +.macro do_4rounds_2x i, m0_a, m1_a, m2_a, m3_a, m0_b, m1_b, m2_b, m3_b + movdqa (\i-32)*4(SHA256CONSTANTS), TMP_A + movdqa TMP_A, TMP_B + paddd \m0_a, TMP_A + paddd \m0_b, TMP_B +.if \i < 48 + sha256msg1 \m1_a, \m0_a + sha256msg1 \m1_b, \m0_b +.endif + movdqa TMP_A, MSG + sha256rnds2 STATE0_A, STATE1_A + movdqa TMP_B, MSG + sha256rnds2 STATE0_B, STATE1_B + pshufd $0x0E, TMP_A, MSG + sha256rnds2 STATE1_A, STATE0_A + pshufd $0x0E, TMP_B, MSG + sha256rnds2 STATE1_B, STATE0_B +.if \i < 48 + movdqa \m3_a, TMP_A + movdqa \m3_b, TMP_B + palignr $4, \m2_a, TMP_A + palignr $4, \m2_b, TMP_B + paddd TMP_A, \m0_a + paddd TMP_B, \m0_b + sha256msg2 \m3_a, \m0_a + sha256msg2 \m3_b, \m0_b +.endif +.endm + +// +// void __sha256_ni_finup2x(const struct sha256_state *sctx, +// const u8 *data1, const u8 *data2, int len, +// u8 out1[SHA256_DIGEST_SIZE], +// u8 out2[SHA256_DIGEST_SIZE]); +// +// This function computes the SHA-256 digests of two messages |data1| and +// |data2| that are both |len| bytes long, starting from the initial state +// |sctx|. |len| must be at least SHA256_BLOCK_SIZE. +// +// The instructions for the two SHA-256 operations are interleaved. On many +// CPUs, this is almost twice as fast as hashing each message individually due +// to taking better advantage of the CPU's SHA-256 and SIMD throughput. +// +SYM_FUNC_START(__sha256_ni_finup2x) + // Allocate 128 bytes of stack space, 16-byte aligned. + push %rbx + push %rbp + mov %rsp, %rbp + sub $128, %rsp + and $~15, %rsp + + // Load the shuffle mask for swapping the endianness of 32-bit words. + movdqa PSHUFFLE_BYTE_FLIP_MASK(%rip), SHUF_MASK + + // Set up pointer to the round constants. + lea K256+32*4(%rip), SHA256CONSTANTS + + // Initially we're not processing the final blocks. + xor FINAL_STEP, FINAL_STEP + + // Load the initial state from sctx->state. + movdqu OFFSETOF_STATE+0*16(SCTX), STATE0_A // DCBA + movdqu OFFSETOF_STATE+1*16(SCTX), STATE1_A // HGFE + movdqa STATE0_A, TMP_A + punpcklqdq STATE1_A, STATE0_A // FEBA + punpckhqdq TMP_A, STATE1_A // DCHG + pshufd $0x1B, STATE0_A, STATE0_A // ABEF + pshufd $0xB1, STATE1_A, STATE1_A // CDGH + + // Load sctx->count. Take the mod 64 of it to get the number of bytes + // that are buffered in sctx->buf. Also save it in a register with LEN + // added to it. + mov LEN, LEN + mov OFFSETOF_COUNT(SCTX), %rbx + lea (%rbx, LEN64, 1), COUNT + and $63, %ebx + jz .Lfinup2x_enter_loop // No bytes buffered? + + // %ebx bytes (1 to 63) are currently buffered in sctx->buf. Load them + // followed by the first 64 - %ebx bytes of data. Since LEN >= 64, we + // just load 64 bytes from each of sctx->buf, DATA1, and DATA2 + // unconditionally and rearrange the data as needed. + + movdqu OFFSETOF_BUF+0*16(SCTX), MSG0_A + movdqu OFFSETOF_BUF+1*16(SCTX), MSG1_A + movdqu OFFSETOF_BUF+2*16(SCTX), MSG2_A + movdqu OFFSETOF_BUF+3*16(SCTX), MSG3_A + movdqa MSG0_A, 0*16(%rsp) + movdqa MSG1_A, 1*16(%rsp) + movdqa MSG2_A, 2*16(%rsp) + movdqa MSG3_A, 3*16(%rsp) + + movdqu 0*16(DATA1), MSG0_A + movdqu 1*16(DATA1), MSG1_A + movdqu 2*16(DATA1), MSG2_A + movdqu 3*16(DATA1), MSG3_A + movdqu MSG0_A, 0*16(%rsp,%rbx) + movdqu MSG1_A, 1*16(%rsp,%rbx) + movdqu MSG2_A, 2*16(%rsp,%rbx) + movdqu MSG3_A, 3*16(%rsp,%rbx) + movdqa 0*16(%rsp), MSG0_A + movdqa 1*16(%rsp), MSG1_A + movdqa 2*16(%rsp), MSG2_A + movdqa 3*16(%rsp), MSG3_A + + movdqu 0*16(DATA2), MSG0_B + movdqu 1*16(DATA2), MSG1_B + movdqu 2*16(DATA2), MSG2_B + movdqu 3*16(DATA2), MSG3_B + movdqu MSG0_B, 0*16(%rsp,%rbx) + movdqu MSG1_B, 1*16(%rsp,%rbx) + movdqu MSG2_B, 2*16(%rsp,%rbx) + movdqu MSG3_B, 3*16(%rsp,%rbx) + movdqa 0*16(%rsp), MSG0_B + movdqa 1*16(%rsp), MSG1_B + movdqa 2*16(%rsp), MSG2_B + movdqa 3*16(%rsp), MSG3_B + + sub $64, %rbx // rbx = buffered - 64 + sub %rbx, DATA1 // DATA1 += 64 - buffered + sub %rbx, DATA2 // DATA2 += 64 - buffered + add %ebx, LEN // LEN += buffered - 64 + movdqa STATE0_A, STATE0_B + movdqa STATE1_A, STATE1_B + jmp .Lfinup2x_loop_have_data + +.Lfinup2x_enter_loop: + sub $64, LEN + movdqa STATE0_A, STATE0_B + movdqa STATE1_A, STATE1_B +.Lfinup2x_loop: + // Load the next two data blocks. + movdqu 0*16(DATA1), MSG0_A + movdqu 0*16(DATA2), MSG0_B + movdqu 1*16(DATA1), MSG1_A + movdqu 1*16(DATA2), MSG1_B + movdqu 2*16(DATA1), MSG2_A + movdqu 2*16(DATA2), MSG2_B + movdqu 3*16(DATA1), MSG3_A + movdqu 3*16(DATA2), MSG3_B + add $64, DATA1 + add $64, DATA2 +.Lfinup2x_loop_have_data: + // Convert the words of the data blocks from big endian. + pshufb SHUF_MASK, MSG0_A + pshufb SHUF_MASK, MSG0_B + pshufb SHUF_MASK, MSG1_A + pshufb SHUF_MASK, MSG1_B + pshufb SHUF_MASK, MSG2_A + pshufb SHUF_MASK, MSG2_B + pshufb SHUF_MASK, MSG3_A + pshufb SHUF_MASK, MSG3_B +.Lfinup2x_loop_have_bswapped_data: + + // Save the original state for each block. + movdqa STATE0_A, 0*16(%rsp) + movdqa STATE0_B, 1*16(%rsp) + movdqa STATE1_A, 2*16(%rsp) + movdqa STATE1_B, 3*16(%rsp) + + // Do the SHA-256 rounds on each block. +.irp i, 0, 16, 32, 48 + do_4rounds_2x (\i + 0), MSG0_A, MSG1_A, MSG2_A, MSG3_A, \ + MSG0_B, MSG1_B, MSG2_B, MSG3_B + do_4rounds_2x (\i + 4), MSG1_A, MSG2_A, MSG3_A, MSG0_A, \ + MSG1_B, MSG2_B, MSG3_B, MSG0_B + do_4rounds_2x (\i + 8), MSG2_A, MSG3_A, MSG0_A, MSG1_A, \ + MSG2_B, MSG3_B, MSG0_B, MSG1_B + do_4rounds_2x (\i + 12), MSG3_A, MSG0_A, MSG1_A, MSG2_A, \ + MSG3_B, MSG0_B, MSG1_B, MSG2_B +.endr + + // Add the original state for each block. + paddd 0*16(%rsp), STATE0_A + paddd 1*16(%rsp), STATE0_B + paddd 2*16(%rsp), STATE1_A + paddd 3*16(%rsp), STATE1_B + + // Update LEN and loop back if more blocks remain. + sub $64, LEN + jge .Lfinup2x_loop + + // Check if any final blocks need to be handled. + // FINAL_STEP = 2: all done + // FINAL_STEP = 1: need to do count-only padding block + // FINAL_STEP = 0: need to do the block with 0x80 padding byte + cmp $1, FINAL_STEP + jg .Lfinup2x_done + je .Lfinup2x_finalize_countonly + add $64, LEN + jz .Lfinup2x_finalize_blockaligned + + // Not block-aligned; 1 <= LEN <= 63 data bytes remain. Pad the block. + // To do this, write the padding starting with the 0x80 byte to + // &sp[64]. Then for each message, copy the last 64 data bytes to sp + // and load from &sp[64 - LEN] to get the needed padding block. This + // code relies on the data buffers being >= 64 bytes in length. + mov $64, %ebx + sub LEN, %ebx // ebx = 64 - LEN + sub %rbx, DATA1 // DATA1 -= 64 - LEN + sub %rbx, DATA2 // DATA2 -= 64 - LEN + mov $0x80, FINAL_STEP // using FINAL_STEP as a temporary + movd FINAL_STEP, MSG0_A + pxor MSG1_A, MSG1_A + movdqa MSG0_A, 4*16(%rsp) + movdqa MSG1_A, 5*16(%rsp) + movdqa MSG1_A, 6*16(%rsp) + movdqa MSG1_A, 7*16(%rsp) + cmp $56, LEN + jge 1f // will COUNT spill into its own block? + shl $3, COUNT + bswap COUNT + mov COUNT, 56(%rsp,%rbx) + mov $2, FINAL_STEP // won't need count-only block + jmp 2f +1: + mov $1, FINAL_STEP // will need count-only block +2: + movdqu 0*16(DATA1), MSG0_A + movdqu 1*16(DATA1), MSG1_A + movdqu 2*16(DATA1), MSG2_A + movdqu 3*16(DATA1), MSG3_A + movdqa MSG0_A, 0*16(%rsp) + movdqa MSG1_A, 1*16(%rsp) + movdqa MSG2_A, 2*16(%rsp) + movdqa MSG3_A, 3*16(%rsp) + movdqu 0*16(%rsp,%rbx), MSG0_A + movdqu 1*16(%rsp,%rbx), MSG1_A + movdqu 2*16(%rsp,%rbx), MSG2_A + movdqu 3*16(%rsp,%rbx), MSG3_A + + movdqu 0*16(DATA2), MSG0_B + movdqu 1*16(DATA2), MSG1_B + movdqu 2*16(DATA2), MSG2_B + movdqu 3*16(DATA2), MSG3_B + movdqa MSG0_B, 0*16(%rsp) + movdqa MSG1_B, 1*16(%rsp) + movdqa MSG2_B, 2*16(%rsp) + movdqa MSG3_B, 3*16(%rsp) + movdqu 0*16(%rsp,%rbx), MSG0_B + movdqu 1*16(%rsp,%rbx), MSG1_B + movdqu 2*16(%rsp,%rbx), MSG2_B + movdqu 3*16(%rsp,%rbx), MSG3_B + jmp .Lfinup2x_loop_have_data + + // Prepare a padding block, either: + // + // {0x80, 0, 0, 0, ..., count (as __be64)} + // This is for a block aligned message. + // + // { 0, 0, 0, 0, ..., count (as __be64)} + // This is for a message whose length mod 64 is >= 56. + // + // Pre-swap the endianness of the words. +.Lfinup2x_finalize_countonly: + pxor MSG0_A, MSG0_A + jmp 1f + +.Lfinup2x_finalize_blockaligned: + mov $0x80000000, %ebx + movd %ebx, MSG0_A +1: + pxor MSG1_A, MSG1_A + pxor MSG2_A, MSG2_A + ror $29, COUNT + movq COUNT, MSG3_A + pslldq $8, MSG3_A + movdqa MSG0_A, MSG0_B + pxor MSG1_B, MSG1_B + pxor MSG2_B, MSG2_B + movdqa MSG3_A, MSG3_B + mov $2, FINAL_STEP + jmp .Lfinup2x_loop_have_bswapped_data + +.Lfinup2x_done: + // Write the two digests with all bytes in the correct order. + movdqa STATE0_A, TMP_A + movdqa STATE0_B, TMP_B + punpcklqdq STATE1_A, STATE0_A // GHEF + punpcklqdq STATE1_B, STATE0_B + punpckhqdq TMP_A, STATE1_A // ABCD + punpckhqdq TMP_B, STATE1_B + pshufd $0xB1, STATE0_A, STATE0_A // HGFE + pshufd $0xB1, STATE0_B, STATE0_B + pshufd $0x1B, STATE1_A, STATE1_A // DCBA + pshufd $0x1B, STATE1_B, STATE1_B + pshufb SHUF_MASK, STATE0_A + pshufb SHUF_MASK, STATE0_B + pshufb SHUF_MASK, STATE1_A + pshufb SHUF_MASK, STATE1_B + movdqu STATE0_A, 1*16(OUT1) + movdqu STATE0_B, 1*16(OUT2) + movdqu STATE1_A, 0*16(OUT1) + movdqu STATE1_B, 0*16(OUT2) + + mov %rbp, %rsp + pop %rbp + pop %rbx + RET +SYM_FUNC_END(__sha256_ni_finup2x) + .section .rodata.cst256.K256, "aM", @progbits, 256 .align 64 K256: .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 diff --git a/arch/x86/crypto/sha256_ssse3_glue.c b/arch/x86/crypto/sha256_ssse3_glue.c index e04a43d9f7d5..f5e6cc7afac7 100644 --- a/arch/x86/crypto/sha256_ssse3_glue.c +++ b/arch/x86/crypto/sha256_ssse3_glue.c @@ -331,10 +331,15 @@ static void unregister_sha256_avx2(void) #ifdef CONFIG_AS_SHA256_NI asmlinkage void sha256_ni_transform(struct sha256_state *digest, const u8 *data, int rounds); +asmlinkage void __sha256_ni_finup2x(const struct sha256_state *sctx, + const u8 *data1, const u8 *data2, int len, + u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]); + static int sha256_ni_update(struct shash_desc *desc, const u8 *data, unsigned int len) { return _sha256_update(desc, data, len, sha256_ni_transform); } @@ -355,18 +360,52 @@ static int sha256_ni_digest(struct shash_desc *desc, const u8 *data, { return sha256_base_init(desc) ?: sha256_ni_finup(desc, data, len, out); } +static int sha256_ni_finup_mb(struct shash_desc *desc, + const u8 * const data[], unsigned int len, + u8 * const outs[], unsigned int num_msgs) +{ + struct sha256_state *sctx = shash_desc_ctx(desc); + + /* + * num_msgs != 2 should not happen here, since this algorithm sets + * mb_max_msgs=2, and the crypto API handles num_msgs <= 1 before + * calling into the algorithm's finup_mb method. + */ + if (WARN_ON_ONCE(num_msgs != 2)) + return -EOPNOTSUPP; + + if (unlikely(!crypto_simd_usable())) + return -EOPNOTSUPP; + + /* __sha256_ni_finup2x() assumes SHA256_BLOCK_SIZE <= len <= INT_MAX. */ + if (unlikely(len < SHA256_BLOCK_SIZE || len > PAGE_SIZE)) + return -EOPNOTSUPP; + + /* __sha256_ni_finup2x() assumes the following offsets. */ + BUILD_BUG_ON(offsetof(struct sha256_state, state) != 0); + BUILD_BUG_ON(offsetof(struct sha256_state, count) != 32); + BUILD_BUG_ON(offsetof(struct sha256_state, buf) != 40); + + kernel_fpu_begin(); + __sha256_ni_finup2x(sctx, data[0], data[1], len, outs[0], outs[1]); + kernel_fpu_end(); + return 0; +} + static struct shash_alg sha256_ni_algs[] = { { .digestsize = SHA256_DIGEST_SIZE, .init = sha256_base_init, .update = sha256_ni_update, .final = sha256_ni_final, .finup = sha256_ni_finup, .digest = sha256_ni_digest, + .finup_mb = sha256_ni_finup_mb, .descsize = sizeof(struct sha256_state), + .mb_max_msgs = 2, .base = { .cra_name = "sha256", .cra_driver_name = "sha256-ni", .cra_priority = 250, .cra_blocksize = SHA256_BLOCK_SIZE, From patchwork Tue May 7 00:23:41 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 795333 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99E033C39; Tue, 7 May 2024 00:25:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041518; cv=none; b=GXoxl5f+OlfckmzftMidvonDg1pjGwUiEtpxlk46axdTjUTvrsrpTjJAwNXvrcFaQeW2r+Of2lVj4gz5dug5L21LmhPmludNg9jDKLBAZ2fXa6tZbM45mronyjlgcwQfai3Unf0oiEcyDSrBitxfU4Lp1lsK5rjAZZEGo79BzK8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041518; c=relaxed/simple; bh=TZLgF8hgmB4ShgK1PTPmmpVgUZNDlY044WPe7DEdI1U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=iq4197fJR+cOBnIGzWGlgCQy+vyPjw9W+2A7xHKCOJiIxWMKZReF0EMHBIOP/UMJG7sK94EVqErZpxZqW4Vm382Efd8L1sMVo3LzfdyfL742bMdne2GUEdTDPP/7RdngYkKQyqVBnL4XNdFcqbFemjbj9tLQsgm8d3EJG8P7CUo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AzNbri2O; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AzNbri2O" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 05A14C4AF68; Tue, 7 May 2024 00:25:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715041518; bh=TZLgF8hgmB4ShgK1PTPmmpVgUZNDlY044WPe7DEdI1U=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AzNbri2OoJ+OJ085gg3j+mfKnAjH2r+MA/VtEXiOEij0jG0KXNNJIOeIvsEOBHlYN BUMgsc8Jr+kwde4WdZT5HbrNN63KVNNZJWhdNfu0UJw8oYAHK6v/kMJEgYQUoVnUIZ RooGdCN6JsKGCrL0UDtV/HTj4ScuNw7zmMcr3wS5977tTVG5sWy3+9Ii45k9AJXa6F YWB3PHk1vVxt5vTbI+h6juVR2QKGpqGLon4pPzptFCY1jB1Ub5bWqhqOVa3BuFiu+i Ai6jZaSIsiNZEVhpw1C0ePcwHXVtpwyUgFT2w0oCsyTRmXzE78Uutr9yHcNptRRpy2 81YxM4NahJPeQ== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev, dm-devel@lists.linux.dev Cc: x86@kernel.org, linux-arm-kernel@lists.infradead.org, Ard Biesheuvel , Sami Tolvanen , Bart Van Assche Subject: [PATCH v3 6/8] fsverity: improve performance by using multibuffer hashing Date: Mon, 6 May 2024 17:23:41 -0700 Message-ID: <20240507002343.239552-7-ebiggers@kernel.org> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20240507002343.239552-1-ebiggers@kernel.org> References: <20240507002343.239552-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers When supported by the hash algorithm, use crypto_shash_finup_mb() to interleave the hashing of pairs of data blocks. On some CPUs this nearly doubles hashing performance. The increase in overall throughput of cold-cache fsverity reads that I'm seeing on arm64 and x86_64 is roughly 35% (though this metric is hard to measure as it jumps around a lot). For now this is only done on the verification path, and only for data blocks, not Merkle tree blocks. We could use finup_mb on Merkle tree blocks too, but that is less important as there aren't as many Merkle tree blocks as data blocks, and that would require some additional code restructuring. We could also use finup_mb to accelerate building the Merkle tree, but verification performance is more important. Signed-off-by: Eric Biggers --- fs/verity/fsverity_private.h | 5 + fs/verity/hash_algs.c | 32 ++++++- fs/verity/open.c | 6 ++ fs/verity/verify.c | 177 +++++++++++++++++++++++++++++------ 4 files changed, 187 insertions(+), 33 deletions(-) diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h index b3506f56e180..9fe1633f15d6 100644 --- a/fs/verity/fsverity_private.h +++ b/fs/verity/fsverity_private.h @@ -27,10 +27,11 @@ struct fsverity_hash_alg { /* * The HASH_ALGO_* constant for this algorithm. This is different from * FS_VERITY_HASH_ALG_*, which uses a different numbering scheme. */ enum hash_algo algo_id; + bool supports_multibuffer; /* crypto_shash_mb_max_msgs(tfm) >= 2 */ }; /* Merkle tree parameters: hash algorithm, initial hash state, and topology */ struct merkle_tree_params { const struct fsverity_hash_alg *hash_alg; /* the hash algorithm */ @@ -65,10 +66,11 @@ struct merkle_tree_params { */ struct fsverity_info { struct merkle_tree_params tree_params; u8 root_hash[FS_VERITY_MAX_DIGEST_SIZE]; u8 file_digest[FS_VERITY_MAX_DIGEST_SIZE]; + u8 zero_block_hash[FS_VERITY_MAX_DIGEST_SIZE]; const struct inode *inode; unsigned long *hash_block_verified; }; #define FS_VERITY_MAX_SIGNATURE_SIZE (FS_VERITY_MAX_DESCRIPTOR_SIZE - \ @@ -82,10 +84,13 @@ const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode, unsigned int num); const u8 *fsverity_prepare_hash_state(const struct fsverity_hash_alg *alg, const u8 *salt, size_t salt_size); int fsverity_hash_block(const struct merkle_tree_params *params, const struct inode *inode, const void *data, u8 *out); +int fsverity_hash_2_blocks(const struct merkle_tree_params *params, + const struct inode *inode, const void *data1, + const void *data2, u8 *out1, u8 *out2); int fsverity_hash_buffer(const struct fsverity_hash_alg *alg, const void *data, size_t size, u8 *out); void __init fsverity_check_hash_algs(void); /* init.c */ diff --git a/fs/verity/hash_algs.c b/fs/verity/hash_algs.c index 6b08b1d9a7d7..71b6f74aaacd 100644 --- a/fs/verity/hash_algs.c +++ b/fs/verity/hash_algs.c @@ -82,12 +82,15 @@ const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode, if (WARN_ON_ONCE(alg->digest_size != crypto_shash_digestsize(tfm))) goto err_free_tfm; if (WARN_ON_ONCE(alg->block_size != crypto_shash_blocksize(tfm))) goto err_free_tfm; - pr_info("%s using implementation \"%s\"\n", - alg->name, crypto_shash_driver_name(tfm)); + alg->supports_multibuffer = (crypto_shash_mb_max_msgs(tfm) >= 2); + + pr_info("%s using implementation \"%s\"%s\n", + alg->name, crypto_shash_driver_name(tfm), + alg->supports_multibuffer ? " (multibuffer)" : ""); /* pairs with smp_load_acquire() above */ smp_store_release(&alg->tfm, tfm); goto out_unlock; @@ -195,10 +198,35 @@ int fsverity_hash_block(const struct merkle_tree_params *params, if (err) fsverity_err(inode, "Error %d computing block hash", err); return err; } +int fsverity_hash_2_blocks(const struct merkle_tree_params *params, + const struct inode *inode, const void *data1, + const void *data2, u8 *out1, u8 *out2) +{ + const u8 *data[2] = { data1, data2 }; + u8 *outs[2] = { out1, out2 }; + SHASH_DESC_ON_STACK(desc, params->hash_alg->tfm); + int err; + + desc->tfm = params->hash_alg->tfm; + + if (params->hashstate) + err = crypto_shash_import(desc, params->hashstate); + else + err = crypto_shash_init(desc); + if (err) { + fsverity_err(inode, "Error %d importing hash state", err); + return err; + } + err = crypto_shash_finup_mb(desc, data, params->block_size, outs, 2); + if (err) + fsverity_err(inode, "Error %d computing block hashes", err); + return err; +} + /** * fsverity_hash_buffer() - hash some data * @alg: the hash algorithm to use * @data: the data to hash * @size: size of data to hash, in bytes diff --git a/fs/verity/open.c b/fs/verity/open.c index fdeb95eca3af..4ae07c689c56 100644 --- a/fs/verity/open.c +++ b/fs/verity/open.c @@ -206,10 +206,16 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode, if (err) { fsverity_err(inode, "Error %d computing file digest", err); goto fail; } + err = fsverity_hash_block(&vi->tree_params, inode, + page_address(ZERO_PAGE(0)), + vi->zero_block_hash); + if (err) + goto fail; + err = fsverity_verify_signature(vi, desc->signature, le32_to_cpu(desc->sig_size)); if (err) goto fail; diff --git a/fs/verity/verify.c b/fs/verity/verify.c index 4fcad0825a12..38f1eb3dbd8e 100644 --- a/fs/verity/verify.c +++ b/fs/verity/verify.c @@ -77,29 +77,33 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage, SetPageChecked(hpage); return false; } /* - * Verify a single data block against the file's Merkle tree. + * Verify the hash of a single data block against the file's Merkle tree. + * + * @real_dblock_hash specifies the hash of the data block, and @data_pos + * specifies the byte position of the data block within the file. * * In principle, we need to verify the entire path to the root node. However, * for efficiency the filesystem may cache the hash blocks. Therefore we need * only ascend the tree until an already-verified hash block is seen, and then * verify the path to that block. * * Return: %true if the data block is valid, else %false. */ static bool verify_data_block(struct inode *inode, struct fsverity_info *vi, - const void *data, u64 data_pos, unsigned long max_ra_pages) + const u8 *real_dblock_hash, u64 data_pos, + unsigned long max_ra_pages) { const struct merkle_tree_params *params = &vi->tree_params; const unsigned int hsize = params->digest_size; int level; u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE]; const u8 *want_hash; - u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE]; + u8 real_hblock_hash[FS_VERITY_MAX_DIGEST_SIZE]; /* The hash blocks that are traversed, indexed by level */ struct { /* Page containing the hash block */ struct page *page; /* Mapped address of the hash block (will be within @page) */ @@ -125,11 +129,12 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi, * doesn't cover data blocks fully past EOF. But the entire * page spanning EOF can be visible to userspace via a mmap, and * any part past EOF should be all zeroes. Therefore, we need * to verify that any data blocks fully past EOF are all zeroes. */ - if (memchr_inv(data, 0, params->block_size)) { + if (memcmp(vi->zero_block_hash, real_dblock_hash, + params->block_size) != 0) { fsverity_err(inode, "FILE CORRUPTED! Data past EOF is not zeroed"); return false; } return true; @@ -200,13 +205,14 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi, struct page *hpage = hblocks[level - 1].page; const void *haddr = hblocks[level - 1].addr; unsigned long hblock_idx = hblocks[level - 1].index; unsigned int hoffset = hblocks[level - 1].hoffset; - if (fsverity_hash_block(params, inode, haddr, real_hash) != 0) + if (fsverity_hash_block(params, inode, haddr, + real_hblock_hash) != 0) goto error; - if (memcmp(want_hash, real_hash, hsize) != 0) + if (memcmp(want_hash, real_hblock_hash, hsize) != 0) goto corrupted; /* * Mark the hash block as verified. This must be atomic and * idempotent, as the same hash block might be verified by * multiple threads concurrently. @@ -219,55 +225,145 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi, want_hash = _want_hash; kunmap_local(haddr); put_page(hpage); } - /* Finally, verify the data block. */ - if (fsverity_hash_block(params, inode, data, real_hash) != 0) - goto error; - if (memcmp(want_hash, real_hash, hsize) != 0) + /* Finally, verify the hash of the data block. */ + if (memcmp(want_hash, real_dblock_hash, hsize) != 0) goto corrupted; return true; corrupted: fsverity_err(inode, "FILE CORRUPTED! pos=%llu, level=%d, want_hash=%s:%*phN, real_hash=%s:%*phN", data_pos, level - 1, params->hash_alg->name, hsize, want_hash, - params->hash_alg->name, hsize, real_hash); + params->hash_alg->name, hsize, + level == 0 ? real_dblock_hash : real_hblock_hash); error: for (; level > 0; level--) { kunmap_local(hblocks[level - 1].addr); put_page(hblocks[level - 1].page); } return false; } +struct fsverity_verification_context { + struct inode *inode; + struct fsverity_info *vi; + unsigned long max_ra_pages; + + /* + * pending_data and pending_pos are used when the selected hash + * algorithm supports multibuffer hashing. They're used to temporarily + * store the virtual address and position of a mapped data block that + * needs to be verified. If we then see another data block, we hash the + * two blocks simultaneously using the fast multibuffer hashing method. + */ + const void *pending_data; + u64 pending_pos; + + /* Buffers to temporarily store the calculated data block hashes */ + u8 hash1[FS_VERITY_MAX_DIGEST_SIZE]; + u8 hash2[FS_VERITY_MAX_DIGEST_SIZE]; +}; + +static inline void +fsverity_init_verification_context(struct fsverity_verification_context *ctx, + struct inode *inode, + unsigned long max_ra_pages) +{ + ctx->inode = inode; + ctx->vi = inode->i_verity_info; + ctx->max_ra_pages = max_ra_pages; + ctx->pending_data = NULL; +} + +static bool +fsverity_finish_verification(struct fsverity_verification_context *ctx) +{ + int err; + + if (ctx->pending_data == NULL) + return true; + /* + * Multibuffer hashing is enabled but there was an odd number of data + * blocks. Hash and verify the last block by itself. + */ + err = fsverity_hash_block(&ctx->vi->tree_params, ctx->inode, + ctx->pending_data, ctx->hash1); + kunmap_local(ctx->pending_data); + ctx->pending_data = NULL; + return err == 0 && + verify_data_block(ctx->inode, ctx->vi, ctx->hash1, + ctx->pending_pos, ctx->max_ra_pages); +} + +static inline void +fsverity_abort_verification(struct fsverity_verification_context *ctx) +{ + if (ctx->pending_data) { + kunmap_local(ctx->pending_data); + ctx->pending_data = NULL; + } +} + static bool -verify_data_blocks(struct folio *data_folio, size_t len, size_t offset, - unsigned long max_ra_pages) +fsverity_add_data_blocks(struct fsverity_verification_context *ctx, + struct folio *data_folio, size_t len, size_t offset) { - struct inode *inode = data_folio->mapping->host; - struct fsverity_info *vi = inode->i_verity_info; - const unsigned int block_size = vi->tree_params.block_size; - u64 pos = (u64)data_folio->index << PAGE_SHIFT; + struct inode *inode = ctx->inode; + struct fsverity_info *vi = ctx->vi; + const struct merkle_tree_params *params = &vi->tree_params; + const unsigned int block_size = params->block_size; + const bool multibuffer = params->hash_alg->supports_multibuffer; + u64 pos = ((u64)data_folio->index << PAGE_SHIFT) + offset; if (WARN_ON_ONCE(len <= 0 || !IS_ALIGNED(len | offset, block_size))) return false; if (WARN_ON_ONCE(!folio_test_locked(data_folio) || folio_test_uptodate(data_folio))) return false; do { - void *data; - bool valid; - - data = kmap_local_folio(data_folio, offset); - valid = verify_data_block(inode, vi, data, pos + offset, - max_ra_pages); - kunmap_local(data); - if (!valid) - return false; + const void *data = kmap_local_folio(data_folio, offset); + int err; + + if (multibuffer) { + if (ctx->pending_data) { + /* Hash and verify two data blocks. */ + err = fsverity_hash_2_blocks(params, + inode, + ctx->pending_data, + data, + ctx->hash1, + ctx->hash2); + kunmap_local(data); + kunmap_local(ctx->pending_data); + ctx->pending_data = NULL; + if (err != 0 || + !verify_data_block(inode, vi, ctx->hash1, + ctx->pending_pos, + ctx->max_ra_pages) || + !verify_data_block(inode, vi, ctx->hash2, + pos, ctx->max_ra_pages)) + return false; + } else { + /* Wait and see if there's another block. */ + ctx->pending_data = data; + ctx->pending_pos = pos; + } + } else { + /* Hash and verify one data block. */ + err = fsverity_hash_block(params, inode, data, + ctx->hash1); + kunmap_local(data); + if (err != 0 || + !verify_data_block(inode, vi, ctx->hash1, + pos, ctx->max_ra_pages)) + return false; + } + pos += block_size; offset += block_size; len -= block_size; } while (len); return true; } @@ -284,11 +380,19 @@ verify_data_blocks(struct folio *data_folio, size_t len, size_t offset, * * Return: %true if the data is valid, else %false. */ bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset) { - return verify_data_blocks(folio, len, offset, 0); + struct fsverity_verification_context ctx; + + fsverity_init_verification_context(&ctx, folio->mapping->host, 0); + + if (!fsverity_add_data_blocks(&ctx, folio, len, offset)) { + fsverity_abort_verification(&ctx); + return false; + } + return fsverity_finish_verification(&ctx); } EXPORT_SYMBOL_GPL(fsverity_verify_blocks); #ifdef CONFIG_BLOCK /** @@ -305,10 +409,12 @@ EXPORT_SYMBOL_GPL(fsverity_verify_blocks); * filesystems) must instead call fsverity_verify_page() directly on each page. * All filesystems must also call fsverity_verify_page() on holes. */ void fsverity_verify_bio(struct bio *bio) { + struct inode *inode = bio_first_folio_all(bio)->mapping->host; + struct fsverity_verification_context ctx; struct folio_iter fi; unsigned long max_ra_pages = 0; if (bio->bi_opf & REQ_RAHEAD) { /* @@ -321,17 +427,26 @@ void fsverity_verify_bio(struct bio *bio) * reduces the number of I/O requests made to the Merkle tree. */ max_ra_pages = bio->bi_iter.bi_size >> (PAGE_SHIFT + 2); } + fsverity_init_verification_context(&ctx, inode, max_ra_pages); + bio_for_each_folio_all(fi, bio) { - if (!verify_data_blocks(fi.folio, fi.length, fi.offset, - max_ra_pages)) { - bio->bi_status = BLK_STS_IOERR; - break; + if (!fsverity_add_data_blocks(&ctx, fi.folio, fi.length, + fi.offset)) { + fsverity_abort_verification(&ctx); + goto ioerr; } } + + if (!fsverity_finish_verification(&ctx)) + goto ioerr; + return; + +ioerr: + bio->bi_status = BLK_STS_IOERR; } EXPORT_SYMBOL_GPL(fsverity_verify_bio); #endif /* CONFIG_BLOCK */ /** From patchwork Tue May 7 00:23:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 795332 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69F514C90; Tue, 7 May 2024 00:25:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041519; cv=none; b=mzPVUVUK+QVT5yvjAX96vy4Qq457XmGy2dDEwHK0qCSbrnGlfnUY+R7oCYZ8xyZU4V6Nrm0Dj5OGRJAdUc1VEGlYYMrwVd4PpZtttYuy0n9TncG5+Vx1AS2uDG34Gz84+1/uKFRgQKV7fm18PPbMcmEruMi9jUMABXJxntPsqlU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715041519; c=relaxed/simple; bh=ggpP8ELjDjuehtJRJqVh1Cgwh7DRdNe18MMGOo6UHag=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hS5eDSdUSE0xZa1k8fy7ny0Blssgf1ISS4QdSBTIVCf7MS14xrNMUIokPG5QKSN0F+PDO5R0otbS+r3vDVbTWvCReJt455Z/eewI7s0tT8F3wrhKtHM9lvwlfEdBUJuUmtQZN2P+dWJuS+z3j8qOr3SVyrwKRiiUYCThzDM7Rc8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ABeR4caK; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ABeR4caK" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CA0BCC3277B; Tue, 7 May 2024 00:25:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715041519; bh=ggpP8ELjDjuehtJRJqVh1Cgwh7DRdNe18MMGOo6UHag=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ABeR4caKI8c3bQ6W26j2RIhsguQaSO7jNQXAZPRseYuCyE7mlMNGrJ/Wcd2np3Bua zwWTiXePgx/SBmpywuUMbtHtLs9te3orikJ31qD8rhfVU62LtmGidhX2tCD5S3vIS1 HkaUwZm3HJIbmoKs6v9iTXLx55E218GnP88se/uxFh4InhZE4lDNuluRbUVqfD07rX sC0W/rsJHZ35oiJicOvWwd2LKO0Lcah/6YrDhHx257EEY2EtVRNvrOBtN21OyZVKZa r6KR2wTHvrYSdm0XH4uVech2YZBKiVxczzGQ/BfvKW9sP+DbnhqKNugpY5i13Mmq9h ECVz5UWsxoghw== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev, dm-devel@lists.linux.dev Cc: x86@kernel.org, linux-arm-kernel@lists.infradead.org, Ard Biesheuvel , Sami Tolvanen , Bart Van Assche Subject: [PATCH v3 8/8] dm-verity: improve performance by using multibuffer hashing Date: Mon, 6 May 2024 17:23:43 -0700 Message-ID: <20240507002343.239552-9-ebiggers@kernel.org> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20240507002343.239552-1-ebiggers@kernel.org> References: <20240507002343.239552-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers When supported by the hash algorithm, use crypto_shash_finup_mb() to interleave the hashing of pairs of data blocks. On some CPUs this nearly doubles hashing performance. The increase in overall throughput of cold-cache dm-verity reads that I'm seeing on arm64 and x86_64 is roughly 35% (though this metric is hard to measure as it jumps around a lot). For now this is only done on data blocks, not Merkle tree blocks. We could use finup_mb on Merkle tree blocks too, but that is less important as there aren't as many Merkle tree blocks as data blocks, and that would require some additional code restructuring. Signed-off-by: Eric Biggers --- drivers/md/dm-verity-fec.c | 24 +-- drivers/md/dm-verity-fec.h | 7 +- drivers/md/dm-verity-target.c | 363 +++++++++++++++++++++++----------- drivers/md/dm-verity.h | 28 +-- 4 files changed, 264 insertions(+), 158 deletions(-) diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c index b436b8e4d750..c1677137a682 100644 --- a/drivers/md/dm-verity-fec.c +++ b/drivers/md/dm-verity-fec.c @@ -184,18 +184,18 @@ static int fec_decode_bufs(struct dm_verity *v, struct dm_verity_io *io, * Locate data block erasures using verity hashes. */ static int fec_is_erasure(struct dm_verity *v, struct dm_verity_io *io, u8 *want_digest, u8 *data) { + u8 real_digest[HASH_MAX_DIGESTSIZE]; + if (unlikely(verity_compute_hash_virt(v, io, data, 1 << v->data_dev_block_bits, - verity_io_real_digest(v, io), - true))) + real_digest, true))) return 0; - return memcmp(verity_io_real_digest(v, io), want_digest, - v->digest_size) != 0; + return memcmp(real_digest, want_digest, v->digest_size) != 0; } /* * Read data blocks that are part of the RS block and deinterleave as much as * fits into buffers. Check for erasure locations if @neras is non-NULL. @@ -362,14 +362,15 @@ static void fec_init_bufs(struct dm_verity *v, struct dm_verity_fec_io *fio) * (indicated by @offset) in fio->output. If @use_erasures is non-zero, uses * hashes to locate erasures. */ static int fec_decode_rsb(struct dm_verity *v, struct dm_verity_io *io, struct dm_verity_fec_io *fio, u64 rsb, u64 offset, - bool use_erasures) + const u8 *want_digest, bool use_erasures) { int r, neras = 0; unsigned int pos; + u8 real_digest[HASH_MAX_DIGESTSIZE]; r = fec_alloc_bufs(v, fio); if (unlikely(r < 0)) return r; @@ -389,16 +390,15 @@ static int fec_decode_rsb(struct dm_verity *v, struct dm_verity_io *io, } /* Always re-validate the corrected block against the expected hash */ r = verity_compute_hash_virt(v, io, fio->output, 1 << v->data_dev_block_bits, - verity_io_real_digest(v, io), true); + real_digest, true); if (unlikely(r < 0)) return r; - if (memcmp(verity_io_real_digest(v, io), verity_io_want_digest(v, io), - v->digest_size)) { + if (memcmp(real_digest, want_digest, v->digest_size)) { DMERR_LIMIT("%s: FEC %llu: failed to correct (%d erasures)", v->data_dev->name, (unsigned long long)rsb, neras); return -EILSEQ; } @@ -419,12 +419,12 @@ static int fec_bv_copy(struct dm_verity *v, struct dm_verity_io *io, u8 *data, /* * Correct errors in a block. Copies corrected block to dest if non-NULL, * otherwise to a bio_vec starting from iter. */ int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io, - enum verity_block_type type, sector_t block, u8 *dest, - struct bvec_iter *iter) + enum verity_block_type type, sector_t block, + const u8 *want_digest, u8 *dest, struct bvec_iter *iter) { int r; struct dm_verity_fec_io *fio = fec_io(io); u64 offset, res, rsb; @@ -463,13 +463,13 @@ int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io, /* * Locating erasures is slow, so attempt to recover the block without * them first. Do a second attempt with erasures if the corruption is * bad enough. */ - r = fec_decode_rsb(v, io, fio, rsb, offset, false); + r = fec_decode_rsb(v, io, fio, rsb, offset, want_digest, false); if (r < 0) { - r = fec_decode_rsb(v, io, fio, rsb, offset, true); + r = fec_decode_rsb(v, io, fio, rsb, offset, want_digest, true); if (r < 0) goto done; } if (dest) diff --git a/drivers/md/dm-verity-fec.h b/drivers/md/dm-verity-fec.h index 8454070d2824..57c3f674cae9 100644 --- a/drivers/md/dm-verity-fec.h +++ b/drivers/md/dm-verity-fec.h @@ -68,11 +68,12 @@ struct dm_verity_fec_io { extern bool verity_fec_is_enabled(struct dm_verity *v); extern int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io, enum verity_block_type type, sector_t block, - u8 *dest, struct bvec_iter *iter); + const u8 *want_digest, u8 *dest, + struct bvec_iter *iter); extern unsigned int verity_fec_status_table(struct dm_verity *v, unsigned int sz, char *result, unsigned int maxlen); extern void verity_fec_finish_io(struct dm_verity_io *io); @@ -97,12 +98,12 @@ static inline bool verity_fec_is_enabled(struct dm_verity *v) return false; } static inline int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io, - enum verity_block_type type, - sector_t block, u8 *dest, + enum verity_block_type type, sector_t block, + const u8 *want_digest, u8 *dest, struct bvec_iter *iter) { return -EOPNOTSUPP; } diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c index 2dd15f5e91b7..d367198aefe7 100644 --- a/drivers/md/dm-verity-target.c +++ b/drivers/md/dm-verity-target.c @@ -300,16 +300,16 @@ static int verity_handle_err(struct dm_verity *v, enum verity_block_type type, /* * Verify hash of a metadata block pertaining to the specified data block * ("block" argument) at a specified level ("level" argument). * - * On successful return, verity_io_want_digest(v, io) contains the hash value - * for a lower tree level or for the data block (if we're at the lowest level). + * On successful return, want_digest contains the hash value for a lower tree + * level or for the data block (if we're at the lowest level). * * If "skip_unverified" is true, unverified buffer is skipped and 1 is returned. * If "skip_unverified" is false, unverified buffer is hashed and verified - * against current value of verity_io_want_digest(v, io). + * against current value of want_digest. */ static int verity_verify_level(struct dm_verity *v, struct dm_verity_io *io, sector_t block, int level, bool skip_unverified, u8 *want_digest) { @@ -318,10 +318,11 @@ static int verity_verify_level(struct dm_verity *v, struct dm_verity_io *io, u8 *data; int r; sector_t hash_block; unsigned int offset; struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size); + u8 real_digest[HASH_MAX_DIGESTSIZE]; verity_hash_at_level(v, block, level, &hash_block, &offset); if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) { data = dm_bufio_get(v->bufio, hash_block, &buf); @@ -349,27 +350,26 @@ static int verity_verify_level(struct dm_verity *v, struct dm_verity_io *io, goto release_ret_r; } r = verity_compute_hash_virt(v, io, data, 1 << v->hash_dev_block_bits, - verity_io_real_digest(v, io), - !io->in_bh); + real_digest, !io->in_bh); if (unlikely(r < 0)) goto release_ret_r; - if (likely(memcmp(verity_io_real_digest(v, io), want_digest, - v->digest_size) == 0)) + if (likely(!memcmp(real_digest, want_digest, v->digest_size))) aux->hash_verified = 1; else if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) { /* * Error handling code (FEC included) cannot be run in a * tasklet since it may sleep, so fallback to work-queue. */ r = -EAGAIN; goto release_ret_r; } else if (verity_fec_decode(v, io, DM_VERITY_BLOCK_TYPE_METADATA, - hash_block, data, NULL) == 0) + hash_block, want_digest, + data, NULL) == 0) aux->hash_verified = 1; else if (verity_handle_err(v, DM_VERITY_BLOCK_TYPE_METADATA, hash_block)) { struct bio *bio = @@ -473,71 +473,10 @@ static int verity_ahash_update_block(struct dm_verity *v, } while (todo); return 0; } -static int verity_compute_hash(struct dm_verity *v, struct dm_verity_io *io, - struct bvec_iter *iter, u8 *digest, - bool may_sleep) -{ - int r; - - if (static_branch_unlikely(&ahash_enabled) && !v->shash_tfm) { - struct ahash_request *req = verity_io_hash_req(v, io); - struct crypto_wait wait; - - r = verity_ahash_init(v, req, &wait, may_sleep); - if (unlikely(r)) - goto error; - - r = verity_ahash_update_block(v, io, iter, &wait); - if (unlikely(r)) - goto error; - - r = verity_ahash_final(v, req, digest, &wait); - if (unlikely(r)) - goto error; - } else { - struct shash_desc *desc = verity_io_hash_req(v, io); - struct bio *bio = - dm_bio_from_per_bio_data(io, v->ti->per_io_data_size); - struct bio_vec bv = bio_iter_iovec(bio, *iter); - const unsigned int len = 1 << v->data_dev_block_bits; - const void *virt; - - if (unlikely(len > bv.bv_len)) { - /* - * Data block spans pages. This should not happen, - * since this code path is not used if the data block - * size is greater than the page size, and all I/O - * should be data block aligned because dm-verity sets - * logical_block_size to the data block size. - */ - DMERR_LIMIT("unaligned io (data block spans pages)"); - return -EIO; - } - - desc->tfm = v->shash_tfm; - r = crypto_shash_import(desc, v->initial_hashstate); - if (unlikely(r)) - goto error; - - virt = bvec_kmap_local(&bv); - r = crypto_shash_finup(desc, virt, len, digest); - kunmap_local(virt); - if (unlikely(r)) - goto error; - - bio_advance_iter(bio, iter, len); - } - return 0; - -error: - DMERR("Error hashing block from bio iter: %d", r); - return r; -} - /* * Calls function process for 1 << v->data_dev_block_bits bytes in the bio_vec * starting from iter. */ int verity_for_bv_block(struct dm_verity *v, struct dm_verity_io *io, @@ -581,41 +520,42 @@ static int verity_recheck_copy(struct dm_verity *v, struct dm_verity_io *io, io->recheck_buffer += len; return 0; } -static noinline int verity_recheck(struct dm_verity *v, struct dm_verity_io *io, - struct bvec_iter start, sector_t cur_block) +static int verity_recheck(struct dm_verity *v, struct dm_verity_io *io, + struct bvec_iter start, sector_t blkno, + const u8 *want_digest) { struct page *page; void *buffer; int r; struct dm_io_request io_req; struct dm_io_region io_loc; + u8 real_digest[HASH_MAX_DIGESTSIZE]; page = mempool_alloc(&v->recheck_pool, GFP_NOIO); buffer = page_to_virt(page); io_req.bi_opf = REQ_OP_READ; io_req.mem.type = DM_IO_KMEM; io_req.mem.ptr.addr = buffer; io_req.notify.fn = NULL; io_req.client = v->io; io_loc.bdev = v->data_dev->bdev; - io_loc.sector = cur_block << (v->data_dev_block_bits - SECTOR_SHIFT); + io_loc.sector = blkno << (v->data_dev_block_bits - SECTOR_SHIFT); io_loc.count = 1 << (v->data_dev_block_bits - SECTOR_SHIFT); r = dm_io(&io_req, 1, &io_loc, NULL, IOPRIO_DEFAULT); if (unlikely(r)) goto free_ret; r = verity_compute_hash_virt(v, io, buffer, 1 << v->data_dev_block_bits, - verity_io_real_digest(v, io), true); + real_digest, true); if (unlikely(r)) goto free_ret; - if (memcmp(verity_io_real_digest(v, io), - verity_io_want_digest(v, io), v->digest_size)) { + if (memcmp(real_digest, want_digest, v->digest_size)) { r = -EIO; goto free_ret; } io->recheck_buffer = buffer; @@ -647,22 +587,84 @@ static inline void verity_bv_skip_block(struct dm_verity *v, struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size); bio_advance_iter(bio, iter, 1 << v->data_dev_block_bits); } +static noinline int +__verity_handle_data_hash_mismatch(struct dm_verity *v, struct dm_verity_io *io, + struct bio *bio, struct bvec_iter *start, + sector_t blkno, const u8 *want_digest) +{ + if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) { + /* + * Error handling code (FEC included) cannot be run in the + * BH workqueue, so fallback to a standard workqueue. + */ + return -EAGAIN; + } + if (verity_recheck(v, io, *start, blkno, want_digest) == 0) { + if (v->validated_blocks) + set_bit(blkno, v->validated_blocks); + return 0; + } +#if defined(CONFIG_DM_VERITY_FEC) + if (verity_fec_decode(v, io, DM_VERITY_BLOCK_TYPE_DATA, blkno, + want_digest, NULL, start) == 0) + return 0; +#endif + if (bio->bi_status) + return -EIO; /* Error correction failed; Just return error */ + + if (verity_handle_err(v, DM_VERITY_BLOCK_TYPE_DATA, blkno)) { + dm_audit_log_bio(DM_MSG_PREFIX, "verify-data", bio, blkno, 0); + return -EIO; + } + return 0; +} + +static __always_inline int +verity_check_data_block_hash(struct dm_verity *v, struct dm_verity_io *io, + struct bio *bio, struct bvec_iter *start, + sector_t blkno, + const u8 *real_digest, const u8 *want_digest) +{ + if (likely(memcmp(real_digest, want_digest, v->digest_size) == 0)) { + if (v->validated_blocks) + set_bit(blkno, v->validated_blocks); + return 0; + } + return __verity_handle_data_hash_mismatch(v, io, bio, start, blkno, + want_digest); +} + /* * Verify one "dm_verity_io" structure. */ static int verity_verify_io(struct dm_verity_io *io) { - bool is_zero; struct dm_verity *v = io->v; + const unsigned int block_size = 1 << v->data_dev_block_bits; + struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size); + u8 want_digest[HASH_MAX_DIGESTSIZE]; + u8 real_digest[HASH_MAX_DIGESTSIZE]; struct bvec_iter start; struct bvec_iter iter_copy; struct bvec_iter *iter; - struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size); + /* + * The pending_* variables are used when the selected hash algorithm + * supports multibuffer hashing. They're used to temporarily store the + * virtual address and position of a mapped data block that needs to be + * verified. If we then see another data block, we hash the two blocks + * simultaneously using the fast multibuffer hashing method. + */ + const void *pending_data = NULL; + sector_t pending_blkno; + struct bvec_iter pending_start; + u8 pending_want_digest[HASH_MAX_DIGESTSIZE]; + u8 pending_real_digest[HASH_MAX_DIGESTSIZE]; unsigned int b; + int r; if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) { /* * Copy the iterator in case we need to restart * verification in a work-queue. @@ -671,82 +673,177 @@ static int verity_verify_io(struct dm_verity_io *io) iter = &iter_copy; } else iter = &io->iter; for (b = 0; b < io->n_blocks; b++) { - int r; - sector_t cur_block = io->block + b; + sector_t blkno = io->block + b; + bool is_zero; if (v->validated_blocks && bio->bi_status == BLK_STS_OK && - likely(test_bit(cur_block, v->validated_blocks))) { + likely(test_bit(blkno, v->validated_blocks))) { verity_bv_skip_block(v, io, iter); continue; } - r = verity_hash_for_block(v, io, cur_block, - verity_io_want_digest(v, io), - &is_zero); + r = verity_hash_for_block(v, io, blkno, want_digest, &is_zero); if (unlikely(r < 0)) - return r; + goto error; if (is_zero) { /* * If we expect a zero block, don't validate, just * return zeros. */ r = verity_for_bv_block(v, io, iter, verity_bv_zero); if (unlikely(r < 0)) - return r; + goto error; continue; } start = *iter; - r = verity_compute_hash(v, io, iter, - verity_io_real_digest(v, io), - !io->in_bh); - if (unlikely(r < 0)) - return r; - - if (likely(memcmp(verity_io_real_digest(v, io), - verity_io_want_digest(v, io), v->digest_size) == 0)) { - if (v->validated_blocks) - set_bit(cur_block, v->validated_blocks); - continue; - } else if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) { - /* - * Error handling code (FEC included) cannot be run in a - * tasklet since it may sleep, so fallback to work-queue. - */ - return -EAGAIN; - } else if (verity_recheck(v, io, start, cur_block) == 0) { - if (v->validated_blocks) - set_bit(cur_block, v->validated_blocks); - continue; -#if defined(CONFIG_DM_VERITY_FEC) - } else if (verity_fec_decode(v, io, DM_VERITY_BLOCK_TYPE_DATA, - cur_block, NULL, &start) == 0) { - continue; -#endif + if (static_branch_unlikely(&ahash_enabled) && !v->shash_tfm) { + /* Hash and verify one data block using ahash. */ + struct ahash_request *req = verity_io_hash_req(v, io); + struct crypto_wait wait; + + r = verity_ahash_init(v, req, &wait, !io->in_bh); + if (unlikely(r)) + goto hash_error; + + r = verity_ahash_update_block(v, io, iter, &wait); + if (unlikely(r)) + goto hash_error; + + r = verity_ahash_final(v, req, real_digest, &wait); + if (unlikely(r)) + goto hash_error; + + r = verity_check_data_block_hash(v, io, bio, &start, + blkno, real_digest, + want_digest); + if (unlikely(r)) + goto error; } else { - if (bio->bi_status) { + struct shash_desc *desc = verity_io_hash_req(v, io); + struct bio_vec bv = bio_iter_iovec(bio, *iter); + const void *data; + + if (unlikely(bv.bv_len < block_size)) { /* - * Error correction failed; Just return error + * Data block spans pages. This should not + * happen, since this code path is not used if + * the data block size is greater than the page + * size, and all I/O should be data block + * aligned because dm-verity sets + * logical_block_size to the data block size. */ - return -EIO; + DMERR_LIMIT("unaligned io (data block spans pages)"); + r = -EIO; + goto error; } - if (verity_handle_err(v, DM_VERITY_BLOCK_TYPE_DATA, - cur_block)) { - dm_audit_log_bio(DM_MSG_PREFIX, "verify-data", - bio, cur_block, 0); - return -EIO; + + data = bvec_kmap_local(&bv); + + if (v->use_finup_mb) { + if (pending_data) { + const u8 *datas[2] = { pending_data, + data }; + u8 *outs[2] = { pending_real_digest, + real_digest }; + /* Hash and verify two data blocks. */ + desc->tfm = v->shash_tfm; + r = crypto_shash_import(desc, + v->initial_hashstate) ?: + crypto_shash_finup_mb(desc, + datas, + block_size, + outs, + 2); + kunmap_local(data); + kunmap_local(pending_data); + pending_data = NULL; + if (unlikely(r)) + goto hash_error; + r = verity_check_data_block_hash( + v, io, bio, + &pending_start, + pending_blkno, + pending_real_digest, + pending_want_digest); + if (unlikely(r)) + goto error; + r = verity_check_data_block_hash( + v, io, bio, + &start, + blkno, + real_digest, + want_digest); + if (unlikely(r)) + goto error; + } else { + /* Wait and see if there's another block. */ + pending_data = data; + pending_blkno = blkno; + pending_start = start; + memcpy(pending_want_digest, want_digest, + v->digest_size); + } + } else { + /* Hash and verify one data block. */ + desc->tfm = v->shash_tfm; + r = crypto_shash_import(desc, + v->initial_hashstate) ?: + crypto_shash_finup(desc, data, block_size, + real_digest); + kunmap_local(data); + if (unlikely(r)) + goto hash_error; + r = verity_check_data_block_hash( + v, io, bio, &start, blkno, + real_digest, want_digest); + if (unlikely(r)) + goto error; } + + bio_advance_iter(bio, iter, block_size); } } + if (pending_data) { + /* + * Multibuffer hashing is enabled but there was an odd number of + * data blocks. Hash and verify the last block by itself. + */ + struct shash_desc *desc = verity_io_hash_req(v, io); + + desc->tfm = v->shash_tfm; + r = crypto_shash_import(desc, v->initial_hashstate) ?: + crypto_shash_finup(desc, pending_data, block_size, + pending_real_digest); + kunmap_local(pending_data); + pending_data = NULL; + if (unlikely(r)) + goto hash_error; + r = verity_check_data_block_hash(v, io, bio, + &pending_start, + pending_blkno, + pending_real_digest, + pending_want_digest); + if (unlikely(r)) + goto error; + } + return 0; + +hash_error: + DMERR("Error hashing block from bio iter: %d", r); +error: + if (pending_data) + kunmap_local(pending_data); + return r; } /* * Skip verity work in response to I/O error when system is shutting down. */ @@ -1321,10 +1418,34 @@ static int verity_setup_hash_alg(struct dm_verity *v, const char *alg_name) if (!v->alg_name) { ti->error = "Cannot allocate algorithm name"; return -ENOMEM; } + /* + * Allocate the hash transformation object that this dm-verity instance + * will use. We have a choice of two APIs: shash and ahash. Most + * dm-verity users use CPU-based hashing, and for this shash is optimal + * since it matches the underlying algorithm implementations and also + * allows the use of fast multibuffer hashing (crypto_shash_finup_mb()). + * ahash adds support for off-CPU hash offloading. It also provides + * access to shash algorithms, but does so less efficiently. + * + * Meanwhile, hashing a block in dm-verity in general requires an + * init+update+final sequence with multiple updates. However, usually + * the salt is prepended to the block rather than appended, and the data + * block size is not greater than the page size. In this very common + * case, the sequence can be optimized to import+finup, where the first + * step imports the pre-computed state after init+update(salt). This + * can reduce the crypto API overhead significantly. + * + * To provide optimal performance for the vast majority of dm-verity + * users while still supporting off-CPU hash offloading and the rarer + * dm-verity settings, we therefore have two code paths: one using shash + * where we use import+finup or import+finup_mb, and one using ahash + * where we use init+update(s)+final. We use the former code path when + * it's possible to use and shash gives the same algorithm as ahash. + */ ahash = crypto_alloc_ahash(alg_name, 0, v->use_bh_wq ? CRYPTO_ALG_ASYNC : 0); if (IS_ERR(ahash)) { ti->error = "Cannot initialize hash function"; return PTR_ERR(ahash); @@ -1345,14 +1466,16 @@ static int verity_setup_hash_alg(struct dm_verity *v, const char *alg_name) } if (!IS_ERR_OR_NULL(shash)) { crypto_free_ahash(ahash); ahash = NULL; v->shash_tfm = shash; + v->use_finup_mb = crypto_shash_mb_max_msgs(shash); v->digest_size = crypto_shash_digestsize(shash); v->hash_reqsize = sizeof(struct shash_desc) + crypto_shash_descsize(shash); - DMINFO("%s using shash \"%s\"", alg_name, driver_name); + DMINFO("%s using shash \"%s\"%s", alg_name, driver_name, + v->use_finup_mb ? " (multibuffer)" : ""); } else { v->ahash_tfm = ahash; static_branch_inc(&ahash_enabled); v->digest_size = crypto_ahash_digestsize(ahash); v->hash_reqsize = sizeof(struct ahash_request) + diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h index 15ffb0881cc9..d5f659e1bdef 100644 --- a/drivers/md/dm-verity.h +++ b/drivers/md/dm-verity.h @@ -55,10 +55,11 @@ struct dm_verity { unsigned char hash_per_block_bits; /* log2(hashes in hash block) */ unsigned char levels; /* the number of tree levels */ unsigned char version; bool hash_failed:1; /* set if hash of any block failed */ bool use_bh_wq:1; /* try to verify in BH wq before normal work-queue */ + bool use_finup_mb:1; /* use crypto_shash_finup_mb() */ unsigned int digest_size; /* digest size for the current hash algorithm */ unsigned int hash_reqsize; /* the size of temporary space for crypto */ enum verity_mode mode; /* mode for handling verification errors */ unsigned int corrupted_errs;/* Number of errors for corrupted blocks */ @@ -92,42 +93,23 @@ struct dm_verity_io { struct work_struct bh_work; char *recheck_buffer; /* - * Three variably-size fields follow this struct: - * - * u8 hash_req[v->hash_reqsize]; - * u8 real_digest[v->digest_size]; - * u8 want_digest[v->digest_size]; - * - * To access them use: verity_io_hash_req(), verity_io_real_digest() - * and verity_io_want_digest(). - * - * hash_req is either a struct ahash_request or a struct shash_desc, - * depending on whether ahash_tfm or shash_tfm is being used. + * This struct is followed by a variable-sized hash request of size + * v->hash_reqsize, either a struct ahash_request or a struct shash_desc + * (depending on whether ahash_tfm or shash_tfm is being used). To + * access it, use verity_io_hash_req(). */ }; static inline void *verity_io_hash_req(struct dm_verity *v, struct dm_verity_io *io) { return io + 1; } -static inline u8 *verity_io_real_digest(struct dm_verity *v, - struct dm_verity_io *io) -{ - return (u8 *)(io + 1) + v->hash_reqsize; -} - -static inline u8 *verity_io_want_digest(struct dm_verity *v, - struct dm_verity_io *io) -{ - return (u8 *)(io + 1) + v->hash_reqsize + v->digest_size; -} - extern int verity_for_bv_block(struct dm_verity *v, struct dm_verity_io *io, struct bvec_iter *iter, int (*process)(struct dm_verity *v, struct dm_verity_io *io, u8 *data, size_t len));