[RFC] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic

Message ID 1485785489-5116-1-git-send-email-ard.biesheuvel@linaro.org
State New
Headers show

Commit Message

Ard Biesheuvel Jan. 30, 2017, 2:11 p.m.
Instead of unconditionally forcing 4 byte alignment for all generic
chaining modes that rely on crypto_xor() or crypto_inc() (which may
result in unnecessary copying of data when the underlying hardware
can perform unaligned accesses efficiently), make those functions
deal with unaligned input explicitly, but only if the Kconfig symbol
HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop
the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.

For crypto_inc(), this simply involves making the 4-byte stride
conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that
it typically operates on 16 byte buffers.

For crypto_xor(), an algorithm is implemented that simply runs through
the input using the largest strides possible if unaligned accesses are
allowed. If they are not, an optimal sequence of memory accesses is
emitted that takes the relative alignment of the input buffers into
account, e.g., if the relative misalignment of dst and src is 4 bytes,
the entire xor operation will be completed using 4 byte loads and stores
(modulo unaligned bits at the start and end). Note that all expressions
involving startalign and misalign are simply eliminated by the compiler
if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 crypto/algapi.c | 102 ++++++++++++++++----
 crypto/cbc.c    |   3 -
 crypto/cmac.c   |   3 +-
 crypto/ctr.c    |   2 +-
 crypto/cts.c    |   3 -
 crypto/pcbc.c   |   3 -
 crypto/seqiv.c  |   2 -
 7 files changed, 87 insertions(+), 31 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Biggers Feb. 2, 2017, 6:47 a.m. | #1
On Mon, Jan 30, 2017 at 02:11:29PM +0000, Ard Biesheuvel wrote:
> Instead of unconditionally forcing 4 byte alignment for all generic

> chaining modes that rely on crypto_xor() or crypto_inc() (which may

> result in unnecessary copying of data when the underlying hardware

> can perform unaligned accesses efficiently), make those functions

> deal with unaligned input explicitly, but only if the Kconfig symbol

> HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop

> the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.

> 

> For crypto_inc(), this simply involves making the 4-byte stride

> conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that

> it typically operates on 16 byte buffers.

> 

> For crypto_xor(), an algorithm is implemented that simply runs through

> the input using the largest strides possible if unaligned accesses are

> allowed. If they are not, an optimal sequence of memory accesses is

> emitted that takes the relative alignment of the input buffers into

> account, e.g., if the relative misalignment of dst and src is 4 bytes,

> the entire xor operation will be completed using 4 byte loads and stores

> (modulo unaligned bits at the start and end). Note that all expressions

> involving startalign and misalign are simply eliminated by the compiler

> if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

> 


Hi Ard,

This is a good idea, and I think it was error-prone to be requiring 4-byte
alignment always, and also inefficient on many architectures.

The new crypto_inc() looks fine, but the new crypto_xor() is quite complicated.
I'm wondering whether it has to be that way, especially since it seems to most
commonly be used on very small input buffers, e.g. 8 or 16-byte blocks.  There
are a couple trivial ways it could be simplified, e.g. using 'dst' and 'src'
directly instead of 'a' and 'b' (which also seems to improve code generation by
getting rid of the '+= len & ~mask' parts), or using sizeof(long) directly
instead of 'size' and 'mask'.

But also when I tried testing the proposed crypto_xor() on MIPS, it didn't work
correctly on a misaligned buffer.  With startalign=1, it did one iteration of
the following loop and then exited with startalign=0 and entered the "unsigned
long at a time" loop, which is incorrect since at that point the buffers were
not yet fully aligned:

>		do {

>			if (len < sizeof(u8))

>				break;

>

>			if (len >= size && !(startalign & 1) && !(misalign & 1))

>				break;

>

>			*dst++ ^= *src++;

>			len -= sizeof(u8);

>			startalign &= ~sizeof(u8);

>		} while (misalign & 1);


I think it would need to do instead:

		startalign += sizeof(u8);
		startalign %= sizeof(unsigned long);

But I am wondering whether you considered something simpler, using the
get_unaligned/put_unaligned helpers, maybe even using a switch statement for the
last (sizeof(long) - 1) bytes so it can be compiled as a jump table.  Something
like this:

#define xor_unaligned(dst, src) \
        put_unaligned(get_unaligned(dst) ^ get_unaligned(src), (dst))

void crypto_xor(u8 *dst, const u8 *src, unsigned int len)
{
	while (len >= sizeof(unsigned long)) {
		xor_unaligned((unsigned long *)dst, (unsigned long *)src);
		dst += sizeof(unsigned long);
		src += sizeof(unsigned long);
		len -= sizeof(unsigned long);
	}

	switch (len) {
#ifdef CONFIG_64BIT
	case 7:
		dst[6] ^= src[6];
		/* fall through */
	case 6:
		xor_unaligned((u16 *)&dst[4], (u16 *)&src[4]);
		goto len_4;
	case 5:
		dst[4] ^= src[4];
		/* fall through */
	case 4:
	len_4:
		xor_unaligned((u32 *)dst, (u32 *)src);
		break;
#endif
	case 3:
		dst[2] ^= src[2];
		/* fall through */
	case 2:
		xor_unaligned((u16 *)dst, (u16 *)src);
		break;
	case 1:
		dst[0] ^= src[0];
		break;
	}
}

That would seem like a better choice for small buffers, which seems to be the
more common case.  It should generate slightly faster code on architectures with
fast unaligned access like x86_64, while still being sufficient on architectures
without --- perhaps even faster, since it wouldn't have as many branches.

Eric
Ard Biesheuvel Feb. 2, 2017, 7:52 a.m. | #2
On 2 February 2017 at 06:47, Eric Biggers <ebiggers3@gmail.com> wrote:
> On Mon, Jan 30, 2017 at 02:11:29PM +0000, Ard Biesheuvel wrote:

>> Instead of unconditionally forcing 4 byte alignment for all generic

>> chaining modes that rely on crypto_xor() or crypto_inc() (which may

>> result in unnecessary copying of data when the underlying hardware

>> can perform unaligned accesses efficiently), make those functions

>> deal with unaligned input explicitly, but only if the Kconfig symbol

>> HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop

>> the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.

>>

>> For crypto_inc(), this simply involves making the 4-byte stride

>> conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that

>> it typically operates on 16 byte buffers.

>>

>> For crypto_xor(), an algorithm is implemented that simply runs through

>> the input using the largest strides possible if unaligned accesses are

>> allowed. If they are not, an optimal sequence of memory accesses is

>> emitted that takes the relative alignment of the input buffers into

>> account, e.g., if the relative misalignment of dst and src is 4 bytes,

>> the entire xor operation will be completed using 4 byte loads and stores

>> (modulo unaligned bits at the start and end). Note that all expressions

>> involving startalign and misalign are simply eliminated by the compiler

>> if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

>>

>

> Hi Ard,

>

> This is a good idea, and I think it was error-prone to be requiring 4-byte

> alignment always, and also inefficient on many architectures.

>

> The new crypto_inc() looks fine, but the new crypto_xor() is quite complicated.

> I'm wondering whether it has to be that way, especially since it seems to most

> commonly be used on very small input buffers, e.g. 8 or 16-byte blocks.  There

> are a couple trivial ways it could be simplified, e.g. using 'dst' and 'src'

> directly instead of 'a' and 'b' (which also seems to improve code generation by

> getting rid of the '+= len & ~mask' parts), or using sizeof(long) directly

> instead of 'size' and 'mask'.

>

> But also when I tried testing the proposed crypto_xor() on MIPS, it didn't work

> correctly on a misaligned buffer.  With startalign=1, it did one iteration of

> the following loop and then exited with startalign=0 and entered the "unsigned

> long at a time" loop, which is incorrect since at that point the buffers were

> not yet fully aligned:

>


Right. I knew it was convoluted but I thought that was justified by
its correctness :-)

>>               do {

>>                       if (len < sizeof(u8))

>>                               break;

>>

>>                       if (len >= size && !(startalign & 1) && !(misalign & 1))

>>                               break;

>>

>>                       *dst++ ^= *src++;

>>                       len -= sizeof(u8);

>>                       startalign &= ~sizeof(u8);

>>               } while (misalign & 1);

>

> I think it would need to do instead:

>

>                 startalign += sizeof(u8);

>                 startalign %= sizeof(unsigned long);

>

> But I am wondering whether you considered something simpler, using the

> get_unaligned/put_unaligned helpers, maybe even using a switch statement for the

> last (sizeof(long) - 1) bytes so it can be compiled as a jump table.  Something

> like this:

>

> #define xor_unaligned(dst, src) \

>         put_unaligned(get_unaligned(dst) ^ get_unaligned(src), (dst))

>

> void crypto_xor(u8 *dst, const u8 *src, unsigned int len)

> {

>         while (len >= sizeof(unsigned long)) {

>                 xor_unaligned((unsigned long *)dst, (unsigned long *)src);

>                 dst += sizeof(unsigned long);

>                 src += sizeof(unsigned long);

>                 len -= sizeof(unsigned long);

>         }

>

>         switch (len) {

> #ifdef CONFIG_64BIT

>         case 7:

>                 dst[6] ^= src[6];

>                 /* fall through */

>         case 6:

>                 xor_unaligned((u16 *)&dst[4], (u16 *)&src[4]);

>                 goto len_4;

>         case 5:

>                 dst[4] ^= src[4];

>                 /* fall through */

>         case 4:

>         len_4:

>                 xor_unaligned((u32 *)dst, (u32 *)src);

>                 break;

> #endif

>         case 3:

>                 dst[2] ^= src[2];

>                 /* fall through */

>         case 2:

>                 xor_unaligned((u16 *)dst, (u16 *)src);

>                 break;

>         case 1:

>                 dst[0] ^= src[0];

>                 break;

>         }

> }

>

> That would seem like a better choice for small buffers, which seems to be the

> more common case.  It should generate slightly faster code on architectures with

> fast unaligned access like x86_64, while still being sufficient on architectures

> without --- perhaps even faster, since it wouldn't have as many branches.

>


Well, what I tried to deal with explicitly is misaligned dst and src
by, e.g., 4 bytes. In my implementation, the idea was that it would
run through the entire input using 32-bit loads and stores, and the
standard unaligned accessors always take the hit of bytewise accesses
on architectures that don't have hardware support.

But I have an idea how I could simplify this, stay tuned please
Jason A. Donenfeld Feb. 4, 2017, 11 p.m. | #3
Hey,

On Thu, Feb 2, 2017 at 7:47 AM, Eric Biggers <ebiggers3@gmail.com> wrote:
> I'm wondering whether it has to be that way, especially since it seems to most

> commonly be used on very small input buffers, e.g. 8 or 16-byte blocks.


Note that popular stream ciphers like chacha or salsa wind up XORing
much longer blocks -- 64 bytes. Likewise, CTR mode tends to XOR using
the block size as well. Not sure whether this is directly relavent for
the decision making here, but I thought I'd mention it just in case.
The XOR for this case should be _fast_, and preferably inlineable.

Jason
Jason A. Donenfeld Feb. 4, 2017, 11:10 p.m. | #4
Another thing that might be helpful is that you can let gcc decide on
the alignment, and then optimize appropriately. Check out what we do
with siphash:

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/include/linux/siphash.h#n76

static inline u64 siphash(const void *data, size_t len, const
siphash_key_t *key)
{
#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
        if (!IS_ALIGNED((unsigned long)data, SIPHASH_ALIGNMENT))
                return __siphash_unaligned(data, len, key);
#endif
        return ___siphash_aligned(data, len, key);
}

With this trick, we fall through to the fast alignment-assuming code,
if gcc can prove that the address is inlined. This is often the case
when passing structs, or when passing buffers that have
__aligned(BLOCKSIZE). It proves to be a very useful optimization on
some platforms.
Eric Biggers Feb. 5, 2017, 3:05 a.m. | #5
On Sun, Feb 05, 2017 at 12:10:53AM +0100, Jason A. Donenfeld wrote:
> Another thing that might be helpful is that you can let gcc decide on

> the alignment, and then optimize appropriately. Check out what we do

> with siphash:

> 

> https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/include/linux/siphash.h#n76

> 

> static inline u64 siphash(const void *data, size_t len, const

> siphash_key_t *key)

> {

> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

>         if (!IS_ALIGNED((unsigned long)data, SIPHASH_ALIGNMENT))

>                 return __siphash_unaligned(data, len, key);

> #endif

>         return ___siphash_aligned(data, len, key);

> }

> 

> With this trick, we fall through to the fast alignment-assuming code,

> if gcc can prove that the address is inlined. This is often the case

> when passing structs, or when passing buffers that have

> __aligned(BLOCKSIZE). It proves to be a very useful optimization on

> some platforms.


Yes, this is a good idea.  Though it seems that usually at least one of the two
pointers passed to crypto_xor() will have alignment unknown to the compiler,
sometimes the length is constant which inlining can help a lot for.  For
example, if someone does crypto_xor(foo, bar, 16) on x86_64 or ARM64, we'd
really like it to turn into just a few instructions like this:

	mov    (%rsi),%rax
	xor    %rax,(%rdi)
	mov    0x8(%rsi),%rax
	xor    %rax,0x8(%rdi)

So how about inlining crypto_xor() if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS or
the pointers are long-aligned, otherwise calling an out-of-line function
__crypto_xor_unaligned() that handles all the cases with weird alignment.
Something like the following patch:

(Note: exactly how __crypto_xor_unaligned() is implemented is still debatable;
it could be more similar to Ard's proposal, or it could use the unaligned access
helpers.)diff --git a/crypto/algapi.c b/crypto/algapi.c
index df939b54b09f..a0591db3f13a 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -972,23 +972,69 @@ void crypto_inc(u8 *a, unsigned int size)
 }
 EXPORT_SYMBOL_GPL(crypto_inc);
 
-static inline void crypto_xor_byte(u8 *a, const u8 *b, unsigned int size)
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+void __crypto_xor_unaligned(u8 *dst, const u8 *src, unsigned int len)
 {
-	for (; size; size--)
-		*a++ ^= *b++;
-}
+	unsigned long delta = (unsigned long)dst ^ (unsigned long)src;
 
-void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
-{
-	u32 *a = (u32 *)dst;
-	u32 *b = (u32 *)src;
+	/* Handle relative misalignment */
+	if (delta % sizeof(unsigned long)) {
+
+		/* 1-byte relative misalignment? */
+		if (delta & 1) {
+			while (len--)
+				*dst++ ^= *src++;
+			return;
+		}
 
-	for (; size >= 4; size -= 4)
-		*a++ ^= *b++;
+		/* 2-byte relative misalignment? */
+		if ((delta & 2) || sizeof(unsigned long) == 4) {
+			if ((unsigned long)dst % __alignof__(u16) && len) {
+				*dst++ ^= *src++;
+				len--;
+			}
+			while (len >= 2) {
+				*(u16 *)dst ^= *(u16 *)src;
+				dst += 2, src += 2, len -= 2;
+			}
+			if (len)
+				*dst ^= *src;
+			return;
+		}
+
+		/* 4-byte relative misalignment? */
+		while ((unsigned long)dst % __alignof__(u32) && len) {
+			*dst++ ^= *src++;
+			len--;
+		}
+		while (len >= 4) {
+			*(u32 *)dst ^= *(u32 *)src;
+			dst += 4, src += 4, len -= 4;
+		}
+		while (len--)
+			*dst++ ^= *src++;
+		return;
+	}
+
+	/* No relative misalignment; use word accesses */
+
+	while ((unsigned long)dst % __alignof__(unsigned long) && len) {
+		*dst++ ^= *src++;
+		len--;
+	}
+
+	while (len >= sizeof(unsigned long)) {
+		*(unsigned long *)dst ^= *(unsigned long *)src;
+		dst += sizeof(unsigned long);
+		src += sizeof(unsigned long);
+		len -= sizeof(unsigned long);
+	}
 
-	crypto_xor_byte((u8 *)a, (u8 *)b, size);
+	while (len--)
+		*dst++ ^= *src++;
 }
-EXPORT_SYMBOL_GPL(crypto_xor);
+EXPORT_SYMBOL_GPL(__crypto_xor_unaligned);
+#endif /* !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
 
 unsigned int crypto_alg_extsize(struct crypto_alg *alg)
 {
diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 404e9558e879..718145c5eaca 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -191,9 +191,29 @@ static inline unsigned int crypto_queue_len(struct crypto_queue *queue)
 	return queue->qlen;
 }
 
-/* These functions require the input/output to be aligned as u32. */
 void crypto_inc(u8 *a, unsigned int size);
-void crypto_xor(u8 *dst, const u8 *src, unsigned int size);
+
+void __crypto_xor_unaligned(u8 *dst, const u8 *src, unsigned int len);
+
+static inline void crypto_xor(u8 *dst, const u8 *src, unsigned int len)
+{
+	if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+	    (((unsigned long)dst | (unsigned long)src) %
+	     __alignof__(unsigned long) == 0))
+	{
+		while (len >= sizeof(unsigned long)) {
+			*(unsigned long *)dst ^= *(unsigned long *)src;
+			dst += sizeof(unsigned long);
+			src += sizeof(unsigned long);
+			len -= sizeof(unsigned long);
+		}
+
+		while (len--)
+			*dst++ ^= *src++;
+		return;
+	}
+	return __crypto_xor_unaligned(dst, src, len);
+}
 
 int blkcipher_walk_done(struct blkcipher_desc *desc,
 			struct blkcipher_walk *walk, int err);

Patch

diff --git a/crypto/algapi.c b/crypto/algapi.c
index df939b54b09f..771284473a97 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -961,32 +961,100 @@  void crypto_inc(u8 *a, unsigned int size)
 	__be32 *b = (__be32 *)(a + size);
 	u32 c;
 
-	for (; size >= 4; size -= 4) {
-		c = be32_to_cpu(*--b) + 1;
-		*b = cpu_to_be32(c);
-		if (c)
-			return;
-	}
+	if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+	    !((unsigned long)b & (__alignof__(*b) - 1)))
+		for (; size >= 4; size -= 4) {
+			c = be32_to_cpu(*--b) + 1;
+			*b = cpu_to_be32(c);
+			if (c)
+				return;
+		}
 
 	crypto_inc_byte(a, size);
 }
 EXPORT_SYMBOL_GPL(crypto_inc);
 
-static inline void crypto_xor_byte(u8 *a, const u8 *b, unsigned int size)
+void crypto_xor(u8 *dst, const u8 *src, unsigned int len)
 {
-	for (; size; size--)
-		*a++ ^= *b++;
-}
+	const int size = sizeof(unsigned long);
+	const int mask = size - 1;
+	int misalign = ((unsigned long)dst ^ (unsigned long)src) & mask;
+	int startalign = ((unsigned long)dst | (unsigned long)src) & mask;
+
+	if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
+		misalign = startalign = 0;
+
+	while (len > 0) {
+		/*
+		 * Process as much data as we can using 4 or 8 byte strides
+		 * (depending on the size of unsigned long) if
+		 * a) we don't care about alignment, or
+		 * b) we do care about alignment, but dst and src are both
+		 *    suitably aligned
+		 */
+		if (startalign == 0) {
+			unsigned long *a = (unsigned long *)dst;
+			const unsigned long *b = (const unsigned long *)src;
+
+			dst += len & ~mask;
+			src += len & ~mask;
+
+			for (; len >= size; len -= size)
+				*a++ ^= *b++;
+		}
 
-void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
-{
-	u32 *a = (u32 *)dst;
-	u32 *b = (u32 *)src;
+		if (IS_ENABLED(CONFIG_64BIT)) {
+			do {
+				u32 *a = (u32 *)dst;
+				const u32 *b = (u32 *)src;
+
+				if (len < sizeof(u32) ||
+				    (startalign & (sizeof(u32) - 1)) != 0)
+					break;
+
+				if (len >= size && misalign != sizeof(u32) &&
+				    (startalign & sizeof(u32)) == 0)
+					break;
+
+				*a ^= *b;
+				dst += sizeof(u32);
+				src += sizeof(u32);
+				len -= sizeof(u32);
+				startalign &= ~sizeof(u32);
+			} while (misalign == sizeof(u32));
+		}
 
-	for (; size >= 4; size -= 4)
-		*a++ ^= *b++;
+		do {
+			u16 *a = (u16 *)dst;
+			const u16 *b = (u16 *)src;
+
+			if (len < sizeof(u16) ||
+			    (startalign & (sizeof(u16) - 1)) != 0)
+				break;
 
-	crypto_xor_byte((u8 *)a, (u8 *)b, size);
+			if (len >= size && (startalign & sizeof(u16)) == 0 &&
+			    (misalign % sizeof(u32)) != sizeof(u16))
+				break;
+
+			*a ^= *b;
+			dst += sizeof(u16);
+			src += sizeof(u16);
+			len -= sizeof(u16);
+			startalign &= ~sizeof(u16);
+		} while ((misalign % sizeof(u32)) == sizeof(u16));
+
+		do {
+			if (len < sizeof(u8))
+				break;
+
+			if (len >= size && !(startalign & 1) && !(misalign & 1))
+				break;
+
+			*dst++ ^= *src++;
+			len -= sizeof(u8);
+			startalign &= ~sizeof(u8);
+		} while (misalign & 1);
+	}
 }
 EXPORT_SYMBOL_GPL(crypto_xor);
 
diff --git a/crypto/cbc.c b/crypto/cbc.c
index 68f751a41a84..bc160a3186dc 100644
--- a/crypto/cbc.c
+++ b/crypto/cbc.c
@@ -145,9 +145,6 @@  static int crypto_cbc_create(struct crypto_template *tmpl, struct rtattr **tb)
 	inst->alg.base.cra_blocksize = alg->cra_blocksize;
 	inst->alg.base.cra_alignmask = alg->cra_alignmask;
 
-	/* We access the data as u32s when xoring. */
-	inst->alg.base.cra_alignmask |= __alignof__(u32) - 1;
-
 	inst->alg.ivsize = alg->cra_blocksize;
 	inst->alg.min_keysize = alg->cra_cipher.cia_min_keysize;
 	inst->alg.max_keysize = alg->cra_cipher.cia_max_keysize;
diff --git a/crypto/cmac.c b/crypto/cmac.c
index 04080dca8f0c..16301f52858c 100644
--- a/crypto/cmac.c
+++ b/crypto/cmac.c
@@ -260,8 +260,7 @@  static int cmac_create(struct crypto_template *tmpl, struct rtattr **tb)
 	if (err)
 		goto out_free_inst;
 
-	/* We access the data as u32s when xoring. */
-	alignmask = alg->cra_alignmask | (__alignof__(u32) - 1);
+	alignmask = alg->cra_alignmask;
 	inst->alg.base.cra_alignmask = alignmask;
 	inst->alg.base.cra_priority = alg->cra_priority;
 	inst->alg.base.cra_blocksize = alg->cra_blocksize;
diff --git a/crypto/ctr.c b/crypto/ctr.c
index a9a7a44f2783..a4f4a8983169 100644
--- a/crypto/ctr.c
+++ b/crypto/ctr.c
@@ -209,7 +209,7 @@  static struct crypto_instance *crypto_ctr_alloc(struct rtattr **tb)
 	inst->alg.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
 	inst->alg.cra_priority = alg->cra_priority;
 	inst->alg.cra_blocksize = 1;
-	inst->alg.cra_alignmask = alg->cra_alignmask | (__alignof__(u32) - 1);
+	inst->alg.cra_alignmask = alg->cra_alignmask;
 	inst->alg.cra_type = &crypto_blkcipher_type;
 
 	inst->alg.cra_blkcipher.ivsize = alg->cra_blocksize;
diff --git a/crypto/cts.c b/crypto/cts.c
index a1335d6c35fb..243f591dc409 100644
--- a/crypto/cts.c
+++ b/crypto/cts.c
@@ -374,9 +374,6 @@  static int crypto_cts_create(struct crypto_template *tmpl, struct rtattr **tb)
 	inst->alg.base.cra_blocksize = alg->base.cra_blocksize;
 	inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
 
-	/* We access the data as u32s when xoring. */
-	inst->alg.base.cra_alignmask |= __alignof__(u32) - 1;
-
 	inst->alg.ivsize = alg->base.cra_blocksize;
 	inst->alg.chunksize = crypto_skcipher_alg_chunksize(alg);
 	inst->alg.min_keysize = crypto_skcipher_alg_min_keysize(alg);
diff --git a/crypto/pcbc.c b/crypto/pcbc.c
index 11d248673ad4..29dd2b4a3b85 100644
--- a/crypto/pcbc.c
+++ b/crypto/pcbc.c
@@ -260,9 +260,6 @@  static int crypto_pcbc_create(struct crypto_template *tmpl, struct rtattr **tb)
 	inst->alg.base.cra_blocksize = alg->cra_blocksize;
 	inst->alg.base.cra_alignmask = alg->cra_alignmask;
 
-	/* We access the data as u32s when xoring. */
-	inst->alg.base.cra_alignmask |= __alignof__(u32) - 1;
-
 	inst->alg.ivsize = alg->cra_blocksize;
 	inst->alg.min_keysize = alg->cra_cipher.cia_min_keysize;
 	inst->alg.max_keysize = alg->cra_cipher.cia_max_keysize;
diff --git a/crypto/seqiv.c b/crypto/seqiv.c
index c7049231861f..570b7d1aa0ca 100644
--- a/crypto/seqiv.c
+++ b/crypto/seqiv.c
@@ -153,8 +153,6 @@  static int seqiv_aead_create(struct crypto_template *tmpl, struct rtattr **tb)
 	if (IS_ERR(inst))
 		return PTR_ERR(inst);
 
-	inst->alg.base.cra_alignmask |= __alignof__(u32) - 1;
-
 	spawn = aead_instance_ctx(inst);
 	alg = crypto_spawn_aead_alg(spawn);