[2/2] crypto: aegis/generic - fix for big endian systems

Message ID 20180930085859.15038-3-ard.biesheuvel@linaro.org
State New
Headers show
Series
  • crypto - fix aegis/morus for big endian systems
Related show

Commit Message

Ard Biesheuvel Sept. 30, 2018, 8:58 a.m.
Use the correct __le32 annotation and accessors to perform the
single round of AES encryption performed inside the AEGIS transform.
Otherwise, tcrypt reports:

  alg: aead: Test 1 failed on encryption for aegis128-generic
  00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e
  alg: aead: Test 1 failed on encryption for aegis128l-generic
  00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28
  alg: aead: Test 1 failed on encryption for aegis256-generic
  00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

While at it, let's refer to the first precomputed table only, and
derive the other ones by rotation. This reduces the D-cache footprint
by 75%, and shouldn't be too costly or free on load/store architectures
(and X86 has its own AES-NI based implementation)

Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")
Cc: <stable@vger.kernel.org> # v4.18+
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 crypto/aegis.h | 23 +++++++++-----------
 1 file changed, 10 insertions(+), 13 deletions(-)

-- 
2.19.0

Comments

Ard Biesheuvel Sept. 30, 2018, 11:14 a.m. | #1
On 30 September 2018 at 10:58, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> Use the correct __le32 annotation and accessors to perform the

> single round of AES encryption performed inside the AEGIS transform.

> Otherwise, tcrypt reports:

>

>   alg: aead: Test 1 failed on encryption for aegis128-generic

>   00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e

>   alg: aead: Test 1 failed on encryption for aegis128l-generic

>   00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28

>   alg: aead: Test 1 failed on encryption for aegis256-generic

>   00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

>

> While at it, let's refer to the first precomputed table only, and

> derive the other ones by rotation. This reduces the D-cache footprint

> by 75%, and shouldn't be too costly or free on load/store architectures

> (and X86 has its own AES-NI based implementation)

>

> Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")

> Cc: <stable@vger.kernel.org> # v4.18+

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  crypto/aegis.h | 23 +++++++++-----------

>  1 file changed, 10 insertions(+), 13 deletions(-)

>

> diff --git a/crypto/aegis.h b/crypto/aegis.h

> index f1c6900ddb80..84d3e07a3c33 100644

> --- a/crypto/aegis.h

> +++ b/crypto/aegis.h

> @@ -21,7 +21,7 @@

>

>  union aegis_block {

>         __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];

> -       u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];

> +       __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];

>         u8 bytes[AEGIS_BLOCK_SIZE];

>  };

>

> @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,

>  {

>         u32 *d = dst->words32;

>         const u8  *s  = src->bytes;

> -       const u32 *k  = key->words32;

> +       const __le32 *k  = key->words32;

>         const u32 *t0 = crypto_ft_tab[0];

> -       const u32 *t1 = crypto_ft_tab[1];

> -       const u32 *t2 = crypto_ft_tab[2];

> -       const u32 *t3 = crypto_ft_tab[3];

>         u32 d0, d1, d2, d3;

>

> -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];

> -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];

> -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];

> -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];

> +       d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);

> +       d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);

> +       d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);

> +       d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);

>

> -       d[0] = d0;

> -       d[1] = d1;

> -       d[2] = d2;

> -       d[3] = d3;

> +       d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));

> +       d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));

> +       d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));

> +       d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));



I suppose this

> +       d[0] = cpu_to_le32(d0) ^ k[0];

> +       d[1] = cpu_to_le32(d1) ^ k[1];

> +       d[2] = cpu_to_le32(d2) ^ k[2];

> +       d[3] = cpu_to_le32(d3) ^ k[3];


should work fine as well

>  }

>

>  #endif /* _CRYPTO_AEGIS_H */

> --

> 2.19.0

>
Ondrej Mosnacek Oct. 1, 2018, 7:50 a.m. | #2
Hi Ard,

On Sun, Sep 30, 2018 at 10:59 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Use the correct __le32 annotation and accessors to perform the

> single round of AES encryption performed inside the AEGIS transform.

> Otherwise, tcrypt reports:

>

>   alg: aead: Test 1 failed on encryption for aegis128-generic

>   00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e

>   alg: aead: Test 1 failed on encryption for aegis128l-generic

>   00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28

>   alg: aead: Test 1 failed on encryption for aegis256-generic

>   00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c


Hm...  I think the reason I made a mistake here is that I first had a
version with the AES table hard-coded and I had an #ifdef <is big
endian> #else #endif there with values for little-endian and
big-endian variants.  Then I realized the aes_generic module exports
the crypto_ft_table and rewrote the code to use that.  Somewhere along
the way I forgot to check if the aes_generic table uses the same trick
and correct the code...

It would be nice to apply the same optimization to aes_generic.c, but
unfortunately the current tables are exported so changing the
convention would break external modules that use them :/

>

> While at it, let's refer to the first precomputed table only, and

> derive the other ones by rotation. This reduces the D-cache footprint

> by 75%, and shouldn't be too costly or free on load/store architectures

> (and X86 has its own AES-NI based implementation)


Could you maybe extract this into a separate patch?  I don't think we
should mix functional and performance fixes together.

>

> Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")

> Cc: <stable@vger.kernel.org> # v4.18+

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  crypto/aegis.h | 23 +++++++++-----------

>  1 file changed, 10 insertions(+), 13 deletions(-)

>

> diff --git a/crypto/aegis.h b/crypto/aegis.h

> index f1c6900ddb80..84d3e07a3c33 100644

> --- a/crypto/aegis.h

> +++ b/crypto/aegis.h

> @@ -21,7 +21,7 @@

>

>  union aegis_block {

>         __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];

> -       u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];

> +       __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];

>         u8 bytes[AEGIS_BLOCK_SIZE];

>  };

>

> @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,

>  {

>         u32 *d = dst->words32;

>         const u8  *s  = src->bytes;

> -       const u32 *k  = key->words32;

> +       const __le32 *k  = key->words32;

>         const u32 *t0 = crypto_ft_tab[0];

> -       const u32 *t1 = crypto_ft_tab[1];

> -       const u32 *t2 = crypto_ft_tab[2];

> -       const u32 *t3 = crypto_ft_tab[3];

>         u32 d0, d1, d2, d3;

>

> -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];

> -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];

> -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];

> -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];

> +       d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);

> +       d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);

> +       d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);

> +       d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);

>

> -       d[0] = d0;

> -       d[1] = d1;

> -       d[2] = d2;

> -       d[3] = d3;

> +       d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));

> +       d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));

> +       d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));

> +       d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));

>  }

>

>  #endif /* _CRYPTO_AEGIS_H */

> --

> 2.19.0

>


Thanks,

-- 
Ondrej Mosnacek <omosnace at redhat dot com>
Associate Software Engineer, Security Technologies
Red Hat, Inc.
Ard Biesheuvel Oct. 1, 2018, 7:53 a.m. | #3
On 1 October 2018 at 09:50, Ondrej Mosnacek <omosnace@redhat.com> wrote:
> Hi Ard,

>

> On Sun, Sep 30, 2018 at 10:59 AM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

>> Use the correct __le32 annotation and accessors to perform the

>> single round of AES encryption performed inside the AEGIS transform.

>> Otherwise, tcrypt reports:

>>

>>   alg: aead: Test 1 failed on encryption for aegis128-generic

>>   00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e

>>   alg: aead: Test 1 failed on encryption for aegis128l-generic

>>   00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28

>>   alg: aead: Test 1 failed on encryption for aegis256-generic

>>   00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

>

> Hm...  I think the reason I made a mistake here is that I first had a

> version with the AES table hard-coded and I had an #ifdef <is big

> endian> #else #endif there with values for little-endian and

> big-endian variants.  Then I realized the aes_generic module exports

> the crypto_ft_table and rewrote the code to use that.  Somewhere along

> the way I forgot to check if the aes_generic table uses the same trick

> and correct the code...

>

> It would be nice to apply the same optimization to aes_generic.c, but

> unfortunately the current tables are exported so changing the

> convention would break external modules that use them :/

>


Indeed. I am doing some refactoring work on the AES code, which is how
I ran into this in the first place.

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=for-kernelci

>>

>> While at it, let's refer to the first precomputed table only, and

>> derive the other ones by rotation. This reduces the D-cache footprint

>> by 75%, and shouldn't be too costly or free on load/store architectures

>> (and X86 has its own AES-NI based implementation)

>

> Could you maybe extract this into a separate patch?  I don't think we

> should mix functional and performance fixes together.

>


Yeah, good point. I will do that and fold in the simplification.

>>

>> Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")

>> Cc: <stable@vger.kernel.org> # v4.18+

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> ---

>>  crypto/aegis.h | 23 +++++++++-----------

>>  1 file changed, 10 insertions(+), 13 deletions(-)

>>

>> diff --git a/crypto/aegis.h b/crypto/aegis.h

>> index f1c6900ddb80..84d3e07a3c33 100644

>> --- a/crypto/aegis.h

>> +++ b/crypto/aegis.h

>> @@ -21,7 +21,7 @@

>>

>>  union aegis_block {

>>         __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];

>> -       u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];

>> +       __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];

>>         u8 bytes[AEGIS_BLOCK_SIZE];

>>  };

>>

>> @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,

>>  {

>>         u32 *d = dst->words32;

>>         const u8  *s  = src->bytes;

>> -       const u32 *k  = key->words32;

>> +       const __le32 *k  = key->words32;

>>         const u32 *t0 = crypto_ft_tab[0];

>> -       const u32 *t1 = crypto_ft_tab[1];

>> -       const u32 *t2 = crypto_ft_tab[2];

>> -       const u32 *t3 = crypto_ft_tab[3];

>>         u32 d0, d1, d2, d3;

>>

>> -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];

>> -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];

>> -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];

>> -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];

>> +       d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);

>> +       d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);

>> +       d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);

>> +       d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);

>>

>> -       d[0] = d0;

>> -       d[1] = d1;

>> -       d[2] = d2;

>> -       d[3] = d3;

>> +       d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));

>> +       d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));

>> +       d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));

>> +       d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));

>>  }

>>

>>  #endif /* _CRYPTO_AEGIS_H */

>> --

>> 2.19.0

>>

>

> Thanks,

>

> --

> Ondrej Mosnacek <omosnace at redhat dot com>

> Associate Software Engineer, Security Technologies

> Red Hat, Inc.
Ondrej Mosnacek Oct. 1, 2018, 8 a.m. | #4
On Sun, Sep 30, 2018 at 1:14 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 30 September 2018 at 10:58, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > Use the correct __le32 annotation and accessors to perform the

> > single round of AES encryption performed inside the AEGIS transform.

> > Otherwise, tcrypt reports:

> >

> >   alg: aead: Test 1 failed on encryption for aegis128-generic

> >   00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e

> >   alg: aead: Test 1 failed on encryption for aegis128l-generic

> >   00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28

> >   alg: aead: Test 1 failed on encryption for aegis256-generic

> >   00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

> >

> > While at it, let's refer to the first precomputed table only, and

> > derive the other ones by rotation. This reduces the D-cache footprint

> > by 75%, and shouldn't be too costly or free on load/store architectures

> > (and X86 has its own AES-NI based implementation)

> >

> > Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")

> > Cc: <stable@vger.kernel.org> # v4.18+

> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> > ---

> >  crypto/aegis.h | 23 +++++++++-----------

> >  1 file changed, 10 insertions(+), 13 deletions(-)

> >

> > diff --git a/crypto/aegis.h b/crypto/aegis.h

> > index f1c6900ddb80..84d3e07a3c33 100644

> > --- a/crypto/aegis.h

> > +++ b/crypto/aegis.h

> > @@ -21,7 +21,7 @@

> >

> >  union aegis_block {

> >         __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];

> > -       u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];

> > +       __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];

> >         u8 bytes[AEGIS_BLOCK_SIZE];

> >  };

> >

> > @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,

> >  {

> >         u32 *d = dst->words32;

> >         const u8  *s  = src->bytes;

> > -       const u32 *k  = key->words32;

> > +       const __le32 *k  = key->words32;

> >         const u32 *t0 = crypto_ft_tab[0];

> > -       const u32 *t1 = crypto_ft_tab[1];

> > -       const u32 *t2 = crypto_ft_tab[2];

> > -       const u32 *t3 = crypto_ft_tab[3];

> >         u32 d0, d1, d2, d3;

> >

> > -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];

> > -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];

> > -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];

> > -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];

> > +       d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);

> > +       d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);

> > +       d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);

> > +       d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);

> >

> > -       d[0] = d0;

> > -       d[1] = d1;

> > -       d[2] = d2;

> > -       d[3] = d3;

> > +       d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));

> > +       d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));

> > +       d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));

> > +       d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));

>

>

> I suppose this

>

> > +       d[0] = cpu_to_le32(d0) ^ k[0];

> > +       d[1] = cpu_to_le32(d1) ^ k[1];

> > +       d[2] = cpu_to_le32(d2) ^ k[2];

> > +       d[3] = cpu_to_le32(d3) ^ k[3];

>

> should work fine as well


Yeah, that looks nicer, but I'm not sure if it is completely OK to do
bitwise/arithmetic operations directly on the __[lb]e* types...  Maybe
yes, but the code I've seen that used them usually seemed to treat
them as opaque types.

>

> >  }

> >

> >  #endif /* _CRYPTO_AEGIS_H */

> > --

> > 2.19.0

> >


--
Ondrej Mosnacek <omosnace at redhat dot com>
Associate Software Engineer, Security Technologies
Red Hat, Inc.
Ard Biesheuvel Oct. 1, 2018, 8:01 a.m. | #5
On 1 October 2018 at 10:00, Ondrej Mosnacek <omosnace@redhat.com> wrote:
> On Sun, Sep 30, 2018 at 1:14 PM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

>> On 30 September 2018 at 10:58, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>> > Use the correct __le32 annotation and accessors to perform the

>> > single round of AES encryption performed inside the AEGIS transform.

>> > Otherwise, tcrypt reports:

>> >

>> >   alg: aead: Test 1 failed on encryption for aegis128-generic

>> >   00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e

>> >   alg: aead: Test 1 failed on encryption for aegis128l-generic

>> >   00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28

>> >   alg: aead: Test 1 failed on encryption for aegis256-generic

>> >   00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

>> >

>> > While at it, let's refer to the first precomputed table only, and

>> > derive the other ones by rotation. This reduces the D-cache footprint

>> > by 75%, and shouldn't be too costly or free on load/store architectures

>> > (and X86 has its own AES-NI based implementation)

>> >

>> > Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")

>> > Cc: <stable@vger.kernel.org> # v4.18+

>> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> > ---

>> >  crypto/aegis.h | 23 +++++++++-----------

>> >  1 file changed, 10 insertions(+), 13 deletions(-)

>> >

>> > diff --git a/crypto/aegis.h b/crypto/aegis.h

>> > index f1c6900ddb80..84d3e07a3c33 100644

>> > --- a/crypto/aegis.h

>> > +++ b/crypto/aegis.h

>> > @@ -21,7 +21,7 @@

>> >

>> >  union aegis_block {

>> >         __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];

>> > -       u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];

>> > +       __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];

>> >         u8 bytes[AEGIS_BLOCK_SIZE];

>> >  };

>> >

>> > @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,

>> >  {

>> >         u32 *d = dst->words32;

>> >         const u8  *s  = src->bytes;

>> > -       const u32 *k  = key->words32;

>> > +       const __le32 *k  = key->words32;

>> >         const u32 *t0 = crypto_ft_tab[0];

>> > -       const u32 *t1 = crypto_ft_tab[1];

>> > -       const u32 *t2 = crypto_ft_tab[2];

>> > -       const u32 *t3 = crypto_ft_tab[3];

>> >         u32 d0, d1, d2, d3;

>> >

>> > -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];

>> > -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];

>> > -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];

>> > -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];

>> > +       d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);

>> > +       d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);

>> > +       d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);

>> > +       d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);

>> >

>> > -       d[0] = d0;

>> > -       d[1] = d1;

>> > -       d[2] = d2;

>> > -       d[3] = d3;

>> > +       d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));

>> > +       d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));

>> > +       d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));

>> > +       d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));

>>

>>

>> I suppose this

>>

>> > +       d[0] = cpu_to_le32(d0) ^ k[0];

>> > +       d[1] = cpu_to_le32(d1) ^ k[1];

>> > +       d[2] = cpu_to_le32(d2) ^ k[2];

>> > +       d[3] = cpu_to_le32(d3) ^ k[3];

>>

>> should work fine as well

>

> Yeah, that looks nicer, but I'm not sure if it is completely OK to do

> bitwise/arithmetic operations directly on the __[lb]e* types...  Maybe

> yes, but the code I've seen that used them usually seemed to treat

> them as opaque types.

>


No, xor is fine with __le/__be types
Ondrej Mosnacek Oct. 1, 2018, 8:06 a.m. | #6
On Mon, Oct 1, 2018 at 10:01 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 1 October 2018 at 10:00, Ondrej Mosnacek <omosnace@redhat.com> wrote:

> > On Sun, Sep 30, 2018 at 1:14 PM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> >> On 30 September 2018 at 10:58, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >> > Use the correct __le32 annotation and accessors to perform the

> >> > single round of AES encryption performed inside the AEGIS transform.

> >> > Otherwise, tcrypt reports:

> >> >

> >> >   alg: aead: Test 1 failed on encryption for aegis128-generic

> >> >   00000000: 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e

> >> >   alg: aead: Test 1 failed on encryption for aegis128l-generic

> >> >   00000000: cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28

> >> >   alg: aead: Test 1 failed on encryption for aegis256-generic

> >> >   00000000: aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

> >> >

> >> > While at it, let's refer to the first precomputed table only, and

> >> > derive the other ones by rotation. This reduces the D-cache footprint

> >> > by 75%, and shouldn't be too costly or free on load/store architectures

> >> > (and X86 has its own AES-NI based implementation)

> >> >

> >> > Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")

> >> > Cc: <stable@vger.kernel.org> # v4.18+

> >> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> >> > ---

> >> >  crypto/aegis.h | 23 +++++++++-----------

> >> >  1 file changed, 10 insertions(+), 13 deletions(-)

> >> >

> >> > diff --git a/crypto/aegis.h b/crypto/aegis.h

> >> > index f1c6900ddb80..84d3e07a3c33 100644

> >> > --- a/crypto/aegis.h

> >> > +++ b/crypto/aegis.h

> >> > @@ -21,7 +21,7 @@

> >> >

> >> >  union aegis_block {

> >> >         __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];

> >> > -       u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];

> >> > +       __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];

> >> >         u8 bytes[AEGIS_BLOCK_SIZE];

> >> >  };

> >> >

> >> > @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,

> >> >  {

> >> >         u32 *d = dst->words32;

> >> >         const u8  *s  = src->bytes;

> >> > -       const u32 *k  = key->words32;

> >> > +       const __le32 *k  = key->words32;

> >> >         const u32 *t0 = crypto_ft_tab[0];

> >> > -       const u32 *t1 = crypto_ft_tab[1];

> >> > -       const u32 *t2 = crypto_ft_tab[2];

> >> > -       const u32 *t3 = crypto_ft_tab[3];

> >> >         u32 d0, d1, d2, d3;

> >> >

> >> > -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];

> >> > -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];

> >> > -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];

> >> > -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];

> >> > +       d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);

> >> > +       d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);

> >> > +       d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);

> >> > +       d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);

> >> >

> >> > -       d[0] = d0;

> >> > -       d[1] = d1;

> >> > -       d[2] = d2;

> >> > -       d[3] = d3;

> >> > +       d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));

> >> > +       d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));

> >> > +       d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));

> >> > +       d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));

> >>

> >>

> >> I suppose this

> >>

> >> > +       d[0] = cpu_to_le32(d0) ^ k[0];

> >> > +       d[1] = cpu_to_le32(d1) ^ k[1];

> >> > +       d[2] = cpu_to_le32(d2) ^ k[2];

> >> > +       d[3] = cpu_to_le32(d3) ^ k[3];

> >>

> >> should work fine as well

> >

> > Yeah, that looks nicer, but I'm not sure if it is completely OK to do

> > bitwise/arithmetic operations directly on the __[lb]e* types...  Maybe

> > yes, but the code I've seen that used them usually seemed to treat

> > them as opaque types.

> >

>

> No, xor is fine with __le/__be types


Ah, OK then.  Good to know :)

-- 
Ondrej Mosnacek <omosnace at redhat dot com>
Associate Software Engineer, Security Technologies
Red Hat, Inc.

Patch

diff --git a/crypto/aegis.h b/crypto/aegis.h
index f1c6900ddb80..84d3e07a3c33 100644
--- a/crypto/aegis.h
+++ b/crypto/aegis.h
@@ -21,7 +21,7 @@ 
 
 union aegis_block {
 	__le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];
-	u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];
+	__le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];
 	u8 bytes[AEGIS_BLOCK_SIZE];
 };
 
@@ -59,22 +59,19 @@  static void crypto_aegis_aesenc(union aegis_block *dst,
 {
 	u32 *d = dst->words32;
 	const u8  *s  = src->bytes;
-	const u32 *k  = key->words32;
+	const __le32 *k  = key->words32;
 	const u32 *t0 = crypto_ft_tab[0];
-	const u32 *t1 = crypto_ft_tab[1];
-	const u32 *t2 = crypto_ft_tab[2];
-	const u32 *t3 = crypto_ft_tab[3];
 	u32 d0, d1, d2, d3;
 
-	d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];
-	d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];
-	d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];
-	d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];
+	d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ rol32(t0[s[15]], 24);
+	d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ rol32(t0[s[ 3]], 24);
+	d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ rol32(t0[s[ 7]], 24);
+	d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ rol32(t0[s[11]], 24);
 
-	d[0] = d0;
-	d[1] = d1;
-	d[2] = d2;
-	d[3] = d3;
+	d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));
+	d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));
+	d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));
+	d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));
 }
 
 #endif /* _CRYPTO_AEGIS_H */