diff mbox series

[v5.10.y] crypto: caam/jr - Fix possible caam_jr crash

Message ID 20231221093209.984929-1-Kun.Song@windriver.com
State New
Headers show
Series [v5.10.y] crypto: caam/jr - Fix possible caam_jr crash | expand

Commit Message

Kun Song Dec. 21, 2023, 9:32 a.m. UTC
Test environment:
  Linux kernel version: 5.10.y
  Architecture: ARM Cortex-A
  Processor: NXP Layerscape LS1028

Crash in reboot tests:
  Reproducibility: 1%

If a job ring is still allocated, Once caam_jr_remove() returned,
jrpriv will be freed and the registers will get unmapped.Then
caam_jr_interrupt will get error irqstate value.
So such a job ring will probably crash.Crash info is below:
--------------------------------------
RBS Sys: Restart ordered by epghd(0x1)
RBS Sys: RESTARTING
caam_jr 8030000.jr: Device is busy
caam_jr 8020000.jr: Device is busy
caam_jr 8010000.jr: Device is busy
arm-smmu 5000000.iommu: disabling translation
caam_jr 8010000.jr: job ring error: irqstate: 00000103
------------[ cut here ]------------
kernel BUG at drivers/crypto/caam/jr.c:288!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Hardware name: freescale ls1028a/ls1028a, BIOS 2019.10+fsl+g3d542a3d22
pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
pc : caam_jr_interrupt+0x128/0x130
lr : caam_jr_interrupt+0x128/0x130
sp : ffff80001144be50
x29: ffff80001144be50 x28: ffff800010f61008
x27: ffff800011228000 x26: ffff800010f61008
x25: ffff000027904800 x24: 0000000000000072
x23: ffff8000113ba140 x22: 0000000000000001
x21: ffff800011433000 x20: ffff000027904e80
x19: 0000000000000103 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: ffffffffffffffff x14: ffff8000113ebcb8
x13: 0000000000000008 x12: fffffffffffcac8f
x11: ffff00000038bb00 x10: ffff8000112a1e90
x9 : ffff8000100a99c0 x8 : ffff800011249e90
x7 : ffff8000112a1e90 x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000000
x3 : 0000000000000000 x2 : 0000000000000000
x1 : 0000000000000000 x0 : ffff0000279ac600
Call trace:
 caam_jr_interrupt+0x128/0x130
 __handle_irq_event_percpu+0x84/0x2b0
 handle_irq_event+0x6c/0xfc
 handle_fasteoi_irq+0xc8/0x230
 __handle_domain_irq+0xb8/0x130
 gic_handle_irq+0x90/0x158
 el1_irq+0xcc/0x180
 _raw_spin_lock_irq+0x0/0x90
 caam_rng_read_one.constprop.0+0x248/0x370
 caam_read+0x8c/0xb0
 hwrng_fillfn+0xfc/0x1cc
 kthread+0x14c/0x160
 ret_from_fork+0x10/0x30
Code: 2a1303e2 d00029a1 910ee021 940b2b1d (d4210000)
---[ end trace f04d90f3ad0da5f4 ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
Kernel Offset: disabled
CPU features: 0x28040022,21002008
Memory Limit: none
--------------------------------------

Disabling interrupts is to ensure that the device removal
operation is not interrupted.

Signed-off-by: Kun Song <Kun.Song@windriver.com>
Reviewed-by: Hen Guo <Heng.Guo@windriver.com>
---
 drivers/crypto/caam/jr.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Gaurav Jain (OSS) Dec. 22, 2023, 11:43 a.m. UTC | #1
Hi Kun

Have you seen this issue in later kernel versions > 5.10 ?

Regards
Gaurav Jain

> -----Original Message-----
> From: Kun Song <Kun.Song@windriver.com>
> Sent: Thursday, December 21, 2023 3:02 PM
> To: Horia Geanta <horia.geanta@nxp.com>; Aymen Sghaier
> <aymen.sghaier@nxp.com>; herbert@gondor.apana.org.au;
> davem@davemloft.net
> Cc: linux-crypto@vger.kernel.org; filip.pudak@windriver.com;
> heng.guo@windriver.com; kun.song@windriver.com
> Subject: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr crash
> 
> Test environment:
>   Linux kernel version: 5.10.y
>   Architecture: ARM Cortex-A
>   Processor: NXP Layerscape LS1028
> 
> Crash in reboot tests:
>   Reproducibility: 1%
> 
> If a job ring is still allocated, Once caam_jr_remove() returned, jrpriv will be
> freed and the registers will get unmapped.Then caam_jr_interrupt will get error
> irqstate value.
> So such a job ring will probably crash.Crash info is below:
> --------------------------------------
> RBS Sys: Restart ordered by epghd(0x1)
> RBS Sys: RESTARTING
> caam_jr 8030000.jr: Device is busy
> caam_jr 8020000.jr: Device is busy
> caam_jr 8010000.jr: Device is busy
> arm-smmu 5000000.iommu: disabling translation caam_jr 8010000.jr: job ring
> error: irqstate: 00000103 ------------[ cut here ]------------ kernel BUG at
> drivers/crypto/caam/jr.c:288!
> Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Hardware name: freescale
> ls1028a/ls1028a, BIOS 2019.10+fsl+g3d542a3d22
> pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--) pc :
> caam_jr_interrupt+0x128/0x130 lr : caam_jr_interrupt+0x128/0x130 sp :
> ffff80001144be50
> x29: ffff80001144be50 x28: ffff800010f61008
> x27: ffff800011228000 x26: ffff800010f61008
> x25: ffff000027904800 x24: 0000000000000072
> x23: ffff8000113ba140 x22: 0000000000000001
> x21: ffff800011433000 x20: ffff000027904e80
> x19: 0000000000000103 x18: 0000000000000030
> x17: 0000000000000000 x16: 0000000000000000
> x15: ffffffffffffffff x14: ffff8000113ebcb8
> x13: 0000000000000008 x12: fffffffffffcac8f
> x11: ffff00000038bb00 x10: ffff8000112a1e90
> x9 : ffff8000100a99c0 x8 : ffff800011249e90
> x7 : ffff8000112a1e90 x6 : 0000000000000000
> x5 : 0000000000000000 x4 : 0000000000000000
> x3 : 0000000000000000 x2 : 0000000000000000
> x1 : 0000000000000000 x0 : ffff0000279ac600 Call trace:
>  caam_jr_interrupt+0x128/0x130
>  __handle_irq_event_percpu+0x84/0x2b0
>  handle_irq_event+0x6c/0xfc
>  handle_fasteoi_irq+0xc8/0x230
>  __handle_domain_irq+0xb8/0x130
>  gic_handle_irq+0x90/0x158
>  el1_irq+0xcc/0x180
>  _raw_spin_lock_irq+0x0/0x90
>  caam_rng_read_one.constprop.0+0x248/0x370
>  caam_read+0x8c/0xb0
>  hwrng_fillfn+0xfc/0x1cc
>  kthread+0x14c/0x160
>  ret_from_fork+0x10/0x30
> Code: 2a1303e2 d00029a1 910ee021 940b2b1d (d4210000) ---[ end trace
> f04d90f3ad0da5f4 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception in
> interrupt Kernel Offset: disabled CPU features: 0x28040022,21002008 Memory
> Limit: none
> --------------------------------------
> 
> Disabling interrupts is to ensure that the device removal operation is not
> interrupted.
> 
> Signed-off-by: Kun Song <Kun.Song@windriver.com>
> Reviewed-by: Hen Guo <Heng.Guo@windriver.com>
> ---
>  drivers/crypto/caam/jr.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c index
> 6f669966ba2c..d191e8caa1ad 100644
> --- a/drivers/crypto/caam/jr.c
> +++ b/drivers/crypto/caam/jr.c
> @@ -135,6 +135,10 @@ static int caam_jr_remove(struct platform_device
> *pdev)
>  	jrdev = &pdev->dev;
>  	jrpriv = dev_get_drvdata(jrdev);
> 
> +	/* Disabling interrupts is ensure that the device removal operation
> +	 * is not interrupted by interrupts.
> +	 */
> +	devm_free_irq(jrdev, jrpriv->irq, jrdev);
>  	if (jrpriv->hwrng)
>  		caam_rng_exit(jrdev->parent);
> 
> --
> 2.26.1
>
Gaurav Jain (OSS) Jan. 10, 2024, 6:57 a.m. UTC | #2
Hello SK

I am submitting and replying patches using gaurav.jain@nxp.com

In Later kernel versions we have provided fixes related to job ring flush and there are other changes as well.
It would be better to add these changes on top of 5.10 tree as we also run multiple tests and not observed this issue.

Regards
Gaurav Jain


> -----Original Message-----
> From: Kun Song <Kun.Song@windriver.com>
> Sent: Wednesday, January 10, 2024 7:09 AM
> To: Gaurav Jain (OSS) <gaurav.jain@oss.nxp.com>
> Cc: Kun.Song@windriver.com; Varun Sethi <V.Sethi@nxp.com>; Aymen Sghaier
> <aymen.sghaier@nxp.com>; davem@davemloft.net; filip.pudak@windriver.com;
> heng.guo@windriver.com; herbert@gondor.apana.org.au; Horia Geanta
> <horia.geanta@nxp.com>; linux-crypto@vger.kernel.org; Meenakshi Aggarwal
> <meenakshi.aggarwal@nxp.com>
> Subject: [REMINDER] Re: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr
> crash
> 
> Hello Gaurav,
> 
>   I hope you receive this email. I'm following up on a patch I submitted a few
> weeks ago. There doesn't seem to be any response yet and I want to make sure
> it gets pushed forward.
> 
>   I know you're busy and thank you for taking the time to focus on this.If you
> have any concerns or feedback please let me know and I'll be happy to address it.
> 
> Best regards,
> SK
Gaurav Jain Jan. 10, 2024, 7 a.m. UTC | #3
Hello SK

I am submitting and replying patches using gaurav.jain@nxp.com

In Later kernel versions we have provided fixes related to job ring flush and there are other changes as well.
It would be better to add these changes on top of 5.10 tree as we also run multiple tests and not observed this issue.

Regards
Gaurav Jain

> -----Original Message-----
> From: Kun Song <Kun.Song@windriver.com>
> Sent: Wednesday, January 10, 2024 7:09 AM
> To: Gaurav Jain (OSS) <gaurav.jain@oss.nxp.com>
> Cc: Kun.Song@windriver.com; Varun Sethi <V.Sethi@nxp.com>; Aymen Sghaier
> <aymen.sghaier@nxp.com>; davem@davemloft.net; filip.pudak@windriver.com;
> heng.guo@windriver.com; herbert@gondor.apana.org.au; Horia Geanta
> <horia.geanta@nxp.com>; linux-crypto@vger.kernel.org; Meenakshi Aggarwal
> <meenakshi.aggarwal@nxp.com>
> Subject: [REMINDER] Re: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr
> crash
> 
> Hello Gaurav,
> 
>   I hope you receive this email. I'm following up on a patch I submitted a few
> weeks ago. There doesn't seem to be any response yet and I want to make sure
> it gets pushed forward.
> 
>   I know you're busy and thank you for taking the time to focus on this.If you
> have any concerns or feedback please let me know and I'll be happy to address it.
> 
> Best regards,
> SK
Horia Geanta Jan. 31, 2024, 1:34 p.m. UTC | #4
On 12/21/2023 11:32 AM, Kun Song wrote:
> Test environment:
>   Linux kernel version: 5.10.y
>   Architecture: ARM Cortex-A
>   Processor: NXP Layerscape LS1028
> 
> Crash in reboot tests:
>   Reproducibility: 1%
> 
Replying here to comment on the log.
I've added all the people from the latest reply, i.e.:
https://lore.kernel.org/linux-crypto/AM0PR04MB6004AECDD044F1E6BF3B6732E7682@AM0PR04MB6004.eurprd04.prod.outlook.com

> If a job ring is still allocated, Once caam_jr_remove() returned,
> jrpriv will be freed and the registers will get unmapped.Then
In this case, most likely the root cause is different (see below).

> caam_jr_interrupt will get error irqstate value.
> So such a job ring will probably crash.Crash info is below:
> --------------------------------------
> RBS Sys: Restart ordered by epghd(0x1)
> RBS Sys: RESTARTING
This looks like a system restart.

> caam_jr 8030000.jr: Device is busy
> caam_jr 8020000.jr: Device is busy
> caam_jr 8010000.jr: Device is busy
For some reason, there are still tfms accounted for on all these caam job rings.
Maybe the system restart is not handled correctly at some point,
hence some resource leaks (unallocated tfms).

As already discussed, exiting early from caam_jr .remove() callback will leave
the HW unquiesced (e.g. job rings not flushed), interrupts still active.

> arm-smmu 5000000.iommu: disabling translation
From here onward caam memory accesses won't be translated.

> caam_jr 8010000.jr: job ring error: irqstate: 00000103
The error code means caam was not able to write the status of the job
it just finished in the output job ring (which is allocated in memory).

Most likely this happened due to iommu no longer translating the access.

> 
> Disabling interrupts is to ensure that the device removal
> operation is not interrupted.
> 
> Signed-off-by: Kun Song <Kun.Song@windriver.com>
> Reviewed-by: Hen Guo <Heng.Guo@windriver.com>
> ---
>  drivers/crypto/caam/jr.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c
> index 6f669966ba2c..d191e8caa1ad 100644
> --- a/drivers/crypto/caam/jr.c
> +++ b/drivers/crypto/caam/jr.c
> @@ -135,6 +135,10 @@ static int caam_jr_remove(struct platform_device *pdev)
>  	jrdev = &pdev->dev;
>  	jrpriv = dev_get_drvdata(jrdev);
>  
> +	/* Disabling interrupts is ensure that the device removal operation
> +	 * is not interrupted by interrupts.
> +	 */
> +	devm_free_irq(jrdev, jrpriv->irq, jrdev);
>  	if (jrpriv->hwrng)
>  		caam_rng_exit(jrdev->parent);
>  
As pointed out by previous discussions, this is not enough.
Crashes could still occur due to crypto API users calling into caam driver,
which would be in an inconsistent state.

Whether .remove() being called is triggered by a system restart or
a manual device unbinding [1] is irrelevant, the driver mustn't crash.

I think Herbert's suggestion [2] on how to deal with HW devices going away
makes sense.

Not sure if the changes (thinking of crypto API part) could go into LTS kernels.
If not, we'll have to stick to caam driver-only changes (but IIUC
other crypto drivers are having the same issue with HW devices going away [3]).

Thanks,
Horia

[1] https://lore.kernel.org/linux-crypto/VI1PR04MB7023A7EC91599A537CB6A487EEB30@VI1PR04MB7023.eurprd04.prod.outlook.com/
[2] https://lore.kernel.org/linux-crypto/20190919134512.GA29320@gondor.apana.org.au/
[3] https://lore.kernel.org/linux-crypto/ZAqwTqw3lR+dnImO@gondor.apana.org.au/
diff mbox series

Patch

diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c
index 6f669966ba2c..d191e8caa1ad 100644
--- a/drivers/crypto/caam/jr.c
+++ b/drivers/crypto/caam/jr.c
@@ -135,6 +135,10 @@  static int caam_jr_remove(struct platform_device *pdev)
 	jrdev = &pdev->dev;
 	jrpriv = dev_get_drvdata(jrdev);
 
+	/* Disabling interrupts is ensure that the device removal operation
+	 * is not interrupted by interrupts.
+	 */
+	devm_free_irq(jrdev, jrpriv->irq, jrdev);
 	if (jrpriv->hwrng)
 		caam_rng_exit(jrdev->parent);