diff mbox series

[v11,1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Message ID 20240204080144.7977-2-xueshuai@linux.alibaba.com
State New
Headers show
Series [v11,1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered | expand

Commit Message

Shuai Xue Feb. 4, 2024, 8:01 a.m. UTC
Synchronous error was detected as a result of user-space process accessing
a 2-bit uncorrected error. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
memory_failure() work which poisons the related page, unmaps the page, and
then sends a SIGBUS to the process, so that a system wide panic can be
avoided.

However, no memory_failure() work will be queued when abnormal synchronous
errors occur. These errors can include situations such as invalid PA,
unexpected severity, no memory failure config support, invalid GUID
section, etc. In such case, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

Fix it by performing a force kill if no memory_failure() work is queued
for synchronous errors.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Borislav Petkov Feb. 19, 2024, 9:25 a.m. UTC | #1
On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> Synchronous error was detected as a result of user-space process accessing
> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> memory_failure() work which poisons the related page, unmaps the page, and
> then sends a SIGBUS to the process, so that a system wide panic can be
> avoided.
> 
> However, no memory_failure() work will be queued when abnormal synchronous
> errors occur. These errors can include situations such as invalid PA,
> unexpected severity, no memory failure config support, invalid GUID
> section, etc. In such case, the user-space process will trigger SEA again.
> This loop can potentially exceed the platform firmware threshold or even
> trigger a kernel hard lockup, leading to a system reboot.
> 
> Fix it by performing a force kill if no memory_failure() work is queued
> for synchronous errors.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>  drivers/acpi/apei/ghes.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 7b7c605166e0..0892550732d4 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
>  		}
>  	}
>  
> +	/*
> +	 * If no memory failure work is queued for abnormal synchronous
> +	 * errors, do a force kill.
> +	 */
> +	if (sync && !queued) {
> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> +		force_sig(SIGBUS);
> +	}

Except that there are a bunch of CXL GUIDs being handled there too and
this will sigbus those processes now automatically.

Lemme add the whole bunch from

  671a794c33c6 ("acpi/ghes: Process CXL Component Events")

for comment to Cc.
Shuai Xue Feb. 22, 2024, 2:07 a.m. UTC | #2
On 2024/2/19 17:25, Borislav Petkov wrote:
> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
>> Synchronous error was detected as a result of user-space process accessing
>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>> memory_failure() work which poisons the related page, unmaps the page, and
>> then sends a SIGBUS to the process, so that a system wide panic can be
>> avoided.
>>
>> However, no memory_failure() work will be queued when abnormal synchronous
>> errors occur. These errors can include situations such as invalid PA,
>> unexpected severity, no memory failure config support, invalid GUID
>> section, etc. In such case, the user-space process will trigger SEA again.
>> This loop can potentially exceed the platform firmware threshold or even
>> trigger a kernel hard lockup, leading to a system reboot.
>>
>> Fix it by performing a force kill if no memory_failure() work is queued
>> for synchronous errors.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>>  drivers/acpi/apei/ghes.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 7b7c605166e0..0892550732d4 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
>>  		}
>>  	}
>>  
>> +	/*
>> +	 * If no memory failure work is queued for abnormal synchronous
>> +	 * errors, do a force kill.
>> +	 */
>> +	if (sync && !queued) {
>> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
>> +		force_sig(SIGBUS);
>> +	}
> 
> Except that there are a bunch of CXL GUIDs being handled there too and
> this will sigbus those processes now automatically.

Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
delivered as a synchronous notification.

Will the CXL component trigger synchronous events for which we need to terminate the
current process by sending sigbus to process?

> 
> Lemme add the whole bunch from
> 
>   671a794c33c6 ("acpi/ghes: Process CXL Component Events")
> 
> for comment to Cc.
> 

Thank you.

Best Regards,
Shuai
Dan Williams Feb. 23, 2024, 5:26 a.m. UTC | #3
Shuai Xue wrote:
> 
> 
> On 2024/2/19 17:25, Borislav Petkov wrote:
> > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> >> Synchronous error was detected as a result of user-space process accessing
> >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> >> memory_failure() work which poisons the related page, unmaps the page, and
> >> then sends a SIGBUS to the process, so that a system wide panic can be
> >> avoided.
> >>
> >> However, no memory_failure() work will be queued when abnormal synchronous
> >> errors occur. These errors can include situations such as invalid PA,
> >> unexpected severity, no memory failure config support, invalid GUID
> >> section, etc. In such case, the user-space process will trigger SEA again.
> >> This loop can potentially exceed the platform firmware threshold or even
> >> trigger a kernel hard lockup, leading to a system reboot.
> >>
> >> Fix it by performing a force kill if no memory_failure() work is queued
> >> for synchronous errors.
> >>
> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> >> ---
> >>  drivers/acpi/apei/ghes.c | 9 +++++++++
> >>  1 file changed, 9 insertions(+)
> >>
> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> >> index 7b7c605166e0..0892550732d4 100644
> >> --- a/drivers/acpi/apei/ghes.c
> >> +++ b/drivers/acpi/apei/ghes.c
> >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> >>  		}
> >>  	}
> >>  
> >> +	/*
> >> +	 * If no memory failure work is queued for abnormal synchronous
> >> +	 * errors, do a force kill.
> >> +	 */
> >> +	if (sync && !queued) {
> >> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> >> +		force_sig(SIGBUS);
> >> +	}
> > 
> > Except that there are a bunch of CXL GUIDs being handled there too and
> > this will sigbus those processes now automatically.
> 
> Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> delivered as a synchronous notification.
> 
> Will the CXL component trigger synchronous events for which we need to terminate the
> current process by sending sigbus to process?

None of the CXL component errors should be handled as synchronous
events. They are either asynchronous protocol errors, or effectively
equivalent to CPER_SEC_PLATFORM_MEM notifications.
Jonathan Cameron Feb. 23, 2024, 12:08 p.m. UTC | #4
On Thu, 22 Feb 2024 21:26:43 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Shuai Xue wrote:
> > 
> > 
> > On 2024/2/19 17:25, Borislav Petkov wrote:  
> > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:  
> > >> Synchronous error was detected as a result of user-space process accessing
> > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > >> memory_failure() work which poisons the related page, unmaps the page, and
> > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > >> avoided.
> > >>
> > >> However, no memory_failure() work will be queued when abnormal synchronous
> > >> errors occur. These errors can include situations such as invalid PA,
> > >> unexpected severity, no memory failure config support, invalid GUID
> > >> section, etc. In such case, the user-space process will trigger SEA again.
> > >> This loop can potentially exceed the platform firmware threshold or even
> > >> trigger a kernel hard lockup, leading to a system reboot.
> > >>
> > >> Fix it by performing a force kill if no memory_failure() work is queued
> > >> for synchronous errors.
> > >>
> > >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> > >> ---
> > >>  drivers/acpi/apei/ghes.c | 9 +++++++++
> > >>  1 file changed, 9 insertions(+)
> > >>
> > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > >> index 7b7c605166e0..0892550732d4 100644
> > >> --- a/drivers/acpi/apei/ghes.c
> > >> +++ b/drivers/acpi/apei/ghes.c
> > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > >>  		}
> > >>  	}
> > >>  
> > >> +	/*
> > >> +	 * If no memory failure work is queued for abnormal synchronous
> > >> +	 * errors, do a force kill.
> > >> +	 */
> > >> +	if (sync && !queued) {
> > >> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > >> +		force_sig(SIGBUS);
> > >> +	}  
> > > 
> > > Except that there are a bunch of CXL GUIDs being handled there too and
> > > this will sigbus those processes now automatically.  
> > 
> > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> > delivered as a synchronous notification.
> > 
> > Will the CXL component trigger synchronous events for which we need to terminate the
> > current process by sending sigbus to process?  
> 
> None of the CXL component errors should be handled as synchronous
> events. They are either asynchronous protocol errors, or effectively
> equivalent to CPER_SEC_PLATFORM_MEM notifications.

Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
Jonathan Cameron Feb. 23, 2024, 12:17 p.m. UTC | #5
On Fri, 23 Feb 2024 12:08:13 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Thu, 22 Feb 2024 21:26:43 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Shuai Xue wrote:  
> > > 
> > > 
> > > On 2024/2/19 17:25, Borislav Petkov wrote:    
> > > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:    
> > > >> Synchronous error was detected as a result of user-space process accessing
> > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > > >> memory_failure() work which poisons the related page, unmaps the page, and
> > > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > > >> avoided.
> > > >>
> > > >> However, no memory_failure() work will be queued when abnormal synchronous
> > > >> errors occur. These errors can include situations such as invalid PA,
> > > >> unexpected severity, no memory failure config support, invalid GUID
> > > >> section, etc. In such case, the user-space process will trigger SEA again.
> > > >> This loop can potentially exceed the platform firmware threshold or even
> > > >> trigger a kernel hard lockup, leading to a system reboot.
> > > >>
> > > >> Fix it by performing a force kill if no memory_failure() work is queued
> > > >> for synchronous errors.
> > > >>
> > > >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> > > >> ---
> > > >>  drivers/acpi/apei/ghes.c | 9 +++++++++
> > > >>  1 file changed, 9 insertions(+)
> > > >>
> > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > >> index 7b7c605166e0..0892550732d4 100644
> > > >> --- a/drivers/acpi/apei/ghes.c
> > > >> +++ b/drivers/acpi/apei/ghes.c
> > > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > > >>  		}
> > > >>  	}
> > > >>  
> > > >> +	/*
> > > >> +	 * If no memory failure work is queued for abnormal synchronous
> > > >> +	 * errors, do a force kill.
> > > >> +	 */
> > > >> +	if (sync && !queued) {
> > > >> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > > >> +		force_sig(SIGBUS);
> > > >> +	}    
> > > > 
> > > > Except that there are a bunch of CXL GUIDs being handled there too and
> > > > this will sigbus those processes now automatically.    
> > > 
> > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> > > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> > > delivered as a synchronous notification.
> > > 
> > > Will the CXL component trigger synchronous events for which we need to terminate the
> > > current process by sending sigbus to process?    
> > 
> > None of the CXL component errors should be handled as synchronous
> > events. They are either asynchronous protocol errors, or effectively
> > equivalent to CPER_SEC_PLATFORM_MEM notifications.  
> 
> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
> 

Premature send.:(

One example I can point at is how we do signaling of memory
errors detected by the host into a VM on arm64.
https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).

Right now we've only used async in QEMU for proposed CXL error
CPER records signalling but your reference to them being similar
to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
synchronous in some physical systems as it's one viable way to
provide rich information for synchronous reception of poison.
For the VM case my assumption today is we don't care about providing the
VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
errors whether from CXL CPER records or not.

Jonathan
Shuai Xue Feb. 24, 2024, 6:08 a.m. UTC | #6
On 2024/2/23 20:17, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 12:08:13 +0000
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
>> On Thu, 22 Feb 2024 21:26:43 -0800
>> Dan Williams <dan.j.williams@intel.com> wrote:
>>
>>> Shuai Xue wrote:  
>>>>
>>>>
>>>> On 2024/2/19 17:25, Borislav Petkov wrote:    
>>>>> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:    
>>>>>> Synchronous error was detected as a result of user-space process accessing
>>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>>>>>> memory_failure() work which poisons the related page, unmaps the page, and
>>>>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>>>>> avoided.
>>>>>>
>>>>>> However, no memory_failure() work will be queued when abnormal synchronous
>>>>>> errors occur. These errors can include situations such as invalid PA,
>>>>>> unexpected severity, no memory failure config support, invalid GUID
>>>>>> section, etc. In such case, the user-space process will trigger SEA again.
>>>>>> This loop can potentially exceed the platform firmware threshold or even
>>>>>> trigger a kernel hard lockup, leading to a system reboot.
>>>>>>
>>>>>> Fix it by performing a force kill if no memory_failure() work is queued
>>>>>> for synchronous errors.
>>>>>>
>>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>>>> ---
>>>>>>  drivers/acpi/apei/ghes.c | 9 +++++++++
>>>>>>  1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>>> index 7b7c605166e0..0892550732d4 100644
>>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>>> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>>>>  		}
>>>>>>  	}
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * If no memory failure work is queued for abnormal synchronous
>>>>>> +	 * errors, do a force kill.
>>>>>> +	 */
>>>>>> +	if (sync && !queued) {
>>>>>> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
>>>>>> +		force_sig(SIGBUS);
>>>>>> +	}    
>>>>>
>>>>> Except that there are a bunch of CXL GUIDs being handled there too and
>>>>> this will sigbus those processes now automatically.    
>>>>
>>>> Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
>>>> asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
>>>> delivered as a synchronous notification.
>>>>
>>>> Will the CXL component trigger synchronous events for which we need to terminate the
>>>> current process by sending sigbus to process?    
>>>
>>> None of the CXL component errors should be handled as synchronous
>>> events. They are either asynchronous protocol errors, or effectively
>>> equivalent to CPER_SEC_PLATFORM_MEM notifications.  
>>
>> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
>>
> 
> Premature send.:(
> 
> One example I can point at is how we do signaling of memory
> errors detected by the host into a VM on arm64.
> https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
> CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).
> 
> Right now we've only used async in QEMU for proposed CXL error
> CPER records signalling but your reference to them being similar
> to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
> synchronous in some physical systems as it's one viable way to
> provide rich information for synchronous reception of poison.
> For the VM case my assumption today is we don't care about providing the
> VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
> errors whether from CXL CPER records or not.
> 
> Jonathan

Thank you for your confirmation and explanation.

So I think the condition:

- `sync` for synchronous event,
- `!queued` for CPER_SEC_PLATFORM_MEM notifications which do not handle memory failures.

is fine.

@Borislav, do you have any other concerns?

Best Regards,
Shuai
Dan Williams Feb. 24, 2024, 7:40 p.m. UTC | #7
Borislav Petkov wrote:
> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> > Synchronous error was detected as a result of user-space process accessing
> > a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > memory_failure() work which poisons the related page, unmaps the page, and
> > then sends a SIGBUS to the process, so that a system wide panic can be
> > avoided.
> > 
> > However, no memory_failure() work will be queued when abnormal synchronous
> > errors occur. These errors can include situations such as invalid PA,
> > unexpected severity, no memory failure config support, invalid GUID
> > section, etc. In such case, the user-space process will trigger SEA again.
> > This loop can potentially exceed the platform firmware threshold or even
> > trigger a kernel hard lockup, leading to a system reboot.
> > 
> > Fix it by performing a force kill if no memory_failure() work is queued
> > for synchronous errors.
> > 
> > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> > ---
> >  drivers/acpi/apei/ghes.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 7b7c605166e0..0892550732d4 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * If no memory failure work is queued for abnormal synchronous
> > +	 * errors, do a force kill.
> > +	 */
> > +	if (sync && !queued) {
> > +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > +		force_sig(SIGBUS);
> > +	}
> 
> Except that there are a bunch of CXL GUIDs being handled there too and
> this will sigbus those processes now automatically.
> 
> Lemme add the whole bunch from
> 
>   671a794c33c6 ("acpi/ghes: Process CXL Component Events")
> 
> for comment to Cc.

BTW, I am about to revert all the CXL GUIDs for v6.8 to try again for
v6.9:

http://lore.kernel.org/r/170820177849.631006.8893584762602010898.stgit@dwillia2-xfh.jf.intel.com
Dan Williams Feb. 24, 2024, 7:42 p.m. UTC | #8
Jonathan Cameron wrote:
[..]
> > > None of the CXL component errors should be handled as synchronous
> > > events. They are either asynchronous protocol errors, or effectively
> > > equivalent to CPER_SEC_PLATFORM_MEM notifications.  
> > 
> > Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
> > 
> 
> Premature send.:(
> 
> One example I can point at is how we do signaling of memory
> errors detected by the host into a VM on arm64.
> https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
> CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).
> 
> Right now we've only used async in QEMU for proposed CXL error
> CPER records signalling but your reference to them being similar
> to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
> synchronous in some physical systems as it's one viable way to
> provide rich information for synchronous reception of poison.
> For the VM case my assumption today is we don't care about providing the
> VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
> errors whether from CXL CPER records or not.

Makes sense... and I was not precise when I mentioned the equivalency, I
was only considering x86.
Borislav Petkov Feb. 26, 2024, 10:29 a.m. UTC | #9
On Sat, Feb 24, 2024 at 02:08:42PM +0800, Shuai Xue wrote:
> @Borislav, do you have any other concerns?

Yes, this change needs to be further reviewed by an ARM person: I have
no clue what those "abnormal synchronous errors" on ARM are and how
they're supposed to be handled properly there:

- what happens if you get such an error when ghes is disabled there?

- is that even the right place to handle them?

James?
Shuai Xue Feb. 27, 2024, 1:23 a.m. UTC | #10
On 2024/2/26 18:29, Borislav Petkov wrote:
> On Sat, Feb 24, 2024 at 02:08:42PM +0800, Shuai Xue wrote:
>> @Borislav, do you have any other concerns?
> 
> Yes, this change needs to be further reviewed by an ARM person: I have
> no clue what those "abnormal synchronous errors" on ARM are 

Hi, Borislav,

May the `abnormal` is not inaccurate and misled you. I mean the preconditions
check before memory_failure_queue():

- `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
- `if (flags == -1)` in ghes_handle_memory_failure()
- `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
- `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()

If the preconditions are not passed, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

> and how
> they're supposed to be handled properly there:
> 
> - what happens if you get such an error when ghes is disabled there?

If ghes_disable is set, the GHES driver will not be inited by acpi_ghes_init(),
so none of error notifications will be handled. IMHO, it is expected.

> 
> - is that even the right place to handle them?
> 
> James?
> 

Leave this to @James.

Thank you.

Best Regards,
Shuai
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7b7c605166e0..0892550732d4 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -806,6 +806,15 @@  static bool ghes_do_proc(struct ghes *ghes,
 		}
 	}
 
+	/*
+	 * If no memory failure work is queued for abnormal synchronous
+	 * errors, do a force kill.
+	 */
+	if (sync && !queued) {
+		pr_err("Sending SIGBUS to current task due to memory error not recovered");
+		force_sig(SIGBUS);
+	}
+
 	return queued;
 }