diff mbox series

ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events

Message ID 20221027042445.60108-1-xueshuai@linux.alibaba.com
State New
Headers show
Series ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events | expand

Commit Message

Shuai Xue Oct. 27, 2022, 4:24 a.m. UTC
There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes the
  memory. OS requires to take action (for example, offline failure page/kill
  failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution context.
  Some data in the memory are corrupted. But the data have not been consumed.
  OS is optional to take action to recover this uncorrectable error.

For X86 platforms, we can easily distinguish between these two types
based on the MCA Bank. While for arm64 platform, the memory failure
flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
a.k.a, Action Optional now.

If UC is detected by a background scrubber, it is obviously an Action
Optional error.  For other errors, we should conservatively regard them
as Action Required.

cper_sec_mem_err::error_type identifies the type of error that occurred
if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
flags as MF_ACTION_REQUIRED.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 10 ++++++++--
 include/linux/cper.h     |  3 +++
 2 files changed, 11 insertions(+), 2 deletions(-)

Comments

Rafael J. Wysocki Oct. 28, 2022, 5:08 p.m. UTC | #1
On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>
> There are two major types of uncorrected error (UC) :
>
> - Action Required: The error is detected and the processor already consumes the
>   memory. OS requires to take action (for example, offline failure page/kill
>   failure thread) to recover this uncorrectable error.
>
> - Action Optional: The error is detected out of processor execution context.
>   Some data in the memory are corrupted. But the data have not been consumed.
>   OS is optional to take action to recover this uncorrectable error.
>
> For X86 platforms, we can easily distinguish between these two types
> based on the MCA Bank. While for arm64 platform, the memory failure
> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
> a.k.a, Action Optional now.
>
> If UC is detected by a background scrubber, it is obviously an Action
> Optional error.  For other errors, we should conservatively regard them
> as Action Required.
>
> cper_sec_mem_err::error_type identifies the type of error that occurred
> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
> flags as MF_ACTION_REQUIRED.
>
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

I need input from the APEI reviewers on this.

Thanks!

> ---
>  drivers/acpi/apei/ghes.c | 10 ++++++++--
>  include/linux/cper.h     |  3 +++
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 80ad530583c9..6c03059cbfc6 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>         if (sec_sev == GHES_SEV_CORRECTED &&
>             (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>                 flags = MF_SOFT_OFFLINE;
> -       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> -               flags = 0;
> +       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
> +               if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
> +                       flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
> +                                       0 :
> +                                       MF_ACTION_REQUIRED;
> +               else
> +                       flags = MF_ACTION_REQUIRED;
> +       }
>
>         if (flags != -1)
>                 return ghes_do_memory_failure(mem_err->physical_addr, flags);
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index eacb7dd7b3af..b77ab7636614 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -235,6 +235,9 @@ enum {
>  #define CPER_MEM_VALID_BANK_ADDRESS            0x100000
>  #define CPER_MEM_VALID_CHIP_ID                 0x200000
>
> +#define CPER_MEM_SCRUB_CE                      13
> +#define CPER_MEM_SCRUB_UC                      14
> +
>  #define CPER_MEM_EXT_ROW_MASK                  0x3
>  #define CPER_MEM_EXT_ROW_SHIFT                 16
>
> --
> 2.20.1.9.gb50a0d7
>
Tony Luck Oct. 28, 2022, 5:25 p.m. UTC | #2
>> cper_sec_mem_err::error_type identifies the type of error that occurred
>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>> flags as MF_ACTION_REQUIRED.

On x86 the "action required" cases are signaled by a synchronous machine check
that is delivered before the instruction that is attempting to consume the uncorrected
data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
because it is not visible in any architectural state.

APEI signaled errors don't fall into that category on x86 ... the uncorrected data
could have been consumed and propagated long before the signaling used for
APEI can alert the OS.

Does ARM deliver APEI signals synchronously?

If not, then this patch might deliver a false sense of security to applications
about the state of uncorrected data in the system.

-Tony
Shuai Xue Nov. 2, 2022, 7:07 a.m. UTC | #3
在 2022/10/29 AM1:08, Rafael J. Wysocki 写道:
> On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>
>> There are two major types of uncorrected error (UC) :
>>
>> - Action Required: The error is detected and the processor already consumes the
>>   memory. OS requires to take action (for example, offline failure page/kill
>>   failure thread) to recover this uncorrectable error.
>>
>> - Action Optional: The error is detected out of processor execution context.
>>   Some data in the memory are corrupted. But the data have not been consumed.
>>   OS is optional to take action to recover this uncorrectable error.
>>
>> For X86 platforms, we can easily distinguish between these two types
>> based on the MCA Bank. While for arm64 platform, the memory failure
>> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
>> a.k.a, Action Optional now.
>>
>> If UC is detected by a background scrubber, it is obviously an Action
>> Optional error.  For other errors, we should conservatively regard them
>> as Action Required.
>>
>> cper_sec_mem_err::error_type identifies the type of error that occurred
>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>> flags as MF_ACTION_REQUIRED.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> I need input from the APEI reviewers on this.
> 
> Thanks!

Hi, Rafael,

Sorry, I missed this email. Thank you for you quick reply. Let's discuss with
reviewers.

Thank you.

Cheers,
Shuai


> 
>> ---
>>  drivers/acpi/apei/ghes.c | 10 ++++++++--
>>  include/linux/cper.h     |  3 +++
>>  2 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 80ad530583c9..6c03059cbfc6 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>         if (sec_sev == GHES_SEV_CORRECTED &&
>>             (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>>                 flags = MF_SOFT_OFFLINE;
>> -       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> -               flags = 0;
>> +       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
>> +               if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
>> +                       flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
>> +                                       0 :
>> +                                       MF_ACTION_REQUIRED;
>> +               else
>> +                       flags = MF_ACTION_REQUIRED;
>> +       }
>>
>>         if (flags != -1)
>>                 return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index eacb7dd7b3af..b77ab7636614 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -235,6 +235,9 @@ enum {
>>  #define CPER_MEM_VALID_BANK_ADDRESS            0x100000
>>  #define CPER_MEM_VALID_CHIP_ID                 0x200000
>>
>> +#define CPER_MEM_SCRUB_CE                      13
>> +#define CPER_MEM_SCRUB_UC                      14
>> +
>>  #define CPER_MEM_EXT_ROW_MASK                  0x3
>>  #define CPER_MEM_EXT_ROW_SHIFT                 16
>>
>> --
>> 2.20.1.9.gb50a0d7
>>
Shuai Xue Nov. 2, 2022, 11:53 a.m. UTC | #4
在 2022/10/29 AM1:25, Luck, Tony 写道:
>>> cper_sec_mem_err::error_type identifies the type of error that occurred
>>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>>> flags as MF_ACTION_REQUIRED.
> 
> On x86 the "action required" cases are signaled by a synchronous machine check
> that is delivered before the instruction that is attempting to consume the uncorrected
> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
> because it is not visible in any architectural state.

On arm, if a 2-bit (uncorrectable) error is detected, and the memory access has been
architecturally executed, that error is considered “consumed”. The CPU will take a
synchronous error exception, signaled as synchronous external abort (SEA), which is
analogously to MCE.

> 
> APEI signaled errors don't fall into that category on x86 ... the uncorrected data
> could have been consumed and propagated long before the signaling used for
> APEI can alert the OS.
> 
> Does ARM deliver APEI signals synchronously?
> 
> If not, then this patch might deliver a false sense of security to applications
> about the state of uncorrected data in the system.
> 

Well, it does not always. There are many APEI notification, such as SCI, GSIV, GPIO,
SDEI, SEA, etc. Not all APEI notifications are synchronously and it depends on
hardware signal. As far as I know, if a UE is detected and consumed, synchronous external
abort is signaled to firmware and firmware then performs a first-level triage and
synchronously notify OS by SDEI or SEA notification. On the other hand, if CE is
detected, a asynchronous interrupt will be signaled and firmware could notify OS
by GPIO or GSIV.

Best Regards,
Shuai
Shuai Xue Nov. 22, 2022, 11:40 a.m. UTC | #5
在 2022/11/2 PM7:53, Shuai Xue 写道:
> 
> 
> 在 2022/10/29 AM1:25, Luck, Tony 写道:
>>>> cper_sec_mem_err::error_type identifies the type of error that occurred
>>>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>>>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>>>> flags as MF_ACTION_REQUIRED.
>>
>> On x86 the "action required" cases are signaled by a synchronous machine check
>> that is delivered before the instruction that is attempting to consume the uncorrected
>> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
>> because it is not visible in any architectural state.
> 
> On arm, if a 2-bit (uncorrectable) error is detected, and the memory access has been
> architecturally executed, that error is considered “consumed”. The CPU will take a
> synchronous error exception, signaled as synchronous external abort (SEA), which is
> analogously to MCE.
> 
>>
>> APEI signaled errors don't fall into that category on x86 ... the uncorrected data
>> could have been consumed and propagated long before the signaling used for
>> APEI can alert the OS.
>>
>> Does ARM deliver APEI signals synchronously?
>>
>> If not, then this patch might deliver a false sense of security to applications
>> about the state of uncorrected data in the system.
>>
> 
> Well, it does not always. There are many APEI notification, such as SCI, GSIV, GPIO,
> SDEI, SEA, etc. Not all APEI notifications are synchronously and it depends on
> hardware signal. As far as I know, if a UE is detected and consumed, synchronous external
> abort is signaled to firmware and firmware then performs a first-level triage and
> synchronously notify OS by SDEI or SEA notification. On the other hand, if CE is
> detected, a asynchronous interrupt will be signaled and firmware could notify OS
> by GPIO or GSIV.
> 
> Best Regards,
> Shuai
> 
> 


Hi, Tony,

Prefetch data with UE error triggers async interrupt on both X86 and Arm64 platform
(CMCI in X86 and SPI in arm64). It does not belongs to scrub UEs. I have to admit that
cper_sec_mem_err::error_type is not an appropriate basis to distinguish
"action required" cases.



acpi_hest_generic_data::flags (UEFI spec section N.2.2) could be used to indicate
Action Optional (Scrub/Prefetch).

	Bit 5 – Latent error: If set this flag indicates that action has been
	taken to ensure error containment (such a poisoning data), but
	the error has not been fully corrected and the data has not been
	consumed. System software may choose to take further
	corrective action before the data is consumed.

Our hardware team has submitted a proposal to UEFI community to add a new bit:

	Bit 8 – sync flag; if set this flag indicates that
	this event record is synchronous(e.g. cpu
	core consumes poison data, then cause
	instruction/data abort); if not set, this event
	record is asynchronous.

With bit 8, we will know it is "Action Required".


I will send a new patch set to rework GHES error handling after the proposal is accept.


Thank you.

Best Regards
Shuai
Kefeng Wang April 11, 2023, 2:17 p.m. UTC | #6
Hi Shuai Xue,

On 2023/4/11 18:48, Shuai Xue wrote:
> There are two major types of uncorrected recoverable (UCR) errors :
> 
> - Action Required (AR): The error is detected and the processor already
>    consumes the memory. OS requires to take action (for example, offline
>    failure page/kill failure thread) to recover this uncorrectable error.
> 
> - Action Optional (AO): The error is detected out of processor execution
>    context. Some data in the memory are corrupted. But the data have not
>    been consumed. OS is optional to take action to recover this
>    uncorrectable error.
> 
> The essential difference between AR and AO errors is that AR is a
> synchronous event, while AO is an asynchronous event. The hardware will
> signal a synchronous exception (Machine Check Exception on X86 and
> Synchronous External Abort on Arm64) when an error is detected and the
> memory access has been architecturally executed.
> 
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For AR errors, kernel will kill current process
> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
> addition, for AO errors, kernel will notify the process who owns the
> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
> are handled as AO errors in memory failure.
> 
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.

As your mentioned in cover-letter, we met same issue, and hope it could 
be fixed ASAP, this patch looks good to me,

Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>


> 
> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
> ---
>   drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
>   1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 34ad071a64e9..c479b85899f5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>   	return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>   }
>   
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt). On x86, the HEST notifications are always
> + * asynchronous, so only SEA on ARM is delivered as a synchronous
> + * notification.
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> +	u8 notify_type = ghes->generic->notify.type;
> +
> +	return notify_type == ACPI_HEST_NOTIFY_SEA;
> +}
> +
>   /*
>    * This driver isn't really modular, however for the time being,
>    * continuing to use module_param is the easiest way to remain
> @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   }
>   
>   static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> -				       int sev)
> +				       int sev, bool sync)
>   {
>   	int flags = -1;
>   	int sec_sev = ghes_severity(gdata->error_severity);
> @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>   	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>   		flags = MF_SOFT_OFFLINE;
>   	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> -		flags = 0;
> +		flags = sync ? MF_ACTION_REQUIRED : 0;
>   
>   	if (flags != -1)
>   		return ghes_do_memory_failure(mem_err->physical_addr, flags);
> @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>   	return false;
>   }
>   
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> +				       int sev, bool sync)
>   {
>   	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> +	int flags = sync ? MF_ACTION_REQUIRED : 0;
>   	bool queued = false;
>   	int sec_sev, i;
>   	char *p;
> @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>   		 * and don't filter out 'corrected' error here.
>   		 */
>   		if (is_cache && has_pa) {
> -			queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
> +			queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>   			p += err_info->length;
>   			continue;
>   		}
> @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
>   	const guid_t *fru_id = &guid_null;
>   	char *fru_text = "";
>   	bool queued = false;
> +	bool sync = is_hest_sync_notify(ghes);
>   
>   	sev = ghes_severity(estatus->error_severity);
>   	apei_estatus_for_each_section(estatus, gdata) {
> @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>   			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>   
>   			arch_apei_report_mem_error(sev, mem_err);
> -			queued = ghes_handle_memory_failure(gdata, sev);
> +			queued = ghes_handle_memory_failure(gdata, sev, sync);
>   		}
>   		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>   			ghes_handle_aer(gdata);
>   		}
>   		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> -			queued = ghes_handle_arm_hw_error(gdata, sev);
> +			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>   		} else {
>   			void *err = acpi_hest_get_payload(gdata);
>
Kefeng Wang April 11, 2023, 2:28 p.m. UTC | #7
On 2023/4/11 18:48, Shuai Xue wrote:
> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
> 
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
> 
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
> 
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
> 
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>    before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
>    handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
>    failure config support, invalid GUID section, OOM, etc.
> 
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
> 
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
> ---
>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>   include/acpi/ghes.h      |  3 --
>   mm/memory-failure.c      | 13 ------
>   3 files changed, 61 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index c479b85899f5..4b70955e25f9 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>   }
>   
>   /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork:                callback_head for task work
> + * @pfn:                  page frame number of corrupted page
> + * @flags:                fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
>    */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> +	struct callback_head twork;
> +	u64 pfn;
> +	int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
>   {
> -	struct acpi_hest_generic_status *estatus;
> -	struct ghes_estatus_node *estatus_node;
> -	u32 node_len;
> +	int ret;
> +	struct sync_task_work *twcb =
> +		container_of(twork, struct sync_task_work, twork);
>   
> -	estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> -	if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> -		memory_failure_queue_kick(estatus_node->task_work_cpu);
> +	ret = memory_failure(twcb->pfn, twcb->flags);
> +	kfree(twcb);
>   
> -	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> -	node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> -	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> +	if (!ret)
> +		return;
> +
> +	/*
> +	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> +	 * to the current process with the proper error info,

This should be part of the comments of function memory_failure(),

> +	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> +	 *
and this part is already there
> +	 * In both cases, no further processing is required.
> +	 */
so, after that, I think we could drop this comment, also the same 
comment in x86's kill_me_maybe().

> +	if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> +		return;
> +
> +	pr_err("Memory error not recovered");
> +	force_sig(SIGBUS);
>   }
>   
>   static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   {
>   	unsigned long pfn;
> +	struct sync_task_work *twcb;
>   
>   	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>   		return false;
> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   		return false;
>   	}
>   
> +	if (flags == MF_ACTION_REQUIRED && current->mm) {
> +		twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> +		if (!twcb)
> +			return false;
> +
> +		twcb->pfn = pfn;
> +		twcb->flags = flags;
> +		init_task_work(&twcb->twork, memory_failure_cb);
> +		task_work_add(current, &twcb->twork, TWA_RESUME);
> +		return true;
> +	}
> +
>   	memory_failure_queue(pfn, flags);
>   	return true;
>   }
> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>   	struct ghes_estatus_node *estatus_node;
>   	struct acpi_hest_generic *generic;
>   	struct acpi_hest_generic_status *estatus;
> -	bool task_work_pending;
> +	bool queued, sync;
>   	u32 len, node_len;
> -	int ret;
>   
>   	llnode = llist_del_all(&ghes_estatus_llist);
>   	/*
> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>   		estatus_node = llist_entry(llnode, struct ghes_estatus_node,
>   					   llnode);
>   		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> +		sync = is_hest_sync_notify(estatus_node->ghes);
>   		len = cper_estatus_len(estatus);
>   		node_len = GHES_ESTATUS_NODE_LEN(len);
> -		task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> +		queued = ghes_do_proc(estatus_node->ghes, estatus) > +		/*
> +		 * If no memory failure work is queued for abnormal synchronous
> +		 * errors, do a force kill.
> +		 */
> +		if (sync && !queued)
> +			force_sig(SIGBUS);

It's better to move this part into function ghes_do_proc(), because 
there is already an is_hest_sync_notify(), and no need return value,
so make ghes_do_proc() a void function, Apart from this,

Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>

> +
>   		if (!ghes_estatus_cached(estatus)) {
>   			generic = estatus_node->generic;
>   			if (ghes_print_estatus(NULL, generic, estatus))
>   				ghes_estatus_cache_add(generic, estatus);
>   		}
> -
> -		if (task_work_pending && current->mm) {
> -			estatus_node->task_work.func = ghes_kick_task_work;
> -			estatus_node->task_work_cpu = smp_processor_id();
> -			ret = task_work_add(current, &estatus_node->task_work,
> -					    TWA_RESUME);
> -			if (ret)
> -				estatus_node->task_work.func = NULL;
> -		}
> -
> -		if (!estatus_node->task_work.func)
> -			gen_pool_free(ghes_estatus_pool,
> -				      (unsigned long)estatus_node, node_len);
> +		gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> +			      node_len);
>   
>   		llnode = next;
>   	}
> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>   
>   	estatus_node->ghes = ghes;
>   	estatus_node->generic = ghes->generic;
> -	estatus_node->task_work.func = NULL;
>   	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>   
>   	if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3c8bba9f1114..e5e0c308d27f 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>   	struct llist_node llnode;
>   	struct acpi_hest_generic *generic;
>   	struct ghes *ghes;
> -
> -	int task_work_cpu;
> -	struct callback_head task_work;
>   };
>   
>   struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6ea8c325acb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>   	}
>   }
>   
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> -	struct memory_failure_cpu *mf_cpu;
> -
> -	mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> -	cancel_work_sync(&mf_cpu->work);
> -	memory_failure_work_func(&mf_cpu->work);
> -}
> -
>   static int __init memory_failure_init(void)
>   {
>   	struct memory_failure_cpu *mf_cpu;
Shuai Xue April 12, 2023, 2:54 a.m. UTC | #8
On 2023/4/11 PM10:17, Kefeng Wang wrote:
> Hi Shuai Xue,
> 
> On 2023/4/11 18:48, Shuai Xue wrote:
>> There are two major types of uncorrected recoverable (UCR) errors :
>>
>> - Action Required (AR): The error is detected and the processor already
>>    consumes the memory. OS requires to take action (for example, offline
>>    failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Action Optional (AO): The error is detected out of processor execution
>>    context. Some data in the memory are corrupted. But the data have not
>>    been consumed. OS is optional to take action to recover this
>>    uncorrectable error.
>>
>> The essential difference between AR and AO errors is that AR is a
>> synchronous event, while AO is an asynchronous event. The hardware will
>> signal a synchronous exception (Machine Check Exception on X86 and
>> Synchronous External Abort on Arm64) when an error is detected and the
>> memory access has been architecturally executed.
>>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For AR errors, kernel will kill current process
>> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
>> addition, for AO errors, kernel will notify the process who owns the
>> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
>> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
>> are handled as AO errors in memory failure.
>>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
> 
> As your mentioned in cover-letter, we met same issue, and hope it could be fixed ASAP, this patch looks good to me,
> 
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Thank you.

Cheers,
Shuai

> 
>>
>> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
>>   1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 34ad071a64e9..c479b85899f5 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>>       return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>>   }
>>   +/*
>> + * A platform may describe one error source for the handling of synchronous
>> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>> + * or External Interrupt). On x86, the HEST notifications are always
>> + * asynchronous, so only SEA on ARM is delivered as a synchronous
>> + * notification.
>> + */
>> +static inline bool is_hest_sync_notify(struct ghes *ghes)
>> +{
>> +    u8 notify_type = ghes->generic->notify.type;
>> +
>> +    return notify_type == ACPI_HEST_NOTIFY_SEA;
>> +}
>> +
>>   /*
>>    * This driver isn't really modular, however for the time being,
>>    * continuing to use module_param is the easiest way to remain
>> @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   }
>>     static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> -                       int sev)
>> +                       int sev, bool sync)
>>   {
>>       int flags = -1;
>>       int sec_sev = ghes_severity(gdata->error_severity);
>> @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>           (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>>           flags = MF_SOFT_OFFLINE;
>>       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> -        flags = 0;
>> +        flags = sync ? MF_ACTION_REQUIRED : 0;
>>         if (flags != -1)
>>           return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>       return false;
>>   }
>>   -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>> +                       int sev, bool sync)
>>   {
>>       struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> +    int flags = sync ? MF_ACTION_REQUIRED : 0;
>>       bool queued = false;
>>       int sec_sev, i;
>>       char *p;
>> @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>>            * and don't filter out 'corrected' error here.
>>            */
>>           if (is_cache && has_pa) {
>> -            queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
>> +            queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>>               p += err_info->length;
>>               continue;
>>           }
>> @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
>>       const guid_t *fru_id = &guid_null;
>>       char *fru_text = "";
>>       bool queued = false;
>> +    bool sync = is_hest_sync_notify(ghes);
>>         sev = ghes_severity(estatus->error_severity);
>>       apei_estatus_for_each_section(estatus, gdata) {
>> @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>>               atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>>                 arch_apei_report_mem_error(sev, mem_err);
>> -            queued = ghes_handle_memory_failure(gdata, sev);
>> +            queued = ghes_handle_memory_failure(gdata, sev, sync);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>>               ghes_handle_aer(gdata);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> -            queued = ghes_handle_arm_hw_error(gdata, sev);
>> +            queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>>           } else {
>>               void *err = acpi_hest_get_payload(gdata);
>>
Shuai Xue April 12, 2023, 2:58 a.m. UTC | #9
On 2023/4/11 PM10:28, Kefeng Wang wrote:
> 
> 
> On 2023/4/11 18:48, Shuai Xue wrote:
>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>    before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>    handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>    failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>   include/acpi/ghes.h      |  3 --
>>   mm/memory-failure.c      | 13 ------
>>   3 files changed, 61 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index c479b85899f5..4b70955e25f9 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>   }
>>     /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    int ret;
>> +    struct sync_task_work *twcb =
>> +        container_of(twork, struct sync_task_work, twork);
>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    kfree(twcb);
>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret)
>> +        return;
>> +
>> +    /*
>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> +     * to the current process with the proper error info,
> 
> This should be part of the comments of function memory_failure(),
> 
>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> +     *
> and this part is already there
>> +     * In both cases, no further processing is required.
>> +     */
> so, after that, I think we could drop this comment, also the same comment in x86's kill_me_maybe().

Ok, I will add comments on return value of memory_failure() and drop both this
comment and that in kill_me_maybe() out.

> 
>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("Memory error not recovered");
>> +    force_sig(SIGBUS);
>>   }
>>     static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>>       unsigned long pfn;
>> +    struct sync_task_work *twcb;
>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>           return false;
>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>           return false;
>>       }
>>   +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> +        if (!twcb)
>> +            return false;
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return true;
>> +    }
>> +
>>       memory_failure_queue(pfn, flags);
>>       return true;
>>   }
>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>> +    bool queued, sync;
>>       u32 len, node_len;
>> -    int ret;
>>         llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus_node = llist_entry(llnode, struct ghes_estatus_node,
>>                          llnode);
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> +        sync = is_hest_sync_notify(estatus_node->ghes);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> +        queued = ghes_do_proc(estatus_node->ghes, estatus) > +        /*
>> +         * If no memory failure work is queued for abnormal synchronous
>> +         * errors, do a force kill.
>> +         */
>> +        if (sync && !queued)
>> +            force_sig(SIGBUS);
> 
> It's better to move this part into function ghes_do_proc(), because there is already an is_hest_sync_notify(), and no need return value,
> so make ghes_do_proc() a void function, Apart from this,

Good idea. I will do this and send a new version.

> 
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Thank you.

Cheers,
Shuai

> 
>> +
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>             llnode = next;
>>       }
>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>         estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 3c8bba9f1114..e5e0c308d27f 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>     struct ghes_estatus_cache {
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6ea8c325acb3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>>   -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;
Xiaofei Tan April 12, 2023, 4:05 a.m. UTC | #10
在 2023/4/11 18:48, Shuai Xue 写道:
> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>    before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
>    handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
>    failure config support, invalid GUID section, OOM, etc.
>
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
> ---
>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>   include/acpi/ghes.h      |  3 --
>   mm/memory-failure.c      | 13 ------
>   3 files changed, 61 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index c479b85899f5..4b70955e25f9 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>   }
>   
>   /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork:                callback_head for task work
> + * @pfn:                  page frame number of corrupted page
> + * @flags:                fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
>    */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> +	struct callback_head twork;
> +	u64 pfn;
> +	int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
>   {
> -	struct acpi_hest_generic_status *estatus;
> -	struct ghes_estatus_node *estatus_node;
> -	u32 node_len;
> +	int ret;
> +	struct sync_task_work *twcb =
> +		container_of(twork, struct sync_task_work, twork);
>   
> -	estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> -	if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> -		memory_failure_queue_kick(estatus_node->task_work_cpu);
> +	ret = memory_failure(twcb->pfn, twcb->flags);
> +	kfree(twcb);
>   
> -	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> -	node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> -	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> +	if (!ret)
> +		return;
> +
> +	/*
> +	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> +	 * to the current process with the proper error info,
> +	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> +	 *
> +	 * In both cases, no further processing is required.
> +	 */
> +	if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> +		return;
> +
> +	pr_err("Memory error not recovered");

The print could add the following SIGBUS signal sending.
Such as "Sending SIGBUS to current task due to memory error not recovered"

> +	force_sig(SIGBUS);
>   }
>   
>   static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   {
>   	unsigned long pfn;
> +	struct sync_task_work *twcb;
>   
>   	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>   		return false;
> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>   		return false;
>   	}
>   
> +	if (flags == MF_ACTION_REQUIRED && current->mm) {
> +		twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> +		if (!twcb)
> +			return false;
> +
> +		twcb->pfn = pfn;
> +		twcb->flags = flags;
> +		init_task_work(&twcb->twork, memory_failure_cb);
> +		task_work_add(current, &twcb->twork, TWA_RESUME);
> +		return true;
> +	}
> +
>   	memory_failure_queue(pfn, flags);
>   	return true;
>   }
> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>   	struct ghes_estatus_node *estatus_node;
>   	struct acpi_hest_generic *generic;
>   	struct acpi_hest_generic_status *estatus;
> -	bool task_work_pending;
> +	bool queued, sync;
>   	u32 len, node_len;
> -	int ret;
>   
>   	llnode = llist_del_all(&ghes_estatus_llist);
>   	/*
> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>   		estatus_node = llist_entry(llnode, struct ghes_estatus_node,
>   					   llnode);
>   		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> +		sync = is_hest_sync_notify(estatus_node->ghes);
>   		len = cper_estatus_len(estatus);
>   		node_len = GHES_ESTATUS_NODE_LEN(len);
> -		task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> +		queued = ghes_do_proc(estatus_node->ghes, estatus);
> +		/*
> +		 * If no memory failure work is queued for abnormal synchronous
> +		 * errors, do a force kill.
> +		 */
> +		if (sync && !queued)
> +			force_sig(SIGBUS);

Could also add one similar print here as above
Apart from this,
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>

> +
>   		if (!ghes_estatus_cached(estatus)) {
>   			generic = estatus_node->generic;
>   			if (ghes_print_estatus(NULL, generic, estatus))
>   				ghes_estatus_cache_add(generic, estatus);
>   		}
> -
> -		if (task_work_pending && current->mm) {
> -			estatus_node->task_work.func = ghes_kick_task_work;
> -			estatus_node->task_work_cpu = smp_processor_id();
> -			ret = task_work_add(current, &estatus_node->task_work,
> -					    TWA_RESUME);
> -			if (ret)
> -				estatus_node->task_work.func = NULL;
> -		}
> -
> -		if (!estatus_node->task_work.func)
> -			gen_pool_free(ghes_estatus_pool,
> -				      (unsigned long)estatus_node, node_len);
> +		gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> +			      node_len);
>   
>   		llnode = next;
>   	}
> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>   
>   	estatus_node->ghes = ghes;
>   	estatus_node->generic = ghes->generic;
> -	estatus_node->task_work.func = NULL;
>   	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>   
>   	if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3c8bba9f1114..e5e0c308d27f 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>   	struct llist_node llnode;
>   	struct acpi_hest_generic *generic;
>   	struct ghes *ghes;
> -
> -	int task_work_cpu;
> -	struct callback_head task_work;
>   };
>   
>   struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6ea8c325acb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>   	}
>   }
>   
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> -	struct memory_failure_cpu *mf_cpu;
> -
> -	mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> -	cancel_work_sync(&mf_cpu->work);
> -	memory_failure_work_func(&mf_cpu->work);
> -}
> -
>   static int __init memory_failure_init(void)
>   {
>   	struct memory_failure_cpu *mf_cpu;
Shuai Xue April 13, 2023, 1:49 a.m. UTC | #11
On 2023/4/12 PM12:05, Xiaofei Tan wrote:
> 
> 在 2023/4/11 18:48, Shuai Xue 写道:
>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>    before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>    handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>    failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>   include/acpi/ghes.h      |  3 --
>>   mm/memory-failure.c      | 13 ------
>>   3 files changed, 61 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index c479b85899f5..4b70955e25f9 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>   }
>>     /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    int ret;
>> +    struct sync_task_work *twcb =
>> +        container_of(twork, struct sync_task_work, twork);
>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    kfree(twcb);
>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret)
>> +        return;
>> +
>> +    /*
>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> +     * to the current process with the proper error info,
>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> +     *
>> +     * In both cases, no further processing is required.
>> +     */
>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("Memory error not recovered");
> 
> The print could add the following SIGBUS signal sending.
> Such as "Sending SIGBUS to current task due to memory error not recovered"
> 
>> +    force_sig(SIGBUS);
>>   }
>>     static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>>       unsigned long pfn;
>> +    struct sync_task_work *twcb;
>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>           return false;
>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>           return false;
>>       }
>>   +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> +        if (!twcb)
>> +            return false;
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return true;
>> +    }
>> +
>>       memory_failure_queue(pfn, flags);
>>       return true;
>>   }
>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>> +    bool queued, sync;
>>       u32 len, node_len;
>> -    int ret;
>>         llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus_node = llist_entry(llnode, struct ghes_estatus_node,
>>                          llnode);
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> +        sync = is_hest_sync_notify(estatus_node->ghes);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> +        queued = ghes_do_proc(estatus_node->ghes, estatus);
>> +        /*
>> +         * If no memory failure work is queued for abnormal synchronous
>> +         * errors, do a force kill.
>> +         */
>> +        if (sync && !queued)
>> +            force_sig(SIGBUS);
> 
> Could also add one similar print here as above
> Apart from this,
> Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>

Thanks :)

Sorry, I missed your replies, because Thunderbird marks an email as Junk,
just move it to the Junk folder.

I'd like to add above warning message and pick up your reviewed-by tag.

Cheers,
Shuai



> 
>> +
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>             llnode = next;
>>       }
>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>         estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 3c8bba9f1114..e5e0c308d27f 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>     struct ghes_estatus_cache {
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6ea8c325acb3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>>   -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;
Greg Kroah-Hartman Dec. 18, 2023, 6:54 a.m. UTC | #12
On Mon, Dec 18, 2023 at 02:45:19PM +0800, Shuai Xue wrote:
> Synchronous error was detected as a result of user-space process accessing
> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> memory_failure() work which poisons the related page, unmaps the page, and
> then sends a SIGBUS to the process, so that a system wide panic can be
> avoided.
> 
> However, no memory_failure() work will be queued when abnormal synchronous
> errors occur. These errors can include situations such as invalid PA,
> unexpected severity, no memory failure config support, invalid GUID
> section, etc. In such case, the user-space process will trigger SEA again.
> This loop can potentially exceed the platform firmware threshold or even
> trigger a kernel hard lockup, leading to a system reboot.
> 
> Fix it by performing a force kill if no memory_failure() work is queued for synchronous errors.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>  drivers/acpi/apei/ghes.c | 9 +++++++++
>  1 file changed, 9 insertions(+)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
Greg Kroah-Hartman Dec. 18, 2023, 6:54 a.m. UTC | #13
On Mon, Dec 18, 2023 at 02:45:20PM +0800, Shuai Xue wrote:
> Part of return value comments for memory_failure() were originally
> documented at the call site. Move those comments to the function
> declaration to improve code readability and to provide developers with
> immediate access to function usage and return information.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 9 +--------
>  mm/memory-failure.c            | 9 ++++++---
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
Greg Kroah-Hartman Dec. 18, 2023, 6:54 a.m. UTC | #14
On Mon, Dec 18, 2023 at 02:45:21PM +0800, Shuai Xue wrote:
> Hardware errors could be signaled by asynchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when a CPU tries to access a poisoned cache line. Both
> synchronous and asynchronous error are queued as a memory_failure() work
> and handled by a dedicated kthread in workqueue.
> 
> However, the memory failure recovery sends SIBUS with wrong BUS_MCEERR_AO
> si_code for synchronous errors in early kill mode, even MF_ACTION_REQUIRED
> is set. The main problem is that the memory failure work is handled in
> kthread context but not the user-space process which is accessing the
> corrupt memory location, so it will send SIGBUS with BUS_MCEERR_AO si_code
> to the user-space process instead of BUS_MCEERR_AR in kill_proc().
> 
> To this end, queue memory_failure() as a task_work so that the current
> context in memory_failure() is exactly belongs to the process consuming
> poison data and it will send SIBBUS with proper si_code.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Tested-by: Ma Wupeng <mawupeng1@huawei.com>
> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
>  include/acpi/ghes.h      |  3 --
>  mm/memory-failure.c      | 13 -------
>  3 files changed, 44 insertions(+), 49 deletions(-)
> 


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 80ad530583c9..6c03059cbfc6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -474,8 +474,14 @@  static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 	if (sec_sev == GHES_SEV_CORRECTED &&
 	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 		flags = MF_SOFT_OFFLINE;
-	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = 0;
+	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
+		if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
+			flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
+					0 :
+					MF_ACTION_REQUIRED;
+		else
+			flags = MF_ACTION_REQUIRED;
+	}
 
 	if (flags != -1)
 		return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..b77ab7636614 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -235,6 +235,9 @@  enum {
 #define CPER_MEM_VALID_BANK_ADDRESS		0x100000
 #define CPER_MEM_VALID_CHIP_ID			0x200000
 
+#define CPER_MEM_SCRUB_CE			13
+#define CPER_MEM_SCRUB_UC			14
+
 #define CPER_MEM_EXT_ROW_MASK			0x3
 #define CPER_MEM_EXT_ROW_SHIFT			16