From patchwork Thu Oct 27 04:24:45 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shuai Xue <xueshuai@linux.alibaba.com>
X-Patchwork-Id: 619253
Return-Path: <linux-acpi-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 6B11BC67871
 for <linux-acpi@archiver.kernel.org>; Thu, 27 Oct 2022 04:25:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S233548AbiJ0EZg (ORCPT <rfc822;linux-acpi@archiver.kernel.org>);
 Thu, 27 Oct 2022 00:25:36 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43104 "EHLO
 lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S229691AbiJ0EZg (ORCPT
 <rfc822;linux-acpi@vger.kernel.org>); Thu, 27 Oct 2022 00:25:36 -0400
Received: from out30-43.freemail.mail.aliyun.com
 (out30-43.freemail.mail.aliyun.com [115.124.30.43])
 by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3C5D132B9D;
 Wed, 26 Oct 2022 21:25:00 -0700 (PDT)
X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R131e4; CH=green; DM=||false|;
 DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=ay29a033018046050;
 MF=xueshuai@linux.alibaba.com; NM=1; PH=DS; RN=17; SR=0;
 TI=SMTPD_---0VT9cxpK_1666844688; 
Received: from localhost.localdomain(mailfrom:xueshuai@linux.alibaba.com
 fp:SMTPD_---0VT9cxpK_1666844688) by smtp.aliyun-inc.com;
 Thu, 27 Oct 2022 12:24:51 +0800
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: rafael@kernel.org, lenb@kernel.org, james.morse@arm.com,
 tony.luck@intel.com, bp@alien8.de, dave.hansen@linux.intel.com,
 jarkko@kernel.org, naoya.horiguchi@nec.com, linmiaohe@huawei.com,
 akpm@linux-foundation.org
Cc: stable@vger.kernel.org, linux-acpi@vger.kernel.org,
 linux-kernel@vger.kernel.org, cuibixuan@linux.alibaba.com,
 baolin.wang@linux.alibaba.com, zhuo.song@linux.alibaba.com,
 xueshuai@linux.alibaba.com
Subject: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
 action required events
Date: Thu, 27 Oct 2022 12:24:45 +0800
Message-Id: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-acpi.vger.kernel.org>
X-Mailing-List: linux-acpi@vger.kernel.org

There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes the
  memory. OS requires to take action (for example, offline failure page/kill
  failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution context.
  Some data in the memory are corrupted. But the data have not been consumed.
  OS is optional to take action to recover this uncorrectable error.

For X86 platforms, we can easily distinguish between these two types
based on the MCA Bank. While for arm64 platform, the memory failure
flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
a.k.a, Action Optional now.

If UC is detected by a background scrubber, it is obviously an Action
Optional error.  For other errors, we should conservatively regard them
as Action Required.

cper_sec_mem_err::error_type identifies the type of error that occurred
if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
flags as MF_ACTION_REQUIRED.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
---
 drivers/acpi/apei/ghes.c | 10 ++++++++--
 include/linux/cper.h     |  3 +++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 80ad530583c9..6c03059cbfc6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 	if (sec_sev == GHES_SEV_CORRECTED &&
 	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 		flags = MF_SOFT_OFFLINE;
-	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = 0;
+	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
+		if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
+			flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
+					0 :
+					MF_ACTION_REQUIRED;
+		else
+			flags = MF_ACTION_REQUIRED;
+	}
 
 	if (flags != -1)
 		return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..b77ab7636614 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -235,6 +235,9 @@ enum {
 #define CPER_MEM_VALID_BANK_ADDRESS		0x100000
 #define CPER_MEM_VALID_CHIP_ID			0x200000
 
+#define CPER_MEM_SCRUB_CE			13
+#define CPER_MEM_SCRUB_UC			14
+
 #define CPER_MEM_EXT_ROW_MASK			0x3
 #define CPER_MEM_EXT_ROW_SHIFT			16
 

From patchwork Tue Dec  6 15:33:54 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shuai Xue <xueshuai@linux.alibaba.com>
X-Patchwork-Id: 631349
Return-Path: <linux-acpi-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 9CDB5C4708C
 for <linux-acpi@archiver.kernel.org>; Tue,  6 Dec 2022 15:34:11 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S234514AbiLFPeJ (ORCPT <rfc822;linux-acpi@archiver.kernel.org>);
 Tue, 6 Dec 2022 10:34:09 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56628 "EHLO
 lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S234508AbiLFPeH (ORCPT
 <rfc822;linux-acpi@vger.kernel.org>); Tue, 6 Dec 2022 10:34:07 -0500
Received: from out199-11.us.a.mail.aliyun.com (out199-11.us.a.mail.aliyun.com
 [47.90.199.11])
 by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 58AB52B26B;
 Tue,  6 Dec 2022 07:34:05 -0800 (PST)
X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R101e4; CH=green; DM=||false|;
 DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=ay29a033018046059;
 MF=xueshuai@linux.alibaba.com; NM=0; PH=DS; RN=16; SR=0;
 TI=SMTPD_---0VWhZY-U_1670340840; 
Received: from localhost.localdomain(mailfrom:xueshuai@linux.alibaba.com
 fp:SMTPD_---0VWhZY-U_1670340840) by smtp.aliyun-inc.com;
 Tue, 06 Dec 2022 23:34:01 +0800
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: rafael@kernel.org, lenb@kernel.org, james.morse@arm.com,
 tony.luck@intel.com, bp@alien8.de, dave.hansen@linux.intel.com,
 jarkko@kernel.org, naoya.horiguchi@nec.com, linmiaohe@huawei.com,
 akpm@linux-foundation.org
Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
 cuibixuan@linux.alibaba.com, baolin.wang@linux.alibaba.com,
 zhuo.song@linux.alibaba.com, xueshuai@linux.alibaba.com
Subject: [RFC PATCH 2/2] ACPI: APEI: separate synchronous error handling into
 task work
Date: Tue,  6 Dec 2022 23:33:54 +0800
Message-Id: <20221206153354.92394-3-xueshuai@linux.alibaba.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
References: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-acpi.vger.kernel.org>
X-Mailing-List: linux-acpi@vger.kernel.org

On Arm64 platform, errors could be signaled by synchronous interrupt, e.g.
when an error is detected by a background scrubber, or signaled by
synchronous exception, e.g. when an uncorrected error is consumed. Both
synchronous and asynchronous error are queued and handled by a dedicated
kthread in workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault and kernel will send sigbus due to
VM_FAULT_HWPOISON.

Although the task could be killed by page fault, the memory failure is
handled in a kthread context so that the hwpoison-aware mechanisms, e.g.
PF_MCE_EARLY, early kill, does not work as expected.

To this end, separate synchronous and asynchronous error handling into
different paths like X86 does:

- task work for synchronous error.
- and workqueue for asynchronous error.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 118 ++++++++++++++++++++++-----------------
 1 file changed, 66 insertions(+), 52 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a420759fce2d..f13c298f47e6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -421,46 +421,80 @@ static void ghes_clear_estatus(struct ghes *ghes,
 		ghes_ack_error(ghes->generic_v2);
 }
 
-/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+/**
+ * struct mce_task_work - for synchronous RAS event
+ *
+ * @twork:                callback_head for task work
+ * @pfn:                  page frame number of corrupted page
+ * @flags:                fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * returning to userspace via task_work_add().
  */
-static void ghes_kick_task_work(struct callback_head *head)
+struct mce_task_work {
+	struct callback_head twork;
+	u64 pfn;
+	int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
 {
-	struct acpi_hest_generic_status *estatus;
-	struct ghes_estatus_node *estatus_node;
-	u32 node_len;
+	int ret;
+	struct mce_task_work *twcb =
+		container_of(twork, struct mce_task_work, twork);
+	ret = memory_failure(twcb->pfn, twcb->flags);
+	kfree(twcb);
 
-	estatus_node = container_of(head, struct ghes_estatus_node, task_work);
-	if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
-		memory_failure_queue_kick(estatus_node->task_work_cpu);
+	if (!ret)
+		return;
+	/*
+	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+	 * to the current process with the proper error info,
+	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+	 *
+	 * In both cases, no further processing is required.
+	 */
+	if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+		return;
 
-	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
-	node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
-	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+	pr_err("Memory error not recovered");
+	force_sig(SIGBUS);
 }
 
-static bool ghes_do_memory_failure(u64 physical_addr, int flags)
+static void ghes_do_memory_failure(u64 physical_addr, int flags)
 {
 	unsigned long pfn;
+	struct mce_task_work *twcb;
 
 	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
-		return false;
+		return;
 
 	pfn = PHYS_PFN(physical_addr);
 	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
 		pr_warn_ratelimited(FW_WARN GHES_PFX
 		"Invalid address in generic error data: %#llx\n",
 		physical_addr);
-		return false;
+		return;
 	}
 
-	memory_failure_queue(pfn, flags);
-	return true;
+	if (flags == MF_ACTION_REQUIRED && current->mm) {
+		twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+		if (!twcb)
+			return;
+
+		twcb->pfn = pfn;
+		twcb->flags = flags;
+		init_task_work(&twcb->twork, memory_failure_cb);
+		task_work_add(current, &twcb->twork, TWA_RESUME);
+		return;
+	} else {
+		memory_failure_queue(pfn, flags);
+	}
+
+	return;
 }
 
-static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 				       int sev)
 {
 	int flags = -1;
@@ -468,7 +502,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
 
 	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
-		return false;
+		return;
 
 	/* iff following two events can be handled properly by now */
 	if (sec_sev == GHES_SEV_CORRECTED &&
@@ -478,15 +512,12 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 		flags = (gdata->flags & CPER_SEC_SYNC) ? MF_ACTION_REQUIRED : 0;
 
 	if (flags != -1)
-		return ghes_do_memory_failure(mem_err->physical_addr, flags);
-
-	return false;
+		ghes_do_memory_failure(mem_err->physical_addr, flags);
 }
 
-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
 {
 	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
-	bool queued = false;
 	int sec_sev, i;
 	char *p;
 
@@ -494,7 +525,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
 
 	sec_sev = ghes_severity(gdata->error_severity);
 	if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
-		return false;
+		return;
 
 	p = (char *)(err + 1);
 	for (i = 0; i < err->err_info_num; i++) {
@@ -510,7 +541,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
 		 * and don't filter out 'corrected' error here.
 		 */
 		if (is_cache && has_pa) {
-			queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+			ghes_do_memory_failure(err_info->physical_fault_addr, 0);
 			p += err_info->length;
 			continue;
 		}
@@ -524,7 +555,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
 		p += err_info->length;
 	}
 
-	return queued;
+	return;
 }
 
 /*
@@ -622,7 +653,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
@@ -630,7 +661,6 @@ static bool ghes_do_proc(struct ghes *ghes,
 	guid_t *sec_type;
 	const guid_t *fru_id = &guid_null;
 	char *fru_text = "";
-	bool queued = false;
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
@@ -648,13 +678,13 @@ static bool ghes_do_proc(struct ghes *ghes,
 			ghes_edac_report_mem_error(sev, mem_err);
 
 			arch_apei_report_mem_error(sev, mem_err);
-			queued = ghes_handle_memory_failure(gdata, sev);
+			ghes_handle_memory_failure(gdata, sev);
 		}
 		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
 			ghes_handle_aer(gdata);
 		}
 		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-			queued = ghes_handle_arm_hw_error(gdata, sev);
+			ghes_handle_arm_hw_error(gdata, sev);
 		} else {
 			void *err = acpi_hest_get_payload(gdata);
 
@@ -664,8 +694,6 @@ static bool ghes_do_proc(struct ghes *ghes,
 					       gdata->error_data_length);
 		}
 	}
-
-	return queued;
 }
 
 static void __ghes_print_estatus(const char *pfx,
@@ -961,9 +989,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 	struct ghes_estatus_node *estatus_node;
 	struct acpi_hest_generic *generic;
 	struct acpi_hest_generic_status *estatus;
-	bool task_work_pending;
 	u32 len, node_len;
-	int ret;
 
 	llnode = llist_del_all(&ghes_estatus_llist);
 	/*
@@ -978,26 +1004,15 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = cper_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+		ghes_do_proc(estatus_node->ghes, estatus);
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
 				ghes_estatus_cache_add(generic, estatus);
 		}
 
-		if (task_work_pending && current->mm) {
-			estatus_node->task_work.func = ghes_kick_task_work;
-			estatus_node->task_work_cpu = smp_processor_id();
-			ret = task_work_add(current, &estatus_node->task_work,
-					    TWA_RESUME);
-			if (ret)
-				estatus_node->task_work.func = NULL;
-		}
-
-		if (!estatus_node->task_work.func)
-			gen_pool_free(ghes_estatus_pool,
-				      (unsigned long)estatus_node, node_len);
-
+		gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+			      node_len);
 		llnode = next;
 	}
 }
@@ -1057,7 +1072,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
 
 	estatus_node->ghes = ghes;
 	estatus_node->generic = ghes->generic;
-	estatus_node->task_work.func = NULL;
 	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 
 	if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {

From patchwork Mon Dec 18 06:45:20 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shuai Xue <xueshuai@linux.alibaba.com>
X-Patchwork-Id: 756027
Received: from out30-110.freemail.mail.aliyun.com
 (out30-110.freemail.mail.aliyun.com [115.124.30.110])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6F9BA7498;
 Mon, 18 Dec 2023 06:45:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.alibaba.com
X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R101e4; CH=green; DM=||false|;
 DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=ay29a033018045192;
 MF=xueshuai@linux.alibaba.com; NM=1; PH=DS; RN=35; SR=0;
 TI=SMTPD_---0VygHb.J_1702881931; 
Received: from localhost.localdomain(mailfrom:xueshuai@linux.alibaba.com
 fp:SMTPD_---0VygHb.J_1702881931) by smtp.aliyun-inc.com;
 Mon, 18 Dec 2023 14:45:33 +0800
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: bp@alien8.de, rafael@kernel.org, wangkefeng.wang@huawei.com,
 tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com,
 linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com,
 gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org
Cc: linux-acpi@vger.kernel.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
 linux-edac@vger.kernel.org, acpica-devel@lists.linuxfoundation.org,
 stable@vger.kernel.org, x86@kernel.org, xueshuai@linux.alibaba.com,
 justin.he@arm.com, ardb@kernel.org, ying.huang@intel.com,
 ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, tglx@linutronix.de,
 mingo@redhat.com, dave.hansen@linux.intel.com, lenb@kernel.org,
 hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com,
 xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com
Subject: [PATCH v10 3/4] mm: memory-failure: move memory_failure() return
 value documentation to function declaration
Date: Mon, 18 Dec 2023 14:45:20 +0800
Message-Id: <20231218064521.37324-4-xueshuai@linux.alibaba.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
References: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
Precedence: bulk
X-Mailing-List: linux-acpi@vger.kernel.org
List-Id: <linux-acpi.vger.kernel.org>
List-Subscribe: <mailto:linux-acpi+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-acpi+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Part of return value comments for memory_failure() were originally
documented at the call site. Move those comments to the function
declaration to improve code readability and to provide developers with
immediate access to function usage and return information.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 9 +--------
 mm/memory-failure.c            | 9 ++++++---
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7b397370b4d6..43e542f06ad5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1324,17 +1324,10 @@ static void kill_me_maybe(struct callback_head *cb)
 		return;
 	}
 
-	/*
-	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
-	 * to the current process with the proper error info,
-	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
-	 *
-	 * In both cases, no further processing is required.
-	 */
 	if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
 		return;
 
-	pr_err("Memory error not recovered");
+	pr_err("Sending SIGBUS to current task due to memory error not recovered");
 	kill_me_now(cb);
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..bd3dcafdfa4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2164,9 +2164,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks held.
  *
- * Return: 0 for successfully handled the memory error,
- *         -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- *         < 0(except -EOPNOTSUPP) on failure.
+ * Return values:
+ *   0             - success
+ *   -EOPNOTSUPP   - hwpoison_filter() filtered the error event.
+ *   -EHWPOISON    - sent SIGBUS to the current process with the proper
+ *                   error info by kill_accessing_process().
+ *   other negative values - failure
  */
 int memory_failure(unsigned long pfn, int flags)
 {

From patchwork Mon Dec 18 06:45:21 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shuai Xue <xueshuai@linux.alibaba.com>
X-Patchwork-Id: 755513
Received: from out30-110.freemail.mail.aliyun.com
 (out30-110.freemail.mail.aliyun.com [115.124.30.110])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 11D0FD522;
 Mon, 18 Dec 2023 06:51:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.alibaba.com
X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R211e4; CH=green; DM=||false|;
 DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=ay29a033018046059;
 MF=xueshuai@linux.alibaba.com; NM=1; PH=DS; RN=35; SR=0;
 TI=SMTPD_---0VygHb04_1702881934; 
Received: from localhost.localdomain(mailfrom:xueshuai@linux.alibaba.com
 fp:SMTPD_---0VygHb04_1702881934) by smtp.aliyun-inc.com;
 Mon, 18 Dec 2023 14:45:35 +0800
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: bp@alien8.de, rafael@kernel.org, wangkefeng.wang@huawei.com,
 tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com,
 linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com,
 gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org
Cc: linux-acpi@vger.kernel.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
 linux-edac@vger.kernel.org, acpica-devel@lists.linuxfoundation.org,
 stable@vger.kernel.org, x86@kernel.org, xueshuai@linux.alibaba.com,
 justin.he@arm.com, ardb@kernel.org, ying.huang@intel.com,
 ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, tglx@linutronix.de,
 mingo@redhat.com, dave.hansen@linux.intel.com, lenb@kernel.org,
 hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com,
 xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com
Subject: [PATCH v10 4/4] ACPI: APEI: handle synchronous exceptions in task work
Date: Mon, 18 Dec 2023 14:45:21 +0800
Message-Id: <20231218064521.37324-5-xueshuai@linux.alibaba.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
References: <20221027042445.60108-1-xueshuai@linux.alibaba.com>
Precedence: bulk
X-Mailing-List: linux-acpi@vger.kernel.org
List-Id: <linux-acpi.vger.kernel.org>
List-Subscribe: <mailto:linux-acpi+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-acpi+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Hardware errors could be signaled by asynchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when a CPU tries to access a poisoned cache line. Both
synchronous and asynchronous error are queued as a memory_failure() work
and handled by a dedicated kthread in workqueue.

However, the memory failure recovery sends SIBUS with wrong BUS_MCEERR_AO
si_code for synchronous errors in early kill mode, even MF_ACTION_REQUIRED
is set. The main problem is that the memory failure work is handled in
kthread context but not the user-space process which is accessing the
corrupt memory location, so it will send SIGBUS with BUS_MCEERR_AO si_code
to the user-space process instead of BUS_MCEERR_AR in kill_proc().

To this end, queue memory_failure() as a task_work so that the current
context in memory_failure() is exactly belongs to the process consuming
poison data and it will send SIBBUS with proper si_code.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Tested-by: Ma Wupeng <mawupeng1@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
 include/acpi/ghes.h      |  3 --
 mm/memory-failure.c      | 13 -------
 3 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f832ffc5a88d..a6b4907cfe47 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -464,28 +464,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
 }
 
 /*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork:                callback_head for task work
+ * @pfn:                  page frame number of corrupted page
+ * @flags:                fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
  */
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+	struct callback_head twork;
+	u64 pfn;
+	int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
 {
-	struct acpi_hest_generic_status *estatus;
-	struct ghes_estatus_node *estatus_node;
-	u32 node_len;
+	int ret;
+	struct sync_task_work *twcb =
+		container_of(twork, struct sync_task_work, twork);
 
-	estatus_node = container_of(head, struct ghes_estatus_node, task_work);
-	if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
-		memory_failure_queue_kick(estatus_node->task_work_cpu);
+	ret = memory_failure(twcb->pfn, twcb->flags);
+	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
 
-	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
-	node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
-	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+		return;
+
+	pr_err("Sending SIGBUS to current task due to memory error not recovered");
+	force_sig(SIGBUS);
 }
 
 static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 {
 	unsigned long pfn;
+	struct sync_task_work *twcb;
 
 	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
 		return false;
@@ -498,6 +511,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 		return false;
 	}
 
+	if (flags == MF_ACTION_REQUIRED && current->mm) {
+		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
+		if (!twcb)
+			return false;
+
+		twcb->pfn = pfn;
+		twcb->flags = flags;
+		init_task_work(&twcb->twork, memory_failure_cb);
+		task_work_add(current, &twcb->twork, TWA_RESUME);
+		return true;
+	}
+
 	memory_failure_queue(pfn, flags);
 	return true;
 }
@@ -673,7 +698,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
 	int sev, sec_sev;
@@ -725,8 +750,6 @@ static bool ghes_do_proc(struct ghes *ghes,
 		pr_err("Sending SIGBUS to current task due to memory error not recovered");
 		force_sig(SIGBUS);
 	}
-
-	return queued;
 }
 
 static void __ghes_print_estatus(const char *pfx,
@@ -1028,9 +1051,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 	struct ghes_estatus_node *estatus_node;
 	struct acpi_hest_generic *generic;
 	struct acpi_hest_generic_status *estatus;
-	bool task_work_pending;
 	u32 len, node_len;
-	int ret;
 
 	llnode = llist_del_all(&ghes_estatus_llist);
 	/*
@@ -1045,25 +1066,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 		len = cper_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
-		task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+		ghes_do_proc(estatus_node->ghes, estatus);
+
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
 				ghes_estatus_cache_add(generic, estatus);
 		}
-
-		if (task_work_pending && current->mm) {
-			estatus_node->task_work.func = ghes_kick_task_work;
-			estatus_node->task_work_cpu = smp_processor_id();
-			ret = task_work_add(current, &estatus_node->task_work,
-					    TWA_RESUME);
-			if (ret)
-				estatus_node->task_work.func = NULL;
-		}
-
-		if (!estatus_node->task_work.func)
-			gen_pool_free(ghes_estatus_pool,
-				      (unsigned long)estatus_node, node_len);
+		gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+			      node_len);
 
 		llnode = next;
 	}
@@ -1124,7 +1136,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
 
 	estatus_node->ghes = ghes;
 	estatus_node->generic = ghes->generic;
-	estatus_node->task_work.func = NULL;
 	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
 
 	if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index be1dd4c1a917..ebd21b05fe6e 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
 	struct ghes *ghes;
-
-	int task_work_cpu;
-	struct callback_head task_work;
 };
 
 struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bd3dcafdfa4a..6bff57444928 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2451,19 +2451,6 @@ static void memory_failure_work_func(struct work_struct *work)
 	}
 }
 
-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
-	struct memory_failure_cpu *mf_cpu;
-
-	mf_cpu = &per_cpu(memory_failure_cpu, cpu);
-	cancel_work_sync(&mf_cpu->work);
-	memory_failure_work_func(&mf_cpu->work);
-}
-
 static int __init memory_failure_init(void)
 {
 	struct memory_failure_cpu *mf_cpu;