mbox series

[v2,0/4] MCE wrapper and support for new SMCA syndrome MSRs

Message ID 20240625195624.2565741-1-avadhut.naik@amd.com
Headers show
Series MCE wrapper and support for new SMCA syndrome MSRs | expand

Message

Avadhut Naik June 25, 2024, 7:56 p.m. UTC
This patchset adds a new wrapper for struct mce to prevent its bloating
and export vendor specific error information. Additionally, support is
also introduced for two new "syndrome" MSRs used in newer AMD Scalable
MCA (SMCA) systems. Also, a new "FRU Text in MCA" feature that uses these
new "syndrome" MSRs has been addded.

Patch 1 adds the new wrapper structure mce_hw_err for the struct mce
while also modifying the mce_record tracepoint to use the new wrapper.

Patch 2 adds support for the new "syndrome" registers. They are read/printed
wherever the existing MCA_SYND register is used.

Patch 3 updates the function that pulls MCA information from UEFI x86
Common Platform Error Records (CPERs) to handle systems that support the
new registers.

Patch 4 adds support to the AMD MCE decoder module to detect and use the
"FRU Text in MCA" feature which leverages the new registers.

NOTE:

This set was initially submitted as part of the larger MCA Updates set.

v1: https://lore.kernel.org/linux-edac/20231118193248.1296798-1-yazen.ghannam@amd.com/
v2: https://lore.kernel.org/linux-edac/20240404151359.47970-1-yazen.ghannam@amd.com/

However, since the MCA Updates set has been split up into smaller sets,
this set, going forward, will be submitted independently.

Having said that, this set set depends on and applies cleanly on top of
the below two sets.

[1] https://lore.kernel.org/linux-edac/20240521125434.1555845-1-yazen.ghannam@amd.com/
[2] https://lore.kernel.org/linux-edac/20240523155641.2805411-1-yazen.ghannam@amd.com/

Changes in v2:
 - Drop dependencies on sets [1] and [2] above and rebase on top of
   tip/master. (Boris)

Avadhut Naik (2):
  x86/mce: Add wrapper for struct mce to export vendor specific info
  x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers

Yazen Ghannam (2):
  x86/mce/apei: Handle variable register array size
  EDAC/mce_amd: Add support for FRU Text in MCA

 arch/x86/include/asm/mce.h              |  20 ++-
 arch/x86/kernel/cpu/mce/amd.c           |  33 ++--
 arch/x86/kernel/cpu/mce/apei.c          | 119 ++++++++++----
 arch/x86/kernel/cpu/mce/core.c          | 201 ++++++++++++++----------
 arch/x86/kernel/cpu/mce/dev-mcelog.c    |   2 +-
 arch/x86/kernel/cpu/mce/genpool.c       |  20 +--
 arch/x86/kernel/cpu/mce/inject.c        |   4 +-
 arch/x86/kernel/cpu/mce/internal.h      |   4 +-
 drivers/acpi/acpi_extlog.c              |   2 +-
 drivers/acpi/nfit/mce.c                 |   2 +-
 drivers/edac/i7core_edac.c              |   2 +-
 drivers/edac/igen6_edac.c               |   2 +-
 drivers/edac/mce_amd.c                  |  27 +++-
 drivers/edac/pnd2_edac.c                |   2 +-
 drivers/edac/sb_edac.c                  |   2 +-
 drivers/edac/skx_common.c               |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |   2 +-
 drivers/ras/amd/fmpm.c                  |   2 +-
 drivers/ras/cec.c                       |   2 +-
 include/trace/events/mce.h              |  51 +++---
 20 files changed, 316 insertions(+), 185 deletions(-)


base-commit: 4fe5c16f5e5e0bd1a71a5ac79b5870f91b6b8e81

Comments

Borislav Petkov June 26, 2024, 11:57 a.m. UTC | #1
On Tue, Jun 25, 2024 at 02:56:23PM -0500, Avadhut Naik wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> ACPI Boot Error Record Table (BERT) is being used by the kernel to
> report errors that occurred in a previous boot. On some modern AMD
> systems, these very errors within the BERT are reported through the
> x86 Common Platform Error Record (CPER) format which consists of one
> or more Processor Context Information Structures. These context
> structures provide a starting address and represent an x86 MSR range
> in which the data constitutes a contiguous set of MSRs starting from,
> and including the starting address.
> 
> It's common, for AMD systems that implement this behavior, that the
> MSR range represents the MCAX register space used for the Scalable MCA
> feature. The apei_smca_report_x86_error() function decodes and passes
> this information through the MCE notifier chain. However, this function
> assumes a fixed register size based on the original HW/FW implementation.
> 
> This assumption breaks with the addition of two new MCAX registers viz.
> MCA_SYND1 and MCA_SYND2. These registers are added at the end of the
> MCAX register space, so they won't be included when decoding the CPER
> data.
> 
> Rework apei_smca_report_x86_error() to support a variable register array
> size. This covers any case where the MSR context information starts at
> the MCAX address for MCA_STATUS and ends at any other register within
> the MCAX register space.
> 
> Add code comments indicating the MCAX register at each offset.
> 
> [Yazen: Add Avadhut as co-developer for wrapper changes.]
> 
> Co-developed-by: Avadhut Naik <avadhut.naik@amd.com>
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>

This needs Avadhut's SOB after Yazen's.

Touchups ontop:

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 7a15f0ca1bd1..6bbeb29125a9 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -69,7 +69,7 @@ EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
 {
 	const u64 *i_mce = ((const u64 *) (ctx_info + 1));
-	unsigned int cpu, num_registers;
+	unsigned int cpu, num_regs;
 	struct mce_hw_err err;
 	struct mce *m = &err.m;
 
@@ -93,10 +93,10 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
 	/*
 	 * The number of registers in the register array is determined by
 	 * Register Array Size/8 as defined in UEFI spec v2.8, sec N.2.4.2.2.
-	 * Ensure that the array size includes at least 1 register.
+	 * Sanity-check registers array size.
 	 */
-	num_registers = ctx_info->reg_arr_size >> 3;
-	if (!num_registers)
+	num_regs = ctx_info->reg_arr_size >> 3;
+	if (!num_regs)
 		return -EINVAL;
 
 	mce_setup(m);
@@ -118,13 +118,12 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
 	/*
 	 * The SMCA register layout is fixed and includes 16 registers.
 	 * The end of the array may be variable, but the beginning is known.
-	 * Switch on the number of registers. Cap the number of registers to
-	 * expected max (15).
+	 * Cap the number of registers to expected max (15).
 	 */
-	if (num_registers > 15)
-		num_registers = 15;
+	if (num_regs > 15)
+		num_regs = 15;
 
-	switch (num_registers) {
+	switch (num_regs) {
 	/* MCA_SYND2 */
 	case 15:
 		err.vi.amd.synd2 = *(i_mce + 14);
Naik, Avadhut June 26, 2024, 5:28 p.m. UTC | #2
On 6/26/2024 06:57, Borislav Petkov wrote:
> On Tue, Jun 25, 2024 at 02:56:23PM -0500, Avadhut Naik wrote:
>> From: Yazen Ghannam <yazen.ghannam@amd.com>
>>
>> ACPI Boot Error Record Table (BERT) is being used by the kernel to
>> report errors that occurred in a previous boot. On some modern AMD
>> systems, these very errors within the BERT are reported through the
>> x86 Common Platform Error Record (CPER) format which consists of one
>> or more Processor Context Information Structures. These context
>> structures provide a starting address and represent an x86 MSR range
>> in which the data constitutes a contiguous set of MSRs starting from,
>> and including the starting address.
>>
>> It's common, for AMD systems that implement this behavior, that the
>> MSR range represents the MCAX register space used for the Scalable MCA
>> feature. The apei_smca_report_x86_error() function decodes and passes
>> this information through the MCE notifier chain. However, this function
>> assumes a fixed register size based on the original HW/FW implementation.
>>
>> This assumption breaks with the addition of two new MCAX registers viz.
>> MCA_SYND1 and MCA_SYND2. These registers are added at the end of the
>> MCAX register space, so they won't be included when decoding the CPER
>> data.
>>
>> Rework apei_smca_report_x86_error() to support a variable register array
>> size. This covers any case where the MSR context information starts at
>> the MCAX address for MCA_STATUS and ends at any other register within
>> the MCAX register space.
>>
>> Add code comments indicating the MCAX register at each offset.
>>
>> [Yazen: Add Avadhut as co-developer for wrapper changes.]
>>
>> Co-developed-by: Avadhut Naik <avadhut.naik@amd.com>
>> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> This needs Avadhut's SOB after Yazen's.
> 
Will do. Will change to the below format:

Co-developed-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>

> Touchups ontop:
> 
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index 7a15f0ca1bd1..6bbeb29125a9 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -69,7 +69,7 @@ EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
>  int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
>  {
>  	const u64 *i_mce = ((const u64 *) (ctx_info + 1));
> -	unsigned int cpu, num_registers;
> +	unsigned int cpu, num_regs;
>  	struct mce_hw_err err;
>  	struct mce *m = &err.m;
>  
> @@ -93,10 +93,10 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
>  	/*
>  	 * The number of registers in the register array is determined by
>  	 * Register Array Size/8 as defined in UEFI spec v2.8, sec N.2.4.2.2.
> -	 * Ensure that the array size includes at least 1 register.
> +	 * Sanity-check registers array size.
>  	 */
> -	num_registers = ctx_info->reg_arr_size >> 3;
> -	if (!num_registers)
> +	num_regs = ctx_info->reg_arr_size >> 3;
> +	if (!num_regs)
>  		return -EINVAL;
>  
>  	mce_setup(m);
> @@ -118,13 +118,12 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
>  	/*
>  	 * The SMCA register layout is fixed and includes 16 registers.
>  	 * The end of the array may be variable, but the beginning is known.
> -	 * Switch on the number of registers. Cap the number of registers to
> -	 * expected max (15).
> +	 * Cap the number of registers to expected max (15).
>  	 */
> -	if (num_registers > 15)
> -		num_registers = 15;
> +	if (num_regs > 15)
> +		num_regs = 15;
>  
> -	switch (num_registers) {
> +	switch (num_regs) {
>  	/* MCA_SYND2 */
>  	case 15:
>  		err.vi.amd.synd2 = *(i_mce + 14);
> 
Will incorporate these.