mbox series

[v12,0/3] cxl, EINJ: Update EINJ for CXL error types

Message ID 20240214200709.777166-1-Benjamin.Cheatham@amd.com
Headers show
Series cxl, EINJ: Update EINJ for CXL error types | expand

Message

Ben Cheatham Feb. 14, 2024, 8:07 p.m. UTC
v12 Changes:
	- Rebase onto v6.8-rc4
	- Squash Kconfig patch into patch 2/3 (Jonathan)
	- Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS"
	  to "depends on ACPI_APEI_EINJ = CXL_BUS"
	- Drop "ACPI, APEI" part of commit message title and use just EINJ
	instead (Dan)
	- Add protocol error types to "einj_types" documentation (Jonathan)
	- Change 0xffff... constants to use GENMASK()
	- Drop param* variables and use constants instead in cxl error
	  inject functions (Jonathan)
	- Add is_cxl_error_type() helper function in einj.c (Jonathan)
	- Remove a stray function declaration in einj-cxl.h (Jonathan)
	- Comment #else/#endifs with corresponding #if/#ifdef in
	einj-cxl.h (Jonathan)

v11 Changes:
	- Drop patch 2/6 (Add CXL protocol error defines) and put the
	  defines in patch 4/6 instead (Dan)
	- Add Dan's reviewed-by

The new CXL error types will use the Memory Address field in the
SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1
compliant memory-mapped downstream port. The value of the memory address
will be in the port's MMIO range, and it will not represent physical
(normal or persistent) memory.

Add the functionality for injecting CXL 1.1 errors to the EINJ module,
but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj.
Instead, make the error types available under /sys/kernel/debug/cxl.
This allows for validating the MMIO address for a CXL 1.1 error type
while also not making the user responsible for finding it.
Ben Cheatham (3):
  EINJ: Migrate to a platform driver
  cxl/core, EINJ: Add EINJ CXL debugfs files and EINJ helper functions
  EINJ, Documentation: Update EINJ kernel doc

 Documentation/ABI/testing/debugfs-cxl         |  30 +++
 .../firmware-guide/acpi/apei/einj.rst         |  19 ++
 MAINTAINERS                                   |   1 +
 drivers/acpi/apei/einj.c                      | 202 ++++++++++++++++--
 drivers/cxl/Kconfig                           |  12 ++
 drivers/cxl/core/port.c                       |  41 ++++
 include/linux/einj-cxl.h                      |  40 ++++
 7 files changed, 332 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/einj-cxl.h

Comments

Dan Williams Feb. 16, 2024, 12:09 a.m. UTC | #1
Luck, Tony wrote:
> > I would save that work for a clear description of why einj.ko should not
> > be resident.
> 
> Personally, it would save me having to type "modprobe einj" to run tests (and
> answer e-mails from validation folks telling they missed this step).

It would only autoload with cxl_core.ko though.

> 
> But others might feels this is unwanted. It looks like some distros build kernels
> with CONFIG_ACPI_APEI_EINJ=m.
> 
> On the other hand, EINJ should be under control of a BIOS option that
> defaults to "off". So production systems won't enable it.
> 
> But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip:
> 
> 	pr_warn("EINJ table not found.\n");
> 	pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status));
> 	pr_warn(FW_BUG "Invalid EINJ table.\n");
> 	pr_err("Error collecting EINJ resources.\n");

Oh, good point. However, should a debug module really be throwing
pr_err() or pr_warn()? I.e. should those all move to pr_info() or
pr_debug() since the error case is detected by the lack of debugfs files
being published.

At least that would be my preference over creating cxl_einj.ko.
Dan Williams Feb. 16, 2024, 12:11 a.m. UTC | #2
Luck, Tony wrote:
> > But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip:
> 
> >	pr_warn("EINJ table not found.\n");
> >	pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status));
> >	pr_warn(FW_BUG "Invalid EINJ table.\n");
> >	pr_err("Error collecting EINJ resources.\n");
> 
> Just tried on my system. The winner (for me) is:
> 
> [   27.989081] EINJ: EINJ table not found.
> 
> If you decide that it is OK to auto-load, I think that needs severity downgraded to pr_info().
> 
> Users ask questions when they see warnings.

Sounds good, I missed this before sending my last reply.
Davidlohr Bueso Feb. 20, 2024, 7:02 p.m. UTC | #3
On Wed, 14 Feb 2024, Ben Cheatham wrote:

>Update EINJ kernel document to include how to inject CXL protocol error
>types, build the kernel to include CXL error types, and give an example
>injection.
>
>Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>Signed-off-by: Ben Cheatham <Benjamin.Cheatham@amd.com>

Would vote for folding into 2/3, but otherwise looks good with a minor
suggestion.

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

>---
> .../firmware-guide/acpi/apei/einj.rst         | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
>diff --git a/Documentation/firmware-guide/acpi/apei/einj.rst b/Documentation/firmware-guide/acpi/apei/einj.rst
>index d6b61d22f525..f179adf7b61c 100644
>--- a/Documentation/firmware-guide/acpi/apei/einj.rst
>+++ b/Documentation/firmware-guide/acpi/apei/einj.rst
>@@ -181,6 +181,25 @@ You should see something like this in dmesg::
>   [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0
>   [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
>
>+CXL error types are supported from ACPI 6.5 onwards. These error types
						     ^ and target a CXL Port

>+are not available in the legacy interface at /sys/kernel/debug/apei/einj,
>+and are instead at /sys/kernel/debug/cxl/. There is a file under debug/cxl
>+called "einj_type" that is analogous to available_error_type under debug/cxl.
>+There is also a "einj_inject" file in each $dport_dev directory under debug/cxl
>+that will inject a given error into the dport represented by $dport_dev.
>+For example, to inject a CXL.mem protocol correctable error into
>+$dport_dev=pci0000:0c::
>+
>+    # cd /sys/kernel/debug/cxl/
>+    # cat einj_type                 # See which error can be injected
>+	0x00008000  CXL.mem Protocol Correctable
>+	0x00010000  CXL.mem Protocol Uncorrectable non-fatal
>+	0x00020000  CXL.mem Protocol Uncorrectable fatal
>+    # cd 0000:e0:01.1               # Navigate to dport to inject into
>+    # echo 0x8000 > einj_inject     # Inject error
>+
>+To use CXL error types, ``CONFIG_CXL_EINJ`` will need to be enabled.
>+
> Special notes for injection into SGX enclaves:
>
> There may be a separate BIOS setup option to enable SGX injection.
>--
>2.34.1
>
Ben Cheatham Feb. 20, 2024, 7:59 p.m. UTC | #4
Thanks for taking a look David!

On 2/20/24 1:02 PM, Davidlohr Bueso wrote:
> On Wed, 14 Feb 2024, Ben Cheatham wrote:
> 
>> Update EINJ kernel document to include how to inject CXL protocol error
>> types, build the kernel to include CXL error types, and give an example
>> injection.
>>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Ben Cheatham <Benjamin.Cheatham@amd.com>
> 
> Would vote for folding into 2/3, but otherwise looks good with a minor
> suggestion.
> 

I would, but I think 2/3 is already pretty large and this is more digestible to me. I've also reworked a large portion of
that patch for v13 so it's probably better to keep it smaller.

> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> 
>> ---
>> .../firmware-guide/acpi/apei/einj.rst         | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/Documentation/firmware-guide/acpi/apei/einj.rst b/Documentation/firmware-guide/acpi/apei/einj.rst
>> index d6b61d22f525..f179adf7b61c 100644
>> --- a/Documentation/firmware-guide/acpi/apei/einj.rst
>> +++ b/Documentation/firmware-guide/acpi/apei/einj.rst
>> @@ -181,6 +181,25 @@ You should see something like this in dmesg::
>>   [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0
>>   [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
>>
>> +CXL error types are supported from ACPI 6.5 onwards. These error types
>                              ^ and target a CXL Port
> 

Will add.

Thanks,
Ben

>> +are not available in the legacy interface at /sys/kernel/debug/apei/einj,
>> +and are instead at /sys/kernel/debug/cxl/. There is a file under debug/cxl
>> +called "einj_type" that is analogous to available_error_type under debug/cxl.
>> +There is also a "einj_inject" file in each $dport_dev directory under debug/cxl
>> +that will inject a given error into the dport represented by $dport_dev.
>> +For example, to inject a CXL.mem protocol correctable error into
>> +$dport_dev=pci0000:0c::
>> +
>> +    # cd /sys/kernel/debug/cxl/
>> +    # cat einj_type                 # See which error can be injected
>> +    0x00008000  CXL.mem Protocol Correctable
>> +    0x00010000  CXL.mem Protocol Uncorrectable non-fatal
>> +    0x00020000  CXL.mem Protocol Uncorrectable fatal
>> +    # cd 0000:e0:01.1               # Navigate to dport to inject into
>> +    # echo 0x8000 > einj_inject     # Inject error
>> +
>> +To use CXL error types, ``CONFIG_CXL_EINJ`` will need to be enabled.
>> +
>> Special notes for injection into SGX enclaves:
>>
>> There may be a separate BIOS setup option to enable SGX injection.
>> -- 
>> 2.34.1
>>