Message ID | 20230403151849.43408-1-Benjamin.Cheatham@amd.com |
---|---|
State | New |
Headers | show |
Series | [v2] ACPI, APEI, EINJ: Remove memory range validation for CXL error types | expand |
Hi Ben, Ben Cheatham wrote: > From: Yazen Ghannam <yazen.ghannam@amd.com> > > V2 Changes: > - Added Link tags for links > - removed stray unused variable > > This patch is a follow up to the discussion at [1], and builds on Tony's > CXL error patch at [2]. > > The new CXL error types will use the Memory Address field in the > SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 > compliant memory-mapped Downstream port. The value of the Memory Address > will be in the port's MMIO range, and it will not represent physical > (normal or persistent) memory. > > Allow error injection for CXL 1.1 systems by skipping memory range > validation for CXL error injection types. This just feels a bit too loose especially when the kernel has the cxl_acpi driver to perform the enumeration of CXL root ports. I know that Terry and Robert are teaching the PCI AER core how to coordinate with RCRB information [1] (I still need to go dig in deeper on that set). I would expect ACPI EINJ could benefit from similar coordination and validate these addresses. Now, is it any address in the downstream-port RCRB range that is valid, or only the base? Another minor comment below... [1]: http://lore.kernel.org/r/20230523232214.55282-1-terry.bowman@amd.com > > Output trying to inject CXL.mem error without patch: > > -bash: echo: write error: Invalid argument > > [1]: > Link: https://lore.kernel.org/linux-acpi/20221206205234.606073-1-Benjamin.Cheatham@amd.com/ > [2]: > Link: https://lore.kernel.org/linux-cxl/CAJZ5v0hNQUfWViqxbJ5B4JCGJUuHpWWSpqpCFWPNpGuagoFbsQ@mail.gmail.com/T/#t > > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> > Signed-off-by: Ben Cheatham <benjamin.cheatham@amd.com> > --- > drivers/acpi/apei/einj.c | 12 +++++++++++- > include/acpi/actbl1.h | 6 ++++++ > 2 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c > index 013eb621dc92..68a20326ed7c 100644 > --- a/drivers/acpi/apei/einj.c > +++ b/drivers/acpi/apei/einj.c > @@ -37,6 +37,13 @@ > ACPI_EINJ_MEMORY_UNCORRECTABLE | \ > ACPI_EINJ_MEMORY_FATAL) > > +#define CXL_ERROR_MASK (ACPI_EINJ_CXL_CACHE_CORRECTABLE | \ > + ACPI_EINJ_CXL_CACHE_UNCORRECTABLE | \ > + ACPI_EINJ_CXL_CACHE_FATAL | \ > + ACPI_EINJ_CXL_MEM_CORRECTABLE | \ > + ACPI_EINJ_CXL_MEM_UNCORRECTABLE | \ > + ACPI_EINJ_CXL_MEM_FATAL) > + > /* > * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action. > */ > @@ -537,8 +544,11 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2, > if (type & ACPI5_VENDOR_BIT) { > if (vendor_flags != SETWA_FLAGS_MEM) > goto inject; > - } else if (!(type & MEM_ERROR_MASK) && !(flags & SETWA_FLAGS_MEM)) > + } else if (!(type & MEM_ERROR_MASK) && !(flags & SETWA_FLAGS_MEM)) { > + goto inject; > + } else if (type & CXL_ERROR_MASK) { > goto inject; > + } > > /* > * Disallow crazy address masks that give BIOS leeway to pick > diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h > index 81b9e794424d..c39837266414 100644 > --- a/include/acpi/actbl1.h > +++ b/include/acpi/actbl1.h > @@ -1044,6 +1044,12 @@ enum acpi_einj_command_status { > #define ACPI_EINJ_PLATFORM_CORRECTABLE (1<<9) > #define ACPI_EINJ_PLATFORM_UNCORRECTABLE (1<<10) > #define ACPI_EINJ_PLATFORM_FATAL (1<<11) > +#define ACPI_EINJ_CXL_CACHE_CORRECTABLE BIT(12) > +#define ACPI_EINJ_CXL_CACHE_UNCORRECTABLE BIT(13) > +#define ACPI_EINJ_CXL_CACHE_FATAL BIT(14) > +#define ACPI_EINJ_CXL_MEM_CORRECTABLE BIT(15) > +#define ACPI_EINJ_CXL_MEM_UNCORRECTABLE BIT(16) > +#define ACPI_EINJ_CXL_MEM_FATAL BIT(17) I expect these to come from the next ACPICA update just like the other definitions. The change in style from (x<<N) to BIT(N) was a tip-off. The process is to submit a pull request to the ACPICA project, for example: https://github.com/acpica/acpica/commit/e948142526c0
Dan Williams wrote: > Hi Ben, > > Ben Cheatham wrote: > > From: Yazen Ghannam <yazen.ghannam@amd.com> > > > > V2 Changes: > > - Added Link tags for links > > - removed stray unused variable > > > > This patch is a follow up to the discussion at [1], and builds on Tony's > > CXL error patch at [2]. > > > > The new CXL error types will use the Memory Address field in the > > SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 > > compliant memory-mapped Downstream port. The value of the Memory Address > > will be in the port's MMIO range, and it will not represent physical > > (normal or persistent) memory. > > > > Allow error injection for CXL 1.1 systems by skipping memory range > > validation for CXL error injection types. > > This just feels a bit too loose especially when the kernel has > the cxl_acpi driver to perform the enumeration of CXL root ports. > > I know that Terry and Robert are teaching the PCI AER core how to > coordinate with RCRB information [1] (I still need to go dig in deeper > on that set). I would expect ACPI EINJ could benefit from similar > coordination and validate these addresses. > > Now, is it any address in the downstream-port RCRB range that is valid, > or only the base? Now that I say the above, I wonder if the legacy EINJ interface is even the right interface for CXL protocol error injection. It feels like something that should be relative to the given 'struct cxl_port' instance where the error is to be injected. Userspace should not need to deal in MMIO addresses, especially in the RCH case where the RCRB is difficult for userspace to enumerate.
Thanks for the review Dan, responses inline. On 5/31/23 4:31 PM, Dan Williams wrote: > Hi Ben, > > Ben Cheatham wrote: >> From: Yazen Ghannam <yazen.ghannam@amd.com> >> >> V2 Changes: >> - Added Link tags for links >> - removed stray unused variable >> >> This patch is a follow up to the discussion at [1], and builds on Tony's >> CXL error patch at [2]. >> >> The new CXL error types will use the Memory Address field in the >> SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 >> compliant memory-mapped Downstream port. The value of the Memory Address >> will be in the port's MMIO range, and it will not represent physical >> (normal or persistent) memory. >> >> Allow error injection for CXL 1.1 systems by skipping memory range >> validation for CXL error injection types. > > This just feels a bit too loose especially when the kernel has > the cxl_acpi driver to perform the enumeration of CXL root ports. > > I know that Terry and Robert are teaching the PCI AER core how to > coordinate with RCRB information [1] (I still need to go dig in deeper > on that set). I would expect ACPI EINJ could benefit from similar > coordination and validate these addresses. >> Now, is it any address in the downstream-port RCRB range that is valid, > or only the base? > Response to above in your follow up email. > Another minor comment below... > > [1]: http://lore.kernel.org/r/20230523232214.55282-1-terry.bowman@amd.com > >> >> Output trying to inject CXL.mem error without patch: >> >> -bash: echo: write error: Invalid argument >> >> [1]: >> Link: https://lore.kernel.org/linux-acpi/20221206205234.606073-1-Benjamin.Cheatham@amd.com/ >> [2]: >> Link: https://lore.kernel.org/linux-cxl/CAJZ5v0hNQUfWViqxbJ5B4JCGJUuHpWWSpqpCFWPNpGuagoFbsQ@mail.gmail.com/T/#t >> >> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> >> Signed-off-by: Ben Cheatham <benjamin.cheatham@amd.com> >> --- >> drivers/acpi/apei/einj.c | 12 +++++++++++- >> include/acpi/actbl1.h | 6 ++++++ >> 2 files changed, 17 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c >> index 013eb621dc92..68a20326ed7c 100644 >> --- a/drivers/acpi/apei/einj.c >> +++ b/drivers/acpi/apei/einj.c >> @@ -37,6 +37,13 @@ >> ACPI_EINJ_MEMORY_UNCORRECTABLE | \ >> ACPI_EINJ_MEMORY_FATAL) >> >> +#define CXL_ERROR_MASK (ACPI_EINJ_CXL_CACHE_CORRECTABLE | \ >> + ACPI_EINJ_CXL_CACHE_UNCORRECTABLE | \ >> + ACPI_EINJ_CXL_CACHE_FATAL | \ >> + ACPI_EINJ_CXL_MEM_CORRECTABLE | \ >> + ACPI_EINJ_CXL_MEM_UNCORRECTABLE | \ >> + ACPI_EINJ_CXL_MEM_FATAL) >> + >> /* >> * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action. >> */ >> @@ -537,8 +544,11 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2, >> if (type & ACPI5_VENDOR_BIT) { >> if (vendor_flags != SETWA_FLAGS_MEM) >> goto inject; >> - } else if (!(type & MEM_ERROR_MASK) && !(flags & SETWA_FLAGS_MEM)) >> + } else if (!(type & MEM_ERROR_MASK) && !(flags & SETWA_FLAGS_MEM)) { >> + goto inject; >> + } else if (type & CXL_ERROR_MASK) { >> goto inject; >> + } >> >> /* >> * Disallow crazy address masks that give BIOS leeway to pick >> diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h >> index 81b9e794424d..c39837266414 100644 >> --- a/include/acpi/actbl1.h >> +++ b/include/acpi/actbl1.h >> @@ -1044,6 +1044,12 @@ enum acpi_einj_command_status { >> #define ACPI_EINJ_PLATFORM_CORRECTABLE (1<<9) >> #define ACPI_EINJ_PLATFORM_UNCORRECTABLE (1<<10) >> #define ACPI_EINJ_PLATFORM_FATAL (1<<11) >> +#define ACPI_EINJ_CXL_CACHE_CORRECTABLE BIT(12) >> +#define ACPI_EINJ_CXL_CACHE_UNCORRECTABLE BIT(13) >> +#define ACPI_EINJ_CXL_CACHE_FATAL BIT(14) >> +#define ACPI_EINJ_CXL_MEM_CORRECTABLE BIT(15) >> +#define ACPI_EINJ_CXL_MEM_UNCORRECTABLE BIT(16) >> +#define ACPI_EINJ_CXL_MEM_FATAL BIT(17) > > I expect these to come from the next ACPICA update just like the other > definitions. The change in style from (x<<N) to BIT(N) was a tip-off. > The process is to submit a pull request to the ACPICA project, for > example: > > https://github.com/acpica/acpica/commit/e948142526c0 I wasn't aware of this, I'll go ahead and do that.
On 5/31/23 10:57 PM, Dan Williams wrote: > Dan Williams wrote: >> Hi Ben, >> >> Ben Cheatham wrote: >>> From: Yazen Ghannam <yazen.ghannam@amd.com> >>> >>> V2 Changes: >>> - Added Link tags for links >>> - removed stray unused variable >>> >>> This patch is a follow up to the discussion at [1], and builds on Tony's >>> CXL error patch at [2]. >>> >>> The new CXL error types will use the Memory Address field in the >>> SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 >>> compliant memory-mapped Downstream port. The value of the Memory Address >>> will be in the port's MMIO range, and it will not represent physical >>> (normal or persistent) memory. >>> >>> Allow error injection for CXL 1.1 systems by skipping memory range >>> validation for CXL error injection types. >> >> This just feels a bit too loose especially when the kernel has >> the cxl_acpi driver to perform the enumeration of CXL root ports. >> >> I know that Terry and Robert are teaching the PCI AER core how to >> coordinate with RCRB information [1] (I still need to go dig in deeper >> on that set). I would expect ACPI EINJ could benefit from similar >> coordination and validate these addresses. >> I haven't kept up with that patch set either, I'll take a look. >> Now, is it any address in the downstream-port RCRB range that is valid, >> or only the base? I took another look at the ACPI 6.5 spec since it's been a while and it doesn't specify, but when I tested this patch I think I only used the base address. I'll try out other addresses in the range and see if they work as well. > > Now that I say the above, I wonder if the legacy EINJ interface is even > the right interface for CXL protocol error injection. It feels like > something that should be relative to the given 'struct cxl_port' > instance where the error is to be injected. Userspace should not need to > deal in MMIO addresses, especially in the RCH case where the RCRB is > difficult for userspace to enumerate. I think that's a good idea. My first idea is to have sysfs files under the port that are basically a shortcut for an EINJ injection through the legacy interface. I think it might be possible to have just the sysfs files and no legacy interface support, but since I'll most likely be reusing the EINJ code I imagine the legacy interface will have to be updated anyway.
diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c index 013eb621dc92..68a20326ed7c 100644 --- a/drivers/acpi/apei/einj.c +++ b/drivers/acpi/apei/einj.c @@ -37,6 +37,13 @@ ACPI_EINJ_MEMORY_UNCORRECTABLE | \ ACPI_EINJ_MEMORY_FATAL) +#define CXL_ERROR_MASK (ACPI_EINJ_CXL_CACHE_CORRECTABLE | \ + ACPI_EINJ_CXL_CACHE_UNCORRECTABLE | \ + ACPI_EINJ_CXL_CACHE_FATAL | \ + ACPI_EINJ_CXL_MEM_CORRECTABLE | \ + ACPI_EINJ_CXL_MEM_UNCORRECTABLE | \ + ACPI_EINJ_CXL_MEM_FATAL) + /* * ACPI version 5 provides a SET_ERROR_TYPE_WITH_ADDRESS action. */ @@ -537,8 +544,11 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2, if (type & ACPI5_VENDOR_BIT) { if (vendor_flags != SETWA_FLAGS_MEM) goto inject; - } else if (!(type & MEM_ERROR_MASK) && !(flags & SETWA_FLAGS_MEM)) + } else if (!(type & MEM_ERROR_MASK) && !(flags & SETWA_FLAGS_MEM)) { + goto inject; + } else if (type & CXL_ERROR_MASK) { goto inject; + } /* * Disallow crazy address masks that give BIOS leeway to pick diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index 81b9e794424d..c39837266414 100644 --- a/include/acpi/actbl1.h +++ b/include/acpi/actbl1.h @@ -1044,6 +1044,12 @@ enum acpi_einj_command_status { #define ACPI_EINJ_PLATFORM_CORRECTABLE (1<<9) #define ACPI_EINJ_PLATFORM_UNCORRECTABLE (1<<10) #define ACPI_EINJ_PLATFORM_FATAL (1<<11) +#define ACPI_EINJ_CXL_CACHE_CORRECTABLE BIT(12) +#define ACPI_EINJ_CXL_CACHE_UNCORRECTABLE BIT(13) +#define ACPI_EINJ_CXL_CACHE_FATAL BIT(14) +#define ACPI_EINJ_CXL_MEM_CORRECTABLE BIT(15) +#define ACPI_EINJ_CXL_MEM_UNCORRECTABLE BIT(16) +#define ACPI_EINJ_CXL_MEM_FATAL BIT(17) #define ACPI_EINJ_VENDOR_DEFINED (1<<31) /*******************************************************************************