mbox series

[0/4] efi/cxl-cper: Report CXL CPER events through tracing

Message ID 20240228-cxl-cper3-v1-0-6aa3f1343c6c@intel.com
Headers show
Series efi/cxl-cper: Report CXL CPER events through tracing | expand

Message

Ira Weiny Feb. 29, 2024, 7:13 a.m. UTC
CXL Component Events, as defined by EFI 2.10 Section N.2.14, wrap a
mostly CXL event payload in an EFI Common Platform Error Record (CPER)
record.  If a device is configured for firmware first CXL event records
are not sent directly to the host.

The CXL sub-system uniquely has DPA to HPA translation information.  It
also already has event decoding/tracing.  Such translations are very
useful for users to determine which system issues may correspond to
specific hardware events.

The restructuring of the event data structures in 6.8 made sharing the
data between CPER/event logs more efficient.  Now re-wire the sending of
CPER records to the CXL sub-system.

In addition provide a default RAS event should the CXL module not be
loaded [ie callback not registered].

Series status/background
========================

Smita and Jonathan have been a great help with this series.  Once again
thank you.

Unfortunately, with all the churn surrounding the bug which Dan
Carpenter found the maintainers were force to revert this work.

Therefore, this is a whole new series based on what is in 6.8.

Testing
=======

I've hacked up a quick debugfs patch to facilitate easier testing.[1]

With this I have verified that the bug Dan Carpenter found is fixed.
However, the tp_printk bug Jonathan found remains.  The taking of the
device lock in the callback is required and the tp_printk issue is
unlikely to be fixed.  Fortunately, tp_printk is not widely used so it
is anticipated this will not be an issue.

No other locking issues were found with this test and locking debug
turned on.

[1] https://github.com/weiny2/linux-kernel/commit/6c540a23cb1194d67a9dcfefb702774a99afc3b1

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Ira Weiny (4):
      cxl/event: Add missing include files
      acpi/ghes: Process CXL Component Events
      cxl/pci: Register for and process CPER events
      ras/events: Trace CXL CPER events even without the CXL stack loaded

 drivers/acpi/apei/ghes.c  | 130 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/pci.c         |  69 +++++++++++++++++++++++-
 include/linux/cxl-event.h |  21 ++++++++
 include/ras/ras_event.h   |  90 ++++++++++++++++++++++++++++++++
 4 files changed, 309 insertions(+), 1 deletion(-)
---
base-commit: daeacfa75d08954e1a5b71c36a8fbfcdd0b3fec9
change-id: 20240220-cxl-cper3-30e55279f936

Best regards,

Comments

Dan Williams March 1, 2024, 8:19 p.m. UTC | #1
Ira Weiny wrote:
> Additional event testing using the cxl-event.h header revealed that it
> did not include the correct headers for the types used.  Compile errors
> such as:
> 
> 	include/linux/cxl-event.h|11 col 9| error: unknown type name ‘u8’
> 
> ... were seen.

Were seen where? Should this have the trio of Reported-by: Closes: and
Fixes tags?
Dan Williams March 1, 2024, 9:49 p.m. UTC | #2
Ira Weiny wrote:
> If the firmware has configured CXL event support to be firmware first
> the OS can process those events through CPER records.  The CXL layer has
> unique DPA to HPA knowledge and standard event trace parsing in place.
> 
> CPER records contain Bus, Device, Function information which can be used
> to identify the PCI device which is sending the event.
> 
> Add a CXL CPER callback to process events through the CXL trace
> subsystem.
> 
> Future patches will provide additional region information such as HPA.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes:
> [iweiny: Add back in after the revert in 6.8]
> ---
>  drivers/cxl/pci.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..6cf8336d1b33 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -974,6 +974,73 @@ static struct pci_driver cxl_pci_driver = {
>  	},
>  };
>  
> -module_pci_driver(cxl_pci_driver);
> +#define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
> +static void cxl_cper_event_call(enum cxl_event_type ev_type,
> +				struct cxl_cper_event_rec *rec)
> +{
> +	struct cper_cxl_event_devid *device_id = &rec->hdr.device_id;
> +	struct pci_dev *pdev __free(pci_dev_put) = NULL;
> +	enum cxl_event_log_type log_type;
> +	struct cxl_dev_state *cxlds;
> +	unsigned int devfn;
> +	u32 hdr_flags;
> +
> +	pr_debug("CPER event for device %u:%u:%u.%u\n",
> +		 device_id->segment_num, device_id->bus_num,
> +		 device_id->device_num, device_id->func_num);
> +
> +	devfn = PCI_DEVFN(device_id->device_num, device_id->func_num);
> +	pdev = pci_get_domain_bus_and_slot(device_id->segment_num,
> +					   device_id->bus_num, devfn);
> +	if (!pdev) {
> +		pr_err("CPER event device %u:%u:%u.%u not found\n",
> +			device_id->segment_num, device_id->bus_num,
> +			device_id->device_num, device_id->func_num);
> +		return;
> +	}
> +
> +	dev_dbg(&pdev->dev, "Found device %u:%u.%u\n", device_id->bus_num,
> +		device_id->device_num, device_id->func_num);

These print statements are excessive. The dev_dbg() already encodes the
device BDF into the device name. The pr_err() is not actionable and
somewhat redundant with the default cper_estatus_print_section() print.

I would just delete all of them.
Ira Weiny March 1, 2024, 9:53 p.m. UTC | #3
Dan Williams wrote:
> Ira Weiny wrote:
> > Additional event testing using the cxl-event.h header revealed that it
> > did not include the correct headers for the types used.  Compile errors
> > such as:
> > 
> > 	include/linux/cxl-event.h|11 col 9| error: unknown type name ‘u8’
> > 
> > ... were seen.
> 
> Were seen where? Should this have the trio of Reported-by: Closes: and
> Fixes tags?

As I said after this in the commit message:

	"Omit the fixes tag because this does not cause any issues until
	the header is used again in other code."

I'll clarify it was seen when I used cxl-event.h in the testing code and
this happened.

Ira