mbox series

[v5,0/6] CCIX Protocol error reporting.

Message ID 20191114133919.32290-1-Jonathan.Cameron@huawei.com
Headers show
Series CCIX Protocol error reporting. | expand

Message

Jonathan Cameron Nov. 14, 2019, 1:39 p.m. UTC
Changes since V4:

Reponses to Mauro's review.
* Use explicitly sized buffers for functions using the rcd_decode_str
  global array.
* Fix buffer overflow issue with not reducing the length as we move the
  start pointer in snprintf calls.
* Tidy up some first snprintf calls to take advantage of known 0s.
* Patch description cleanup to relect current state of the series.

A few minor formatting fixups and comment clarifications of things noticed
whilst making the above changes.

Changes since V3:

Added #if defined(CONFIG_ACPI_APEI_CCIX) build protection to the
tracepoints to avoid using undefined functions.
Reported-by: 0-day.

Changes since V2:

Dropped the legal boiler plate from the cover letter. The CCIX consortium
have agreed that a simple tradmark statement is sufficient which I have
put in the cper-ccix.c file and here.

The CCIX® trademark and CCIX trade name are owned solely by
CCIX CONSORTIUM, INC. and all rights are reserved therein.

Changes since V1:

Addressed comments from James Morse
- Dropped kernel logging of vendor data. We just push it to the tracepoints.
- Tidied up this cover letter and added information to address questions
  raised. Includes removing questions where James and I agreed ;)

Note, this initial series attempts no 'handling' of errors.
That will follow later.

EFI 2.8 defines a new CPER record Appendix N for CCIX Protocol Error Records
(PER). www.uefi.org

These include Protocol Error Record logs which are defined in the
CCIX 1.0 Base Specification www.ccixconsortium.com.  A public evaluation
version is now available.

Handling of coherency protocol errors is complex and how Linux does this
will take some time to evolve.  For now, fatal errors are handled via the
usual means and everything else is reported.

There are 6 types of error defined, covering:
* Memory errors
* Cache errors
* Address translation unit errors
* CCIX port errors
* CCIX link errors
* Agent internal errors.

These errors are concerned (mostly) wth things happening in the CCIX
protocol layer.  They are parallel to AER errors which should be only
concerned with the PCIe layer (which is underneath CCIX).
The ATS errors break this rule slightly. You may get an error
occurring that results in problems at both layers of the protocol
stack and hence have to handle AER and PER errors simultaneously.

Some of these errors can 'almost' be mapped onto standard existing error
types but only at the loss of information specific to CCIX such as
where in the topology they occurred.

The set includes tracepoints to report the errors to RAS Daemon and a patch
set for RAS Daemon will follow shortly.

Several design decisions that people may disagree with.
1. Reporting of vendor data.  We have little choice but to do this via a
   dynamic array as these blocks can take arbitrary size. I had hoped
   no one would actually use these given the odd mismatch between a
   standard error structure and non standard element, but there are
   already designs out there that do use it. James suggested that
   it made sense to put these in the tracepoints, but we shouldn't spam
   the kernel log with them (done in V2).
2. The trade off between explicit tracepoint fields, on which we might
   want to filter in kernel, and the simplicity of a blob.
   I have gone for having the whole of the block specific to the PER
   error type in an opaque blob.
   The key elements that may be filtered on are the physical address
   and the source and component fields which allow you to identify
   faulty devices. Note that you have to know how the devices were
   enumerated to be able to do so.
3. Defined 6 new tracepoints rather than cramming everything into one.
   * They are all defined by the CCIX specification as independent error
     classes.
   * Many of them can only be generated by particular types of agent.
   * The handling required will vary widely depending on types.
     In the kernel some map cleanly onto existing handling. Keeping the
     whole flow separate will aide this. They vary by a similar amount
     in scope to the RAS errors found on an existing system which have
     independent tracepoints.
   * Separating them out allows for filtering on the tracepoints by
     elements that are not shared between them.
   * Muxing the lot into one record type can lead to ugly code both in
     kernel and in userspace.

Rasdaemon patches posted.
https://patchwork.kernel.org/cover/11116735/

Jonathan Cameron (6):
  efi / ras: CCIX Memory error reporting
  efi / ras: CCIX Cache error reporting
  efi / ras: CCIX Address Translation Cache error reporting
  efi / ras: CCIX Port error reporting
  efi / ras: CCIX Link error reporting
  efi / ras: CCIX Agent internal error reporting

 drivers/acpi/apei/Kconfig        |   8 +
 drivers/acpi/apei/ghes.c         |  59 ++
 drivers/firmware/efi/Kconfig     |   5 +
 drivers/firmware/efi/Makefile    |   1 +
 drivers/firmware/efi/cper-ccix.c | 923 +++++++++++++++++++++++++++++++
 drivers/firmware/efi/cper.c      |   6 +
 include/linux/cper.h             | 333 +++++++++++
 include/ras/ras_event.h          | 407 ++++++++++++++
 8 files changed, 1742 insertions(+)
 create mode 100644 drivers/firmware/efi/cper-ccix.c

-- 
2.20.1

Comments

Mauro Carvalho Chehab Dec. 13, 2019, 2:53 p.m. UTC | #1
Em Thu, 14 Nov 2019 21:39:19 +0800
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> The CCIX 1.0 Base specification defines an internal agent error,

> for which the specific data present afte the header is vendor

> defined.

> 


Patches 3 to 6 just repeats the same things I mentioned already on
patch 1.

Except for that, the series look ready for merging on my PoV.

Regards,
Mauro

> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---

> Changes since v4

> * none

> 

>  drivers/acpi/apei/ghes.c         |  4 ++

>  drivers/firmware/efi/cper-ccix.c | 43 +++++++++++++++++++++

>  include/linux/cper.h             | 29 +++++++++++++++

>  include/ras/ras_event.h          | 64 ++++++++++++++++++++++++++++++++

>  4 files changed, 140 insertions(+)

> 

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c

> index 22df8c14ec13..5a75fb0374dd 100644

> --- a/drivers/acpi/apei/ghes.c

> +++ b/drivers/acpi/apei/ghes.c

> @@ -533,6 +533,10 @@ static void ghes_handle_ccix_per(struct acpi_hest_generic_data *gdata, int sev)

>  		trace_ccix_link_error_event(payload, err_seq, sev,

>  					    ccix_link_err_ven_len_get(payload));

>  		break;

> +	case CCIX_AGENT_INTERNAL_ERROR:

> +		trace_ccix_agent_error_event(payload, err_seq, sev,

> +					     ccix_agent_err_ven_len_get(payload));

> +		break;

>  	default:

>  		/* Unknown error type */

>  		pr_info("CCIX error of unknown or vendor defined type\n");

> diff --git a/drivers/firmware/efi/cper-ccix.c b/drivers/firmware/efi/cper-ccix.c

> index ea17fb9140f9..d95fddf17bd3 100644

> --- a/drivers/firmware/efi/cper-ccix.c

> +++ b/drivers/firmware/efi/cper-ccix.c

> @@ -587,6 +587,38 @@ static int cper_ccix_link_err_details(const char *pfx,

>  	return 0;

>  }

>  

> +static int cper_ccix_agent_err_details(const char *pfx,

> +				       struct acpi_hest_generic_data *gdata)

> +{

> +	struct cper_ccix_agent_err *full_agent_err;

> +	struct cper_sec_ccix_agent_err *agent_err;

> +	u16 vendor_data_len;

> +	int i;

> +

> +	if (gdata->error_data_length < sizeof(*full_agent_err))

> +		return -ENOSPC;

> +

> +	full_agent_err = acpi_hest_get_payload(gdata);

> +

> +	agent_err = &full_agent_err->agent_record;

> +

> +	if (agent_err->validation_bits & CCIX_AGENT_INTERNAL_ERR_VENDOR_DATA_VALID) {

> +		if (gdata->error_data_length < sizeof(*full_agent_err) + 4)

> +			return -ENOSPC;

> +

> +		vendor_data_len = agent_err->vendor_data[0] & GENMASK(15, 0);

> +		if (gdata->error_data_length <

> +		    sizeof(*full_agent_err) + vendor_data_len)

> +			return -ENOSPC;

> +

> +		for (i = 0; i < vendor_data_len/4 - 1; i++)

> +			printk("%s""Vendor%d: 0x%08x\n", pfx, i,

> +			       agent_err->vendor_data[i + 1]);

> +	}

> +

> +	return 0;

> +}

> +

>  int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)

>  {

>  	struct cper_sec_ccix_header *header = acpi_hest_get_payload(gdata);

> @@ -656,6 +688,8 @@ int cper_print_ccix_per(const char *pfx, struct acpi_hest_generic_data *gdata)

>  		return cper_ccix_port_err_details(pfx, gdata);

>  	case CCIX_LINK_ERROR:

>  		return cper_ccix_link_err_details(pfx, gdata);

> +	case CCIX_AGENT_INTERNAL_ERROR:

> +		return cper_ccix_agent_err_details(pfx, gdata);

>  	default:

>  		/* Vendor defined so no formatting be done */

>  		break;

> @@ -878,3 +912,12 @@ const char *cper_ccix_link_err_unpack(struct trace_seq *p,

>  

>  	return ret;

>  }

> +

> +void cper_ccix_agent_err_pack(const struct cper_sec_ccix_agent_err *agent_record,

> +			      struct cper_ccix_agent_err_compact *cagent_err,

> +			      const u16 vendor_data_len,

> +			      u8 *vendor_data)

> +{

> +	cagent_err->validation_bits = agent_record->validation_bits;

> +	memcpy(vendor_data, &agent_record->vendor_data[1], vendor_data_len);

> +}

> diff --git a/include/linux/cper.h b/include/linux/cper.h

> index d35be55351e3..373c1d387a70 100644

> --- a/include/linux/cper.h

> +++ b/include/linux/cper.h

> @@ -783,6 +783,30 @@ struct cper_ccix_link_err_compact {

>  	__u8	credit_type;

>  };

>  

> +struct cper_sec_ccix_agent_err {

> +	__u32	validation_bits;

> +#define CCIX_AGENT_INTERNAL_ERR_VENDOR_DATA_VALID	BIT(0)

> +	__u32	vendor_data[];

> +};

> +

> +struct cper_ccix_agent_err {

> +	struct cper_sec_ccix_header header;

> +	__u32 ccix_header[CCIX_PER_LOG_HEADER_DWS];

> +	struct cper_sec_ccix_agent_err agent_record;

> +};

> +

> +static inline u16 ccix_agent_err_ven_len_get(struct cper_ccix_agent_err *agent_err)

> +{

> +	if (agent_err->agent_record.validation_bits & CCIX_AGENT_INTERNAL_ERR_VENDOR_DATA_VALID)

> +		return agent_err->agent_record.vendor_data[0] & 0xFFFF;

> +	else

> +		return 0;

> +}

> +

> +struct cper_ccix_agent_err_compact {

> +	__u32	validation_bits;

> +};

> +

>  /* Reset to default packing */

>  #pragma pack()

>  

> @@ -835,6 +859,11 @@ void cper_ccix_link_err_pack(const struct cper_sec_ccix_link_error *link_record,

>  const char *cper_ccix_link_err_unpack(struct trace_seq *p,

>  				      struct cper_ccix_link_err_compact *clink_err);

>  

> +void cper_ccix_agent_err_pack(const struct cper_sec_ccix_agent_err *agent_record,

> +			      struct cper_ccix_agent_err_compact *cagent_err,

> +			      const u16 vendor_data_len,

> +			      u8 *vendor_data);

> +

>  struct acpi_hest_generic_data;

>  int cper_print_ccix_per(const char *pfx,

>  			struct acpi_hest_generic_data *gdata);

> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h

> index 7cecfadb0b15..59b62d5cd8cf 100644

> --- a/include/ras/ras_event.h

> +++ b/include/ras/ras_event.h

> @@ -679,6 +679,70 @@ TRACE_EVENT(ccix_link_error_event,

>  		__print_hex(__get_dynamic_array(vendor_data), __entry->vendor_data_length)

>  	)

>  );

> +

> +TRACE_EVENT(ccix_agent_error_event,

> +	TP_PROTO(struct cper_ccix_agent_err *err,

> +		 u32 err_seq,

> +		 u8 sev, u16 ven_len),

> +

> +	TP_ARGS(err, err_seq, sev, ven_len),

> +

> +	TP_STRUCT__entry(

> +		__field(u32, err_seq)

> +		__field(u8, sev)

> +		__field(u8, sevdetail)

> +		__field(u8, source)

> +		__field(u8, component)

> +		__field(u64, pa)

> +		__field(u8, pa_mask_lsb)

> +		__field(u16, vendor_data_length)

> +		__field_struct(struct cper_ccix_agent_err_compact, data)

> +		__dynamic_array(u8, vendor_data, ven_len)

> +	),

> +

> +	TP_fast_assign(

> +		__entry->err_seq = err_seq;

> +

> +		__entry->sev = sev;

> +		__entry->sevdetail = FIELD_GET(CCIX_PER_LOG_DW1_SEV_UE_M |

> +			CCIX_PER_LOG_DW1_SEV_NO_COMM_M |

> +			CCIX_PER_LOG_DW1_SEV_DEGRADED_M |

> +			CCIX_PER_LOG_DW1_SEV_DEFFERABLE_M,

> +			err->ccix_header[1]);

> +		if (err->header.validation_bits & 0x1)

> +			__entry->source = err->header.source_id;

> +		else

> +			__entry->source = ~0;

> +		__entry->component = FIELD_GET(CCIX_PER_LOG_DW1_COMP_TYPE_M,

> +						   err->ccix_header[1]);

> +		if (err->ccix_header[1] & CCIX_PER_LOG_DW1_ADDR_VAL_M) {

> +			__entry->pa = (u64)err->ccix_header[2] << 32 |

> +				(err->ccix_header[3] & 0xfffffffc);

> +			__entry->pa_mask_lsb = err->ccix_header[4] & 0xff;

> +		} else {

> +			__entry->pa = ~0ull;

> +			__entry->pa_mask_lsb = ~0;

> +		}

> +		/* Do not store the vendor data header length */

> +		__entry->vendor_data_length = ven_len ? ven_len - 4 : 0;

> +		cper_ccix_agent_err_pack(&err->agent_record, &__entry->data,

> +					__entry->vendor_data_length,

> +					__get_dynamic_array(vendor_data));

> +	),

> +

> +	TP_printk("{%d} %s CCIX PER Agent Internal Error in %s SevUE:%d SevNoComm:%d SevDegraded:%d SevDeferred:%d physical addr: %016llx (mask: %x) vendor:%s",

> +		__entry->err_seq,

> +		cper_severity_str(__entry->sev),

> +		cper_ccix_comp_type_str(__entry->component),

> +		__entry->sevdetail & BIT(0) ? 1 : 0,

> +		__entry->sevdetail & BIT(1) ? 1 : 0,

> +		__entry->sevdetail & BIT(2) ? 1 : 0,

> +		__entry->sevdetail & BIT(3) ? 1 : 0,

> +		__entry->pa,

> +		__entry->pa_mask_lsb,

> +		__print_hex(__get_dynamic_array(vendor_data), __entry->vendor_data_length)

> +	)

> +);

>  #endif

>  

>  /*





Cheers,
Mauro