[v2,13/24] EDAC, ghes: Add support for legacy API counters

Message ID 20190624150758.6695-14-rrichter@marvell.com
State New
Headers show
Series
  • EDAC, mc, ghes: Fixes and updates to improve memory error reporting
Related show

Commit Message

Robert Richter June 24, 2019, 3:09 p.m.
The ghes driver is not able yet to count legacy API counters in sysfs,
e.g.:

 /sys/devices/system/edac/mc/mc0/csrow2/ce_count
 /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count
 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

Make counting csrows/channels generic so that the ghes driver can use
it too.

Signed-off-by: Robert Richter <rrichter@marvell.com>

---
 drivers/edac/edac_mc.c   | 38 +++++++++++++++++++++-----------------
 drivers/edac/edac_mc.h   |  7 ++++++-
 drivers/edac/ghes_edac.c |  2 +-
 3 files changed, 28 insertions(+), 19 deletions(-)

-- 
2.20.1

Comments

Borislav Petkov Aug. 16, 2019, 9:55 a.m. | #1
On Mon, Jun 24, 2019 at 03:09:22PM +0000, Robert Richter wrote:
> The ghes driver is not able yet to count legacy API counters in sysfs,

> e.g.:

> 

>  /sys/devices/system/edac/mc/mc0/csrow2/ce_count

>  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count

>  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

> 

> Make counting csrows/channels generic so that the ghes driver can use

> it too.


What for?

ghes_edac enumerates the DIMMs from SMBIOS - it doesn't need chip
selects and ranks. Those are used when you can't count the DIMMs
properly...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Robert Richter Aug. 30, 2019, 9:35 a.m. | #2
On 16.08.19 11:55:59, Borislav Petkov wrote:
> On Mon, Jun 24, 2019 at 03:09:22PM +0000, Robert Richter wrote:

> > The ghes driver is not able yet to count legacy API counters in sysfs,

> > e.g.:

> > 

> >  /sys/devices/system/edac/mc/mc0/csrow2/ce_count

> >  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count

> >  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

> > 

> > Make counting csrows/channels generic so that the ghes driver can use

> > it too.

> 

> What for?


Same was asked here:

 https://lore.kernel.org/patchwork/patch/1080277/

Actually it is a fix for the counters exposed by the legacy API for
the ghes driver. Counters are broken (set to zero). The ghes driver is
the only where errors are reported using edac_raw_mc_handle_error()
instead of edac_mc_handle_error().  The fix is to move the error
counting to edac_mc_handle_error() where the other counters are
incremented.

All distributions that I have checked enable the legacy API option
(CONFIG_EDAC_LEGACY_SYSFS=y) and the interface cannot be disabled for
individual drivers. As long as the counters are exposed, their values
should be correct. See all options discussed in the thread from v1.

> ghes_edac enumerates the DIMMs from SMBIOS - it doesn't need chip

> selects and ranks. Those are used when you can't count the DIMMs

> properly...


Right, but that is true also for other drivers (actually all other
drivers since DIMMs are used now). It is to support older tools that
deal with */csrow*/ch* instead of */dimm* in sysfs.

-Robert

Patch

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index a8d4471e238c..eea09c6acd3e 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1007,7 +1007,8 @@  static void edac_ue_error(struct mem_ctl_info *mci,
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      struct mem_ctl_info *mci,
 			      struct dimm_info *dimm,
-			      struct edac_raw_error_desc *e)
+			      struct edac_raw_error_desc *e,
+			      int row, int chan)
 {
 	char detail[80];
 	u8 grain_bits;
@@ -1046,7 +1047,22 @@  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      e->label, detail, e->other_detail);
 	}
 
+	/* old API's counters */
+	if (dimm) {
+		row = dimm->csrow;
+		chan = dimm->cschannel;
+	}
 
+	if (mci->csrows && row >= 0) {
+		if (type == HW_EVENT_ERR_CORRECTED) {
+			mci->csrows[row]->ce_count += e->error_count;
+			if (chan >= 0)
+				mci->csrows[row]->channels[chan]->ce_count += e->error_count;
+		} else {
+			mci->csrows[row]->ue_count += e->error_count;
+		}
+		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
+	}
 }
 EXPORT_SYMBOL_GPL(edac_raw_mc_handle_error);
 
@@ -1177,22 +1193,10 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!per_layer_report) {
+	if (!per_layer_report)
 		strcpy(e->label, "any memory");
-	} else {
-		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
-		if (p == e->label)
-			strcpy(e->label, "unknown memory");
-		if (type == HW_EVENT_ERR_CORRECTED) {
-			if (row >= 0) {
-				mci->csrows[row]->ce_count += error_count;
-				if (chan >= 0)
-					mci->csrows[row]->channels[chan]->ce_count += error_count;
-			}
-		} else
-			if (row >= 0)
-				mci->csrows[row]->ue_count += error_count;
-	}
+	else if (!*e->label)
+		strcpy(e->label, "unknown memory");
 
 	/* Fill the RAM location data */
 	p = e->location;
@@ -1210,6 +1214,6 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 
 	dimm = edac_get_dimm(mci, top_layer, mid_layer, low_layer);
 
-	edac_raw_mc_handle_error(type, mci, dimm, e);
+	edac_raw_mc_handle_error(type, mci, dimm, e, row, chan);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index b816cf3caaee..c4ddd5c1e24c 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -216,6 +216,10 @@  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
  * @mci:		a struct mem_ctl_info pointer
  * @dimm:		a struct dimm_info pointer
  * @e:			error description
+ * @row:		csrow hint if there is no dimm info (<0 if
+ *			unknown)
+ * @chan:		cschannel hint if there is no dimm info (<0 if
+ *			unknown)
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
@@ -224,7 +228,8 @@  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      struct mem_ctl_info *mci,
 			      struct dimm_info *dimm,
-			      struct edac_raw_error_desc *e);
+			      struct edac_raw_error_desc *e,
+			      int row, int chan);
 
 /**
  * edac_mc_handle_error() - Reports a memory event to userspace.
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 746083876b5f..8063996a311d 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -434,7 +434,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 
 	dimm_info = edac_get_dimm_by_index(mci, e->top_layer);
 
-	edac_raw_mc_handle_error(type, mci, dimm_info, e);
+	edac_raw_mc_handle_error(type, mci, dimm_info, e, -1, -1);
 
 	spin_unlock_irqrestore(&ghes_lock, flags);
 }