[12/21] EDAC, ghes: Add support for legacy API counters

Message ID 20190529084344.28562-13-rrichter@marvell.com
State Superseded
Headers show
Series
  • EDAC, mc, ghes: Fixes and updates to improve memory error reporting
Related show

Commit Message

Robert Richter May 29, 2019, 8:44 a.m.
The ghes driver is not able yet to count legacy API counters in sysfs,
e.g.:

 /sys/devices/system/edac/mc/mc0/csrow2/ce_count
 /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count
 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

Make counting csrows/channels generic so that the ghes driver can use
it too.

Signed-off-by: Robert Richter <rrichter@marvell.com>

---
 drivers/edac/edac_mc.c   | 39 ++++++++++++++++++++++-----------------
 drivers/edac/edac_mc.h   |  7 ++++++-
 drivers/edac/ghes_edac.c |  2 +-
 3 files changed, 29 insertions(+), 19 deletions(-)

-- 
2.20.1

Comments

James Morse May 29, 2019, 3:13 p.m. | #1
Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> The ghes driver is not able yet to count legacy API counters in sysfs,

> e.g.:

> 

>  /sys/devices/system/edac/mc/mc0/csrow2/ce_count

>  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count

>  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

> 

> Make counting csrows/channels generic so that the ghes driver can use

> it too.


What for?

Is this for an arm64 system? Surely we don't have any systems that used to work with these
legacy counters. Aren't they legacy because we want new software to stop using them!


Thanks,

James
Robert Richter June 12, 2019, 6:41 p.m. | #2
On 29.05.19 16:13:02, James Morse wrote:
> On 29/05/2019 09:44, Robert Richter wrote:

> > The ghes driver is not able yet to count legacy API counters in sysfs,

> > e.g.:

> > 

> >  /sys/devices/system/edac/mc/mc0/csrow2/ce_count

> >  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count

> >  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

> > 

> > Make counting csrows/channels generic so that the ghes driver can use

> > it too.

> 

> What for?


With EDAC_LEGACY_SYSFS enabled those counters are exposed to sysfs,
but the numbers are wrong (all zero).

> Is this for an arm64 system? Surely we don't have any systems that used to work with these

> legacy counters. Aren't they legacy because we want new software to stop using them!


The option is to support legacy userland. If we want to provide a
similar "user experience" as for x86 the counters should be correct.
Of course it is not a real mapping to csrows, but it makes that i/f
work.

In any case, this patch cleans up code as old API's counter code is
isolated and moved to common code. Making the counter's work for ghes
is actually a side-effect here. The cleanup is a prerequisit for
follow on patches.

-Robert

> 

> 

> Thanks,

> 

> James
James Morse June 19, 2019, 5:22 p.m. | #3
Hi Robert,

On 12/06/2019 19:41, Robert Richter wrote:
> On 29.05.19 16:13:02, James Morse wrote:

>> On 29/05/2019 09:44, Robert Richter wrote:

>>> The ghes driver is not able yet to count legacy API counters in sysfs,

>>> e.g.:

>>>

>>>  /sys/devices/system/edac/mc/mc0/csrow2/ce_count

>>>  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count

>>>  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count

>>>

>>> Make counting csrows/channels generic so that the ghes driver can use

>>> it too.

>>

>> What for?

> 

> With EDAC_LEGACY_SYSFS enabled those counters are exposed to sysfs,

> but the numbers are wrong (all zero).


Excellent, so its legacy and broken.


>> Is this for an arm64 system? Surely we don't have any systems that used to work with these

>> legacy counters. Aren't they legacy because we want new software to stop using them!

> 

> The option is to support legacy userland. If we want to provide a> similar "user experience" as for x86 the counters should be correct.


The flip-side is arm64 doesn't have the same baggage. These counters have never worked
with this driver (even on x86).

This ghes driver also probes on HPE Server platform, so the architecture isn't really
relevant. (I was curious why Marvell care).


> Of course it is not a real mapping to csrows, but it makes that i/f

> work.


(...which stinks)


> In any case, this patch cleans up code as old API's counter code is

> isolated and moved to common code. Making the counter's work for ghes

> is actually a side-effect here. The cleanup is a prerequisit for

> follow on patches.


I'm all for removing/warning-its-broken it when ghes_edac is in use. But the convincing
argument is debian ships a 'current' version of edac-utils that predates 199747106934,
(that made all this fake csrow stuff deprecated), and debian's popcon says ~1000 people
have it installed.


If you want it fixed, please don't do it as a side effect of cleanup. Fixes need to be a
small separate series that can be backported. (unless we're confident no-one uses it, in
which case, why fix it?)



Thanks,

James
Robert Richter June 20, 2019, 6:55 a.m. | #4
On 19.06.19 18:22:32, James Morse wrote:
> > In any case, this patch cleans up code as old API's counter code is

> > isolated and moved to common code. Making the counter's work for ghes

> > is actually a side-effect here. The cleanup is a prerequisit for

> > follow on patches.

> 

> I'm all for removing/warning-its-broken it when ghes_edac is in use. But the convincing

> argument is debian ships a 'current' version of edac-utils that predates 199747106934,

> (that made all this fake csrow stuff deprecated), and debian's popcon says ~1000 people

> have it installed.


All arm64 distribution kernels that I have checked come with:

CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_GHES=y
CONFIG_EDAC_LAYERSCAPE=m
CONFIG_EDAC_THUNDERX=m
CONFIG_EDAC_XGENE=m

> If you want it fixed, please don't do it as a side effect of cleanup. Fixes need to be a

> small separate series that can be backported. (unless we're confident no-one uses it, in

> which case, why fix it?)


It is not that I am keen on fixing legacy edac sysfs. It just happens
while unifying the error handlers in ghes_edac and edac_mc. As I see
you are reluctant on just letting it go, let's just disable
EDAC_LEGACY_SYSFS for ARM64. Though, I don't agree with it as there
still could be some userland tools that use this interface that cannot
be used any longer after a transition from x86 to arm64. I leave that
decision up to you. Please just ACK a patch with the Kconfig change
which I will add to my v2 series.

Thanks,

-Robert
James Morse June 26, 2019, 9:33 a.m. | #5
Hi Robert,

On 20/06/2019 07:55, Robert Richter wrote:
> On 19.06.19 18:22:32, James Morse wrote:

>>> In any case, this patch cleans up code as old API's counter code is

>>> isolated and moved to common code. Making the counter's work for ghes

>>> is actually a side-effect here. The cleanup is a prerequisit for

>>> follow on patches.

>>

>> I'm all for removing/warning-its-broken it when ghes_edac is in use. But the convincing

>> argument is debian ships a 'current' version of edac-utils that predates 199747106934,

>> (that made all this fake csrow stuff deprecated), and debian's popcon says ~1000 people

>> have it installed.

> 

> All arm64 distribution kernels that I have checked come with:

> 

> CONFIG_EDAC_SUPPORT=y

> CONFIG_EDAC=y

> CONFIG_EDAC_LEGACY_SYSFS=y

> # CONFIG_EDAC_DEBUG is not set

> CONFIG_EDAC_GHES=y

> CONFIG_EDAC_LAYERSCAPE=m

> CONFIG_EDAC_THUNDERX=m

> CONFIG_EDAC_XGENE=m


(distros also enable drivers for hardware no-one has!)

Who uses this? edac-utils, on both arm64 and x86.


>> If you want it fixed, please don't do it as a side effect of cleanup. Fixes need to be a

>> small separate series that can be backported. (unless we're confident no-one uses it, in

>> which case, why fix it?)


> It is not that I am keen on fixing legacy edac sysfs. It just happens

> while unifying the error handlers in ghes_edac and edac_mc. As I see

> you are reluctant on just letting it go, let's just disable

> EDAC_LEGACY_SYSFS for ARM64.


That would break other drivers where those legacy counters expose valid values.

You're painting me as some kind of stubborn villan here. You're right my initial reaction
was 'what for?'. Adding new support for legacy counters that have never worked with
ghes_edac looks like the wrong thing to do.

But unfortunately edac-utils is still using this legacy interface.

If we're going to fix it, could we fix it properly? (separate series that can be
backported to stable).


> Though, I don't agree with it as there

> still could be some userland tools that use this interface that cannot

> be used any longer after a transition from x86 to arm64.


I don't think this is the right thing to do. ghes_edac's behaviour should not change
between architectures.


Where we aren't agreeing is how we fix bugs:

Its either broken, and no-one cares, we should remove it.
Or, we should fix it and those fixes should go to stable.

We can't mix fixes and features in a patch series, as the fixes then can't easily be
backported. If its ever in doubt, the patches should still be as separate fixes so the
maintainer can decide.


Thanks,

James
Robert Richter June 26, 2019, 10:27 a.m. | #6
On 26.06.19 10:33:28, James Morse wrote:
> On 20/06/2019 07:55, Robert Richter wrote:


> > It is not that I am keen on fixing legacy edac sysfs. It just happens

> > while unifying the error handlers in ghes_edac and edac_mc. As I see

> > you are reluctant on just letting it go, let's just disable

> > EDAC_LEGACY_SYSFS for ARM64.

> 

> That would break other drivers where those legacy counters expose valid values.

> 

> You're painting me as some kind of stubborn villan here. You're right my initial reaction

> was 'what for?'. Adding new support for legacy counters that have never worked with

> ghes_edac looks like the wrong thing to do.

> 

> But unfortunately edac-utils is still using this legacy interface.


I am sorry for mis-understanding you here. I haven't seen your
motivation for this which is now clear to me.

> If we're going to fix it, could we fix it properly? (separate series that can be

> backported to stable).


I see your point here. This is also the reason why I (try to) put
fixes at the beginning of a series to allow backports to stable (or
distros). Clearly, this must be better separated here.

> > Though, I don't agree with it as there

> > still could be some userland tools that use this interface that cannot

> > be used any longer after a transition from x86 to arm64.

> 

> I don't think this is the right thing to do. ghes_edac's behaviour should not change

> between architectures.

> 

> 

> Where we aren't agreeing is how we fix bugs:

> 

> Its either broken, and no-one cares, we should remove it.

> Or, we should fix it and those fixes should go to stable.

> 

> We can't mix fixes and features in a patch series, as the fixes then can't easily be

> backported. If its ever in doubt, the patches should still be as separate fixes so the

> maintainer can decide.


I will better separate fix here and update in the next v3.

Thanks and sorry again,

-Robert

Patch

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8613a31dc86c..f7e6a751f309 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1007,7 +1007,8 @@  static void edac_ue_error(struct mem_ctl_info *mci,
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      struct mem_ctl_info *mci,
 			      struct dimm_info *dimm,
-			      struct edac_raw_error_desc *e)
+			      struct edac_raw_error_desc *e,
+			      int row, int chan)
 {
 	char detail[80];
 	u8 grain_bits;
@@ -1040,7 +1041,23 @@  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      e->label, detail, e->other_detail);
 	}
 
+	/* old API's counters */
+	if (dimm) {
+		row = dimm->csrow;
+		chan = dimm->cschannel;
+	}
+
+	if (row >= 0) {
+		if (type == HW_EVENT_ERR_CORRECTED) {
+			mci->csrows[row]->ce_count += e->error_count;
+			if (chan >= 0)
+				mci->csrows[row]->channels[chan]->ce_count += e->error_count;
+		} else {
+			mci->csrows[row]->ue_count += e->error_count;
+		}
+	}
 
+	edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
 }
 EXPORT_SYMBOL_GPL(edac_raw_mc_handle_error);
 
@@ -1171,22 +1188,10 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!per_layer_report) {
+	if (!per_layer_report)
 		strcpy(e->label, "any memory");
-	} else {
-		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
-		if (p == e->label)
-			strcpy(e->label, "unknown memory");
-		if (type == HW_EVENT_ERR_CORRECTED) {
-			if (row >= 0) {
-				mci->csrows[row]->ce_count += error_count;
-				if (chan >= 0)
-					mci->csrows[row]->channels[chan]->ce_count += error_count;
-			}
-		} else
-			if (row >= 0)
-				mci->csrows[row]->ue_count += error_count;
-	}
+	else if (!*e->label)
+		strcpy(e->label, "unknown memory");
 
 	/* Fill the RAM location data */
 	p = e->location;
@@ -1204,6 +1209,6 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 
 	dimm = edac_get_dimm(mci, top_layer, mid_layer, low_layer);
 
-	edac_raw_mc_handle_error(type, mci, dimm, e);
+	edac_raw_mc_handle_error(type, mci, dimm, e, row, chan);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index b816cf3caaee..c4ddd5c1e24c 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -216,6 +216,10 @@  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
  * @mci:		a struct mem_ctl_info pointer
  * @dimm:		a struct dimm_info pointer
  * @e:			error description
+ * @row:		csrow hint if there is no dimm info (<0 if
+ *			unknown)
+ * @chan:		cschannel hint if there is no dimm info (<0 if
+ *			unknown)
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
@@ -224,7 +228,8 @@  extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      struct mem_ctl_info *mci,
 			      struct dimm_info *dimm,
-			      struct edac_raw_error_desc *e);
+			      struct edac_raw_error_desc *e,
+			      int row, int chan);
 
 /**
  * edac_mc_handle_error() - Reports a memory event to userspace.
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index f6ea4b070bfe..ea4d53043199 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -435,7 +435,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 
 	dimm_info = edac_get_dimm_by_index(mci, e->top_layer);
 
-	edac_raw_mc_handle_error(type, mci, dimm_info, e);
+	edac_raw_mc_handle_error(type, mci, dimm_info, e, -1, -1);
 
 	spin_unlock_irqrestore(&ghes_lock, flags);
 }