diff mbox series

[v2,5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

Message ID 20220817143458.335938-6-justin.he@arm.com
State New
Headers show
Series Modularize ghes_edac driver | expand

Commit Message

Jia He Aug. 17, 2022, 2:34 p.m. UTC
Previous, there is only one edac can be registering in the EDAC core. After
ghes_edac is modularized, the registering of ghes devices may be deferred
when ghes_edac is loaded. Prevent the chipset-specific edac drivers from
loading after ghes_edac is registered, which allow the edac drivers to
get selected in e.g. HPE platform.

Signed-off-by: Jia He <justin.he@arm.com>
---
 drivers/acpi/apei/ghes.c  | 14 ++++++++++++++
 drivers/edac/amd64_edac.c |  2 +-
 drivers/edac/ghes_edac.c  |  2 ++
 drivers/edac/pnd2_edac.c  |  2 +-
 drivers/edac/sb_edac.c    |  2 +-
 drivers/edac/skx_base.c   |  2 +-
 include/acpi/ghes.h       |  4 ++++
 7 files changed, 24 insertions(+), 4 deletions(-)

Comments

Kani, Toshi Aug. 19, 2022, 1:29 a.m. UTC | #1
On Wednesday, August 17, 2022 8:35 AM, Jia He wrote:
> Previous, there is only one edac can be registering in the EDAC core. After
> ghes_edac is modularized, the registering of ghes devices may be deferred
> when ghes_edac is loaded. Prevent the chipset-specific edac drivers from
> loading after ghes_edac is registered, which allow the edac drivers to
> get selected in e.g. HPE platform.
...
> @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> *ghes_dev)
>  	platform_set_drvdata(ghes_dev, ghes);
> 
>  	ghes->dev = &ghes_dev->dev;
> +	set_ghes_devs_registered(false);

This does not look right to me.

The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
where:
 - ghes-present is latched on ghes_probe()
 - ghes-preferred is true if the platform-check passes.

ghes_get_device() introduced in the previous patch works as the 
ghes-preferred check.

We cannot use ghes_edac registration as the ghes-present check in
this scheme since it is deferred to module_init().

I'd suggest the following changes:
- Update ghes_get_device() to check a flag latched on ghes_probe().
- Remove 'force' argument from ghes_get_device().  ghes_edac_init()
is too late to set this flag.  Also, other edac drivers should not be able
to control this flag.  I think this force_load kernel option needs to
belong to ghes with this scheme.
- ghes_get_device() exposes 'ghes_devs' to chipset-specific edac drivers,
which should be internal to ghes.  ghes_get_device() may be renamed
to something like ghes_edac_preferred() which returns true / false.

Toshi
Jia He Aug. 19, 2022, 10:34 a.m. UTC | #2
Hi Toshi,

> -----Original Message-----
> From: Kani, Toshi <toshi.kani@hpe.com>
> Sent: Friday, August 19, 2022 9:30 AM
> To: Justin He <Justin.He@arm.com>; Ard Biesheuvel <ardb@kernel.org>; Len
> Brown <lenb@kernel.org>; James Morse <James.Morse@arm.com>; Tony Luck
> <tony.luck@intel.com>; Borislav Petkov <bp@alien8.de>; Mauro Carvalho
> Chehab <mchehab@kernel.org>; Robert Richter <rric@kernel.org>; Robert
> Moore <robert.moore@intel.com>; Qiuxu Zhuo <qiuxu.zhuo@intel.com>;
> Yazen Ghannam <yazen.ghannam@amd.com>
> Cc: linux-acpi@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-edac@vger.kernel.org; devel@acpica.org; Rafael J . Wysocki
> <rafael@kernel.org>; Shuai Xue <xueshuai@linux.alibaba.com>; Jarkko
> Sakkinen <jarkko@kernel.org>; linux-efi@vger.kernel.org; nd <nd@arm.com>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from
> loading after ghes_edac is registered
> 
> On Wednesday, August 17, 2022 8:35 AM, Jia He wrote:
> > Previous, there is only one edac can be registering in the EDAC core.
> > After ghes_edac is modularized, the registering of ghes devices may be
> > deferred when ghes_edac is loaded. Prevent the chipset-specific edac
> > drivers from loading after ghes_edac is registered, which allow the
> > edac drivers to get selected in e.g. HPE platform.
> ...
> > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> > *ghes_dev)
> >  	platform_set_drvdata(ghes_dev, ghes);
> >
> >  	ghes->dev = &ghes_dev->dev;
> > +	set_ghes_devs_registered(false);
> 
> This does not look right to me.
> 
> The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
> where:
>  - ghes-present is latched on ghes_probe()
>  - ghes-preferred is true if the platform-check passes.
> 
> ghes_get_device() introduced in the previous patch works as the
> ghes-preferred check.
> 
> We cannot use ghes_edac registration as the ghes-present check in this
> scheme since it is deferred to module_init().

What is the logic for ghes-present check? In this patch, I assumed it is equal to 
"ghes_edac devices have been registered". Seems it is not correct.
But should we consider the case as follows:
What if sbridge_init () is invoked before ghes_edac_init()? i.e. Should we get
sb_edac driver selected when ghes_edac is not loaded yet (e.g. on HPE)?

> 
> I'd suggest the following changes:
> - Update ghes_get_device() to check a flag latched on ghes_probe().

Still need your elaborating about the details of the flag. i.e. When is this flag
latched? When is it unlatched?

> - Remove 'force' argument from ghes_get_device().  ghes_edac_init() is too
> late to set this flag.  Also, other edac drivers should not be able to control this
> flag.  I think this force_load kernel option needs to belong to ghes with this
> scheme.
Agree, I will remove force_load in ghes_get_device(), and put all check/update of
force_load in ghes

> - ghes_get_device() exposes 'ghes_devs' to chipset-specific edac drivers, which
> should be internal to ghes.  ghes_get_device() may be renamed to something
> like ghes_edac_preferred() which returns true / false.
> 

Okay, agree

--
Cheers,
Justin (Jia He)
Kani, Toshi Aug. 19, 2022, 5:48 p.m. UTC | #3
On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> > > *ghes_dev)
> > >  	platform_set_drvdata(ghes_dev, ghes);
> > >
> > >  	ghes->dev = &ghes_dev->dev;
> > > +	set_ghes_devs_registered(false);
> >
> > This does not look right to me.
> >
> > The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
> > where:
> >  - ghes-present is latched on ghes_probe()
> >  - ghes-preferred is true if the platform-check passes.
> >
> > ghes_get_device() introduced in the previous patch works as the
> > ghes-preferred check.
> >
> > We cannot use ghes_edac registration as the ghes-present check in this
> > scheme since it is deferred to module_init().
> 
> What is the logic for ghes-present check? In this patch, I assumed it is equal to
> "ghes_edac devices have been registered". Seems it is not correct.

Using (ghes_edac-registered) is a wrong check in this scheme
since ghes_edac registration is deferred.  This check is fine in
the current scheme since ghes_edac gets registered before
any other chipset-specific edac drivers.

> But should we consider the case as follows:
> What if sbridge_init () is invoked before ghes_edac_init()? i.e. Should we get
> sb_edac driver selected when ghes_edac is not loaded yet (e.g. on HPE)?

No.  The point is that ghes_edac driver needs to be selected,
"regardless of the module ordering", on a system with GHES
present & preferred.

Note that this new scheme leads to the following condition
change: 
- Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
- New: (ghes-present) && (ghes-preferred)

The option I suggested previously keeps the current condition,
but this new scheme does not for better modularity.

What this means is that if ghes_edac is not enabled (but ghes
is enabled) on a system with GHES present & preferred, no edac
driver gets registered.  This change is fine from my (HPE) perspective
and should be fine for other GHES systems.  GHES systems have
chipset-specific edac driver in FW.  OS-based chipset-specific edac
driver is not necessary and may lead to a conflict of chipset register
ownership.

> > I'd suggest the following changes:
> > - Update ghes_get_device() to check a flag latched on ghes_probe().
> 
> Still need your elaborating about the details of the flag. i.e. When is this flag
> latched? When is it unlatched?

This flag is a static variable, say ghes_present, which is set to false initially.
ghes_probe() sets it to true.  ghes_edac_preferred() (aka. ghes_get_device)
checks this flag at beginning and returns false if this flag is false.  It does not
get unlatched since ACPI GHES table is static.

Toshi
Borislav Petkov Aug. 19, 2022, 6:29 p.m. UTC | #4
On Fri, Aug 19, 2022 at 05:48:39PM +0000, Kani, Toshi wrote:
> This flag is a static variable, say ghes_present, which is set to
> false initially. ghes_probe() sets it to true. ghes_edac_preferred()
> (aka. ghes_get_device) checks this flag at beginning and returns false
> if this flag is false. It does not get unlatched since ACPI GHES table
> is static.

What is that flag needed for at all?

There are two possibilities:

1. ghes_probe() succeeds. ghes_edac loads properly and other drivers use
ghes_get_devices() to know when to load.

2. ghes_probe() fails and that is caught during platform testing of all
those platforms who wish to use ghes_edac. BIOS is fixed and goto 1.

No need for funky flags whatsoever.
Kani, Toshi Aug. 19, 2022, 6:46 p.m. UTC | #5
On Friday, August 19, 2022 12:30 PM, Borislav Petkov wrote:
> > This flag is a static variable, say ghes_present, which is set to
> > false initially. ghes_probe() sets it to true. ghes_edac_preferred()
> > (aka. ghes_get_device) checks this flag at beginning and returns false
> > if this flag is false. It does not get unlatched since ACPI GHES table
> > is static.
> 
> What is that flag needed for at all?

Because ghes_get_device() always returns &ghes_devs on Arm,
which is !NULL.  It does not check if GHES is present.

> There are two possibilities:
> 
> 1. ghes_probe() succeeds. ghes_edac loads properly and other drivers use
> ghes_get_devices() to know when to load.
> 
> 2. ghes_probe() fails and that is caught during platform testing of all
> those platforms who wish to use ghes_edac. BIOS is fixed and goto 1.
> 
> No need for funky flags whatsoever.

3. ghes_probe() is not called, but ghes_get_device() is called from
other edac drivers.

Toshi
Borislav Petkov Aug. 19, 2022, 6:50 p.m. UTC | #6
On Fri, Aug 19, 2022 at 06:46:34PM +0000, Kani, Toshi wrote:
> 3. ghes_probe() is not called,

When is ghes_probe() not called on ARM?
Kani, Toshi Aug. 19, 2022, 6:53 p.m. UTC | #7
On Friday, August 19, 2022 12:50 PM, Borislav Petkov wrote:
> > 3. ghes_probe() is not called,
> 
> When is ghes_probe() not called on ARM?

When the system does not implement ACPI GHES table,
which I suppose is most of the case on ARM.

Toshi
Elliott, Robert (Servers) Aug. 19, 2022, 6:57 p.m. UTC | #8
> -----Original Message-----
> From: Kani, Toshi <toshi.kani@hpe.com>
> Sent: Friday, August 19, 2022 12:49 PM
> To: Justin He <Justin.He@arm.com>
> Cc: linux-acpi@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> edac@vger.kernel.org; devel@acpica.org; Rafael J . Wysocki
> <rafael@kernel.org>; Shuai Xue <xueshuai@linux.alibaba.com>; Jarkko
> Sakkinen <jarkko@kernel.org>; linux-efi@vger.kernel.org; nd
> <nd@arm.com>; Ard Biesheuvel <ardb@kernel.org>; Len Brown
> <lenb@kernel.org>; James Morse <James.Morse@arm.com>; Tony Luck
> <tony.luck@intel.com>; Borislav Petkov <bp@alien8.de>; Mauro Carvalho
> Chehab <mchehab@kernel.org>; Robert Richter <rric@kernel.org>; Robert
> Moore <robert.moore@intel.com>; Qiuxu Zhuo <qiuxu.zhuo@intel.com>;
> Yazen Ghannam <yazen.ghannam@amd.com>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac
> from loading after ghes_edac is registered
> 
> On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct
> platform_device
> > > > *ghes_dev)
> > > >  	platform_set_drvdata(ghes_dev, ghes);
> > > >
> > > >  	ghes->dev = &ghes_dev->dev;
> > > > +	set_ghes_devs_registered(false);
> > >
> > > This does not look right to me.
> > >
> > > The condition of using ghes_edac is (ghes-present) && (ghes-
> preferred),
> > > where:
> > >  - ghes-present is latched on ghes_probe()
> > >  - ghes-preferred is true if the platform-check passes.
> > >
> > > ghes_get_device() introduced in the previous patch works as the
> > > ghes-preferred check.
> > >
> > > We cannot use ghes_edac registration as the ghes-present check in
> this
> > > scheme since it is deferred to module_init().
> >
> > What is the logic for ghes-present check? In this patch, I assumed it
> is equal to
> > "ghes_edac devices have been registered". Seems it is not correct.
> 
> Using (ghes_edac-registered) is a wrong check in this scheme
> since ghes_edac registration is deferred.  This check is fine in
> the current scheme since ghes_edac gets registered before
> any other chipset-specific edac drivers.
> 
> > But should we consider the case as follows:
> > What if sbridge_init () is invoked before ghes_edac_init()? i.e.
> Should we get
> > sb_edac driver selected when ghes_edac is not loaded yet (e.g. on
> HPE)?
> 
> No.  The point is that ghes_edac driver needs to be selected,
> "regardless of the module ordering", on a system with GHES
> present & preferred.
>
> Note that this new scheme leads to the following condition
> change:
> - Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
> - New: (ghes-present) && (ghes-preferred)
> 
> The option I suggested previously keeps the current condition,
> but this new scheme does not for better modularity.
> 
> What this means is that if ghes_edac is not enabled (but ghes
> is enabled) on a system with GHES present & preferred, no edac
> driver gets registered.  This change is fine from my (HPE) perspective
> and should be fine for other GHES systems.  GHES systems have
> chipset-specific edac driver in FW.  OS-based chipset-specific edac
> driver is not necessary and may lead to a conflict of chipset register
> ownership.

Currently, running with this on the kernel command line
	ghes.disable

causes the ACPI ghes driver to quit early in acpi_ghes_init():

  /*
   * This driver isn't really modular, however for the time being,
   * continuing to use module_param is the easiest way to remain
   * compatible with existing boot arg use cases.
   */
  bool ghes_disable;
  module_param_named(disable, ghes_disable, bool, 0);

which results in the skx_edac module assuming it should run:
  [   33.628140] calling  skx_init+0x0/0xe5a [skx_edac] @ 1444
  [   33.628531] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:36:0a.0 (INTERRUPT)
  [   33.641432] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:36:0c.0 (INTERRUPT)
  [   33.653256] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:ae:0a.0 (INTERRUPT)
  [   33.665055] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:ae:0c.0 (INTERRUPT)
  [   33.676801] initcall skx_init+0x0/0xe5a [skx_edac] returned 0 after 36343 usecs

We might need to differentiate between the system ROM really not
offering GHES vs. the ghes module not running.
Borislav Petkov Aug. 19, 2022, 7:31 p.m. UTC | #9
On Fri, Aug 19, 2022 at 06:53:41PM +0000, Kani, Toshi wrote:
> When the system does not implement ACPI GHES table,
> which I suppose is most of the case on ARM.

That should be detected in ghes_get_devices() - just like on x86.
Borislav Petkov Aug. 19, 2022, 7:32 p.m. UTC | #10
On Fri, Aug 19, 2022 at 06:57:11PM +0000, Elliott, Robert (Servers) wrote:
> We might need to differentiate between the system ROM really not
> offering GHES vs. the ghes module not running.

All detectable in ghes_get_devices().
Kani, Toshi Aug. 19, 2022, 7:37 p.m. UTC | #11
On Friday, August 19, 2022 1:31 PM, Borislav Petkov wrote:
> > When the system does not implement ACPI GHES table,
> > which I suppose is most of the case on ARM.
> 
> That should be detected in ghes_get_devices() - just like on x86.

Agreed.  And that is the check to ghes_present flag...

Toshi
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9c52183e3ad9..9272d963b57d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -95,6 +95,7 @@ 
 #endif
 
 static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
+static bool devs_registered;
 
 static inline bool is_hest_type_generic_v2(struct ghes *ghes)
 {
@@ -1266,6 +1267,18 @@  static int apei_sdei_unregister_ghes(struct ghes *ghes)
 	return sdei_unregister_ghes(ghes);
 }
 
+void set_ghes_devs_registered(bool flag)
+{
+	devs_registered = flag;
+}
+EXPORT_SYMBOL_GPL(set_ghes_devs_registered);
+
+bool ghes_devs_registered(void)
+{
+	return devs_registered;
+}
+EXPORT_SYMBOL_GPL(ghes_devs_registered);
+
 static int ghes_probe(struct platform_device *ghes_dev)
 {
 	struct acpi_hest_generic *generic;
@@ -1382,6 +1395,7 @@  static int ghes_probe(struct platform_device *ghes_dev)
 	platform_set_drvdata(ghes_dev, ghes);
 
 	ghes->dev = &ghes_dev->dev;
+	set_ghes_devs_registered(false);
 
 	mutex_lock(&ghes_devs_mutex);
 	list_add_tail(&ghes->elist, &ghes_devs);
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index e4eaf6668feb..f48507fa7b94 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -4329,7 +4329,7 @@  static int __init amd64_edac_init(void)
 	int err = -ENODEV;
 	int i;
 
-	if (ghes_get_devices(0))
+	if (ghes_get_devices(0) && ghes_devs_registered())
 		return -EBUSY;
 
 	owner = edac_get_owner();
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 3bdf8469882d..d26644b3bc47 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -564,6 +564,7 @@  static int __init ghes_edac_init(void)
 	list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
 		ghes_edac_register(g->dev);
 	}
+	set_ghes_devs_registered(true);
 
 	return 0;
 }
@@ -576,6 +577,7 @@  static void __exit ghes_edac_exit(void)
 	list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
 		ghes_edac_unregister(g);
 	}
+	set_ghes_devs_registered(false);
 }
 module_exit(ghes_edac_exit);
 
diff --git a/drivers/edac/pnd2_edac.c b/drivers/edac/pnd2_edac.c
index 73f2ba0e64e3..66d89d00c4b3 100644
--- a/drivers/edac/pnd2_edac.c
+++ b/drivers/edac/pnd2_edac.c
@@ -1528,7 +1528,7 @@  static int __init pnd2_init(void)
 
 	edac_dbg(2, "\n");
 
-	if (ghes_get_devices(0))
+	if (ghes_get_devices(0) && ghes_devs_registered())
 		return -EBUSY;
 
 	owner = edac_get_owner();
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 1d0520a16840..a3a89e366a74 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3506,7 +3506,7 @@  static int __init sbridge_init(void)
 
 	edac_dbg(2, "\n");
 
-	if (ghes_get_devices(0))
+	if (ghes_get_devices(0) && ghes_devs_registered())
 		return -EBUSY;
 
 	owner = edac_get_owner();
diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c
index fe267f8543f5..efa1ae79c35a 100644
--- a/drivers/edac/skx_base.c
+++ b/drivers/edac/skx_base.c
@@ -653,7 +653,7 @@  static int __init skx_init(void)
 
 	edac_dbg(2, "\n");
 
-	if (ghes_get_devices(0))
+	if (ghes_get_devices(0) && ghes_devs_registered())
 		return -EBUSY;
 
 	owner = edac_get_owner();
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 150c0b9500d6..21b9d4d4c3e9 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -72,8 +72,12 @@  int ghes_register_vendor_record_notifier(struct notifier_block *nb);
  */
 void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
 struct list_head *ghes_get_devices(bool force);
+bool ghes_devs_registered(void);
+void set_ghes_devs_registered(bool flag);
 #else
 static inline struct list_head *ghes_get_devices(bool force) { return NULL; }
+static inline bool ghes_devs_registered(void) { return false; }
+static inline void set_ghes_devs_registered(bool flag) { return; }
 #endif
 
 int ghes_estatus_pool_init(int num_ghes);