mbox series

[RESEND,v3,0/9] Make ghes_edac a proper module

Message ID 20220822154048.188253-1-justin.he@arm.com
Headers show
Series Make ghes_edac a proper module | expand

Message

Jia He Aug. 22, 2022, 3:40 p.m. UTC
Commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in
apci_init()") introduced a bug that ghes_edac_register() would be invoked
before edac_init(). Because at that time, the bus "edac" hasn't been even
registered, this created sysfs /devices/mc0 instead of
/sys/devices/system/edac/mc/mc0 on an Ampere eMag server.

Changelog:
v3:
 - refine the commit logs
 - introduce ghes preferred and present flag (by Toshi)
 - move force_load to setup parameter
 - add the ghes_edac_preferred() check for x86/Arm edac drivers (incompleted list)
 
v2:
 - add acked-by tag of Patch 1 from Ard
 - split the notifier patch
 - add 2 patch to get regular drivers selected when ghes edac is not loaded
 - fix an errno in igen6 driver
 - add a patch to fix the sparse warning of ghes
 - refine the commit logs

Jia He (9):
  efi/cper: export several helpers for ghes_edac to use
  EDAC/ghes: Add a notifier for reporting memory errors
  EDAC/ghes: Make ghes_edac a proper module to remove the dependency on
    ghes
  EDAC/ghes: Move ghes_edac.force_load to setup parameter
  EDAC: Don't load chipset-specific edac drivers when ghes_edac is
    preferred
  ghes: Introduce a flag ghes_present
  apei/ghes: Use unrcu_pointer for cmpxchg
  EDAC/igen6: Keep returned errno consistent when edac mc has been
    enabled
  edac: Don't load Arm specific edac drivers when ghes_edac is preferred

 .../admin-guide/kernel-parameters.txt         |   5 +
 drivers/acpi/apei/ghes.c                      |  94 ++++++++++++++--
 drivers/edac/Kconfig                          |   4 +-
 drivers/edac/amd64_edac.c                     |   3 +
 drivers/edac/armada_xp_edac.c                 |   3 +
 drivers/edac/edac_module.h                    |   1 +
 drivers/edac/ghes_edac.c                      | 102 +++++++++++-------
 drivers/edac/i10nm_base.c                     |   3 +
 drivers/edac/igen6_edac.c                     |   5 +-
 drivers/edac/layerscape_edac.c                |   3 +
 drivers/edac/pnd2_edac.c                      |   3 +
 drivers/edac/sb_edac.c                        |   3 +
 drivers/edac/skx_base.c                       |   3 +
 drivers/edac/thunderx_edac.c                  |   3 +
 drivers/edac/xgene_edac.c                     |   3 +
 drivers/firmware/efi/cper.c                   |   3 +
 include/acpi/ghes.h                           |  38 +++----
 17 files changed, 204 insertions(+), 75 deletions(-)

Comments

Jia He Aug. 23, 2022, 1:49 a.m. UTC | #1
Hi, 
Sorry for resending the v3.
There is an exceptional interrupt when I tried to post v3 at the first time.
Maybe it is caused by a comma "," inside the mail name.
E.g. 
Signed-off-by: Some, one <someone@site.com>
Looks like a git sendemail issue? 

Anyway, sorry for the inconvenience.
--
Cheers,
Justin (Jia He)
Rafael J. Wysocki Aug. 23, 2022, 5:19 p.m. UTC | #2
On Tue, Aug 23, 2022 at 3:50 AM Justin He <Justin.He@arm.com> wrote:
>
> Hi,
> Sorry for resending the v3.
> There is an exceptional interrupt when I tried to post v3 at the first time.
> Maybe it is caused by a comma "," inside the mail name.
> E.g.
> Signed-off-by: Some, one <someone@site.com>
> Looks like a git sendemail issue?
>
> Anyway, sorry for the inconvenience.

I've received it twice, but no problem.

I need Boris to tell me what to do with this series.
Borislav Petkov Aug. 24, 2022, 3:37 p.m. UTC | #3
On Mon, Aug 22, 2022 at 03:40:42PM +0000, Jia He wrote:
> Commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in
> apci_init()") introduced a bug that ghes_edac_register() would be invoked
> before edac_init(). Because at that time, the bus "edac" hadn't been even
> registered, this created sysfs /devices/mc0 instead of
> /sys/devices/system/edac/mc/mc0 on an Ampere eMag server.
> 
> To remove the dependency of ghes_edac on ghes, make it a proper module. Use
> a list to save the probing devices in ghes_probe(), and defer the
> ghes_edac_register() to module_init() of the new ghes_edac module by
> iterating over the devices list.
> 
> Co-developed-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Jia He <justin.he@arm.com>
> Fixes: dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in apci_init()")
> Cc: stable@kernel.org

Why is this marked for stable?

The prerequisite patches are needed too. I guess this needs to be
communicated to stable folks somehow by doing

Cc: stable@kernel.org # needs commits X, Y, ...

but I guess the committer needs to do that because only at commit time
will X and Y be known...

So, is there any particular reason why this should be in stable?

> @@ -1442,7 +1449,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  
>  	ghes_fini(ghes);
>  
> -	ghes_edac_unregister(ghes);
> +	mutex_lock(&ghes_devs_mutex);
> +	list_del_rcu(&ghes->elist);

Is that list RCU-protected?

> +	mutex_unlock(&ghes_devs_mutex);
>  
>  	kfree(ghes);

...

> @@ -566,3 +549,35 @@ void ghes_edac_unregister(struct ghes *ghes)
>  unlock:
>  	mutex_unlock(&ghes_reg_mutex);
>  }
> +
> +static int __init ghes_edac_init(void)
> +{
> +	struct ghes *g, *g_tmp;
> +
> +	if (!IS_ENABLED(CONFIG_X86))
> +		force_load = true;

No, this is not how this works.

> +	ghes_devs = ghes_get_devices(force_load);
> +	if (!ghes_devs)
> +		return -ENODEV;

You simply need to check force_load here.

> +	list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
> +		ghes_edac_register(g->dev);
> +	}
> +
> +	return 0;
> +}
Kani, Toshi Aug. 24, 2022, 11:04 p.m. UTC | #4
On Monday, August 22, 2022 9:41 AM, Jia He wrote:
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index e17e0ee8f842..327386f3cf33 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -1537,16 +1537,25 @@ static struct acpi_platform_list plat_list[] = {
>  	{ } /* End */
>  };
> 
> -struct list_head *ghes_get_devices(void)
> +bool ghes_edac_preferred(void)
>  {
>  	int idx = -1;
> 
>  	if (IS_ENABLED(CONFIG_X86)) {
>  		idx = acpi_match_platform_list(plat_list);
>  		if (idx < 0 && !ghes_edac_force)
> -			return NULL;
> +			return false;
>  	}
> 
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(ghes_edac_preferred);
> +
> +struct list_head *ghes_get_devices(void)
> +{
> +	if (!ghes_edac_preferred())
> +		return NULL;
> +
>  	return &ghes_devs;
>  }

ghes_get_devices() changing multiple times in the series is 
confusing to me.   Can you simply introduce ghes_get_devices()
and ghes_preferred() in the right state in a patch?  Perhaps,
patch #2, #5, #6 can collapse to introduce the two funcs?

The rest of patch #5 adding the call to ghes_edac_preferred()
into other edac drivers can remain as a separate patch.

Toshi
Jia He Aug. 25, 2022, 9:45 a.m. UTC | #5
> -----Original Message-----
> From: Kani, Toshi <toshi.kani@hpe.com>
> Sent: Thursday, August 25, 2022 7:04 AM
> To: Justin He <Justin.He@arm.com>; Len Brown <lenb@kernel.org>; James
> Morse <James.Morse@arm.com>; Tony Luck <tony.luck@intel.com>; Borislav
> Petkov <bp@alien8.de>; Mauro Carvalho Chehab <mchehab@kernel.org>;
> Robert Richter <rric@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Qiuxu Zhuo <qiuxu.zhuo@intel.com>; Yazen Ghannam
> <yazen.ghannam@amd.com>; Jonathan Corbet <corbet@lwn.net>; Jan
> Luebbe <jlu@pengutronix.de>; Khuong Dinh
> <khuong@os.amperecomputing.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>; linux-acpi@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-edac@vger.kernel.org; devel@acpica.org;
> Rafael J . Wysocki <rafael@kernel.org>; Shuai Xue
> <xueshuai@linux.alibaba.com>; Jarkko Sakkinen <jarkko@kernel.org>;
> linux-efi@vger.kernel.org; nd <nd@arm.com>; Paul E. McKenney
> <paulmck@kernel.org>; Andrew Morton <akpm@linux-foundation.org>;
> Neeraj Upadhyay <quic_neeraju@quicinc.com>; Randy Dunlap
> <rdunlap@infradead.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Muchun Song
> <songmuchun@bytedance.com>; linux-doc@vger.kernel.org
> Subject: RE: [RESEND PATCH v3 5/9] EDAC: Don't load chipset-specific edac
> drivers when ghes_edac is preferred
> 
> On Monday, August 22, 2022 9:41 AM, Jia He wrote:
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> > e17e0ee8f842..327386f3cf33 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -1537,16 +1537,25 @@ static struct acpi_platform_list plat_list[] = {
> >  	{ } /* End */
> >  };
> >
> > -struct list_head *ghes_get_devices(void)
> > +bool ghes_edac_preferred(void)
> >  {
> >  	int idx = -1;
> >
> >  	if (IS_ENABLED(CONFIG_X86)) {
> >  		idx = acpi_match_platform_list(plat_list);
> >  		if (idx < 0 && !ghes_edac_force)
> > -			return NULL;
> > +			return false;
> >  	}
> >
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(ghes_edac_preferred);
> > +
> > +struct list_head *ghes_get_devices(void) {
> > +	if (!ghes_edac_preferred())
> > +		return NULL;
> > +
> >  	return &ghes_devs;
> >  }
> 
> ghes_get_devices() changing multiple times in the series is
> confusing to me.   Can you simply introduce ghes_get_devices()
> and ghes_preferred() in the right state in a patch?  Perhaps, patch #2, #5, #6
> can collapse to introduce the two funcs?
> 

My purpose was to make it easy for review. I am ok to merge these patches into one.

> The rest of patch #5 adding the call to ghes_edac_preferred() into other edac
> drivers can remain as a separate patch.

Okay, I assume you mean to merge all of the ghes_edac_preferred() checking for intel and
Arm edac drivers into 1 separate patch, am I understanding well?


--
Cheers,
Justin (Jia He)
Jia He Aug. 25, 2022, 12:21 p.m. UTC | #6
Hi Borislav

> -----Original Message-----
> On Mon, Aug 22, 2022 at 03:40:42PM +0000, Jia He wrote:
> > Commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in
> > apci_init()") introduced a bug that ghes_edac_register() would be
> > invoked before edac_init(). Because at that time, the bus "edac"
> > hadn't been even registered, this created sysfs /devices/mc0 instead
> > of
> > /sys/devices/system/edac/mc/mc0 on an Ampere eMag server.
> >
> > To remove the dependency of ghes_edac on ghes, make it a proper
> > module. Use a list to save the probing devices in ghes_probe(), and
> > defer the
> > ghes_edac_register() to module_init() of the new ghes_edac module by
> > iterating over the devices list.
> >
> > Co-developed-by: Borislav Petkov <bp@alien8.de>
> > Signed-off-by: Borislav Petkov <bp@alien8.de>
> > Signed-off-by: Jia He <justin.he@arm.com>
> > Fixes: dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in
> > apci_init()")
> > Cc: stable@kernel.org
> 
> Why is this marked for stable?
> 
> The prerequisite patches are needed too. I guess this needs to be
> communicated to stable folks somehow by doing
> 
> Cc: stable@kernel.org # needs commits X, Y, ...
> 
> but I guess the committer needs to do that because only at commit time will X
> and Y be known...
> 
> So, is there any particular reason why this should be in stable?

Okay, I am fine with removing the stable line if dc4e8c07e9e2 will not be included in
any stable tree branch.

> 
> > @@ -1442,7 +1449,9 @@ static int ghes_remove(struct platform_device
> > *ghes_dev)
> >
> >  	ghes_fini(ghes);
> >
> > -	ghes_edac_unregister(ghes);
> > +	mutex_lock(&ghes_devs_mutex);
> > +	list_del_rcu(&ghes->elist);
> 
> Is that list RCU-protected?

No, I will remove the "rcu" suffix since I use list_add_tail.
 
> 
> > +	mutex_unlock(&ghes_devs_mutex);
> >
> >  	kfree(ghes);
> 
> ...
> 
> > @@ -566,3 +549,35 @@ void ghes_edac_unregister(struct ghes *ghes)
> >  unlock:
> >  	mutex_unlock(&ghes_reg_mutex);
> >  }
> > +
> > +static int __init ghes_edac_init(void) {
> > +	struct ghes *g, *g_tmp;
> > +
> > +	if (!IS_ENABLED(CONFIG_X86))
> > +		force_load = true;
> 
> No, this is not how this works.
> 
> > +	ghes_devs = ghes_get_devices(force_load);
> > +	if (!ghes_devs)
> > +		return -ENODEV;
> 
> You simply need to check force_load here.
> 

Okay, hence should I export the *ghes_devs* in ghes?


--
Cheers,
Justin (Jia He)
Kani, Toshi Aug. 25, 2022, 11:38 p.m. UTC | #7
On Thursday, August 25, 2022 3:46 AM, Justin He wrote:
> > ghes_get_devices() changing multiple times in the series is
> > confusing to me.   Can you simply introduce ghes_get_devices()
> > and ghes_preferred() in the right state in a patch?  Perhaps, patch #2, #5,
> > #6 can collapse to introduce the two funcs?
> 
> My purpose was to make it easy for review. I am ok to merge these patches
> into one.

This series starts with your original patchset and then has additional
patches to address the issues with the original patchset.  The number
of the patches continues to increase in this way...  You do not need to 
record the history of discussions and design changes in the series.
 
> > The rest of patch #5 adding the call to ghes_edac_preferred() into other edac
> > drivers can remain as a separate patch.
> 
> Okay, I assume you mean to merge all of the ghes_edac_preferred()
> checking for intel and
> Arm edac drivers into 1 separate patch, am I understanding well?

No, that's not what I meant.

ghes_get_devices() and ghes_edac_preferred() look good to me after
all patches are applied.  So, instead of introducing the original design of
ghes_get_devices() and then changing its design multiple times in the
series, please simply add the final version of ghes_get_devices() and 
ghes_edac_preferred() in a single patch.  They are small functions anyway.

That is, your series includes something like:
  - PATCH: Add ghes_get_devices() and ghes_edac_preferred() interfaces
  - PATCH: Add ghes_edac_preferred check to chipset-specific edac drivers

Toshi
Kani, Toshi Aug. 26, 2022, 7:30 p.m. UTC | #8
On Thursday, August 25, 2022 6:21 AM, Justin He wrote:
> > > @@ -566,3 +549,35 @@ void ghes_edac_unregister(struct ghes *ghes)
> > >  unlock:
> > >  	mutex_unlock(&ghes_reg_mutex);
> > >  }
> > > +
> > > +static int __init ghes_edac_init(void) {
> > > +	struct ghes *g, *g_tmp;
> > > +
> > > +	if (!IS_ENABLED(CONFIG_X86))
> > > +		force_load = true;
> >
> > No, this is not how this works.
> >
> > > +	ghes_devs = ghes_get_devices(force_load);
> > > +	if (!ghes_devs)
> > > +		return -ENODEV;
> >
> > You simply need to check force_load here.
> >
> 
> Okay, hence should I export the *ghes_devs* in ghes?

It does not matter.  This series then moves the force_load check
to ghes_edac_preferred().  It is confusing for reviewers...
Please divide patches based on the final design. 

Toshi