diff mbox series

acpi: Fix hed module initialization order when it is built-in

Message ID 20241115035014.1339256-1-tanxiaofei@huawei.com
State New
Headers show
Series acpi: Fix hed module initialization order when it is built-in | expand

Commit Message

Xiaofei Tan Nov. 15, 2024, 3:50 a.m. UTC
When the module hed is built-in, the init order is determined by
Makefile order. That order violates expectations. Because the module
hed init is behind evged. RAS records can't be handled in the
special time window that evged has initialized while hed not.
If the number of such RAS records is more than the APEI HEST error
source number, the HEST resources could be occupied all, and then
could affect subsequent RAS error reporting.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
---
 drivers/acpi/Makefile | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki Dec. 10, 2024, 5:59 p.m. UTC | #1
On Fri, Nov 15, 2024 at 4:56 AM Xiaofei Tan <tanxiaofei@huawei.com> wrote:
>
> When the module hed is built-in, the init order is determined by
> Makefile order.

Are you sure?

> That order violates expectations. Because the module
> hed init is behind evged. RAS records can't be handled in the
> special time window that evged has initialized while hed not.
> If the number of such RAS records is more than the APEI HEST error
> source number, the HEST resources could be occupied all, and then
> could affect subsequent RAS error reporting.

Well, the problem is real, but does the change really prevent it from
happening or does it just increase the likelihood of success?

In the latter case, and generally speaking too, it would be better to
add explicit synchronization between evged and hed.

> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
> ---
>  drivers/acpi/Makefile | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index 61ca4afe83dc..54f60b7922ad 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -15,6 +15,13 @@ endif
>
>  obj-$(CONFIG_ACPI)             += tables.o
>
> +#
> +# The hed.o needs to be in front of evged.o to avoid the problem that
> +# RAS errors cannot be handled in the special time window of startup
> +# phase that evged has initialized while hed not.
> +#
> +obj-$(CONFIG_ACPI_HED)         += hed.o
> +
>  #
>  # ACPI Core Subsystem (Interpreter)
>  #
> @@ -95,7 +102,6 @@ obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)     += battery.o
>  obj-$(CONFIG_ACPI_SBS)         += sbshc.o
>  obj-$(CONFIG_ACPI_SBS)         += sbs.o
> -obj-$(CONFIG_ACPI_HED)         += hed.o
>  obj-$(CONFIG_ACPI_EC_DEBUGFS)  += ec_sys.o
>  obj-$(CONFIG_ACPI_BGRT)                += bgrt.o
>  obj-$(CONFIG_ACPI_CPPC_LIB)    += cppc_acpi.o
> --
> 2.33.0
>
Mauro Carvalho Chehab Dec. 11, 2024, 4:22 p.m. UTC | #2
Em Fri, 15 Nov 2024 11:50:14 +0800
Xiaofei Tan <tanxiaofei@huawei.com> escreveu:

Please always copy my @kernel.org address for upstream work.

> When the module hed is built-in, the init order is determined by
> Makefile order. That order violates expectations. Because the module
> hed init is behind evged. RAS records can't be handled in the
> special time window that evged has initialized while hed not.
> If the number of such RAS records is more than the APEI HEST error
> source number, the HEST resources could be occupied all, and then
> could affect subsequent RAS error reporting.

IMO, it is a lot better to use a late init call. Please see:
	include/linux/init.h

This would be done by, for instance, using late_initcall().

Now, what we have is:

	acpi-y                          += evged.o
	obj-$(CONFIG_ACPI_HED)          += hed.o

Where ACPI_HED being a tri-state.

It sounds to me, that even, with your patch, if you build
HED as a module, you'll still have a problem.

Shouldn't be ACPI_HED be changed from tristate to bool?

Regards,
Mauro

> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
> ---
>  drivers/acpi/Makefile | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index 61ca4afe83dc..54f60b7922ad 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -15,6 +15,13 @@ endif
>  
>  obj-$(CONFIG_ACPI)		+= tables.o
>  
> +#
> +# The hed.o needs to be in front of evged.o to avoid the problem that
> +# RAS errors cannot be handled in the special time window of startup
> +# phase that evged has initialized while hed not.
> +#
> +obj-$(CONFIG_ACPI_HED)		+= hed.o
> +
>  #
>  # ACPI Core Subsystem (Interpreter)
>  #
> @@ -95,7 +102,6 @@ obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
>  obj-$(CONFIG_ACPI_SBS)		+= sbshc.o
>  obj-$(CONFIG_ACPI_SBS)		+= sbs.o
> -obj-$(CONFIG_ACPI_HED)		+= hed.o
>  obj-$(CONFIG_ACPI_EC_DEBUGFS)	+= ec_sys.o
>  obj-$(CONFIG_ACPI_BGRT)		+= bgrt.o
>  obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc_acpi.o
Xiaofei Tan Dec. 23, 2024, 9:31 a.m. UTC | #3
Hi Rafael,

在 2024/12/11 1:59, Rafael J. Wysocki 写道:
> On Fri, Nov 15, 2024 at 4:56 AM Xiaofei Tan <tanxiaofei@huawei.com> wrote:
>> When the module hed is built-in, the init order is determined by
>> Makefile order.
> Are you sure?

yes

>> That order violates expectations. Because the module
>> hed init is behind evged. RAS records can't be handled in the
>> special time window that evged has initialized while hed not.
>> If the number of such RAS records is more than the APEI HEST error
>> source number, the HEST resources could be occupied all, and then
>> could affect subsequent RAS error reporting.
> Well, the problem is real, but does the change really prevent it from
> happening or does it just increase the likelihood of success?

It can be completely solved if the driver used as built-in way. If build HED as a
module, it not solved.

>
> In the latter case, and generally speaking too, it would be better to
> add explicit synchronization between evged and hed.
>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
>> ---
>>   drivers/acpi/Makefile | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
>> index 61ca4afe83dc..54f60b7922ad 100644
>> --- a/drivers/acpi/Makefile
>> +++ b/drivers/acpi/Makefile
>> @@ -15,6 +15,13 @@ endif
>>
>>   obj-$(CONFIG_ACPI)             += tables.o
>>
>> +#
>> +# The hed.o needs to be in front of evged.o to avoid the problem that
>> +# RAS errors cannot be handled in the special time window of startup
>> +# phase that evged has initialized while hed not.
>> +#
>> +obj-$(CONFIG_ACPI_HED)         += hed.o
>> +
>>   #
>>   # ACPI Core Subsystem (Interpreter)
>>   #
>> @@ -95,7 +102,6 @@ obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>>   obj-$(CONFIG_ACPI_BATTERY)     += battery.o
>>   obj-$(CONFIG_ACPI_SBS)         += sbshc.o
>>   obj-$(CONFIG_ACPI_SBS)         += sbs.o
>> -obj-$(CONFIG_ACPI_HED)         += hed.o
>>   obj-$(CONFIG_ACPI_EC_DEBUGFS)  += ec_sys.o
>>   obj-$(CONFIG_ACPI_BGRT)                += bgrt.o
>>   obj-$(CONFIG_ACPI_CPPC_LIB)    += cppc_acpi.o
>> --
>> 2.33.0
>>
> .
Xiaofei Tan Dec. 23, 2024, 9:44 a.m. UTC | #4
Hi Mauro,

在 2024/12/12 0:22, Mauro Carvalho Chehab 写道:
> Em Fri, 15 Nov 2024 11:50:14 +0800
> Xiaofei Tan <tanxiaofei@huawei.com> escreveu:
>
> Please always copy my @kernel.org address for upstream work.

OK

>> When the module hed is built-in, the init order is determined by
>> Makefile order. That order violates expectations. Because the module
>> hed init is behind evged. RAS records can't be handled in the
>> special time window that evged has initialized while hed not.
>> If the number of such RAS records is more than the APEI HEST error
>> source number, the HEST resources could be occupied all, and then
>> could affect subsequent RAS error reporting.
> IMO, it is a lot better to use a late init call. Please see:
> 	include/linux/init.h
>
> This would be done by, for instance, using late_initcall().
>
> Now, what we have is:
>
> 	acpi-y                          += evged.o
> 	obj-$(CONFIG_ACPI_HED)          += hed.o
>
> Where ACPI_HED being a tri-state.
>
> It sounds to me, that even, with your patch, if you build
> HED as a module, you'll still have a problem.
Yes, and it is also  affected by loading sequence of HED and GHES. Anyway, the risk remains.
>
> Shouldn't be ACPI_HED be changed from tristate to bool?

agree,

@Rafael

Hi Rafael, Please help check if we can do this change, thanks.


>
> Regards,
> Mauro
>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
>> ---
>>   drivers/acpi/Makefile | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
>> index 61ca4afe83dc..54f60b7922ad 100644
>> --- a/drivers/acpi/Makefile
>> +++ b/drivers/acpi/Makefile
>> @@ -15,6 +15,13 @@ endif
>>   
>>   obj-$(CONFIG_ACPI)		+= tables.o
>>   
>> +#
>> +# The hed.o needs to be in front of evged.o to avoid the problem that
>> +# RAS errors cannot be handled in the special time window of startup
>> +# phase that evged has initialized while hed not.
>> +#
>> +obj-$(CONFIG_ACPI_HED)		+= hed.o
>> +
>>   #
>>   # ACPI Core Subsystem (Interpreter)
>>   #
>> @@ -95,7 +102,6 @@ obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>>   obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
>>   obj-$(CONFIG_ACPI_SBS)		+= sbshc.o
>>   obj-$(CONFIG_ACPI_SBS)		+= sbs.o
>> -obj-$(CONFIG_ACPI_HED)		+= hed.o
>>   obj-$(CONFIG_ACPI_EC_DEBUGFS)	+= ec_sys.o
>>   obj-$(CONFIG_ACPI_BGRT)		+= bgrt.o
>>   obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc_acpi.o
> .
Jonathan Cameron Dec. 23, 2024, 7:33 p.m. UTC | #5
On Mon, 23 Dec 2024 17:31:08 +0800
Xiaofei Tan <tanxiaofei@huawei.com> wrote:

> Hi Rafael,
> 
> 在 2024/12/11 1:59, Rafael J. Wysocki 写道:
> > On Fri, Nov 15, 2024 at 4:56 AM Xiaofei Tan <tanxiaofei@huawei.com> wrote:  
> >> When the module hed is built-in, the init order is determined by
> >> Makefile order.  
> > Are you sure?  
> 
> yes

We had a similar fix in CXL recently (which is why I suggested this approach
internally when tanxiaofei mentioned the problem).

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/cxl?id=6575b268157f37929948a8d1f3bafb3d7c055bc1

The related discussion for the CXL patch was the first time I'd come across solution
to load order for built in cases.


> 
> >> That order violates expectations. Because the module
> >> hed init is behind evged. RAS records can't be handled in the
> >> special time window that evged has initialized while hed not.
> >> If the number of such RAS records is more than the APEI HEST error
> >> source number, the HEST resources could be occupied all, and then
> >> could affect subsequent RAS error reporting.  
> > Well, the problem is real, but does the change really prevent it from
> > happening or does it just increase the likelihood of success?  
> 
> It can be completely solved if the driver used as built-in way. If build HED as a
> module, it not solved.

Can we enforce that condition not happening with appropriate Kconfig?
It's annoying to restrict build options, but if needed to make it work
then better than not working!

Jonathan


> 
> >
> > In the latter case, and generally speaking too, it would be better to
> > add explicit synchronization between evged and hed.
> >  
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
> >> ---
> >>   drivers/acpi/Makefile | 8 +++++++-
> >>   1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> >> index 61ca4afe83dc..54f60b7922ad 100644
> >> --- a/drivers/acpi/Makefile
> >> +++ b/drivers/acpi/Makefile
> >> @@ -15,6 +15,13 @@ endif
> >>
> >>   obj-$(CONFIG_ACPI)             += tables.o
> >>
> >> +#
> >> +# The hed.o needs to be in front of evged.o to avoid the problem that
> >> +# RAS errors cannot be handled in the special time window of startup
> >> +# phase that evged has initialized while hed not.
> >> +#
> >> +obj-$(CONFIG_ACPI_HED)         += hed.o
> >> +
> >>   #
> >>   # ACPI Core Subsystem (Interpreter)
> >>   #
> >> @@ -95,7 +102,6 @@ obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
> >>   obj-$(CONFIG_ACPI_BATTERY)     += battery.o
> >>   obj-$(CONFIG_ACPI_SBS)         += sbshc.o
> >>   obj-$(CONFIG_ACPI_SBS)         += sbs.o
> >> -obj-$(CONFIG_ACPI_HED)         += hed.o
> >>   obj-$(CONFIG_ACPI_EC_DEBUGFS)  += ec_sys.o
> >>   obj-$(CONFIG_ACPI_BGRT)                += bgrt.o
> >>   obj-$(CONFIG_ACPI_CPPC_LIB)    += cppc_acpi.o
> >> --
> >> 2.33.0
> >>  
> > .
Xiaofei Tan Dec. 28, 2024, 10:23 a.m. UTC | #6
Hi Jonathan,

在 2024/12/24 3:33, Jonathan Cameron 写道:
> On Mon, 23 Dec 2024 17:31:08 +0800
> Xiaofei Tan <tanxiaofei@huawei.com> wrote:
>
>> Hi Rafael,
>>
>> 在 2024/12/11 1:59, Rafael J. Wysocki 写道:
>>> On Fri, Nov 15, 2024 at 4:56 AM Xiaofei Tan <tanxiaofei@huawei.com> wrote:
>>>> When the module hed is built-in, the init order is determined by
>>>> Makefile order.
>>> Are you sure?
>> yes
> We had a similar fix in CXL recently (which is why I suggested this approach
> internally when tanxiaofei mentioned the problem).
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/cxl?id=6575b268157f37929948a8d1f3bafb3d7c055bc1
>
> The related discussion for the CXL patch was the first time I'd come across solution
> to load order for built in cases.
>
Yes :)

>>>> That order violates expectations. Because the module
>>>> hed init is behind evged. RAS records can't be handled in the
>>>> special time window that evged has initialized while hed not.
>>>> If the number of such RAS records is more than the APEI HEST error
>>>> source number, the HEST resources could be occupied all, and then
>>>> could affect subsequent RAS error reporting.
>>> Well, the problem is real, but does the change really prevent it from
>>> happening or does it just increase the likelihood of success?
>> It can be completely solved if the driver used as built-in way. If build HED as a
>> module, it not solved.
> Can we enforce that condition not happening with appropriate Kconfig?
> It's annoying to restrict build options, but if needed to make it work
> then better than not working!

Agree,  i will change ACPI_HED from tristate to bool if there are no other comments, thanks.

>
> Jonathan
>
>
>>> In the latter case, and generally speaking too, it would be better to
>>> add explicit synchronization between evged and hed.
>>>   
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
>>>> ---
>>>>    drivers/acpi/Makefile | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
>>>> index 61ca4afe83dc..54f60b7922ad 100644
>>>> --- a/drivers/acpi/Makefile
>>>> +++ b/drivers/acpi/Makefile
>>>> @@ -15,6 +15,13 @@ endif
>>>>
>>>>    obj-$(CONFIG_ACPI)             += tables.o
>>>>
>>>> +#
>>>> +# The hed.o needs to be in front of evged.o to avoid the problem that
>>>> +# RAS errors cannot be handled in the special time window of startup
>>>> +# phase that evged has initialized while hed not.
>>>> +#
>>>> +obj-$(CONFIG_ACPI_HED)         += hed.o
>>>> +
>>>>    #
>>>>    # ACPI Core Subsystem (Interpreter)
>>>>    #
>>>> @@ -95,7 +102,6 @@ obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>>>>    obj-$(CONFIG_ACPI_BATTERY)     += battery.o
>>>>    obj-$(CONFIG_ACPI_SBS)         += sbshc.o
>>>>    obj-$(CONFIG_ACPI_SBS)         += sbs.o
>>>> -obj-$(CONFIG_ACPI_HED)         += hed.o
>>>>    obj-$(CONFIG_ACPI_EC_DEBUGFS)  += ec_sys.o
>>>>    obj-$(CONFIG_ACPI_BGRT)                += bgrt.o
>>>>    obj-$(CONFIG_ACPI_CPPC_LIB)    += cppc_acpi.o
>>>> --
>>>> 2.33.0
>>>>   
>>> .
> .
diff mbox series

Patch

diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 61ca4afe83dc..54f60b7922ad 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -15,6 +15,13 @@  endif
 
 obj-$(CONFIG_ACPI)		+= tables.o
 
+#
+# The hed.o needs to be in front of evged.o to avoid the problem that
+# RAS errors cannot be handled in the special time window of startup
+# phase that evged has initialized while hed not.
+#
+obj-$(CONFIG_ACPI_HED)		+= hed.o
+
 #
 # ACPI Core Subsystem (Interpreter)
 #
@@ -95,7 +102,6 @@  obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
 obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
 obj-$(CONFIG_ACPI_SBS)		+= sbshc.o
 obj-$(CONFIG_ACPI_SBS)		+= sbs.o
-obj-$(CONFIG_ACPI_HED)		+= hed.o
 obj-$(CONFIG_ACPI_EC_DEBUGFS)	+= ec_sys.o
 obj-$(CONFIG_ACPI_BGRT)		+= bgrt.o
 obj-$(CONFIG_ACPI_CPPC_LIB)	+= cppc_acpi.o