diff mbox series

[v3,4/4] pseries/mobility: set NMI watchdog factor during LPM

Message ID 20220627135347.32624-5-ldufour@linux.ibm.com
State Superseded
Headers show
Series Extending NMI watchdog during LPM | expand

Commit Message

Laurent Dufour June 27, 2022, 1:53 p.m. UTC
During a LPM, while the memory transfer is in progress on the arrival side,
some latencies is generated when accessing not yet transferred pages on the
arrival side. Thus, the NMI watchdog may be triggered too frequently, which
increases the risk to hit a NMI interrupt in a bad place in the kernel,
leading to a kernel panic.

Disabling the Hard Lockup Watchdog until the memory transfer could be a too
strong work around, some users would want this timeout to be eventually
triggered if the system is hanging even during LPM.

Introduce a new sysctl variable nmi_watchdog_factor. It allows to apply
a factor to the NMI watchdog timeout during a LPM. Just before the CPU are
stopped for the switchover sequence, the NMI watchdog timer is set to
 watchdog_tresh + factor%

A value of 0 has no effect. The default value is 200, meaning that the NMI
watchdog is set to 30s during LPM (based on a 10s watchdog_tresh value).
Once the memory transfer is achieved, the factor is reset to 0.

Setting this value to a high number is like disabling the NMI watchdog
during a LPM.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
 Documentation/admin-guide/sysctl/kernel.rst | 12 ++++++
 arch/powerpc/platforms/pseries/mobility.c   | 43 +++++++++++++++++++++
 2 files changed, 55 insertions(+)

Comments

Laurent Dufour July 13, 2022, 2:22 p.m. UTC | #1
Le 12/07/2022 à 11:47, Laurent Dufour a écrit :
> Le 12/07/2022 à 03:46, Nicholas Piggin a écrit :
>> Excerpts from Laurent Dufour's message of June 27, 2022 11:53 pm:
>>> During a LPM, while the memory transfer is in progress on the arrival side,
>>> some latencies is generated when accessing not yet transferred pages on the
>>> arrival side. Thus, the NMI watchdog may be triggered too frequently, which
>>> increases the risk to hit a NMI interrupt in a bad place in the kernel,
>>> leading to a kernel panic.
>>>
>>> Disabling the Hard Lockup Watchdog until the memory transfer could be a too
>>> strong work around, some users would want this timeout to be eventually
>>> triggered if the system is hanging even during LPM.
>>>
>>> Introduce a new sysctl variable nmi_watchdog_factor. It allows to apply
>>> a factor to the NMI watchdog timeout during a LPM. Just before the CPU are
>>> stopped for the switchover sequence, the NMI watchdog timer is set to
>>>  watchdog_tresh + factor%
>>>
>>> A value of 0 has no effect. The default value is 200, meaning that the NMI
>>> watchdog is set to 30s during LPM (based on a 10s watchdog_tresh value).
>>> Once the memory transfer is achieved, the factor is reset to 0.
>>>
>>> Setting this value to a high number is like disabling the NMI watchdog
>>> during a LPM.
>>>
>>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>>> ---
>>>  Documentation/admin-guide/sysctl/kernel.rst | 12 ++++++
>>>  arch/powerpc/platforms/pseries/mobility.c   | 43 +++++++++++++++++++++
>>>  2 files changed, 55 insertions(+)
>>>
>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>>> index ddccd1077462..0bb0b7f27e96 100644
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -592,6 +592,18 @@ to the guest kernel command line (see
>>>  Documentation/admin-guide/kernel-parameters.rst).
>>>  
>>>  
>>> +nmi_watchdog_factor (PPC only)
>>> +==================================
>>> +
>>> +Factor apply to to the NMI watchdog timeout (only when ``nmi_watchdog`` is
>>> +set to 1). This factor represents the percentage added to
>>> +``watchdog_thresh`` when calculating the NMI watchdog timeout during a
>>> +LPM. The soft lockup timeout is not impacted.
>>
>> Could "LPM" or "mobility" be a bit more prominent in the parameter name
>> and documentation? Something else might want to add a factor as well,
>> one day.
> 
> In the V2 version, Nathan suggested "making the user-visible
> name more generic (e.g. "nmi_watchdog_factor") in case it makes sense to
> apply this to other contexts in the future."
> 
> So I made the change to a more generic name. I think this is a good option
> since the documentation is explicit about the LPM particular case.
> If in the future this factor needs to apply during an other operation that
> name will be generic enough.
> 
> Do you agree ?

Nick and I discussed that.
Nick prefers to have LPM in the tunable names, and thinks we can add a new
tunable if a separate user came up which required it.

We agree that 'nmi_wd_lpm_factor' is a good name.
I'll send a v5 updating that name.

>>
>> Otherwise the code looks okay.
>>
>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>>
>>> +
>>> +A value of 0 means no change. The default value is 200 meaning the NMI
>>> +watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
>>> +
>>> +
>>>  numa_balancing
>>>  ==============
>>>  
>>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>>> index 907a779074d6..649155faafc2 100644
>>> --- a/arch/powerpc/platforms/pseries/mobility.c
>>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>>> @@ -48,6 +48,39 @@ struct update_props_workarea {
>>>  #define MIGRATION_SCOPE	(1)
>>>  #define PRRN_SCOPE -2
>>>  
>>> +#ifdef CONFIG_PPC_WATCHDOG
>>> +static unsigned int nmi_wd_factor = 200;
>>> +
>>> +#ifdef CONFIG_SYSCTL
>>> +static struct ctl_table nmi_wd_factor_ctl_table[] = {
>>> +	{
>>> +		.procname	= "nmi_watchdog_factor",
>>> +		.data		= &nmi_wd_factor,
>>> +		.maxlen		= sizeof(int),
>>> +		.mode		= 0644,
>>> +		.proc_handler	= proc_douintvec_minmax,
>>> +	},
>>> +	{}
>>> +};
>>> +static struct ctl_table nmi_wd_factor_sysctl_root[] = {
>>> +	{
>>> +		.procname       = "kernel",
>>> +		.mode           = 0555,
>>> +		.child          = nmi_wd_factor_ctl_table,
>>> +	},
>>> +	{}
>>> +};
>>> +
>>> +static int __init register_nmi_wd_factor_sysctl(void)
>>> +{
>>> +	register_sysctl_table(nmi_wd_factor_sysctl_root);
>>> +
>>> +	return 0;
>>> +}
>>> +device_initcall(register_nmi_wd_factor_sysctl);
>>> +#endif /* CONFIG_SYSCTL */
>>> +#endif /* CONFIG_PPC_WATCHDOG */
>>> +
>>>  static int mobility_rtas_call(int token, char *buf, s32 scope)
>>>  {
>>>  	int rc;
>>> @@ -702,13 +735,20 @@ static int pseries_suspend(u64 handle)
>>>  static int pseries_migrate_partition(u64 handle)
>>>  {
>>>  	int ret;
>>> +	unsigned int factor = 0;
>>>  
>>> +#ifdef CONFIG_PPC_WATCHDOG
>>> +	factor = nmi_wd_factor;
>>> +#endif
>>>  	ret = wait_for_vasi_session_suspending(handle);
>>>  	if (ret)
>>>  		return ret;
>>>  
>>>  	vas_migration_handler(VAS_SUSPEND);
>>>  
>>> +	if (factor)
>>> +		watchdog_nmi_set_lpm_factor(factor);
>>> +
>>>  	ret = pseries_suspend(handle);
>>>  	if (ret == 0) {
>>>  		post_mobility_fixup();
>>> @@ -716,6 +756,9 @@ static int pseries_migrate_partition(u64 handle)
>>>  	} else
>>>  		pseries_cancel_migration(handle, ret);
>>>  
>>> +	if (factor)
>>> +		watchdog_nmi_set_lpm_factor(0);
>>> +
>>>  	vas_migration_handler(VAS_RESUME);
>>>  
>>>  	return ret;
>>> -- 
>>> 2.36.1
>>>
>>>
>
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index ddccd1077462..0bb0b7f27e96 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -592,6 +592,18 @@  to the guest kernel command line (see
 Documentation/admin-guide/kernel-parameters.rst).
 
 
+nmi_watchdog_factor (PPC only)
+==================================
+
+Factor apply to to the NMI watchdog timeout (only when ``nmi_watchdog`` is
+set to 1). This factor represents the percentage added to
+``watchdog_thresh`` when calculating the NMI watchdog timeout during a
+LPM. The soft lockup timeout is not impacted.
+
+A value of 0 means no change. The default value is 200 meaning the NMI
+watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).
+
+
 numa_balancing
 ==============
 
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index 907a779074d6..649155faafc2 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -48,6 +48,39 @@  struct update_props_workarea {
 #define MIGRATION_SCOPE	(1)
 #define PRRN_SCOPE -2
 
+#ifdef CONFIG_PPC_WATCHDOG
+static unsigned int nmi_wd_factor = 200;
+
+#ifdef CONFIG_SYSCTL
+static struct ctl_table nmi_wd_factor_ctl_table[] = {
+	{
+		.procname	= "nmi_watchdog_factor",
+		.data		= &nmi_wd_factor,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+	},
+	{}
+};
+static struct ctl_table nmi_wd_factor_sysctl_root[] = {
+	{
+		.procname       = "kernel",
+		.mode           = 0555,
+		.child          = nmi_wd_factor_ctl_table,
+	},
+	{}
+};
+
+static int __init register_nmi_wd_factor_sysctl(void)
+{
+	register_sysctl_table(nmi_wd_factor_sysctl_root);
+
+	return 0;
+}
+device_initcall(register_nmi_wd_factor_sysctl);
+#endif /* CONFIG_SYSCTL */
+#endif /* CONFIG_PPC_WATCHDOG */
+
 static int mobility_rtas_call(int token, char *buf, s32 scope)
 {
 	int rc;
@@ -702,13 +735,20 @@  static int pseries_suspend(u64 handle)
 static int pseries_migrate_partition(u64 handle)
 {
 	int ret;
+	unsigned int factor = 0;
 
+#ifdef CONFIG_PPC_WATCHDOG
+	factor = nmi_wd_factor;
+#endif
 	ret = wait_for_vasi_session_suspending(handle);
 	if (ret)
 		return ret;
 
 	vas_migration_handler(VAS_SUSPEND);
 
+	if (factor)
+		watchdog_nmi_set_lpm_factor(factor);
+
 	ret = pseries_suspend(handle);
 	if (ret == 0) {
 		post_mobility_fixup();
@@ -716,6 +756,9 @@  static int pseries_migrate_partition(u64 handle)
 	} else
 		pseries_cancel_migration(handle, ret);
 
+	if (factor)
+		watchdog_nmi_set_lpm_factor(0);
+
 	vas_migration_handler(VAS_RESUME);
 
 	return ret;