diff mbox series

[RFC] watchdog: Adding softwatchdog

Message ID 20210424102555.28203-2-peter.enderborg@sony.com
State New
Headers show
Series [RFC] watchdog: Adding softwatchdog | expand

Commit Message

Peter.Enderborg@sony.com April 24, 2021, 10:25 a.m. UTC
This is not a rebooting watchdog. It's function is to take other
actions than a hard reboot. On many complex system there is some
kind of manager that monitor and take action on slow systems.
Android has it's lowmemorykiller (lmkd), desktops has earlyoom.
This watchdog can be used to help monitor to preform some basic
action to keep the monitor running.

It can also be used standalone. This add a policy that is
killing the process with highest oom_score_adj and using
oom functions to it quickly. I think it is a good usecase
for the patch. Memory siuations can be problematic for
software that monitor system, but other prolicys can
should also be possible. Like picking tasks from a memcg, or
specific UID's or what ever is low priority.
---
 drivers/watchdog/Makefile       |   2 +
 drivers/watchdog/softwatchdog.c | 231 ++++++++++++++++++++++++++++++++
 include/uapi/linux/watchdog.h   |   6 +
 mm/oom_kill.c                   |   4 +-
 4 files changed, 241 insertions(+), 2 deletions(-)
 create mode 100644 drivers/watchdog/softwatchdog.c

Comments

Peter.Enderborg@sony.com April 24, 2021, 1:04 p.m. UTC | #1
On 4/24/21 2:21 PM, Christophe Leroy wrote:
>

>

> Le 24/04/2021 à 12:25, Peter Enderborg a écrit :

>> This is not a rebooting watchdog. It's function is to take other

>> actions than a hard reboot. On many complex system there is some

>> kind of manager that monitor and take action on slow systems.

>> Android has it's lowmemorykiller (lmkd), desktops has earlyoom.

>> This watchdog can be used to help monitor to preform some basic

>> action to keep the monitor running.

>>

>> It can also be used standalone. This add a policy that is

>> killing the process with highest oom_score_adj and using

>> oom functions to it quickly. I think it is a good usecase

>> for the patch. Memory siuations can be problematic for

>> software that monitor system, but other prolicys can

>> should also be possible. Like picking tasks from a memcg, or

>> specific UID's or what ever is low priority.

>

>

> I'm nore sure I understand the reasoning behind the choice of oom logic to decide which task to kill.

>

This is not using oom logic to pick a task to kill, it is using oom functions to free resources fast.

The oom is also to slow. So there are userspace solutions to start removing processes before it starts to slow down.

In for example Ubuntu and Fedora a process called earlyoom is running. On Android there is lmkd. However
allocation can be huge fast. For example starting a camera. So what then can happen is that the service that
is there to remove applications that is not needed can get starved. They do a lot of operations to that needs
memory and by this they also get slow.  In worst case it can cause a oom. Oom kills things randomly and
it will cause a android phone to reboot if it kills wrong things. When it get slow it can't kick the wd and
we can free up resources from within kernel. To get current version to work there is very high margins wasting
a lot of memory to be "safe".


> Usually a watchdog will detect if a task is using 100% of the CPU time. If such a task exists, it is the one running, not another one that has huge amount of memory allocated by spends like 1% of CPU time.

>

Watchdogs detects that you does not feed it. 
> So if there is a task to kill by a watchdog, I would say it is the current task.



Current task?  We usually have many cpu's. But the idea is that you should easily write a policy for that if that is what you want.


>

>

>

> Another remark: you are using regular timers as far as I understand. I remember having problems with that in the past, it required the use of hrtimers. I can't remember the details exactly but you can look at

> commit https://github.com/linuxppc/linux/commit/1ff688209



That I definitely need to look in to.


> Christophe
Tetsuo Handa April 24, 2021, 3:23 p.m. UTC | #2
On 2021/04/24 23:41, Guenter Roeck wrote:
> On 4/24/21 3:25 AM, Peter Enderborg wrote:
>> This is not a rebooting watchdog. It's function is to take other
>> actions than a hard reboot. On many complex system there is some
>> kind of manager that monitor and take action on slow systems.
>> Android has it's lowmemorykiller (lmkd), desktops has earlyoom.
>> This watchdog can be used to help monitor to preform some basic
>> action to keep the monitor running.
>>
>> It can also be used standalone. This add a policy that is
>> killing the process with highest oom_score_adj and using
>> oom functions to it quickly. I think it is a good usecase
>> for the patch. Memory siuations can be problematic for
>> software that monitor system, but other prolicys can
>> should also be possible. Like picking tasks from a memcg, or
>> specific UID's or what ever is low priority.
>> ---
> 
> NACK. Besides this not following the new watchdog API, the task
> of a watchdog is to reset the system on failure. Its task is most
> definitely not to re-implement the oom killer in any way, shape,
> or form.
> 

I don't think this proposal is a watchdog. I think this proposal is
a timer based process killer, based on an assumption that any slowdown
which prevents the monitor process from pinging for more than 0.5 seconds
(if HZ == 1000) is caused by memory pressure.
Peter.Enderborg@sony.com April 24, 2021, 4:19 p.m. UTC | #3
On 4/24/21 5:23 PM, Tetsuo Handa wrote:
> On 2021/04/24 23:41, Guenter Roeck wrote:
>> On 4/24/21 3:25 AM, Peter Enderborg wrote:
>>> This is not a rebooting watchdog. It's function is to take other
>>> actions than a hard reboot. On many complex system there is some
>>> kind of manager that monitor and take action on slow systems.
>>> Android has it's lowmemorykiller (lmkd), desktops has earlyoom.
>>> This watchdog can be used to help monitor to preform some basic
>>> action to keep the monitor running.
>>>
>>> It can also be used standalone. This add a policy that is
>>> killing the process with highest oom_score_adj and using
>>> oom functions to it quickly. I think it is a good usecase
>>> for the patch. Memory siuations can be problematic for
>>> software that monitor system, but other prolicys can
>>> should also be possible. Like picking tasks from a memcg, or
>>> specific UID's or what ever is low priority.
>>> ---
>> NACK. Besides this not following the new watchdog API, the task
>> of a watchdog is to reset the system on failure. Its task is most
>> definitely not to re-implement the oom killer in any way, shape,
>> or form.
>>
> I don't think this proposal is a watchdog. I think this proposal is
> a timer based process killer, based on an assumption that any slowdown
> which prevents the monitor process from pinging for more than 0.5 seconds
> (if HZ == 1000) is caused by memory pressure.

You missing the point. The oom killer is a example of a work that it can do.
it is one policy. The idea is that you should have a policy that fits your needs.

oom_score_adj is suitable for a android world. But it might be based on
uid's if your priority is some users over other.  Or a memcg. Or as
Christophe Leroy want the current. The policy is only a example that
fits a one area. You need to describe your prioritization, in android it is
oom_score_adj. For example I would very much have a policy that sends
sigterm instead of sigkill. But the integration with oom is there because
it is needed. Maybe a bad choice for political reasons but I don't it a
good idea to hide the intention. Please don't focus on the oom part.
Peter.Enderborg@sony.com April 24, 2021, 5:20 p.m. UTC | #4
On 4/24/21 7:07 PM, Guenter Roeck wrote:
> On 4/24/21 8:27 AM, Peter.Enderborg@sony.com wrote:

>> On 4/24/21 4:41 PM, Guenter Roeck wrote:

>>> On 4/24/21 3:25 AM, Peter Enderborg wrote:

>>>> This is not a rebooting watchdog. It's function is to take other

>>>> actions than a hard reboot. On many complex system there is some

>>>> kind of manager that monitor and take action on slow systems.

>>>> Android has it's lowmemorykiller (lmkd), desktops has earlyoom.

>>>> This watchdog can be used to help monitor to preform some basic

>>>> action to keep the monitor running.

>>>>

>>>> It can also be used standalone. This add a policy that is

>>>> killing the process with highest oom_score_adj and using

>>>> oom functions to it quickly. I think it is a good usecase

>>>> for the patch. Memory siuations can be problematic for

>>>> software that monitor system, but other prolicys can

>>>> should also be possible. Like picking tasks from a memcg, or

>>>> specific UID's or what ever is low priority.

>>>> ---

>>> NACK. Besides this not following the new watchdog API, the task

>>> of a watchdog is to reset the system on failure. Its task is most

>>> definitely not to re-implement the oom killer in any way, shape,

>>> or form.

>>>

>>> Guenter

>> Do you have better idea where the re-invented wheel might

>> fit better if it not for watchdog API?

>>

> The watchdog subsystem does support pretimeouts and a variety

> of configurable pretimeout notifiers. A pretimeout notifier which

> invokes the oom killer might be something worth discussing, though

> it would require an audience large enough to determine if it really

> makes sense (instead of improving the existing oom killer itself).

>

> A possible alternative might be to introduce watchdog pretimeout

> callbacks; this has actually be proposed in another context but

> without upstream user. The oom killer could then subscribe to

> watchdog pretimeouts and perform the action suggested here if

> a pretimeout is observed. Again, such an approach might be worth

> discussing with a larger audience.

>

> Thanks,

> Guenter


What should be a larger audience? I have include mm and
mm maintainer and the global list.
Tetsuo Handa April 25, 2021, 1:08 a.m. UTC | #5
On 2021/04/25 1:19, peter enderborg wrote:
>> I don't think this proposal is a watchdog. I think this proposal is

>> a timer based process killer, based on an assumption that any slowdown

>> which prevents the monitor process from pinging for more than 0.5 seconds

>> (if HZ == 1000) is caused by memory pressure.

> 

> You missing the point. The oom killer is a example of a work that it can do.

> it is one policy. The idea is that you should have a policy that fits your needs.


Implementing policy which can run in kernel from timer interrupt context is
quite limited, for it is not allowed to perform operations that might sleep. See

  [RFC] memory reserve for userspace oom-killer
  https://lkml.kernel.org/r/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com

for implementing possibly useful policy.

> 

> oom_score_adj is suitable for a android world. But it might be based on

> uid's if your priority is some users over other.  Or a memcg. Or as

> Christophe Leroy want the current. The policy is only a example that

> fits a one area.


Horrible idea. Imagine a kernel module that randomly sends SIGTERM/SIGKILL
to "current" thread. How normal systems can survive? A normal system is not
designed to survive random signals.

>                  You need to describe your prioritization, in android it is

> oom_score_adj. For example I would very much have a policy that sends

> sigterm instead of sigkill.


That's because Android framework is designed to survive random signals
(in order to survive memory pressure situation).

>                             But the integration with oom is there because

> it is needed. Maybe a bad choice for political reasons but I don't it a

> good idea to hide the intention. Please don't focus on the oom part.


I wonder what system other than Android framework can utilize this module.

By the way, there already is "Software Watchdog" ( drivers/watchdog/softdog.c )
which some people might call it "soft watchdog". It is very confusing to name
your module as "softwatchdog". Please find a different name.
Peter.Enderborg@sony.com April 25, 2021, 6:42 a.m. UTC | #6
On 4/25/21 3:08 AM, Tetsuo Handa wrote:
> On 2021/04/25 1:19, peter enderborg wrote:

>>> I don't think this proposal is a watchdog. I think this proposal is

>>> a timer based process killer, based on an assumption that any slowdown

>>> which prevents the monitor process from pinging for more than 0.5 seconds

>>> (if HZ == 1000) is caused by memory pressure.

>> You missing the point. The oom killer is a example of a work that it can do.

>> it is one policy. The idea is that you should have a policy that fits your needs.

> Implementing policy which can run in kernel from timer interrupt context is

> quite limited, for it is not allowed to perform operations that might sleep. See

>

>   [RFC] memory reserve for userspace oom-killer

>   https://urldefense.com/v3/__https://lkml.kernel.org/r/CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz*u3bFFyOLg@mail.gmail.com__;Kw!!JmoZiZGBv3RvKRSx!tqBFKAdfydRJ5M0oP4xCRvSscrBwChj5MWuj1YUNAk05uORWkbcz-iodFCHYjKdOytmHoO4$ 

>

> for implementing possibly useful policy.


I you need to do a more complex approach you might need to
have a work queue.  For example a SIGTERM solution might
be like that. You send sigterm wait some time and then send a sigkill.


>> oom_score_adj is suitable for a android world. But it might be based on

>> uid's if your priority is some users over other.  Or a memcg. Or as

>> Christophe Leroy want the current. The policy is only a example that

>> fits a one area.

> Horrible idea. Imagine a kernel module that randomly sends SIGTERM/SIGKILL

> to "current" thread. How normal systems can survive? A normal system is not

> designed to survive random signals.


I think you need to see it in the context of a watchdog. It might be
problematic, but it has a good statistical change to hit a cpu hogger. 

And seeing as watchdog, the alternative is a system reset. You
take a chance.  Reboot should be the last resort.

I can imagine a kernel module that  randomly sends SIGTERM/SIGKILL,
we already have that. It is called oom-kill. This is *exactly* the problem.

>

>>                  You need to describe your prioritization, in android it is

>> oom_score_adj. For example I would very much have a policy that sends

>> sigterm instead of sigkill.

> That's because Android framework is designed to survive random signals

> (in order to survive memory pressure situation).

It using a lot to control the system. It use it differently than you would
with a shell or window-manager.
>

>>                             But the integration with oom is there because

>> it is needed. Maybe a bad choice for political reasons but I don't it a

>> good idea to hide the intention. Please don't focus on the oom part.

> I wonder what system other than Android framework can utilize this module.

I think it will be useful for embedded systems as well.
> By the way, there already is "Software Watchdog" ( drivers/watchdog/softdog.c )

> which some people might call it "soft watchdog". It is very confusing to name

> your module as "softwatchdog". Please find a different name.

>

It is mention in the patch-set. I had as an idea to add this function to that one,
but I decided that it was better to separate so point out the feature  that is to
be "Soft" rather than so hard.
Peter.Enderborg@sony.com April 25, 2021, 8:05 a.m. UTC | #7
> By the way, there already is "Software Watchdog" ( drivers/watchdog/softdog.c )

> which some people might call it "soft watchdog". It is very confusing to name

> your module as "softwatchdog". Please find a different name.

>

Would puppydog do?
diff mbox series

Patch

diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index f3a6540e725e..bff8f61fe504 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -221,3 +221,5 @@  obj-$(CONFIG_MENZ069_WATCHDOG) += menz69_wdt.o
 obj-$(CONFIG_RAVE_SP_WATCHDOG) += rave-sp-wdt.o
 obj-$(CONFIG_STPMIC1_WATCHDOG) += stpmic1_wdt.o
 obj-$(CONFIG_SL28CPLD_WATCHDOG) += sl28cpld_wdt.o
+
+obj-y += softwatchdog.o
diff --git a/drivers/watchdog/softwatchdog.c b/drivers/watchdog/softwatchdog.c
new file mode 100644
index 000000000000..dd7153092da8
--- /dev/null
+++ b/drivers/watchdog/softwatchdog.c
@@ -0,0 +1,231 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Soft watchdog
+ *  This is a RFC of a watchdog does do not reboot the system.
+ *  Instead it do a pre defined action
+ *
+ *  Author: peter.enderborg@sony.com
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/oom.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/types.h>
+#include <linux/timer.h>
+#include <linux/miscdevice.h>
+#include <linux/watchdog.h>
+#include <linux/notifier.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+
+void wake_oom_reaper(struct task_struct *tsk); /* need to public ... */
+void __oom_kill_process(struct task_struct *victim, const char *message); /* hmm */
+
+static struct timer_list swd_timer;
+static int heartbeat = 500;
+static unsigned long swd_is_open;
+static char expect_close;
+static bool nowayout = WATCHDOG_NOWAYOUT;
+
+/* a example policy doing a process kill by calling
+ *  functions within oom-killer.
+ */
+static int policy_fast_kill_oom_score_adj(void)
+{
+	struct task_struct *tsk;
+	struct task_struct *selected = NULL;
+	int highest = 0;
+
+	rcu_read_lock();
+	for_each_process(tsk) {
+		struct task_struct *candidate;
+
+		if (tsk->flags & PF_KTHREAD)
+			continue;
+
+		/* Ignore task if coredump in progress */
+		if (tsk->mm && tsk->mm->core_state)
+			continue;
+		candidate = find_lock_task_mm(tsk);
+		if (!candidate)
+			continue;
+
+		if (highest < candidate->signal->oom_score_adj) {
+			/* for test dont kill level 0 */
+			highest = candidate->signal->oom_score_adj;
+			selected = candidate;
+			pr_info("new selected %d %d", selected->pid,
+				selected->signal->oom_score_adj);
+		}
+		task_unlock(candidate);
+	}
+	if (selected)
+		get_task_struct(selected);
+
+	rcu_read_unlock();
+	if (selected) {
+		int pid = selected->pid;
+
+		pr_info("swd killing: %d", selected->pid);
+		__oom_kill_process(selected, "swd");
+		return pid;
+	}
+	return 0;
+}
+
+static void swd_ping(void)
+{
+	mod_timer(&swd_timer, jiffies + heartbeat);
+}
+
+static long swd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	int __user *p = argp;
+	int new_heartbeat;
+	int status;
+
+	struct watchdog_info ident = {
+		.options =		WDIOF_SETTIMEOUT|
+					WDIOF_MAGICCLOSE|
+					WDIOF_OOMKILL |
+					WDIOF_AUTOKILL,
+		.identity =		"swd",
+	};
+
+	switch (cmd) {
+	case WDIOC_GETSUPPORT:
+		return copy_to_user(argp, &ident, sizeof(ident)) ? -EFAULT : 0;
+	case WDIOC_GETSTATUS:
+		status = 0;
+		return put_user(status, p);
+	case WDIOC_GETBOOTSTATUS:
+		return put_user(0, p);
+	case WDIOC_KEEPALIVE:
+		swd_ping();
+		return 0;
+	case WDIOC_SETTIMEOUT:
+		if (get_user(new_heartbeat, p))
+			return -EFAULT;
+		heartbeat = new_heartbeat;
+		fallthrough;
+	case WDIOC_GETTIMEOUT:
+		return put_user(heartbeat, p);
+	default:
+		return -ENOTTY;
+	}
+	return -ENOTTY;
+}
+
+static void swd_timer_func(struct timer_list *unused)
+{
+	int res;
+
+	pr_info("swd timer %d\n", heartbeat);
+	res = policy_fast_kill_oom_score_adj();
+	if (res)
+		pr_info("killed %d\n", res);
+
+	mod_timer(&swd_timer, jiffies + heartbeat);
+}
+
+static int swd_start(void)
+{
+	add_timer(&swd_timer);
+	return 0;
+}
+
+static int swd_stop(void)
+{
+	del_timer(&swd_timer);
+	return 0;
+}
+
+static int swd_open(struct inode *inode, struct file *file)
+{
+	if (test_and_set_bit(0, &swd_is_open))
+		return -EBUSY;
+	swd_start();
+	return stream_open(inode, file);
+}
+
+static int swd_release(struct inode *inode, struct file *file)
+{
+	if (expect_close != 42) {
+		swd_stop();
+		clear_bit(0, &swd_is_open);
+	} else {
+		pr_crit("SWD device closed unexpectedly.  SWD will not stop!\n");
+		swd_ping();
+	}
+	expect_close = 0;
+	return 0;
+}
+static ssize_t swd_write(struct file *file, const char __user *buf,
+						size_t count, loff_t *ppos)
+{
+	if (count) {
+		if (!nowayout) {
+			size_t i;
+
+			expect_close = 0;
+			for (i = 0; i != count; i++) {
+				char c;
+
+				if (get_user(c, buf + i))
+					return -EFAULT;
+				if (c == 'V')
+					expect_close = 42;
+			}
+		}
+		swd_ping();
+	}
+	return count;
+}
+
+static const struct file_operations swd_fops = {
+	.owner		= THIS_MODULE,
+	.llseek		= no_llseek,
+	 .write		= swd_write,
+	.unlocked_ioctl	= swd_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.open		= swd_open,
+	.release	= swd_release,
+};
+
+static struct miscdevice swd_miscdev = {
+	.minor	=	WATCHDOG_MINOR,
+	.name	=	"soft-watchdog",
+	.fops	=	&swd_fops,
+};
+
+int __init swd_register(void)
+{
+	int ret;
+
+	pr_info("swd installed with timer");
+	ret = misc_register(&swd_miscdev);
+	timer_setup(&swd_timer, swd_timer_func, 0);
+	return 0;
+}
+
+static int __init swd_init(void)
+{
+	return 0;
+}
+
+static void __exit swd_unload(void)
+{
+	return;
+}
+
+subsys_initcall(swd_register);
+
+module_init(swd_init);
+module_exit(swd_unload);
+
+MODULE_AUTHOR("Peter Enderborg");
+MODULE_DESCRIPTION("Memory software watchdog");
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/watchdog.h b/include/uapi/linux/watchdog.h
index b15cde5c9054..780987482e2d 100644
--- a/include/uapi/linux/watchdog.h
+++ b/include/uapi/linux/watchdog.h
@@ -48,6 +48,12 @@  struct watchdog_info {
 #define	WDIOF_PRETIMEOUT	0x0200  /* Pretimeout (in seconds), get/set */
 #define	WDIOF_ALARMONLY		0x0400	/* Watchdog triggers a management or
 					   other external alarm not a reboot */
+#define WDIOF_OOMKILL		0x0800	/* Watchdog trigger process kill as
+					 *  oom kill
+					 */
+#define WDIOF_AUTOKILL		0x1000	/* Watchdog listen to oom notifiy
+					 * and kills with its policy
+					 */
 #define	WDIOF_KEEPALIVEPING	0x8000	/* Keep alive ping reply */
 
 #define	WDIOS_DISABLECARD	0x0001	/* Turn off the watchdog timer */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fa1cf18bac97..a5f7299af9a3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -658,7 +658,7 @@  static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
 	/* mm is already queued? */
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -856,7 +856,7 @@  static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void __oom_kill_process(struct task_struct *victim, const char *message)
+void __oom_kill_process(struct task_struct *victim, const char *message)
 {
 	struct task_struct *p;
 	struct mm_struct *mm;