diff mbox

[V2] clocksource: register cpu notifier to remove timer from dying CPU

Message ID f13134fea7e7658ae3dab90faca1e6578b8f82e7.1397017662.git.viresh.kumar@linaro.org
State New
Headers show

Commit Message

Viresh Kumar April 9, 2014, 4:34 a.m. UTC
clocksource core is using add_timer_on() to run clocksource_watchdog() on all
CPUs one by one. But when a core is brought down, clocksource core doesn't
remove this timer from the dying CPU. And in this case timer core gives this
(Gives this only with unmerged code, anyway in the current code as well timer
core is migrating a pinned timer to other CPUs, which is also wrong:
http://www.gossamer-threads.com/lists/linux/kernel/1898117)

migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60,
timer->function: ffffffff810d7010,deactivating it Modules linked in:

CPU: 0 PID: 1932 Comm: 01-cpu-hotplug Not tainted 3.14.0-rc1-00088-gab3c4fd #4
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000009 ffff88001d407c38 ffffffff817237bd ffff88001d407c80
 ffff88001d407c70 ffffffff8106a1dd 0000000000000010 ffffffff81f06a60
 ffff88001e04d040 ffffffff81e3d4c0 ffff88001e04d030 ffff88001d407cd0
Call Trace:
 [<ffffffff817237bd>] dump_stack+0x4d/0x66
 [<ffffffff8106a1dd>] warn_slowpath_common+0x7d/0xa0
 [<ffffffff8106a24c>] warn_slowpath_fmt+0x4c/0x50
 [<ffffffff810761c3>] ? __internal_add_timer+0x113/0x130
 [<ffffffff810d7010>] ? clocksource_watchdog_kthread+0x40/0x40
 [<ffffffff8107753b>] migrate_timer_list+0xdb/0xf0
 [<ffffffff810782dc>] timer_cpu_notify+0xfc/0x1f0
 [<ffffffff8173046c>] notifier_call_chain+0x4c/0x70
 [<ffffffff8109340e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff8106a3f3>] cpu_notify+0x23/0x50
 [<ffffffff8106a44e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff81712a5d>] _cpu_down+0x1ad/0x2e0
 [<ffffffff81712bc4>] cpu_down+0x34/0x50
 [<ffffffff813fec54>] cpu_subsys_offline+0x14/0x20
 [<ffffffff813f9f65>] device_offline+0x95/0xc0
 [<ffffffff813fa060>] online_store+0x40/0x90
 [<ffffffff813f75d8>] dev_attr_store+0x18/0x30
 [<ffffffff8123309d>] sysfs_kf_write+0x3d/0x50

This patch tries to fix this by registering cpu notifiers from clocksource core,
only when we start clocksource-watchdog. And if on the CPU_DEAD notification it
is found that dying CPU was the CPU on which this timer is queued on, then it is
removed from that CPU and queued to next CPU.

Reported-and-tested-by: Jet Chen <jet.chen@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
V1->V2:
- Moved 'static int timer_cpu' within #ifdef CONFIG_CLOCKSOURCE_WATCHDOG/endif
- replaced spin_lock with spin_lock_irqsave in clocksource_cpu_notify() as a bug
  is reported by Jet Chen with that.
- Tested again by Jet Chen (Thanks again :))

 kernel/time/clocksource.c | 65 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 54 insertions(+), 11 deletions(-)

Comments

Thomas Gleixner May 7, 2014, 10:08 a.m. UTC | #1
On Wed, 9 Apr 2014, Viresh Kumar wrote:
> This patch tries to fix this by registering cpu notifiers from clocksource core,
> only when we start clocksource-watchdog. And if on the CPU_DEAD notification it
> is found that dying CPU was the CPU on which this timer is queued on, then it is
> removed from that CPU and queued to next CPU.

Gah, no. We realy don't want more notifier crap.
  
It's perfectly fine for the watchdog timer to be moved around on cpu
down. And the timer itself is not pinned at all. add_timer_on() does
not set the pinned bit.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar May 7, 2014, 10:36 a.m. UTC | #2
On 7 May 2014 15:38, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 9 Apr 2014, Viresh Kumar wrote:
>> This patch tries to fix this by registering cpu notifiers from clocksource core,
>> only when we start clocksource-watchdog. And if on the CPU_DEAD notification it
>> is found that dying CPU was the CPU on which this timer is queued on, then it is
>> removed from that CPU and queued to next CPU.
>
> Gah, no. We realy don't want more notifier crap.

Agreed, and could have used the generic ones, probably.

> It's perfectly fine for the watchdog timer to be moved around on cpu
> down.

Functionally? Yes. Then handler doesn't have any CPU specific stuff to
do here and so queuing it on any cpu is fine.

> And the timer itself is not pinned at all. add_timer_on() does
> not set the pinned bit.

The perception I had is this:
- mod_timer() is a more complicated form of add_timer() as it has to
tackle with migration and removal of timers as well. Otherwise they
should work in similar way.
- There is no PINNED bit which can be set, its just a parameter to
__mod_timer() to decide which CPU the timer should fire on.
- And by the 'name add_timer_on()', we must guarantee that timer
fires on the CPU its being added to, otherwise it may break things
for many. There might be users which want to run the handler
on a particular CPU due to some CPU-specific stuff they want to do.
And have used add_timer_on()...

But looking at your reply, it looks that add_timer_on() doesn't
guarantee that timer would fire on the CPU mentioned? Is that the
case for mod_timer_pinned() as well ?

And if that's the case what do we want should we do with these
timers (i.e. ones added with add_timer_on or mod_timer_pinned)
when we try to quiesce a cpu using cpuset.quiesce [1]?

--
viresh

[1] https://lkml.org/lkml/2014/4/4/99
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar May 7, 2014, 1:20 p.m. UTC | #3
On 7 May 2014 16:06, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> And if that's the case what do we want should we do with these
> timers (i.e. ones added with add_timer_on or mod_timer_pinned)
> when we try to quiesce a cpu using cpuset.quiesce [1]?

Okay, I thought again and above looked stupid :) .. During isolation
we can't migrate any pinned timers and so these will stay as is.
But we shouldn't change the code (which I changed in my initial
patchset), which migrates away pinned timers as well..

Probably just add a pr_warn() there and mention we are migrating
a pinned timer. That's it.

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Thomas Gleixner May 7, 2014, 7:52 p.m. UTC | #4
On Wed, 7 May 2014, Viresh Kumar wrote:
> On 7 May 2014 15:38, Thomas Gleixner <tglx@linutronix.de> wrote:
> > And the timer itself is not pinned at all. add_timer_on() does
> > not set the pinned bit.
> 
> The perception I had is this:
> - mod_timer() is a more complicated form of add_timer() as it has to
> tackle with migration and removal of timers as well. Otherwise they
> should work in similar way.
> - There is no PINNED bit which can be set, its just a parameter to
> __mod_timer() to decide which CPU the timer should fire on.
> - And by the 'name add_timer_on()', we must guarantee that timer
> fires on the CPU its being added to, otherwise it may break things
> for many. There might be users which want to run the handler
> on a particular CPU due to some CPU-specific stuff they want to do.
> And have used add_timer_on()...
> 
> But looking at your reply, it looks that add_timer_on() doesn't
> guarantee that timer would fire on the CPU mentioned? Is that the
> case for mod_timer_pinned() as well ?
> 
> And if that's the case what do we want should we do with these
> timers (i.e. ones added with add_timer_on or mod_timer_pinned)
> when we try to quiesce a cpu using cpuset.quiesce [1]?

There is no general rule to that. The timers which are added to be per
cpu are the critical ones. But there a lots of other use cases like
the watchdog which do not care on which cpu they actually fire. They
prefer to fire on the one they were armed on.

We have no way to distinguish that right now and I still need to find
a few free cycles to finish the design of the timer_list
replacement. I keep that in mind.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index ba3e502..d288f1f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -23,10 +23,12 @@ 
  *   o Allow clocksource drivers to be unregistered
  */
 
+#include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/clocksource.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/notifier.h>
 #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
 #include <linux/tick.h>
 #include <linux/kthread.h>
@@ -180,6 +182,9 @@  static char override_name[CS_NAME_LEN];
 static int finished_booting;
 
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
+/* Tracks current CPU to queue watchdog timer on */
+static int timer_cpu;
+
 static void clocksource_watchdog_work(struct work_struct *work);
 static void clocksource_select(void);
 
@@ -246,12 +251,25 @@  void clocksource_mark_unstable(struct clocksource *cs)
 	spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
+static void queue_timer_on_next_cpu(void)
+{
+	/*
+	 * Cycle through CPUs to check if the CPUs stay synchronized to each
+	 * other.
+	 */
+	timer_cpu = cpumask_next(timer_cpu, cpu_online_mask);
+	if (timer_cpu >= nr_cpu_ids)
+		timer_cpu = cpumask_first(cpu_online_mask);
+	watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
+	add_timer_on(&watchdog_timer, timer_cpu);
+}
+
 static void clocksource_watchdog(unsigned long data)
 {
 	struct clocksource *cs;
 	cycle_t csnow, wdnow;
 	int64_t wd_nsec, cs_nsec;
-	int next_cpu, reset_pending;
+	int reset_pending;
 
 	spin_lock(&watchdog_lock);
 	if (!watchdog_running)
@@ -336,27 +354,51 @@  static void clocksource_watchdog(unsigned long data)
 	if (reset_pending)
 		atomic_dec(&watchdog_reset_pending);
 
-	/*
-	 * Cycle through CPUs to check if the CPUs stay synchronized
-	 * to each other.
-	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
-	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
-	watchdog_timer.expires += WATCHDOG_INTERVAL;
-	add_timer_on(&watchdog_timer, next_cpu);
+	queue_timer_on_next_cpu();
 out:
 	spin_unlock(&watchdog_lock);
 }
 
+static int clocksource_cpu_notify(struct notifier_block *self,
+				unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	unsigned long flags;
+
+	spin_lock_irqsave(&watchdog_lock, flags);
+	if (!watchdog_running)
+		goto notify_out;
+
+	switch (action) {
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		if (cpu != timer_cpu)
+			break;
+		del_timer(&watchdog_timer);
+		queue_timer_on_next_cpu();
+		break;
+	}
+
+notify_out:
+	spin_unlock_irqrestore(&watchdog_lock, flags);
+	return NOTIFY_OK;
+}
+
+static struct notifier_block clocksource_nb = {
+	.notifier_call	= clocksource_cpu_notify,
+	.priority = 1,
+};
+
 static inline void clocksource_start_watchdog(void)
 {
 	if (watchdog_running || !watchdog || list_empty(&watchdog_list))
 		return;
+	timer_cpu = cpumask_first(cpu_online_mask);
+	register_cpu_notifier(&clocksource_nb);
 	init_timer(&watchdog_timer);
 	watchdog_timer.function = clocksource_watchdog;
 	watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
-	add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
+	add_timer_on(&watchdog_timer, timer_cpu);
 	watchdog_running = 1;
 }
 
@@ -365,6 +407,7 @@  static inline void clocksource_stop_watchdog(void)
 	if (!watchdog_running || (watchdog && !list_empty(&watchdog_list)))
 		return;
 	del_timer(&watchdog_timer);
+	unregister_cpu_notifier(&clocksource_nb);
 	watchdog_running = 0;
 }