diff mbox

[RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

Message ID 359c926bc85cdf79650e39f2344c2083002545bb.1427347966.git.viresh.kumar@linaro.org
State New
Headers show

Commit Message

Viresh Kumar March 26, 2015, 5:39 a.m. UTC
A delayed work to schedule vmstat_shepherd() is queued at periodic intervals for
internal working of vmstat core. This work and its timer end up waking an idle
cpu sometimes, as this always stays on CPU0.

Because we re-queue the work from its handler, idle_cpu() returns false and so
the timer (used by delayed work) never migrates to any other CPU.

This may not be the desired behavior always as waking up an idle CPU to queue
work on few other CPUs isn't good from power-consumption point of view.

In order to avoid waking up an idle core, we can replace schedule_delayed_work()
with a normal work plus a separate timer. The timer handler will then queue the
work after re-arming the timer. If the CPU was idle before the timer fired,
idle_cpu() will mostly return true and the next timer shall be migrated to a
non-idle CPU.

But the timer core has a limitation, when the timer is re-armed from its
handler, timer core disables migration of that timer to other cores. Details of
that limitation are present in kernel/time/timer.c:__mod_timer() routine.

Another simple yet effective solution can be to keep two timers with same
handler and keep toggling between them, so that the above limitation doesn't
hold true anymore.

This patch replaces schedule_delayed_work() with schedule_work() plus two
timers. After this, it was seen that the timer and its do get migrated to other
non-idle CPUs, when the local cpu is idle.

Tested-by: Vinayak Menon <vinmenon@codeaurora.org>
Tested-by: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
This patch isn't sent to say its the best way forward, but to get a discussion
started on the same.

 mm/vmstat.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

Comments

Viresh Kumar March 27, 2015, 4:49 a.m. UTC | #1
On 27 March 2015 at 01:48, Andrew Morton <akpm@linux-foundation.org> wrote:
> Shouldn't this be viewed as a shortcoming of the core timer code?

Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, but
they are rejected for obviously reasons [1].

> vmstat_shepherd() is merely rescheduling itself with
> schedule_delayed_work().  That's a dead bog simple operation and if
> it's producing suboptimal behaviour then we shouldn't be fixing it with
> elaborate workarounds in the caller?

I understand that, and that's why I sent it as an RFC to get the discussion
started. Does anyone else have got another (acceptable) idea to get this
resolved ?

--
viresh

[1] http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar March 28, 2015, 4:18 a.m. UTC | #2
On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:

>> So the issue seems to be that we need base->running_timer in order to
>> tell if a callback is running, right?
>>
>> We could align the base on 8 bytes to gain an extra bit in the pointer
>> and use that bit to indicate the running state. Then these sites can
>> spin on that bit while we can change the actual base pointer.
>
> Even though tvec_base has ____cacheline_aligned stuck on, most are
> allocated using kzalloc_node() which does not actually respect that but
> already guarantees a minimum u64 alignment, so I think we can use that
> third bit without too much magic.

Right. So what I tried earlier was very much similar to you are suggesting.
The only difference was that I haven't made much attempts towards
saving memory.

But Thomas didn't like it for sure and I believe that Rant will hold true for
what you are suggesting as well, isn't it ?

http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar March 28, 2015, 4:28 a.m. UTC | #3
On 27 March 2015 at 17:32, Peter Zijlstra <peterz@infradead.org> wrote:
> What's not clear to me is why that thing is allocated at all, AFAICT
> something like:
>
> static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
>
> Should do the right thing and be much simpler.

Does this comment from timers.c answers your query ?

                        /*
                         * This is for the boot CPU - we use compile-time
                         * static initialisation because per-cpu memory isn't
                         * ready yet and because the memory allocators are not
                         * initialised either.
                         */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Viresh Kumar March 28, 2015, 4:34 a.m. UTC | #4
On 27 March 2015 at 19:49, Michal Hocko <mhocko@suse.cz> wrote:

> Wouldn't something like I was suggesting few months back
> (http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this
> problem as well? Scheduler should be idle aware, no? I mean it shouldn't
> wake up an idle CPU if the task might run on another one.

Probably yes. Lets see what others have to say about it..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..d45e4243a046 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1424,8 +1424,18 @@  static bool need_update(int cpu)
  * inactivity.
  */
 static void vmstat_shepherd(struct work_struct *w);
+static DECLARE_WORK(shepherd, vmstat_shepherd);
 
-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+/*
+ * Two timers are used here to avoid waking up an idle CPU. If a single timer is
+ * kept, then re-arming the timer from its handler doesn't let us migrate it.
+ */
+static struct timer_list shepherd_timer[2];
+#define toggle_timer() (shepherd_timer_index = !shepherd_timer_index,	\
+			&shepherd_timer[shepherd_timer_index])
+
+static void vmstat_shepherd_timer(unsigned long data);
+static int shepherd_timer_index;
 
 static void vmstat_shepherd(struct work_struct *w)
 {
@@ -1441,15 +1451,19 @@  static void vmstat_shepherd(struct work_struct *w)
 				&per_cpu(vmstat_work, cpu), 0);
 
 	put_online_cpus();
+}
 
-	schedule_delayed_work(&shepherd,
-		round_jiffies_relative(sysctl_stat_interval));
+static void vmstat_shepherd_timer(unsigned long data)
+{
+	mod_timer(toggle_timer(),
+		  jiffies + round_jiffies_relative(sysctl_stat_interval));
+	schedule_work(&shepherd);
 
 }
 
 static void __init start_shepherd_timer(void)
 {
-	int cpu;
+	int cpu, i = -1;
 
 	for_each_possible_cpu(cpu)
 		INIT_DELAYED_WORK(per_cpu_ptr(&vmstat_work, cpu),
@@ -1459,8 +1473,13 @@  static void __init start_shepherd_timer(void)
 		BUG();
 	cpumask_copy(cpu_stat_off, cpu_online_mask);
 
-	schedule_delayed_work(&shepherd,
-		round_jiffies_relative(sysctl_stat_interval));
+	while (++i < 2) {
+		init_timer(&shepherd_timer[i]);
+		shepherd_timer[i].function = vmstat_shepherd_timer;
+	};
+
+	mod_timer(toggle_timer(),
+		  jiffies + round_jiffies_relative(sysctl_stat_interval));
 }
 
 static void vmstat_cpu_dead(int node)