[V2,Resend,3/4] workqueue: Schedule work on non-idle cpu instead of current one

Message ID ad25c731a6ca1bd1269555245952f05c856a9759.1352196505.git.viresh.kumar@linaro.org
State Accepted
Headers show

Commit Message

Viresh Kumar Nov. 6, 2012, 10:38 a.m.
Workqueues queues work on current cpu, if the caller haven't passed a preferred
cpu. This may wake up an idle CPU, which is actually not required.

This work can be processed by any CPU and so we must select a non-idle CPU here.
This patch adds in support in workqueue framework to get preferred CPU details
from the scheduler, instead of using current CPU.

Most of the time when a work is queued, the current cpu isn't idle and so we
will choose it only. There are cases when a cpu is idle when it queues some
work. For example, consider following scenario:
- A cpu has programmed a timer and is IDLE now.
- CPU gets into interrupt handler due to timer and queues a work. As the CPU is
  currently IDLE, we can queue this work to some other CPU.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 kernel/workqueue.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Tejun Heo Nov. 26, 2012, 5:15 p.m. | #1
Hello, Viresh.

On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
> Workqueues queues work on current cpu, if the caller haven't passed a preferred
> cpu. This may wake up an idle CPU, which is actually not required.
> 
> This work can be processed by any CPU and so we must select a non-idle CPU here.
> This patch adds in support in workqueue framework to get preferred CPU details
> from the scheduler, instead of using current CPU.
> 
> Most of the time when a work is queued, the current cpu isn't idle and so we
> will choose it only. There are cases when a cpu is idle when it queues some
> work. For example, consider following scenario:
> - A cpu has programmed a timer and is IDLE now.
> - CPU gets into interrupt handler due to timer and queues a work. As the CPU is
>   currently IDLE, we can queue this work to some other CPU.

I'm pretty skeptical about this.  queue_work() w/o explicit CPU
assignment has always guaranteed that the work item will be executed
on the same CPU.  I don't think there are too many users depending on
that but am not sure at all that there are none.  I asked you last
time that you would at the very least need to audit most users but it
doesn't seem like there has been any effort there.  Given the
seemingly low rate of migration and subtlety of such bugs, things
could get nasty to debug.

That said, if the obtained benefit is big enough, sure, we can
definitely change the behavior, which isn't all that great to begin
with, and try to shake out the bugs quicky by e.g. forcing all work
items to execute on different CPUs, but you're presenting a few
percent of work items being migrated to a different CPU from an
already active CPU, which doesn't seem like such a big benefit to me
even if the migration target CPU is somewhat more efficient.  How much
powersaving are we talking about?

Thanks.
Viresh Kumar Nov. 27, 2012, 5:19 a.m. | #2
Hi Tejun,

On 26 November 2012 22:45, Tejun Heo <tj@kernel.org> wrote:
> On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:

> I'm pretty skeptical about this.  queue_work() w/o explicit CPU
> assignment has always guaranteed that the work item will be executed
> on the same CPU.  I don't think there are too many users depending on
> that but am not sure at all that there are none.  I asked you last
> time that you would at the very least need to audit most users but it
> doesn't seem like there has been any effort there.

My bad. I completely missed/forgot that comment from your earlier mails.
Will do it.

> That said, if the obtained benefit is big enough, sure, we can
> definitely change the behavior, which isn't all that great to begin
> with, and try to shake out the bugs quicky by e.g. forcing all work
> items to execute on different CPUs, but you're presenting a few
> percent of work items being migrated to a different CPU from an
> already active CPU, which doesn't seem like such a big benefit to me
> even if the migration target CPU is somewhat more efficient.  How much
> powersaving are we talking about?

Hmm.. I actually implemented the problem discussed here:
(I know you have seen this earlier :) )

http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf

Specifically slides: 12 & 19.

I haven't done much power calculations with it and have tested it more from
functionality point of view.

@Vincent: Can you add some comments here?

--
viresh
Vincent Guittot Nov. 27, 2012, 12:54 p.m. | #3
On 27 November 2012 06:19, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> Hi Tejun,
>
> On 26 November 2012 22:45, Tejun Heo <tj@kernel.org> wrote:
>> On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
>
>> I'm pretty skeptical about this.  queue_work() w/o explicit CPU
>> assignment has always guaranteed that the work item will be executed
>> on the same CPU.  I don't think there are too many users depending on
>> that but am not sure at all that there are none.  I asked you last
>> time that you would at the very least need to audit most users but it
>> doesn't seem like there has been any effort there.
>
> My bad. I completely missed/forgot that comment from your earlier mails.
> Will do it.
>
>> That said, if the obtained benefit is big enough, sure, we can
>> definitely change the behavior, which isn't all that great to begin
>> with, and try to shake out the bugs quicky by e.g. forcing all work
>> items to execute on different CPUs, but you're presenting a few
>> percent of work items being migrated to a different CPU from an
>> already active CPU, which doesn't seem like such a big benefit to me
>> even if the migration target CPU is somewhat more efficient.  How much
>> powersaving are we talking about?
>
> Hmm.. I actually implemented the problem discussed here:
> (I know you have seen this earlier :) )
>
> http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf
>
> Specifically slides: 12 & 19.
>
> I haven't done much power calculations with it and have tested it more from
> functionality point of view.
>
> @Vincent: Can you add some comments here?

Sorry for this late reply.

We have faced some situations on TC2 (as an example) where the tasks
are running in the LITTLE cluster whereas some periodic works stay on
the big cluster so we can have one cluster that wakes up for tasks and
another one that wakes up for work. We would like to consolidate the
behaviour of the work with the tasks behaviour.

Sorry, I don't have relevant figures as the patches are used with
other ones which also impact the power consumption.

This series introduces the possibility to run a work on another CPU
which is necessary if we want a better correlation of task and work
scheduling on the system. Most of the time the queue_work is used when
a driver don't mind the CPU on which you want to run whereas it looks
like it should be used only if you want to run locally. We would like
to solve this point with the new interface that is proposed by viresh

Vincent

>
> --
> viresh
Steven Rostedt Nov. 27, 2012, 1:26 p.m. | #4
On Tue, 2012-11-06 at 16:08 +0530, Viresh Kumar wrote:
> Workqueues queues work on current cpu, if the caller haven't passed a preferred
> cpu. This may wake up an idle CPU, which is actually not required.
> 
> This work can be processed by any CPU and so we must select a non-idle CPU here.
> This patch adds in support in workqueue framework to get preferred CPU details
> from the scheduler, instead of using current CPU.
> 
> Most of the time when a work is queued, the current cpu isn't idle and so we
> will choose it only. There are cases when a cpu is idle when it queues some
> work. For example, consider following scenario:
> - A cpu has programmed a timer and is IDLE now.
> - CPU gets into interrupt handler due to timer and queues a work. As the CPU is
>   currently IDLE, we can queue this work to some other CPU.
> 
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  kernel/workqueue.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 042d221..d32efa2 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
>  		struct global_cwq *last_gcwq;
>  
>  		if (cpu == WORK_CPU_UNBOUND)
> -			cpu = raw_smp_processor_id();
> +			cpu = sched_select_cpu(0);

A couple of things. The sched_select_cpu() is not cheap. It has a double
loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
and we are CPU 1023 and all other CPUs happen to be idle, we could be
searching 1023 CPUs before we come up with our own.

__queue_work() should be fast as there is a reason that it is delaying
the work and not running it itself.

Also, I really don't like this as a default behavior. It seems that this
solution is for a very special case, and this can become very intrusive
for the normal case.

To be honest, I'm uncomfortable with this approach. It seems to be
fighting a symptom and not the disease. I'd rather find a way to keep
work from being queued on wrong CPU. If it is a timer, find a way to
move the timer. If it is something else, lets work to fix that. Doing
searches of possibly all CPUs (unlikely, but it is there), just seems
wrong to me.

-- Steve


>  
>  		/*
>  		 * It's multi cpu.  If @work was previously on a different
> @@ -1383,7 +1383,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
>  		if (gcwq)
>  			lcpu = gcwq->cpu;
>  		if (lcpu == WORK_CPU_UNBOUND)
> -			lcpu = raw_smp_processor_id();
> +			lcpu = sched_select_cpu(0);
>  	} else {
>  		lcpu = WORK_CPU_UNBOUND;
>  	}
Viresh Kumar Nov. 27, 2012, 1:48 p.m. | #5
On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote:
> A couple of things. The sched_select_cpu() is not cheap. It has a double
> loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
> and we are CPU 1023 and all other CPUs happen to be idle, we could be
> searching 1023 CPUs before we come up with our own.

Not sure if you missed the first check sched_select_cpu()

+int sched_select_cpu(unsigned int sd_flags)
+{
+       /* If Current cpu isn't idle, don't migrate anything */
+       if (!idle_cpu(cpu))
+               return cpu;

We aren't going to search if we aren't idle.

> Also, I really don't like this as a default behavior. It seems that this
> solution is for a very special case, and this can become very intrusive
> for the normal case.

We tried with an KCONFIG option for it, which Tejun rejected.

> To be honest, I'm uncomfortable with this approach. It seems to be
> fighting a symptom and not the disease. I'd rather find a way to keep
> work from being queued on wrong CPU. If it is a timer, find a way to
> move the timer. If it is something else, lets work to fix that. Doing
> searches of possibly all CPUs (unlikely, but it is there), just seems
> wrong to me.

As Vincent pointed out, on big LITTLE systems we just don't want to
serve works on big cores. That would be wasting too much of power.
Specially if we are going to wake up big cores.

It would be difficult to control the source driver (which queues work) to
little cores. We thought, if somebody wanted to queue work on current
cpu then they must use queue_work_on().

--
viresh
Steven Rostedt Nov. 27, 2012, 1:59 p.m. | #6
On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote:
> > A couple of things. The sched_select_cpu() is not cheap. It has a double
> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
> > and we are CPU 1023 and all other CPUs happen to be idle, we could be
> > searching 1023 CPUs before we come up with our own.
> 
> Not sure if you missed the first check sched_select_cpu()
> 
> +int sched_select_cpu(unsigned int sd_flags)
> +{
> +       /* If Current cpu isn't idle, don't migrate anything */
> +       if (!idle_cpu(cpu))
> +               return cpu;
> 
> We aren't going to search if we aren't idle.

OK, we are idle, but CPU 1022 isn't. We still need a large search. But,
heh we are idle we can spin. But then why go through this in the first
place ;-)


> 
> > Also, I really don't like this as a default behavior. It seems that this
> > solution is for a very special case, and this can become very intrusive
> > for the normal case.
> 
> We tried with an KCONFIG option for it, which Tejun rejected.

Yeah, I saw that. I don't like adding KCONFIG options either. Best is to
get something working that doesn't add any regressions. If you can get
this to work without making *any* regressions in the normal case than
I'm totally fine with that. But if this adds any issues with the normal
case, then it's a show stopper.

> 
> > To be honest, I'm uncomfortable with this approach. It seems to be
> > fighting a symptom and not the disease. I'd rather find a way to keep
> > work from being queued on wrong CPU. If it is a timer, find a way to
> > move the timer. If it is something else, lets work to fix that. Doing
> > searches of possibly all CPUs (unlikely, but it is there), just seems
> > wrong to me.
> 
> As Vincent pointed out, on big LITTLE systems we just don't want to
> serve works on big cores. That would be wasting too much of power.
> Specially if we are going to wake up big cores.
> 
> It would be difficult to control the source driver (which queues work) to
> little cores. We thought, if somebody wanted to queue work on current
> cpu then they must use queue_work_on().

As Tejun has mentioned earlier, is there any assumptions anywhere that
expects an unbounded work queue to not migrate? Where per cpu variables
might be used. Tejun had a good idea of forcing this to migrate the work
*every* time. To not let a work queue run on the same CPU that it was
queued on. If it can survive that, then it is probably OK. Maybe add a
config option that forces this? That way, anyone can test that this
isn't an issue.

-- Steve
Vincent Guittot Nov. 27, 2012, 2:55 p.m. | #7
On 27 November 2012 14:59, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
>> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > A couple of things. The sched_select_cpu() is not cheap. It has a double
>> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
>> > and we are CPU 1023 and all other CPUs happen to be idle, we could be
>> > searching 1023 CPUs before we come up with our own.
>>
>> Not sure if you missed the first check sched_select_cpu()
>>
>> +int sched_select_cpu(unsigned int sd_flags)
>> +{
>> +       /* If Current cpu isn't idle, don't migrate anything */
>> +       if (!idle_cpu(cpu))
>> +               return cpu;
>>
>> We aren't going to search if we aren't idle.
>
> OK, we are idle, but CPU 1022 isn't. We still need a large search. But,
> heh we are idle we can spin. But then why go through this in the first
> place ;-)

By migrating it now, it will create its activity and wake up on the
right CPU next time.

If migrating on any CPUs seems a bit risky, we could restrict the
migration on a CPU on the same node. We can pass such contraints on
sched_select_cpu

>
>
>>
>> > Also, I really don't like this as a default behavior. It seems that this
>> > solution is for a very special case, and this can become very intrusive
>> > for the normal case.
>>
>> We tried with an KCONFIG option for it, which Tejun rejected.
>
> Yeah, I saw that. I don't like adding KCONFIG options either. Best is to
> get something working that doesn't add any regressions. If you can get
> this to work without making *any* regressions in the normal case than
> I'm totally fine with that. But if this adds any issues with the normal
> case, then it's a show stopper.
>
>>
>> > To be honest, I'm uncomfortable with this approach. It seems to be
>> > fighting a symptom and not the disease. I'd rather find a way to keep
>> > work from being queued on wrong CPU. If it is a timer, find a way to
>> > move the timer. If it is something else, lets work to fix that. Doing
>> > searches of possibly all CPUs (unlikely, but it is there), just seems
>> > wrong to me.
>>
>> As Vincent pointed out, on big LITTLE systems we just don't want to
>> serve works on big cores. That would be wasting too much of power.
>> Specially if we are going to wake up big cores.
>>
>> It would be difficult to control the source driver (which queues work) to
>> little cores. We thought, if somebody wanted to queue work on current
>> cpu then they must use queue_work_on().
>
> As Tejun has mentioned earlier, is there any assumptions anywhere that
> expects an unbounded work queue to not migrate? Where per cpu variables
> might be used. Tejun had a good idea of forcing this to migrate the work
> *every* time. To not let a work queue run on the same CPU that it was
> queued on. If it can survive that, then it is probably OK. Maybe add a
> config option that forces this? That way, anyone can test that this
> isn't an issue.
>
> -- Steve
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Steven Rostedt Nov. 27, 2012, 3:04 p.m. | #8
On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote:
> On 27 November 2012 14:59, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
> >> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote:
> >> > A couple of things. The sched_select_cpu() is not cheap. It has a double
> >> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
> >> > and we are CPU 1023 and all other CPUs happen to be idle, we could be
> >> > searching 1023 CPUs before we come up with our own.
> >>
> >> Not sure if you missed the first check sched_select_cpu()
> >>
> >> +int sched_select_cpu(unsigned int sd_flags)
> >> +{
> >> +       /* If Current cpu isn't idle, don't migrate anything */
> >> +       if (!idle_cpu(cpu))
> >> +               return cpu;
> >>
> >> We aren't going to search if we aren't idle.
> >
> > OK, we are idle, but CPU 1022 isn't. We still need a large search. But,
> > heh we are idle we can spin. But then why go through this in the first
> > place ;-)
> 
> By migrating it now, it will create its activity and wake up on the
> right CPU next time.
> 
> If migrating on any CPUs seems a bit risky, we could restrict the
> migration on a CPU on the same node. We can pass such contraints on
> sched_select_cpu
> 

That's assuming that the CPUs stay idle. Now if we move the work to
another CPU and it goes idle, then it may move that again. It could end
up being a ping pong approach.

I don't think idle is a strong enough heuristic for the general case. If
interrupts are constantly going off on a CPU that happens to be idle
most of the time, it will constantly be moving work onto CPUs that are
currently doing real work, and by doing so, it will be slowing those
CPUs down.

-- Steve
Vincent Guittot Nov. 27, 2012, 3:35 p.m. | #9
On 27 November 2012 16:04, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote:
>> On 27 November 2012 14:59, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
>> >> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >> > A couple of things. The sched_select_cpu() is not cheap. It has a double
>> >> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
>> >> > and we are CPU 1023 and all other CPUs happen to be idle, we could be
>> >> > searching 1023 CPUs before we come up with our own.
>> >>
>> >> Not sure if you missed the first check sched_select_cpu()
>> >>
>> >> +int sched_select_cpu(unsigned int sd_flags)
>> >> +{
>> >> +       /* If Current cpu isn't idle, don't migrate anything */
>> >> +       if (!idle_cpu(cpu))
>> >> +               return cpu;
>> >>
>> >> We aren't going to search if we aren't idle.
>> >
>> > OK, we are idle, but CPU 1022 isn't. We still need a large search. But,
>> > heh we are idle we can spin. But then why go through this in the first
>> > place ;-)
>>
>> By migrating it now, it will create its activity and wake up on the
>> right CPU next time.
>>
>> If migrating on any CPUs seems a bit risky, we could restrict the
>> migration on a CPU on the same node. We can pass such contraints on
>> sched_select_cpu
>>
>
> That's assuming that the CPUs stay idle. Now if we move the work to
> another CPU and it goes idle, then it may move that again. It could end
> up being a ping pong approach.
>
> I don't think idle is a strong enough heuristic for the general case. If
> interrupts are constantly going off on a CPU that happens to be idle
> most of the time, it will constantly be moving work onto CPUs that are
> currently doing real work, and by doing so, it will be slowing those
> CPUs down.
>

I agree that idle is probably not enough but it's the heuristic that
is currently used for selecting a CPU for a timer and the timer also
uses sched_select_cpu in this series. So in order to go step by step,
a common interface has been introduced for selecting a CPU and this
function uses the same algorithm than the timer already do.
Once we agreed on an interface, the heuristic could be updated.


> -- Steve
>
>
Viresh Kumar Jan. 4, 2013, 11:11 a.m. | #10
Hi Tejun,

On 27 November 2012 10:49, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 26 November 2012 22:45, Tejun Heo <tj@kernel.org> wrote:
>> On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
>
>> I'm pretty skeptical about this.  queue_work() w/o explicit CPU
>> assignment has always guaranteed that the work item will be executed
>> on the same CPU.  I don't think there are too many users depending on
>> that but am not sure at all that there are none.  I asked you last
>> time that you would at the very least need to audit most users but it
>> doesn't seem like there has been any effort there.
>
> My bad. I completely missed/forgot that comment from your earlier mails.
> Will do it.

And this is the first thing i did this year :)

So there are almost 1200 files which are using: queue_work, queue_delayed_work,
schedule_work, schedule_delayed_work or schedule_on_each_cpu

Obviously i can't try to understand all of them :) , and so i tried to
search two
strings in them: "cpu" or "processor_id". So, this would catch every
file of these 1200
odd files which use some variables/comment/code with cpu or smp_processor_id or
raw_processor_id, etc..

I got around 650 files with these two searches.. Then i went into
these files to see if
there is anything noticeable, which may break due to this patch...

I got a list of files where cpu/processor_id strings are found, which
may break with
this patch (still can't guarantee as i don't have knowledge of these drivers)...

- arch/powerpc/mm/numa.c
- arch/s390/appldata/appldata_base.c
- arch/s390/kernel/time.c
- arch/s390/kernel/topology.c
- arch/s390/oprofile/hwsampler.c
- arch/x86/kernel/cpu/mcheck/mce.c
- arch/x86/kernel/tsc.c
- arch/x86/kvm/x86.c
- block/blk-core.c
- block/blk-throttle.c
- block/blk-genhd.c
- crypto/cryptd.c
- drivers/base/power/domain.c
- drivers/cpufreq/cpufreq.c
- drivers/hv/vmbus_drv.c
- drivers/infiniband/hw/qib/qib_iba7322.c and
drivers/infiniband/hw/qib/qib_init.c
- drivers/md/dm-crypt.c
- drivers/md/md.c
- drivers/net/ethernet/sfc/efx.c
- drivers/net/ethernet/tile/tilepro.c
- drivers/net/team/team_mode_loadbalance.c
- drivers/oprofile/cpu_buffer.c
- drivers/s390/kvm/kvm_virtio.c
- drivers/scsi/fcoe/fcoe.c
- drivers/tty/sysrq.c
- drivers/xen/pcpu.c
- fs/file.c, file_table.c, fs/fscache/object.c
- include/linux/stop_machine.h, kernel/stop_machine.c
- mm/percpu.c
- mm/slab.c
- mm/vmstat.c
- net/core/flow.c
- net/iucv/iucv.c

I am not sure what to do now :) , can you assist ?
Tejun Heo Jan. 4, 2013, 3:09 p.m. | #11
Hello, Viresh.

On Fri, Jan 04, 2013 at 04:41:47PM +0530, Viresh Kumar wrote:
> I got a list of files where cpu/processor_id strings are found, which
> may break with
> this patch (still can't guarantee as i don't have knowledge of these drivers)...
...
> I am not sure what to do now :) , can you assist ?

I don't know either.  Changing behavior subtly like this is hard.  I
usually try to spot some problem cases and try to identify patterns
there.  Once you identify a few of them, understanding and detecting
other problem cases get a lot easier.  In this case, maybe there are
too many places to audit and the problems are too subtle, and, if we
*have* to do it, the only thing we can do is implementing a debug
option to make such problems more visible - say, always schedule to a
different cpu on queue_work().

That said, at this point, the patchset doesn't seem all that
convincing to me and if I'm comprehending responses from others
correctly that seems to be the consensus.  It might be a better
approach to identify the specific offending workqueue users and make
them handle the situation more intelligently than trying to impose the
behavior on all workqueue users.  At any rate, we need way more data
showing this actually helps and if so why.

Thanks.
Viresh Kumar Jan. 7, 2013, 9:58 a.m. | #12
Hi Tejun,

On 4 January 2013 20:39, Tejun Heo <tj@kernel.org> wrote:
> I don't know either.  Changing behavior subtly like this is hard.  I
> usually try to spot some problem cases and try to identify patterns
> there.  Once you identify a few of them, understanding and detecting
> other problem cases get a lot easier.  In this case, maybe there are
> too many places to audit and the problems are too subtle, and, if we
> *have* to do it, the only thing we can do is implementing a debug
> option to make such problems more visible - say, always schedule to a
> different cpu on queue_work().
>
> That said, at this point, the patchset doesn't seem all that
> convincing to me and if I'm comprehending responses from others
> correctly that seems to be the consensus.  It might be a better
> approach to identify the specific offending workqueue users and make
> them handle the situation more intelligently than trying to impose the
> behavior on all workqueue users.  At any rate, we need way more data
> showing this actually helps and if so why.

I understand your concerns and believe me, even i feel the same :)
I had another idea, that i wanted to share.

Firstly the root cause of this patchset.

Myself and some others in Linaro are working on ARM future cores:
big.LITTLE systems.
Here we have few very powerful, high power consuming cores (big,
currently A15's) and
few very efficient ones (LITTLE, currently A7's).

The ultimate goal is to save as much power as possible without compromising
much with performance. For, that we wanted most of the stuff to run on LITTLE
cores and some performance-demanding stuff on big Cores. There are
multiple things
going around in this direction. Now, we thought A15's or big cores
shouldn't be used
for running small tasks like timers/workqueues and hence this patch is
an attempt
towards reaching that goal.

Over that we can do some load balancing of works within multiple alive
cpus, so that
it can get done quickly. Also, we shouldn't start using an idle cpu
just for processing
work :)

I have another idea that we can try:

queue_work_on_any_cpu().

With this we would not break any existing code and can try to migrate
old users to
this new infrastructure (atleast the ones which are rearming works from their
work_handlers). What do you say?

To take care of the cache locality issue, we can pass an argument to
this routine,
that can provide
- the mask of cpus to schedule this work on
  OR
- Sched Level (SD_LEVEL) of cpus to run it.

Waiting for your view on it :)

--
viresh
Steven Rostedt Jan. 7, 2013, 1:28 p.m. | #13
On Mon, 2013-01-07 at 15:28 +0530, Viresh Kumar wrote:
> Hi Tejun,
> 
> On 4 January 2013 20:39, Tejun Heo <tj@kernel.org> wrote:
> > I don't know either.  Changing behavior subtly like this is hard.  I
> > usually try to spot some problem cases and try to identify patterns
> > there.  Once you identify a few of them, understanding and detecting
> > other problem cases get a lot easier.  In this case, maybe there are
> > too many places to audit and the problems are too subtle, and, if we
> > *have* to do it, the only thing we can do is implementing a debug
> > option to make such problems more visible - say, always schedule to a
> > different cpu on queue_work().
> >
> > That said, at this point, the patchset doesn't seem all that
> > convincing to me and if I'm comprehending responses from others
> > correctly that seems to be the consensus.  It might be a better
> > approach to identify the specific offending workqueue users and make
> > them handle the situation more intelligently than trying to impose the
> > behavior on all workqueue users.  At any rate, we need way more data
> > showing this actually helps and if so why.
> 
> I understand your concerns and believe me, even i feel the same :)
> I had another idea, that i wanted to share.
> 
> Firstly the root cause of this patchset.
> 
> Myself and some others in Linaro are working on ARM future cores:
> big.LITTLE systems.
> Here we have few very powerful, high power consuming cores (big,
> currently A15's) and
> few very efficient ones (LITTLE, currently A7's).
> 
> The ultimate goal is to save as much power as possible without compromising
> much with performance. For, that we wanted most of the stuff to run on LITTLE
> cores and some performance-demanding stuff on big Cores. There are
> multiple things
> going around in this direction. Now, we thought A15's or big cores
> shouldn't be used
> for running small tasks like timers/workqueues and hence this patch is
> an attempt
> towards reaching that goal.
> 
> Over that we can do some load balancing of works within multiple alive
> cpus, so that
> it can get done quickly. Also, we shouldn't start using an idle cpu
> just for processing
> work :)
> 
> I have another idea that we can try:
> 
> queue_work_on_any_cpu().

I think this is a good idea.

> 
> With this we would not break any existing code and can try to migrate
> old users to
> this new infrastructure (atleast the ones which are rearming works from their
> work_handlers). What do you say?
> 
> To take care of the cache locality issue, we can pass an argument to
> this routine,
> that can provide
> - the mask of cpus to schedule this work on
>   OR
> - Sched Level (SD_LEVEL) of cpus to run it.

I wouldn't give a mask. If one is needed, we could have a
queue_work_on_mask_cpus(), or something. I think the "any" in the name
should be good enough to let developers know that this will not be on
the CPU that is called. By default, I would suggest for cache locality,
that we try to keep it on the same CPU. But if there's a better CPU to
run on, it runs there. Also, we could still add a debug option that
makes it always run on other CPUs to slap developers that don't read.

-- Steve

> 
> Waiting for your view on it :)
> 
> --
> viresh
Tejun Heo Jan. 7, 2013, 3:04 p.m. | #14
Hello, Viresh.

On Mon, Jan 07, 2013 at 03:28:33PM +0530, Viresh Kumar wrote:
> Firstly the root cause of this patchset.
> 
> Myself and some others in Linaro are working on ARM future cores:
> big.LITTLE systems.
> Here we have few very powerful, high power consuming cores (big,
> currently A15's) and
> few very efficient ones (LITTLE, currently A7's).
> 
> The ultimate goal is to save as much power as possible without compromising
> much with performance. For, that we wanted most of the stuff to run on LITTLE
> cores and some performance-demanding stuff on big Cores. There are
> multiple things
> going around in this direction. Now, we thought A15's or big cores
> shouldn't be used
> for running small tasks like timers/workqueues and hence this patch is
> an attempt
> towards reaching that goal.

I see.  My understanding of big.little is very superficial so there
are very good chances that I'm utterly wrong, but to me it seems like
more of a stop-gap solution than something which will have staying
power in the sense that the complexity doesn't seem to have any
inherent reason other than the failure to make the big ones power
efficient enough while idle or under minor load.

Maybe this isn't specific to big.little and heterogeneous cores will
be the way of future with big.little being just a forefront of the
trend.  And even if that's not the case, it definitely doesn't mean
that we can't accomodate big.little, but, at this point, it does make
me a bit more reluctant to make wholesale changes specifically for
big.little.

> Over that we can do some load balancing of works within multiple alive
> cpus, so that
> it can get done quickly. Also, we shouldn't start using an idle cpu
> just for processing
> work :)

The latter part "not using idle cpu just for processing work" does
apply to homogeneous systems too but as I wrote earlier work items
don't spontaneously happen on an idle core.  Something, including
timer, needs to kick it.  So, by definition, a CPU already isn't idle
when a work item starts execution on it.  What am I missing here?

> I have another idea that we can try:
> 
> queue_work_on_any_cpu().
>
> With this we would not break any existing code and can try to migrate
> old users to
> this new infrastructure (atleast the ones which are rearming works from their
> work_handlers). What do you say?

Yeah, this could be a better solution, I think.  Plus, it's not like
finding the optimal cpu is free.

> To take care of the cache locality issue, we can pass an argument to
> this routine,
> that can provide
> - the mask of cpus to schedule this work on
>   OR
> - Sched Level (SD_LEVEL) of cpus to run it.

Let's start simple for now.  If we really need it, we can always add
more later.

Thanks.
Amit Kucheria Jan. 7, 2013, 3:40 p.m. | #15
On Mon, Jan 7, 2013 at 8:34 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Viresh.
>
> On Mon, Jan 07, 2013 at 03:28:33PM +0530, Viresh Kumar wrote:
>> Firstly the root cause of this patchset.
>>
>> Myself and some others in Linaro are working on ARM future cores:
>> big.LITTLE systems.
>> Here we have few very powerful, high power consuming cores (big,
>> currently A15's) and
>> few very efficient ones (LITTLE, currently A7's).
>>
>> The ultimate goal is to save as much power as possible without compromising
>> much with performance. For, that we wanted most of the stuff to run on LITTLE
>> cores and some performance-demanding stuff on big Cores. There are
>> multiple things
>> going around in this direction. Now, we thought A15's or big cores
>> shouldn't be used
>> for running small tasks like timers/workqueues and hence this patch is
>> an attempt
>> towards reaching that goal.
>
> I see.  My understanding of big.little is very superficial so there
> are very good chances that I'm utterly wrong, but to me it seems like
> more of a stop-gap solution than something which will have staying
> power in the sense that the complexity doesn't seem to have any
> inherent reason other than the failure to make the big ones power
> efficient enough while idle or under minor load.

The two processors use different manufacturing processes - one
optimised for performance, the other for really low power. So the
reason is physics at this point. Other architectures are known to be
playing with similar schemes. ARM's big.LITTLE is just the first one
to the market.

> Maybe this isn't specific to big.little and heterogeneous cores will
> be the way of future with big.little being just a forefront of the
> trend.  And even if that's not the case, it definitely doesn't mean
> that we can't accomodate big.little, but, at this point, it does make
> me a bit more reluctant to make wholesale changes specifically for
> big.little.

The patches aren't targeted to benefit only b.L systems though it was
definitely the trigger for our investigations. Our hope is that after
presenting more analysis results from our side we'll be able to
convince the community of the general usefulness of this feature.

Here are a few short introductions to big.LITTLE in case you're
interested.[1][2]

[1] http://www.arm.com/files/downloads/big.LITTLE_Final.pdf
[2] http://lwn.net/Articles/481055/
Viresh Kumar Jan. 7, 2013, 5:59 p.m. | #16
On 7 January 2013 18:58, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2013-01-07 at 15:28 +0530, Viresh Kumar wrote:
>> I have another idea that we can try:
>>
>> queue_work_on_any_cpu().
>
> I think this is a good idea.

:) :)

>> - the mask of cpus to schedule this work on
>>   OR
>> - Sched Level (SD_LEVEL) of cpus to run it.
>
> I wouldn't give a mask. If one is needed, we could have a
> queue_work_on_mask_cpus(), or something. I think the "any" in the name
> should be good enough to let developers know that this will not be on
> the CPU that is called.

Fair Enough.

> By default, I would suggest for cache locality,
> that we try to keep it on the same CPU. But if there's a better CPU to
> run on, it runs there.

That would break our intention behind this routine. We should run
it on a cpu which we think is the best one for it power/performance wise.

So, i would like to call the sched_select_cpu() routine from this routine to
get the suggested cpu.

This routine would however would see more changes later to optimize it
more.

> Also, we could still add a debug option that
> makes it always run on other CPUs to slap developers that don't read.

I don't think we need it :(
It would be a new API, with zero existing users. And so, whomsoever uses
it, should know what exactly he is doing :)

Thanks for your quick feedback.
Viresh Kumar Jan. 7, 2013, 6:07 p.m. | #17
[Removed Suresh and Venki from discussion, they switched their companies
probably]

On 7 January 2013 20:34, Tejun Heo <tj@kernel.org> wrote:
> The latter part "not using idle cpu just for processing work" does
> apply to homogeneous systems too but as I wrote earlier work items
> don't spontaneously happen on an idle core.  Something, including
> timer, needs to kick it.  So, by definition, a CPU already isn't idle
> when a work item starts execution on it.  What am I missing here?

We are talking about a core being idle from schedulers perspective :)

>> I have another idea that we can try:
>>
>> queue_work_on_any_cpu().
>>
>> With this we would not break any existing code and can try to migrate
>> old users to
>> this new infrastructure (atleast the ones which are rearming works from their
>> work_handlers). What do you say?
>
> Yeah, this could be a better solution, I think.  Plus, it's not like
> finding the optimal cpu is free.

Thanks for the first part (When i shared this idea with Vincent and Amit, i
wasn't sure at all about the feedback i will get from you and others, but i
am very happy now :) ).

I couldn't understand the second part. We still need to search for a free cpu
for this new routine. And the implementation would almost be same as the
implementation of queue_work() in my initial patch

>> To take care of the cache locality issue, we can pass an argument to
>> this routine,
>> that can provide
>> - the mask of cpus to schedule this work on
>>   OR
>> - Sched Level (SD_LEVEL) of cpus to run it.
>
> Let's start simple for now.  If we really need it, we can always add
> more later.

:)
Agreed. But i liked the idea from steven, we can have two routines:
queue_work_on_any_cpu() and queue_work_on_cpus()
Steven Rostedt Jan. 7, 2013, 10:29 p.m. | #18
On Mon, 2013-01-07 at 23:29 +0530, Viresh Kumar wrote:

> > By default, I would suggest for cache locality,
> > that we try to keep it on the same CPU. But if there's a better CPU to
> > run on, it runs there.
> 
> That would break our intention behind this routine. We should run
> it on a cpu which we think is the best one for it power/performance wise.

If you are running on a big.Little box sure. But for normal (read x86 ;)
systems, it should probably stay on the current CPU.

> 
> So, i would like to call the sched_select_cpu() routine from this routine to
> get the suggested cpu.

Does the caller need to know? Or if you have a big.LITTLE setup, it
should just work automagically?

> 
> This routine would however would see more changes later to optimize it
> more.
> 
> > Also, we could still add a debug option that
> > makes it always run on other CPUs to slap developers that don't read.
> 
> I don't think we need it :(
> It would be a new API, with zero existing users. And so, whomsoever uses
> it, should know what exactly he is doing :)

Heh, you don't know kernel developers well do you? ;-)

Once a new API is added to the kernel tree, it quickly becomes
(mis)used.

-- Steve
Viresh Kumar Jan. 8, 2013, 4:03 a.m. | #19
Hi Steven,

On 8 January 2013 03:59, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2013-01-07 at 23:29 +0530, Viresh Kumar wrote:
>
>> > By default, I would suggest for cache locality,
>> > that we try to keep it on the same CPU. But if there's a better CPU to
>> > run on, it runs there.
>>
>> That would break our intention behind this routine. We should run
>> it on a cpu which we think is the best one for it power/performance wise.
>
> If you are running on a big.Little box sure. But for normal (read x86 ;)
> systems, it should probably stay on the current CPU.

But that's not the purpose of this new call. If the caller want it to be on
local cpu, he must not use this call. It is upto the core routine
(sched_select_cpu()
in our case) to decide where to launch it. If we need something special for x86,
we can hack this routine.

Even for non bigLITTLE systems, we may want to run a work on non-idle cpu and
so we can't guarantee local cpu here.

>> So, i would like to call the sched_select_cpu() routine from this routine to
>> get the suggested cpu.
>
> Does the caller need to know? Or if you have a big.LITTLE setup, it
> should just work automagically?

Caller wouldn't know, he just needs to trust on sched_select_cpu() to give the
most optimum cpu.

Again, it is not only for big LITTLE, but for any SMP system, where we
don't want
an idle cpu to do this work.

>> I don't think we need it :(
>> It would be a new API, with zero existing users. And so, whomsoever uses
>> it, should know what exactly he is doing :)
>
> Heh, you don't know kernel developers well do you? ;-)

I agree with you, but we don't need to care for foolish new code here.

> Once a new API is added to the kernel tree, it quickly becomes
> (mis)used.

Its true with all pieces of code in kernel and we really don't need to take
care of such users here :)
Tejun Heo Jan. 9, 2013, 6:49 p.m. | #20
Hello,

On Mon, Jan 07, 2013 at 11:37:22PM +0530, Viresh Kumar wrote:
> On 7 January 2013 20:34, Tejun Heo <tj@kernel.org> wrote:
> > The latter part "not using idle cpu just for processing work" does
> > apply to homogeneous systems too but as I wrote earlier work items
> > don't spontaneously happen on an idle core.  Something, including
> > timer, needs to kick it.  So, by definition, a CPU already isn't idle
> > when a work item starts execution on it.  What am I missing here?
> 
> We are talking about a core being idle from schedulers perspective :)

But it's not like cpu doesn't consume power if scheduler considers it
idle, right?  Can you please explain in detail how this contributes to
saving power?  Is it primarily about routing work items to lower power
CPUs?  And please don't point to presentation slides.  They don't seem
to explain much better and patches and the code should be able to
describe themselves.  Here, more so, as the desired behavior change
and the resulting powersave are rather subtle.

> > Yeah, this could be a better solution, I think.  Plus, it's not like
> > finding the optimal cpu is free.
> 
> Thanks for the first part (When i shared this idea with Vincent and Amit, i
> wasn't sure at all about the feedback i will get from you and others, but i
> am very happy now :) ).
> 
> I couldn't understand the second part. We still need to search for a free cpu
> for this new routine. And the implementation would almost be same as the
> implementation of queue_work() in my initial patch

I meant that enforcing lookup for the "optimal" cpu on queue_work() by
default would add overhead to the operation, which in cases (on most
homogeneous configs) would lead to overhead without any realized
benefit.

> > Let's start simple for now.  If we really need it, we can always add
> > more later.
> 
> :)
> Agreed. But i liked the idea from steven, we can have two routines:
> queue_work_on_any_cpu() and queue_work_on_cpus()

We can talk about specifics later but please try to *justify* each new
interface.  Please note that "putting work items to cpus which the
scheduler considers idle" can't really be justification in itself.

Thanks.
Viresh Kumar Jan. 10, 2013, 5:04 a.m. | #21
On 10 January 2013 00:19, Tejun Heo <tj@kernel.org> wrote:
> On Mon, Jan 07, 2013 at 11:37:22PM +0530, Viresh Kumar wrote:
>> We are talking about a core being idle from schedulers perspective :)
>
> But it's not like cpu doesn't consume power if scheduler considers it
> idle, right?  Can you please explain in detail how this contributes to
> saving power?  Is it primarily about routing work items to lower power
> CPUs?  And please don't point to presentation slides.  They don't seem
> to explain much better and patches and the code should be able to
> describe themselves.  Here, more so, as the desired behavior change
> and the resulting powersave are rather subtle.

I got your concerns. Firstly, when cpu is idle from schedulers perspective, it
consumes a lot of power.

queue_work_on_any_cpu() would queue the work on any other cpu only
when current cpu is idle from schedulers perspective, and this can only
happen when the cpu was actually idle (in low power state), woke up due
to some interrupt/timer and is asked to queue a work..

The idea is to choose other non-idle cpu at this point, so that current cpu
can immediately go into deeper idle state. With this cpus can stay longer
at deeper idle state, rather than running works.

And in cases, where works are rearmed from the handler, this can cause
sufficient power loss, which could be easily saved by pushing this work to
non-idle cpus.

The same approach is taken for deffered timers too, they are already using
such routine. .

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 042d221..d32efa2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1238,7 +1238,7 @@  static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 		struct global_cwq *last_gcwq;
 
 		if (cpu == WORK_CPU_UNBOUND)
-			cpu = raw_smp_processor_id();
+			cpu = sched_select_cpu(0);
 
 		/*
 		 * It's multi cpu.  If @work was previously on a different
@@ -1383,7 +1383,7 @@  static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 		if (gcwq)
 			lcpu = gcwq->cpu;
 		if (lcpu == WORK_CPU_UNBOUND)
-			lcpu = raw_smp_processor_id();
+			lcpu = sched_select_cpu(0);
 	} else {
 		lcpu = WORK_CPU_UNBOUND;
 	}