diff mbox

[v2,06/13] sched: Store maximum per-cpu capacity in root domain

Message ID 1466615004-3503-7-git-send-email-morten.rasmussen@arm.com
State New
Headers show

Commit Message

Morten Rasmussen June 22, 2016, 5:03 p.m. UTC
From: Dietmar Eggemann <dietmar.eggemann@arm.com>


To be able to compare the capacity of the target cpu with the highest
available cpu capacity, store the maximum per-cpu capacity in the root
domain.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

---
 kernel/sched/core.c  | 9 +++++++++
 kernel/sched/sched.h | 2 ++
 2 files changed, 11 insertions(+)

-- 
1.9.1

Comments

Dietmar Eggemann July 11, 2016, 4:16 p.m. UTC | #1
On 11/07/16 11:18, Peter Zijlstra wrote:
> On Wed, Jun 22, 2016 at 06:03:17PM +0100, Morten Rasmussen wrote:

>> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,

>>  	/* Attach the domains */

>>  	rcu_read_lock();

>>  	for_each_cpu(i, cpu_map) {

>> +		rq = cpu_rq(i);

>>  		sd = *per_cpu_ptr(d.sd, i);

>>  		cpu_attach_domain(sd, d.rd, i);

>> +

>> +		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)

>> +			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;

>>  	}

> 

> Should you not set that _before_ cpu_attach_domain(), such that the

> state is up-to-date when its published?


yes, much better.

> Also, since its lockless, should we not use {READ,WRITE}_ONCE() with it?


You mean for rq->rd->max_cpu_capacity ? IMHO, there is a data dependency
between the read and the write and the code only runs on one cpu.

I assume here that this is related to item 2 'Overlapping loads and
stores within a particular CPU ...' in GUARANTEES of
doc/Documentation/memory-barriers.txt.

Do I miss something?

>>  	rcu_read_unlock();

>>  

>> +	if (rq)

>> +		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",

>> +			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);

>> +

> 

> While a single statement, it is multi line, please add brackets.


OK.

> 

>>  	ret = 0;

>>  error:
Dietmar Eggemann July 13, 2016, 11:18 a.m. UTC | #2
On 12/07/16 12:42, Peter Zijlstra wrote:
> On Mon, Jul 11, 2016 at 05:16:06PM +0100, Dietmar Eggemann wrote:

>> On 11/07/16 11:18, Peter Zijlstra wrote:

>>> On Wed, Jun 22, 2016 at 06:03:17PM +0100, Morten Rasmussen wrote:

>>>> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,

>>>>  	/* Attach the domains */

>>>>  	rcu_read_lock();

>>>>  	for_each_cpu(i, cpu_map) {

>>>> +		rq = cpu_rq(i);

>>>>  		sd = *per_cpu_ptr(d.sd, i);

>>>>  		cpu_attach_domain(sd, d.rd, i);

>>>> +

>>>> +		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)

>>>> +			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;

>>>>  	}

>>>

>>> Should you not set that _before_ cpu_attach_domain(), such that the

>>> state is up-to-date when its published?

>>

>> yes, much better.

>>

>>> Also, since its lockless, should we not use {READ,WRITE}_ONCE() with it?

>>

>> You mean for rq->rd->max_cpu_capacity ? IMHO, there is a data dependency

>> between the read and the write and the code only runs on one cpu.

>>

>> I assume here that this is related to item 2 'Overlapping loads and

>> stores within a particular CPU ...' in GUARANTEES of

>> doc/Documentation/memory-barriers.txt.

>>

>> Do I miss something?

> 

> Well, the value 'rd->max_cpu_capacity' is read by all CPUs attached to

> the root_domain, right? So CPUs already attached can observe this change

> when we update the value, we want them to observe either the old or the

> new max value, not a random mix of bytes.

> 

> {READ,WRITE}_ONCE() ensure whole word load/store, iow they avoid

> load/store-tearing.

> 


OK, thanks, will add them.

So this maps to the point '(*) For aligned memory locations whose size
allows them to be accessed with a single memory-reference instruction,
prevents "load tearing" and "store tearing," ...' under 'The READ_ONCE()
and WRITE_ONCE() functions can prevent any number of optimizations ...'
section in the 'COMPILER BARRIER' paragraph in
Documentation/memory-barriers.txt.
Vincent Guittot July 13, 2016, 12:40 p.m. UTC | #3
On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

>

> To be able to compare the capacity of the target cpu with the highest

> available cpu capacity, store the maximum per-cpu capacity in the root

> domain.


I thought that the capacity of all CPUS were built so the highest
capacity of the CPU of the system is 1024  for big LITTLE system . So
this patch doesn't seem necessary for big.LITTLE system

>

> cc: Ingo Molnar <mingo@redhat.com>

> cc: Peter Zijlstra <peterz@infradead.org>

>

> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

> ---

>  kernel/sched/core.c  | 9 +++++++++

>  kernel/sched/sched.h | 2 ++

>  2 files changed, 11 insertions(+)

>

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c

> index fe39118ffdfb..5093765e9930 100644

> --- a/kernel/sched/core.c

> +++ b/kernel/sched/core.c

> @@ -6855,6 +6855,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,

>         enum s_alloc alloc_state;

>         struct sched_domain *sd;

>         struct s_data d;

> +       struct rq *rq = NULL;

>         int i, ret = -ENOMEM;

>

>         alloc_state = __visit_domain_allocation_hell(&d, cpu_map);

> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,

>         /* Attach the domains */

>         rcu_read_lock();

>         for_each_cpu(i, cpu_map) {

> +               rq = cpu_rq(i);

>                 sd = *per_cpu_ptr(d.sd, i);

>                 cpu_attach_domain(sd, d.rd, i);

> +

> +               if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)

> +                       rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;

>         }

>         rcu_read_unlock();

>

> +       if (rq)

> +               pr_info("span: %*pbl (max cpu_capacity = %lu)\n",

> +                       cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);

> +

>         ret = 0;

>  error:

>         __free_domain_allocs(&d, alloc_state, cpu_map);

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h

> index 72f1f3087b04..3e9904ef224f 100644

> --- a/kernel/sched/sched.h

> +++ b/kernel/sched/sched.h

> @@ -564,6 +564,8 @@ struct root_domain {

>          */

>         cpumask_var_t rto_mask;

>         struct cpupri cpupri;

> +

> +       unsigned long max_cpu_capacity;

>  };

>

>  extern struct root_domain def_root_domain;

> --

> 1.9.1

>
Dietmar Eggemann July 13, 2016, 1:48 p.m. UTC | #4
On 13/07/16 13:40, Vincent Guittot wrote:
> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

>>

>> To be able to compare the capacity of the target cpu with the highest

>> available cpu capacity, store the maximum per-cpu capacity in the root

>> domain.

> 

> I thought that the capacity of all CPUS were built so the highest

> capacity of the CPU of the system is 1024  for big LITTLE system . So

> this patch doesn't seem necessary for big.LITTLE system


The asymmetric cpu capacity support currently only has an effect on arm
big.LITTLE (32bit) using the existing 'struct cpu_efficiency
table_efficiency[]' based approach.

So e.g. on TC2 we have 1441 for highest capacity.

[    0.041007] SMP: Total of 5 processors activated (240.00 BogoMIPS).
[    0.041024] CPU: All CPU(s) started in SVC mode.
[    0.041103] CPU0 attaching sched-domain:
[    0.041119]  domain 0: span 0-1 level MC
[    0.041141]   groups: 0 (cpu_capacity = 1441) 1 (cpu_capacity = 1441)
[    0.041179]   domain 1: span 0-4 level DIE
[    0.041199]    groups: 0-1 (cpu_capacity = 2882) 2-4 (cpu_capacity =
1818)
[    0.041245] CPU1 attaching sched-domain:
[    0.041260]  domain 0: span 0-1 level MC
[    0.041279]   groups: 1 (cpu_capacity = 1441) 0 (cpu_capacity = 1441)
[    0.041315]   domain 1: span 0-4 level DIE
[    0.041334]    groups: 0-1 (cpu_capacity = 2882) 2-4 (cpu_capacity =
1818)
[    0.041376] CPU2 attaching sched-domain:
[    0.041391]  domain 0: span 2-4 level MC
[    0.041409]   groups: 2 (cpu_capacity = 606) 3 (cpu_capacity = 606) 4
(cpu_capacity = 606)
[    0.041460]   domain 1: span 0-4 level DIE
[    0.041479]    groups: 2-4 (cpu_capacity = 1818) 0-1 (cpu_capacity =
2882)
..

[...]
Morten Rasmussen July 13, 2016, 4:37 p.m. UTC | #5
On Wed, Jul 13, 2016 at 02:48:24PM +0100, Dietmar Eggemann wrote:
> On 13/07/16 13:40, Vincent Guittot wrote:

> > On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

> >>

> >> To be able to compare the capacity of the target cpu with the highest

> >> available cpu capacity, store the maximum per-cpu capacity in the root

> >> domain.

> > 

> > I thought that the capacity of all CPUS were built so the highest

> > capacity of the CPU of the system is 1024  for big LITTLE system . So

> > this patch doesn't seem necessary for big.LITTLE system

> 

> The asymmetric cpu capacity support currently only has an effect on arm

> big.LITTLE (32bit) using the existing 'struct cpu_efficiency

> table_efficiency[]' based approach.


True for this patch set, but longer term and if you use the preview
branch mentioned in the cover letter Vincent is right. The idea is that
the highest capacity anywhere should be 1024.

If we fix the arch/arm/kernel/topology.c code at the same time we could
kill this patch.

However, even further down the road we might need it (or something
similar) anyway due to the thermal framework. At some point we would
like to adjust the max capacity based any OPP constraints imposed by the
thermal framework. In extreme cases big cpus might be capped so hard
that they effectively have smaller capacity than little. I don't think
it makes sense to re-normalize everything to the highest available
capacity to ensure that there is always a cpu with capacity = 1024 in
the system, instead we must be able to cope with scenarios where max
capacity is smaller than 1024.

Also, for SMT max capacity is less than 1024 already. No?
But we may be able to cater for this in wake_cap() somehow. I can have a
look if Vincent doesn't like this patch.

Cheers,
Morten
Vincent Guittot July 14, 2016, 1:25 p.m. UTC | #6
On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jul 13, 2016 at 02:48:24PM +0100, Dietmar Eggemann wrote:

>> On 13/07/16 13:40, Vincent Guittot wrote:

>> > On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

>> >>

>> >> To be able to compare the capacity of the target cpu with the highest

>> >> available cpu capacity, store the maximum per-cpu capacity in the root

>> >> domain.

>> >

>> > I thought that the capacity of all CPUS were built so the highest

>> > capacity of the CPU of the system is 1024  for big LITTLE system . So

>> > this patch doesn't seem necessary for big.LITTLE system

>>

>> The asymmetric cpu capacity support currently only has an effect on arm

>> big.LITTLE (32bit) using the existing 'struct cpu_efficiency

>> table_efficiency[]' based approach.

>

> True for this patch set, but longer term and if you use the preview

> branch mentioned in the cover letter Vincent is right. The idea is that

> the highest capacity anywhere should be 1024.

>

> If we fix the arch/arm/kernel/topology.c code at the same time we could

> kill this patch.

>

> However, even further down the road we might need it (or something

> similar) anyway due to the thermal framework. At some point we would

> like to adjust the max capacity based any OPP constraints imposed by the

> thermal framework. In extreme cases big cpus might be capped so hard

> that they effectively have smaller capacity than little. I don't think

> it makes sense to re-normalize everything to the highest available

> capacity to ensure that there is always a cpu with capacity = 1024 in

> the system, instead we must be able to cope with scenarios where max

> capacity is smaller than 1024.


Yes we will have to found a solution for thermal mitigation but i
don't know if a rd->max_cpu_capacity would the best solution
>

> Also, for SMT max capacity is less than 1024 already. No?


Yes, it is. I haven't looked in details but i think that we could use
a capacity of 1024 for SMT with changes that have been done on how to
evaluate if a sched_group is overloaded or not.

> But we may be able to cater for this in wake_cap() somehow. I can have a

> look if Vincent doesn't like this patch.


IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

Vincent

>

> Cheers,

> Morten
Morten Rasmussen July 14, 2016, 3:15 p.m. UTC | #7
On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
> On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> > On Wed, Jul 13, 2016 at 02:48:24PM +0100, Dietmar Eggemann wrote:

> >> On 13/07/16 13:40, Vincent Guittot wrote:

> >> > On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> >> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

> >> >>

> >> >> To be able to compare the capacity of the target cpu with the highest

> >> >> available cpu capacity, store the maximum per-cpu capacity in the root

> >> >> domain.

> >> >

> >> > I thought that the capacity of all CPUS were built so the highest

> >> > capacity of the CPU of the system is 1024  for big LITTLE system . So

> >> > this patch doesn't seem necessary for big.LITTLE system

> >>

> >> The asymmetric cpu capacity support currently only has an effect on arm

> >> big.LITTLE (32bit) using the existing 'struct cpu_efficiency

> >> table_efficiency[]' based approach.

> >

> > True for this patch set, but longer term and if you use the preview

> > branch mentioned in the cover letter Vincent is right. The idea is that

> > the highest capacity anywhere should be 1024.

> >

> > If we fix the arch/arm/kernel/topology.c code at the same time we could

> > kill this patch.

> >

> > However, even further down the road we might need it (or something

> > similar) anyway due to the thermal framework. At some point we would

> > like to adjust the max capacity based any OPP constraints imposed by the

> > thermal framework. In extreme cases big cpus might be capped so hard

> > that they effectively have smaller capacity than little. I don't think

> > it makes sense to re-normalize everything to the highest available

> > capacity to ensure that there is always a cpu with capacity = 1024 in

> > the system, instead we must be able to cope with scenarios where max

> > capacity is smaller than 1024.

> 

> Yes we will have to found a solution for thermal mitigation but i

> don't know if a rd->max_cpu_capacity would the best solution


Agreed, I'm pretty sure that the current form isn't sufficient.

> >

> > Also, for SMT max capacity is less than 1024 already. No?

> 

> Yes, it is. I haven't looked in details but i think that we could use

> a capacity of 1024 for SMT with changes that have been done on how to

> evaluate if a sched_group is overloaded or not.


Changing SMT is a bit more invasive that I had hoped for for this patch
set. I will see if we can make it work with the current SMT capacities.

> 

> > But we may be able to cater for this in wake_cap() somehow. I can have a

> > look if Vincent doesn't like this patch.

> 

> IMO, rd->max_cpu_capacity field doesn't seem to be required for now .


No problem. I will try to get rid of it. I will drop the "arm:" patches
as well as they would have to be extended to guarantee a max capacity of
1024 and we most likely will have to change it again when Juri's DT
solution hopefully gets merged.

Morten
Morten Rasmussen July 15, 2016, 11:46 a.m. UTC | #8
On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:
> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:

> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> > > Also, for SMT max capacity is less than 1024 already. No?

> > 

> > Yes, it is. I haven't looked in details but i think that we could use

> > a capacity of 1024 for SMT with changes that have been done on how to

> > evaluate if a sched_group is overloaded or not.

> 

> Changing SMT is a bit more invasive that I had hoped for for this patch

> set. I will see if we can make it work with the current SMT capacities.

> 

> > 

> > > But we may be able to cater for this in wake_cap() somehow. I can have a

> > > look if Vincent doesn't like this patch.

> > 

> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

> 

> No problem. I will try to get rid of it. I will drop the "arm:" patches

> as well as they would have to be extended to guarantee a max capacity of

> 1024 and we most likely will have to change it again when Juri's DT

> solution hopefully gets merged.


I have had a closer look at wake_cap() again. Getting rid of
rd->max_cpu_capacity isn't as easy as I thought.

The fundamental problem is that all we have in wake_cap() is the waking
cpu and previous cpu ids which isn't sufficient to determine whether we
have an asymmetric capacity system or not. A capacity <1024 can either a
little cpu or an SMT thread. We need a third piece of information, which
can be either the highest cpu capacity available in the cpu, or a
flag/variable/function telling us whether we are on an SMT system.

I see the following solutions to the problem:

1. Have a system-wide max_cpu_capacity (as proposed in this patch) which
can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.

2. Change SMT thread capacity to 1024 so we implicitly know that max
capacity is always 1024. As said above, this is a very invasive change
as it would mean that we no longer distinguish between SMP and SMT.
smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and
can be ripped out. I would prefer not create a dependency on such a
massive change. We can do the experiment afterwards if needed.

3. Detect SMT in wake_cap(). This requires access to the sched_domain
hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,
AFAIK, apart from looping through the capacities of all cpus in the
system basically computing max_cpu_capacity each time.
wake_cap() is currently called before rcu_read_lock() that gives us
access to the sched_domain hierarchy. I would have to postpone the
wake_cap() call to being inside the lock and introduce another lookup in
the sched_domain hierarchy which would be executed on every wake-up on
all systems. IMHO, that is a bit ugly.

I don't really like any of the solutions, but of those three I would go
for the current solution (1) as it is very minimal both in the amount of
code touched/affected and overhead. We can kill it later if we have a
better one, no problem for me.

Do you see any alternatives?
Vincent Guittot July 15, 2016, 1:39 p.m. UTC | #9
On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:

>> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:

>> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> > > Also, for SMT max capacity is less than 1024 already. No?

>> >

>> > Yes, it is. I haven't looked in details but i think that we could use

>> > a capacity of 1024 for SMT with changes that have been done on how to

>> > evaluate if a sched_group is overloaded or not.

>>

>> Changing SMT is a bit more invasive that I had hoped for for this patch

>> set. I will see if we can make it work with the current SMT capacities.

>>

>> >

>> > > But we may be able to cater for this in wake_cap() somehow. I can have a

>> > > look if Vincent doesn't like this patch.

>> >

>> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

>>

>> No problem. I will try to get rid of it. I will drop the "arm:" patches

>> as well as they would have to be extended to guarantee a max capacity of

>> 1024 and we most likely will have to change it again when Juri's DT

>> solution hopefully gets merged.

>

> I have had a closer look at wake_cap() again. Getting rid of

> rd->max_cpu_capacity isn't as easy as I thought.

>

> The fundamental problem is that all we have in wake_cap() is the waking

> cpu and previous cpu ids which isn't sufficient to determine whether we

> have an asymmetric capacity system or not. A capacity <1024 can either a

> little cpu or an SMT thread. We need a third piece of information, which

> can be either the highest cpu capacity available in the cpu, or a

> flag/variable/function telling us whether we are on an SMT system.

>

> I see the following solutions to the problem:

>

> 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which

> can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.

>

> 2. Change SMT thread capacity to 1024 so we implicitly know that max

> capacity is always 1024. As said above, this is a very invasive change

> as it would mean that we no longer distinguish between SMP and SMT.

> smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and

> can be ripped out. I would prefer not create a dependency on such a

> massive change. We can do the experiment afterwards if needed.

>

> 3. Detect SMT in wake_cap(). This requires access to the sched_domain

> hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,

> AFAIK, apart from looping through the capacities of all cpus in the

> system basically computing max_cpu_capacity each time.

> wake_cap() is currently called before rcu_read_lock() that gives us

> access to the sched_domain hierarchy. I would have to postpone the

> wake_cap() call to being inside the lock and introduce another lookup in

> the sched_domain hierarchy which would be executed on every wake-up on

> all systems. IMHO, that is a bit ugly.

>

> I don't really like any of the solutions, but of those three I would go

> for the current solution (1) as it is very minimal both in the amount of

> code touched/affected and overhead. We can kill it later if we have a

> better one, no problem for me.


I had solution 2 in mind. I haven't looked deeply the impact but I
thought that the main remaining blocking  point is in
update_numa_stats where it use the fact that the capacity is less than
1024 vat SMT level to compute task_capacity and  set has_free_capacity
only if we have less than 1 task per core.
smt_gain would not be used anymore

>

> Do you see any alternatives?
Morten Rasmussen July 15, 2016, 4:02 p.m. UTC | #10
On Fri, Jul 15, 2016 at 03:39:05PM +0200, Vincent Guittot wrote:
> On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> > On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:

> >> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:

> >> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> >> > > Also, for SMT max capacity is less than 1024 already. No?

> >> >

> >> > Yes, it is. I haven't looked in details but i think that we could use

> >> > a capacity of 1024 for SMT with changes that have been done on how to

> >> > evaluate if a sched_group is overloaded or not.

> >>

> >> Changing SMT is a bit more invasive that I had hoped for for this patch

> >> set. I will see if we can make it work with the current SMT capacities.

> >>

> >> >

> >> > > But we may be able to cater for this in wake_cap() somehow. I can have a

> >> > > look if Vincent doesn't like this patch.

> >> >

> >> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

> >>

> >> No problem. I will try to get rid of it. I will drop the "arm:" patches

> >> as well as they would have to be extended to guarantee a max capacity of

> >> 1024 and we most likely will have to change it again when Juri's DT

> >> solution hopefully gets merged.

> >

> > I have had a closer look at wake_cap() again. Getting rid of

> > rd->max_cpu_capacity isn't as easy as I thought.

> >

> > The fundamental problem is that all we have in wake_cap() is the waking

> > cpu and previous cpu ids which isn't sufficient to determine whether we

> > have an asymmetric capacity system or not. A capacity <1024 can either a

> > little cpu or an SMT thread. We need a third piece of information, which

> > can be either the highest cpu capacity available in the cpu, or a

> > flag/variable/function telling us whether we are on an SMT system.

> >

> > I see the following solutions to the problem:

> >

> > 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which

> > can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.

> >

> > 2. Change SMT thread capacity to 1024 so we implicitly know that max

> > capacity is always 1024. As said above, this is a very invasive change

> > as it would mean that we no longer distinguish between SMP and SMT.

> > smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and

> > can be ripped out. I would prefer not create a dependency on such a

> > massive change. We can do the experiment afterwards if needed.

> >

> > 3. Detect SMT in wake_cap(). This requires access to the sched_domain

> > hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,

> > AFAIK, apart from looping through the capacities of all cpus in the

> > system basically computing max_cpu_capacity each time.

> > wake_cap() is currently called before rcu_read_lock() that gives us

> > access to the sched_domain hierarchy. I would have to postpone the

> > wake_cap() call to being inside the lock and introduce another lookup in

> > the sched_domain hierarchy which would be executed on every wake-up on

> > all systems. IMHO, that is a bit ugly.

> >

> > I don't really like any of the solutions, but of those three I would go

> > for the current solution (1) as it is very minimal both in the amount of

> > code touched/affected and overhead. We can kill it later if we have a

> > better one, no problem for me.

> 

> I had solution 2 in mind. I haven't looked deeply the impact but I

> thought that the main remaining blocking  point is in

> update_numa_stats where it use the fact that the capacity is less than

> 1024 vat SMT level to compute task_capacity and  set has_free_capacity

> only if we have less than 1 task per core.

> smt_gain would not be used anymore


Isn't group capacities of also smaller and hence influence load
balancing decisions?

I was hoping that we could decouple a full audit of the load-balance
code from this relatively simple patch set by staying with 1 for now. I
worry that the changing SMT capacity can turn into a major task. Just
proving that there is no regressions even if we know it should be, is a
lot of work.

I'm happy to look at the SMT stuff it has been on my list of outstanding
issues for a very long time, but I would prefer to break it into
multiple independent patch sets to keep them focused. I haven't had a
much luck with massive complicated patch sets so far ;-)
Vincent Guittot July 18, 2016, 12:48 p.m. UTC | #11
On 15 July 2016 at 18:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Jul 15, 2016 at 03:39:05PM +0200, Vincent Guittot wrote:

>> On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> > On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:

>> >> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:

>> >> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> >> > > Also, for SMT max capacity is less than 1024 already. No?

>> >> >

>> >> > Yes, it is. I haven't looked in details but i think that we could use

>> >> > a capacity of 1024 for SMT with changes that have been done on how to

>> >> > evaluate if a sched_group is overloaded or not.

>> >>

>> >> Changing SMT is a bit more invasive that I had hoped for for this patch

>> >> set. I will see if we can make it work with the current SMT capacities.

>> >>

>> >> >

>> >> > > But we may be able to cater for this in wake_cap() somehow. I can have a

>> >> > > look if Vincent doesn't like this patch.

>> >> >

>> >> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

>> >>

>> >> No problem. I will try to get rid of it. I will drop the "arm:" patches

>> >> as well as they would have to be extended to guarantee a max capacity of

>> >> 1024 and we most likely will have to change it again when Juri's DT

>> >> solution hopefully gets merged.

>> >

>> > I have had a closer look at wake_cap() again. Getting rid of

>> > rd->max_cpu_capacity isn't as easy as I thought.

>> >

>> > The fundamental problem is that all we have in wake_cap() is the waking

>> > cpu and previous cpu ids which isn't sufficient to determine whether we

>> > have an asymmetric capacity system or not. A capacity <1024 can either a

>> > little cpu or an SMT thread. We need a third piece of information, which

>> > can be either the highest cpu capacity available in the cpu, or a

>> > flag/variable/function telling us whether we are on an SMT system.

>> >

>> > I see the following solutions to the problem:

>> >

>> > 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which

>> > can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.

>> >

>> > 2. Change SMT thread capacity to 1024 so we implicitly know that max

>> > capacity is always 1024. As said above, this is a very invasive change

>> > as it would mean that we no longer distinguish between SMP and SMT.

>> > smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and

>> > can be ripped out. I would prefer not create a dependency on such a

>> > massive change. We can do the experiment afterwards if needed.

>> >

>> > 3. Detect SMT in wake_cap(). This requires access to the sched_domain

>> > hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,

>> > AFAIK, apart from looping through the capacities of all cpus in the

>> > system basically computing max_cpu_capacity each time.

>> > wake_cap() is currently called before rcu_read_lock() that gives us

>> > access to the sched_domain hierarchy. I would have to postpone the

>> > wake_cap() call to being inside the lock and introduce another lookup in

>> > the sched_domain hierarchy which would be executed on every wake-up on

>> > all systems. IMHO, that is a bit ugly.

>> >

>> > I don't really like any of the solutions, but of those three I would go

>> > for the current solution (1) as it is very minimal both in the amount of

>> > code touched/affected and overhead. We can kill it later if we have a

>> > better one, no problem for me.

>>

>> I had solution 2 in mind. I haven't looked deeply the impact but I

>> thought that the main remaining blocking  point is in

>> update_numa_stats where it use the fact that the capacity is less than

>> 1024 vat SMT level to compute task_capacity and  set has_free_capacity

>> only if we have less than 1 task per core.

>> smt_gain would not be used anymore

>

> Isn't group capacities of also smaller and hence influence load

> balancing decisions?


It should not because the capacity is now only used to compare groups
together and no more with the 1024 value

>

> I was hoping that we could decouple a full audit of the load-balance

> code from this relatively simple patch set by staying with 1 for now. I

> worry that the changing SMT capacity can turn into a major task. Just

> proving that there is no regressions even if we know it should be, is a

> lot of work.


Yes, you are probably right on that point

>

> I'm happy to look at the SMT stuff it has been on my list of outstanding

> issues for a very long time, but I would prefer to break it into

> multiple independent patch sets to keep them focused. I haven't had a

> much luck with massive complicated patch sets so far ;-)
Morten Rasmussen July 18, 2016, 3:11 p.m. UTC | #12
On Mon, Jul 18, 2016 at 02:48:42PM +0200, Vincent Guittot wrote:
> On 15 July 2016 at 18:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> > On Fri, Jul 15, 2016 at 03:39:05PM +0200, Vincent Guittot wrote:

> >> On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> >> > On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:

> >> >> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:

> >> >> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

> >> >> > > Also, for SMT max capacity is less than 1024 already. No?

> >> >> >

> >> >> > Yes, it is. I haven't looked in details but i think that we could use

> >> >> > a capacity of 1024 for SMT with changes that have been done on how to

> >> >> > evaluate if a sched_group is overloaded or not.

> >> >>

> >> >> Changing SMT is a bit more invasive that I had hoped for for this patch

> >> >> set. I will see if we can make it work with the current SMT capacities.

> >> >>

> >> >> >

> >> >> > > But we may be able to cater for this in wake_cap() somehow. I can have a

> >> >> > > look if Vincent doesn't like this patch.

> >> >> >

> >> >> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

> >> >>

> >> >> No problem. I will try to get rid of it. I will drop the "arm:" patches

> >> >> as well as they would have to be extended to guarantee a max capacity of

> >> >> 1024 and we most likely will have to change it again when Juri's DT

> >> >> solution hopefully gets merged.

> >> >

> >> > I have had a closer look at wake_cap() again. Getting rid of

> >> > rd->max_cpu_capacity isn't as easy as I thought.

> >> >

> >> > The fundamental problem is that all we have in wake_cap() is the waking

> >> > cpu and previous cpu ids which isn't sufficient to determine whether we

> >> > have an asymmetric capacity system or not. A capacity <1024 can either a

> >> > little cpu or an SMT thread. We need a third piece of information, which

> >> > can be either the highest cpu capacity available in the cpu, or a

> >> > flag/variable/function telling us whether we are on an SMT system.

> >> >

> >> > I see the following solutions to the problem:

> >> >

> >> > 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which

> >> > can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.

> >> >

> >> > 2. Change SMT thread capacity to 1024 so we implicitly know that max

> >> > capacity is always 1024. As said above, this is a very invasive change

> >> > as it would mean that we no longer distinguish between SMP and SMT.

> >> > smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and

> >> > can be ripped out. I would prefer not create a dependency on such a

> >> > massive change. We can do the experiment afterwards if needed.

> >> >

> >> > 3. Detect SMT in wake_cap(). This requires access to the sched_domain

> >> > hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,

> >> > AFAIK, apart from looping through the capacities of all cpus in the

> >> > system basically computing max_cpu_capacity each time.

> >> > wake_cap() is currently called before rcu_read_lock() that gives us

> >> > access to the sched_domain hierarchy. I would have to postpone the

> >> > wake_cap() call to being inside the lock and introduce another lookup in

> >> > the sched_domain hierarchy which would be executed on every wake-up on

> >> > all systems. IMHO, that is a bit ugly.

> >> >

> >> > I don't really like any of the solutions, but of those three I would go

> >> > for the current solution (1) as it is very minimal both in the amount of

> >> > code touched/affected and overhead. We can kill it later if we have a

> >> > better one, no problem for me.

> >>

> >> I had solution 2 in mind. I haven't looked deeply the impact but I

> >> thought that the main remaining blocking  point is in

> >> update_numa_stats where it use the fact that the capacity is less than

> >> 1024 vat SMT level to compute task_capacity and  set has_free_capacity

> >> only if we have less than 1 task per core.

> >> smt_gain would not be used anymore

> >

> > Isn't group capacities of also smaller and hence influence load

> > balancing decisions?

> 

> It should not because the capacity is now only used to compare groups

> together and no more with the 1024 value


You may very well be right. It is definitely worth a look, a lot of code
can be ripped out if we can move SMT threads to have default capacity.

> 

> >

> > I was hoping that we could decouple a full audit of the load-balance

> > code from this relatively simple patch set by staying with 1 for now. I

> > worry that the changing SMT capacity can turn into a major task. Just

> > proving that there is no regressions even if we know it should be, is a

> > lot of work.

> 

> Yes, you are probably right on that point


I will put together v3 still containing 1.

Thanks,
Morten
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe39118ffdfb..5093765e9930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6855,6 +6855,7 @@  static int build_sched_domains(const struct cpumask *cpu_map,
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
 	struct s_data d;
+	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6905,11 +6906,19 @@  static int build_sched_domains(const struct cpumask *cpu_map,
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
 		cpu_attach_domain(sd, d.rd, i);
+
+		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
+			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
 	}
 	rcu_read_unlock();
 
+	if (rq)
+		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+
 	ret = 0;
 error:
 	__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f3087b04..3e9904ef224f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -564,6 +564,8 @@  struct root_domain {
 	 */
 	cpumask_var_t rto_mask;
 	struct cpupri cpupri;
+
+	unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;