Message ID | 1396009796-31598-4-git-send-email-daniel.lezcano@linaro.org |
---|---|
State | New |
Headers | show |
On Fri, 28 Mar 2014, Daniel Lezcano wrote: > As we know in which idle state the cpu is, we can investigate the following: > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the > deeper it is idle > 2. what exit latency is ? the greater the exit latency is, the deeper it is > > With both information, when all cpus are idle, we can choose the idlest cpu. > > When one cpu is not idle, the old check against weighted load applies. > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> There seems to be some problems with the implementation. > @@ -4336,20 +4337,53 @@ static int > find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) > { > unsigned long load, min_load = ULONG_MAX; > - int idlest = -1; > + unsigned int min_exit_latency = UINT_MAX; > + u64 idle_stamp, min_idle_stamp = ULONG_MAX; I don't think you really meant to assign an u64 variable with ULONG_MAX. You probably want ULLONG_MAX here. And probably not in fact (more later). > + > + struct rq *rq; > + struct cpuidle_power *power; > + > + int cpu_idle = -1; > + int cpu_busy = -1; > int i; > > /* Traverse only the allowed CPUs */ > for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > - load = weighted_cpuload(i); > > - if (load < min_load || (load == min_load && i == this_cpu)) { > - min_load = load; > - idlest = i; > + if (idle_cpu(i)) { > + > + rq = cpu_rq(i); > + power = rq->power; > + idle_stamp = rq->idle_stamp; > + > + /* The cpu is idle since a shorter time */ > + if (idle_stamp < min_idle_stamp) { > + min_idle_stamp = idle_stamp; > + cpu_idle = i; > + continue; Don't you want the highest time stamp in order to select the most recently idled CPU? Favoring the CPU which has been idle the longest makes little sense. > + } > + > + /* The cpu is idle but the exit_latency is shorter */ > + if (power && power->exit_latency < min_exit_latency) { > + min_exit_latency = power->exit_latency; > + cpu_idle = i; > + continue; > + } I think this is wrong. This gives priority to CPUs which have been idle for a (longer... although this should have been) shorter period of time over those with a shallower idle state. I think this should rather be: if (power && power->exit_latency < min_exit_latency) { min_exit_latency = power->exit_latency; latest_idle_stamp = idle_stamp; cpu_idle = i; } else if ((!power || power->exit_latency == min_exit_latency) && idle_stamp > latest_idle_stamp) { latest_idle_stamp = idle_stamp; cpu_idle = i; } So the CPU with the shallowest idle state is selected in priority, and if many CPUs are in the same state then the time stamp is used to select the most recent one. Whenever a shallower idle state is found then the latest_idle_stamp is reset for that state even if it is further in the past. > + } else { > + > + load = weighted_cpuload(i); > + > + if (load < min_load || > + (load == min_load && i == this_cpu)) { > + min_load = load; > + cpu_busy = i; > + continue; > + } > } I think this is wrong to do an if-else based on idle_cpu() here. What if a CPU is heavily loaded, but for some reason it happens to be idle at this very moment? With your patch it could be selected as an idle CPU while it would be discarded as being too busy otherwise. It is important to determine both cpu_busy and cpu_idle for all CPUs. And cpu_busy is a bad name for this. Something like least_loaded would be more self explanatory. Same thing for cpu_idle which could be clearer if named shalloest_idle. > - return idlest; > + /* Busy cpus are considered less idle than idle cpus ;) */ > + return cpu_busy != -1 ? cpu_busy : cpu_idle; And finally it is a policy decision whether or not we want to return least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs first or not. That in itself needs more investigation. To keep the existing policy unchanged for now the above condition should have its variables swapped. Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote: > On Fri, 28 Mar 2014, Daniel Lezcano wrote: > > > As we know in which idle state the cpu is, we can investigate the following: > > > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the > > deeper it is idle > > 2. what exit latency is ? the greater the exit latency is, the deeper it is > > > > With both information, when all cpus are idle, we can choose the idlest cpu. > > > > When one cpu is not idle, the old check against weighted load applies. > > > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> > > There seems to be some problems with the implementation. > > > @@ -4336,20 +4337,53 @@ static int > > find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) > > { > > unsigned long load, min_load = ULONG_MAX; > > - int idlest = -1; > > + unsigned int min_exit_latency = UINT_MAX; > > + u64 idle_stamp, min_idle_stamp = ULONG_MAX; > > I don't think you really meant to assign an u64 variable with ULONG_MAX. > You probably want ULLONG_MAX here. And probably not in fact (more > later). > > > + > > + struct rq *rq; > > + struct cpuidle_power *power; > > + > > + int cpu_idle = -1; > > + int cpu_busy = -1; > > int i; > > > > /* Traverse only the allowed CPUs */ > > for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > > - load = weighted_cpuload(i); > > > > - if (load < min_load || (load == min_load && i == this_cpu)) { > > - min_load = load; > > - idlest = i; > > + if (idle_cpu(i)) { > > + > > + rq = cpu_rq(i); > > + power = rq->power; > > + idle_stamp = rq->idle_stamp; > > + > > + /* The cpu is idle since a shorter time */ > > + if (idle_stamp < min_idle_stamp) { > > + min_idle_stamp = idle_stamp; > > + cpu_idle = i; > > + continue; > > Don't you want the highest time stamp in order to select the most > recently idled CPU? Favoring the CPU which has been idle the longest > makes little sense. It may make sense if the hardware can auto-promote CPUs to deeper C-states. Something like that happens with package C-states that are only entered when all cores have entered a particular core C-state already. In that case the probability of the core being in a deeper state grows with time. That said I would just drop this heuristics for the time being. If auto-promotion is disregarded, it doesn't really matter how much time the given CPU has been idle except for one case: When the target residency of its idle state hasn't been reached yet, waking up the CPU may be a mistake (depending on how deep the state actually is, but for the majority of drivers in the tree we don't have any measure of that). > > + } > > + > > + /* The cpu is idle but the exit_latency is shorter */ > > + if (power && power->exit_latency < min_exit_latency) { > > + min_exit_latency = power->exit_latency; > > + cpu_idle = i; > > + continue; > > + } > > I think this is wrong. This gives priority to CPUs which have been idle > for a (longer... although this should have been) shorter period of time > over those with a shallower idle state. I think this should rather be: > > if (power && power->exit_latency < min_exit_latency) { > min_exit_latency = power->exit_latency; > latest_idle_stamp = idle_stamp; > cpu_idle = i; > } else if ((!power || power->exit_latency == min_exit_latency) && > idle_stamp > latest_idle_stamp) { > latest_idle_stamp = idle_stamp; > cpu_idle = i; > } > > So the CPU with the shallowest idle state is selected in priority, and > if many CPUs are in the same state then the time stamp is used to > select the most recent one. Again, if auto-promotion is disregarded, it doesn't really matter which of them is woken up. > Whenever a shallower idle state is found then the latest_idle_stamp is reset for > that state even if it is further in the past. > > > + } else { > > + > > + load = weighted_cpuload(i); > > + > > + if (load < min_load || > > + (load == min_load && i == this_cpu)) { > > + min_load = load; > > + cpu_busy = i; > > + continue; > > + } > > } > > I think this is wrong to do an if-else based on idle_cpu() here. What > if a CPU is heavily loaded, but for some reason it happens to be idle at > this very moment? With your patch it could be selected as an idle CPU > while it would be discarded as being too busy otherwise. But see below -> > It is important to determine both cpu_busy and cpu_idle for all CPUs. > > And cpu_busy is a bad name for this. Something like least_loaded would > be more self explanatory. Same thing for cpu_idle which could be > clearer if named shalloest_idle. shallowest_idle? > > - return idlest; > > + /* Busy cpus are considered less idle than idle cpus ;) */ > > + return cpu_busy != -1 ? cpu_busy : cpu_idle; > > And finally it is a policy decision whether or not we want to return > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs > first or not. That in itself needs more investigation. To keep the > existing policy unchanged for now the above condition should have its > variables swapped. Which means that once we've find the first idle CPU, it is not useful to continue computing least_loaded, because we will return the idle one anyway, right?
On Fri, 4 Apr 2014, Rafael J. Wysocki wrote: > On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote: > > On Fri, 28 Mar 2014, Daniel Lezcano wrote: > > > > > As we know in which idle state the cpu is, we can investigate the following: > > > > > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the > > > deeper it is idle > > > 2. what exit latency is ? the greater the exit latency is, the deeper it is > > > > > > With both information, when all cpus are idle, we can choose the idlest cpu. > > > > > > When one cpu is not idle, the old check against weighted load applies. > > > > > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> > > > > There seems to be some problems with the implementation. > > > > > @@ -4336,20 +4337,53 @@ static int > > > find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) > > > { > > > unsigned long load, min_load = ULONG_MAX; > > > - int idlest = -1; > > > + unsigned int min_exit_latency = UINT_MAX; > > > + u64 idle_stamp, min_idle_stamp = ULONG_MAX; > > > > I don't think you really meant to assign an u64 variable with ULONG_MAX. > > You probably want ULLONG_MAX here. And probably not in fact (more > > later). > > > > > + > > > + struct rq *rq; > > > + struct cpuidle_power *power; > > > + > > > + int cpu_idle = -1; > > > + int cpu_busy = -1; > > > int i; > > > > > > /* Traverse only the allowed CPUs */ > > > for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > > > - load = weighted_cpuload(i); > > > > > > - if (load < min_load || (load == min_load && i == this_cpu)) { > > > - min_load = load; > > > - idlest = i; > > > + if (idle_cpu(i)) { > > > + > > > + rq = cpu_rq(i); > > > + power = rq->power; > > > + idle_stamp = rq->idle_stamp; > > > + > > > + /* The cpu is idle since a shorter time */ > > > + if (idle_stamp < min_idle_stamp) { > > > + min_idle_stamp = idle_stamp; > > > + cpu_idle = i; > > > + continue; > > > > Don't you want the highest time stamp in order to select the most > > recently idled CPU? Favoring the CPU which has been idle the longest > > makes little sense. > > It may make sense if the hardware can auto-promote CPUs to deeper C-states. If so the promotion will happen over time, no? What I'm saying here is that those CPUs which have been idle longer should not be favored when it is time to select a CPU for a task to run. More recently idled CPUs are more likely to be in a shallower C-state. > Something like that happens with package C-states that are only entered when > all cores have entered a particular core C-state already. In that case the > probability of the core being in a deeper state grows with time. Exactly what I'm saying. Also here it is worth remembering that the scheduling domains should represent those packages that share common C-states at a higher level. The scheduler can then be told not to balance across domains if it doesn't need to in order to favor the conditions for those package C-states to be used. That's what the task packing patch series is about, independently of this one. > That said I would just drop this heuristics for the time being. If auto-promotion > is disregarded, it doesn't really matter how much time the given CPU has been idle > except for one case: When the target residency of its idle state hasn't been > reached yet, waking up the CPU may be a mistake (depending on how deep the state > actually is, but for the majority of drivers in the tree we don't have any measure > of that). There is one reason for considering the time a CPU has been idle, assuming equivalent C-state, and that is cache snooping. The longer a CPU is idle, the more likely its cache content will have been claimed and migrated by other CPUs. Of course that doesn't make much difference for deeper C-states where the cache isn't preserved, but it is probably simpler and cheaper to apply this heuristic in all cases. > > > + } > > > + > > > + /* The cpu is idle but the exit_latency is shorter */ > > > + if (power && power->exit_latency < min_exit_latency) { > > > + min_exit_latency = power->exit_latency; > > > + cpu_idle = i; > > > + continue; > > > + } > > > > I think this is wrong. This gives priority to CPUs which have been idle > > for a (longer... although this should have been) shorter period of time > > over those with a shallower idle state. I think this should rather be: > > > > if (power && power->exit_latency < min_exit_latency) { > > min_exit_latency = power->exit_latency; > > latest_idle_stamp = idle_stamp; > > cpu_idle = i; > > } else if ((!power || power->exit_latency == min_exit_latency) && > > idle_stamp > latest_idle_stamp) { > > latest_idle_stamp = idle_stamp; > > cpu_idle = i; > > } > > > > So the CPU with the shallowest idle state is selected in priority, and > > if many CPUs are in the same state then the time stamp is used to > > select the most recent one. > > Again, if auto-promotion is disregarded, it doesn't really matter which of them > is woken up. If it doesn't matter then it doesn't hurt. But in some cases it matters. > > Whenever a shallower idle state is found then the latest_idle_stamp is reset for > > that state even if it is further in the past. > > > > > + } else { > > > + > > > + load = weighted_cpuload(i); > > > + > > > + if (load < min_load || > > > + (load == min_load && i == this_cpu)) { > > > + min_load = load; > > > + cpu_busy = i; > > > + continue; > > > + } > > > } > > > > I think this is wrong to do an if-else based on idle_cpu() here. What > > if a CPU is heavily loaded, but for some reason it happens to be idle at > > this very moment? With your patch it could be selected as an idle CPU > > while it would be discarded as being too busy otherwise. > > But see below -> > > > It is important to determine both cpu_busy and cpu_idle for all CPUs. > > > > And cpu_busy is a bad name for this. Something like least_loaded would > > be more self explanatory. Same thing for cpu_idle which could be > > clearer if named shalloest_idle. > > shallowest_idle? Something that means the CPU with the shallowest C-state. Using "cpu_idle" for this variable doesn't cut it. > > > - return idlest; > > > + /* Busy cpus are considered less idle than idle cpus ;) */ > > > + return cpu_busy != -1 ? cpu_busy : cpu_idle; > > > > And finally it is a policy decision whether or not we want to return > > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs > > first or not. That in itself needs more investigation. To keep the > > existing policy unchanged for now the above condition should have its > > variables swapped. > > Which means that once we've find the first idle CPU, it is not useful to > continue computing least_loaded, because we will return the idle one anyway, > right? Good point. Currently, that should be the case. Eventually we'll want to put new tasks on lightly loaded CPUs instead of waking up a fully idle CPU in order to favor deeper C-states. But that requires a patch series of its own just to determine how loaded a CPU is and how much work it can still accommodate before being oversubscribed, etc. Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, April 04, 2014 12:56:52 PM Nicolas Pitre wrote: > On Fri, 4 Apr 2014, Rafael J. Wysocki wrote: > > > On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote: > > > On Fri, 28 Mar 2014, Daniel Lezcano wrote: > > > > > > > As we know in which idle state the cpu is, we can investigate the following: > > > > > > > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the > > > > deeper it is idle > > > > 2. what exit latency is ? the greater the exit latency is, the deeper it is > > > > > > > > With both information, when all cpus are idle, we can choose the idlest cpu. > > > > > > > > When one cpu is not idle, the old check against weighted load applies. > > > > > > > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> > > > > > > There seems to be some problems with the implementation. > > > > > > > @@ -4336,20 +4337,53 @@ static int > > > > find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) > > > > { > > > > unsigned long load, min_load = ULONG_MAX; > > > > - int idlest = -1; > > > > + unsigned int min_exit_latency = UINT_MAX; > > > > + u64 idle_stamp, min_idle_stamp = ULONG_MAX; > > > > > > I don't think you really meant to assign an u64 variable with ULONG_MAX. > > > You probably want ULLONG_MAX here. And probably not in fact (more > > > later). > > > > > > > + > > > > + struct rq *rq; > > > > + struct cpuidle_power *power; > > > > + > > > > + int cpu_idle = -1; > > > > + int cpu_busy = -1; > > > > int i; > > > > > > > > /* Traverse only the allowed CPUs */ > > > > for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > > > > - load = weighted_cpuload(i); > > > > > > > > - if (load < min_load || (load == min_load && i == this_cpu)) { > > > > - min_load = load; > > > > - idlest = i; > > > > + if (idle_cpu(i)) { > > > > + > > > > + rq = cpu_rq(i); > > > > + power = rq->power; > > > > + idle_stamp = rq->idle_stamp; > > > > + > > > > + /* The cpu is idle since a shorter time */ > > > > + if (idle_stamp < min_idle_stamp) { > > > > + min_idle_stamp = idle_stamp; > > > > + cpu_idle = i; > > > > + continue; > > > > > > Don't you want the highest time stamp in order to select the most > > > recently idled CPU? Favoring the CPU which has been idle the longest > > > makes little sense. > > > > It may make sense if the hardware can auto-promote CPUs to deeper C-states. > > If so the promotion will happen over time, no? What I'm saying here is > that those CPUs which have been idle longer should not be favored when > it is time to select a CPU for a task to run. More recently idled CPUs > are more likely to be in a shallower C-state. > > > Something like that happens with package C-states that are only entered when > > all cores have entered a particular core C-state already. In that case the > > probability of the core being in a deeper state grows with time. > > Exactly what I'm saying. Right, I got that the other way around by mistake. > Also here it is worth remembering that the scheduling domains should > represent those packages that share common C-states at a higher level. > The scheduler can then be told not to balance across domains if it > doesn't need to in order to favor the conditions for those package > C-states to be used. That's what the task packing patch series is > about, independently of this one. > > > That said I would just drop this heuristics for the time being. If auto-promotion > > is disregarded, it doesn't really matter how much time the given CPU has been idle > > except for one case: When the target residency of its idle state hasn't been > > reached yet, waking up the CPU may be a mistake (depending on how deep the state > > actually is, but for the majority of drivers in the tree we don't have any measure > > of that). > > There is one reason for considering the time a CPU has been idle, > assuming equivalent C-state, and that is cache snooping. The longer a > CPU is idle, the more likely its cache content will have been claimed > and migrated by other CPUs. Of course that doesn't make much difference > for deeper C-states where the cache isn't preserved, but it is probably > simpler and cheaper to apply this heuristic in all cases. Yes, that sounds like it might be a reason, but I'd like to see numbers confirming that to be honest. > > > > + } > > > > + > > > > + /* The cpu is idle but the exit_latency is shorter */ > > > > + if (power && power->exit_latency < min_exit_latency) { > > > > + min_exit_latency = power->exit_latency; > > > > + cpu_idle = i; > > > > + continue; > > > > + } > > > > > > I think this is wrong. This gives priority to CPUs which have been idle > > > for a (longer... although this should have been) shorter period of time > > > over those with a shallower idle state. I think this should rather be: > > > > > > if (power && power->exit_latency < min_exit_latency) { > > > min_exit_latency = power->exit_latency; > > > latest_idle_stamp = idle_stamp; > > > cpu_idle = i; > > > } else if ((!power || power->exit_latency == min_exit_latency) && > > > idle_stamp > latest_idle_stamp) { > > > latest_idle_stamp = idle_stamp; > > > cpu_idle = i; > > > } > > > > > > So the CPU with the shallowest idle state is selected in priority, and > > > if many CPUs are in the same state then the time stamp is used to > > > select the most recent one. > > > > Again, if auto-promotion is disregarded, it doesn't really matter which of them > > is woken up. > > If it doesn't matter then it doesn't hurt. But in some cases it > matters. > > > > Whenever a shallower idle state is found then the latest_idle_stamp is reset for > > > that state even if it is further in the past. > > > > > > > + } else { > > > > + > > > > + load = weighted_cpuload(i); > > > > + > > > > + if (load < min_load || > > > > + (load == min_load && i == this_cpu)) { > > > > + min_load = load; > > > > + cpu_busy = i; > > > > + continue; > > > > + } > > > > } > > > > > > I think this is wrong to do an if-else based on idle_cpu() here. What > > > if a CPU is heavily loaded, but for some reason it happens to be idle at > > > this very moment? With your patch it could be selected as an idle CPU > > > while it would be discarded as being too busy otherwise. > > > > But see below -> > > > > > It is important to determine both cpu_busy and cpu_idle for all CPUs. > > > > > > And cpu_busy is a bad name for this. Something like least_loaded would > > > be more self explanatory. Same thing for cpu_idle which could be > > > clearer if named shalloest_idle. > > > > shallowest_idle? > > Something that means the CPU with the shallowest C-state. Using > "cpu_idle" for this variable doesn't cut it. Yes, that was about the typo above only. :-) > > > > - return idlest; > > > > + /* Busy cpus are considered less idle than idle cpus ;) */ > > > > + return cpu_busy != -1 ? cpu_busy : cpu_idle; > > > > > > And finally it is a policy decision whether or not we want to return > > > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs > > > first or not. That in itself needs more investigation. To keep the > > > existing policy unchanged for now the above condition should have its > > > variables swapped. > > > > Which means that once we've find the first idle CPU, it is not useful to > > continue computing least_loaded, because we will return the idle one anyway, > > right? > > Good point. Currently, that should be the case. > > Eventually we'll want to put new tasks on lightly loaded CPUs instead of > waking up a fully idle CPU in order to favor deeper C-states. But that > requires a patch series of its own just to determine how loaded a CPU is > and how much work it can still accommodate before being oversubscribed, > etc. Wouldn't we need power consumption numbers for that realistically? Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 28, 2014 at 01:29:56PM +0100, Daniel Lezcano wrote: > @@ -4336,20 +4337,53 @@ static int > find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) > { > unsigned long load, min_load = ULONG_MAX; > - int idlest = -1; > + unsigned int min_exit_latency = UINT_MAX; > + u64 idle_stamp, min_idle_stamp = ULONG_MAX; > + > + struct rq *rq; > + struct cpuidle_power *power; > + > + int cpu_idle = -1; > + int cpu_busy = -1; > int i; > > /* Traverse only the allowed CPUs */ > for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > - load = weighted_cpuload(i); > > - if (load < min_load || (load == min_load && i == this_cpu)) { > - min_load = load; > - idlest = i; > + if (idle_cpu(i)) { > + > + rq = cpu_rq(i); > + power = rq->power; > + idle_stamp = rq->idle_stamp; > + > + /* The cpu is idle since a shorter time */ > + if (idle_stamp < min_idle_stamp) { > + min_idle_stamp = idle_stamp; > + cpu_idle = i; > + continue; > + } > + > + /* The cpu is idle but the exit_latency is shorter */ > + if (power && power->exit_latency < min_exit_latency) { > + min_exit_latency = power->exit_latency; > + cpu_idle = i; > + continue; > + } Aside from the arguments made by Nico (which I agree with), depending on the life time rules of the power object we might need smp_read_barrier_depends() between reading and using. If all these objects are static and never change content we do not, if there's dynamic objects involved we probably should. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/02/2014 05:05 AM, Nicolas Pitre wrote: > On Fri, 28 Mar 2014, Daniel Lezcano wrote: > >> As we know in which idle state the cpu is, we can investigate the following: >> >> 1. when did the cpu entered the idle state ? the longer the cpu is idle, the >> deeper it is idle >> 2. what exit latency is ? the greater the exit latency is, the deeper it is >> >> With both information, when all cpus are idle, we can choose the idlest cpu. >> >> When one cpu is not idle, the old check against weighted load applies. >> >> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> > > There seems to be some problems with the implementation. > >> @@ -4336,20 +4337,53 @@ static int >> find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) >> { >> unsigned long load, min_load = ULONG_MAX; >> - int idlest = -1; >> + unsigned int min_exit_latency = UINT_MAX; >> + u64 idle_stamp, min_idle_stamp = ULONG_MAX; > > I don't think you really meant to assign an u64 variable with ULONG_MAX. > You probably want ULLONG_MAX here. And probably not in fact (more > later). > >> + >> + struct rq *rq; >> + struct cpuidle_power *power; >> + >> + int cpu_idle = -1; >> + int cpu_busy = -1; >> int i; >> >> /* Traverse only the allowed CPUs */ >> for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { >> - load = weighted_cpuload(i); >> >> - if (load < min_load || (load == min_load && i == this_cpu)) { >> - min_load = load; >> - idlest = i; >> + if (idle_cpu(i)) { >> + >> + rq = cpu_rq(i); >> + power = rq->power; >> + idle_stamp = rq->idle_stamp; >> + >> + /* The cpu is idle since a shorter time */ >> + if (idle_stamp < min_idle_stamp) { >> + min_idle_stamp = idle_stamp; >> + cpu_idle = i; >> + continue; > > Don't you want the highest time stamp in order to select the most > recently idled CPU? Favoring the CPU which has been idle the longest > makes little sense. > >> + } >> + >> + /* The cpu is idle but the exit_latency is shorter */ >> + if (power && power->exit_latency < min_exit_latency) { >> + min_exit_latency = power->exit_latency; >> + cpu_idle = i; >> + continue; >> + } > > I think this is wrong. This gives priority to CPUs which have been idle > for a (longer... although this should have been) shorter period of time > over those with a shallower idle state. I think this should rather be: > > if (power && power->exit_latency < min_exit_latency) { > min_exit_latency = power->exit_latency; > latest_idle_stamp = idle_stamp; > cpu_idle = i; > } else if ((!power || power->exit_latency == min_exit_latency) && > idle_stamp > latest_idle_stamp) { > latest_idle_stamp = idle_stamp; > cpu_idle = i; > } > > So the CPU with the shallowest idle state is selected in priority, and > if many CPUs are in the same state then the time stamp is used to > select the most recent one. Whenever > a shallower idle state is found then the latest_idle_stamp is reset for > that state even if it is further in the past. > >> + } else { >> + >> + load = weighted_cpuload(i); >> + >> + if (load < min_load || >> + (load == min_load && i == this_cpu)) { >> + min_load = load; >> + cpu_busy = i; >> + continue; >> + } >> } > > I think this is wrong to do an if-else based on idle_cpu() here. What > if a CPU is heavily loaded, but for some reason it happens to be idle at > this very moment? With your patch it could be selected as an idle CPU > while it would be discarded as being too busy otherwise. > > It is important to determine both cpu_busy and cpu_idle for all CPUs. > > And cpu_busy is a bad name for this. Something like least_loaded would > be more self explanatory. Same thing for cpu_idle which could be > clearer if named shalloest_idle. > >> - return idlest; >> + /* Busy cpus are considered less idle than idle cpus ;) */ >> + return cpu_busy != -1 ? cpu_busy : cpu_idle; > > And finally it is a policy decision whether or not we want to return > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs > first or not. That in itself needs more investigation. To keep the > existing policy unchanged for now the above condition should have its > variables swapped. Ok, refreshed the patchset but before sending it out I would to discuss about the rational of the changes and the policy, and change the patchset consequently. What order to choose if the cpu is idle ? Let's assume all cpus are idle on a dual socket quad core. Also, we can reasonably do the hypothesis if the cluster is in low power mode, the cpus belonging to the same cluster are in the same idle state (putting apart the auto-promote where we don't have control on). If the policy you talk above is 'aggressive power saving', we can follow the rules with decreasing priority: 1. We want to prevent to wakeup the entire cluster => as the cpus are in the same idle state, by choosing a cpu in shallow state, we should have the guarantee we won't wakeup a cluster (except if no shallowest idle cpu are found). 2. We want to prevent to wakeup a cpu which did not reach the target residency time (will need some work to unify cpuidle idle time and idle task run time) => with the target residency and, as a first step, with the idle stamp, we can determine if the cpu slept enough 3. We want to prevent to wakeup a cpu in deep idle state => by looking for the cpu in shallowest idle state 4. We want to prevent to wakeup a cpu where the exit latency is longer than the expected run time of the task (and the time to migrate the task ?) Concerning the policy, I would suggest to create an entry in /proc/sys/kernel/sched_power, where a couple of values could be performance - power saving (0 / 1). Does it make sense ? Any ideas ? Thanks -- Daniel
On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote: > Concerning the policy, I would suggest to create an entry in > /proc/sys/kernel/sched_power, where a couple of values could be performance > - power saving (0 / 1). Ingo wanted a sched_balance_policy file with 3 values: "performance, power, auto" Where the auto thing switches between them, initially based off of having AC or not. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/17/2014 04:47 PM, Peter Zijlstra wrote: > On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote: >> Concerning the policy, I would suggest to create an entry in >> /proc/sys/kernel/sched_power, where a couple of values could be performance >> - power saving (0 / 1). > > Ingo wanted a sched_balance_policy file with 3 values: > "performance, power, auto" > > Where the auto thing switches between them, initially based off of > having AC or not. oh, good. Thanks !
On Thu, 17 Apr 2014, Daniel Lezcano wrote: > Ok, refreshed the patchset but before sending it out I would to discuss about > the rational of the changes and the policy, and change the patchset > consequently. > > What order to choose if the cpu is idle ? > > Let's assume all cpus are idle on a dual socket quad core. > > Also, we can reasonably do the hypothesis if the cluster is in low power mode, > the cpus belonging to the same cluster are in the same idle state (putting > apart the auto-promote where we don't have control on). > > If the policy you talk above is 'aggressive power saving', we can follow the > rules with decreasing priority: > > 1. We want to prevent to wakeup the entire cluster > => as the cpus are in the same idle state, by choosing a cpu in > => shallow > state, we should have the guarantee we won't wakeup a cluster (except if no > shallowest idle cpu are found). This is unclear to me. Obviously, if an entire cluster is down, that means all the CPUs it contains have been idle for a long time. And therefore they shouldn't be subject to selection unless there is no other CPUs available. Is that what you mean? > 2. We want to prevent to wakeup a cpu which did not reach the target residency > time (will need some work to unify cpuidle idle time and idle task run time) > => with the target residency and, as a first step, with the idle > => stamp, > we can determine if the cpu slept enough Agreed. However, right now, the scheduler does not have any consideration for that. So this should be done as a separate patch. > 3. We want to prevent to wakeup a cpu in deep idle state > => by looking for the cpu in shallowest idle state Obvious. > 4. We want to prevent to wakeup a cpu where the exit latency is longer than > the expected run time of the task (and the time to migrate the task ?) Sure. That would be a case for using task packing even if the policy is set to performance rather than powersave whereas task packing is normally for powersave. Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 04/17/2014 05:53 PM, Nicolas Pitre wrote: > On Thu, 17 Apr 2014, Daniel Lezcano wrote: > >> Ok, refreshed the patchset but before sending it out I would to discuss about >> the rational of the changes and the policy, and change the patchset >> consequently. >> >> What order to choose if the cpu is idle ? >> >> Let's assume all cpus are idle on a dual socket quad core. >> >> Also, we can reasonably do the hypothesis if the cluster is in low power mode, >> the cpus belonging to the same cluster are in the same idle state (putting >> apart the auto-promote where we don't have control on). >> >> If the policy you talk above is 'aggressive power saving', we can follow the >> rules with decreasing priority: >> >> 1. We want to prevent to wakeup the entire cluster >> => as the cpus are in the same idle state, by choosing a cpu in >> => shallow >> state, we should have the guarantee we won't wakeup a cluster (except if no >> shallowest idle cpu are found). > > This is unclear to me. Obviously, if an entire cluster is down, that > means all the CPUs it contains have been idle for a long time. And > therefore they shouldn't be subject to selection unless there is no > other CPUs available. Is that what you mean? Yes, this is what I meant. But also what I meant is we can get rid for the moment of the cpu topology and the coupling idle state because if we do this described approach, as the idle state will be the same for the cpus belonging to the same cluster we won't select a cluster down (except if there is no other CPUs available). >> 2. We want to prevent to wakeup a cpu which did not reach the target residency >> time (will need some work to unify cpuidle idle time and idle task run time) >> => with the target residency and, as a first step, with the idle >> => stamp, >> we can determine if the cpu slept enough > > Agreed. However, right now, the scheduler does not have any > consideration for that. So this should be done as a separate patch. Yes, I thought as a very first step we can rely on the idle stamp until we unify the times with a big comment. Or I can first unify the idle times and then take into account the target residency. It is to comply with Rafael's request to have the 'big picture'. >> 3. We want to prevent to wakeup a cpu in deep idle state >> => by looking for the cpu in shallowest idle state > > Obvious. > >> 4. We want to prevent to wakeup a cpu where the exit latency is longer than >> the expected run time of the task (and the time to migrate the task ?) > > Sure. That would be a case for using task packing even if the policy is > set to performance rather than powersave whereas task packing is > normally for powersave. Yes, I agree, task packing improves also the performances and it makes really sense to prevent task migration under some circumstances for a better cache efficiency. Thanks for the comments -- Daniel
On Thu, 17 Apr 2014, Daniel Lezcano wrote: > On 04/17/2014 05:53 PM, Nicolas Pitre wrote: > > On Thu, 17 Apr 2014, Daniel Lezcano wrote: > > > > > Ok, refreshed the patchset but before sending it out I would to discuss > > > about > > > the rational of the changes and the policy, and change the patchset > > > consequently. > > > > > > What order to choose if the cpu is idle ? > > > > > > Let's assume all cpus are idle on a dual socket quad core. > > > > > > Also, we can reasonably do the hypothesis if the cluster is in low power > > > mode, > > > the cpus belonging to the same cluster are in the same idle state (putting > > > apart the auto-promote where we don't have control on). > > > > > > If the policy you talk above is 'aggressive power saving', we can follow > > > the > > > rules with decreasing priority: > > > > > > 1. We want to prevent to wakeup the entire cluster > > > => as the cpus are in the same idle state, by choosing a cpu in > > > => shallow > > > state, we should have the guarantee we won't wakeup a cluster (except if > > > no > > > shallowest idle cpu are found). > > > > This is unclear to me. Obviously, if an entire cluster is down, that > > means all the CPUs it contains have been idle for a long time. And > > therefore they shouldn't be subject to selection unless there is no > > other CPUs available. Is that what you mean? > > Yes, this is what I meant. But also what I meant is we can get rid for the > moment of the cpu topology and the coupling idle state because if we do this > described approach, as the idle state will be the same for the cpus belonging > to the same cluster we won't select a cluster down (except if there is no > other CPUs available). CPU topology is needed to properly describe scheduling domains. Whether we balance across domains or pack using as few domains as possible is a separate issue. In other words, you shouldn't have to care in this patch series. And IMHO coupled C-state is a low-level mechanism that should remain private to cpuidle which the scheduler shouldn't be aware of. > > > 2. We want to prevent to wakeup a cpu which did not reach the target > > > residency > > > time (will need some work to unify cpuidle idle time and idle task run > > > time) > > > => with the target residency and, as a first step, with the idle > > > => stamp, > > > we can determine if the cpu slept enough > > > > Agreed. However, right now, the scheduler does not have any > > consideration for that. So this should be done as a separate patch. > > Yes, I thought as a very first step we can rely on the idle stamp until we > unify the times with a big comment. Or I can first unify the idle times and > then take into account the target residency. It is to comply with Rafael's > request to have the 'big picture'. I agree, but that should be done incrementally. Even without this consideration, what you proposed is already an improvement over the current state of affairs. Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Daniel Lezcano <daniel.lezcano@linaro.org> wrote: > On 04/17/2014 04:47 PM, Peter Zijlstra wrote: > >On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote: > >>Concerning the policy, I would suggest to create an entry in > >>/proc/sys/kernel/sched_power, where a couple of values could be performance > >>- power saving (0 / 1). > > > >Ingo wanted a sched_balance_policy file with 3 values: > > "performance, power, auto" > > > >Where the auto thing switches between them, initially based off of > >having AC or not. > > oh, good. Thanks ! Also, 'auto' should be the default, because the kernel doing TRT is really what users want. Userspace can sill tweak it all and make it all user-space controlled, by flipping between 'performance' and 'power'. (and those modes are also helpful for development and debugging.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/18/2014 10:09 AM, Ingo Molnar wrote: > > * Daniel Lezcano <daniel.lezcano@linaro.org> wrote: > >> On 04/17/2014 04:47 PM, Peter Zijlstra wrote: >>> On Thu, Apr 17, 2014 at 03:53:32PM +0200, Daniel Lezcano wrote: >>>> Concerning the policy, I would suggest to create an entry in >>>> /proc/sys/kernel/sched_power, where a couple of values could be performance >>>> - power saving (0 / 1). >>> >>> Ingo wanted a sched_balance_policy file with 3 values: >>> "performance, power, auto" >>> >>> Where the auto thing switches between them, initially based off of >>> having AC or not. >> >> oh, good. Thanks ! > > Also, 'auto' should be the default, because the kernel doing TRT is > really what users want. > > Userspace can sill tweak it all and make it all user-space controlled, > by flipping between 'performance' and 'power'. (and those modes are > also helpful for development and debugging.) Copy that. Thanks ! -- Daniel
On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote: > CPU topology is needed to properly describe scheduling domains. Whether > we balance across domains or pack using as few domains as possible is a > separate issue. In other words, you shouldn't have to care in this > patch series. > > And IMHO coupled C-state is a low-level mechanism that should remain > private to cpuidle which the scheduler shouldn't be aware of. I'm confused.. why wouldn't you want to expose these? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 04/18/2014 11:38 AM, Peter Zijlstra wrote: > On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote: >> CPU topology is needed to properly describe scheduling domains. Whether >> we balance across domains or pack using as few domains as possible is a >> separate issue. In other words, you shouldn't have to care in this >> patch series. >> >> And IMHO coupled C-state is a low-level mechanism that should remain >> private to cpuidle which the scheduler shouldn't be aware of. > > I'm confused.. why wouldn't you want to expose these? The couple C-state is used as a mechanism for cpuidle to sync the cpus when entering a specific c-state. This mechanism is usually used to handle the cluster power down. It is only used for a two drivers (soon three) but it is not the only mechanism used for syncing the cpus. There are also the MCPM (tc2), the hand made sync when the hardware allows it (ux500), and an abstraction from the firmware (mwait), transparent to the kernel. Taking into account the couple c-state only does not make sense because of the other mechanisms above. This is why it should stay inside the cpuidle framework. The extension of the cpu topology will provide a generic way to describe and abstracting such dependencies. Does it answer your question ? -- Daniel
On Fri, Apr 18, 2014 at 02:13:48PM +0200, Daniel Lezcano wrote: > On 04/18/2014 11:38 AM, Peter Zijlstra wrote: > >On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote: > >>CPU topology is needed to properly describe scheduling domains. Whether > >>we balance across domains or pack using as few domains as possible is a > >>separate issue. In other words, you shouldn't have to care in this > >>patch series. > >> > >>And IMHO coupled C-state is a low-level mechanism that should remain > >>private to cpuidle which the scheduler shouldn't be aware of. > > > >I'm confused.. why wouldn't you want to expose these? > > The couple C-state is used as a mechanism for cpuidle to sync the cpus when > entering a specific c-state. This mechanism is usually used to handle the > cluster power down. It is only used for a two drivers (soon three) but it is > not the only mechanism used for syncing the cpus. There are also the MCPM > (tc2), the hand made sync when the hardware allows it (ux500), and an > abstraction from the firmware (mwait), transparent to the kernel. > > Taking into account the couple c-state only does not make sense because of > the other mechanisms above. This is why it should stay inside the cpuidle > framework. > > The extension of the cpu topology will provide a generic way to describe and > abstracting such dependencies. > > Does it answer your question ? I suppose so; its still a bit like we won't but we will :-) So we _will_ actually expose coupled C states through the topology bits, that's good. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/18/2014 02:53 PM, Peter Zijlstra wrote: > On Fri, Apr 18, 2014 at 02:13:48PM +0200, Daniel Lezcano wrote: >> On 04/18/2014 11:38 AM, Peter Zijlstra wrote: >>> On Thu, Apr 17, 2014 at 12:21:28PM -0400, Nicolas Pitre wrote: >>>> CPU topology is needed to properly describe scheduling domains. Whether >>>> we balance across domains or pack using as few domains as possible is a >>>> separate issue. In other words, you shouldn't have to care in this >>>> patch series. >>>> >>>> And IMHO coupled C-state is a low-level mechanism that should remain >>>> private to cpuidle which the scheduler shouldn't be aware of. >>> >>> I'm confused.. why wouldn't you want to expose these? >> >> The couple C-state is used as a mechanism for cpuidle to sync the cpus when >> entering a specific c-state. This mechanism is usually used to handle the >> cluster power down. It is only used for a two drivers (soon three) but it is >> not the only mechanism used for syncing the cpus. There are also the MCPM >> (tc2), the hand made sync when the hardware allows it (ux500), and an >> abstraction from the firmware (mwait), transparent to the kernel. >> >> Taking into account the couple c-state only does not make sense because of >> the other mechanisms above. This is why it should stay inside the cpuidle >> framework. >> >> The extension of the cpu topology will provide a generic way to describe and >> abstracting such dependencies. >> >> Does it answer your question ? > > I suppose so; its still a bit like we won't but we will :-) > > So we _will_ actually expose coupled C states through the topology bits, > that's good. Ah, ok. I think I understood where the confusion is coming from. A couple of definitions for the same thing :) 1. Coupled C-states : *mechanism* implemented in the cpuidle framework: drivers/cpuidle/coupled.c 2. Coupled C-states : *constraint* to reach a cluster power down state, will be described through the topology and could be implemented by different mechanism (MCPM, handmade sync, cpuidle-coupled-c-state, firmware). We want to expose 2. not 1. to the scheduler.
On Fri, 18 Apr 2014, Daniel Lezcano wrote: > On 04/18/2014 02:53 PM, Peter Zijlstra wrote: > > I suppose so; its still a bit like we won't but we will :-) > > > > So we _will_ actually expose coupled C states through the topology bits, > > that's good. > > Ah, ok. I think I understood where the confusion is coming from. > > A couple of definitions for the same thing :) > > 1. Coupled C-states : *mechanism* implemented in the cpuidle framework: > drivers/cpuidle/coupled.c > > 2. Coupled C-states : *constraint* to reach a cluster power down state, will > be described through the topology and could be implemented by different > mechanism (MCPM, handmade sync, cpuidle-coupled-c-state, firmware). > > We want to expose 2. not 1. to the scheduler. I couldn't explain it better. Sorry for creating confusion. Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 16042b5..068e503 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -23,6 +23,7 @@ #include <linux/latencytop.h> #include <linux/sched.h> #include <linux/cpumask.h> +#include <linux/cpuidle.h> #include <linux/slab.h> #include <linux/profile.h> #include <linux/interrupt.h> @@ -4336,20 +4337,53 @@ static int find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) { unsigned long load, min_load = ULONG_MAX; - int idlest = -1; + unsigned int min_exit_latency = UINT_MAX; + u64 idle_stamp, min_idle_stamp = ULONG_MAX; + + struct rq *rq; + struct cpuidle_power *power; + + int cpu_idle = -1; + int cpu_busy = -1; int i; /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { - load = weighted_cpuload(i); - if (load < min_load || (load == min_load && i == this_cpu)) { - min_load = load; - idlest = i; + if (idle_cpu(i)) { + + rq = cpu_rq(i); + power = rq->power; + idle_stamp = rq->idle_stamp; + + /* The cpu is idle since a shorter time */ + if (idle_stamp < min_idle_stamp) { + min_idle_stamp = idle_stamp; + cpu_idle = i; + continue; + } + + /* The cpu is idle but the exit_latency is shorter */ + if (power && power->exit_latency < min_exit_latency) { + min_exit_latency = power->exit_latency; + cpu_idle = i; + continue; + } + } else { + + load = weighted_cpuload(i); + + if (load < min_load || + (load == min_load && i == this_cpu)) { + min_load = load; + cpu_busy = i; + continue; + } } } - return idlest; + /* Busy cpus are considered less idle than idle cpus ;) */ + return cpu_busy != -1 ? cpu_busy : cpu_idle; } /*
As we know in which idle state the cpu is, we can investigate the following: 1. when did the cpu entered the idle state ? the longer the cpu is idle, the deeper it is idle 2. what exit latency is ? the greater the exit latency is, the deeper it is With both information, when all cpus are idle, we can choose the idlest cpu. When one cpu is not idle, the old check against weighted load applies. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> --- kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 6 deletions(-)