cpuidle: arm_big_little: route target residency to mcpm

Message ID	1368471222-26434-2-git-send-email-sebastian.capella@linaro.org
State	New
Headers	show Return-Path: <patchwork-forward+bncBDHNDAGIUACRBS7NYSGAKGQECJ3R7GA@linaro.org> MIME-Version: 1.0 Received-SPF: neutral (google.com: 209.85.220.180 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=209.85.220.180; Received-SPF: neutral (google.com: 2607:f8b0:400e:c00::236 is neither permitted nor denied by best guess record for domain of sebastian.capella@linaro.org) client-ip=2607:f8b0:400e:c00::236; From: Sebastian Capella <sebastian.capella@linaro.org> To: lorenzo.pieralisi@arm.com, daniel.lezcano@linaro.org Cc: linaro-dev@lists.linaro.org, patches@linaro.org, linaro-kernel@lists.linaro.org, mark.hambleton@broadcom.com, jamesk@broadcom.com, Sebastian Capella <sebastian.capella@linaro.org> Subject: [PATCH] cpuidle: arm_big_little: route target residency to mcpm Date: Mon, 13 May 2013 11:53:42 -0700 Message-Id: <1368471222-26434-2-git-send-email-sebastian.capella@linaro.org> In-Reply-To: <1368471222-26434-1-git-send-email-sebastian.capella@linaro.org> References: <1368471222-26434-1-git-send-email-sebastian.capella@linaro.org> Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org

Sebastian Capella May 13, 2013, 6:53 p.m. UTC

Pass residency information to the mcpm_cpu_suspend.  The information
is taken from the target_residency of the intended C-state.

When a platform uses multiple powerdown cstates, the residency information
indicates which powerdown state is targeted.  Multiple powerdown cstate
information can be maintained in the device tree and the vendor specific
handling will then have enough information to determine what power state to
enter without needing additional changes to the big_little framework.

Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
---
 drivers/cpuidle/arm_big_little.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Liviu Dudau May 15, 2013, 3:24 p.m. UTC | #1

Hi Sebastian,

On Mon, May 13, 2013 at 07:53:42PM +0100, Sebastian Capella wrote:
> Pass residency information to the mcpm_cpu_suspend.  The information
> is taken from the target_residency of the intended C-state.
>
> When a platform uses multiple powerdown cstates, the residency information
> indicates which powerdown state is targeted.  Multiple powerdown cstate
> information can be maintained in the device tree and the vendor specific
> handling will then have enough information to determine what power state to
> enter without needing additional changes to the big_little framework.
>
> Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
> ---
>  drivers/cpuidle/arm_big_little.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/arm_big_little.c b/drivers/cpuidle/arm_big_little.c
> index a430800..8332b05 100644
> --- a/drivers/cpuidle/arm_big_little.c
> +++ b/drivers/cpuidle/arm_big_little.c

I could not find a branch that contains this file. Which git tree and branch
are you using?

> @@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned long arg)
>       unsigned int cpu = mpidr & 0xf;
>
>       mcpm_set_entry_vector(cpu, cluster, cpu_resume);
> -     mcpm_cpu_suspend(0);  /* 0 should be replaced with better value here */
> +     mcpm_cpu_suspend(arg);
>       return 1;
>  }
>
> @@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
>  {
>       struct timespec ts_preidle, ts_postidle, ts_idle;
>       int ret;
> +     struct cpuidle_state *state = &drv->states[idx];
>
>       /* Used to keep track of the total time in idle */
>       getnstimeofday(&ts_preidle);
> @@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
>
>       clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
>
> -     ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher);
> +     ret = cpu_suspend((unsigned long) state->target_residency,
> +                                     bl_powerdown_finisher);

I don't think you should pass the target residency here but the intended
C-state. Think about what will happen when you run this in a guest kernel: is
the target_residency the same if the guest has been migrated from a big core
that might have a faster execution of the down/up path to a little core that
is slower? The intended C-state should stay the same, regardless of the actual
time it takes to get there and out, which affects the actual time spent inside
the state.

Best regards,
Liviu

Daniel Lezcano May 15, 2013, 3:47 p.m. UTC | #2

On 05/15/2013 05:24 PM, Liviu Dudau wrote:
> Hi Sebastian,
> 
> On Mon, May 13, 2013 at 07:53:42PM +0100, Sebastian Capella wrote:
>> Pass residency information to the mcpm_cpu_suspend.  The information
>> is taken from the target_residency of the intended C-state.
>>
>> When a platform uses multiple powerdown cstates, the residency information
>> indicates which powerdown state is targeted.  Multiple powerdown cstate
>> information can be maintained in the device tree and the vendor specific
>> handling will then have enough information to determine what power state to
>> enter without needing additional changes to the big_little framework.
>>
>> Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
>> ---
>>  drivers/cpuidle/arm_big_little.c |    6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cpuidle/arm_big_little.c b/drivers/cpuidle/arm_big_little.c
>> index a430800..8332b05 100644
>> --- a/drivers/cpuidle/arm_big_little.c
>> +++ b/drivers/cpuidle/arm_big_little.c
> 
> I could not find a branch that contains this file. Which git tree and branch
> are you using?

I believe it should apply to:

https://git.linaro.org/gitweb?p=landing-teams/working/arm/kernel.git;a=blob;f=drivers/cpuidle/arm_big_little.c;h=a430800d4a74c8c5b759cd34fb5e272d8cf8f8a3;hb=67b5adf6a402921655d337d52845d6d48c6ef2d2#l66

> 
>> @@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned long arg)
>>       unsigned int cpu = mpidr & 0xf;
>>
>>       mcpm_set_entry_vector(cpu, cluster, cpu_resume);
>> -     mcpm_cpu_suspend(0);  /* 0 should be replaced with better value here */
>> +     mcpm_cpu_suspend(arg);
>>       return 1;
>>  }
>>
>> @@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
>>  {
>>       struct timespec ts_preidle, ts_postidle, ts_idle;
>>       int ret;
>> +     struct cpuidle_state *state = &drv->states[idx];
>>
>>       /* Used to keep track of the total time in idle */
>>       getnstimeofday(&ts_preidle);
>> @@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
>>
>>       clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
>>
>> -     ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher);
>> +     ret = cpu_suspend((unsigned long) state->target_residency,
>> +                                     bl_powerdown_finisher);
> 
> I don't think you should pass the target residency here but the intended
> C-state. Think about what will happen when you run this in a guest kernel: is
> the target_residency the same if the guest has been migrated from a big core
> that might have a faster execution of the down/up path to a little core that
> is slower? The intended C-state should stay the same, regardless of the actual
> time it takes to get there and out, which affects the actual time spent inside
> the state.
> 
> Best regards,
> Liviu
>

Sebastian Capella May 15, 2013, 4:49 p.m. UTC | #3

Thanks Daniel!

Liviu,

I have been using on the linux-linaro branch in the linux-linaro-tracking
repository here:

https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git;a=shortlog;h=refs/heads/linux-linaro

Sorry for missing that.

Thanks!

Sebastian


On 15 May 2013 08:47, Daniel Lezcano <daniel.lezcano@linaro.org> wrote:

> On 05/15/2013 05:24 PM, Liviu Dudau wrote:
> > Hi Sebastian,
> >
> > On Mon, May 13, 2013 at 07:53:42PM +0100, Sebastian Capella wrote:
> >> Pass residency information to the mcpm_cpu_suspend.  The information
> >> is taken from the target_residency of the intended C-state.
> >>
> >> When a platform uses multiple powerdown cstates, the residency
> information
> >> indicates which powerdown state is targeted.  Multiple powerdown cstate
> >> information can be maintained in the device tree and the vendor specific
> >> handling will then have enough information to determine what power
> state to
> >> enter without needing additional changes to the big_little framework.
> >>
> >> Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
> >> ---
> >>  drivers/cpuidle/arm_big_little.c |    6 ++++--
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/cpuidle/arm_big_little.c
> b/drivers/cpuidle/arm_big_little.c
> >> index a430800..8332b05 100644
> >> --- a/drivers/cpuidle/arm_big_little.c
> >> +++ b/drivers/cpuidle/arm_big_little.c
> >
> > I could not find a branch that contains this file. Which git tree and
> branch
> > are you using?
>
> I believe it should apply to:
>
>
> https://git.linaro.org/gitweb?p=landing-teams/working/arm/kernel.git;a=blob;f=drivers/cpuidle/arm_big_little.c;h=a430800d4a74c8c5b759cd34fb5e272d8cf8f8a3;hb=67b5adf6a402921655d337d52845d6d48c6ef2d2#l66
>
> >
> >> @@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned
> long arg)
> >>       unsigned int cpu = mpidr & 0xf;
> >>
> >>       mcpm_set_entry_vector(cpu, cluster, cpu_resume);
> >> -     mcpm_cpu_suspend(0);  /* 0 should be replaced with better value
> here */
> >> +     mcpm_cpu_suspend(arg);
> >>       return 1;
> >>  }
> >>
> >> @@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device
> *dev,
> >>  {
> >>       struct timespec ts_preidle, ts_postidle, ts_idle;
> >>       int ret;
> >> +     struct cpuidle_state *state = &drv->states[idx];
> >>
> >>       /* Used to keep track of the total time in idle */
> >>       getnstimeofday(&ts_preidle);
> >> @@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device
> *dev,
> >>
> >>       clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
> >>
> >> -     ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher);
> >> +     ret = cpu_suspend((unsigned long) state->target_residency,
> >> +                                     bl_powerdown_finisher);
> >
> > I don't think you should pass the target residency here but the intended
> > C-state. Think about what will happen when you run this in a guest
> kernel: is
> > the target_residency the same if the guest has been migrated from a big
> core
> > that might have a faster execution of the down/up path to a little core
> that
> > is slower? The intended C-state should stay the same, regardless of the
> actual
> > time it takes to get there and out, which affects the actual time spent
> inside
> > the state.
> >
> > Best regards,
> > Liviu
> >
>
>
> --
>  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
>

Jon Medhurst (Tixy) May 15, 2013, 5:07 p.m. UTC | #4

On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
> Thanks Daniel!
> 
> Liviu,
> 
> I have been using on the linux-linaro branch in the linux-linaro-tracking
> repository here:
> 
> https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git;a=shortlog;h=refs/heads/linux-linaro
> 

Generally, that's the Linaro kernel tree people should use and what is
built daily and released monthly.

It's just it hasn't moved to 3.10 yet (will do in the next day or so)
but the topic branches which feed into it (that Liviu pointed out) have
already made that move.

Sebastian Capella May 15, 2013, 6:05 p.m. UTC | #5

Hi Liviu,

Regarding your comments about using the C-state instead of the residency,
we based off of the existing mcpm_suspend call which currently takes
residency (with a 0 meaning lowest power).

We used calls (including mcpm_suspend) in the hot plug/suspend path.
 However, it does not know about c-states.  I suspect others may want to do
the same.  Do you know how suspend is done on tc2?

Regarding guest kernels, I don't think I understand the implications.  If
we migrate between cores (having different parameters) in the middle of a
cstate transition, can we have correct behavior?  Wouldn't it be worse to
migrate to a lower c-state then we had intended?

Thanks,

Sebastian

On 15 May 2013 10:07, Jon Medhurst (Tixy) <tixy@linaro.org> wrote:

> On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
> > Thanks Daniel!
> >
> > Liviu,
> >
> > I have been using on the linux-linaro branch in the linux-linaro-tracking
> > repository here:
> >
> >
> https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git;a=shortlog;h=refs/heads/linux-linaro
> >
>
> Generally, that's the Linaro kernel tree people should use and what is
> built daily and released monthly.
>
> It's just it hasn't moved to 3.10 yet (will do in the next day or so)
> but the topic branches which feed into it (that Liviu pointed out) have
> already made that move.
>
> --
> Tixy
>
>

Liviu Dudau May 16, 2013, 1:40 p.m. UTC | #6

On Wed, May 15, 2013 at 07:05:10PM +0100, Sebastian Capella wrote:
> Hi Liviu,
> 
> Regarding your comments about using the C-state instead of the residency, we based off of the existing mcpm_suspend call which currently takes residency (with a 0 meaning lowest power).
> 
> We used calls (including mcpm_suspend) in the hot plug/suspend path.  However, it does not know about c-states.  I suspect others may want to do the same.  Do you know how suspend is done on tc2?
> 
> Regarding guest kernels, I don't think I understand the implications.  If we migrate between cores (having different parameters) in the middle of a cstate transition, can we have correct behavior?  Wouldn't it be worse to migrate to a lower c-state then we had intended?
> 
> Thanks,
> 
> Sebastian
> 
> 
> On 15 May 2013 10:07, Jon Medhurst (Tixy) <tixy@linaro.org<mailto:tixy@linaro.org>> wrote:
> On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
> > Thanks Daniel!
> >
> > Liviu,
> >
> > I have been using on the linux-linaro branch in the linux-linaro-tracking
> > repository here:
> >
> > https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git;a=shortlog;h=refs/heads/linux-linaro
> >
> 
> Generally, that's the Linaro kernel tree people should use and what is
> built daily and released monthly.
> 
> It's just it hasn't moved to 3.10 yet (will do in the next day or so)
> but the topic branches which feed into it (that Liviu pointed out) have
> already made that move.
> 
> --
> Tixy
> 
> 

Hi Sebastian,

From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.

Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.

If we choose to use the target C-state, the request validation is simplified to a comparision between each core target C-state and the lowest common C-state per cluster, all done in the host.

Of course, by describing C-states in terms of target residency times both schemes can be considered equivalent. But that target residency time is not constant for all code paths and for all conditions and that makes the decision process more complicated.

Hope that provides some clarification.

Best regards,
Liviu

Liviu Dudau May 16, 2013, 3:59 p.m. UTC | #7

On Wed, May 15, 2013 at 07:05:10PM +0100, Sebastian Capella wrote:
> Hi Liviu,
> 
> Regarding your comments about using the C-state instead of the residency, we based off of the existing mcpm_suspend call which currently takes residency (with a 0 meaning lowest power).
> 
> We used calls (including mcpm_suspend) in the hot plug/suspend path.  However, it does not know about c-states.  I suspect others may want to do the same.  Do you know how suspend is done on tc2?
> 
> Regarding guest kernels, I don't think I understand the implications.  If we migrate between cores (having different parameters) in the middle of a cstate transition, can we have correct behavior?  Wouldn't it be worse to migrate to a lower c-state then we had intended?
> 
> Thanks,
> 
> Sebastian
> 
> 
> On 15 May 2013 10:07, Jon Medhurst (Tixy) <tixy@linaro.org<mailto:tixy@linaro.org>> wrote:
> On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
> > Thanks Daniel!
> >
> > Liviu,
> >
> > I have been using on the linux-linaro branch in the linux-linaro-tracking
> > repository here:
> >
> > https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git;a=shortlog;h=refs/heads/linux-linaro
> >
> 
> Generally, that's the Linaro kernel tree people should use and what is
> built daily and released monthly.
> 
> It's just it hasn't moved to 3.10 yet (will do in the next day or so)
> but the topic branches which feed into it (that Liviu pointed out) have
> already made that move.
> 
> --
> Tixy
> 
> 

Hi Sebastian,

From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.

Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.

If we choose to use the target C-state, the request validation is simplified to a comparision between each core target C-state and the lowest common C-state per cluster, all done in the host.

Of course, by describing C-states in terms of target residency times both schemes can be considered equivalent. But that target residency time is not constant for all code paths and for all conditions and that makes the decision process more complicated.

Hope that provides some clarification.

Best regards,
Liviu

Nicolas Pitre May 16, 2013, 6:21 p.m. UTC | #8

On Thu, 16 May 2013, Liviu Dudau wrote:

> From previous discussions between Achin, Charles and Nico I am aware 
> that Nico has decided for the moment that target residency should be 
> useful enough to be used by MCPM. That is because Nico is a big 
> proponent of doing everything in the kernel and keeping the firmware 
> dumb and (mostly) out of the way. However, the view that we have here 
> at ARM (but I will only speak in my name here) is that in order to 
> have alignment with AArch64 kernel and the way it is using PSCI 
> interface, we should be moving the kernel on AArch32 and armv7a to run 
> in non-secure mode. At that time, the kernel will make PSCI calls to 
> do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware 
> the deepest C-state that the core can support going to without being 
> woken up to do any additional state management. It is then the 
> latitude of the firmware to put the core in that state or to tally the 
> sum of all requests in a cluster and decide to put the cores and the 
> cluster in the lowest common C-state.

That's all good.

My worry is about the definition of all the different C-state on all the 
different platforms.  I think it is simpler to have the kernel tell the 
firmware what it anticipates in terms of load/quiescence periods (e.g. 
the next interrupt is likely to happen in x millisecs), and let the 
firmware and/or low-level machine specific backend translate that into 
the appropriate C-state on its own.  After all, the firmware is supposed 
to know what is the best C-state to apply given a target latency and the 
current state of the surrounding CPUs, which may also differ depending 
on the cluster type, etc.

> Regarding the migration of the guest kernels, it should be transparent 
> (to a certain extent) wether on resume it is running on the same core 
> or it has been migrated. The host OS should have a better 
> understanding on what can be achieved and what invariants it can still 
> hold, but it should not be limited to do that in a specific amount of 
> time. Lets take an example: one core in the cluster says that it can 
> go as deep as cluster shutdown but it does so in your use of the API 
> by saying that it would like to sleep for at least amount X of time. 
> The host however has to tally all the cores in the cluster in order to 
> decide if the cluster can be shutdown, has to do a lot of cache 
> maintainance and state saving, turning off clocks and devices etc, and 
> in doing so is going to consume some compute cycles; it will then 
> substract the time spent making a decision and doing the cleanup and 
> then figure out if there is still time left for each of the cores to 
> go to sleep for the specified amount of time. All this implies that 
> the guest has to have an understanding of the time the host is 
> spending in doing maintainance operations before asking the hypervisor 
> for a target residency and the host still has to do the math again to 
> validate that the guest request is still valid.

I don't follow your reasoning.  Why would the guest have to care 
about what the host can do at all and in what amount of time?  What the 
guest should tell the host is this: "I don't anticipate any need for the 
CPU during the next 500 ms so take this as a hint to perform the most 
efficient power saving given the constraints you alone have the 
knowledge of."  The host should know how long it takes to flush its 
cache, whether or not that cache is in use by other guests, etc.  But 
the guest should not care.

And in this case the math performed by the guest and the host are 
completely different.

> If we choose to use the target C-state, the request validation is 
> simplified to a comparision between each core target C-state and the 
> lowest common C-state per cluster, all done in the host.
> 
> Of course, by describing C-states in terms of target residency times 
> both schemes can be considered equivalent. But that target residency 
> time is not constant for all code paths and for all conditions and 
> that makes the decision process more complicated.

For who?

If the guest is responsible for choosing a C-state itself and pass it on 
to the host, it has to process through a set of available C-states and 
select the proper one according to the target residency time it must 
compute anyway since this is all the scheduler can tell you.  And since 
those C-states are likely to have different latency profiles on 
different clusters, the guest will have to query the type of host it is 
running on or the available C-states each time it wants to select one, 
etc.  So I don't think passing the target residency directly to the host 
is more complicated when you look at the big picture.

Nicolas

Liviu Dudau May 17, 2013, 9:43 a.m. UTC | #9

On Thu, May 16, 2013 at 07:21:55PM +0100, Nicolas Pitre wrote:
> On Thu, 16 May 2013, Liviu Dudau wrote:
> 
> > From previous discussions between Achin, Charles and Nico I am aware 
> > that Nico has decided for the moment that target residency should be 
> > useful enough to be used by MCPM. That is because Nico is a big 
> > proponent of doing everything in the kernel and keeping the firmware 
> > dumb and (mostly) out of the way. However, the view that we have here 
> > at ARM (but I will only speak in my name here) is that in order to 
> > have alignment with AArch64 kernel and the way it is using PSCI 
> > interface, we should be moving the kernel on AArch32 and armv7a to run 
> > in non-secure mode. At that time, the kernel will make PSCI calls to 
> > do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware 
> > the deepest C-state that the core can support going to without being 
> > woken up to do any additional state management. It is then the 
> > latitude of the firmware to put the core in that state or to tally the 
> > sum of all requests in a cluster and decide to put the cores and the 
> > cluster in the lowest common C-state.
> 
> That's all good.
> 
> My worry is about the definition of all the different C-state on all the 
> different platforms.  I think it is simpler to have the kernel tell the 
> firmware what it anticipates in terms of load/quiescence periods (e.g. 
> the next interrupt is likely to happen in x millisecs), and let the 
> firmware and/or low-level machine specific backend translate that into 
> the appropriate C-state on its own.  After all, the firmware is supposed 
> to know what is the best C-state to apply given a target latency and the 
> current state of the surrounding CPUs, which may also differ depending 
> on the cluster type, etc.
> 
> > Regarding the migration of the guest kernels, it should be transparent 
> > (to a certain extent) wether on resume it is running on the same core 
> > or it has been migrated. The host OS should have a better 
> > understanding on what can be achieved and what invariants it can still 
> > hold, but it should not be limited to do that in a specific amount of 
> > time. Lets take an example: one core in the cluster says that it can 
> > go as deep as cluster shutdown but it does so in your use of the API 
> > by saying that it would like to sleep for at least amount X of time. 
> > The host however has to tally all the cores in the cluster in order to 
> > decide if the cluster can be shutdown, has to do a lot of cache 
> > maintainance and state saving, turning off clocks and devices etc, and 
> > in doing so is going to consume some compute cycles; it will then 
> > substract the time spent making a decision and doing the cleanup and 
> > then figure out if there is still time left for each of the cores to 
> > go to sleep for the specified amount of time. All this implies that 
> > the guest has to have an understanding of the time the host is 
> > spending in doing maintainance operations before asking the hypervisor 
> > for a target residency and the host still has to do the math again to 
> > validate that the guest request is still valid.
> 
> I don't follow your reasoning.  Why would the guest have to care 
> about what the host can do at all and in what amount of time?  What the 
> guest should tell the host is this: "I don't anticipate any need for the 
> CPU during the next 500 ms so take this as a hint to perform the most 
> efficient power saving given the constraints you alone have the 
> knowledge of."  The host should know how long it takes to flush its 
> cache, whether or not that cache is in use by other guests, etc.  But 
> the guest should not care.

Exactly my position. What I was doing was to show in an example what the
use of target residency actually means (i.e. guest sends a hint, but
the host cannot use that hint straight away as it needs first to calculate
the common longest amount of time that all the cores in cluster can sleep,
do the maintainace operations and then recalculate the remaining time
again in order to validate that it can still go to that C-state. I was
not implying that the guest has to know the cost of the host maintainance
operations (other than that it is probably built into the target residency
values that the guests uses).

In my (possibly simplistic) view of the world, by specifying the target
C-state as a state and not as a time interval the guest frees the host
to do various operations that it doesn't care about (or it cannot know
about). If as a guest I tell the hypervisor that I can survive cluster
shutdown then the infrastructure can then migrate me to a machine in a
different corner of the world where I'm going to be warm booted from
the snapshot taken at cluster shutdown.

Remember that time can be virtualised as well, so although the scheduler
tells me that I need to wakeup in 500ms, that doesn't mean wall clock
time. The host can lie about the current time when the guest wakes up.


> 
> And in this case the math performed by the guest and the host are 
> completely different.
> 
> > If we choose to use the target C-state, the request validation is 
> > simplified to a comparision between each core target C-state and the 
> > lowest common C-state per cluster, all done in the host.
> > 
> > Of course, by describing C-states in terms of target residency times 
> > both schemes can be considered equivalent. But that target residency 
> > time is not constant for all code paths and for all conditions and 
> > that makes the decision process more complicated.
> 
> For who?

Sorry, I meant to say: "But the maintainance cost is not constant for all
code paths .... "

> 
> If the guest is responsible for choosing a C-state itself and pass it on 
> to the host, it has to process through a set of available C-states and 
> select the proper one according to the target residency time it must 
> compute anyway since this is all the scheduler can tell you.  And since 
> those C-states are likely to have different latency profiles on 
> different clusters, the guest will have to query the type of host it is 
> running on or the available C-states each time it wants to select one, 
> etc.  So I don't think passing the target residency directly to the host 
> is more complicated when you look at the big picture.

I agree. But I was trying to keep the host code small and dumb and let the
guest code do all the calculations and the translation between target
residency and C-states.

One other piece of information that target residency time does not convey
is the restrictions regarding migration. If as a guest I can sleep for
500ms because the tape device that I'm reading data from is slow and the
reading hardware has a buffer big enough that is not going to wake me up
for a while that doesn't mean that the whole cluster can be shut down
and migrated. Yet the same 500ms target residency will be used to signal
that a guest is idle and does nothing, so it can be shut down together
with all the idle cores in the cluster. So how is the host going to know?

Best regards,
Liviu

> 
> 
> Nicolas
>

Catalin Marinas May 17, 2013, 9:55 a.m. UTC | #10

Hi Nico,

On 16 May 2013 19:21, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 16 May 2013, Liviu Dudau wrote:
>
>> From previous discussions between Achin, Charles and Nico I am aware
>> that Nico has decided for the moment that target residency should be
>> useful enough to be used by MCPM. That is because Nico is a big
>> proponent of doing everything in the kernel and keeping the firmware
>> dumb and (mostly) out of the way. However, the view that we have here
>> at ARM (but I will only speak in my name here) is that in order to
>> have alignment with AArch64 kernel and the way it is using PSCI
>> interface, we should be moving the kernel on AArch32 and armv7a to run
>> in non-secure mode. At that time, the kernel will make PSCI calls to
>> do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware
>> the deepest C-state that the core can support going to without being
>> woken up to do any additional state management. It is then the
>> latitude of the firmware to put the core in that state or to tally the
>> sum of all requests in a cluster and decide to put the cores and the
>> cluster in the lowest common C-state.
>
> That's all good.
>
> My worry is about the definition of all the different C-state on all the
> different platforms.  I think it is simpler to have the kernel tell the
> firmware what it anticipates in terms of load/quiescence periods (e.g.
> the next interrupt is likely to happen in x millisecs), and let the
> firmware and/or low-level machine specific backend translate that into
> the appropriate C-state on its own.  After all, the firmware is supposed
> to know what is the best C-state to apply given a target latency and the
> current state of the surrounding CPUs, which may also differ depending
> on the cluster type, etc.

While I'm for abstracting platform details behind firmware like PSCI
(especially when SMCs are required), I would rather keep the firmware
simple (i.e. not too much cleverness). I think the cpuidle framework
in Linux is a better place for deciding which C-state it can target
and we only need a way to describe the states/latencies in the DT.

>> Regarding the migration of the guest kernels, it should be transparent
>> (to a certain extent) wether on resume it is running on the same core
>> or it has been migrated. The host OS should have a better
>> understanding on what can be achieved and what invariants it can still
>> hold, but it should not be limited to do that in a specific amount of
>> time. Lets take an example: one core in the cluster says that it can
>> go as deep as cluster shutdown but it does so in your use of the API
>> by saying that it would like to sleep for at least amount X of time.
>> The host however has to tally all the cores in the cluster in order to
>> decide if the cluster can be shutdown, has to do a lot of cache
>> maintainance and state saving, turning off clocks and devices etc, and
>> in doing so is going to consume some compute cycles; it will then
>> substract the time spent making a decision and doing the cleanup and
>> then figure out if there is still time left for each of the cores to
>> go to sleep for the specified amount of time. All this implies that
>> the guest has to have an understanding of the time the host is
>> spending in doing maintainance operations before asking the hypervisor
>> for a target residency and the host still has to do the math again to
>> validate that the guest request is still valid.
>
> I don't follow your reasoning.  Why would the guest have to care
> about what the host can do at all and in what amount of time?  What the
> guest should tell the host is this: "I don't anticipate any need for the
> CPU during the next 500 ms so take this as a hint to perform the most
> efficient power saving given the constraints you alone have the
> knowledge of."  The host should know how long it takes to flush its
> cache, whether or not that cache is in use by other guests, etc.  But
> the guest should not care.

I agree, the guest shouldn't know about the host C-states. The guest
most likely will be provided with virtual C-states/latencies and the
host could make a decision based the C-state of the guests.

--
Catalin

cpuidle: arm_big_little: route target residency to mcpm

Commit Message

Comments

Patch