mbox series

[v8,00/26] PM / Domains: Support hierarchical CPU arrangement (PSCI/ARM)

Message ID 20180620172226.15012-1-ulf.hansson@linaro.org
Headers show
Series PM / Domains: Support hierarchical CPU arrangement (PSCI/ARM) | expand

Message

Ulf Hansson June 20, 2018, 5:22 p.m. UTC
Changes in v8:
 - Added some tags for reviews and acks.
 - Cleanup timer patch (patch6) according to comments from Rafael.
 - Rebased series on top of v4.18rc1 - it applied cleanly, except for patch 5.
 - While adopting patch 5 to new genpd changes, I took the opportunity to
   improve the new function description a bit.
 - Corrected malformed SPDX-License-Identifier in patch20.

Changes in v7:
 - Addressed comments concerning the PSCI changes from Mark Rutland, which moves
   the psci firmware driver to a new firmware subdir and change to force PSCI PC
   mode during boot to cope with kexec'ed booted kernels.
 - Added some maintainers in cc for the timer/nohz patches.
 - Minor update to the new genpd governor, taking into account the state's
   poweroff latency while validating the sleep duration time.
 - Addressed a problem pointed out by Geert Uytterhoeven, around calling
   pm_runtime_get|put() for CPUs that has not been attached to a CPU PM domain.
 - Re-based on Linus' latest master.

Some background:

Overall this series have been discussed over years at various Linux conferences
and LKML, although let me give a brief introduction and then the rest can be
read in each changelog.

For ARM, the PSCI firmware interface may be managing the power to the CPUs.
Depending on the SoC, CPUs may also be arranged in hierarchical manner, which
could add another level of complexity from a CPU idle management point of view.

PSCI v1.0+ adds support for the so called OS initiated CPU suspend mode, which
enables a more fine grained method, allowing Linux to get more control, in
regards to being energy efficient. This is typically useful for these kind of
complex battery driven platforms.

Now, in principle what is missing today around CPU idle management for these
SoCs arranging CPUs in a hierarchical manner, that is what this series intends
to address.

 - Patch 1 -> Patch 12: The first part are generic changes to genpd, cpu_pm,
   timers, cpuidle and DT. Of course the solution is based on an opt-in method,
   so no users should be affected of any of these changes.

 - Patch 13 -> Patch 26: The second part are changes to PSCI and ARM64, which
   deploys the support for CPU idle management, based upon the new generic
   changes from the first part.

The series is based on v4.18rc1 and the code has been tested on a QCOM 410c
dragonboard. You may find the code at:

git.linaro.org/people/ulf.hansson/linux-pm.git next

Kind regards
Ulf Hansson


Lina Iyer (6):
  PM / Domains: Add generic data pointer to genpd_power_state struct
  timer: Export next wakeup time of a CPU
  dt: psci: Update DT bindings to support hierarchical PSCI states
  cpuidle: dt: Support hierarchical CPU idle states
  drivers: firmware: psci: Support hierarchical CPU idle states
  arm64: dts: Convert to the hierarchical CPU topology layout for
    MSM8916

Ulf Hansson (20):
  PM / Domains: Don't treat zero found compatible idle states as an
    error
  PM / Domains: Deal with multiple states but no governor in genpd
  PM / Domains: Add support for CPU devices to genpd
  PM / Domains: Add helper functions to attach/detach CPUs to/from genpd
  PM / Domains: Add genpd governor for CPUs
  PM / Domains: Extend genpd CPU governor to cope with QoS constraints
  kernel/cpu_pm: Manage runtime PM in the idle path for CPUs
  of: base: Add of_get_cpu_state_node() to get idle states for a CPU
    node
  drivers: firmware: psci: Move psci to separate directory
  MAINTAINERS: Update files for PSCI
  drivers: firmware: psci: Split psci_dt_cpu_init_idle()
  drivers: firmware: psci: Simplify error path of psci_dt_init()
  drivers: firmware: psci: Announce support for OS initiated suspend
    mode
  drivers: firmware: psci: Prepare to use OS initiated suspend mode
  drivers: firmware: psci: Share a few internal PSCI functions
  drivers: firmware: psci: Add support for PM domains using genpd
  drivers: firmware: psci: Introduce psci_dt_topology_init()
  drivers: firmware: psci: Try to attach CPU devices to their PM domains
  drivers: firmware: psci: Deal with CPU hotplug when using OSI mode
  arm64: kernel: Respect the hierarchical CPU topology in DT for PSCI

 .../devicetree/bindings/arm/psci.txt          | 156 +++++++++++++++
 MAINTAINERS                                   |   2 +-
 arch/arm64/boot/dts/qcom/msm8916.dtsi         |  53 +++++-
 arch/arm64/kernel/setup.c                     |   3 +
 drivers/base/power/domain.c                   | 158 ++++++++++++++-
 drivers/base/power/domain_governor.c          |  67 ++++++-
 drivers/cpuidle/dt_idle_states.c              |   5 +-
 drivers/firmware/Kconfig                      |  15 +-
 drivers/firmware/Makefile                     |   3 +-
 drivers/firmware/psci/Kconfig                 |  13 ++
 drivers/firmware/psci/Makefile                |   4 +
 drivers/firmware/{ => psci}/psci.c            | 174 +++++++++++++----
 drivers/firmware/psci/psci.h                  |  19 ++
 drivers/firmware/{ => psci}/psci_checker.c    |   0
 drivers/firmware/psci/psci_pm_domain.c        | 180 ++++++++++++++++++
 drivers/of/base.c                             |  35 ++++
 include/linux/of.h                            |   8 +
 include/linux/pm_domain.h                     |  16 ++
 include/linux/psci.h                          |   2 +
 include/linux/tick.h                          |   8 +
 include/uapi/linux/psci.h                     |   5 +
 kernel/cpu_pm.c                               |  11 ++
 kernel/time/tick-sched.c                      |  10 +
 23 files changed, 877 insertions(+), 70 deletions(-)
 create mode 100644 drivers/firmware/psci/Kconfig
 create mode 100644 drivers/firmware/psci/Makefile
 rename drivers/firmware/{ => psci}/psci.c (83%)
 create mode 100644 drivers/firmware/psci/psci.h
 rename drivers/firmware/{ => psci}/psci_checker.c (100%)
 create mode 100644 drivers/firmware/psci/psci_pm_domain.c

-- 
2.17.1

Comments

Rafael J. Wysocki Sept. 14, 2018, 9:50 a.m. UTC | #1
On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:

> 

> [...]

> 

> > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)

> > >>> >     return false;

> > >>> >  }

> > >>> >

> > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)

> > >>> > +{

> > >>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);

> > >>> > +   ktime_t domain_wakeup, cpu_wakeup;

> > >>> > +   s64 idle_duration_ns;

> > >>> > +   int cpu, i;

> > >>> > +

> > >>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))

> > >>> > +           return true;

> > >>> > +

> > >>> > +   /*

> > >>> > +    * Find the next wakeup for any of the online CPUs within the PM domain

> > >>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already

> > >>> > +    * contains a mask of all CPUs from subdomains.

> > >>> > +    */

> > >>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);

> > >>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {

> > >>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);

> > >>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))

> > >>> > +                   domain_wakeup = cpu_wakeup;

> > >>> > +   }

> > >>

> > >> Here's a concern I have missed before. :-/

> > >>

> > >> Say, one of the CPUs you're walking here is woken up in the meantime.

> > >

> > > Yes, that can happen - when we miss-predicted "next wakeup".

> > >

> > >>

> > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then

> > >> to update domain_wakeup.  We really should just avoid the domain power off in

> > >> that case at all IMO.

> > >

> > > Correct.

> > >

> > > However, we also want to avoid locking contentions in the idle path,

> > > which is what this boils done to.

> > 

> > This already is done under genpd_lock() AFAICS, so I'm not quite sure

> > what exactly you mean.

> > 

> > Besides, this is not just about increased latency, which is a concern

> > by itself but maybe not so much in all environments, but also about

> > possibility of missing a CPU wakeup, which is a major issue.

> > 

> > If one of the CPUs sharing the domain with the current one is woken up

> > during cpu_power_down_ok() and the wakeup is an edge-triggered

> > interrupt and the domain is turned off regardless, the wakeup may be

> > missed entirely if I'm not mistaken.

> > 

> > It looks like there needs to be a way for the hardware to prevent a

> > domain poweroff when there's a pending interrupt or I don't quite see

> > how this can be handled correctly.

> > 

> > >> Sure enough, if the domain power off is already started and one of the CPUs

> > >> in the domain is woken up then, too bad, it will suffer the latency (but in

> > >> that case the hardware should be able to help somewhat), but otherwise CPU

> > >> wakeup should prevent domain power off from being carried out.

> > >

> > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.

> > >

> > > Even if the above computation turns out to wrongly suggest that the

> > > cluster can be powered off, the FW shall together with the genpd

> > > backend driver prevent it.

> > 

> > Fine, but then the solution depends on specific FW/HW behavior, so I'm

> > not sure how generic it really is.  At least, that expectation should

> > be clearly documented somewhere, preferably in code comments.

> > 

> > > To cover this case for PSCI, we also use a per cpu variable for the

> > > CPU's power off state, as can be seen later in the series.

> > 

> > Oh great, but the generic part should be independent on the underlying

> > implementation of the driver.  If it isn't, then it also is not

> > generic.

> > 

> > > Hope this clarifies your concern, else tell and will to elaborate a bit more.

> > 

> > Not really.

> > 

> > There also is one more problem and that is the interaction between

> > this code and the idle governor.

> > 

> > Namely, the idle governor may select a shallower state for some

> > reason, for example due to an additional latency limit derived from

> > CPU utilization (like in the menu governor), and how does the code in

> > cpu_power_down_ok() know what state has been selected and how does it

> > honor the selection made by the idle governor?

> 

> That's a good question and it maybe gives a path towards a solution.

> 

> AFAICS the genPD governor only selects the idle state parameter that

> determines the idle state at, say, GenPD cpumask level it does not touch

> the CPUidle decision, that works on a subset of idle states (at cpu

> level).


I've deferred responding to this as I wasn't quite sure if I followed you
at that time, but I'm afraid I'm still not following you now. :-)

The idle governor has to take the total worst-case wakeup latency into
account.  Not just from the logical CPU itself, but also from whatever
state the SoC may end up in as a result of this particular logical CPU
going idle, this way or another.

So for example, if your logical CPU has an idle state A that may trigger an
idle state X at the cluster level (if the other logical CPUs happen to be in
the right states and so on), then the worst-case exit latency for that
is the one of state X.

> That's my understanding, which can be wrong so please correct me

> if that's the case because that's a bit confusing.

> 

> Let's imagine that we flattened out the list of idle states and feed

> CPUidle with it (all of them - cpu, cluster, package, system - as it is

> in the mainline _now_). Then the GenPD governor can run-through the

> CPUidle selection and _demote_ the idle state if necessary since it

> understands that some CPUs in the GenPD will wake up shortly and break

> the target residency hyphothesis the CPUidle governor is expecting.

> 

> The whole idea about this series is improving CPUidle decision when

> the target idle state is _shared_ among groups of cpus (again, please

> do correct me if I am wrong).

> 

> It is obvious that a GenPD governor must only demote - never promote a

> CPU idle state selection given that hierarchy implies more power

> savings and higher target residencies required.


So I see a problem here, because the way patch 9 in this series is done,
the genpd governor for CPUs has no idea what states have been selected by
the idle governor, so how does it know how deep it can go with turning
off domains?

My point is that the selection made by the idle governor need not be
based only on timers which is the only thing that the genpd governor
seems to be looking at.  The genpd governor should rather look at what
idle states have been selected for each CPU in the domain by the idle
governor and work within the boundaries of those.

Thanks,
Rafael
Lorenzo Pieralisi Sept. 14, 2018, 12:30 p.m. UTC | #2
On Fri, Sep 14, 2018 at 01:34:14PM +0200, Rafael J. Wysocki wrote:

[...]

> > > So for example, if your logical CPU has an idle state A that may trigger an

> > > idle state X at the cluster level (if the other logical CPUs happen to be in

> > > the right states and so on), then the worst-case exit latency for that

> > > is the one of state X.

> >

> > I will provide an example:

> >

> > IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms

> >

> > CPU 0 is about to enter IDLE state A since its "next-event" fulfill the

> > residency requirements and exit latency constraints.

> >

> > CPU 1 is in idle state A (given that CPU 0 is ON, some of the common

> > logic shared between CPU {0,1} is still ON, but, as soon as CPU 0

> > enters idle state A CPU {0,1} can enter the "full" idle state A

> > power savings mode).

> >

> > The current CPUidle governor does not check the "next-event" for CPU 1,

> > that it may wake up in, say, 10us.

> 

> Right.

> 

> > Requesting IDLE STATE A is a waste of power (if firmware or hardware

> > does not demote it since it does peek at CPU 1 next-event and actually

> > demote CPU 0 request).

> 

> OK, I see.

> 

> That's because the state is "collaborative" so to speak.  But was't

> that supposed to be covered by the "coupled" thing?


The coupled idle states code was merged because on some early SMP
ARM platforms CPUs must enter cluster idle states orderly otherwise
the system would break; "coupled" as-in "syncronized idle state entry".

Basically coupled idle code fixed a HW bug. This series code instead
applies to all arches where an idle state may span multiple CPUs (x86
inclusive, but as I mentioned it is probably not needed since FW/HW
behind mwait is capable of detecting whether that's wortwhile to shut
down, say, a package. PSCI, whether OSI or PC mode can work the same way).

Entering an idle state spanning multiple cpus need not be synchronized
but a sort of cpumask aware governor may help optimize idle state
selection.

I hope this makes the whole point clearer.

Cheers,
Lorenzo