diff mbox series

[3/4] powercap/drivers/dtpm: Add API for dynamic thermal power management

Message ID 20201006122024.14539-4-daniel.lezcano@linaro.org
State Superseded
Headers show
Series powercap/dtpm: Add the DTPM framework | expand

Commit Message

Daniel Lezcano Oct. 6, 2020, 12:20 p.m. UTC
On the embedded world, the complexity of the SoC leads to an
increasing number of hotspots which need to be monitored and mitigated
as a whole in order to prevent the temperature to go above the
normative and legally stated 'skin temperature'.

Another aspect is to sustain the performance for a given power budget,
for example virtual reality where the user can feel dizziness if the
GPU performance is capped while a big CPU is processing something
else. Or reduce the battery charging because the dissipated power is
too high compared with the power consumed by other devices.

The userspace is the most adequate place to dynamically act on the
different devices by limiting their power given an application
profile: it has the knowledge of the platform.

These userspace daemons are in charge of the Dynamic Thermal Power
Management (DTPM).

Nowadays, the dtpm daemons are abusing the thermal framework as they
act on the cooling device state to force a specific and arbitraty
state without taking care of the governor decisions. Given the closed
loop of some governors that can confuse the logic or directly enter in
a decision conflict.

As the number of cooling device support is limited today to the CPU
and the GPU, the dtpm daemons have little control on the power
dissipation of the system. The out of tree solutions are hacking
around here and there in the drivers, in the frameworks to have
control on the devices. The common solution is to declare them as
cooling devices.

There is no unification of the power limitation unit, opaque states
are used.

This patch provides a way to create a hierarchy of constraints using
the powercap framework. The devices which are registered as power
limit-able devices are represented in this hierarchy as a tree. They
are linked together with intermediate nodes which are just there to
propagate the constraint to the children.

The leaves of the tree are the real devices, the intermediate nodes
are virtual, aggregating the children constraints and power
characteristics.

Each node have a weight on a 2^10 basis, in order to reflect the
percentage of power distribution of the children's node. This
percentage is used to dispatch the power limit to the children.

The weight is computed against the max power of the siblings.

This simple approach allows to do a fair distribution of the power
limit.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

---
 drivers/powercap/Kconfig          |   6 +
 drivers/powercap/Makefile         |   1 +
 drivers/powercap/dtpm.c           | 430 ++++++++++++++++++++++++++++++
 include/asm-generic/vmlinux.lds.h |  11 +
 include/linux/dtpm.h              |  73 +++++
 5 files changed, 521 insertions(+)
 create mode 100644 drivers/powercap/dtpm.c
 create mode 100644 include/linux/dtpm.h

-- 
2.17.1

Comments

kernel test robot Oct. 6, 2020, 4:42 p.m. UTC | #1
Hi Daniel,

I love your patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on linux/master linus/master v5.9-rc8 next-20201006]
[cannot apply to asm-generic/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Lezcano/powercap-dtpm-Add-the-DTPM-framework/20201006-202317
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a3c92086239fccbe4523d59537de2a4c805561d9
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Lezcano/powercap-dtpm-Add-the-DTPM-framework/20201006-202317
        git checkout a3c92086239fccbe4523d59537de2a4c805561d9
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   m68k-linux-ld: drivers/powercap/dtpm.o: in function `__dtpm_rebalance_weight':
>> dtpm.c:(.text+0x298): undefined reference to `__udivdi3'


---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Oct. 6, 2020, 6:05 p.m. UTC | #2
Hi Daniel,

I love your patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on linux/master linus/master v5.9-rc8 next-20201006]
[cannot apply to asm-generic/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Daniel-Lezcano/powercap-dtpm-Add-the-DTPM-framework/20201006-202317
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a3c92086239fccbe4523d59537de2a4c805561d9
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Daniel-Lezcano/powercap-dtpm-Add-the-DTPM-framework/20201006-202317
        git checkout a3c92086239fccbe4523d59537de2a4c805561d9
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/mips/kernel/head.o: in function `dtb_found':
   (.ref.text+0xe0): relocation truncated to fit: R_MIPS_26 against `start_kernel'
   init/main.o: in function `set_reset_devices':
   main.c:(.init.text+0x20): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x30): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `debug_kernel':
   main.c:(.init.text+0x9c): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0xac): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `quiet_kernel':
   main.c:(.init.text+0x118): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x128): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   init/main.o: in function `init_setup':
   main.c:(.init.text+0x1a4): relocation truncated to fit: R_MIPS_26 against `_mcount'
   main.c:(.init.text+0x1c8): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   main.c:(.init.text+0x1e8): relocation truncated to fit: R_MIPS_26 against `__sanitizer_cov_trace_pc'
   main.c:(.init.text+0x1fc): additional relocation overflows omitted from the output
   mips-linux-ld: drivers/powercap/dtpm.o: in function `__dtpm_rebalance_weight':
>> dtpm.c:(.text.__dtpm_rebalance_weight+0xec): undefined reference to `__udivdi3'


---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Lukasz Luba Oct. 23, 2020, 10:29 a.m. UTC | #3
Hi Daniel,


On 10/6/20 1:20 PM, Daniel Lezcano wrote:
> On the embedded world, the complexity of the SoC leads to an

> increasing number of hotspots which need to be monitored and mitigated

> as a whole in order to prevent the temperature to go above the

> normative and legally stated 'skin temperature'.

> 

> Another aspect is to sustain the performance for a given power budget,

> for example virtual reality where the user can feel dizziness if the

> GPU performance is capped while a big CPU is processing something

> else. Or reduce the battery charging because the dissipated power is

> too high compared with the power consumed by other devices.

> 

> The userspace is the most adequate place to dynamically act on the

> different devices by limiting their power given an application

> profile: it has the knowledge of the platform.

> 

> These userspace daemons are in charge of the Dynamic Thermal Power

> Management (DTPM).

> 

> Nowadays, the dtpm daemons are abusing the thermal framework as they

> act on the cooling device state to force a specific and arbitraty

> state without taking care of the governor decisions. Given the closed

> loop of some governors that can confuse the logic or directly enter in

> a decision conflict.

> 

> As the number of cooling device support is limited today to the CPU

> and the GPU, the dtpm daemons have little control on the power

> dissipation of the system. The out of tree solutions are hacking

> around here and there in the drivers, in the frameworks to have

> control on the devices. The common solution is to declare them as

> cooling devices.

> 

> There is no unification of the power limitation unit, opaque states

> are used.

> 

> This patch provides a way to create a hierarchy of constraints using

> the powercap framework. The devices which are registered as power

> limit-able devices are represented in this hierarchy as a tree. They

> are linked together with intermediate nodes which are just there to

> propagate the constraint to the children.

> 

> The leaves of the tree are the real devices, the intermediate nodes

> are virtual, aggregating the children constraints and power

> characteristics.

> 

> Each node have a weight on a 2^10 basis, in order to reflect the

> percentage of power distribution of the children's node. This

> percentage is used to dispatch the power limit to the children.

> 

> The weight is computed against the max power of the siblings.

> 

> This simple approach allows to do a fair distribution of the power

> limit.

> 

> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

> ---

>   drivers/powercap/Kconfig          |   6 +

>   drivers/powercap/Makefile         |   1 +

>   drivers/powercap/dtpm.c           | 430 ++++++++++++++++++++++++++++++

>   include/asm-generic/vmlinux.lds.h |  11 +

>   include/linux/dtpm.h              |  73 +++++

>   5 files changed, 521 insertions(+)

>   create mode 100644 drivers/powercap/dtpm.c

>   create mode 100644 include/linux/dtpm.h

> 

> diff --git a/drivers/powercap/Kconfig b/drivers/powercap/Kconfig

> index ebc4d4578339..777cf60300b8 100644

> --- a/drivers/powercap/Kconfig

> +++ b/drivers/powercap/Kconfig

> @@ -43,4 +43,10 @@ config IDLE_INJECT

>   	  CPUs for power capping. Idle period can be injected

>   	  synchronously on a set of specified CPUs or alternatively

>   	  on a per CPU basis.

> +

> +config DTPM

> +	bool "Power capping for dynamic thermal power management"


Maybe starting with capital letters: Dynamic Thermal Power Management?

> +	help

> +	  This enables support for the power capping for the dynamic

> +	  thermal management userspace engine.

>   endif

> diff --git a/drivers/powercap/Makefile b/drivers/powercap/Makefile

> index 7255c94ec61c..6482ac52054d 100644

> --- a/drivers/powercap/Makefile

> +++ b/drivers/powercap/Makefile

> @@ -1,4 +1,5 @@

>   # SPDX-License-Identifier: GPL-2.0-only

> +obj-$(CONFIG_DTPM) += dtpm.o

>   obj-$(CONFIG_POWERCAP)	+= powercap_sys.o

>   obj-$(CONFIG_INTEL_RAPL_CORE) += intel_rapl_common.o

>   obj-$(CONFIG_INTEL_RAPL) += intel_rapl_msr.o

> diff --git a/drivers/powercap/dtpm.c b/drivers/powercap/dtpm.c

> new file mode 100644

> index 000000000000..6df1e51a2c1c

> --- /dev/null

> +++ b/drivers/powercap/dtpm.c

> @@ -0,0 +1,430 @@

> +// SPDX-License-Identifier: GPL-2.0-only

> +/*

> + * Copyright 2020 Linaro Limited

> + *

> + * Author: Daniel Lezcano <daniel.lezcano@linaro.org>

> + *

> + * The powercap based Dynamic Thermal Power Management framework

> + * provides to the userspace a consistent API to set the power limit

> + * on some devices.

> + *

> + * DTPM defines the functions to create a tree of constraints. Each

> + * parent node is a virtual description of the aggregation of the

> + * children. It propagates the constraints set at its level to its

> + * children and collect the children power infomation. The leaves of


s/infomation/information/

> + * the tree are the real devices which have the ability to get their

> + * current power consumption and set their power limit.

> + */

> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

> +

> +#include <linux/dtpm.h>

> +#include <linux/init.h>

> +#include <linux/kernel.h>

> +#include <linux/powercap.h>

> +#include <linux/slab.h>

> +#include <linux/spinlock.h>

> +

> +static const char *constraint_name[] = {

> +	"Instantaneous power constraint",

> +};

> +

> +static struct powercap_control_type *pct;

> +static struct dtpm *root;


I wonder if it safe to have the tree without a global lock for it, like
mutex tree_lock ?
I have put some comments below when the code traverses the tree.

> +

> +static int get_time_window_us(struct powercap_zone *pcz, int cid, u64 *window)

> +{

> +	return -ENOSYS;

> +}

> +

> +static int set_time_window_us(struct powercap_zone *pcz, int cid, u64 window)

> +{

> +	return -ENOSYS;

> +}

> +

> +static int get_max_power_range_uw(struct powercap_zone *pcz, u64 *max_power_uw)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +

> +	spin_lock(&dtpm->lock);

> +	*max_power_uw = dtpm->power_max - dtpm->power_min;

> +	spin_unlock(&dtpm->lock);

> +

> +	return 0;

> +}

> +

> +static int get_children_power_uw(struct powercap_zone *pcz, u64 *power_uw)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *child;

> +	u64 power;

> +	int ret = 0;

> +

> +	*power_uw = 0;

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +		ret = child->zone.ops->get_power_uw(&child->zone, &power);

> +		if (ret)

> +			break;

> +		*power_uw += power;

> +	}

> +	spin_unlock(&dtpm->lock);

> +

> +	return ret;

> +}

> +

> +static void __dtpm_rebalance_weight(struct dtpm *dtpm)

> +{

> +	struct dtpm *child;

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +

> +		pr_debug("Setting weight '%d' for '%s'\n",

> +			 child->weight, child->zone.name);

> +

> +		child->weight = DIV_ROUND_CLOSEST(child->power_max * 1024,

> +						  dtpm->power_max);

> +

> +		__dtpm_rebalance_weight(child);

> +	}

> +	spin_unlock(&dtpm->lock);

> +}

> +

> +static void dtpm_rebalance_weight(void)

> +{

> +	__dtpm_rebalance_weight(root);

> +}

> +

> +static void dtpm_sub_power(struct dtpm *dtpm)

> +{

> +	struct dtpm *parent = dtpm->parent;

> +

> +	while (parent) {


I am not sure if it is safe for a corner case when the
nodes are removing from bottom to top. We don't hold a tree
lock, so these two (above line and below) operations might
be split/preempted and 'parent' freed before taking the lock.
Is it possible? (Note: I might missed something like double
locking using this local node spinlock).

> +		spin_lock(&parent->lock);

> +		parent->power_min -= dtpm->power_min;

> +		parent->power_max -= dtpm->power_max;

> +		spin_unlock(&parent->lock);

> +		parent = parent->parent;

> +	}

> +

> +	dtpm_rebalance_weight();

> +}

> +

> +static void dtpm_add_power(struct dtpm *dtpm)

> +{

> +	struct dtpm *parent = dtpm->parent;

> +

> +	while (parent) {


Similar here?

> +		spin_lock(&parent->lock);

> +		parent->power_min += dtpm->power_min;

> +		parent->power_max += dtpm->power_max;

> +		spin_unlock(&parent->lock);

> +		parent = parent->parent;

> +	}

> +

> +	dtpm_rebalance_weight();

> +}

> +

> +/**

> + * dtpm_update_power - Update the power on the dtpm

> + * @dtpm: a pointer to a dtpm structure to update

> + * @power_min: a u64 representing the new power_min value

> + * @power_max: a u64 representing the new power_max value

> + *

> + * Function to update the power values of the dtpm node specified in

> + * parameter. These new values will be propagated to the tree.

> + *

> + * Return: zero on success, -EINVAL if the values are inconsistent

> + */

> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max)

> +{

> +	if (power_min == dtpm->power_min && power_max == dtpm->power_max)

> +		return 0;

> +

> +	if (power_max < power_min)

> +		return -EINVAL;

> +

> +	dtpm_sub_power(dtpm);

> +	spin_lock(&dtpm->lock);

> +	dtpm->power_min = power_min;

> +	dtpm->power_max = power_max;

> +	spin_unlock(&dtpm->lock);

> +	dtpm_add_power(dtpm);

> +

> +	return 0;

> +}

> +

> +/**

> + * dtpm_release_zone - Cleanup when the node is released

> + * @pcz: a pointer to a powercap_zone structure

> + *

> + * Do some housecleaning and update the weight on the tree. The

> + * release will be denied if the node has children. This function must

> + * be called by the specific release callback of the different

> + * backends.

> + *

> + * Return: 0 on success, -EBUSY if there are children

> + */

> +int dtpm_release_zone(struct powercap_zone *pcz)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *parent = dtpm->parent;

> +


I would lock the whole tree, just to play safe.
What do you think?

> +	if (!list_empty(&dtpm->children))

> +		return -EBUSY;

> +

> +	if (parent) {

> +		spin_lock(&parent->lock);

> +		list_del(&dtpm->sibling);

> +		spin_unlock(&parent->lock);

> +	}

> +

> +	dtpm_sub_power(dtpm);

> +

> +	kfree(dtpm);

> +

> +	return 0;

> +}

> +

> +/*

> + * Set the power limit on the nodes, the power limit is distributed

> + * given the weight of the children.

> + */

> +static int set_children_power_limit(struct powercap_zone *pcz, int cid,

> +				    u64 power_limit)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *child;

> +	u64 power;

> +	int ret = 0;

> +

> +	/*

> +	 * Don't allow values outside of the power range previously

> +	 * set when initiliazing the power numbers.

> +	 */

> +	power_limit = clamp_val(power_limit, dtpm->power_min, dtpm->power_max);

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +

> +		/*

> +		 * Integer division rounding will inevitably lead to a

> +		 * different max value when set several times. In

> +		 * order to restore the initial value, we force the

> +		 * child's max power every time if the constraint is

> +		 * removed by setting a value greater or equal to the

> +		 * max power.

> +		 */

> +		if (power_limit == dtpm->power_max)

> +			power = child->power_max;

> +		else

> +			power = DIV_ROUND_CLOSEST(

> +				power_limit * child->weight, 1024);

> +

> +		pr_debug("Setting power limit for '%s': %llu uW\n",

> +			 child->zone.name, power);

> +

> +		ret = child->zone.constraints->ops->set_power_limit_uw(

> +			&child->zone, cid, power);

> +		if (ret)

> +			break;

> +	}

> +	spin_unlock(&dtpm->lock);

> +

> +	return ret;

> +}

> +

> +static int get_children_power_limit(struct powercap_zone *pcz, int cid,

> +				    u64 *power_limit)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *child;

> +	u64 power;

> +	int ret = 0;

> +

> +	*power_limit = 0;

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +		ret = child->zone.constraints->ops->get_power_limit_uw(

> +			&child->zone, cid, &power);

> +		if (ret)

> +			break;

> +		*power_limit += power;

> +	}

> +	spin_unlock(&dtpm->lock);

> +

> +	return ret;

> +}

> +

> +static const char *get_constraint_name(struct powercap_zone *pcz, int cid)

> +{

> +	return constraint_name[cid];

> +}

> +

> +static int get_max_power_uw(struct powercap_zone *pcz, int id, u64 *max_power)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +

> +	spin_lock(&dtpm->lock);

> +	*max_power = dtpm->power_max;

> +	spin_unlock(&dtpm->lock);

> +

> +	return 0;

> +}

> +

> +static struct powercap_zone_constraint_ops constraint_ops = {

> +	.set_power_limit_uw = set_children_power_limit,

> +	.get_power_limit_uw = get_children_power_limit,

> +	.set_time_window_us = set_time_window_us,

> +	.get_time_window_us = get_time_window_us,

> +	.get_max_power_uw = get_max_power_uw,

> +	.get_name = get_constraint_name,

> +};

> +

> +static struct powercap_zone_ops zone_ops = {

> +	.get_max_power_range_uw = get_max_power_range_uw,

> +	.get_power_uw = get_children_power_uw,

> +	.release = dtpm_release_zone,

> +};

> +

> +/**

> + * dtpm_alloc - Allocate and initialize a dtpm struct

> + * @name: a string specifying the name of the node

> + *

> + * Return: a struct dtpm pointer, NULL in case of error

> + */

> +struct dtpm *dtpm_alloc(void)

> +{

> +	struct dtpm *dtpm;

> +

> +	dtpm = kzalloc(sizeof(*dtpm), GFP_KERNEL);

> +	if (dtpm) {

> +		INIT_LIST_HEAD(&dtpm->children);

> +		INIT_LIST_HEAD(&dtpm->sibling);

> +		spin_lock_init(&dtpm->lock);


Why do we use spinlock and not mutex?

> +		dtpm->weight = 1024;

> +	}

> +

> +	return dtpm;

> +}

> +

> +/**

> + * dtpm_unregister - Unregister a dtpm node from the hierarchy tree

> + * @dtpm: a pointer to a dtpm structure corresponding to the node to be removed

> + *

> + * Call the underlying powercap unregister function. That will call

> + * the release callback of the powercap zone.

> + */

> +void dtpm_unregister(struct dtpm *dtpm)

> +{

> +	powercap_unregister_zone(pct, &dtpm->zone);

> +}

> +

> +/**

> + * dtpm_register - Register a dtpm node in the hierarchy tree

> + * @name: a string specifying the name of the node

> + * @dtpm: a pointer to a dtpm structure corresponding to the new node

> + * @parent: a pointer to a dtpm structure corresponding to the parent node

> + * @ops: a pointer to a powercap_zone_ops structure

> + * @nr_constraints: a integer giving the number of constraints supported

> + * @const_ops: a pointer to a powercap_zone_constraint_ops structure

> + *

> + * Create a dtpm node in the tree. If no parent is specified, the node

> + * is the root node of the hierarchy. If the root node already exists,

> + * then the registration will fail. The powercap controller must be

> + * initialized before calling this function.

> + *

> + * The dtpm structure must be initialized with the power numbers

> + * before calling this function.

> + *

> + * Return: zero on success, a negative value in case of error:

> + *  -EAGAIN: the function is called before the framework is initialized.

> + *  -EBUSY: the root node is already inserted

> + *  -EINVAL: there is no root node yet and @parent is specified

> + *   Other negative values are reported back from the powercap framework

> + */

> +int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,

> +		  struct powercap_zone_ops *ops, int nr_constraints,

> +		  struct powercap_zone_constraint_ops *const_ops)

> +{

> +	struct powercap_zone *pcz;

> +

> +	if (!pct)

> +		return -EAGAIN;

> +

> +	if (root && !parent)

> +		return -EBUSY;

> +

> +	if (!root && parent)

> +		return -EINVAL;

> +

> +	const_ops->get_name = get_constraint_name;

> +	const_ops->get_max_power_uw = get_max_power_uw;

> +	const_ops->set_time_window_us = set_time_window_us;

> +	const_ops->get_time_window_us = get_time_window_us;

> +

> +	ops->get_max_power_range_uw = get_max_power_range_uw;

> +

> +	pcz = powercap_register_zone(&dtpm->zone, pct, name,

> +				     parent ? &parent->zone : NULL,

> +				     ops, nr_constraints, const_ops);

> +	if (IS_ERR(pcz))

> +		return PTR_ERR(pcz);

> +

> +	if (parent) {

> +		spin_lock(&parent->lock);

> +		list_add_tail(&dtpm->sibling, &parent->children);

> +		spin_unlock(&parent->lock);

> +		dtpm->parent = parent;

> +	} else {

> +		root = dtpm;

> +	}

> +

> +	dtpm_add_power(dtpm);

> +

> +	return 0;

> +}

> +

> +/**

> + * dtpm_register_parent - Register a intermediate node in the tree

> + * @name: a string specifying the name of the node

> + * @dtpm: a pointer to a dtpm structure corresponding to the new node

> + * @parent: a pointer to a dtpm structure corresponding parent's new node

> + *

> + * The function will add an intermediate node corresponding to a

> + * parent to more nodes. Its purpose is to aggregate the children

> + * characteristics and dispatch the constraints. If the @parent

> + * parameter is NULL, then this node becomes the root node of the tree

> + * if there is no root node yet.

> + *

> + * Return: zero on success, a negative value in case of error:

> + *  -EAGAIN: the function is called before the framework is initialized.

> + *  -EBUSY: the root node is already inserted

> + *  -EINVAL: there is not root node yet and @parent is specified

> + *   Other negative values are reported back from the powercap framework

> + */

> +int dtpm_register_parent(const char *name, struct dtpm *dtpm,

> +			 struct dtpm *parent)

> +{

> +	return dtpm_register(name, dtpm, parent, &zone_ops,

> +			     MAX_DTPM_CONSTRAINTS, &constraint_ops);

> +}

> +

> +static int __init dtpm_init(void)

> +{

> +	struct dtpm_descr **dtpm_descr;

> +

> +	pct = powercap_register_control_type(NULL, "dtpm", NULL);

> +	if (!pct) {

> +		pr_err("Failed to register control type\n");

> +		return -EINVAL;

> +	}

> +

> +	for_each_dtpm_table(dtpm_descr)

> +		(*dtpm_descr)->init(*dtpm_descr);


We don't check the returned value here. It is required that the
devices should already be up and running (like cpufreq).
But if for some reason the init() failed, maybe it's worth to add a
field inside the dtpm_desc or dtpm struct like 'bool ready' ?
It could be retried to init later.

> +

> +	return 0;

> +}

> +late_initcall(dtpm_init);


The framework would start operating at late boot. We don't control
the thermal/power issues in earier stages.
Although, at this late stage all other things like cpufreq should be
ready, so the ->init() on them is likely to success.

Regards,
Lukasz

> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h

> index 5430febd34be..29b30976ea02 100644

> --- a/include/asm-generic/vmlinux.lds.h

> +++ b/include/asm-generic/vmlinux.lds.h

> @@ -315,6 +315,16 @@

>   #define THERMAL_TABLE(name)

>   #endif

>   

> +#ifdef CONFIG_DTPM

> +#define DTPM_TABLE()							\

> +	. = ALIGN(8);							\

> +	__dtpm_table = .;						\

> +	KEEP(*(__dtpm_table))						\

> +	__dtpm_table_end = .;

> +#else

> +#define DTPM_TABLE()

> +#endif

> +

>   #define KERNEL_DTB()							\

>   	STRUCT_ALIGN();							\

>   	__dtb_start = .;						\

> @@ -715,6 +725,7 @@

>   	ACPI_PROBE_TABLE(irqchip)					\

>   	ACPI_PROBE_TABLE(timer)						\

>   	THERMAL_TABLE(governor)						\

> +	DTPM_TABLE()							\

>   	EARLYCON_TABLE()						\

>   	LSM_TABLE()							\

>   	EARLY_LSM_TABLE()

> diff --git a/include/linux/dtpm.h b/include/linux/dtpm.h

> new file mode 100644

> index 000000000000..6696bdcfdb87

> --- /dev/null

> +++ b/include/linux/dtpm.h

> @@ -0,0 +1,73 @@

> +/* SPDX-License-Identifier: GPL-2.0-only */

> +/*

> + * Copyright (C) 2020 Linaro Ltd

> + *

> + * Author: Daniel Lezcano <daniel.lezcano@linaro.org>

> + */

> +#ifndef ___DTPM_H__

> +#define ___DTPM_H__

> +

> +#include <linux/of.h>

> +#include <linux/powercap.h>

> +

> +#define MAX_DTPM_DESCR 8

> +#define MAX_DTPM_CONSTRAINTS 1

> +

> +struct dtpm {

> +	struct powercap_zone zone;

> +	struct dtpm *parent;

> +	struct list_head sibling;

> +	struct list_head children;

> +	spinlock_t lock;

> +	u64 power_limit;

> +	u64 power_max;

> +	u64 power_min;

> +	int weight;

> +	void *private;

> +};

> +

> +struct dtpm_descr;

> +

> +typedef int (*dtpm_init_t)(struct dtpm_descr *);

> +

> +struct dtpm_descr {

> +	struct dtpm *parent;

> +	const char *name;

> +	dtpm_init_t init;

> +};

> +

> +/* Init section thermal table */

> +extern struct dtpm_descr *__dtpm_table[];

> +extern struct dtpm_descr *__dtpm_table_end[];

> +

> +#define DTPM_TABLE_ENTRY(name)			\

> +	static typeof(name) *__dtpm_table_entry_##name	\

> +	__used __section(__dtpm_table) = &name

> +

> +#define DTPM_DECLARE(name)	DTPM_TABLE_ENTRY(name)

> +

> +#define for_each_dtpm_table(__dtpm)	\

> +	for (__dtpm = __dtpm_table;	\

> +	     __dtpm < __dtpm_table_end;	\

> +	     __dtpm++)

> +

> +static inline struct dtpm *to_dtpm(struct powercap_zone *zone)

> +{

> +	return container_of(zone, struct dtpm, zone);

> +}

> +

> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max);

> +

> +int dtpm_release_zone(struct powercap_zone *pcz);

> +

> +struct dtpm *dtpm_alloc(void);

> +

> +void dtpm_unregister(struct dtpm *dtpm);

> +

> +int dtpm_register_parent(const char *name, struct dtpm *dtpm,

> +			 struct dtpm *parent);

> +

> +int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,

> +		  struct powercap_zone_ops *ops, int nr_constraints,

> +		  struct powercap_zone_constraint_ops *const_ops);

> +#endif

>
Daniel Lezcano Nov. 3, 2020, 6:42 p.m. UTC | #4
Hi Lukasz,

thanks for the review and the comments.

On 23/10/2020 12:29, Lukasz Luba wrote:
> Hi Daniel,

[ ... ]

>> +
>> +config DTPM
>> +    bool "Power capping for dynamic thermal power management"
> 
> Maybe starting with capital letters: Dynamic Thermal Power Management?

Ok, noted.

[ ... ]

>> +++ b/drivers/powercap/dtpm.c
>> @@ -0,0 +1,430 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright 2020 Linaro Limited
>> + *
>> + * Author: Daniel Lezcano <daniel.lezcano@linaro.org>
>> + *
>> + * The powercap based Dynamic Thermal Power Management framework
>> + * provides to the userspace a consistent API to set the power limit
>> + * on some devices.
>> + *
>> + * DTPM defines the functions to create a tree of constraints. Each
>> + * parent node is a virtual description of the aggregation of the
>> + * children. It propagates the constraints set at its level to its
>> + * children and collect the children power infomation. The leaves of
> 
> s/infomation/information/

Ok, thanks

[ ... ]

>> +static struct powercap_control_type *pct;
>> +static struct dtpm *root;
> 
> I wonder if it safe to have the tree without a global lock for it, like
> mutex tree_lock ?
> I have put some comments below when the code traverses the tree.

The mutex is a heavy lock and the its purpose is to allow the current
process to be preempted while the spinlock is very fast without preemption.

Putting in place a single lock will simplify the code but I'm not sure
it is worth as it could be a contention. It would be simpler to switch
to a big lock than the opposite.

[ ... ]

>> +static void dtpm_rebalance_weight(void)
>> +{
>> +    __dtpm_rebalance_weight(root);
>> +}
>> +
>> +static void dtpm_sub_power(struct dtpm *dtpm)
>> +{
>> +    struct dtpm *parent = dtpm->parent;
>> +
>> +    while (parent) {
> 
> I am not sure if it is safe for a corner case when the
> nodes are removing from bottom to top. We don't hold a tree
> lock, so these two (above line and below) operations might
> be split/preempted and 'parent' freed before taking the lock.
> Is it possible? (Note: I might missed something like double
> locking using this local node spinlock).

The parent can not be freed until it has children, the check is done in
the release node function.

>> +        spin_lock(&parent->lock);
>> +        parent->power_min -= dtpm->power_min;
>> +        parent->power_max -= dtpm->power_max;
>> +        spin_unlock(&parent->lock);
>> +        parent = parent->parent;
>> +    }
>> +
>> +    dtpm_rebalance_weight();
>> +}
>> +
>> +static void dtpm_add_power(struct dtpm *dtpm)
>> +{
>> +    struct dtpm *parent = dtpm->parent;
>> +
>> +    while (parent) {
> 
> Similar here?
> 
>> +        spin_lock(&parent->lock);
>> +        parent->power_min += dtpm->power_min;
>> +        parent->power_max += dtpm->power_max;
>> +        spin_unlock(&parent->lock);
>> +        parent = parent->parent;
>> +    }
>> +
>> +    dtpm_rebalance_weight();
>> +}
>> +
>> +/**
>> + * dtpm_update_power - Update the power on the dtpm
>> + * @dtpm: a pointer to a dtpm structure to update
>> + * @power_min: a u64 representing the new power_min value
>> + * @power_max: a u64 representing the new power_max value
>> + *
>> + * Function to update the power values of the dtpm node specified in
>> + * parameter. These new values will be propagated to the tree.
>> + *
>> + * Return: zero on success, -EINVAL if the values are inconsistent
>> + */
>> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max)
>> +{
>> +    if (power_min == dtpm->power_min && power_max == dtpm->power_max)
>> +        return 0;
>> +
>> +    if (power_max < power_min)
>> +        return -EINVAL;
>> +
>> +    dtpm_sub_power(dtpm);
>> +    spin_lock(&dtpm->lock);
>> +    dtpm->power_min = power_min;
>> +    dtpm->power_max = power_max;
>> +    spin_unlock(&dtpm->lock);
>> +    dtpm_add_power(dtpm);
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * dtpm_release_zone - Cleanup when the node is released
>> + * @pcz: a pointer to a powercap_zone structure
>> + *
>> + * Do some housecleaning and update the weight on the tree. The
>> + * release will be denied if the node has children. This function must
>> + * be called by the specific release callback of the different
>> + * backends.
>> + *
>> + * Return: 0 on success, -EBUSY if there are children
>> + */
>> +int dtpm_release_zone(struct powercap_zone *pcz)
>> +{
>> +    struct dtpm *dtpm = to_dtpm(pcz);
>> +    struct dtpm *parent = dtpm->parent;
>> +
> 
> I would lock the whole tree, just to play safe.
> What do you think?

I would like to keep the fine grain locking to prevent a potential
contention. If it appears we hit a locking incorrectness or a race
putting in question the fine grain locking scheme, then we can consider
switching to a tree lock.

>> +    if (!list_empty(&dtpm->children))
>> +        return -EBUSY;
>> +
>> +    if (parent) {
>> +        spin_lock(&parent->lock);
>> +        list_del(&dtpm->sibling);
>> +        spin_unlock(&parent->lock);
>> +    }
>> +
>> +    dtpm_sub_power(dtpm);
>> +
>> +    kfree(dtpm);
>> +
>> +    return 0;
>> +}

[ ... ]

>> +struct dtpm *dtpm_alloc(void)
>> +{
>> +    struct dtpm *dtpm;
>> +
>> +    dtpm = kzalloc(sizeof(*dtpm), GFP_KERNEL);
>> +    if (dtpm) {
>> +        INIT_LIST_HEAD(&dtpm->children);
>> +        INIT_LIST_HEAD(&dtpm->sibling);
>> +        spin_lock_init(&dtpm->lock);
> 
> Why do we use spinlock and not mutex?

The mutex will force the calling process to be preempted, that is useful
when the critical sections contains blocking calls.

Here we are just changing values without blocking calls, so using the
spinlock is more adequate as they are faster.

[ ... ]

>> +static int __init dtpm_init(void)
>> +{
>> +    struct dtpm_descr **dtpm_descr;
>> +
>> +    pct = powercap_register_control_type(NULL, "dtpm", NULL);
>> +    if (!pct) {
>> +        pr_err("Failed to register control type\n");
>> +        return -EINVAL;
>> +    }
>> +
>> +    for_each_dtpm_table(dtpm_descr)
>> +        (*dtpm_descr)->init(*dtpm_descr);
> 
> We don't check the returned value here. It is required that the
> devices should already be up and running (like cpufreq).
> But if for some reason the init() failed, maybe it's worth to add a
> field inside the dtpm_desc or dtpm struct like 'bool ready' ?
> It could be retried to init later.

It would be make sense to check the init return value if we want to
rollback what we have done. Here we don't want to do that. If one
subsystem fails to insert itself in the tree, it will log an error but
the tree should continue to give access to what have been successfully
initialized.

The rollback is important in the init() ops, not in dtpm_init().

>> +
>> +    return 0;
>> +}
>> +late_initcall(dtpm_init);
> 
> The framework would start operating at late boot. We don't control
> the thermal/power issues in earier stages.
> Although, at this late stage all other things like cpufreq should be
> ready, so the ->init() on them is likely to success.

Right, the dtpm is accessible through sysfs for an userspace thermal
daemon doing the smart mitigation. So do the initcall can be really late.

[ ... ]

Thanks for the review.

  -- Daniel
Lukasz Luba Nov. 10, 2020, 9:59 a.m. UTC | #5
Hi Daniel,

I've experimented with the patch set and went through the code again.
It looks good, only a few minor comments.

On 10/6/20 1:20 PM, Daniel Lezcano wrote:
> On the embedded world, the complexity of the SoC leads to an

> increasing number of hotspots which need to be monitored and mitigated

> as a whole in order to prevent the temperature to go above the

> normative and legally stated 'skin temperature'.

> 

> Another aspect is to sustain the performance for a given power budget,

> for example virtual reality where the user can feel dizziness if the

> GPU performance is capped while a big CPU is processing something

> else. Or reduce the battery charging because the dissipated power is

> too high compared with the power consumed by other devices.

> 

> The userspace is the most adequate place to dynamically act on the

> different devices by limiting their power given an application

> profile: it has the knowledge of the platform.

> 

> These userspace daemons are in charge of the Dynamic Thermal Power

> Management (DTPM).

> 

> Nowadays, the dtpm daemons are abusing the thermal framework as they

> act on the cooling device state to force a specific and arbitraty


s/arbitraty/arbitrary/

> state without taking care of the governor decisions. Given the closed

> loop of some governors that can confuse the logic or directly enter in

> a decision conflict.

> 

> As the number of cooling device support is limited today to the CPU

> and the GPU, the dtpm daemons have little control on the power

> dissipation of the system. The out of tree solutions are hacking

> around here and there in the drivers, in the frameworks to have

> control on the devices. The common solution is to declare them as

> cooling devices.

> 

> There is no unification of the power limitation unit, opaque states

> are used.

> 

> This patch provides a way to create a hierarchy of constraints using

> the powercap framework. The devices which are registered as power

> limit-able devices are represented in this hierarchy as a tree. They

> are linked together with intermediate nodes which are just there to

> propagate the constraint to the children.

> 

> The leaves of the tree are the real devices, the intermediate nodes

> are virtual, aggregating the children constraints and power

> characteristics.

> 

> Each node have a weight on a 2^10 basis, in order to reflect the

> percentage of power distribution of the children's node. This

> percentage is used to dispatch the power limit to the children.

> 

> The weight is computed against the max power of the siblings.

> 

> This simple approach allows to do a fair distribution of the power

> limit.

> 

> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

> ---

>   drivers/powercap/Kconfig          |   6 +

>   drivers/powercap/Makefile         |   1 +

>   drivers/powercap/dtpm.c           | 430 ++++++++++++++++++++++++++++++

>   include/asm-generic/vmlinux.lds.h |  11 +

>   include/linux/dtpm.h              |  73 +++++

>   5 files changed, 521 insertions(+)

>   create mode 100644 drivers/powercap/dtpm.c

>   create mode 100644 include/linux/dtpm.h

> 

> diff --git a/drivers/powercap/Kconfig b/drivers/powercap/Kconfig

> index ebc4d4578339..777cf60300b8 100644

> --- a/drivers/powercap/Kconfig

> +++ b/drivers/powercap/Kconfig

> @@ -43,4 +43,10 @@ config IDLE_INJECT

>   	  CPUs for power capping. Idle period can be injected

>   	  synchronously on a set of specified CPUs or alternatively

>   	  on a per CPU basis.

> +

> +config DTPM

> +	bool "Power capping for dynamic thermal power management"

> +	help

> +	  This enables support for the power capping for the dynamic

> +	  thermal management userspace engine.

>   endif

> diff --git a/drivers/powercap/Makefile b/drivers/powercap/Makefile

> index 7255c94ec61c..6482ac52054d 100644

> --- a/drivers/powercap/Makefile

> +++ b/drivers/powercap/Makefile

> @@ -1,4 +1,5 @@

>   # SPDX-License-Identifier: GPL-2.0-only

> +obj-$(CONFIG_DTPM) += dtpm.o

>   obj-$(CONFIG_POWERCAP)	+= powercap_sys.o

>   obj-$(CONFIG_INTEL_RAPL_CORE) += intel_rapl_common.o

>   obj-$(CONFIG_INTEL_RAPL) += intel_rapl_msr.o

> diff --git a/drivers/powercap/dtpm.c b/drivers/powercap/dtpm.c

> new file mode 100644

> index 000000000000..6df1e51a2c1c

> --- /dev/null

> +++ b/drivers/powercap/dtpm.c

> @@ -0,0 +1,430 @@

> +// SPDX-License-Identifier: GPL-2.0-only

> +/*

> + * Copyright 2020 Linaro Limited

> + *

> + * Author: Daniel Lezcano <daniel.lezcano@linaro.org>

> + *

> + * The powercap based Dynamic Thermal Power Management framework

> + * provides to the userspace a consistent API to set the power limit

> + * on some devices.

> + *

> + * DTPM defines the functions to create a tree of constraints. Each

> + * parent node is a virtual description of the aggregation of the

> + * children. It propagates the constraints set at its level to its

> + * children and collect the children power infomation. The leaves of

> + * the tree are the real devices which have the ability to get their

> + * current power consumption and set their power limit.

> + */

> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

> +

> +#include <linux/dtpm.h>

> +#include <linux/init.h>

> +#include <linux/kernel.h>

> +#include <linux/powercap.h>

> +#include <linux/slab.h>

> +#include <linux/spinlock.h>

> +

> +static const char *constraint_name[] = {

> +	"Instantaneous power constraint",


Quite long name, max is 30, we would lose new line [1].

> +};

> +

> +static struct powercap_control_type *pct;

> +static struct dtpm *root;

> +

> +static int get_time_window_us(struct powercap_zone *pcz, int cid, u64 *window)

> +{

> +	return -ENOSYS;

> +}

> +

> +static int set_time_window_us(struct powercap_zone *pcz, int cid, u64 window)

> +{

> +	return -ENOSYS;

> +}

> +

> +static int get_max_power_range_uw(struct powercap_zone *pcz, u64 *max_power_uw)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +

> +	spin_lock(&dtpm->lock);

> +	*max_power_uw = dtpm->power_max - dtpm->power_min;

> +	spin_unlock(&dtpm->lock);

> +

> +	return 0;

> +}

> +

> +static int get_children_power_uw(struct powercap_zone *pcz, u64 *power_uw)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *child;

> +	u64 power;

> +	int ret = 0;

> +

> +	*power_uw = 0;

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +		ret = child->zone.ops->get_power_uw(&child->zone, &power);

> +		if (ret)

> +			break;

> +		*power_uw += power;

> +	}

> +	spin_unlock(&dtpm->lock);

> +

> +	return ret;

> +}

> +

> +static void __dtpm_rebalance_weight(struct dtpm *dtpm)

> +{

> +	struct dtpm *child;

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +

> +		pr_debug("Setting weight '%d' for '%s'\n",

> +			 child->weight, child->zone.name);

> +

> +		child->weight = DIV_ROUND_CLOSEST(child->power_max * 1024,

> +						  dtpm->power_max);

> +

> +		__dtpm_rebalance_weight(child);

> +	}

> +	spin_unlock(&dtpm->lock);

> +}

> +

> +static void dtpm_rebalance_weight(void)

> +{

> +	__dtpm_rebalance_weight(root);

> +}

> +

> +static void dtpm_sub_power(struct dtpm *dtpm)

> +{

> +	struct dtpm *parent = dtpm->parent;

> +

> +	while (parent) {

> +		spin_lock(&parent->lock);

> +		parent->power_min -= dtpm->power_min;

> +		parent->power_max -= dtpm->power_max;

> +		spin_unlock(&parent->lock);

> +		parent = parent->parent;

> +	}

> +

> +	dtpm_rebalance_weight();

> +}

> +

> +static void dtpm_add_power(struct dtpm *dtpm)

> +{

> +	struct dtpm *parent = dtpm->parent;

> +

> +	while (parent) {

> +		spin_lock(&parent->lock);

> +		parent->power_min += dtpm->power_min;

> +		parent->power_max += dtpm->power_max;

> +		spin_unlock(&parent->lock);

> +		parent = parent->parent;

> +	}

> +

> +	dtpm_rebalance_weight();

> +}

> +

> +/**

> + * dtpm_update_power - Update the power on the dtpm

> + * @dtpm: a pointer to a dtpm structure to update

> + * @power_min: a u64 representing the new power_min value

> + * @power_max: a u64 representing the new power_max value

> + *

> + * Function to update the power values of the dtpm node specified in

> + * parameter. These new values will be propagated to the tree.

> + *

> + * Return: zero on success, -EINVAL if the values are inconsistent

> + */

> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max)

> +{

> +	if (power_min == dtpm->power_min && power_max == dtpm->power_max)

> +		return 0;

> +

> +	if (power_max < power_min)

> +		return -EINVAL;

> +

> +	dtpm_sub_power(dtpm);

> +	spin_lock(&dtpm->lock);

> +	dtpm->power_min = power_min;

> +	dtpm->power_max = power_max;

> +	spin_unlock(&dtpm->lock);

> +	dtpm_add_power(dtpm);

> +

> +	return 0;

> +}

> +

> +/**

> + * dtpm_release_zone - Cleanup when the node is released

> + * @pcz: a pointer to a powercap_zone structure

> + *

> + * Do some housecleaning and update the weight on the tree. The

> + * release will be denied if the node has children. This function must

> + * be called by the specific release callback of the different

> + * backends.

> + *

> + * Return: 0 on success, -EBUSY if there are children

> + */

> +int dtpm_release_zone(struct powercap_zone *pcz)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *parent = dtpm->parent;

> +

> +	if (!list_empty(&dtpm->children))

> +		return -EBUSY;

> +

> +	if (parent) {

> +		spin_lock(&parent->lock);

> +		list_del(&dtpm->sibling);

> +		spin_unlock(&parent->lock);

> +	}

> +

> +	dtpm_sub_power(dtpm);

> +

> +	kfree(dtpm);

> +

> +	return 0;

> +}

> +

> +/*

> + * Set the power limit on the nodes, the power limit is distributed

> + * given the weight of the children.

> + */

> +static int set_children_power_limit(struct powercap_zone *pcz, int cid,

> +				    u64 power_limit)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *child;

> +	u64 power;

> +	int ret = 0;

> +

> +	/*

> +	 * Don't allow values outside of the power range previously

> +	 * set when initiliazing the power numbers.


s/initiliazing/initializing

> +	 */

> +	power_limit = clamp_val(power_limit, dtpm->power_min, dtpm->power_max);

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +

> +		/*

> +		 * Integer division rounding will inevitably lead to a

> +		 * different max value when set several times. In

> +		 * order to restore the initial value, we force the

> +		 * child's max power every time if the constraint is

> +		 * removed by setting a value greater or equal to the

> +		 * max power.

> +		 */

> +		if (power_limit == dtpm->power_max)

> +			power = child->power_max;

> +		else

> +			power = DIV_ROUND_CLOSEST(

> +				power_limit * child->weight, 1024);

> +

> +		pr_debug("Setting power limit for '%s': %llu uW\n",

> +			 child->zone.name, power);

> +

> +		ret = child->zone.constraints->ops->set_power_limit_uw(

> +			&child->zone, cid, power);

> +		if (ret)

> +			break;

> +	}

> +	spin_unlock(&dtpm->lock);

> +

> +	return ret;

> +}

> +

> +static int get_children_power_limit(struct powercap_zone *pcz, int cid,

> +				    u64 *power_limit)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +	struct dtpm *child;

> +	u64 power;

> +	int ret = 0;

> +

> +	*power_limit = 0;

> +

> +	spin_lock(&dtpm->lock);

> +	list_for_each_entry(child, &dtpm->children, sibling) {

> +		ret = child->zone.constraints->ops->get_power_limit_uw(

> +			&child->zone, cid, &power);

> +		if (ret)

> +			break;

> +		*power_limit += power;

> +	}

> +	spin_unlock(&dtpm->lock);

> +

> +	return ret;

> +}

> +

> +static const char *get_constraint_name(struct powercap_zone *pcz, int cid)

> +{

> +	return constraint_name[cid];

> +}

> +

> +static int get_max_power_uw(struct powercap_zone *pcz, int id, u64 *max_power)

> +{

> +	struct dtpm *dtpm = to_dtpm(pcz);

> +

> +	spin_lock(&dtpm->lock);

> +	*max_power = dtpm->power_max;

> +	spin_unlock(&dtpm->lock);

> +

> +	return 0;

> +}

> +

> +static struct powercap_zone_constraint_ops constraint_ops = {

> +	.set_power_limit_uw = set_children_power_limit,

> +	.get_power_limit_uw = get_children_power_limit,

> +	.set_time_window_us = set_time_window_us,

> +	.get_time_window_us = get_time_window_us,

> +	.get_max_power_uw = get_max_power_uw,

> +	.get_name = get_constraint_name,

> +};

> +

> +static struct powercap_zone_ops zone_ops = {

> +	.get_max_power_range_uw = get_max_power_range_uw,

> +	.get_power_uw = get_children_power_uw,

> +	.release = dtpm_release_zone,

> +};

> +

> +/**

> + * dtpm_alloc - Allocate and initialize a dtpm struct

> + * @name: a string specifying the name of the node

> + *

> + * Return: a struct dtpm pointer, NULL in case of error

> + */

> +struct dtpm *dtpm_alloc(void)

> +{

> +	struct dtpm *dtpm;

> +

> +	dtpm = kzalloc(sizeof(*dtpm), GFP_KERNEL);

> +	if (dtpm) {

> +		INIT_LIST_HEAD(&dtpm->children);

> +		INIT_LIST_HEAD(&dtpm->sibling);

> +		spin_lock_init(&dtpm->lock);

> +		dtpm->weight = 1024;

> +	}

> +

> +	return dtpm;

> +}

> +

> +/**

> + * dtpm_unregister - Unregister a dtpm node from the hierarchy tree

> + * @dtpm: a pointer to a dtpm structure corresponding to the node to be removed

> + *

> + * Call the underlying powercap unregister function. That will call

> + * the release callback of the powercap zone.

> + */

> +void dtpm_unregister(struct dtpm *dtpm)

> +{

> +	powercap_unregister_zone(pct, &dtpm->zone);

> +}

> +

> +/**

> + * dtpm_register - Register a dtpm node in the hierarchy tree

> + * @name: a string specifying the name of the node

> + * @dtpm: a pointer to a dtpm structure corresponding to the new node

> + * @parent: a pointer to a dtpm structure corresponding to the parent node

> + * @ops: a pointer to a powercap_zone_ops structure

> + * @nr_constraints: a integer giving the number of constraints supported

> + * @const_ops: a pointer to a powercap_zone_constraint_ops structure

> + *

> + * Create a dtpm node in the tree. If no parent is specified, the node

> + * is the root node of the hierarchy. If the root node already exists,

> + * then the registration will fail. The powercap controller must be

> + * initialized before calling this function.

> + *

> + * The dtpm structure must be initialized with the power numbers

> + * before calling this function.

> + *

> + * Return: zero on success, a negative value in case of error:

> + *  -EAGAIN: the function is called before the framework is initialized.

> + *  -EBUSY: the root node is already inserted

> + *  -EINVAL: there is no root node yet and @parent is specified

> + *   Other negative values are reported back from the powercap framework

> + */

> +int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,

> +		  struct powercap_zone_ops *ops, int nr_constraints,

> +		  struct powercap_zone_constraint_ops *const_ops)

> +{

> +	struct powercap_zone *pcz;

> +

> +	if (!pct)

> +		return -EAGAIN;

> +

> +	if (root && !parent)

> +		return -EBUSY;

> +

> +	if (!root && parent)

> +		return -EINVAL;

> +

> +	const_ops->get_name = get_constraint_name;

> +	const_ops->get_max_power_uw = get_max_power_uw;

> +	const_ops->set_time_window_us = set_time_window_us;

> +	const_ops->get_time_window_us = get_time_window_us;

> +

> +	ops->get_max_power_range_uw = get_max_power_range_uw;

> +

> +	pcz = powercap_register_zone(&dtpm->zone, pct, name,

> +				     parent ? &parent->zone : NULL,

> +				     ops, nr_constraints, const_ops);

> +	if (IS_ERR(pcz))

> +		return PTR_ERR(pcz);

> +

> +	if (parent) {

> +		spin_lock(&parent->lock);

> +		list_add_tail(&dtpm->sibling, &parent->children);

> +		spin_unlock(&parent->lock);

> +		dtpm->parent = parent;

> +	} else {

> +		root = dtpm;

> +	}

> +

> +	dtpm_add_power(dtpm);

> +

> +	return 0;

> +}

> +

> +/**

> + * dtpm_register_parent - Register a intermediate node in the tree

> + * @name: a string specifying the name of the node

> + * @dtpm: a pointer to a dtpm structure corresponding to the new node

> + * @parent: a pointer to a dtpm structure corresponding parent's new node

> + *

> + * The function will add an intermediate node corresponding to a

> + * parent to more nodes. Its purpose is to aggregate the children

> + * characteristics and dispatch the constraints. If the @parent

> + * parameter is NULL, then this node becomes the root node of the tree

> + * if there is no root node yet.

> + *

> + * Return: zero on success, a negative value in case of error:

> + *  -EAGAIN: the function is called before the framework is initialized.

> + *  -EBUSY: the root node is already inserted

> + *  -EINVAL: there is not root node yet and @parent is specified

> + *   Other negative values are reported back from the powercap framework

> + */

> +int dtpm_register_parent(const char *name, struct dtpm *dtpm,

> +			 struct dtpm *parent)

> +{

> +	return dtpm_register(name, dtpm, parent, &zone_ops,

> +			     MAX_DTPM_CONSTRAINTS, &constraint_ops);

> +}

> +

> +static int __init dtpm_init(void)

> +{

> +	struct dtpm_descr **dtpm_descr;

> +

> +	pct = powercap_register_control_type(NULL, "dtpm", NULL);

> +	if (!pct) {

> +		pr_err("Failed to register control type\n");

> +		return -EINVAL;

> +	}

> +

> +	for_each_dtpm_table(dtpm_descr)

> +		(*dtpm_descr)->init(*dtpm_descr);

> +

> +	return 0;

> +}

> +late_initcall(dtpm_init);

> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h

> index 5430febd34be..29b30976ea02 100644

> --- a/include/asm-generic/vmlinux.lds.h

> +++ b/include/asm-generic/vmlinux.lds.h

> @@ -315,6 +315,16 @@

>   #define THERMAL_TABLE(name)

>   #endif

>   

> +#ifdef CONFIG_DTPM

> +#define DTPM_TABLE()							\

> +	. = ALIGN(8);							\

> +	__dtpm_table = .;						\

> +	KEEP(*(__dtpm_table))						\

> +	__dtpm_table_end = .;

> +#else

> +#define DTPM_TABLE()

> +#endif

> +

>   #define KERNEL_DTB()							\

>   	STRUCT_ALIGN();							\

>   	__dtb_start = .;						\

> @@ -715,6 +725,7 @@

>   	ACPI_PROBE_TABLE(irqchip)					\

>   	ACPI_PROBE_TABLE(timer)						\

>   	THERMAL_TABLE(governor)						\

> +	DTPM_TABLE()							\

>   	EARLYCON_TABLE()						\

>   	LSM_TABLE()							\

>   	EARLY_LSM_TABLE()

> diff --git a/include/linux/dtpm.h b/include/linux/dtpm.h

> new file mode 100644

> index 000000000000..6696bdcfdb87

> --- /dev/null

> +++ b/include/linux/dtpm.h

> @@ -0,0 +1,73 @@

> +/* SPDX-License-Identifier: GPL-2.0-only */

> +/*

> + * Copyright (C) 2020 Linaro Ltd

> + *

> + * Author: Daniel Lezcano <daniel.lezcano@linaro.org>

> + */

> +#ifndef ___DTPM_H__

> +#define ___DTPM_H__

> +

> +#include <linux/of.h>

> +#include <linux/powercap.h>

> +

> +#define MAX_DTPM_DESCR 8

> +#define MAX_DTPM_CONSTRAINTS 1

> +

> +struct dtpm {

> +	struct powercap_zone zone;

> +	struct dtpm *parent;

> +	struct list_head sibling;

> +	struct list_head children;

> +	spinlock_t lock;

> +	u64 power_limit;

> +	u64 power_max;

> +	u64 power_min;

> +	int weight;

> +	void *private;

> +};

> +

> +struct dtpm_descr;

> +

> +typedef int (*dtpm_init_t)(struct dtpm_descr *);

> +

> +struct dtpm_descr {

> +	struct dtpm *parent;

> +	const char *name;

> +	dtpm_init_t init;

> +};

> +

> +/* Init section thermal table */

> +extern struct dtpm_descr *__dtpm_table[];

> +extern struct dtpm_descr *__dtpm_table_end[];

> +

> +#define DTPM_TABLE_ENTRY(name)			\

> +	static typeof(name) *__dtpm_table_entry_##name	\

> +	__used __section(__dtpm_table) = &name


I had to change the section name to string, to pass compilation:
__used __section("__dtpm_table") = &name
I don't know if it's my compiler or configuration.

I've tried to register this DTPM in scmi-cpufreq.c with macro
proposed in patch 4/4 commit message, but I might missed some
important includes there...

> +

> +#define DTPM_DECLARE(name)	DTPM_TABLE_ENTRY(name)

> +

> +#define for_each_dtpm_table(__dtpm)	\

> +	for (__dtpm = __dtpm_table;	\

> +	     __dtpm < __dtpm_table_end;	\

> +	     __dtpm++)

> +

> +static inline struct dtpm *to_dtpm(struct powercap_zone *zone)

> +{

> +	return container_of(zone, struct dtpm, zone);

> +}

> +

> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max);

> +

> +int dtpm_release_zone(struct powercap_zone *pcz);

> +

> +struct dtpm *dtpm_alloc(void);

> +

> +void dtpm_unregister(struct dtpm *dtpm);

> +

> +int dtpm_register_parent(const char *name, struct dtpm *dtpm,

> +			 struct dtpm *parent);

> +

> +int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,

> +		  struct powercap_zone_ops *ops, int nr_constraints,

> +		  struct powercap_zone_constraint_ops *const_ops);

> +#endif

> 


Minor comment. This new framework deserves more debug prints, especially
in registration/unregistration paths. I had to put some, to test it.
But it can be done later as well, after it gets into mainline.

I have also run different hotplug stress tests to check this tree
locking. The userspace process constantly reading these values, while
the last CPU in the cluster was going on/off and node was detaching.
I haven't seen any problems, but the tree wasn't so deep.
Everything was calculated properly, no error, null pointers, etc.

Apart from the spelling minor issues and the long constraint name, LGTM

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

Tested-by: Lukasz Luba <lukasz.luba@arm.com>


Regards,
Lukasz

[1] 
https://lore.kernel.org/linux-pm/20201109172452.6923-1-lukasz.luba@arm.com/
Lukasz Luba Nov. 10, 2020, 11:05 a.m. UTC | #6
Actually I've found one issue when I have been trying to clean
my testing branch with modified scmi-cpufreq.c.


On 11/10/20 9:59 AM, Lukasz Luba wrote:
> Hi Daniel,

> 

> I've experimented with the patch set and went through the code again.

> It looks good, only a few minor comments.

> 

> On 10/6/20 1:20 PM, Daniel Lezcano wrote:

>> On the embedded world, the complexity of the SoC leads to an

>> increasing number of hotspots which need to be monitored and mitigated

>> as a whole in order to prevent the temperature to go above the

>> normative and legally stated 'skin temperature'.


[snip]

>> diff --git a/include/linux/dtpm.h b/include/linux/dtpm.h

>> new file mode 100644

>> index 000000000000..6696bdcfdb87

>> --- /dev/null

>> +++ b/include/linux/dtpm.h

>> @@ -0,0 +1,73 @@

>> +/* SPDX-License-Identifier: GPL-2.0-only */

>> +/*

>> + * Copyright (C) 2020 Linaro Ltd

>> + *

>> + * Author: Daniel Lezcano <daniel.lezcano@linaro.org>

>> + */

>> +#ifndef ___DTPM_H__

>> +#define ___DTPM_H__

>> +

>> +#include <linux/of.h>

>> +#include <linux/powercap.h>

>> +

>> +#define MAX_DTPM_DESCR 8

>> +#define MAX_DTPM_CONSTRAINTS 1

>> +

>> +struct dtpm {

>> +    struct powercap_zone zone;

>> +    struct dtpm *parent;

>> +    struct list_head sibling;

>> +    struct list_head children;

>> +    spinlock_t lock;

>> +    u64 power_limit;

>> +    u64 power_max;

>> +    u64 power_min;

>> +    int weight;

>> +    void *private;

>> +};

>> +

>> +struct dtpm_descr;

>> +

>> +typedef int (*dtpm_init_t)(struct dtpm_descr *);

>> +

>> +struct dtpm_descr {

>> +    struct dtpm *parent;

>> +    const char *name;

>> +    dtpm_init_t init;

>> +};

>> +

>> +/* Init section thermal table */

>> +extern struct dtpm_descr *__dtpm_table[];

>> +extern struct dtpm_descr *__dtpm_table_end[];

>> +

>> +#define DTPM_TABLE_ENTRY(name)            \

>> +    static typeof(name) *__dtpm_table_entry_##name    \

>> +    __used __section(__dtpm_table) = &name

> 

> I had to change the section name to string, to pass compilation:

> __used __section("__dtpm_table") = &name

> I don't know if it's my compiler or configuration.

> 

> I've tried to register this DTPM in scmi-cpufreq.c with macro

> proposed in patch 4/4 commit message, but I might missed some

> important includes there...

> 

>> +

>> +#define DTPM_DECLARE(name)    DTPM_TABLE_ENTRY(name)

>> +

>> +#define for_each_dtpm_table(__dtpm)    \

>> +    for (__dtpm = __dtpm_table;    \

>> +         __dtpm < __dtpm_table_end;    \

>> +         __dtpm++)

>> +

>> +static inline struct dtpm *to_dtpm(struct powercap_zone *zone)

>> +{

>> +    return container_of(zone, struct dtpm, zone);

>> +}

>> +

>> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max);

>> +

>> +int dtpm_release_zone(struct powercap_zone *pcz);

>> +

>> +struct dtpm *dtpm_alloc(void);

>> +

>> +void dtpm_unregister(struct dtpm *dtpm);

>> +

>> +int dtpm_register_parent(const char *name, struct dtpm *dtpm,

>> +             struct dtpm *parent);

>> +

>> +int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm 

>> *parent,

>> +          struct powercap_zone_ops *ops, int nr_constraints,

>> +          struct powercap_zone_constraint_ops *const_ops);


This header is missing
#ifdef CONFIG_DTPM with static inline functions and empty DTPM_DECLARE()
macro.
I got these issues, when my testing code in scmi-cpufreq.c was compiled
w/o CONFIG_DTPM and DTPM_CPU

/usr/bin/aarch64-linux-gnu-ld: warning: orphan section `__dtpm_table' 
from `drivers/cpufreq/scmi-cpufreq.o' being placed in section 
`__dtpm_table'.
/usr/bin/aarch64-linux-gnu-ld: Unexpected GOT/PLT entries detected!
/usr/bin/aarch64-linux-gnu-ld: Unexpected run-time procedure linkages 
detected!
drivers/cpufreq/scmi-cpufreq.o: In function `dtpm_register_pkg':
/data/linux/drivers/cpufreq/scmi-cpufreq.c:272: undefined reference to 
`dtpm_alloc'
/data/linux/drivers/cpufreq/scmi-cpufreq.c:276: undefined reference to 
`dtpm_register_parent'
/data/linux/drivers/cpufreq/scmi-cpufreq.c:280: undefined reference to 
`dtpm_register_cpu'
Makefile:1164: recipe for target 'vmlinux' failed


The diff bellow fixed my issues. Then I had one for patch 4/4
for static inline int dtpm_register_cpu() function. I've followed the
thermal.h scheme with -ENODEV, but you can choose different approach.
--------------------------8<---------------------------------------------
diff --git a/include/linux/dtpm.h b/include/linux/dtpm.h
index 6696bdcfdb87..0ef784ca5d0b 100644
--- a/include/linux/dtpm.h
+++ b/include/linux/dtpm.h
@@ -40,6 +40,7 @@ struct dtpm_descr {
  extern struct dtpm_descr *__dtpm_table[];
  extern struct dtpm_descr *__dtpm_table_end[];

+#ifdef CONFIG_DTPM
  #define DTPM_TABLE_ENTRY(name)			\
  	static typeof(name) *__dtpm_table_entry_##name	\
  	__used __section(__dtpm_table) = &name
@@ -70,4 +71,36 @@ int dtpm_register_parent(const char *name, struct 
dtpm *dtpm,
  int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm 
*parent,
  		  struct powercap_zone_ops *ops, int nr_constraints,
  		  struct powercap_zone_constraint_ops *const_ops);
-#endif
+#else
+#define DTPM_DECLARE(name)
+static inline
+int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max)
+{
+	return -ENODEV;
+}
+static inline int dtpm_release_zone(struct powercap_zone *pcz)
+{
+	return -ENODEV;
+}
+static inline struct dtpm *dtpm_alloc(void)
+{
+	return ERR_PTR(-ENODEV);
+}
+static inline void dtpm_unregister(struct dtpm *dtpm)
+{ }
+static inline
+int dtpm_register_parent(const char *name, struct dtpm *dtpm,
+			 struct dtpm *parent)
+{
+	return -ENODEV;
+}
+static inline
+int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,
+		  struct powercap_zone_ops *ops, int nr_constraints,
+		  struct powercap_zone_constraint_ops *const_ops)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_DTPM */
+
+#endif /* __DTPM_H__ */

----------------------------->8-------------------------------------------


>> +#endif

>>

> 

> Minor comment. This new framework deserves more debug prints, especially

> in registration/unregistration paths. I had to put some, to test it.

> But it can be done later as well, after it gets into mainline.

> 

> I have also run different hotplug stress tests to check this tree

> locking. The userspace process constantly reading these values, while

> the last CPU in the cluster was going on/off and node was detaching.

> I haven't seen any problems, but the tree wasn't so deep.

> Everything was calculated properly, no error, null pointers, etc.

> 

> Apart from the spelling minor issues and the long constraint name, LGTM

> 

> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> > Tested-by: Lukasz Luba <lukasz.luba@arm.com>


Please ignore these two for a while. But if you decide to take this diff 
above, you can add these two tags then in v2.
This is the only issue that I see.

Regards,
Lukasz
Daniel Lezcano Nov. 10, 2020, 12:55 p.m. UTC | #7
Hi Lukasz,

thanks for the review

On 10/11/2020 10:59, Lukasz Luba wrote:

[ ... ]

>> +/* Init section thermal table */

>> +extern struct dtpm_descr *__dtpm_table[];

>> +extern struct dtpm_descr *__dtpm_table_end[];

>> +

>> +#define DTPM_TABLE_ENTRY(name)            \

>> +    static typeof(name) *__dtpm_table_entry_##name    \

>> +    __used __section(__dtpm_table) = &name

> 

> I had to change the section name to string, to pass compilation:

> __used __section("__dtpm_table") = &name

> I don't know if it's my compiler or configuration.


Actually, it is:

commit 33def8498fdde180023444b08e12b72a9efed41d
Author: Joe Perches <joe@perches.com>
Date:   Wed Oct 21 19:36:07 2020 -0700

    treewide: Convert macro and uses of __section(foo) to __section("foo")

Your change is correct, I've noticed it a few days ago when rebasing the
series.

> I've tried to register this DTPM in scmi-cpufreq.c with macro

> proposed in patch 4/4 commit message, but I might missed some

> important includes there...

> 

>> +

>> +#define DTPM_DECLARE(name)    DTPM_TABLE_ENTRY(name)

>> +

>> +#define for_each_dtpm_table(__dtpm)    \

>> +    for (__dtpm = __dtpm_table;    \

>> +         __dtpm < __dtpm_table_end;    \

>> +         __dtpm++)

>> +

>> +static inline struct dtpm *to_dtpm(struct powercap_zone *zone)

>> +{

>> +    return container_of(zone, struct dtpm, zone);

>> +}

>> +

>> +int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max);

>> +

>> +int dtpm_release_zone(struct powercap_zone *pcz);

>> +

>> +struct dtpm *dtpm_alloc(void);

>> +

>> +void dtpm_unregister(struct dtpm *dtpm);

>> +

>> +int dtpm_register_parent(const char *name, struct dtpm *dtpm,

>> +             struct dtpm *parent);

>> +

>> +int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm

>> *parent,

>> +          struct powercap_zone_ops *ops, int nr_constraints,

>> +          struct powercap_zone_constraint_ops *const_ops);

>> +#endif

>>

> 

> Minor comment. This new framework deserves more debug prints, especially

> in registration/unregistration paths. I had to put some, to test it.

> But it can be done later as well, after it gets into mainline.


Ok, I will add some debug traces.

> I have also run different hotplug stress tests to check this tree

> locking. The userspace process constantly reading these values, while

> the last CPU in the cluster was going on/off and node was detaching.

> I haven't seen any problems, but the tree wasn't so deep.

> Everything was calculated properly, no error, null pointers, etc.


Great! thank you very much for this test

> Apart from the spelling minor issues and the long constraint name, LGTM

> 

> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

> Tested-by: Lukasz Luba <lukasz.luba@arm.com>


Thanks for the review

  -- Daniel


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Daniel Lezcano Nov. 10, 2020, 2:59 p.m. UTC | #8
On 10/11/2020 12:05, Lukasz Luba wrote:
> 

> Actually I've found one issue when I have been trying to clean

> my testing branch with modified scmi-cpufreq.c.


IMO, those errors are not the dtpm framework fault but the scmi-cpufreq.

You should add a component in the drivers/powercap which does the glue
between the scmi-cpufreq and the dtpm. No stub will be needed in this
case as the component will depend on CONFIG_DTPM.




-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Lukasz Luba Nov. 10, 2020, 3:04 p.m. UTC | #9
On 11/10/20 2:59 PM, Daniel Lezcano wrote:
> On 10/11/2020 12:05, Lukasz Luba wrote:

>>

>> Actually I've found one issue when I have been trying to clean

>> my testing branch with modified scmi-cpufreq.c.

> 

> IMO, those errors are not the dtpm framework fault but the scmi-cpufreq.


True, I have added this proposed macro directly into driver, but it's
not strictly the framework.

> 

> You should add a component in the drivers/powercap which does the glue

> between the scmi-cpufreq and the dtpm. No stub will be needed in this

> case as the component will depend on CONFIG_DTPM.

> 

> 

> 

> 


Make sense, the glue stick should take care in this scenario.

In this case, please keep the Reviewed-by and Tested-by and ignore
the previous email.
diff mbox series

Patch

diff --git a/drivers/powercap/Kconfig b/drivers/powercap/Kconfig
index ebc4d4578339..777cf60300b8 100644
--- a/drivers/powercap/Kconfig
+++ b/drivers/powercap/Kconfig
@@ -43,4 +43,10 @@  config IDLE_INJECT
 	  CPUs for power capping. Idle period can be injected
 	  synchronously on a set of specified CPUs or alternatively
 	  on a per CPU basis.
+
+config DTPM
+	bool "Power capping for dynamic thermal power management"
+	help
+	  This enables support for the power capping for the dynamic
+	  thermal management userspace engine.
 endif
diff --git a/drivers/powercap/Makefile b/drivers/powercap/Makefile
index 7255c94ec61c..6482ac52054d 100644
--- a/drivers/powercap/Makefile
+++ b/drivers/powercap/Makefile
@@ -1,4 +1,5 @@ 
 # SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_DTPM) += dtpm.o
 obj-$(CONFIG_POWERCAP)	+= powercap_sys.o
 obj-$(CONFIG_INTEL_RAPL_CORE) += intel_rapl_common.o
 obj-$(CONFIG_INTEL_RAPL) += intel_rapl_msr.o
diff --git a/drivers/powercap/dtpm.c b/drivers/powercap/dtpm.c
new file mode 100644
index 000000000000..6df1e51a2c1c
--- /dev/null
+++ b/drivers/powercap/dtpm.c
@@ -0,0 +1,430 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2020 Linaro Limited
+ *
+ * Author: Daniel Lezcano <daniel.lezcano@linaro.org>
+ *
+ * The powercap based Dynamic Thermal Power Management framework
+ * provides to the userspace a consistent API to set the power limit
+ * on some devices.
+ *
+ * DTPM defines the functions to create a tree of constraints. Each
+ * parent node is a virtual description of the aggregation of the
+ * children. It propagates the constraints set at its level to its
+ * children and collect the children power infomation. The leaves of
+ * the tree are the real devices which have the ability to get their
+ * current power consumption and set their power limit.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/dtpm.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/powercap.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+static const char *constraint_name[] = {
+	"Instantaneous power constraint",
+};
+
+static struct powercap_control_type *pct;
+static struct dtpm *root;
+
+static int get_time_window_us(struct powercap_zone *pcz, int cid, u64 *window)
+{
+	return -ENOSYS;
+}
+
+static int set_time_window_us(struct powercap_zone *pcz, int cid, u64 window)
+{
+	return -ENOSYS;
+}
+
+static int get_max_power_range_uw(struct powercap_zone *pcz, u64 *max_power_uw)
+{
+	struct dtpm *dtpm = to_dtpm(pcz);
+
+	spin_lock(&dtpm->lock);
+	*max_power_uw = dtpm->power_max - dtpm->power_min;
+	spin_unlock(&dtpm->lock);
+
+	return 0;
+}
+
+static int get_children_power_uw(struct powercap_zone *pcz, u64 *power_uw)
+{
+	struct dtpm *dtpm = to_dtpm(pcz);
+	struct dtpm *child;
+	u64 power;
+	int ret = 0;
+
+	*power_uw = 0;
+
+	spin_lock(&dtpm->lock);
+	list_for_each_entry(child, &dtpm->children, sibling) {
+		ret = child->zone.ops->get_power_uw(&child->zone, &power);
+		if (ret)
+			break;
+		*power_uw += power;
+	}
+	spin_unlock(&dtpm->lock);
+
+	return ret;
+}
+
+static void __dtpm_rebalance_weight(struct dtpm *dtpm)
+{
+	struct dtpm *child;
+
+	spin_lock(&dtpm->lock);
+	list_for_each_entry(child, &dtpm->children, sibling) {
+
+		pr_debug("Setting weight '%d' for '%s'\n",
+			 child->weight, child->zone.name);
+
+		child->weight = DIV_ROUND_CLOSEST(child->power_max * 1024,
+						  dtpm->power_max);
+
+		__dtpm_rebalance_weight(child);
+	}
+	spin_unlock(&dtpm->lock);
+}
+
+static void dtpm_rebalance_weight(void)
+{
+	__dtpm_rebalance_weight(root);
+}
+
+static void dtpm_sub_power(struct dtpm *dtpm)
+{
+	struct dtpm *parent = dtpm->parent;
+
+	while (parent) {
+		spin_lock(&parent->lock);
+		parent->power_min -= dtpm->power_min;
+		parent->power_max -= dtpm->power_max;
+		spin_unlock(&parent->lock);
+		parent = parent->parent;
+	}
+
+	dtpm_rebalance_weight();
+}
+
+static void dtpm_add_power(struct dtpm *dtpm)
+{
+	struct dtpm *parent = dtpm->parent;
+
+	while (parent) {
+		spin_lock(&parent->lock);
+		parent->power_min += dtpm->power_min;
+		parent->power_max += dtpm->power_max;
+		spin_unlock(&parent->lock);
+		parent = parent->parent;
+	}
+
+	dtpm_rebalance_weight();
+}
+
+/**
+ * dtpm_update_power - Update the power on the dtpm
+ * @dtpm: a pointer to a dtpm structure to update
+ * @power_min: a u64 representing the new power_min value
+ * @power_max: a u64 representing the new power_max value
+ *
+ * Function to update the power values of the dtpm node specified in
+ * parameter. These new values will be propagated to the tree.
+ *
+ * Return: zero on success, -EINVAL if the values are inconsistent
+ */
+int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max)
+{
+	if (power_min == dtpm->power_min && power_max == dtpm->power_max)
+		return 0;
+
+	if (power_max < power_min)
+		return -EINVAL;
+
+	dtpm_sub_power(dtpm);
+	spin_lock(&dtpm->lock);
+	dtpm->power_min = power_min;
+	dtpm->power_max = power_max;
+	spin_unlock(&dtpm->lock);
+	dtpm_add_power(dtpm);
+
+	return 0;
+}
+
+/**
+ * dtpm_release_zone - Cleanup when the node is released
+ * @pcz: a pointer to a powercap_zone structure
+ *
+ * Do some housecleaning and update the weight on the tree. The
+ * release will be denied if the node has children. This function must
+ * be called by the specific release callback of the different
+ * backends.
+ *
+ * Return: 0 on success, -EBUSY if there are children
+ */
+int dtpm_release_zone(struct powercap_zone *pcz)
+{
+	struct dtpm *dtpm = to_dtpm(pcz);
+	struct dtpm *parent = dtpm->parent;
+
+	if (!list_empty(&dtpm->children))
+		return -EBUSY;
+
+	if (parent) {
+		spin_lock(&parent->lock);
+		list_del(&dtpm->sibling);
+		spin_unlock(&parent->lock);
+	}
+
+	dtpm_sub_power(dtpm);
+
+	kfree(dtpm);
+
+	return 0;
+}
+
+/*
+ * Set the power limit on the nodes, the power limit is distributed
+ * given the weight of the children.
+ */
+static int set_children_power_limit(struct powercap_zone *pcz, int cid,
+				    u64 power_limit)
+{
+	struct dtpm *dtpm = to_dtpm(pcz);
+	struct dtpm *child;
+	u64 power;
+	int ret = 0;
+
+	/*
+	 * Don't allow values outside of the power range previously
+	 * set when initiliazing the power numbers.
+	 */
+	power_limit = clamp_val(power_limit, dtpm->power_min, dtpm->power_max);
+
+	spin_lock(&dtpm->lock);
+	list_for_each_entry(child, &dtpm->children, sibling) {
+
+		/*
+		 * Integer division rounding will inevitably lead to a
+		 * different max value when set several times. In
+		 * order to restore the initial value, we force the
+		 * child's max power every time if the constraint is
+		 * removed by setting a value greater or equal to the
+		 * max power.
+		 */
+		if (power_limit == dtpm->power_max)
+			power = child->power_max;
+		else
+			power = DIV_ROUND_CLOSEST(
+				power_limit * child->weight, 1024);
+
+		pr_debug("Setting power limit for '%s': %llu uW\n",
+			 child->zone.name, power);
+
+		ret = child->zone.constraints->ops->set_power_limit_uw(
+			&child->zone, cid, power);
+		if (ret)
+			break;
+	}
+	spin_unlock(&dtpm->lock);
+
+	return ret;
+}
+
+static int get_children_power_limit(struct powercap_zone *pcz, int cid,
+				    u64 *power_limit)
+{
+	struct dtpm *dtpm = to_dtpm(pcz);
+	struct dtpm *child;
+	u64 power;
+	int ret = 0;
+
+	*power_limit = 0;
+
+	spin_lock(&dtpm->lock);
+	list_for_each_entry(child, &dtpm->children, sibling) {
+		ret = child->zone.constraints->ops->get_power_limit_uw(
+			&child->zone, cid, &power);
+		if (ret)
+			break;
+		*power_limit += power;
+	}
+	spin_unlock(&dtpm->lock);
+
+	return ret;
+}
+
+static const char *get_constraint_name(struct powercap_zone *pcz, int cid)
+{
+	return constraint_name[cid];
+}
+
+static int get_max_power_uw(struct powercap_zone *pcz, int id, u64 *max_power)
+{
+	struct dtpm *dtpm = to_dtpm(pcz);
+
+	spin_lock(&dtpm->lock);
+	*max_power = dtpm->power_max;
+	spin_unlock(&dtpm->lock);
+
+	return 0;
+}
+
+static struct powercap_zone_constraint_ops constraint_ops = {
+	.set_power_limit_uw = set_children_power_limit,
+	.get_power_limit_uw = get_children_power_limit,
+	.set_time_window_us = set_time_window_us,
+	.get_time_window_us = get_time_window_us,
+	.get_max_power_uw = get_max_power_uw,
+	.get_name = get_constraint_name,
+};
+
+static struct powercap_zone_ops zone_ops = {
+	.get_max_power_range_uw = get_max_power_range_uw,
+	.get_power_uw = get_children_power_uw,
+	.release = dtpm_release_zone,
+};
+
+/**
+ * dtpm_alloc - Allocate and initialize a dtpm struct
+ * @name: a string specifying the name of the node
+ *
+ * Return: a struct dtpm pointer, NULL in case of error
+ */
+struct dtpm *dtpm_alloc(void)
+{
+	struct dtpm *dtpm;
+
+	dtpm = kzalloc(sizeof(*dtpm), GFP_KERNEL);
+	if (dtpm) {
+		INIT_LIST_HEAD(&dtpm->children);
+		INIT_LIST_HEAD(&dtpm->sibling);
+		spin_lock_init(&dtpm->lock);
+		dtpm->weight = 1024;
+	}
+
+	return dtpm;
+}
+
+/**
+ * dtpm_unregister - Unregister a dtpm node from the hierarchy tree
+ * @dtpm: a pointer to a dtpm structure corresponding to the node to be removed
+ *
+ * Call the underlying powercap unregister function. That will call
+ * the release callback of the powercap zone.
+ */
+void dtpm_unregister(struct dtpm *dtpm)
+{
+	powercap_unregister_zone(pct, &dtpm->zone);
+}
+
+/**
+ * dtpm_register - Register a dtpm node in the hierarchy tree
+ * @name: a string specifying the name of the node
+ * @dtpm: a pointer to a dtpm structure corresponding to the new node
+ * @parent: a pointer to a dtpm structure corresponding to the parent node
+ * @ops: a pointer to a powercap_zone_ops structure
+ * @nr_constraints: a integer giving the number of constraints supported
+ * @const_ops: a pointer to a powercap_zone_constraint_ops structure
+ *
+ * Create a dtpm node in the tree. If no parent is specified, the node
+ * is the root node of the hierarchy. If the root node already exists,
+ * then the registration will fail. The powercap controller must be
+ * initialized before calling this function.
+ *
+ * The dtpm structure must be initialized with the power numbers
+ * before calling this function.
+ *
+ * Return: zero on success, a negative value in case of error:
+ *  -EAGAIN: the function is called before the framework is initialized.
+ *  -EBUSY: the root node is already inserted
+ *  -EINVAL: there is no root node yet and @parent is specified
+ *   Other negative values are reported back from the powercap framework
+ */
+int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,
+		  struct powercap_zone_ops *ops, int nr_constraints,
+		  struct powercap_zone_constraint_ops *const_ops)
+{
+	struct powercap_zone *pcz;
+
+	if (!pct)
+		return -EAGAIN;
+
+	if (root && !parent)
+		return -EBUSY;
+
+	if (!root && parent)
+		return -EINVAL;
+
+	const_ops->get_name = get_constraint_name;
+	const_ops->get_max_power_uw = get_max_power_uw;
+	const_ops->set_time_window_us = set_time_window_us;
+	const_ops->get_time_window_us = get_time_window_us;
+
+	ops->get_max_power_range_uw = get_max_power_range_uw;
+
+	pcz = powercap_register_zone(&dtpm->zone, pct, name,
+				     parent ? &parent->zone : NULL,
+				     ops, nr_constraints, const_ops);
+	if (IS_ERR(pcz))
+		return PTR_ERR(pcz);
+
+	if (parent) {
+		spin_lock(&parent->lock);
+		list_add_tail(&dtpm->sibling, &parent->children);
+		spin_unlock(&parent->lock);
+		dtpm->parent = parent;
+	} else {
+		root = dtpm;
+	}
+
+	dtpm_add_power(dtpm);
+
+	return 0;
+}
+
+/**
+ * dtpm_register_parent - Register a intermediate node in the tree
+ * @name: a string specifying the name of the node
+ * @dtpm: a pointer to a dtpm structure corresponding to the new node
+ * @parent: a pointer to a dtpm structure corresponding parent's new node
+ *
+ * The function will add an intermediate node corresponding to a
+ * parent to more nodes. Its purpose is to aggregate the children
+ * characteristics and dispatch the constraints. If the @parent
+ * parameter is NULL, then this node becomes the root node of the tree
+ * if there is no root node yet.
+ *
+ * Return: zero on success, a negative value in case of error:
+ *  -EAGAIN: the function is called before the framework is initialized.
+ *  -EBUSY: the root node is already inserted
+ *  -EINVAL: there is not root node yet and @parent is specified
+ *   Other negative values are reported back from the powercap framework
+ */
+int dtpm_register_parent(const char *name, struct dtpm *dtpm,
+			 struct dtpm *parent)
+{
+	return dtpm_register(name, dtpm, parent, &zone_ops,
+			     MAX_DTPM_CONSTRAINTS, &constraint_ops);
+}
+
+static int __init dtpm_init(void)
+{
+	struct dtpm_descr **dtpm_descr;
+
+	pct = powercap_register_control_type(NULL, "dtpm", NULL);
+	if (!pct) {
+		pr_err("Failed to register control type\n");
+		return -EINVAL;
+	}
+
+	for_each_dtpm_table(dtpm_descr)
+		(*dtpm_descr)->init(*dtpm_descr);
+
+	return 0;
+}
+late_initcall(dtpm_init);
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5430febd34be..29b30976ea02 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -315,6 +315,16 @@ 
 #define THERMAL_TABLE(name)
 #endif
 
+#ifdef CONFIG_DTPM
+#define DTPM_TABLE()							\
+	. = ALIGN(8);							\
+	__dtpm_table = .;						\
+	KEEP(*(__dtpm_table))						\
+	__dtpm_table_end = .;
+#else
+#define DTPM_TABLE()
+#endif
+
 #define KERNEL_DTB()							\
 	STRUCT_ALIGN();							\
 	__dtb_start = .;						\
@@ -715,6 +725,7 @@ 
 	ACPI_PROBE_TABLE(irqchip)					\
 	ACPI_PROBE_TABLE(timer)						\
 	THERMAL_TABLE(governor)						\
+	DTPM_TABLE()							\
 	EARLYCON_TABLE()						\
 	LSM_TABLE()							\
 	EARLY_LSM_TABLE()
diff --git a/include/linux/dtpm.h b/include/linux/dtpm.h
new file mode 100644
index 000000000000..6696bdcfdb87
--- /dev/null
+++ b/include/linux/dtpm.h
@@ -0,0 +1,73 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 Linaro Ltd
+ *
+ * Author: Daniel Lezcano <daniel.lezcano@linaro.org>
+ */
+#ifndef ___DTPM_H__
+#define ___DTPM_H__
+
+#include <linux/of.h>
+#include <linux/powercap.h>
+
+#define MAX_DTPM_DESCR 8
+#define MAX_DTPM_CONSTRAINTS 1
+
+struct dtpm {
+	struct powercap_zone zone;
+	struct dtpm *parent;
+	struct list_head sibling;
+	struct list_head children;
+	spinlock_t lock;
+	u64 power_limit;
+	u64 power_max;
+	u64 power_min;
+	int weight;
+	void *private;
+};
+
+struct dtpm_descr;
+
+typedef int (*dtpm_init_t)(struct dtpm_descr *);
+
+struct dtpm_descr {
+	struct dtpm *parent;
+	const char *name;
+	dtpm_init_t init;
+};
+
+/* Init section thermal table */
+extern struct dtpm_descr *__dtpm_table[];
+extern struct dtpm_descr *__dtpm_table_end[];
+
+#define DTPM_TABLE_ENTRY(name)			\
+	static typeof(name) *__dtpm_table_entry_##name	\
+	__used __section(__dtpm_table) = &name
+
+#define DTPM_DECLARE(name)	DTPM_TABLE_ENTRY(name)
+
+#define for_each_dtpm_table(__dtpm)	\
+	for (__dtpm = __dtpm_table;	\
+	     __dtpm < __dtpm_table_end;	\
+	     __dtpm++)
+
+static inline struct dtpm *to_dtpm(struct powercap_zone *zone)
+{
+	return container_of(zone, struct dtpm, zone);
+}
+
+int dtpm_update_power(struct dtpm *dtpm, u64 power_min, u64 power_max);
+
+int dtpm_release_zone(struct powercap_zone *pcz);
+
+struct dtpm *dtpm_alloc(void);
+
+void dtpm_unregister(struct dtpm *dtpm);
+
+int dtpm_register_parent(const char *name, struct dtpm *dtpm,
+			 struct dtpm *parent);
+
+int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent,
+		  struct powercap_zone_ops *ops, int nr_constraints,
+		  struct powercap_zone_constraint_ops *const_ops);
+#endif