diff mbox series

[v3,6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

Message ID 1522945005-7165-7-git-send-email-daniel.lezcano@linaro.org
State New
Headers show
Series CPU cooling device new strategies | expand

Commit Message

Daniel Lezcano April 5, 2018, 4:16 p.m. UTC
The cpu idle cooling driver performs synchronized idle injection across all
cpus belonging to the same cluster and offers a new method to cool down a SoC.

Each cluster has its own idle cooling device, each core has its own idle
injection thread, each idle injection thread uses play_idle to enter idle.  In
order to reach the deepest idle state, each cooling device has the idle
injection threads synchronized together.

It has some similarity with the intel power clamp driver but it is actually
designed to work on the ARM architecture via the DT with a mathematical proof
with the power model which comes with the Documentation.

The idle injection cycle is fixed while the running cycle is variable. That
allows to have control on the device reactivity for the user experience. At
the mitigation point the idle threads are unparked, they play idle the
specified amount of time and they schedule themselves. The last thread sets
the next idle injection deadline and when the timer expires it wakes up all
the threads which in turn play idle again. Meanwhile the running cycle is
changed by set_cur_state.  When the mitigation ends, the threads are parked.
The algorithm is self adaptive, so there is no need to handle hotplugging.

If we take an example of the balanced point, we can use the DT for the hi6220.

The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
running at full blast at the maximum OPP consumes 5280mW. The first value is
given in the DT, the second is calculated from the OPP with the formula:

   Pdyn = Cdyn x Voltage^2 x Frequency

As the SoC vendors don't want to share the static leakage values, we assume
it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.

In order to reduce the power to 3326mW, we have to apply a ratio to the
running time.

ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874

We know the idle cycle which is fixed, let's assume 10ms. However from this
duration we have to substract the wake up latency for the cluster idle state.
In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
8.5ms.

As we know the idle duration and the ratio, we can compute the running cycle.

   running_cycle = 8.5 / 0.5874 = 14.47ms

So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
SoC to the balanced trip point of 75°C.

The driver has been tested on the hi6220 and it appears the temperature
stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
running cycle of 14ms as expected by the theory above.

Signed-off-by: Kevin Wangtao <kevin.wangtao@linaro.org>

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

---
 drivers/thermal/Kconfig       |  10 +
 drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpu_cooling.h   |   6 +
 3 files changed, 495 insertions(+)

-- 
2.7.4

Comments

Viresh Kumar April 11, 2018, 8:51 a.m. UTC | #1
On 05-04-18, 18:16, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all

> cpus belonging to the same cluster and offers a new method to cool down a SoC.

> 

> Each cluster has its own idle cooling device, each core has its own idle

> injection thread, each idle injection thread uses play_idle to enter idle.  In

> order to reach the deepest idle state, each cooling device has the idle

> injection threads synchronized together.

> 

> It has some similarity with the intel power clamp driver but it is actually

> designed to work on the ARM architecture via the DT with a mathematical proof

> with the power model which comes with the Documentation.

> 

> The idle injection cycle is fixed while the running cycle is variable. That

> allows to have control on the device reactivity for the user experience. At

> the mitigation point the idle threads are unparked, they play idle the

> specified amount of time and they schedule themselves. The last thread sets

> the next idle injection deadline and when the timer expires it wakes up all

> the threads which in turn play idle again. Meanwhile the running cycle is

> changed by set_cur_state.  When the mitigation ends, the threads are parked.

> The algorithm is self adaptive, so there is no need to handle hotplugging.

> 

> If we take an example of the balanced point, we can use the DT for the hi6220.

> 

> The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores

> running at full blast at the maximum OPP consumes 5280mW. The first value is

> given in the DT, the second is calculated from the OPP with the formula:

> 

>    Pdyn = Cdyn x Voltage^2 x Frequency

> 

> As the SoC vendors don't want to share the static leakage values, we assume

> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.

> 

> In order to reduce the power to 3326mW, we have to apply a ratio to the

> running time.

> 

> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874

> 

> We know the idle cycle which is fixed, let's assume 10ms. However from this

> duration we have to substract the wake up latency for the cluster idle state.

> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle

> 8.5ms.

> 

> As we know the idle duration and the ratio, we can compute the running cycle.

> 

>    running_cycle = 8.5 / 0.5874 = 14.47ms

> 

> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the

> SoC to the balanced trip point of 75°C.

> 

> The driver has been tested on the hi6220 and it appears the temperature

> stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and

> running cycle of 14ms as expected by the theory above.

> 

> Signed-off-by: Kevin Wangtao <kevin.wangtao@linaro.org>

> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

> ---

>  drivers/thermal/Kconfig       |  10 +

>  drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++

>  include/linux/cpu_cooling.h   |   6 +

>  3 files changed, 495 insertions(+)

> 

> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig

> index 5aaae1b..6c34117 100644

> --- a/drivers/thermal/Kconfig

> +++ b/drivers/thermal/Kconfig

> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL

>  	  This will be useful for platforms using the generic thermal interface

>  	  and not the ACPI interface.

>  

> +config CPU_IDLE_THERMAL

> +       bool "CPU idle cooling strategy"

> +       depends on CPU_IDLE

> +       help

> +	 This implements the generic CPU cooling mechanism through

> +	 idle injection.  This will throttle the CPU by injecting

> +	 fixed idle cycle.  All CPUs belonging to the same cluster

> +	 will enter idle synchronously to reach the deepest idle

> +	 state.

> +

>  endchoice

>  

>  config CLOCK_THERMAL

> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c

> index 5c219dc..1eec8d6 100644

> --- a/drivers/thermal/cpu_cooling.c

> +++ b/drivers/thermal/cpu_cooling.c

> @@ -10,18 +10,33 @@

>   *		Viresh Kumar <viresh.kumar@linaro.org>

>   *

>   */

> +#define pr_fmt(fmt) "CPU cooling: " fmt

> +

>  #include <linux/module.h>

>  #include <linux/thermal.h>

>  #include <linux/cpufreq.h>

> +#include <linux/cpuidle.h>

>  #include <linux/err.h>

> +#include <linux/freezer.h>

>  #include <linux/idr.h>

> +#include <linux/kthread.h>

>  #include <linux/pm_opp.h>

>  #include <linux/slab.h>

> +#include <linux/sched/prio.h>

> +#include <linux/sched/rt.h>

> +#include <linux/smpboot.h>

>  #include <linux/cpu.h>

>  #include <linux/cpu_cooling.h>

>  

> +#include <linux/ratelimit.h>


What part of the code is using this one ?

> +


Why add above 2 blank lines ?

> +#include <linux/platform_device.h>

> +#include <linux/of_platform.h>

> +

>  #include <trace/events/thermal.h>

>  

> +#include <uapi/linux/sched/types.h>

> +

>  #ifdef CONFIG_CPU_FREQ_THERMAL

>  /*

>   * Cooling state <-> CPUFreq frequency

> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)

>  }

>  EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);

>  #endif /* CONFIG_CPU_FREQ_THERMAL */

> +

> +#ifdef CONFIG_CPU_IDLE_THERMAL

> +/**

> + * struct cpuidle_cooling_device - data for the idle cooling device

> + * @cdev: a pointer to a struct thermal_cooling_device

> + * @cpumask: a cpumask containing the CPU managed by the cooling device

> + * @timer: a hrtimer giving the tempo for the idle injection cycles

> + * @kref: a kernel refcount on this structure

> + * @count: an atomic to keep track of the last task exiting the idle cycle

> + * @idle_cycle: an integer defining the duration of the idle injection

> + * @state: an normalized integer giving the state of the cooling device

> + */

> +struct cpuidle_cooling_device {

> +	struct thermal_cooling_device *cdev;

> +	struct cpumask *cpumask;

> +	struct hrtimer timer;

> +	struct kref kref;

> +	atomic_t count;

> +	unsigned int idle_cycle;

> +	unsigned long state;

> +};

> +

> +struct cpuidle_cooling_thread {

> +	struct task_struct *tsk;

> +	int should_run;

> +};

> +

> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);

> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);

> +

> +/**

> + * cpuidle_cooling_wakeup - Wake up all idle injection threads

> + * @idle_cdev: the idle cooling device

> + *

> + * Every idle injection task belonging to the idle cooling device and

> + * running on an online cpu will be wake up by this call.

> + */

> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)

> +{

> +	struct cpuidle_cooling_thread *cct;

> +	int cpu;

> +

> +	for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {

> +		cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);

> +		cct->should_run = 1;

> +		wake_up_process(cct->tsk);

> +	}

> +}

> +

> +/**

> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback

> + * @timer: a hrtimer structure

> + *

> + * When the mitigation is acting, the CPU is allowed to run an amount

> + * of time, then the idle injection happens for the specified delay

> + * and the idle task injection schedules itself until the timer event

> + * wakes the idle injection tasks again for a new idle injection

> + * cycle. The time between the end of the idle injection and the timer

> + * expiration is the allocated running time for the CPU.

> + *

> + * Always returns HRTIMER_NORESTART

> + */

> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)

> +{

> +	struct cpuidle_cooling_device *idle_cdev =

> +		container_of(timer, struct cpuidle_cooling_device, timer);

> +

> +	cpuidle_cooling_wakeup(idle_cdev);

> +

> +	return HRTIMER_NORESTART;

> +}

> +

> +/**

> + * cpuidle_cooling_runtime - Running time computation

> + * @idle_cdev: the idle cooling device

> + *

> + * The running duration is computed from the idle injection duration

> + * which is fixed. If we reach 100% of idle injection ratio, that

> + * means the running duration is zero. If we have a 50% ratio

> + * injection, that means we have equal duration for idle and for

> + * running duration.

> + *

> + * The formula is deduced as the following:

> + *

> + *  running = idle x ((100 / ratio) - 1)

> + *

> + * For precision purpose for integer math, we use the following:

> + *

> + *  running = (idle x 100) / ratio - idle

> + *

> + * For example, if we have an injected duration of 50%, then we end up

> + * with 10ms of idle injection and 10ms of running duration.

> + *

> + * Returns a s64 nanosecond based

> + */

> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)

> +{

> +	s64 next_wakeup;

> +	unsigned long state = idle_cdev->state;

> +

> +	/*

> +	 * The function should not be called when there is no

> +	 * mitigation because:

> +	 * - that does not make sense

> +	 * - we end up with a division by zero

> +	 */

> +	if (!state)

> +		return 0;

> +

> +	next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -

> +		idle_cdev->idle_cycle;

> +

> +	return next_wakeup * NSEC_PER_USEC;

> +}

> +

> +/**

> + * cpuidle_cooling_injection - Idle injection mainloop thread function

> + * @cpu: an integer giving the cpu number the thread is pinned on

> + *

> + * This main function does basically two operations:

> + *

> + * - Goes idle for a specific amount of time

> + *

> + * - Sets a timer to wake up all the idle injection threads after a

> + *   running period

> + *

> + * That happens only when the mitigation is enabled, otherwise the

> + * task is scheduled out.

> + *

> + * In order to keep the tasks synchronized together, it is the last

> + * task exiting the idle period which is in charge of setting the

> + * timer.

> + *

> + * This function never returns.

> + */

> +static void cpuidle_cooling_injection(unsigned int cpu)

> +{

> +	s64 next_wakeup;

> +

> +	struct cpuidle_cooling_device *idle_cdev =

> +		per_cpu(cpuidle_cooling_device, cpu);

> +

> +	struct cpuidle_cooling_thread *cct =

> +		per_cpu_ptr(&cpuidle_cooling_thread, cpu);

> +

> +	atomic_inc(&idle_cdev->count);

> +

> +	cct->should_run = 0;

> +

> +	play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);

> +

> +	/*

> +	 * The last CPU waking up is in charge of setting the

> +	 * timer. If the CPU is hotplugged, the timer will

> +	 * move to another CPU (which may not belong to the

> +	 * same cluster) but that is not a problem as the

> +	 * timer will be set again by another CPU belonging to

> +	 * the cluster, so this mechanism is self adaptive and

> +	 * does not require any hotplugging dance.

> +	 */

> +	if (!atomic_dec_and_test(&idle_cdev->count))

> +		return;

> +

> +	next_wakeup = cpuidle_cooling_runtime(idle_cdev);

> +	if (next_wakeup)

> +		hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),

> +			      HRTIMER_MODE_REL_PINNED);

> +}

> +

> +static void cpuidle_cooling_setup(unsigned int cpu)

> +{

> +	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };


Doesn't checkpatch report about space requirements around '/' ?

> +

> +	set_freezable();

> +

> +	sched_setscheduler(current, SCHED_FIFO, &param);

> +}

> +

> +static int cpuidle_cooling_should_run(unsigned int cpu)

> +{

> +	struct cpuidle_cooling_thread *cct =

> +		per_cpu_ptr(&cpuidle_cooling_thread, cpu);

> +

> +	return cct->should_run;

> +}

> +

> +static struct smp_hotplug_thread cpuidle_cooling_threads = {

> +	.store                  = &cpuidle_cooling_thread.tsk,

> +	.thread_fn              = cpuidle_cooling_injection,

> +	.thread_comm            = "thermal-idle/%u",

> +	.thread_should_run	= cpuidle_cooling_should_run,

> +	.setup                  = cpuidle_cooling_setup,

> +};

> +

> +/**

> + * cpuidle_cooling_get_max_state - Get the maximum state

> + * @cdev  : the thermal cooling device

> + * @state : a pointer to the state variable to be filled

> + *

> + * The function always gives 100 as the injection ratio is percentile

> + * based for consistency accros different platforms.

> + *

> + * The function can not fail, it always returns zero.

> + */

> +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,

> +					 unsigned long *state)


This isn't aligned as rest of the routines in this driver. Try running
checkpatch with --strict option.

> +{

> +	/*

> +	 * Depending on the configuration or the hardware, the running

> +	 * cycle and the idle cycle could be different. We want unify

> +	 * that to an 0..100 interval, so the set state interface will

> +	 * be the same whatever the platform is.

> +	 *

> +	 * The state 100% will make the cluster 100% ... idle. A 0%

> +	 * injection ratio means no idle injection at all and 50%

> +	 * means for 10ms of idle injection, we have 10ms of running

> +	 * time.

> +	 */

> +	*state = 100;

> +

> +	return 0;

> +}

> +

> +/**

> + * cpuidle_cooling_get_cur_state - Get the current cooling state

> + * @cdev: the thermal cooling device

> + * @state: a pointer to the state

> + *

> + * The function just copy the state value from the private thermal

> + * cooling device structure, the mapping is 1 <-> 1.

> + *

> + * The function can not fail, it always returns zero.

> + */

> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,

> +					 unsigned long *state)

> +{

> +	struct cpuidle_cooling_device *idle_cdev = cdev->devdata;

> +

> +	*state = idle_cdev->state;

> +

> +	return 0;

> +}

> +

> +/**

> + * cpuidle_cooling_set_cur_state - Set the current cooling state

> + * @cdev: the thermal cooling device

> + * @state: the target state

> + *

> + * The function checks first if we are initiating the mitigation which

> + * in turn wakes up all the idle injection tasks belonging to the idle

> + * cooling device. In any case, it updates the internal state for the

> + * cooling device.

> + *

> + * The function can not fail, it always returns zero.

> + */

> +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,

> +					 unsigned long state)

> +{

> +	struct cpuidle_cooling_device *idle_cdev = cdev->devdata;

> +	unsigned long current_state = idle_cdev->state;

> +

> +	idle_cdev->state = state;

> +

> +	if (current_state == 0 && state > 0) {

> +		pr_debug("Starting cooling cpus '%*pbl'\n",

> +			 cpumask_pr_args(idle_cdev->cpumask));

> +		cpuidle_cooling_wakeup(idle_cdev);

> +	} else if (current_state > 0 && !state)  {

> +		pr_debug("Stopping cooling cpus '%*pbl'\n",

> +			 cpumask_pr_args(idle_cdev->cpumask));

> +	}

> +

> +	return 0;

> +}

> +

> +/**

> + * cpuidle_cooling_ops - thermal cooling device ops

> + */

> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {

> +	.get_max_state = cpuidle_cooling_get_max_state,

> +	.get_cur_state = cpuidle_cooling_get_cur_state,

> +	.set_cur_state = cpuidle_cooling_set_cur_state,

> +};

> +

> +/**

> + * cpuidle_cooling_release - Kref based release helper

> + * @kref: a pointer to the kref structure

> + *

> + * This function is automatically called by the kref_put function when

> + * the idle cooling device refcount reaches zero. At this point, we

> + * have the guarantee the structure is no longer in use and we can

> + * safely release all the ressources.

> + */

> +static void __init cpuidle_cooling_release(struct kref *kref)

> +{

> +	struct cpuidle_cooling_device *idle_cdev =

> +		container_of(kref, struct cpuidle_cooling_device, kref);

> +

> +	if (idle_cdev->cdev)

> +		thermal_cooling_device_unregister(idle_cdev->cdev);

> +

> +	hrtimer_cancel(&idle_cdev->timer);

> +	kfree(idle_cdev);

> +}

> +

> +/**

> + * cpuilde_cooling_unregister - Idle cooling device exit function

> + *

> + * This function unregisters the cpuidle cooling device and frees the

> + * ressources previously allocated by the init function. This function

> + * is called when the initialization fails.

> + */

> +static void __init cpuidle_cooling_unregister(void)

> +{

> +	struct cpuidle_cooling_device *idle_cdev;

> +	int cpu;

> +

> +	for_each_possible_cpu(cpu) {

> +		idle_cdev = per_cpu(cpuidle_cooling_device, cpu);

> +		if (idle_cdev)

> +			kref_put(&idle_cdev->kref, cpuidle_cooling_release);

> +	}

> +}

> +

> +

> +/**

> + * cpuidle_cooling_alloc - Allocate and initialize an idle cooling device

> + * @cpumask: a cpumask containing all the cpus handled by the cooling device

> + * 

> + * The function is called at init time only. It allocates and

> + * initializes the different fields of the cpuidle cooling device

> + *

> + * It returns a pointer to an cpuidle_cooling_device structure on

> + * success, NULL on error.

> + */

> +static struct cpuidle_cooling_device * __init cpuidle_cooling_alloc(

> +	cpumask_t *cpumask)


I would rather break the line after __init .

> +{

> +	struct cpuidle_cooling_device *idle_cdev;

> +	int cpu;

> +

> +	idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);

> +	if (!idle_cdev)

> +		return NULL;

> +

> +	/*

> +	 * The idle duration injection. As we don't have yet a way to

> +	 * specify from the DT configuration, let's default to a tick

> +	 * duration.

> +	 */

> +	idle_cdev->idle_cycle = TICK_USEC;

> +

> +	/*

> +	 * Initialize the timer to wakeup all the idle injection tasks

> +	 */

> +	hrtimer_init(&idle_cdev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);

> +

> +	/*

> +	 * The wakeup function callback which is in charge of waking

> +	 * up all CPUs belonging to the same cluster


Need a full-stop (.) at the end. Please check all comments for this.

> +	 */

> +	idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;

> +

> +	idle_cdev->cpumask = cpumask;

> +

> +	/*

> +	 * Assign on a per cpu basis belonging to the cluster, the per

> +	 * cpu cpuidle_cooling_device pointer and increment its

> +	 * refcount on it

> +	 */

> +	for_each_cpu(cpu, cpumask) {

> +		kref_get(&idle_cdev->kref);

> +		per_cpu(cpuidle_cooling_device, cpu) = idle_cdev;

> +	}

> +

> +	return idle_cdev;

> +}

> +

> +/**

> + * cpuidle_cooling_register - Idle cooling device initialization function

> + *

> + * This function is in charge of creating a cooling device per cluster

> + * and register it to thermal framework. For this we rely on the

> + * topology as there is nothing yet describing better the idle state

> + * power domains.

> + *

> + * We create a cpuidle cooling device per cluster. For this reason we

> + * must, for each cluster, allocate and initialize the cooling device

> + * and for each cpu belonging to this cluster, do the initialization

> + * on a cpu basis.

> + *

> + * This approach for creating the cooling device is needed as we don't

> + * have the guarantee the CPU numbering is sequential.


               guarantee that the CPU

> + *

> + * Unfortunately, there is no API to browse from top to bottom the

> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.

> + * In order to solve that, we use a cpumask to flag the cluster_id we

> + * already processed. The cpumask will always have enough room for all

> + * the cluster because it is based on NR_CPUS and it is not possible

> + * to have more clusters than cpus.

> + *

> + */

> +void __init cpuidle_cooling_register(void)

> +{

> +	struct cpuidle_cooling_device *idle_cdev = NULL;

> +	struct thermal_cooling_device *cdev;

> +	struct device_node *np;

> +	cpumask_var_t cpumask;


maybe call it clustermask ?

> +	char dev_name[THERMAL_NAME_LENGTH];

> +	int ret = -ENOMEM, cpu;

> +	int cluster_id;

> +

> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))

> +		return;

> +

> +	for_each_possible_cpu(cpu) {

> +

> +		cluster_id = topology_physical_package_id(cpu);

> +		if (cpumask_test_cpu(cluster_id, cpumask))

> +			continue;

> +

> +		/*

> +		 * Allocate the cpuidle cooling device with the list

> +		 * of the cpus belonging to the cluster.

> +		 */

> +		idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));

> +		if (!idle_cdev)

> +			goto out;

> +

> +		/*

> +		 * The thermal cooling device name, we use the

> +		 * cluster_id as the numbering index for the idle

> +		 * cooling device.

> +		 */

> +		snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d",

> +			 cluster_id);

> +

> +		np = of_cpu_device_node_get(cpu);


Check if np is valid ?

> +		cdev = thermal_of_cooling_device_register(np, dev_name,

> +							  idle_cdev,

> +							  &cpuidle_cooling_ops);

> +		if (IS_ERR(cdev)) {

> +			ret = PTR_ERR(cdev);

> +			goto out;

> +		}

> +

> +		idle_cdev->cdev = cdev;

> +		cpumask_set_cpu(cluster_id, cpumask);

> +	}

> +

> +	ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);

> +	if (ret)

> +		goto out;

> +

> +	pr_info("Created cpuidle cooling device\n");

> +out:

> +	free_cpumask_var(cpumask);

> +

> +	if (ret) {

> +		cpuidle_cooling_unregister();

> +		pr_err("Failed to create idle cooling device (%d)\n", ret);

> +	}


Maybe rearrange it a bit:

+	ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
+
+out:
+	if (ret) {
+		cpuidle_cooling_unregister();
+		pr_err("Failed to create idle cooling device (%d)\n", ret);
+	} else {
+	        pr_info("Created cpuidle cooling devices\n");
+       }
+
+	free_cpumask_var(cpumask);

??

> +}

> +#endif /* CONFIG_CPU_IDLE_THERMAL */

> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h

> index c0accc7..af5520d 100644

> --- a/include/linux/cpu_cooling.h

> +++ b/include/linux/cpu_cooling.h

> @@ -120,4 +120,10 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)

>  }

>  #endif	/* CONFIG_CPU_FREQ_THERMAL */

>  

> +#ifdef CONFIG_CPU_IDLE_THERMAL

> +extern void __init cpuidle_cooling_register(void);

> +#else /* CONFIG_CPU_IDLE_THERMAL */

> +static inline void __init cpuidle_cooling_register(void) { }

> +#endif /* CONFIG_CPU_IDLE_THERMAL */

> +

>  #endif /* __CPU_COOLING_H__ */

> -- 

> 2.7.4


-- 
viresh
Daniel Lezcano April 11, 2018, 9:29 a.m. UTC | #2
Hi Viresh,

thanks for the review.


On 11/04/2018 10:51, Viresh Kumar wrote:

[ ... ]

>> +void __init cpuidle_cooling_register(void)

>> +{

>> +	struct cpuidle_cooling_device *idle_cdev = NULL;

>> +	struct thermal_cooling_device *cdev;

>> +	struct device_node *np;

>> +	cpumask_var_t cpumask;

> 

> maybe call it clustermask ?


Yeah, sounds better.

>> +		cdev = thermal_of_cooling_device_register(np, dev_name,

>> +							  idle_cdev,

>> +							  &cpuidle_cooling_ops);

>> +		if (IS_ERR(cdev)) {

>> +			ret = PTR_ERR(cdev);

>> +			goto out;

>> +		}

>> +

>> +		idle_cdev->cdev = cdev;

>> +		cpumask_set_cpu(cluster_id, cpumask);

>> +	}

>> +

>> +	ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);

>> +	if (ret)

>> +		goto out;

>> +

>> +	pr_info("Created cpuidle cooling device\n");

>> +out:

>> +	free_cpumask_var(cpumask);

>> +

>> +	if (ret) {

>> +		cpuidle_cooling_unregister();

>> +		pr_err("Failed to create idle cooling device (%d)\n", ret);

>> +	}

> 

> Maybe rearrange it a bit:

> 

> +	ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);

> +

> +out:

> +	if (ret) {

> +		cpuidle_cooling_unregister();

> +		pr_err("Failed to create idle cooling device (%d)\n", ret);

> +	} else {

> +	        pr_info("Created cpuidle cooling devices\n");

> +       }

> +

> +	free_cpumask_var(cpumask);

> 

> ??


I usually tend to avoid using 'else' statements when possible (a coding
style practice) but if you prefer this version I'm fine with that.



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Sudeep Holla April 13, 2018, 11:23 a.m. UTC | #3
Hi Daniel,

On 05/04/18 17:16, Daniel Lezcano wrote:

[...]

> +/**

> + * cpuidle_cooling_register - Idle cooling device initialization function

> + *

> + * This function is in charge of creating a cooling device per cluster

> + * and register it to thermal framework. For this we rely on the

> + * topology as there is nothing yet describing better the idle state

> + * power domains.

> + *

> + * We create a cpuidle cooling device per cluster. For this reason we

> + * must, for each cluster, allocate and initialize the cooling device

> + * and for each cpu belonging to this cluster, do the initialization

> + * on a cpu basis.

> + *

> + * This approach for creating the cooling device is needed as we don't

> + * have the guarantee the CPU numbering is sequential.

> + *

> + * Unfortunately, there is no API to browse from top to bottom the

> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.

> + * In order to solve that, we use a cpumask to flag the cluster_id we

> + * already processed. The cpumask will always have enough room for all

> + * the cluster because it is based on NR_CPUS and it is not possible

> + * to have more clusters than cpus.

> + *

> + */

> +void __init cpuidle_cooling_register(void)

> +{

> +	struct cpuidle_cooling_device *idle_cdev = NULL;

> +	struct thermal_cooling_device *cdev;

> +	struct device_node *np;

> +	cpumask_var_t cpumask;

> +	char dev_name[THERMAL_NAME_LENGTH];

> +	int ret = -ENOMEM, cpu;

> +	int cluster_id;

> +

> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))

> +		return;

> +

> +	for_each_possible_cpu(cpu) {

> +

> +		cluster_id = topology_physical_package_id(cpu);

> +		if (cpumask_test_cpu(cluster_id, cpumask))

> +			continue;


Sorry for chiming in randomly, I haven't read the patches in detail.
But it was brought to my notice that topology_physical_package_id is
being used for cluster ID here. It's completely wrong and will
changesoon with ACPI topology related changes Jeremy is working on.

You will get the physical socket number(which is mostly 0 on single SoC
system). Makes sure that this won't break with that change.

Please note with cluster id not defined architecturally, relying on that
is simply problematic.
-- 
Regards,
Sudeep
Daniel Thompson April 13, 2018, 11:38 a.m. UTC | #4
On Thu, Apr 05, 2018 at 06:16:43PM +0200, Daniel Lezcano wrote:
> +/**

> + * cpuidle_cooling_register - Idle cooling device initialization function

> + *

> + * This function is in charge of creating a cooling device per cluster

> + * and register it to thermal framework. For this we rely on the

> + * topology as there is nothing yet describing better the idle state

> + * power domains.

> + *

> + * We create a cpuidle cooling device per cluster. For this reason we

> + * must, for each cluster, allocate and initialize the cooling device

> + * and for each cpu belonging to this cluster, do the initialization

> + * on a cpu basis.

> + *

> + * This approach for creating the cooling device is needed as we don't

> + * have the guarantee the CPU numbering is sequential.

> + *

> + * Unfortunately, there is no API to browse from top to bottom the

> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.

> + * In order to solve that, we use a cpumask to flag the cluster_id we

> + * already processed. The cpumask will always have enough room for all

> + * the cluster because it is based on NR_CPUS and it is not possible

> + * to have more clusters than cpus.

> + *

> + */

> +void __init cpuidle_cooling_register(void)

> +{

> +	struct cpuidle_cooling_device *idle_cdev = NULL;

> +	struct thermal_cooling_device *cdev;

> +	struct device_node *np;

> +	cpumask_var_t cpumask;

> +	char dev_name[THERMAL_NAME_LENGTH];

> +	int ret = -ENOMEM, cpu;

> +	int cluster_id;

> +

> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))

> +		return;

> +

> +	for_each_possible_cpu(cpu) {

> +

> +		cluster_id = topology_physical_package_id(cpu);

> +		if (cpumask_test_cpu(cluster_id, cpumask))

> +			continue;

> +

> +		/*

> +		 * Allocate the cpuidle cooling device with the list

> +		 * of the cpus belonging to the cluster.

> +		 */

> +		idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));

> +		if (!idle_cdev)

> +			goto out;


Perhaps nitpicking but it might be better to set ret to -ENOMEM here 
and avoid initializing it during the declarations.

There's no bug in the current form since ret is never assigned to
outside of the error paths, but the unwritten assumption that ret keeps
its original value throughout the loop seems like a bit of a landmine
w.r.t. future maintainance.


Daniel.
Daniel Lezcano April 13, 2018, 11:47 a.m. UTC | #5
On 13/04/2018 13:23, Sudeep Holla wrote:
> Hi Daniel,

> 

> On 05/04/18 17:16, Daniel Lezcano wrote:

> 

> [...]

> 

>> +/**

>> + * cpuidle_cooling_register - Idle cooling device initialization function

>> + *

>> + * This function is in charge of creating a cooling device per cluster

>> + * and register it to thermal framework. For this we rely on the

>> + * topology as there is nothing yet describing better the idle state

>> + * power domains.

>> + *

>> + * We create a cpuidle cooling device per cluster. For this reason we

>> + * must, for each cluster, allocate and initialize the cooling device

>> + * and for each cpu belonging to this cluster, do the initialization

>> + * on a cpu basis.

>> + *

>> + * This approach for creating the cooling device is needed as we don't

>> + * have the guarantee the CPU numbering is sequential.

>> + *

>> + * Unfortunately, there is no API to browse from top to bottom the

>> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.

>> + * In order to solve that, we use a cpumask to flag the cluster_id we

>> + * already processed. The cpumask will always have enough room for all

>> + * the cluster because it is based on NR_CPUS and it is not possible

>> + * to have more clusters than cpus.

>> + *

>> + */

>> +void __init cpuidle_cooling_register(void)

>> +{

>> +	struct cpuidle_cooling_device *idle_cdev = NULL;

>> +	struct thermal_cooling_device *cdev;

>> +	struct device_node *np;

>> +	cpumask_var_t cpumask;

>> +	char dev_name[THERMAL_NAME_LENGTH];

>> +	int ret = -ENOMEM, cpu;

>> +	int cluster_id;

>> +

>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))

>> +		return;

>> +

>> +	for_each_possible_cpu(cpu) {

>> +

>> +		cluster_id = topology_physical_package_id(cpu);

>> +		if (cpumask_test_cpu(cluster_id, cpumask))

>> +			continue;

> 

> Sorry for chiming in randomly, I haven't read the patches in detail.

> But it was brought to my notice that topology_physical_package_id is

> being used for cluster ID here. It's completely wrong and will

> changesoon with ACPI topology related changes Jeremy is working on.

> 

> You will get the physical socket number(which is mostly 0 on single SoC

> system). Makes sure that this won't break with that change.

> 

> Please note with cluster id not defined architecturally, relying on that

> is simply problematic.


Ok, noted. At the first glance, it should not be a problem.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar April 16, 2018, 7:37 a.m. UTC | #6
On 13-04-18, 13:47, Daniel Lezcano wrote:
> Ok, noted. At the first glance, it should not be a problem.


Why do you think it wouldn't be a problem ?

-- 
viresh
Daniel Lezcano April 16, 2018, 7:44 a.m. UTC | #7
On 16/04/2018 09:37, Viresh Kumar wrote:
> On 13-04-18, 13:47, Daniel Lezcano wrote:

>> Ok, noted. At the first glance, it should not be a problem.

> 

> Why do you think it wouldn't be a problem ?


Because we rely on the number to identify the cluster and flag it
'processed'. The number itself is not important.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar April 16, 2018, 9:37 a.m. UTC | #8
On 16-04-18, 09:44, Daniel Lezcano wrote:
> Because we rely on the number to identify the cluster and flag it

> 'processed'. The number itself is not important.


It is, because you are creating multiple groups (like cpufreq
policies) right now based on cluster id. Which will be zero for all
the CPUs. So in a big little system we will have two cpufreq cooling
devices and one cpuidle cooling device.

-- 
viresh
Daniel Lezcano April 16, 2018, 9:45 a.m. UTC | #9
On 16/04/2018 11:37, Viresh Kumar wrote:
> On 16-04-18, 09:44, Daniel Lezcano wrote:

>> Because we rely on the number to identify the cluster and flag it

>> 'processed'. The number itself is not important.

> 

> It is, because you are creating multiple groups (like cpufreq

> policies) right now based on cluster id. Which will be zero for all

> the CPUs. So in a big little system we will have two cpufreq cooling

> devices and one cpuidle cooling device.


Can you elaborate a bit ? I'm not sure to get the point.

BTW, Am I asked to change my code to stick to something which is not
merged ?


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar April 16, 2018, 9:50 a.m. UTC | #10
On 16-04-18, 11:45, Daniel Lezcano wrote:
> Can you elaborate a bit ? I'm not sure to get the point.


Sure. With your current code on Hikey960 (big/LITTLE), you end up
creating two cooling devices, one for the big cluster and one for
small cluster. Which is the right thing to do, as we also have two
cpufreq cooling devices.

But with the change Sudeep is referring to, the helper you used to get
cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
will end up creating a single cpuidle cooling device for all the CPUs.
Which would be wrong.

> BTW, Am I asked to change my code to stick to something which is not

> merged ?


Sudeep looked pretty confident on how the meaning of this routine is
going to change very soon. I will let him respond on what guarantees
we have that it will get merged :)

-- 
viresh
Daniel Lezcano April 16, 2018, 10:03 a.m. UTC | #11
On 16/04/2018 11:50, Viresh Kumar wrote:
> On 16-04-18, 11:45, Daniel Lezcano wrote:

>> Can you elaborate a bit ? I'm not sure to get the point.

> 

> Sure. With your current code on Hikey960 (big/LITTLE), you end up

> creating two cooling devices, one for the big cluster and one for

> small cluster. Which is the right thing to do, as we also have two

> cpufreq cooling devices.

> 

> But with the change Sudeep is referring to, the helper you used to get

> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code

> will end up creating a single cpuidle cooling device for all the CPUs.

> Which would be wrong.


Is the semantic of topology_physical_package_id changing ? I don't
understand the change Sudeep is referring to.

I saw this attempt:

https://patchwork.kernel.org/patch/9959977/

>> BTW, Am I asked to change my code to stick to something which is not

>> merged ?

> 

> Sudeep looked pretty confident on how the meaning of this routine is

> going to change very soon. I will let him respond on what guarantees

> we have that it will get merged :)




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar April 16, 2018, 10:10 a.m. UTC | #12
On 16-04-18, 12:03, Daniel Lezcano wrote:
> On 16/04/2018 11:50, Viresh Kumar wrote:

> > On 16-04-18, 11:45, Daniel Lezcano wrote:

> >> Can you elaborate a bit ? I'm not sure to get the point.

> > 

> > Sure. With your current code on Hikey960 (big/LITTLE), you end up

> > creating two cooling devices, one for the big cluster and one for

> > small cluster. Which is the right thing to do, as we also have two

> > cpufreq cooling devices.

> > 

> > But with the change Sudeep is referring to, the helper you used to get

> > cluster id will return 0 (SoC id) for all the 8 CPUs. So your code

> > will end up creating a single cpuidle cooling device for all the CPUs.

> > Which would be wrong.

> 

> Is the semantic of topology_physical_package_id changing ?


That's what I understood from his email.

> I don't

> understand the change Sudeep is referring to.

> 

> I saw this attempt:

> 

> https://patchwork.kernel.org/patch/9959977/


@Sudeep: Is using topology_cod_id() is okay in that case ?

-- 
viresh
Daniel Lezcano April 16, 2018, 1:57 p.m. UTC | #13
On 16/04/2018 14:30, Lorenzo Pieralisi wrote:
> On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:

>> On 16/04/2018 12:10, Viresh Kumar wrote:

>>> On 16-04-18, 12:03, Daniel Lezcano wrote:

>>>> On 16/04/2018 11:50, Viresh Kumar wrote:

>>>>> On 16-04-18, 11:45, Daniel Lezcano wrote:

>>>>>> Can you elaborate a bit ? I'm not sure to get the point.

>>>>>

>>>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up

>>>>> creating two cooling devices, one for the big cluster and one for

>>>>> small cluster. Which is the right thing to do, as we also have two

>>>>> cpufreq cooling devices.

>>>>>

>>>>> But with the change Sudeep is referring to, the helper you used to get

>>>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code

>>>>> will end up creating a single cpuidle cooling device for all the CPUs.

>>>>> Which would be wrong.

>>>>

>>>> Is the semantic of topology_physical_package_id changing ?

>>>

>>> That's what I understood from his email.

>>>

>>>> I don't

>>>> understand the change Sudeep is referring to.

>>

>> Actually there is no impact with the change Sudeep is referring to. It

>> is for ACPI, we are DT based. Confirmed with Jeremy.

>>

>> So AFAICT, it is not a problem.

> 

> It is a problem - DT or ACPI alike. Sudeep was referring to the notion

> of "cluster" that has no architectural meaning whatsoever and using

> topology_physical_package_id() to detect a "cluster" was/is/will always

> be the wrong thing to do. The notion of cluster must not appear in the

> kernel at all, it has no architectural meaning. I understand you need

> to group CPUs but that has to be done in a different way, through

> cooling devices, thermal domains or power domains DT/ACPI bindings but

> not by using topology masks.


I don't get it. What is the cluster concept defined in the ARM
documentation?

ARM Cortex-A53 MPCore Processor Technical Reference Manual

4.5.2. Multiprocessor Affinity Register

I see the documentation says:

A cluster with two cores, three cores, ...

How the kernel can represent that if you kill the
topology_physical_package_id() ?

> You should be able to figure out this week at OSPM the reasoning behind

> what I am saying above.




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Lorenzo Pieralisi April 17, 2018, 10:24 a.m. UTC | #14
On Tue, Apr 17, 2018 at 09:17:36AM +0200, Daniel Lezcano wrote:

[...]

> >>>> Actually there is no impact with the change Sudeep is referring to. It

> >>>> is for ACPI, we are DT based. Confirmed with Jeremy.

> >>>>

> >>>> So AFAICT, it is not a problem.

> >>>

> >>> It is a problem - DT or ACPI alike. Sudeep was referring to the notion

> >>> of "cluster" that has no architectural meaning whatsoever and using

> >>> topology_physical_package_id() to detect a "cluster" was/is/will always

> >>> be the wrong thing to do. The notion of cluster must not appear in the

> >>> kernel at all, it has no architectural meaning. I understand you need

> >>> to group CPUs but that has to be done in a different way, through

> >>> cooling devices, thermal domains or power domains DT/ACPI bindings but

> >>> not by using topology masks.

> >>

> >> I don't get it. What is the cluster concept defined in the ARM

> >> documentation?

> >>

> >> ARM Cortex-A53 MPCore Processor Technical Reference Manual

> >>

> >> 4.5.2. Multiprocessor Affinity Register

> >>

> >> I see the documentation says:

> >>

> >> A cluster with two cores, three cores, ...

> >>

> >> How the kernel can represent that if you kill the

> >> topology_physical_package_id() ?

> > 

> > From an Arm ARM perspective (ARM v8 reference manual), the MPIDR_EL1 has

> > no notion of cluster which means that a cluster is not architecturally

> > defined on Arm systems.

> 

> Sorry, I'm lost :/ You say the MPIDR_EL1 has no notion of cluster but

> the documentation describing this register is all talking about cluster.

> 

> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500g/BABHBJCI.html


I pointed you at the documentation I am referring to. You are referring
to A53 TRM, I am referring to the Arm architecture reference manual that
is the reference for all Arm cores.

> > Currently, as Morten explained today, topology_physical_package_id()

> > is supposed to represent a "cluster" and that's completely wrong

> > because a "cluster" cannot be defined from an architectural perspective.

> > 

> > It was a bodge used as a shortcut, wrongly. We should have never used

> > that API for that purpose and there must be no code in the kernel

> > relying on:

> > 

> > topology_physical_package_id()

> > 

> > to define a cluster; the information you require to group CPUs must

> > come from something else, which is firmware bindings(DT or ACPI) as

> > I mentioned.

> 

> Why not ?


I explained why not :). A cluster is not defined architecturally on Arm
- it is as simple as that and you can't rely on a given MPIDR_EL1
subfield to define what a cluster id is.

> diff --git a/arch/arm64/include/asm/topology.h

> b/arch/arm64/include/asm/topology.h

> index c4f2d50..ac0776d 100644

> --- a/arch/arm64/include/asm/topology.h

> +++ b/arch/arm64/include/asm/topology.h

> @@ -14,7 +14,8 @@ struct cpu_topology {

> 

>  extern struct cpu_topology cpu_topology[NR_CPUS];

> 

> -#define topology_physical_package_id(cpu)

> (cpu_topology[cpu].cluster_id)

> +#define topology_physical_package_id(cpu)      (0)

> +#define topology_physical_cluster_id(cpu)


There is no such a thing (and there is no architecturally defined
package id on Arm either).

> (cpu_topology[cpu].cluster_id)

>  #define topology_core_id(cpu)          (cpu_topology[cpu].core_id)

>  #define topology_core_cpumask(cpu)     (&cpu_topology[cpu].core_sibling)

>  #define topology_sibling_cpumask(cpu)  (&cpu_topology[cpu].thread_sibling)

> 

> 

> > Please speak to Sudeep who will fill you on the reasoning above.

> 

> Yes, Sudeep is next to me but I would prefer to keep the discussion on

> the mailing list so everyone can get the reasoning.


It is not a reasoning - it is the Arm architecture. There is no
architecturally defined cluster id on Arm. The affinity bits in
MPIDR_EL1 must be treated as a unique number that represents a given
core/thread, how the bits are allocated across affinity levels is not
something that you can rely on architecturally - that's why DT/ACPI
topology bindings exist to group cpus in a hierarchical topology.

HTH,
Lorenzo
Martin Kepplinger Aug. 5, 2019, 5:11 a.m. UTC | #15
---

On 05-04-18, 18:16, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all

> cpus belonging to the same cluster and offers a new method to cool down a SoC.

> 

> Each cluster has its own idle cooling device, each core has its own idle

> injection thread, each idle injection thread uses play_idle to enter idle.  In

> order to reach the deepest idle state, each cooling device has the idle

> injection threads synchronized together.

> 

> It has some similarity with the intel power clamp driver but it is actually

> designed to work on the ARM architecture via the DT with a mathematical proof

> with the power model which comes with the Documentation.

> 

> The idle injection cycle is fixed while the running cycle is variable. That

> allows to have control on the device reactivity for the user experience. At

> the mitigation point the idle threads are unparked, they play idle the

> specified amount of time and they schedule themselves. The last thread sets

> the next idle injection deadline and when the timer expires it wakes up all

> the threads which in turn play idle again. Meanwhile the running cycle is

> changed by set_cur_state.  When the mitigation ends, the threads are parked.

> The algorithm is self adaptive, so there is no need to handle hotplugging.

> 

> If we take an example of the balanced point, we can use the DT for the hi6220.

> 

> The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores

> running at full blast at the maximum OPP consumes 5280mW. The first value is

> given in the DT, the second is calculated from the OPP with the formula:

> 

>    Pdyn = Cdyn x Voltage^2 x Frequency

> 

> As the SoC vendors don't want to share the static leakage values, we assume

> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.

> 

> In order to reduce the power to 3326mW, we have to apply a ratio to the

> running time.

> 

> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874

> 

> We know the idle cycle which is fixed, let's assume 10ms. However from this

> duration we have to substract the wake up latency for the cluster idle state.

> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle

> 8.5ms.

> 

> As we know the idle duration and the ratio, we can compute the running cycle.

> 

>    running_cycle = 8.5 / 0.5874 = 14.47ms

> 

> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the

> SoC to the balanced trip point of 75°C.

> 

> The driver has been tested on the hi6220 and it appears the temperature

> stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and

> running cycle of 14ms as expected by the theory above.

> 

> Signed-off-by: Kevin Wangtao <kevin.wangtao@linaro.org>

> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

> ---

>  drivers/thermal/Kconfig       |  10 +

>  drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++

>  include/linux/cpu_cooling.h   |   6 +

>  3 files changed, 495 insertions(+)

> 

> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig

> index 5aaae1b..6c34117 100644

> --- a/drivers/thermal/Kconfig

> +++ b/drivers/thermal/Kconfig

> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL

>  	  This will be useful for platforms using the generic thermal interface

>  	  and not the ACPI interface.

>  

> +config CPU_IDLE_THERMAL

> +       bool "CPU idle cooling strategy"

> +       depends on CPU_IDLE

> +       help

> +	 This implements the generic CPU cooling mechanism through

> +	 idle injection.  This will throttle the CPU by injecting

> +	 fixed idle cycle.  All CPUs belonging to the same cluster

> +	 will enter idle synchronously to reach the deepest idle

> +	 state.

> +

>  endchoice

>  

>  config CLOCK_THERMAL

> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c

> index 5c219dc..1eec8d6 100644

> --- a/drivers/thermal/cpu_cooling.c

> +++ b/drivers/thermal/cpu_cooling.c

> @@ -10,18 +10,33 @@

>   *		Viresh Kumar <viresh.kumar@linaro.org>

>   *

>   */

> +#define pr_fmt(fmt) "CPU cooling: " fmt

> +

>  #include <linux/module.h>

>  #include <linux/thermal.h>

>  #include <linux/cpufreq.h>

> +#include <linux/cpuidle.h>

>  #include <linux/err.h>

> +#include <linux/freezer.h>

>  #include <linux/idr.h>

> +#include <linux/kthread.h>

>  #include <linux/pm_opp.h>

>  #include <linux/slab.h>

> +#include <linux/sched/prio.h>

> +#include <linux/sched/rt.h>

> +#include <linux/smpboot.h>

>  #include <linux/cpu.h>

>  #include <linux/cpu_cooling.h>

>  

> +#include <linux/ratelimit.h>

> +

> +#include <linux/platform_device.h>

> +#include <linux/of_platform.h>

> +

>  #include <trace/events/thermal.h>

>  

> +#include <uapi/linux/sched/types.h>

> +

>  #ifdef CONFIG_CPU_FREQ_THERMAL

>  /*

>   * Cooling state <-> CPUFreq frequency

> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)

>  }

>  EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);

>  #endif /* CONFIG_CPU_FREQ_THERMAL */

> +

> +#ifdef CONFIG_CPU_IDLE_THERMAL

> +/**

> + * struct cpuidle_cooling_device - data for the idle cooling device

> + * @cdev: a pointer to a struct thermal_cooling_device

> + * @cpumask: a cpumask containing the CPU managed by the cooling device

> + * @timer: a hrtimer giving the tempo for the idle injection cycles

> + * @kref: a kernel refcount on this structure

> + * @count: an atomic to keep track of the last task exiting the idle cycle

> + * @idle_cycle: an integer defining the duration of the idle injection

> + * @state: an normalized integer giving the state of the cooling device

> + */

> +struct cpuidle_cooling_device {

> +	struct thermal_cooling_device *cdev;

> +	struct cpumask *cpumask;

> +	struct hrtimer timer;

> +	struct kref kref;

> +	atomic_t count;

> +	unsigned int idle_cycle;

> +	unsigned long state;

> +};

> +

> +struct cpuidle_cooling_thread {

> +	struct task_struct *tsk;

> +	int should_run;

> +};

> +

> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);

> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);

> +

> +/**

> + * cpuidle_cooling_wakeup - Wake up all idle injection threads

> + * @idle_cdev: the idle cooling device

> + *

> + * Every idle injection task belonging to the idle cooling device and

> + * running on an online cpu will be wake up by this call.

> + */

> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)

> +{

> +	struct cpuidle_cooling_thread *cct;

> +	int cpu;

> +

> +	for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {

> +		cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);

> +		cct->should_run = 1;

> +		wake_up_process(cct->tsk);

> +	}

> +}

> +

> +/**

> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback

> + * @timer: a hrtimer structure

> + *

> + * When the mitigation is acting, the CPU is allowed to run an amount

> + * of time, then the idle injection happens for the specified delay

> + * and the idle task injection schedules itself until the timer event

> + * wakes the idle injection tasks again for a new idle injection

> + * cycle. The time between the end of the idle injection and the timer

> + * expiration is the allocated running time for the CPU.

> + *

> + * Always returns HRTIMER_NORESTART

> + */

> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)

> +{

> +	struct cpuidle_cooling_device *idle_cdev =

> +		container_of(timer, struct cpuidle_cooling_device, timer);

> +

> +	cpuidle_cooling_wakeup(idle_cdev);

> +

> +	return HRTIMER_NORESTART;

> +}

> +

> +/**

> + * cpuidle_cooling_runtime - Running time computation

> + * @idle_cdev: the idle cooling device

> + *

> + * The running duration is computed from the idle injection duration

> + * which is fixed. If we reach 100% of idle injection ratio, that

> + * means the running duration is zero. If we have a 50% ratio

> + * injection, that means we have equal duration for idle and for

> + * running duration.

> + *

> + * The formula is deduced as the following:

> + *

> + *  running = idle x ((100 / ratio) - 1)

> + *

> + * For precision purpose for integer math, we use the following:

> + *

> + *  running = (idle x 100) / ratio - idle

> + *

> + * For example, if we have an injected duration of 50%, then we end up

> + * with 10ms of idle injection and 10ms of running duration.

> + *

> + * Returns a s64 nanosecond based

> + */

> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)

> +{

> +	s64 next_wakeup;

> +	unsigned long state = idle_cdev->state;

> +

> +	/*

> +	 * The function should not be called when there is no

> +	 * mitigation because:

> +	 * - that does not make sense

> +	 * - we end up with a division by zero

> +	 */

> +	if (!state)

> +		return 0;

> +

> +	next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -

> +		idle_cdev->idle_cycle;

> +

> +	return next_wakeup * NSEC_PER_USEC;

> +}

> +


There is a bug in your calculation formula here when "state" becomes 100.
You return 0 for the injection rate, which is the same as "rate" being 0,
which is dangerous. You stop cooling when it's most necessary :)

I'm not sure how much sense really being 100% idle makes, so I, when testing
this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.

Daniel, thanks a lot for these additions! Could you send an update of this?

btw, that's what I'm referring to:
https://lore.kernel.org/linux-pm/1522945005-7165-1-git-send-email-daniel.lezcano@linaro.org/
I know it's a little old already, but it seems like there hasn't been any
equivalent solution in the meantime, has it?

Using cpuidle for cooling is way more effective than cpufreq (which often
hardly is).

thanks again,

                                     martin
Martin Kepplinger Aug. 5, 2019, 7:40 a.m. UTC | #16
On 05.08.19 09:37, Daniel Lezcano wrote:
> On 05/08/2019 07:11, Martin Kepplinger wrote:

>> ---

> 

> [ ... ]

> 

>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)

>>> +{

>>> +	s64 next_wakeup;

>>> +	unsigned long state = idle_cdev->state;

>>> +

>>> +	/*

>>> +	 * The function should not be called when there is no

>>> +	 * mitigation because:

>>> +	 * - that does not make sense

>>> +	 * - we end up with a division by zero

>>> +	 */

>>> +	if (!state)

>>> +		return 0;

>>> +

>>> +	next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -

>>> +		idle_cdev->idle_cycle;

>>> +

>>> +	return next_wakeup * NSEC_PER_USEC;

>>> +}

>>> +

>>

>> There is a bug in your calculation formula here when "state" becomes 100.

>> You return 0 for the injection rate, which is the same as "rate" being 0,

>> which is dangerous. You stop cooling when it's most necessary :)

> 

> Right, thanks for spotting this.

> 

>> I'm not sure how much sense really being 100% idle makes, so I, when testing

>> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.

>>

>> Daniel, thanks a lot for these additions! Could you send an update of this?

> 

> Yes, I'm working on a new version.


great.

> 

>> btw, that's what I'm referring to:

>> https://lore.kernel.org/linux-pm/1522945005-7165-1-git-send-email-daniel.lezcano@linaro.org/

>> I know it's a little old already, but it seems like there hasn't been any

>> equivalent solution in the meantime, has it?

>>

>> Using cpuidle for cooling is way more effective than cpufreq (which often

>> hardly is).

> 

> On which platform that happens?

> 

> 


I'm running this on imx8mq.

thanks,

                                    martin
diff mbox series

Patch

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 5aaae1b..6c34117 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -166,6 +166,16 @@  config CPU_FREQ_THERMAL
 	  This will be useful for platforms using the generic thermal interface
 	  and not the ACPI interface.
 
+config CPU_IDLE_THERMAL
+       bool "CPU idle cooling strategy"
+       depends on CPU_IDLE
+       help
+	 This implements the generic CPU cooling mechanism through
+	 idle injection.  This will throttle the CPU by injecting
+	 fixed idle cycle.  All CPUs belonging to the same cluster
+	 will enter idle synchronously to reach the deepest idle
+	 state.
+
 endchoice
 
 config CLOCK_THERMAL
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 5c219dc..1eec8d6 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -10,18 +10,33 @@ 
  *		Viresh Kumar <viresh.kumar@linaro.org>
  *
  */
+#define pr_fmt(fmt) "CPU cooling: " fmt
+
 #include <linux/module.h>
 #include <linux/thermal.h>
 #include <linux/cpufreq.h>
+#include <linux/cpuidle.h>
 #include <linux/err.h>
+#include <linux/freezer.h>
 #include <linux/idr.h>
+#include <linux/kthread.h>
 #include <linux/pm_opp.h>
 #include <linux/slab.h>
+#include <linux/sched/prio.h>
+#include <linux/sched/rt.h>
+#include <linux/smpboot.h>
 #include <linux/cpu.h>
 #include <linux/cpu_cooling.h>
 
+#include <linux/ratelimit.h>
+
+#include <linux/platform_device.h>
+#include <linux/of_platform.h>
+
 #include <trace/events/thermal.h>
 
+#include <uapi/linux/sched/types.h>
+
 #ifdef CONFIG_CPU_FREQ_THERMAL
 /*
  * Cooling state <-> CPUFreq frequency
@@ -928,3 +943,467 @@  void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
 }
 EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
 #endif /* CONFIG_CPU_FREQ_THERMAL */
+
+#ifdef CONFIG_CPU_IDLE_THERMAL
+/**
+ * struct cpuidle_cooling_device - data for the idle cooling device
+ * @cdev: a pointer to a struct thermal_cooling_device
+ * @cpumask: a cpumask containing the CPU managed by the cooling device
+ * @timer: a hrtimer giving the tempo for the idle injection cycles
+ * @kref: a kernel refcount on this structure
+ * @count: an atomic to keep track of the last task exiting the idle cycle
+ * @idle_cycle: an integer defining the duration of the idle injection
+ * @state: an normalized integer giving the state of the cooling device
+ */
+struct cpuidle_cooling_device {
+	struct thermal_cooling_device *cdev;
+	struct cpumask *cpumask;
+	struct hrtimer timer;
+	struct kref kref;
+	atomic_t count;
+	unsigned int idle_cycle;
+	unsigned long state;
+};
+
+struct cpuidle_cooling_thread {
+	struct task_struct *tsk;
+	int should_run;
+};
+
+static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
+static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
+
+/**
+ * cpuidle_cooling_wakeup - Wake up all idle injection threads
+ * @idle_cdev: the idle cooling device
+ *
+ * Every idle injection task belonging to the idle cooling device and
+ * running on an online cpu will be wake up by this call.
+ */
+static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
+{
+	struct cpuidle_cooling_thread *cct;
+	int cpu;
+
+	for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
+		cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
+		cct->should_run = 1;
+		wake_up_process(cct->tsk);
+	}
+}
+
+/**
+ * cpuidle_cooling_wakeup_fn - Running cycle timer callback
+ * @timer: a hrtimer structure
+ *
+ * When the mitigation is acting, the CPU is allowed to run an amount
+ * of time, then the idle injection happens for the specified delay
+ * and the idle task injection schedules itself until the timer event
+ * wakes the idle injection tasks again for a new idle injection
+ * cycle. The time between the end of the idle injection and the timer
+ * expiration is the allocated running time for the CPU.
+ *
+ * Always returns HRTIMER_NORESTART
+ */
+static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
+{
+	struct cpuidle_cooling_device *idle_cdev =
+		container_of(timer, struct cpuidle_cooling_device, timer);
+
+	cpuidle_cooling_wakeup(idle_cdev);
+
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * cpuidle_cooling_runtime - Running time computation
+ * @idle_cdev: the idle cooling device
+ *
+ * The running duration is computed from the idle injection duration
+ * which is fixed. If we reach 100% of idle injection ratio, that
+ * means the running duration is zero. If we have a 50% ratio
+ * injection, that means we have equal duration for idle and for
+ * running duration.
+ *
+ * The formula is deduced as the following:
+ *
+ *  running = idle x ((100 / ratio) - 1)
+ *
+ * For precision purpose for integer math, we use the following:
+ *
+ *  running = (idle x 100) / ratio - idle
+ *
+ * For example, if we have an injected duration of 50%, then we end up
+ * with 10ms of idle injection and 10ms of running duration.
+ *
+ * Returns a s64 nanosecond based
+ */
+static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
+{
+	s64 next_wakeup;
+	unsigned long state = idle_cdev->state;
+
+	/*
+	 * The function should not be called when there is no
+	 * mitigation because:
+	 * - that does not make sense
+	 * - we end up with a division by zero
+	 */
+	if (!state)
+		return 0;
+
+	next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
+		idle_cdev->idle_cycle;
+
+	return next_wakeup * NSEC_PER_USEC;
+}
+
+/**
+ * cpuidle_cooling_injection - Idle injection mainloop thread function
+ * @cpu: an integer giving the cpu number the thread is pinned on
+ *
+ * This main function does basically two operations:
+ *
+ * - Goes idle for a specific amount of time
+ *
+ * - Sets a timer to wake up all the idle injection threads after a
+ *   running period
+ *
+ * That happens only when the mitigation is enabled, otherwise the
+ * task is scheduled out.
+ *
+ * In order to keep the tasks synchronized together, it is the last
+ * task exiting the idle period which is in charge of setting the
+ * timer.
+ *
+ * This function never returns.
+ */
+static void cpuidle_cooling_injection(unsigned int cpu)
+{
+	s64 next_wakeup;
+
+	struct cpuidle_cooling_device *idle_cdev =
+		per_cpu(cpuidle_cooling_device, cpu);
+
+	struct cpuidle_cooling_thread *cct =
+		per_cpu_ptr(&cpuidle_cooling_thread, cpu);
+
+	atomic_inc(&idle_cdev->count);
+
+	cct->should_run = 0;
+
+	play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
+
+	/*
+	 * The last CPU waking up is in charge of setting the
+	 * timer. If the CPU is hotplugged, the timer will
+	 * move to another CPU (which may not belong to the
+	 * same cluster) but that is not a problem as the
+	 * timer will be set again by another CPU belonging to
+	 * the cluster, so this mechanism is self adaptive and
+	 * does not require any hotplugging dance.
+	 */
+	if (!atomic_dec_and_test(&idle_cdev->count))
+		return;
+
+	next_wakeup = cpuidle_cooling_runtime(idle_cdev);
+	if (next_wakeup)
+		hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
+			      HRTIMER_MODE_REL_PINNED);
+}
+
+static void cpuidle_cooling_setup(unsigned int cpu)
+{
+	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
+
+	set_freezable();
+
+	sched_setscheduler(current, SCHED_FIFO, &param);
+}
+
+static int cpuidle_cooling_should_run(unsigned int cpu)
+{
+	struct cpuidle_cooling_thread *cct =
+		per_cpu_ptr(&cpuidle_cooling_thread, cpu);
+
+	return cct->should_run;
+}
+
+static struct smp_hotplug_thread cpuidle_cooling_threads = {
+	.store                  = &cpuidle_cooling_thread.tsk,
+	.thread_fn              = cpuidle_cooling_injection,
+	.thread_comm            = "thermal-idle/%u",
+	.thread_should_run	= cpuidle_cooling_should_run,
+	.setup                  = cpuidle_cooling_setup,
+};
+
+/**
+ * cpuidle_cooling_get_max_state - Get the maximum state
+ * @cdev  : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * The function always gives 100 as the injection ratio is percentile
+ * based for consistency accros different platforms.
+ *
+ * The function can not fail, it always returns zero.
+ */
+static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
+					 unsigned long *state)
+{
+	/*
+	 * Depending on the configuration or the hardware, the running
+	 * cycle and the idle cycle could be different. We want unify
+	 * that to an 0..100 interval, so the set state interface will
+	 * be the same whatever the platform is.
+	 *
+	 * The state 100% will make the cluster 100% ... idle. A 0%
+	 * injection ratio means no idle injection at all and 50%
+	 * means for 10ms of idle injection, we have 10ms of running
+	 * time.
+	 */
+	*state = 100;
+
+	return 0;
+}
+
+/**
+ * cpuidle_cooling_get_cur_state - Get the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: a pointer to the state
+ *
+ * The function just copy the state value from the private thermal
+ * cooling device structure, the mapping is 1 <-> 1.
+ *
+ * The function can not fail, it always returns zero.
+ */
+static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+					 unsigned long *state)
+{
+	struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+
+	*state = idle_cdev->state;
+
+	return 0;
+}
+
+/**
+ * cpuidle_cooling_set_cur_state - Set the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: the target state
+ *
+ * The function checks first if we are initiating the mitigation which
+ * in turn wakes up all the idle injection tasks belonging to the idle
+ * cooling device. In any case, it updates the internal state for the
+ * cooling device.
+ *
+ * The function can not fail, it always returns zero.
+ */
+static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+					 unsigned long state)
+{
+	struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+	unsigned long current_state = idle_cdev->state;
+
+	idle_cdev->state = state;
+
+	if (current_state == 0 && state > 0) {
+		pr_debug("Starting cooling cpus '%*pbl'\n",
+			 cpumask_pr_args(idle_cdev->cpumask));
+		cpuidle_cooling_wakeup(idle_cdev);
+	} else if (current_state > 0 && !state)  {
+		pr_debug("Stopping cooling cpus '%*pbl'\n",
+			 cpumask_pr_args(idle_cdev->cpumask));
+	}
+
+	return 0;
+}
+
+/**
+ * cpuidle_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
+	.get_max_state = cpuidle_cooling_get_max_state,
+	.get_cur_state = cpuidle_cooling_get_cur_state,
+	.set_cur_state = cpuidle_cooling_set_cur_state,
+};
+
+/**
+ * cpuidle_cooling_release - Kref based release helper
+ * @kref: a pointer to the kref structure
+ *
+ * This function is automatically called by the kref_put function when
+ * the idle cooling device refcount reaches zero. At this point, we
+ * have the guarantee the structure is no longer in use and we can
+ * safely release all the ressources.
+ */
+static void __init cpuidle_cooling_release(struct kref *kref)
+{
+	struct cpuidle_cooling_device *idle_cdev =
+		container_of(kref, struct cpuidle_cooling_device, kref);
+
+	if (idle_cdev->cdev)
+		thermal_cooling_device_unregister(idle_cdev->cdev);
+
+	hrtimer_cancel(&idle_cdev->timer);
+	kfree(idle_cdev);
+}
+
+/**
+ * cpuilde_cooling_unregister - Idle cooling device exit function
+ *
+ * This function unregisters the cpuidle cooling device and frees the
+ * ressources previously allocated by the init function. This function
+ * is called when the initialization fails.
+ */
+static void __init cpuidle_cooling_unregister(void)
+{
+	struct cpuidle_cooling_device *idle_cdev;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		idle_cdev = per_cpu(cpuidle_cooling_device, cpu);
+		if (idle_cdev)
+			kref_put(&idle_cdev->kref, cpuidle_cooling_release);
+	}
+}
+
+
+/**
+ * cpuidle_cooling_alloc - Allocate and initialize an idle cooling device
+ * @cpumask: a cpumask containing all the cpus handled by the cooling device
+ * 
+ * The function is called at init time only. It allocates and
+ * initializes the different fields of the cpuidle cooling device
+ *
+ * It returns a pointer to an cpuidle_cooling_device structure on
+ * success, NULL on error.
+ */
+static struct cpuidle_cooling_device * __init cpuidle_cooling_alloc(
+	cpumask_t *cpumask)
+{
+	struct cpuidle_cooling_device *idle_cdev;
+	int cpu;
+
+	idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
+	if (!idle_cdev)
+		return NULL;
+
+	/*
+	 * The idle duration injection. As we don't have yet a way to
+	 * specify from the DT configuration, let's default to a tick
+	 * duration.
+	 */
+	idle_cdev->idle_cycle = TICK_USEC;
+
+	/*
+	 * Initialize the timer to wakeup all the idle injection tasks
+	 */
+	hrtimer_init(&idle_cdev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+	/*
+	 * The wakeup function callback which is in charge of waking
+	 * up all CPUs belonging to the same cluster
+	 */
+	idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
+
+	idle_cdev->cpumask = cpumask;
+
+	/*
+	 * Assign on a per cpu basis belonging to the cluster, the per
+	 * cpu cpuidle_cooling_device pointer and increment its
+	 * refcount on it
+	 */
+	for_each_cpu(cpu, cpumask) {
+		kref_get(&idle_cdev->kref);
+		per_cpu(cpuidle_cooling_device, cpu) = idle_cdev;
+	}
+
+	return idle_cdev;
+}
+
+/**
+ * cpuidle_cooling_register - Idle cooling device initialization function
+ *
+ * This function is in charge of creating a cooling device per cluster
+ * and register it to thermal framework. For this we rely on the
+ * topology as there is nothing yet describing better the idle state
+ * power domains.
+ *
+ * We create a cpuidle cooling device per cluster. For this reason we
+ * must, for each cluster, allocate and initialize the cooling device
+ * and for each cpu belonging to this cluster, do the initialization
+ * on a cpu basis.
+ *
+ * This approach for creating the cooling device is needed as we don't
+ * have the guarantee the CPU numbering is sequential.
+ *
+ * Unfortunately, there is no API to browse from top to bottom the
+ * topology, cluster->cpu, only the usual for_each_possible_cpu loop.
+ * In order to solve that, we use a cpumask to flag the cluster_id we
+ * already processed. The cpumask will always have enough room for all
+ * the cluster because it is based on NR_CPUS and it is not possible
+ * to have more clusters than cpus.
+ *
+ */
+void __init cpuidle_cooling_register(void)
+{
+	struct cpuidle_cooling_device *idle_cdev = NULL;
+	struct thermal_cooling_device *cdev;
+	struct device_node *np;
+	cpumask_var_t cpumask;
+	char dev_name[THERMAL_NAME_LENGTH];
+	int ret = -ENOMEM, cpu;
+	int cluster_id;
+
+	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return;
+
+	for_each_possible_cpu(cpu) {
+
+		cluster_id = topology_physical_package_id(cpu);
+		if (cpumask_test_cpu(cluster_id, cpumask))
+			continue;
+
+		/*
+		 * Allocate the cpuidle cooling device with the list
+		 * of the cpus belonging to the cluster.
+		 */
+		idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));
+		if (!idle_cdev)
+			goto out;
+
+		/*
+		 * The thermal cooling device name, we use the
+		 * cluster_id as the numbering index for the idle
+		 * cooling device.
+		 */
+		snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d",
+			 cluster_id);
+
+		np = of_cpu_device_node_get(cpu);
+		cdev = thermal_of_cooling_device_register(np, dev_name,
+							  idle_cdev,
+							  &cpuidle_cooling_ops);
+		if (IS_ERR(cdev)) {
+			ret = PTR_ERR(cdev);
+			goto out;
+		}
+
+		idle_cdev->cdev = cdev;
+		cpumask_set_cpu(cluster_id, cpumask);
+	}
+
+	ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
+	if (ret)
+		goto out;
+
+	pr_info("Created cpuidle cooling device\n");
+out:
+	free_cpumask_var(cpumask);
+
+	if (ret) {
+		cpuidle_cooling_unregister();
+		pr_err("Failed to create idle cooling device (%d)\n", ret);
+	}
+}
+#endif /* CONFIG_CPU_IDLE_THERMAL */
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index c0accc7..af5520d 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -120,4 +120,10 @@  void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
 }
 #endif	/* CONFIG_CPU_FREQ_THERMAL */
 
+#ifdef CONFIG_CPU_IDLE_THERMAL
+extern void __init cpuidle_cooling_register(void);
+#else /* CONFIG_CPU_IDLE_THERMAL */
+static inline void __init cpuidle_cooling_register(void) { }
+#endif /* CONFIG_CPU_IDLE_THERMAL */
+
 #endif /* __CPU_COOLING_H__ */