[V2,0/7] CPU cooling device new strategies

Message ID	1519226968-19821-1-git-send-email-daniel.lezcano@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Daniel Lezcano <daniel.lezcano@linaro.org> To: edubezval@gmail.com Cc: kevin.wangtao@linaro.org, leo.yan@linaro.org, vincent.guittot@linaro.org, amit.kachhap@gmail.com, linux-kernel@vger.kernel.org, javi.merino@kernel.org, rui.zhang@intel.com, daniel.thompson@linaro.org, linux-pm@vger.kernel.org Subject: [PATCH V2 0/7] CPU cooling device new strategies Date: Wed, 21 Feb 2018 16:29:21 +0100 Message-Id: <1519226968-19821-1-git-send-email-daniel.lezcano@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	CPU cooling device new strategies \| expand [V2,0/7] CPU cooling device new strategies [V2,1/7] thermal/drivers/cpu_cooling: Fixup the header and copyright [V2,4/7] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice [V2,5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation [V2,7/7] cpuidle/drivers/cpuidle-arm: Register the cooling device

Daniel Lezcano Feb. 21, 2018, 3:29 p.m. UTC

Changelog:
  V2:
     - Dropped the cpu combo cooling device
     - Added the acked-by tags
     - Replaced task array by a percpu structure
     - Fixed the copyright dates
     - Fixed the extra lines
     - Fixed the compilation macros to be reused
     - Fixed the list node removal
     - Massaged a couple of function names


The following series provides a new way to cool down a SoC by reducing
the dissipated power on the CPUs. Based on the initial work from Kevin
Wangtao, the series implements a CPU cooling device based on idle
injection, relying on the cpuidle framework.

The patchset is designed to have the current DT binding for the
cpufreq cooling device to be compatible with the new cooling devices.

Different cpu cooling devices can not co-exist on the system, the cpu
cooling device is enabled or not, and one cooling strategy is selected
(cpufreq or cpuidle). It is not possible to have all of them available
at the same time. However, the goal is to enable them all and be able
to switch from one to another at runtime but that needs a rework of the
thermal framework which is orthogonal to the feature we are providing.

This series is divided into two parts.

The first part just provides trivial changes for the copyright and
removes an unused field in the cpu freq cooling device structure.

The second part provides the idle injection cooling device, allowing a SoC
without a cpufreq driver to use this cooling device as an alternative.

The preliminary benchmarks show the following changes:

On the hikey6220, dhrystone shows a throughtput increase of 40% for an
increase of the latency of 16% while sysbench shows a latency increase
of 5%.

Initially, the first version provided also the cpuidle + cpufreq combo
cooling device but regarding the latest comments, there is a misfit with
how the cpufreq cooling device and suspend/resume/cpu hotplug/module
loading|unloading behave together while the combo cooling device was
designed assuming the cpufreq cooling device was always there. This
dynamic is investigated and the combo cooling device will be posted
separetely after this series gets merged.

Daniel Lezcano (7):
  thermal/drivers/cpu_cooling: Fixup the header and copyright
  thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
  thermal/drivers/cpu_cooling: Remove pointless field
  thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
  thermal/drivers/cpu_cooling: Add idle cooling device documentation
  thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
  cpuidle/drivers/cpuidle-arm: Register the cooling device

 Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
 drivers/cpuidle/cpuidle-arm.c              |   5 +
 drivers/thermal/Kconfig                    |  30 +-
 drivers/thermal/cpu_cooling.c              | 480 +++++++++++++++++++++++++++--
 include/linux/cpu_cooling.h                |  15 +-
 5 files changed, 668 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/thermal/cpu-idle-cooling.txt

-- 
2.7.4

Daniel Lezcano Feb. 23, 2018, 11:28 a.m. UTC | #1

On 23/02/2018 08:34, Viresh Kumar wrote:
> On 21-02-18, 16:29, Daniel Lezcano wrote:

>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c

>> index 5c219dc..9340216 100644

>> --- a/drivers/thermal/cpu_cooling.c

>> +++ b/drivers/thermal/cpu_cooling.c

>> @@ -10,18 +10,32 @@

>>   *		Viresh Kumar <viresh.kumar@linaro.org>

>>   *

>>   */

>> +#undef DEBUG

> 

> Why is this required ?


It is usually added, so if you set the -DDEBUG flag when compiling, you
don't get all the pr_debug traces for all files, but the just the ones
where you commented the #undef above. pr_debug is a no-op otherwise.

>> +#define pr_fmt(fmt) "CPU cooling: " fmt

> 

> I think you can use the dev_***() routines instead, as you can

> directly the CPU device from anywhere.


Can we postpone this change for later ? All the file is using pr_*
(cpufreq_cooling included). There is only one place where dev_err is
used but it is removed by the patch 3/7.

[ ... ]

>> +#ifdef CONFIG_CPU_IDLE_THERMAL

>> +/*

>> + * The idle duration injection. As we don't have yet a way to specify

>> + * from the DT configuration, let's default to a tick duration.

>> + */

>> +#define DEFAULT_IDLE_TIME_US TICK_USEC

> 

> I think this macro is a bit unnecessary here. Its used only at

> initialization and so the same comment can be present there and you

> can use TICK_USEC there.

> 

> Else, Keep it as it is and remove the "idle_cycle" field from the

> below structure, as it holds a constant forever.


Yes, makes sense.

>> +/**

>> + * struct cpuidle_cooling_device - data for the idle cooling device

>> + * @cdev: a pointer to a struct thermal_cooling_device

>> + * @cpumask: a cpumask containing the CPU managed by the cooling device

> 

> Missed @node here.


Ok.

>> + * @timer: a hrtimer giving the tempo for the idle injection cycles

>> + * @kref: a kernel refcount on this structure

>> + * @count: an atomic to keep track of the last task exiting the idle cycle

>> + * @idle_cycle: an integer defining the duration of the idle injection

>> + * @state: an normalized integer giving the state of the cooling device

>> + */

>> +struct cpuidle_cooling_device {

>> +	struct thermal_cooling_device *cdev;

>> +	struct cpumask *cpumask;

>> +	struct list_head node;

>> +	struct hrtimer timer;

>> +	struct kref kref;

>> +	atomic_t count;

>> +	unsigned int idle_cycle;

>> +	unsigned int state;

>> +};

>> +

>> +/**

>> + * @tsk: an array of pointer to the idle injection tasks

>> + * @waitq: the waiq for the idle injection tasks

>> + */

>> +struct cpuidle_cooling_tsk {

>> +	struct task_struct *tsk;

>> +	wait_queue_head_t waitq;

>> +};

>> +

>> +DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);

> 

> static ?


Ok.

[ ... ]

>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)

>> +{

>> +	s64 next_wakeup;

>> +	int state = idle_cdev->state;

>> +

>> +	/*

>> +	 * The function must never be called when there is no

>> +	 * mitigation because:

>> +	 * - that does not make sense

>> +	 * - we end up with a division by zero

>> +	 */

>> +	BUG_ON(!state);

> 

> As there is no locking in place, we can surely hit this case. What if

> the state changed to 0 right before this routine was called ?

> 

> I would suggest we should just return 0 in that case and get away with

> the BUG_ON().


Ok.

>> +

>> +	next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -

>> +		idle_cdev->idle_cycle;

>> +

>> +	return next_wakeup * NSEC_PER_USEC;

>> +}

>> +

>> +/**

>> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function

>> + * @arg: a void pointer containing the idle cooling device address

>> + *

>> + * This main function does basically two operations:

>> + *

>> + * - Goes idle for a specific amount of time

>> + *

>> + * - Sets a timer to wake up all the idle injection threads after a

>> + *   running period

>> + *

>> + * That happens only when the mitigation is enabled, otherwise the

>> + * task is scheduled out.

>> + *

>> + * In order to keep the tasks synchronized together, it is the last

>> + * task exiting the idle period which is in charge of setting the

>> + * timer.

>> + *

>> + * This function never returns.

>> + */

>> +static int cpuidle_cooling_injection_thread(void *arg)

>> +{

>> +	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };

>> +	struct cpuidle_cooling_device *idle_cdev = arg;

>> +	struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,

>> +						      smp_processor_id());

> 

> this_cpu_ptr ?


yeah, even better.

>> +	DEFINE_WAIT(wait);

>> +

>> +	set_freezable();

>> +

>> +	sched_setscheduler(current, SCHED_FIFO, &param);

>> +

>> +	while (1) {

>> +		s64 next_wakeup;

>> +

>> +		prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);

>> +

>> +		schedule();

>> +

>> +		atomic_inc(&idle_cdev->count);

>> +

>> +		play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);

>> +

>> +		/*

>> +		 * The last CPU waking up is in charge of setting the

>> +		 * timer. If the CPU is hotplugged, the timer will

>> +		 * move to another CPU (which may not belong to the

>> +		 * same cluster) but that is not a problem as the

>> +		 * timer will be set again by another CPU belonging to

>> +		 * the cluster, so this mechanism is self adaptive and

>> +		 * does not require any hotplugging dance.

>> +		 */

> 

> Well this depends on how CPU hotplug really happens. What happens to

> the per-cpu-tasks which are in the middle of something when hotplug

> happens?  Does hotplug wait for those per-cpu-tasks to finish ?

> 

>> +		if (!atomic_dec_and_test(&idle_cdev->count))

>> +			continue;

>> +

>> +		if (!idle_cdev->state)

>> +			continue;

>> +

>> +		next_wakeup = cpuidle_cooling_runtime(idle_cdev);

>> +

>> +		hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),

>> +			      HRTIMER_MODE_REL_PINNED);

>> +	}

>> +

>> +	finish_wait(&cct->waitq, &wait);

>> +

>> +	return 0;

>> +}

>> +

>> +/**

>> + * cpuidle_cooling_release - Kref based release helper

>> + * @kref: a pointer to the kref structure

>> + *

>> + * This function is automatically called by the kref_put function when

>> + * the idle cooling device refcount reaches zero. At this point, we

>> + * have the guarantee the structure is no longer in use and we can

>> + * safely release all the ressources.

>> + */

>> +static void __init cpuidle_cooling_release(struct kref *kref)

>> +{

>> +	struct cpuidle_cooling_device *idle_cdev =

>> +		container_of(kref, struct cpuidle_cooling_device, kref);

>> +

>> +	thermal_cooling_device_unregister(idle_cdev->cdev);

>> +	kfree(idle_cdev);

>> +}

> 

> Maybe just move it closer to the only function that calls it?


Ok.

[ ... ]

>> +/**

>> + * cpuidle_cooling_get_cur_state - Get the current cooling state

>> + * @cdev: the thermal cooling device

>> + * @state: a pointer to the state

>> + *

>> + * The function just copy the state value from the private thermal

>> + * cooling device structure, the mapping is 1 <-> 1.

>> + *

>> + * The function can not fail, it returns always zero.

>> + */

>> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,

>> +					 unsigned long *state)

>> +{

>> +        struct cpuidle_cooling_device *idle_cdev = cdev->devdata;

> 

> This line isn't aligned properly. Spaces are present instead of a tab.


Ok.

[ ... ]

> Same at other places as well.


Ok, I will double check it.

[ ... ]

>> +static void cpuidle_cooling_unregister(void)

>> +{

>> +	struct cpuidle_cooling_device *tmp, *idle_cdev = NULL;

>> +	struct cpuidle_cooling_tsk *cct;

>> +	int cpu;

>> +

>> +	list_for_each_entry_safe(idle_cdev, tmp, &cpuidle_cdev_list, node) {

>> +		for_each_cpu(cpu, idle_cdev->cpumask) {

>> +			cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);

>> +			if (cct->tsk)

>> +				kthread_stop(cct->tsk);

> 

> What about hrtimer ? Shouldn't that be stopped as well ?


IMO, practically it is not possible to have a timer running at this
point, but rigorously it must be stopped in the release routine, so I
will add it.

>> +			kref_put(&idle_cdev->kref, cpuidle_cooling_release);

>> +		}

>> +	}

>> +}

>> +

>> +/**

>> + * cpuidle_cooling_register - Idle cooling device initialization function

>> + *

>> + * This function is in charge of creating a cooling device per cluster

>> + * and register it to thermal framework. For this we rely on the

>> + * topology as there is nothing yet describing better the idle state

>> + * power domains.

>> + *

>> + * For each first CPU of the cluster's cpumask, we allocate the idle

>> + * cooling device, initialize the general fields and then we initialze

>> + * the rest in a per cpu basis.

>> + *

>> + * Returns zero on success, < 0 otherwise.

> 

> This wouldn't get shown in the doc properly, it should be written as:

> 

>   * Return: zero on success, < 0 otherwise.


Ok.

>> + */

>> +int cpuidle_cooling_register(void)

>> +{

>> +	struct cpuidle_cooling_device *idle_cdev = NULL;

>> +	struct thermal_cooling_device *cdev;

>> +	struct cpuidle_cooling_tsk *cct;

>> +	struct task_struct *tsk;

>> +	struct device_node *np;

>> +	cpumask_t *cpumask;

>> +	char dev_name[THERMAL_NAME_LENGTH];

>> +	int ret = -ENOMEM, cpu;

>> +	int index = 0;

>> +

>> +	for_each_possible_cpu(cpu) {

>> +		cpumask = topology_core_cpumask(cpu);

>> +

>> +		cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);

>> +

>> +		/*

>> +		 * This condition makes the first cpu belonging to the

>> +		 * cluster to create a cooling device and allocates

>> +		 * the structure. Others CPUs belonging to the same

>> +		 * cluster will just increment the refcount on the

>> +		 * cooling device structure and initialize it.

>> +		 */

>> +		if (cpu == cpumask_first(cpumask)) {

> 

> Your function still have few assumptions of cpu numbering and it will

> break in few cases. What if the CPUs on a big Little system (4x4) are

> present in this order: B L L L L B B B  ??

> 

> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,

> 4-7 big and a big CPU is used by the boot loader to bring up Linux.


Ok, how can I sort it out ?

>> +			np = of_cpu_device_node_get(cpu);

>> +

>> +			idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);

>> +			if (!idle_cdev)

>> +				goto out_fail;

>> +

>> +			idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;

>> +

>> +			atomic_set(&idle_cdev->count, 0);

> 

> This should already be 0, isn't it ?


Yes.

>> +

>> +			kref_init(&idle_cdev->kref);

>> +

>> +			/*

>> +			 * Initialize the timer to wakeup all the idle

>> +			 * injection tasks

>> +			 */

>> +			hrtimer_init(&idle_cdev->timer,

>> +				     CLOCK_MONOTONIC, HRTIMER_MODE_REL);

>> +

>> +			/*

>> +			 * The wakeup function callback which is in

>> +			 * charge of waking up all CPUs belonging to

>> +			 * the same cluster

>> +			 */

>> +			idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;

>> +

>> +			/*

>> +			 * The thermal cooling device name

>> +			 */

>> +			snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);

>> +			cdev = thermal_of_cooling_device_register(np, dev_name,

>> +								  idle_cdev,

>> +								  &cpuidle_cooling_ops);

> 

> You registered the cooling device, while the idle_cdev is still

> getting initialized. Ideally, we should register with the thermal core

> (or any other framework) when we are fully ready. For example, any of

> the callbacks present in cpuidle_cooling_ops() can get called by the

> core after this point and you should be ready to handle them. It will

> result in kernel crash in your case as idle_cdev isn't fully

> initialized yet. For example, with the set-state callback you will end

> up calling wake_up_task(NULL).

> 

>> +			if (IS_ERR(cdev)) {

>> +				ret = PTR_ERR(cdev);

>> +				goto out_fail;

>> +			}

>> +

>> +			idle_cdev->cdev = cdev;

>> +

>> +			idle_cdev->cpumask = cpumask;

>> +

>> +			list_add(&idle_cdev->node, &cpuidle_cdev_list);

> 

> I would have removed the above two blank lines as these can all go

> together.


Ok.

>> +

>> +			pr_info("Created idle cooling device for cpus '%*pbl'\n",

>> +				cpumask_pr_args(cpumask));

> 

> I am not sure if it makes sense to print the message right here. We

> can still fail while creating the cooling devices. Maybe a single

> print at the very end of the function is more than enough?


Ok.

>> +		}

>> +

>> +		kref_get(&idle_cdev->kref);

> 

> Did you check if the kref thing is really working here or not? I think

> your structure will never get freed on errors because kref_init()

> initializes the count by 1 and then you do an additional kref_get()

> here for the first CPU of the cluster.


Yes, you are right. I will fix that.

>> +

>> +		init_waitqueue_head(&cct->waitq);

>> +

>> +		tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,

>> +					    idle_cdev, cpu, "kidle_inject/%u");

>> +		if (IS_ERR(tsk)) {

>> +			ret = PTR_ERR(tsk);

>> +			goto out_fail;

>> +		}

>> +

>> +		cct->tsk = tsk;

>> +

>> +		wake_up_process(tsk);

>> +	}

>> +

>> +	return 0;

>> +

>> +out_fail:

>> +	cpuidle_cooling_unregister();

>> +	pr_err("Failed to create idle cooling device (%d)\n", ret);

> 

> So you are already printing the error message here, just make the

> return type void as the caller of this can't do anything anyway.


Ok.

>> +

>> +	return ret;

>> +}

>> +#endif /* CONFIG_CPU_IDLE_THERMAL */



Thanks for the review.

  -- Daniel


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Vincent Guittot Feb. 23, 2018, 3:26 p.m. UTC | #2

Hi Daniel,

On 21 February 2018 at 16:29, Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
> +

> +/**

> + * struct cpuidle_cooling_device - data for the idle cooling device

> + * @cdev: a pointer to a struct thermal_cooling_device

> + * @cpumask: a cpumask containing the CPU managed by the cooling device

> + * @timer: a hrtimer giving the tempo for the idle injection cycles

> + * @kref: a kernel refcount on this structure

> + * @count: an atomic to keep track of the last task exiting the idle cycle

> + * @idle_cycle: an integer defining the duration of the idle injection

> + * @state: an normalized integer giving the state of the cooling device

> + */

> +struct cpuidle_cooling_device {

> +       struct thermal_cooling_device *cdev;

> +       struct cpumask *cpumask;

> +       struct list_head node;

> +       struct hrtimer timer;

> +       struct kref kref;

> +       atomic_t count;

> +       unsigned int idle_cycle;

> +       unsigned int state;

> +};

> +

> +/**

> + * @tsk: an array of pointer to the idle injection tasks

> + * @waitq: the waiq for the idle injection tasks

> + */

> +struct cpuidle_cooling_tsk {

> +       struct task_struct *tsk;

> +       wait_queue_head_t waitq;


Why are you creating one wait_queue_head_t per cpu instead of one per
cooling device and then save a pointer in the per cpu struct
cpuidle_cooling_tsk ?
Then, you can use wake_up_interruptible_all() to wake up all threads
instead of using for_each_cpu ... wake_up_process() loop in
cpuidle_cooling_wakeup() ?

> +};

> +

> +DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);

> +

> +static LIST_HEAD(cpuidle_cdev_list);

> +

> +/**

> + * cpuidle_cooling_wakeup - Wake up all idle injection threads

> + * @idle_cdev: the idle cooling device

> + *

> + * Every idle injection task belonging to the idle cooling device and

> + * running on an online cpu will be wake up by this call.

> + */

> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)

> +{

> +       int cpu;

> +       struct cpuidle_cooling_tsk *cct;

> +

> +       for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {

> +               cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);

> +               wake_up_process(cct->tsk);

> +       }

> +}

> +

> +/**

> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback

> + * @timer: a hrtimer structure

> + *

> + * When the mitigation is acting, the CPU is allowed to run an amount

> + * of time, then the idle injection happens for the specified delay

> + * and the idle task injection schedules itself until the timer event

> + * wakes the idle injection tasks again for a new idle injection

> + * cycle. The time between the end of the idle injection and the timer

> + * expiration is the allocated running time for the CPU.

> + *

> + * Returns always HRTIMER_NORESTART

> + */

> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)

> +{

> +       struct cpuidle_cooling_device *idle_cdev =

> +               container_of(timer, struct cpuidle_cooling_device, timer);

> +

> +       cpuidle_cooling_wakeup(idle_cdev);

> +

> +       return HRTIMER_NORESTART;

> +}

> +

> +/**

Leo Yan March 26, 2018, 2:30 p.m. UTC | #3

On Thu, Mar 08, 2018 at 12:03:52PM +0000, Daniel Thompson wrote:
> On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:

> > >> The preliminary benchmarks show the following changes:

> > >>

> > >> On the hikey6220, dhrystone shows a throughtput increase of 40% for an

> > >> increase of the latency of 16% while sysbench shows a latency increase

> > >> of 5%.

> > > 

> > > I don't follow these numbers. Throughput increase while injecting idle?

> > > compared to what? percentages of what? Please be more specific to better

> > > describer your work..

> > 

> > The dhrystone throughput is based on the virtual timer, when we are

> > running, it is at max opp, so the throughput increases. But regarding

> > the real time, it takes obviously more time to achieve as we are

> > artificially inserting idle cycles. With the cpufreq governor, we run at

> > a lower opp, so the throughput is less for dhrystone but it takes less

> > time to achieve.

> > 

> > Percentages are comparing cpufreq vs cpuidle cooling devices. I will

> > take care of presenting the results in a more clear way in the next version.

> 

> I think we should also note that the current hikey settings for cpufreq

> are very badly tuned for this platform. It has a single temp threshold

> and it jumps from max freq to min freq.

> 

> IIRC Leo's work on Hikey thermals correctly it would be much better if 

> it used the power-allocator thermal governor or if if copied some of 

> the Samsung 32-bit platform by configuring the step governor with a 

> graduated with a slightly lower threshold that moves two stops back in 

> the OPP table (which is still fairly high clock speed... but it

> thermally sustainable).


I think Daniel L. is working on this patch set with 'power-allocator'
governor, and the parameters 'sustainable-power = <3326>' and
'dynamic-power-coefficient = <311>' are profiling value on Hikey
platform.  Now we only consider dynamic power and skip static leakage
for 'power-allocator' governor.  And all these parameters are merged
into Linux mainline kernel.

Daniel L. could correct me if I misunderstand the testing conditions.

Thanks,
Leo Yan

Leo Yan March 27, 2018, 2:03 a.m. UTC | #4

Hi Daniel,

On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all

> cpus belonging to the same cluster and offers a new method to cool down a SoC.

> 

> Each cluster has its own idle cooling device, each core has its own idle

> injection thread, each idle injection thread uses play_idle to enter idle.  In

> order to reach the deepest idle state, each cooling device has the idle

> injection threads synchronized together.

> 

> It has some similarity with the intel power clamp driver but it is actually

> designed to work on the ARM architecture via the DT with a mathematical proof

> with the power model which comes with the Documentation.

> 

> The idle injection cycle is fixed while the running cycle is variable. That

> allows to have control on the device reactivity for the user experience. At

> the mitigation point the idle threads are unparked, they play idle the

> specified amount of time and they schedule themselves. The last thread sets

> the next idle injection deadline and when the timer expires it wakes up all

> the threads which in turn play idle again. Meanwhile the running cycle is

> changed by set_cur_state.  When the mitigation ends, the threads are parked.

> The algorithm is self adaptive, so there is no need to handle hotplugging.


The idle injection threads are RT threads (FIFO) and I saw in
play_idle() set/clear flag PF_IDLE for it.  Will these idle injection
threads utilization be accounted into RT utilization?

If idle injection threads utilization is accounted as RT tasks
utilization, will this impact CPUFreq governor 'schedutil' for OPP
selection?

[...]

Thanks,
Leo Yan

Daniel Lezcano March 27, 2018, 9:35 a.m. UTC | #5

On 26/03/2018 16:30, Leo Yan wrote:
> On Thu, Mar 08, 2018 at 12:03:52PM +0000, Daniel Thompson wrote:

>> On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:

>>>>> The preliminary benchmarks show the following changes:

>>>>>

>>>>> On the hikey6220, dhrystone shows a throughtput increase of 40% for an

>>>>> increase of the latency of 16% while sysbench shows a latency increase

>>>>> of 5%.

>>>>

>>>> I don't follow these numbers. Throughput increase while injecting idle?

>>>> compared to what? percentages of what? Please be more specific to better

>>>> describer your work..

>>>

>>> The dhrystone throughput is based on the virtual timer, when we are

>>> running, it is at max opp, so the throughput increases. But regarding

>>> the real time, it takes obviously more time to achieve as we are

>>> artificially inserting idle cycles. With the cpufreq governor, we run at

>>> a lower opp, so the throughput is less for dhrystone but it takes less

>>> time to achieve.

>>>

>>> Percentages are comparing cpufreq vs cpuidle cooling devices. I will

>>> take care of presenting the results in a more clear way in the next version.

>>

>> I think we should also note that the current hikey settings for cpufreq

>> are very badly tuned for this platform. It has a single temp threshold

>> and it jumps from max freq to min freq.

>>

>> IIRC Leo's work on Hikey thermals correctly it would be much better if 

>> it used the power-allocator thermal governor or if if copied some of 

>> the Samsung 32-bit platform by configuring the step governor with a 

>> graduated with a slightly lower threshold that moves two stops back in 

>> the OPP table (which is still fairly high clock speed... but it

>> thermally sustainable).

> 

> I think Daniel L. is working on this patch set with 'power-allocator'

> governor, and the parameters 'sustainable-power = <3326>' and

> 'dynamic-power-coefficient = <311>' are profiling value on Hikey

> platform.  Now we only consider dynamic power and skip static leakage

> for 'power-allocator' governor.  And all these parameters are merged

> into Linux mainline kernel.

> 

> Daniel L. could correct me if I misunderstand the testing conditions.


Well, the first iteration is without the powerallocator governor API. It
was tested with the step-wise governor only. But you are right by saying
it will use the dynamic-power-coefficient and sustainable-power and will
implement the power allocator version of the API. I'm working on the
power allocator version for the idle injection + OPP change as we need
to compute the capacity equivalence between the idle-injection cycles +
OPP and the lower OPP in order to change the OPP for optimized power /
performance trade-off.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Daniel Lezcano March 27, 2018, 10:56 a.m. UTC | #6

On 27/03/2018 05:35, Leo Yan wrote:
> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:

> 

> [...]

> 

>> +/**

>> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function

>> + * @arg: a void pointer containing the idle cooling device address

>> + *

>> + * This main function does basically two operations:

>> + *

>> + * - Goes idle for a specific amount of time

>> + *

>> + * - Sets a timer to wake up all the idle injection threads after a

>> + *   running period

>> + *

>> + * That happens only when the mitigation is enabled, otherwise the

>> + * task is scheduled out.

>> + *

>> + * In order to keep the tasks synchronized together, it is the last

>> + * task exiting the idle period which is in charge of setting the

>> + * timer.

>> + *

>> + * This function never returns.

>> + */

>> +static int cpuidle_cooling_injection_thread(void *arg)

>> +{

>> +	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };

> 

> I am just wandering if should set priority to (MAX_RT_PRIO - 1)?

> Otherwise I am concern it might be cannot enter deep idle state when

> any CPU idle injection thread is preempted by other higher priority RT

> threads so all CPUs have no alignment for idle state entering/exiting.


I do believe we should consider other RT tasks more important than the
idle injection threads.

>> +	struct cpuidle_cooling_device *idle_cdev = arg;

>> +	struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,

>> +						      smp_processor_id());

>> +	DEFINE_WAIT(wait);

>> +

>> +	set_freezable();

>> +

>> +	sched_setscheduler(current, SCHED_FIFO, &param);

>> +

>> +	while (1) {

>> +		s64 next_wakeup;

>> +

>> +		prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);

>> +

>> +		schedule();

>> +

>> +		atomic_inc(&idle_cdev->count);

>> +

>> +		play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);

>> +

>> +		/*

>> +		 * The last CPU waking up is in charge of setting the

>> +		 * timer. If the CPU is hotplugged, the timer will

>> +		 * move to another CPU (which may not belong to the

>> +		 * same cluster) but that is not a problem as the

>> +		 * timer will be set again by another CPU belonging to

>> +		 * the cluster, so this mechanism is self adaptive and

>> +		 * does not require any hotplugging dance.

>> +		 */

>> +		if (!atomic_dec_and_test(&idle_cdev->count))

>> +			continue;

>> +

>> +		if (!idle_cdev->state)

>> +			continue;

>> +

>> +		next_wakeup = cpuidle_cooling_runtime(idle_cdev);

>> +

>> +		hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),

>> +			      HRTIMER_MODE_REL_PINNED);

> 

> If SoC temperature descreases under tipping point, will the timer be

> disabled for this case?  Or will here set next timer event with big

> value from next_wakeup?


Another timer (the polling one) will update the 'state' variable to zero
in the set_cur_state. In the worst case, we check the idle_cdev->state
right before it is updated and we end up with an extra idle injection
cycle which is perfectly fine.





-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Daniel Lezcano March 27, 2018, 11:10 a.m. UTC | #7

On 27/03/2018 05:43, Leo Yan wrote:
> On Fri, Feb 23, 2018 at 12:28:51PM +0100, Daniel Lezcano wrote:

>> On 23/02/2018 08:34, Viresh Kumar wrote:

>>> On 21-02-18, 16:29, Daniel Lezcano wrote:

> 

>> [ ... ]

>>

>>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)

>>>> +{

>>>> +	s64 next_wakeup;

>>>> +	int state = idle_cdev->state;

>>>> +

>>>> +	/*

>>>> +	 * The function must never be called when there is no

>>>> +	 * mitigation because:

>>>> +	 * - that does not make sense

>>>> +	 * - we end up with a division by zero

>>>> +	 */

>>>> +	BUG_ON(!state);

>>>

>>> As there is no locking in place, we can surely hit this case. What if

>>> the state changed to 0 right before this routine was called ?

>>>

>>> I would suggest we should just return 0 in that case and get away with

>>> the BUG_ON().

> 

> Here if 'state' equals to 0 and we return 0, then the return value will

> be same with when 'state' = 100; this lets the return value confused.

> 

> I think for 'state' = 0, should we return -1 so indicate the hrtimer

> will not be set for this case?


Yeah, I will look at how to make this smoother.

Thanks

  -- Daniel


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Daniel Lezcano April 4, 2018, 8:50 a.m. UTC | #8

On 26/02/2018 05:30, Viresh Kumar wrote:

[ ... ]

>>>> +

>>>> +	for_each_possible_cpu(cpu) {

>>>> +		cpumask = topology_core_cpumask(cpu);

>>>> +

>>>> +		cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);

>>>> +

>>>> +		/*

>>>> +		 * This condition makes the first cpu belonging to the

>>>> +		 * cluster to create a cooling device and allocates

>>>> +		 * the structure. Others CPUs belonging to the same

>>>> +		 * cluster will just increment the refcount on the

>>>> +		 * cooling device structure and initialize it.

>>>> +		 */

>>>> +		if (cpu == cpumask_first(cpumask)) {

>>>

>>> Your function still have few assumptions of cpu numbering and it will

>>> break in few cases. What if the CPUs on a big Little system (4x4) are

>>> present in this order: B L L L L B B B  ??

>>>

>>> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,

>>> 4-7 big and a big CPU is used by the boot loader to bring up Linux.

>>

>> Ok, how can I sort it out ?

> 

> I would do something like this:

> 

>         cpumask_copy(possible, cpu_possible_mask);

>         

>         while (!cpumask_empty(possible)) {

>                 first = cpumask_first(possible);

>                 cpumask = topology_core_cpumask(first);

>                 cpumask_andnot(possible, possible, cpumask);

>         

>                 allocate_cooling_dev(first); //This is most of this function in your patch.

>         

>                 while (!cpumask_empty(cpumask)) {

>                         temp = cpumask_first(possible);

>                         //rest init "temp"

>                         cpumask_clear_cpu(temp, cpumask);

>                 }

>         

>                 //Everything done, register cooling device for cpumask.

>         }



Mmh, that sounds very complex. May be it is simpler to count the number
of cluster and initialize the idle_cdev for each cluster and then go for
this loop with the cluster cpumask.




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Viresh Kumar April 5, 2018, 4:49 a.m. UTC | #9

On 04-04-18, 10:50, Daniel Lezcano wrote:
> Mmh, that sounds very complex. May be it is simpler to count the number

> of cluster and initialize the idle_cdev for each cluster and then go for

> this loop with the cluster cpumask.

Maybe not sure. I have had such code in the past and it was quite
straight forward to understand :)

You can go with whichever version you like.

-- 
viresh

[V2,0/7] CPU cooling device new strategies

Message

Comments