[V5] powercap/drivers/idle_injection: Add an idle injection framework

Message ID 1528190208-22915-1-git-send-email-daniel.lezcano@linaro.org
State New
Headers show
Series
  • [V5] powercap/drivers/idle_injection: Add an idle injection framework
Related show

Commit Message

Daniel Lezcano June 5, 2018, 9:16 a.m.
Initially, the cpu_cooling device for ARM was changed by adding a new
policy inserting idle cycles. The intel_powerclamp driver does a
similar action.

Instead of implementing idle injections privately in the cpu_cooling
device, move the idle injection code in a dedicated framework and give
the opportunity to other frameworks to make use of it.

The framework relies on the smpboot kthreads which handles via its
main loop the common code for hotplugging and [un]parking.

This code was previously tested with the cpu cooling device and went
through several iterations. It results now in split code and API
exported in the header file. It was tested with the cpu cooling device
with success.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>

Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Javi Merino <javi.merino@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kevin Wangtao <kevin.wangtao@linaro.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rui Zhang <rui.zhang@intel.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
---
V5:
   - Move init_completion in the init function (Viresh Kumar)
   - Increment task count in the wakeup function (Viresh Kumar)
   - Remove rollback at init time (Viresh Kumar)

V4:
   - Wait for completion when stopping (Viresh Kumar)
   - Check init already done and rollback (Viresh Kumar)

V3:
   - Fixed typos (Viresh Kumar)
   - Removed extra blank line (Viresh Kumar)
   - Added full stop (Viresh Kumar)
   - Fixed Return kerneldoc format (Viresh Kumar)
   - Fixed multiple kthreads initialization (Viresh Kumar)
   - Fixed rollbacking the actions in the unregister function (Viresh Kumar)
   - Make idle injection kthreads name hardcoded
   - Kthreads are created in the initcall process

 V2: Fixed checkpatch warnings
---
 drivers/powercap/Kconfig          |  10 ++
 drivers/powercap/Makefile         |   1 +
 drivers/powercap/idle_injection.c | 358 ++++++++++++++++++++++++++++++++++++++
 include/linux/idle_injection.h    |  29 +++
 4 files changed, 398 insertions(+)
 create mode 100644 drivers/powercap/idle_injection.c
 create mode 100644 include/linux/idle_injection.h

-- 
2.7.4

Comments

Daniel Lezcano June 5, 2018, 2:54 p.m. | #1
On 05/06/2018 12:39, Viresh Kumar wrote:
> On 05-06-18, 11:16, Daniel Lezcano wrote:

>> diff --git a/drivers/powercap/idle_injection.c b/drivers/powercap/idle_injection.c

>> +/**

>> + * idle_injection_wakeup - Wake up all idle injection threads

>> + * @ii_dev: the idle injection device

>> + *

>> + * Every idle injection task belonging to the idle injection device

>> + * and running on an online CPU will be wake up by this call.

>> + */

>> +static void idle_injection_wakeup(struct idle_injection_device *ii_dev)

>> +{

>> +	struct idle_injection_thread *iit;

>> +	int cpu;

>> +

>> +	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask) {

>> +		atomic_inc(&ii_dev->count);

> 

> This may not be sufficient enough in some extremely corner cases, as it is

> possible that the idle_injection_fn() starts running for the first cpu in the

> cpumask right after first iteration of this loop completes. And so

> atomic_dec_and_test() may return true there before idle_injection_fn() task gets

> a chance to run on all cpus in the cpumask. Normally this wouldn't be a big

> problem as the hrtimer can get reprogrammed in that case, but there is a case I

> am worried about. More on this later..

> 

>> +		iit = per_cpu_ptr(&idle_injection_thread, cpu);

>> +		iit->should_run = 1;

>> +		wake_up_process(iit->tsk);

>> +	}

>> +}

>> +

>> +/**

>> + * idle_injection_wakeup_fn - idle injection timer callback

>> + * @timer: a hrtimer structure

>> + *

>> + * This function is called when the idle injection timer expires which

>> + * will wake up the idle injection tasks and these ones, in turn, play

>> + * idle a specified amount of time.

>> + *

>> + * Return: HRTIMER_NORESTART.

>> + */

>> +static enum hrtimer_restart idle_injection_wakeup_fn(struct hrtimer *timer)

>> +{

>> +	struct idle_injection_device *ii_dev =

>> +		container_of(timer, struct idle_injection_device, timer);

>> +

>> +	idle_injection_wakeup(ii_dev);

>> +

>> +	return HRTIMER_NORESTART;

>> +}

>> +

>> +/**

>> + * idle_injection_fn - idle injection routine

>> + * @cpu: the CPU number the tasks belongs to

>> + *

>> + * The idle injection routine will stay idle the specified amount of

>> + * time

>> + */

>> +static void idle_injection_fn(unsigned int cpu)

>> +{

>> +	struct idle_injection_device *ii_dev;

>> +	struct idle_injection_thread *iit;

>> +	int run_duration_ms, idle_duration_ms;

>> +

>> +	ii_dev = per_cpu(idle_injection_device, cpu);

>> +	iit = per_cpu_ptr(&idle_injection_thread, cpu);

>> +

>> +	/*

>> +	 * Boolean used by the smpboot main loop and used as a

>> +	 * flip-flop in this function

>> +	 */

>> +	iit->should_run = 0;

>> +

>> +	idle_duration_ms = atomic_read(&ii_dev->idle_duration_ms);

>> +	if (idle_duration_ms)

>> +		play_idle(idle_duration_ms);

>> +

>> +	/*

>> +	 * The last CPU waking up is in charge of setting the timer. If

>> +	 * the CPU is hotplugged, the timer will move to another CPU

>> +	 * (which may not belong to the same cluster) but that is not a

>> +	 * problem as the timer will be set again by another CPU

>> +	 * belonging to the cluster. This mechanism is self adaptive.

>> +	 */

>> +	if (!atomic_dec_and_test(&ii_dev->count))

>> +		return;

>> +

>> +	run_duration_ms = atomic_read(&ii_dev->run_duration_ms);

>> +	if (run_duration_ms) {

>> +		hrtimer_start(&ii_dev->timer, ms_to_ktime(run_duration_ms),

>> +			      HRTIMER_MODE_REL_PINNED);

>> +		return;

>> +	}

>> +

>> +	complete(&ii_dev->stop_complete);

>> +}

> 

> Here is the corner case I was talking about. Consider the cpumask contains

> CPU0/1/2/3.

> 

> PATH A                                                  PATH B

> 

>                                                         unregister()

>                                                           -> idle_injection_stop()

> 

> idle_injection_wakeup()

>   -> Running loop for CPU0

>   -> atomic inc (count == 1)

>   -> wake_up_process(tsk)

>   -> At this point the current task stops

>   running and idle_injection_fn() starts running

>   (maybe on a different CPU)

> 

>                                                         

> 

> 

> idle_injection_fn()

>   -> atomic_dec_and_test(), returns true

>   as count == 0

> 

> 

>                                                              -> set-duration to 0

>                                                              -> wait for completion

> 

> 

>   -> atomic_read(run_duration), returns 0

>   -> complete()

> 

>                                                              -> kfree(ii_dev);   

> 

> 

> But the initial loop idle_injection_wakeup() has only run for CPU0 until now and

> will continue running for other CPUs and will crash as ii_dev is already freed.

> 

> Am I still making a mistake here ?


I don't think you are doing a mistake. Even if this can happen
theoretically, I don't think practically that is the case.

The play_idle() has 1ms minimum sleep time.

The scenario you are describing means:

1. the loop in idle_injection_wakeup() takes more than 1ms to achieve

2. at the same time, the user of the idle injection unregisters while
the idle injection is acting precisely at CPU0 and exits before another
task was wakeup by the loop in 1. more than 1ms after.

From my POV, this scenario can't happen.

Anyway, we must write rock solid code so may be we can use a refcount to
protect against that, so instead of freeing in unregister, we refput the
ii_dev pointer.


>> +static struct idle_injection_device *ii_dev_alloc(void)

>> +{

>> +	struct idle_injection_device *ii_dev;

>> +

>> +	ii_dev = kzalloc(sizeof(*ii_dev), GFP_KERNEL);

>> +	if (!ii_dev)

>> +		return NULL;

>> +

>> +	if (!alloc_cpumask_var(&ii_dev->cpumask, GFP_KERNEL)) {

>> +		kfree(ii_dev);

>> +		return NULL;

>> +	}

>> +

>> +	return ii_dev;

>> +}

>> +

>> +static void ii_dev_free(struct idle_injection_device *ii_dev)

>> +{

>> +	free_cpumask_var(ii_dev->cpumask);

>> +	kfree(ii_dev);

>> +}

>> +

>> +/**

>> + * idle_injection_register - idle injection init routine

>> + * @cpumask: the list of CPUs managed by the idle injection device

>> + *

>> + * This is the initialization function in charge of creating the

>> + * initializing of the timer and allocate the structures. It does not

>> + * starts the idle injection cycles.

>> + *

>> + * Return: NULL if an allocation fails.

>> + */

>> +struct idle_injection_device *idle_injection_register(struct cpumask *cpumask)

>> +{

>> +	struct idle_injection_device *ii_dev;

>> +	int cpu;

>> +

>> +	ii_dev = ii_dev_alloc();

>> +	if (!ii_dev)

>> +		return NULL;

>> +

>> +	cpumask_copy(ii_dev->cpumask, cpumask);

>> +	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);

>> +	ii_dev->timer.function = idle_injection_wakeup_fn;

>> +	init_completion(&ii_dev->stop_complete);

>> +

>> +	for_each_cpu(cpu, ii_dev->cpumask) {

>> +

>> +		if (per_cpu(idle_injection_device, cpu)) {

>> +			pr_err("cpu%d is already registered\n", cpu);

>> +			goto out_rollback_per_cpu;

>> +		}

>> +

>> +		per_cpu(idle_injection_device, cpu) = ii_dev;

>> +	}

>> +

>> +	return ii_dev;

>> +

>> +out_rollback_per_cpu:

>> +	for_each_cpu(cpu, ii_dev->cpumask)

>> +		per_cpu(idle_injection_device, cpu) = NULL;

> 

> Maybe move above into ii_dev_free(), as I suggested earlier, as that will remove

> same code from unregister routine as well.


I don't like to mix operations when the function name define clearly
what is doing. The ii_dev_free() frees a structure,
idle_injection_device refers to a global variable.

Anyway if we are talking about using a refcount, we can add a release
which frees the structure and sets the per cpu idle_injection_device
global variable to NULL.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar June 6, 2018, 4:27 a.m. | #2
On 05-06-18, 16:54, Daniel Lezcano wrote:
> On 05/06/2018 12:39, Viresh Kumar wrote:

> I don't think you are doing a mistake. Even if this can happen

> theoretically, I don't think practically that is the case.

> 

> The play_idle() has 1ms minimum sleep time.

> 

> The scenario you are describing means:

> 

> 1. the loop in idle_injection_wakeup() takes more than 1ms to achieve


There are many ways in which idle_injection_wakeup() can get called.

- from hrtimer handler, this happens in softirq context, right? So interrupts
  can still block the handler to run ?

- from idle_injection_start(), process context. RT or DL or IRQ activity can
  block the CPU for long durations sometimes.

> 2. at the same time, the user of the idle injection unregisters while

> the idle injection is acting precisely at CPU0 and exits before another

> task was wakeup by the loop in 1. more than 1ms after.

> 

> >From my POV, this scenario can't happen.


Maybe something else needs to be buggy as well to make this crap happen.

> Anyway, we must write rock solid code


That's my point.

> so may be we can use a refcount to

> protect against that, so instead of freeing in unregister, we refput the

> ii_dev pointer.


I think the solution can be a simple change in implementation of
idle_injection_wakeup(), something like this..

+static void idle_injection_wakeup(struct idle_injection_device *ii_dev)
+{
+	struct idle_injection_thread *iit;
+	int cpu;
+
+	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask)
+		atomic_inc(&ii_dev->count);
+
+       mb(); //I am not sure but I think we need some kind of barrier here ?
+
+	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask) {
+		iit = per_cpu_ptr(&idle_injection_thread, cpu);
+		iit->should_run = 1;
+		wake_up_process(iit->tsk);
+	}
+}

-- 
viresh
Viresh Kumar June 6, 2018, 10:45 a.m. | #3
On 06-06-18, 12:22, Daniel Lezcano wrote:
> (mb() are done in the atomic operations AFAICT).


AFAIU, it is required to make sure the operations are seen in a particular order
on another CPU and the compiler doesn't reorganize code to optimize it.

For example, in our case what if the compiler reorganizes the atomic-set
operation after wakeup-process ? But maybe that wouldn't happen across function
calls and we should be safe then.

> What about:

> 

> 	get_online_cpus();

> 

> 	nr_tasks = cpumask_weight(

> 		cpumask_and(ii_dev->cpumask, cpu_online_mask);

> 			

> 	atomic_set(&ii_dev->count, nr_tasks);

> 

> 	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask) {

> 		iit = per_cpu_ptr(&idle_injection_thread, cpu);

> 		iit->should_run = 1;

> 		wake_up_process(iit->tsk);

> 	}

> 

> 	put_online_cpus();

> ?


Looks good this time.

> I'm wondering if we can have a CPU hotplugged right after the

> 'put_online_cpus', resulting in the 'should park' flag set and then the

> thread goes in the kthread_parkme instead of jumping back the idle

> injection function and decrease the count, leading up to the timer not

> being set again.


True. That looks like a valid problem to me as well.

What about starting the hrtimer right from this routine itself, after taking
into account running-time, idle-time, delay, etc ? That would get rid of the
count stuff, this get_online_cpus(), etc.

Even if we restart the next play-idle cycle a bit early or with some delay, sky
wouldn't fall :)

-- 
viresh
Andrea Parri June 6, 2018, 12:05 p.m. | #4
Hi Daniel, Viresh,

On Wed, Jun 06, 2018 at 04:15:28PM +0530, Viresh Kumar wrote:
> On 06-06-18, 12:22, Daniel Lezcano wrote:

> > (mb() are done in the atomic operations AFAICT).


To do my bit, not all atomic ops do/imply memory barriers; e.g.,

  [from Documentation/atomic_t.txt]

  - non-RMW operations [e.g., atomic_set()] are unordered

  - RMW operations that have no return value [e.g., atomic_inc()] are unordered


> 

> AFAIU, it is required to make sure the operations are seen in a particular order

> on another CPU and the compiler doesn't reorganize code to optimize it.

> 

> For example, in our case what if the compiler reorganizes the atomic-set

> operation after wakeup-process ? But maybe that wouldn't happen across function

> calls and we should be safe then.


IIUC, wake_up_process() implies a full memory barrier and a compiler barrier,
due to:

  raw_spin_lock_irqsave(&p->pi_lock, flags);
  smp_mb__after_spinlock();

The pattern under discussion isn't clear to me, but if you'll end up relying
on this "implicit" barrier I'd suggest documenting it with a comment.

  Andrea
Peter Zijlstra June 6, 2018, 12:29 p.m. | #5
On Wed, Jun 06, 2018 at 02:05:39PM +0200, Andrea Parri wrote:
> Hi Daniel, Viresh,

> 

> On Wed, Jun 06, 2018 at 04:15:28PM +0530, Viresh Kumar wrote:

> > On 06-06-18, 12:22, Daniel Lezcano wrote:

> > > (mb() are done in the atomic operations AFAICT).

> 

> To do my bit, not all atomic ops do/imply memory barriers; e.g.,

> 

>   [from Documentation/atomic_t.txt]

> 

>   - non-RMW operations [e.g., atomic_set()] are unordered

> 

>   - RMW operations that have no return value [e.g., atomic_inc()] are unordered


Quite so indeed.

> > AFAIU, it is required to make sure the operations are seen in a particular order

> > on another CPU and the compiler doesn't reorganize code to optimize it.

> > 

> > For example, in our case what if the compiler reorganizes the atomic-set

> > operation after wakeup-process ? But maybe that wouldn't happen across function

> > calls and we should be safe then.

> 

> IIUC, wake_up_process() implies a full memory barrier and a compiler barrier,

> due to:


Yes, the wakeup being a RELEASE (at least) is a fairly fundamental
property for causality. You expect the woken task to observe the
condition it got woken up on.
Daniel Lezcano June 7, 2018, 8:18 a.m. | #6
On 06/06/2018 17:02, Peter Zijlstra wrote:
> On Wed, Jun 06, 2018 at 03:42:08PM +0200, Daniel Lezcano wrote:

>> On 06/06/2018 14:23, Peter Zijlstra wrote:

>>> On Tue, Jun 05, 2018 at 11:16:40AM +0200, Daniel Lezcano wrote:

>>>> +	atomic_t idle_duration_ms;

>>>> +	atomic_t run_duration_ms;

>>>

>>>> +	idle_duration_ms = atomic_read(&ii_dev->idle_duration_ms);

>>>

>>>> +	run_duration_ms = atomic_read(&ii_dev->run_duration_ms);

>>>

>>>> +	atomic_set(&ii_dev->run_duration_ms, run_duration_ms);

>>>> +	atomic_set(&ii_dev->idle_duration_ms, idle_duration_ms);

>>>

>>>> +	*run_duration_ms = atomic_read(&ii_dev->run_duration_ms);

>>>> +	*idle_duration_ms = atomic_read(&ii_dev->idle_duration_ms);

>>>

>>>> +	if (!atomic_read(&ii_dev->idle_duration_ms))

>>>

>>>> +	if (!atomic_read(&ii_dev->run_duration_ms))

>>>

>>> What is the point of atomic_t here ?!

>>

>> idle_duration and run_duration can be changed from different places at

>> the same time. The atomic is here to ensure the read/write are consistent.

>>

>> Do you think it is pointless ?

> 

> Yes, atomic_read() / atomic_set() are no more atomic than READ_ONCE() /

> WRITE_ONCE().


So IIUC, neither atomic or WRITE|READ_ONCE are necessary in this code
because of the wake_up_process() barrier is enough, right ?


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar June 7, 2018, 8:42 a.m. | #7
On 07-06-18, 10:32, Peter Zijlstra wrote:
> On Thu, Jun 07, 2018 at 10:18:27AM +0200, Daniel Lezcano wrote:

> > So IIUC, neither atomic or WRITE|READ_ONCE are necessary in this code

> > because of the wake_up_process() barrier is enough, right ?

> 

> I didn't look hard enough; if there ever is a time where the loads and

> stores happen concurrently, you need READ/WRITE_ONCE(). If there is no

> concurrency on the variables, you don't need anything.

> 

> Neither atomic_read/set() nor REAd/WRITE_ONCE() will help with ordering,

> which is what the wake_up_process() would provide here, different things

> entirely.


Right and you still need the READ/WRITE_ONCE() thing as
idle_injection_set_duration() may run in parallel with the idle_injection_fn()
thread.

And I don't think the purpose of atomic_read/write was ever to take care of the
ordering issues in this code, it was always about parallel loads/stores.

-- 
viresh
Daniel Lezcano June 7, 2018, 8:46 a.m. | #8
On 07/06/2018 10:42, Viresh Kumar wrote:
> On 07-06-18, 10:32, Peter Zijlstra wrote:

>> On Thu, Jun 07, 2018 at 10:18:27AM +0200, Daniel Lezcano wrote:

>>> So IIUC, neither atomic or WRITE|READ_ONCE are necessary in this code

>>> because of the wake_up_process() barrier is enough, right ?

>>

>> I didn't look hard enough; if there ever is a time where the loads and

>> stores happen concurrently, you need READ/WRITE_ONCE(). If there is no

>> concurrency on the variables, you don't need anything.

>>

>> Neither atomic_read/set() nor REAd/WRITE_ONCE() will help with ordering,

>> which is what the wake_up_process() would provide here, different things

>> entirely.

> 

> Right and you still need the READ/WRITE_ONCE() thing as

> idle_injection_set_duration() may run in parallel with the idle_injection_fn()

> thread.

> 

> And I don't think the purpose of atomic_read/write was ever to take care of the

> ordering issues in this code, it was always about parallel loads/stores.


Yes, correct.

But if we don't care about who wins to store to value, is there a risk
of scramble variable if we just assign a value ?


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar June 7, 2018, 8:49 a.m. | #9
On 07-06-18, 10:46, Daniel Lezcano wrote:
> Yes, correct.

> 

> But if we don't care about who wins to store to value, is there a risk

> of scramble variable if we just assign a value ?


Normally no, as the compiler wouldn't screw it up badly. But there is no rule
which stops the compiler from doing this:

idle_duration_ms = 5;
idle_duration_ms = -5;
idle_duration_ms = 0;
idle_duration_ms = <real-value-we-want-to-write>;

So we *must* use READ/WRITE_ONCE() to make sure garbage values aren't seen by
readers.

-- 
viresh
Peter Zijlstra June 7, 2018, 9:09 a.m. | #10
On Thu, Jun 07, 2018 at 02:19:21PM +0530, Viresh Kumar wrote:
> On 07-06-18, 10:46, Daniel Lezcano wrote:

> > Yes, correct.

> > 

> > But if we don't care about who wins to store to value, is there a risk

> > of scramble variable if we just assign a value ?

> 

> Normally no, as the compiler wouldn't screw it up badly. But there is no rule

> which stops the compiler from doing this:

> 

> idle_duration_ms = 5;

> idle_duration_ms = -5;

> idle_duration_ms = 0;

> idle_duration_ms = <real-value-we-want-to-write>;

> 

> So we *must* use READ/WRITE_ONCE() to make sure garbage values aren't seen by

> readers.


That too, however it is far worse..

The compiler is allowed to do store/load-tearing. Basically it can emit
individual byte store/loads in any random order.

So:
	foo = bar = 0;

	P0			P1

	foo = 0x12345678;	bar = foo;

Could result in: bar == 0x12005600 or any other random combination.

Now, it generally doesn't do this, because it is really retarded to
generate code like that. But we've seen cases where it managed to do
really weird things (think constructing 64bit literals with two or more
smaller stores, which total smaller code).

The volatile in READ/WRITE_ONCE() disallows this and ensures the
variables are read / written in a single go (assuming naturally aligned
native word sizes).
Peter Zijlstra June 7, 2018, 9:39 a.m. | #11
On Thu, Jun 07, 2018 at 11:32:01AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 07, 2018 at 11:09:13AM +0200, Daniel Lezcano wrote:

> > On 07/06/2018 10:49, Viresh Kumar wrote:

> > > On 07-06-18, 10:46, Daniel Lezcano wrote:

> > >> Yes, correct.

> > >>

> > >> But if we don't care about who wins to store to value, is there a risk

> > >> of scramble variable if we just assign a value ?

> > > 

> > > Normally no, as the compiler wouldn't screw it up badly. But there is no rule

> > > which stops the compiler from doing this:

> > > 

> > > idle_duration_ms = 5;

> > > idle_duration_ms = -5;

> > > idle_duration_ms = 0;

> > > idle_duration_ms = <real-value-we-want-to-write>;

> > > 

> > > So we *must* use READ/WRITE_ONCE() to make sure garbage values aren't seen by

> > > readers.

> > 

> > Ok understood. Why would a compiler do this kind of things ?

> 

> I think the above can happen when the compiler uses the variable as a

> scratch pad -- very rare I would say.

> 

> In general a compiler needs to proof that doing this makes no observable

> difference ("as-if" rule). And since it is a regular variable it can

> assume data-race-free and do the above (or something like that). Because

> if there is a data-race it is UB and it can still do whatever it

> pleases.

> 

> And here I think the point is that regular variables are considered only

> in the context of a single linear execution context. Locks are assumed

> to bound observability.

> 

> And here the "volatile" and "_atomic" type specifiers again tell the

> compiler something 'special' is going on and you should not muck with

> things.


Also, I think, more likely:

	if (cond)
		X = 5;
	else
		X = 4;

is allowed to be transformed into:

	X = 4;
	if (cond)
		X = 5;

as long as cond doesn't involve a sequence point of sorts (think
function call).

For the single execution context case, this transformation is valid, but
it is not in the threaded case. But then we go back to the assumption
that regular variables are data-race-free.
Daniel Lezcano June 7, 2018, 12:31 p.m. | #12
On 07/06/2018 11:39, Peter Zijlstra wrote:
> On Thu, Jun 07, 2018 at 11:32:01AM +0200, Peter Zijlstra wrote:

>> On Thu, Jun 07, 2018 at 11:09:13AM +0200, Daniel Lezcano wrote:

>>> On 07/06/2018 10:49, Viresh Kumar wrote:

>>>> On 07-06-18, 10:46, Daniel Lezcano wrote:

>>>>> Yes, correct.

>>>>>

>>>>> But if we don't care about who wins to store to value, is there a risk

>>>>> of scramble variable if we just assign a value ?

>>>>

>>>> Normally no, as the compiler wouldn't screw it up badly. But there is no rule

>>>> which stops the compiler from doing this:

>>>>

>>>> idle_duration_ms = 5;

>>>> idle_duration_ms = -5;

>>>> idle_duration_ms = 0;

>>>> idle_duration_ms = <real-value-we-want-to-write>;

>>>>

>>>> So we *must* use READ/WRITE_ONCE() to make sure garbage values aren't seen by

>>>> readers.

>>>

>>> Ok understood. Why would a compiler do this kind of things ?

>>

>> I think the above can happen when the compiler uses the variable as a

>> scratch pad -- very rare I would say.

>>

>> In general a compiler needs to proof that doing this makes no observable

>> difference ("as-if" rule). And since it is a regular variable it can

>> assume data-race-free and do the above (or something like that). Because

>> if there is a data-race it is UB and it can still do whatever it

>> pleases.

>>

>> And here I think the point is that regular variables are considered only

>> in the context of a single linear execution context. Locks are assumed

>> to bound observability.

>>

>> And here the "volatile" and "_atomic" type specifiers again tell the

>> compiler something 'special' is going on and you should not muck with

>> things.

> 

> Also, I think, more likely:

> 

> 	if (cond)

> 		X = 5;

> 	else

> 		X = 4;

> 

> is allowed to be transformed into:

> 

> 	X = 4;

> 	if (cond)

> 		X = 5;

> 

> as long as cond doesn't involve a sequence point of sorts (think

> function call).

> 

> For the single execution context case, this transformation is valid, but

> it is not in the threaded case. But then we go back to the assumption

> that regular variables are data-race-free.


Thank you very much for the explanations.

  -- Daniel


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Daniel Lezcano June 7, 2018, 2:11 p.m. | #13
On 06/06/2018 12:45, Viresh Kumar wrote:
> On 06-06-18, 12:22, Daniel Lezcano wrote:

>> (mb() are done in the atomic operations AFAICT).

> 

> AFAIU, it is required to make sure the operations are seen in a particular order

> on another CPU and the compiler doesn't reorganize code to optimize it.

> 

> For example, in our case what if the compiler reorganizes the atomic-set

> operation after wakeup-process ? But maybe that wouldn't happen across function

> calls and we should be safe then.

> 

>> What about:

>>

>> 	get_online_cpus();

>>

>> 	nr_tasks = cpumask_weight(

>> 		cpumask_and(ii_dev->cpumask, cpu_online_mask);

>> 			

>> 	atomic_set(&ii_dev->count, nr_tasks);

>>

>> 	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask) {

>> 		iit = per_cpu_ptr(&idle_injection_thread, cpu);

>> 		iit->should_run = 1;

>> 		wake_up_process(iit->tsk);

>> 	}

>>

>> 	put_online_cpus();

>> ?

> 

> Looks good this time.

> 

>> I'm wondering if we can have a CPU hotplugged right after the

>> 'put_online_cpus', resulting in the 'should park' flag set and then the

>> thread goes in the kthread_parkme instead of jumping back the idle

>> injection function and decrease the count, leading up to the timer not

>> being set again.

> 

> True. That looks like a valid problem to me as well.

> 

> What about starting the hrtimer right from this routine itself, after taking

> into account running-time, idle-time, delay, etc ? That would get rid of the

> count stuff, this get_online_cpus(), etc.

> 

> Even if we restart the next play-idle cycle a bit early or with some delay, sky

> wouldn't fall :)


We won't be able to call completion() in this case.



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Viresh Kumar June 8, 2018, 4:48 a.m. | #14
On 07-06-18, 16:11, Daniel Lezcano wrote:
> >> I'm wondering if we can have a CPU hotplugged right after the

> >> 'put_online_cpus', resulting in the 'should park' flag set and then the

> >> thread goes in the kthread_parkme instead of jumping back the idle

> >> injection function and decrease the count, leading up to the timer not

> >> being set again.

> > 

> > True. That looks like a valid problem to me as well.

> > 

> > What about starting the hrtimer right from this routine itself, after taking

> > into account running-time, idle-time, delay, etc ? That would get rid of the

> > count stuff, this get_online_cpus(), etc.

> > 

> > Even if we restart the next play-idle cycle a bit early or with some delay, sky

> > wouldn't fall :)

> 

> We won't be able to call completion() in this case.


I was just having a look at smpboot.h and saw the park/unpark callbacks. I think
you can very much use them to align things with CPU hotplug.

-- 
viresh

Patch

diff --git a/drivers/powercap/Kconfig b/drivers/powercap/Kconfig
index 85727ef..a767ef2 100644
--- a/drivers/powercap/Kconfig
+++ b/drivers/powercap/Kconfig
@@ -29,4 +29,14 @@  config INTEL_RAPL
 	  controller, CPU core (Power Plance 0), graphics uncore (Power Plane
 	  1), etc.
 
+config IDLE_INJECTION
+	bool "Idle injection framework"
+	depends on CPU_IDLE
+	default n
+	help
+	  This enables support for the idle injection framework. It
+	  provides a way to force idle periods on a set of specified
+	  CPUs for power capping. Idle period can be injected
+	  synchronously on a set of specified CPUs or alternatively
+	  on a per CPU basis.
 endif
diff --git a/drivers/powercap/Makefile b/drivers/powercap/Makefile
index 0a21ef3..c3bbfee 100644
--- a/drivers/powercap/Makefile
+++ b/drivers/powercap/Makefile
@@ -1,2 +1,3 @@ 
 obj-$(CONFIG_POWERCAP)	+= powercap_sys.o
 obj-$(CONFIG_INTEL_RAPL) += intel_rapl.o
+obj-$(CONFIG_IDLE_INJECTION) += idle_injection.o
diff --git a/drivers/powercap/idle_injection.c b/drivers/powercap/idle_injection.c
new file mode 100644
index 0000000..37c7b61
--- /dev/null
+++ b/drivers/powercap/idle_injection.c
@@ -0,0 +1,358 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018 Linaro Limited
+ *
+ * Author: Daniel Lezcano <daniel.lezcano@linaro.org>
+ *
+ * The idle injection framework proposes a way to force a cpu to enter
+ * an idle state during a specified amount of time for a specified
+ * period.
+ *
+ * It relies on the smpboot kthreads which handles, via its main loop,
+ * the common code for hotplugging and [un]parking.
+ *
+ * At init time, all the kthreads are created and parked.
+ *
+ * A cpumask is specified as parameter for the idle injection
+ * registering function. The kthreads will be synchronized regarding
+ * this cpumask.
+ *
+ * The idle + run duration is specified via the helpers and then the
+ * idle injection can be started at this point.
+ *
+ * A kthread will call play_idle() with the specified idle duration
+ * from above and then will schedule itself. The latest CPU belonging
+ * to the group is in charge of setting the timer for the next idle
+ * injection deadline.
+ *
+ * The task handling the timer interrupt will wakeup all the kthreads
+ * belonging to the cpumask.
+ */
+#define pr_fmt(fmt) "ii_dev: " fmt
+
+#include <linux/cpu.h>
+#include <linux/freezer.h>
+#include <linux/hrtimer.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/smpboot.h>
+
+#include <uapi/linux/sched/types.h>
+
+/**
+ * struct idle_injection_thread - task on/off switch structure
+ * @tsk: a pointer to a task_struct injecting the idle cycles
+ * @should_run: a integer used as a boolean by the smpboot kthread API
+ */
+struct idle_injection_thread {
+	struct task_struct *tsk;
+	int should_run;
+};
+
+/**
+ * struct idle_injection_device - data for the idle injection
+ * @cpumask: a cpumask containing the list of CPUs managed by the device
+ * @timer: a hrtimer giving the tempo for the idle injection
+ * @count: an atomic to keep track of the last task exiting the idle cycle
+ * @idle_duration_ms: an atomic specifying the idle duration
+ * @run_duration_ms: an atomic specifying the running duration
+ */
+struct idle_injection_device {
+	cpumask_var_t cpumask;
+	struct hrtimer timer;
+	struct completion stop_complete;
+	atomic_t count;
+	atomic_t idle_duration_ms;
+	atomic_t run_duration_ms;
+};
+
+static DEFINE_PER_CPU(struct idle_injection_thread, idle_injection_thread);
+static DEFINE_PER_CPU(struct idle_injection_device *, idle_injection_device);
+
+/**
+ * idle_injection_wakeup - Wake up all idle injection threads
+ * @ii_dev: the idle injection device
+ *
+ * Every idle injection task belonging to the idle injection device
+ * and running on an online CPU will be wake up by this call.
+ */
+static void idle_injection_wakeup(struct idle_injection_device *ii_dev)
+{
+	struct idle_injection_thread *iit;
+	int cpu;
+
+	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask) {
+		atomic_inc(&ii_dev->count);
+		iit = per_cpu_ptr(&idle_injection_thread, cpu);
+		iit->should_run = 1;
+		wake_up_process(iit->tsk);
+	}
+}
+
+/**
+ * idle_injection_wakeup_fn - idle injection timer callback
+ * @timer: a hrtimer structure
+ *
+ * This function is called when the idle injection timer expires which
+ * will wake up the idle injection tasks and these ones, in turn, play
+ * idle a specified amount of time.
+ *
+ * Return: HRTIMER_NORESTART.
+ */
+static enum hrtimer_restart idle_injection_wakeup_fn(struct hrtimer *timer)
+{
+	struct idle_injection_device *ii_dev =
+		container_of(timer, struct idle_injection_device, timer);
+
+	idle_injection_wakeup(ii_dev);
+
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * idle_injection_fn - idle injection routine
+ * @cpu: the CPU number the tasks belongs to
+ *
+ * The idle injection routine will stay idle the specified amount of
+ * time
+ */
+static void idle_injection_fn(unsigned int cpu)
+{
+	struct idle_injection_device *ii_dev;
+	struct idle_injection_thread *iit;
+	int run_duration_ms, idle_duration_ms;
+
+	ii_dev = per_cpu(idle_injection_device, cpu);
+	iit = per_cpu_ptr(&idle_injection_thread, cpu);
+
+	/*
+	 * Boolean used by the smpboot main loop and used as a
+	 * flip-flop in this function
+	 */
+	iit->should_run = 0;
+
+	idle_duration_ms = atomic_read(&ii_dev->idle_duration_ms);
+	if (idle_duration_ms)
+		play_idle(idle_duration_ms);
+
+	/*
+	 * The last CPU waking up is in charge of setting the timer. If
+	 * the CPU is hotplugged, the timer will move to another CPU
+	 * (which may not belong to the same cluster) but that is not a
+	 * problem as the timer will be set again by another CPU
+	 * belonging to the cluster. This mechanism is self adaptive.
+	 */
+	if (!atomic_dec_and_test(&ii_dev->count))
+		return;
+
+	run_duration_ms = atomic_read(&ii_dev->run_duration_ms);
+	if (run_duration_ms) {
+		hrtimer_start(&ii_dev->timer, ms_to_ktime(run_duration_ms),
+			      HRTIMER_MODE_REL_PINNED);
+		return;
+	}
+
+	complete(&ii_dev->stop_complete);
+}
+
+/**
+ * idle_injection_set_duration - idle and run duration helper
+ * @run_duration_ms: an unsigned int giving the running time in milliseconds
+ * @idle_duration_ms: an unsigned int giving the idle time in milliseconds
+ */
+void idle_injection_set_duration(struct idle_injection_device *ii_dev,
+				 unsigned int run_duration_ms,
+				 unsigned int idle_duration_ms)
+{
+	atomic_set(&ii_dev->run_duration_ms, run_duration_ms);
+	atomic_set(&ii_dev->idle_duration_ms, idle_duration_ms);
+}
+
+/**
+ * idle_injection_get_duration - idle and run duration helper
+ * @run_duration_ms: a pointer to an unsigned int to store the running time
+ * @idle_duration_ms: a pointer to an unsigned int to store the idle time
+ */
+void idle_injection_get_duration(struct idle_injection_device *ii_dev,
+				 unsigned int *run_duration_ms,
+				 unsigned int *idle_duration_ms)
+{
+	*run_duration_ms = atomic_read(&ii_dev->run_duration_ms);
+	*idle_duration_ms = atomic_read(&ii_dev->idle_duration_ms);
+}
+
+/**
+ * idle_injection_start - starts the idle injections
+ * @ii_dev: a pointer to an idle_injection_device structure
+ *
+ * The function starts the idle injection cycles by first waking up
+ * all the tasks the ii_dev is attached to and let them handle the
+ * idle-run periods.
+ *
+ * Return: -EINVAL if the idle or the running durations are not set.
+ */
+int idle_injection_start(struct idle_injection_device *ii_dev)
+{
+	if (!atomic_read(&ii_dev->idle_duration_ms))
+		return -EINVAL;
+
+	if (!atomic_read(&ii_dev->run_duration_ms))
+		return -EINVAL;
+
+	pr_debug("Starting injecting idle cycles on CPUs '%*pbl'\n",
+		 cpumask_pr_args(ii_dev->cpumask));
+
+	idle_injection_wakeup(ii_dev);
+
+	return 0;
+}
+
+/**
+ * idle_injection_stop - stops the idle injections
+ * @ii_dev: a pointer to an idle injection_device structure
+ *
+ * The function stops the idle injection by resetting the idle and
+ * running durations and wait for the threads to complete. If we are
+ * in the process of injecting an idle cycle, then this will wait the
+ * end of the cycle.
+ */
+void idle_injection_stop(struct idle_injection_device *ii_dev)
+{
+	pr_debug("Stopping injecting idle cycles on CPUs '%*pbl'\n",
+		 cpumask_pr_args(ii_dev->cpumask));
+
+	idle_injection_set_duration(ii_dev, 0, 0);
+
+	wait_for_completion_interruptible(&ii_dev->stop_complete);
+}
+
+/**
+ * idle_injection_setup - initialize the current task as a RT task
+ * @cpu: the CPU number where the kthread is running on (not used)
+ *
+ * Called one time, this function is in charge of setting the task
+ * scheduler parameters.
+ */
+static void idle_injection_setup(unsigned int cpu)
+{
+	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
+
+	set_freezable();
+
+	sched_setscheduler(current, SCHED_FIFO, &param);
+}
+
+/**
+ * idle_injection_should_run - function helper for the smpboot API
+ * @cpu: the CPU number where the kthread is running on
+ *
+ * Return: a boolean telling if the thread can run.
+ */
+static int idle_injection_should_run(unsigned int cpu)
+{
+	struct idle_injection_thread *iit =
+		per_cpu_ptr(&idle_injection_thread, cpu);
+
+	return iit->should_run;
+}
+
+static struct idle_injection_device *ii_dev_alloc(void)
+{
+	struct idle_injection_device *ii_dev;
+
+	ii_dev = kzalloc(sizeof(*ii_dev), GFP_KERNEL);
+	if (!ii_dev)
+		return NULL;
+
+	if (!alloc_cpumask_var(&ii_dev->cpumask, GFP_KERNEL)) {
+		kfree(ii_dev);
+		return NULL;
+	}
+
+	return ii_dev;
+}
+
+static void ii_dev_free(struct idle_injection_device *ii_dev)
+{
+	free_cpumask_var(ii_dev->cpumask);
+	kfree(ii_dev);
+}
+
+/**
+ * idle_injection_register - idle injection init routine
+ * @cpumask: the list of CPUs managed by the idle injection device
+ *
+ * This is the initialization function in charge of creating the
+ * initializing of the timer and allocate the structures. It does not
+ * starts the idle injection cycles.
+ *
+ * Return: NULL if an allocation fails.
+ */
+struct idle_injection_device *idle_injection_register(struct cpumask *cpumask)
+{
+	struct idle_injection_device *ii_dev;
+	int cpu;
+
+	ii_dev = ii_dev_alloc();
+	if (!ii_dev)
+		return NULL;
+
+	cpumask_copy(ii_dev->cpumask, cpumask);
+	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	ii_dev->timer.function = idle_injection_wakeup_fn;
+	init_completion(&ii_dev->stop_complete);
+
+	for_each_cpu(cpu, ii_dev->cpumask) {
+
+		if (per_cpu(idle_injection_device, cpu)) {
+			pr_err("cpu%d is already registered\n", cpu);
+			goto out_rollback_per_cpu;
+		}
+
+		per_cpu(idle_injection_device, cpu) = ii_dev;
+	}
+
+	return ii_dev;
+
+out_rollback_per_cpu:
+	for_each_cpu(cpu, ii_dev->cpumask)
+		per_cpu(idle_injection_device, cpu) = NULL;
+
+	ii_dev_free(ii_dev);
+
+	return NULL;
+}
+
+/**
+ * idle_injection_unregister - Unregister the idle injection device
+ * @ii_dev: a pointer to an idle injection device
+ *
+ * The function is in charge of stopping the idle injections,
+ * unregister the kthreads and free the allocated memory in the
+ * register function.
+ */
+void idle_injection_unregister(struct idle_injection_device *ii_dev)
+{
+	int cpu;
+
+	idle_injection_stop(ii_dev);
+
+	for_each_cpu(cpu, ii_dev->cpumask)
+		per_cpu(idle_injection_device, cpu) = NULL;
+
+	ii_dev_free(ii_dev);
+}
+
+static struct smp_hotplug_thread idle_injection_threads = {
+	.store = &idle_injection_thread.tsk,
+	.setup = idle_injection_setup,
+	.thread_fn = idle_injection_fn,
+	.thread_comm = "idle_inject/%u",
+	.thread_should_run = idle_injection_should_run,
+};
+
+static int __init idle_injection_init(void)
+{
+	return smpboot_register_percpu_thread(&idle_injection_threads);
+}
+early_initcall(idle_injection_init);
diff --git a/include/linux/idle_injection.h b/include/linux/idle_injection.h
new file mode 100644
index 0000000..ffd20d7
--- /dev/null
+++ b/include/linux/idle_injection.h
@@ -0,0 +1,29 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2018 Linaro Ltd
+ *
+ * Author: Daniel Lezcano <daniel.lezcano@linaro.org>
+ *
+ */
+#ifndef __IDLE_INJECTION_H__
+#define __IDLE_INJECTION_H__
+
+/* private idle injection device structure */
+struct idle_injection_device;
+
+struct idle_injection_device *idle_injection_register(struct cpumask *cpumask);
+
+void idle_injection_unregister(struct idle_injection_device *ii_dev);
+
+int idle_injection_start(struct idle_injection_device *ii_dev);
+
+void idle_injection_stop(struct idle_injection_device *ii_dev);
+
+void idle_injection_set_duration(struct idle_injection_device *ii_dev,
+				 unsigned int run_duration_ms,
+				 unsigned int idle_duration_ms);
+
+void idle_injection_get_duration(struct idle_injection_device *ii_dev,
+				 unsigned int *run_duration_ms,
+				 unsigned int *idle_duration_ms);
+#endif /* __IDLE_INJECTION_H__ */