[v2,1/2] sched: let the scheduler see CPU idle states

Message ID 1409844730-12273-2-git-send-email-nicolas.pitre@linaro.org
State New
Headers show

Commit Message

Nicolas Pitre Sept. 4, 2014, 3:32 p.m.
From: Daniel Lezcano <daniel.lezcano@linaro.org>

When the cpu enters idle, it stores the cpuidle state pointer in its
struct rq instance which in turn could be used to make a better decision
when balancing tasks.

As soon as the cpu exits its idle state, the struct rq reference is
cleared.

There are a couple of situations where the idle state pointer could be changed
while it is being consulted:

1. For x86/acpi with dynamic c-states, when a laptop switches from battery
   to AC that could result on removing the deeper idle state. The acpi driver
   triggers:
	'acpi_processor_cst_has_changed'
		'cpuidle_pause_and_lock'
			'cpuidle_uninstall_idle_handler'
				'kick_all_cpus_sync'.

All cpus will exit their idle state and the pointed object will be set to
NULL.

2. The cpuidle driver is unloaded. Logically that could happen but not
in practice because the drivers are always compiled in and 95% of them are
not coded to unregister themselves.  In any case, the unloading code must
call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
leading to 'kick_all_cpus_sync' as mentioned above.

A race can happen if we use the pointer and then one of these two scenarios
occurs at the same moment.

In order to be safe, the idle state pointer stored in the rq must be
used inside a rcu_read_lock section where we are protected with the
'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
idle_get_state() and idle_put_state() accessors should be used to that
effect.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 drivers/cpuidle/cpuidle.c |  6 ++++++
 kernel/sched/idle.c       |  6 ++++++
 kernel/sched/sched.h      | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

Comments

Paul E. McKenney Sept. 18, 2014, 5:37 p.m. | #1
On Thu, Sep 04, 2014 at 11:32:09AM -0400, Nicolas Pitre wrote:
> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> 
> When the cpu enters idle, it stores the cpuidle state pointer in its
> struct rq instance which in turn could be used to make a better decision
> when balancing tasks.
> 
> As soon as the cpu exits its idle state, the struct rq reference is
> cleared.
> 
> There are a couple of situations where the idle state pointer could be changed
> while it is being consulted:
> 
> 1. For x86/acpi with dynamic c-states, when a laptop switches from battery
>    to AC that could result on removing the deeper idle state. The acpi driver
>    triggers:
> 	'acpi_processor_cst_has_changed'
> 		'cpuidle_pause_and_lock'
> 			'cpuidle_uninstall_idle_handler'
> 				'kick_all_cpus_sync'.
> 
> All cpus will exit their idle state and the pointed object will be set to
> NULL.
> 
> 2. The cpuidle driver is unloaded. Logically that could happen but not
> in practice because the drivers are always compiled in and 95% of them are
> not coded to unregister themselves.  In any case, the unloading code must
> call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
> leading to 'kick_all_cpus_sync' as mentioned above.
> 
> A race can happen if we use the pointer and then one of these two scenarios
> occurs at the same moment.
> 
> In order to be safe, the idle state pointer stored in the rq must be
> used inside a rcu_read_lock section where we are protected with the
> 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
> idle_get_state() and idle_put_state() accessors should be used to that
> effect.
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  drivers/cpuidle/cpuidle.c |  6 ++++++
>  kernel/sched/idle.c       |  6 ++++++
>  kernel/sched/sched.h      | 39 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 51 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index ee9df5e3f5..530e3055a2 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -225,6 +225,12 @@ void cpuidle_uninstall_idle_handler(void)
>  		initialized = 0;
>  		kick_all_cpus_sync();
>  	}
> +
> +	/*
> +	 * Make sure external observers (such as the scheduler)
> +	 * are done looking at pointed idle states.
> +	 */
> +	rcu_barrier();

Actually, all rcu_barrier() does is to make sure that all previously
queued RCU callbacks have been invoked.  And given the current
implementation, if there are no callbacks queued anywhere in the system,
rcu_barrier() is an extended no-op.  "Has CPU 0 any callbacks?" "Nope!"
"Has CPU 1 any callbacks?"  "Nope!" ... "Has CPU nr_cpu_ids-1 any
callbacks?"  "Nope!"  "OK, done!"

This is all done with the current task looking at per-CPU data structures,
with no interaction with the scheduler and with no need to actually make
those other CPUs do anything.

So what is it that you really need to do here?

A synchronize_sched() will wait for all non-idle online CPUs to pass
through the scheduler, where "idle" includes usermode execution in
CONFIG_NO_HZ_FULL=y kernels.  But it won't wait for CPUs executing
in the idle loop.

A synchronize_rcu_tasks() will wait for all non-idle tasks that are
currently on a runqueue to do a voluntary context switch.  There has
been some discussion about extending this to idle tasks, but the current
prospective users can live without this.  But if you need it, I can push
on getting it set up.  (Current plans are that synchronize_rcu_tasks()
goes into the v3.18 merge window.)  And one caveat: There is long
latency associated with synchronize_rcu_tasks() by design.  Grace
periods are measured in seconds.

A stop_cpus() will force a context switch on all CPUs, though it is
a rather big hammer.

So again, what do you really need?

							Thanx, Paul

>  }
> 
>  /**
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 11e7bc434f..c47fce75e6 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -147,6 +147,9 @@ use_default:
>  	    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
>  		goto use_default;
> 
> +	/* Take note of the planned idle state. */
> +	idle_set_state(this_rq(), &drv->states[next_state]);
> +
>  	/*
>  	 * Enter the idle state previously returned by the governor decision.
>  	 * This function will block until an interrupt occurs and will take
> @@ -154,6 +157,9 @@ use_default:
>  	 */
>  	entered_state = cpuidle_enter(drv, dev, next_state);
> 
> +	/* The cpu is no longer idle or about to enter idle. */
> +	idle_set_state(this_rq(), NULL);
> +
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 579712f4e9..aea8baa7a5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -14,6 +14,7 @@
>  #include "cpuacct.h"
> 
>  struct rq;
> +struct cpuidle_state;
> 
>  extern __read_mostly int scheduler_running;
> 
> @@ -636,6 +637,11 @@ struct rq {
>  #ifdef CONFIG_SMP
>  	struct llist_head wake_list;
>  #endif
> +
> +#ifdef CONFIG_CPU_IDLE
> +	/* Must be inspected within a rcu lock section */
> +	struct cpuidle_state *idle_state;
> +#endif
>  };
> 
>  static inline int cpu_of(struct rq *rq)
> @@ -1180,6 +1186,39 @@ static inline void idle_exit_fair(struct rq *rq) { }
> 
>  #endif
> 
> +#ifdef CONFIG_CPU_IDLE
> +static inline void idle_set_state(struct rq *rq,
> +				  struct cpuidle_state *idle_state)
> +{
> +	rq->idle_state = idle_state;
> +}
> +
> +static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> +{
> +	rcu_read_lock();
> +	return rq->idle_state;
> +}
> +
> +static inline void cpuidle_put_state(struct rq *rq)
> +{
> +	rcu_read_unlock();
> +}
> +#else
> +static inline void idle_set_state(struct rq *rq,
> +				  struct cpuidle_state *idle_state)
> +{
> +}
> +
> +static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> +{
> +	return NULL;
> +}
> +
> +static inline void cpuidle_put_state(struct rq *rq)
> +{
> +}
> +#endif
> +
>  extern void sysrq_sched_debug_show(void);
>  extern void sched_init_granularity(void);
>  extern void update_max_interval(void);
> -- 
> 1.8.4.108.g55ea5f6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Sept. 18, 2014, 5:39 p.m. | #2
On Thu, Sep 18, 2014 at 10:37:33AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 11:32:09AM -0400, Nicolas Pitre wrote:
> > From: Daniel Lezcano <daniel.lezcano@linaro.org>
> > 
> > When the cpu enters idle, it stores the cpuidle state pointer in its
> > struct rq instance which in turn could be used to make a better decision
> > when balancing tasks.
> > 
> > As soon as the cpu exits its idle state, the struct rq reference is
> > cleared.
> > 
> > There are a couple of situations where the idle state pointer could be changed
> > while it is being consulted:
> > 
> > 1. For x86/acpi with dynamic c-states, when a laptop switches from battery
> >    to AC that could result on removing the deeper idle state. The acpi driver
> >    triggers:
> > 	'acpi_processor_cst_has_changed'
> > 		'cpuidle_pause_and_lock'
> > 			'cpuidle_uninstall_idle_handler'
> > 				'kick_all_cpus_sync'.
> > 
> > All cpus will exit their idle state and the pointed object will be set to
> > NULL.
> > 
> > 2. The cpuidle driver is unloaded. Logically that could happen but not
> > in practice because the drivers are always compiled in and 95% of them are
> > not coded to unregister themselves.  In any case, the unloading code must
> > call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
> > leading to 'kick_all_cpus_sync' as mentioned above.
> > 
> > A race can happen if we use the pointer and then one of these two scenarios
> > occurs at the same moment.
> > 
> > In order to be safe, the idle state pointer stored in the rq must be
> > used inside a rcu_read_lock section where we are protected with the
> > 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
> > idle_get_state() and idle_put_state() accessors should be used to that
> > effect.
> > 
> > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >  drivers/cpuidle/cpuidle.c |  6 ++++++
> >  kernel/sched/idle.c       |  6 ++++++
> >  kernel/sched/sched.h      | 39 +++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 51 insertions(+)
> > 
> > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> > index ee9df5e3f5..530e3055a2 100644
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -225,6 +225,12 @@ void cpuidle_uninstall_idle_handler(void)
> >  		initialized = 0;
> >  		kick_all_cpus_sync();
> >  	}
> > +
> > +	/*
> > +	 * Make sure external observers (such as the scheduler)
> > +	 * are done looking at pointed idle states.
> > +	 */
> > +	rcu_barrier();
> 
> Actually, all rcu_barrier() does is to make sure that all previously
> queued RCU callbacks have been invoked.  And given the current
> implementation, if there are no callbacks queued anywhere in the system,
> rcu_barrier() is an extended no-op.  "Has CPU 0 any callbacks?" "Nope!"
> "Has CPU 1 any callbacks?"  "Nope!" ... "Has CPU nr_cpu_ids-1 any
> callbacks?"  "Nope!"  "OK, done!"
> 
> This is all done with the current task looking at per-CPU data structures,
> with no interaction with the scheduler and with no need to actually make
> those other CPUs do anything.
> 
> So what is it that you really need to do here?
> 
> A synchronize_sched() will wait for all non-idle online CPUs to pass
> through the scheduler, where "idle" includes usermode execution in
> CONFIG_NO_HZ_FULL=y kernels.  But it won't wait for CPUs executing
> in the idle loop.
> 
> A synchronize_rcu_tasks() will wait for all non-idle tasks that are
> currently on a runqueue to do a voluntary context switch.  There has
> been some discussion about extending this to idle tasks, but the current
> prospective users can live without this.  But if you need it, I can push
> on getting it set up.  (Current plans are that synchronize_rcu_tasks()
> goes into the v3.18 merge window.)  And one caveat: There is long
> latency associated with synchronize_rcu_tasks() by design.  Grace
> periods are measured in seconds.
> 
> A stop_cpus() will force a context switch on all CPUs, though it is
> a rather big hammer.

And I was reminded by the very next email that kick_all_cpus_sync() is
another possibility -- it forces an interrupt on all online CPUs, idle
or not.

							Thanx, Paul

> So again, what do you really need?
> 
> 							Thanx, Paul
> 
> >  }
> > 
> >  /**
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 11e7bc434f..c47fce75e6 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -147,6 +147,9 @@ use_default:
> >  	    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
> >  		goto use_default;
> > 
> > +	/* Take note of the planned idle state. */
> > +	idle_set_state(this_rq(), &drv->states[next_state]);
> > +
> >  	/*
> >  	 * Enter the idle state previously returned by the governor decision.
> >  	 * This function will block until an interrupt occurs and will take
> > @@ -154,6 +157,9 @@ use_default:
> >  	 */
> >  	entered_state = cpuidle_enter(drv, dev, next_state);
> > 
> > +	/* The cpu is no longer idle or about to enter idle. */
> > +	idle_set_state(this_rq(), NULL);
> > +
> >  	if (broadcast)
> >  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
> > 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 579712f4e9..aea8baa7a5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -14,6 +14,7 @@
> >  #include "cpuacct.h"
> > 
> >  struct rq;
> > +struct cpuidle_state;
> > 
> >  extern __read_mostly int scheduler_running;
> > 
> > @@ -636,6 +637,11 @@ struct rq {
> >  #ifdef CONFIG_SMP
> >  	struct llist_head wake_list;
> >  #endif
> > +
> > +#ifdef CONFIG_CPU_IDLE
> > +	/* Must be inspected within a rcu lock section */
> > +	struct cpuidle_state *idle_state;
> > +#endif
> >  };
> > 
> >  static inline int cpu_of(struct rq *rq)
> > @@ -1180,6 +1186,39 @@ static inline void idle_exit_fair(struct rq *rq) { }
> > 
> >  #endif
> > 
> > +#ifdef CONFIG_CPU_IDLE
> > +static inline void idle_set_state(struct rq *rq,
> > +				  struct cpuidle_state *idle_state)
> > +{
> > +	rq->idle_state = idle_state;
> > +}
> > +
> > +static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> > +{
> > +	rcu_read_lock();
> > +	return rq->idle_state;
> > +}
> > +
> > +static inline void cpuidle_put_state(struct rq *rq)
> > +{
> > +	rcu_read_unlock();
> > +}
> > +#else
> > +static inline void idle_set_state(struct rq *rq,
> > +				  struct cpuidle_state *idle_state)
> > +{
> > +}
> > +
> > +static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> > +{
> > +	return NULL;
> > +}
> > +
> > +static inline void cpuidle_put_state(struct rq *rq)
> > +{
> > +}
> > +#endif
> > +
> >  extern void sysrq_sched_debug_show(void);
> >  extern void sched_init_granularity(void);
> >  extern void update_max_interval(void);
> > -- 
> > 1.8.4.108.g55ea5f6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Pitre Sept. 18, 2014, 6:32 p.m. | #3
On Thu, 18 Sep 2014, Paul E. McKenney wrote:

> On Thu, Sep 04, 2014 at 11:32:09AM -0400, Nicolas Pitre wrote:
> > From: Daniel Lezcano <daniel.lezcano@linaro.org>
> > 
> > When the cpu enters idle, it stores the cpuidle state pointer in its
> > struct rq instance which in turn could be used to make a better decision
> > when balancing tasks.
> > 
> > As soon as the cpu exits its idle state, the struct rq reference is
> > cleared.
> > 
> > There are a couple of situations where the idle state pointer could be changed
> > while it is being consulted:
> > 
> > 1. For x86/acpi with dynamic c-states, when a laptop switches from battery
> >    to AC that could result on removing the deeper idle state. The acpi driver
> >    triggers:
> > 	'acpi_processor_cst_has_changed'
> > 		'cpuidle_pause_and_lock'
> > 			'cpuidle_uninstall_idle_handler'
> > 				'kick_all_cpus_sync'.
> > 
> > All cpus will exit their idle state and the pointed object will be set to
> > NULL.
> > 
> > 2. The cpuidle driver is unloaded. Logically that could happen but not
> > in practice because the drivers are always compiled in and 95% of them are
> > not coded to unregister themselves.  In any case, the unloading code must
> > call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
> > leading to 'kick_all_cpus_sync' as mentioned above.
> > 
> > A race can happen if we use the pointer and then one of these two scenarios
> > occurs at the same moment.
> > 
> > In order to be safe, the idle state pointer stored in the rq must be
> > used inside a rcu_read_lock section where we are protected with the
> > 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
> > idle_get_state() and idle_put_state() accessors should be used to that
> > effect.
> > 
> > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >  drivers/cpuidle/cpuidle.c |  6 ++++++
> >  kernel/sched/idle.c       |  6 ++++++
> >  kernel/sched/sched.h      | 39 +++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 51 insertions(+)
> > 
> > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> > index ee9df5e3f5..530e3055a2 100644
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -225,6 +225,12 @@ void cpuidle_uninstall_idle_handler(void)
> >  		initialized = 0;
> >  		kick_all_cpus_sync();
> >  	}
> > +
> > +	/*
> > +	 * Make sure external observers (such as the scheduler)
> > +	 * are done looking at pointed idle states.
> > +	 */
> > +	rcu_barrier();
> 
> Actually, all rcu_barrier() does is to make sure that all previously
> queued RCU callbacks have been invoked.  And given the current
> implementation, if there are no callbacks queued anywhere in the system,
> rcu_barrier() is an extended no-op.  "Has CPU 0 any callbacks?" "Nope!"
> "Has CPU 1 any callbacks?"  "Nope!" ... "Has CPU nr_cpu_ids-1 any
> callbacks?"  "Nope!"  "OK, done!"
> 
> This is all done with the current task looking at per-CPU data structures,
> with no interaction with the scheduler and with no need to actually make
> those other CPUs do anything.
> 
> So what is it that you really need to do here?

In short, we don't want the cpufreq data to go away (see the 2 scenarios 
above) while the scheduler is looking at it.  The scheduler uses the 
provided accessors (see patch 2/2) so we can put any protection 
mechanism we want in them.  A simple spinlock could do just as well 
which should be good enough.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 18, 2014, 11:15 p.m. | #4
On Thu, Sep 18, 2014 at 10:39:25AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 18, 2014 at 10:37:33AM -0700, Paul E. McKenney wrote:

> > A stop_cpus() will force a context switch on all CPUs, though it is
> > a rather big hammer.
> 
> And I was reminded by the very next email that kick_all_cpus_sync() is
> another possibility -- it forces an interrupt on all online CPUs, idle
> or not.

I actually have a patch
http://lkml.kernel.org/r/1409815075-4180-2-git-send-email-chuansheng.liu@intel.com
that changes that, because apparently there are idle loops that don't
actually exit on interrupt :-)

But yes, something like the wake_up_all_idle_cpus() should do.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 18, 2014, 11:17 p.m. | #5
On Thu, Sep 18, 2014 at 02:32:25PM -0400, Nicolas Pitre wrote:
> On Thu, 18 Sep 2014, Paul E. McKenney wrote:

> > So what is it that you really need to do here?
> 
> In short, we don't want the cpufreq data to go away (see the 2 scenarios 
> above) while the scheduler is looking at it.  The scheduler uses the 
> provided accessors (see patch 2/2) so we can put any protection 
> mechanism we want in them.  A simple spinlock could do just as well 
> which should be good enough.

rq->lock disables interrupts so on that something like
kick_all_cpus_sync() will guarantee what you need --
wake_up_all_idle_cpus() will not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nicolas Pitre Sept. 19, 2014, 6:30 p.m. | #6
On Fri, 19 Sep 2014, Peter Zijlstra wrote:

> On Fri, Sep 19, 2014 at 01:17:15AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 18, 2014 at 02:32:25PM -0400, Nicolas Pitre wrote:
> > > On Thu, 18 Sep 2014, Paul E. McKenney wrote:
> > 
> > > > So what is it that you really need to do here?
> > > 
> > > In short, we don't want the cpufreq data to go away (see the 2 scenarios 
> > > above) while the scheduler is looking at it.  The scheduler uses the 
> > > provided accessors (see patch 2/2) so we can put any protection 
> > > mechanism we want in them.  A simple spinlock could do just as well 
> > > which should be good enough.
> > 
> > rq->lock disables interrupts so on that something like
> > kick_all_cpus_sync() will guarantee what you need --
> > wake_up_all_idle_cpus() will not.
> 
> Something like so then?

I'll trust you for anything that relates to RCU as its subtleties are 
still escaping my mind.

Still, the commit log refers to idle_put_state() which is no more, and 
that should be adjusted.

> 
> ---
> Subject: sched: let the scheduler see CPU idle states
> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> Date: Thu, 04 Sep 2014 11:32:09 -0400
> 
> When the cpu enters idle, it stores the cpuidle state pointer in its
> struct rq instance which in turn could be used to make a better decision
> when balancing tasks.
> 
> As soon as the cpu exits its idle state, the struct rq reference is
> cleared.
> 
> There are a couple of situations where the idle state pointer could be changed
> while it is being consulted:
> 
> 1. For x86/acpi with dynamic c-states, when a laptop switches from battery
>    to AC that could result on removing the deeper idle state. The acpi driver
>    triggers:
> 	'acpi_processor_cst_has_changed'
> 		'cpuidle_pause_and_lock'
> 			'cpuidle_uninstall_idle_handler'
> 				'kick_all_cpus_sync'.
> 
> All cpus will exit their idle state and the pointed object will be set to
> NULL.
> 
> 2. The cpuidle driver is unloaded. Logically that could happen but not
> in practice because the drivers are always compiled in and 95% of them are
> not coded to unregister themselves.  In any case, the unloading code must
> call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
> leading to 'kick_all_cpus_sync' as mentioned above.
> 
> A race can happen if we use the pointer and then one of these two scenarios
> occurs at the same moment.
> 
> In order to be safe, the idle state pointer stored in the rq must be
> used inside a rcu_read_lock section where we are protected with the
> 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
> idle_get_state() and idle_put_state() accessors should be used to that
> effect.
> 
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Ingo Molnar <mingo@redhat.com>
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  drivers/cpuidle/cpuidle.c |    6 ++++++
>  kernel/sched/idle.c       |    6 ++++++
>  kernel/sched/sched.h      |   29 +++++++++++++++++++++++++++++
>  3 files changed, 41 insertions(+)
> 
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -225,6 +225,12 @@ void cpuidle_uninstall_idle_handler(void
>  		initialized = 0;
>  		wake_up_all_idle_cpus();
>  	}
> +
> +	/*
> +	 * Make sure external observers (such as the scheduler)
> +	 * are done looking at pointed idle states.
> +	 */
> +	kick_all_cpus_sync();
>  }
>  
>  /**
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -147,6 +147,9 @@ static void cpuidle_idle_call(void)
>  	    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
>  		goto use_default;
>  
> +	/* Take note of the planned idle state. */
> +	idle_set_state(this_rq(), &drv->states[next_state]);
> +
>  	/*
>  	 * Enter the idle state previously returned by the governor decision.
>  	 * This function will block until an interrupt occurs and will take
> @@ -154,6 +157,9 @@ static void cpuidle_idle_call(void)
>  	 */
>  	entered_state = cpuidle_enter(drv, dev, next_state);
>  
> +	/* The cpu is no longer idle or about to enter idle. */
> +	idle_set_state(this_rq(), NULL);
> +
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
>  
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -14,6 +14,7 @@
>  #include "cpuacct.h"
>  
>  struct rq;
> +struct cpuidle_state;
>  
>  /* task_struct::on_rq states: */
>  #define TASK_ON_RQ_QUEUED	1
> @@ -640,6 +641,11 @@ struct rq {
>  #ifdef CONFIG_SMP
>  	struct llist_head wake_list;
>  #endif
> +
> +#ifdef CONFIG_CPU_IDLE
> +	/* Must be inspected within a rcu lock section */
> +	struct cpuidle_state *idle_state;
> +#endif
>  };
>  
>  static inline int cpu_of(struct rq *rq)
> @@ -1193,6 +1199,29 @@ static inline void idle_exit_fair(struct
>  
>  #endif
>  
> +#ifdef CONFIG_CPU_IDLE
> +static inline void idle_set_state(struct rq *rq,
> +				  struct cpuidle_state *idle_state)
> +{
> +	rq->idle_state = idle_state;
> +}
> +
> +static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> +{
> +	return rq->idle_state;
> +}
> +#else
> +static inline void idle_set_state(struct rq *rq,
> +				  struct cpuidle_state *idle_state)
> +{
> +}
> +
> +static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> +{
> +	return NULL;
> +}
> +#endif
> +
>  extern void sysrq_sched_debug_show(void);
>  extern void sched_init_granularity(void);
>  extern void update_max_interval(void);
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ee9df5e3f5..530e3055a2 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -225,6 +225,12 @@  void cpuidle_uninstall_idle_handler(void)
 		initialized = 0;
 		kick_all_cpus_sync();
 	}
+
+	/*
+	 * Make sure external observers (such as the scheduler)
+	 * are done looking at pointed idle states.
+	 */
+	rcu_barrier();
 }
 
 /**
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 11e7bc434f..c47fce75e6 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -147,6 +147,9 @@  use_default:
 	    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
 		goto use_default;
 
+	/* Take note of the planned idle state. */
+	idle_set_state(this_rq(), &drv->states[next_state]);
+
 	/*
 	 * Enter the idle state previously returned by the governor decision.
 	 * This function will block until an interrupt occurs and will take
@@ -154,6 +157,9 @@  use_default:
 	 */
 	entered_state = cpuidle_enter(drv, dev, next_state);
 
+	/* The cpu is no longer idle or about to enter idle. */
+	idle_set_state(this_rq(), NULL);
+
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f4e9..aea8baa7a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -14,6 +14,7 @@ 
 #include "cpuacct.h"
 
 struct rq;
+struct cpuidle_state;
 
 extern __read_mostly int scheduler_running;
 
@@ -636,6 +637,11 @@  struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+#ifdef CONFIG_CPU_IDLE
+	/* Must be inspected within a rcu lock section */
+	struct cpuidle_state *idle_state;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -1180,6 +1186,39 @@  static inline void idle_exit_fair(struct rq *rq) { }
 
 #endif
 
+#ifdef CONFIG_CPU_IDLE
+static inline void idle_set_state(struct rq *rq,
+				  struct cpuidle_state *idle_state)
+{
+	rq->idle_state = idle_state;
+}
+
+static inline struct cpuidle_state *idle_get_state(struct rq *rq)
+{
+	rcu_read_lock();
+	return rq->idle_state;
+}
+
+static inline void cpuidle_put_state(struct rq *rq)
+{
+	rcu_read_unlock();
+}
+#else
+static inline void idle_set_state(struct rq *rq,
+				  struct cpuidle_state *idle_state)
+{
+}
+
+static inline struct cpuidle_state *idle_get_state(struct rq *rq)
+{
+	return NULL;
+}
+
+static inline void cpuidle_put_state(struct rq *rq)
+{
+}
+#endif
+
 extern void sysrq_sched_debug_show(void);
 extern void sched_init_granularity(void);
 extern void update_max_interval(void);