[RFC,tip/core/rcu] rcu: direct algorithmic SRCU implementation

Message ID 20120213020951.GA12138@linux.vnet.ibm.com
State New
Headers show

Commit Message

Paul E. McKenney Feb. 13, 2012, 2:09 a.m.
The current implementation of synchronize_srcu_expedited() can cause
severe OS jitter due to its use of synchronize_sched(), which in turn
invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
This can result in severe performance degradation for real-time workloads
and especially for short-interation-length HPC workloads.  Furthermore,
because only one instance of try_stop_cpus() can be making forward progress
at a given time, only one instance of synchronize_srcu_expedited() can
make forward progress at a time, even if they are all operating on
distinct srcu_struct structures.

This commit, inspired by an earlier implementation by Peter Zijlstra
(https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
takes a strictly algorithmic bits-in-memory approach.  This has the
disadvantage of requiring one explicit memory-barrier instruction in
each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
completely dispenses with OS jitter and furthermore allows SRCU to be
used freely by CPUs that RCU believes to be idle or offline.

The update-side implementation handles the single read-side memory
barrier by rechecking the per-CPU counters after summing them and
by running through the update-side state machine twice.

This implementation has passed moderate rcutorture testing on both 32-bit
x86 and 64-bit Power.  A call_srcu() function will be present in a later
version of this patch.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Comments

Peter Zijlstra Feb. 15, 2012, 12:59 p.m. | #1
On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> The current implementation of synchronize_srcu_expedited() can cause
> severe OS jitter due to its use of synchronize_sched(), which in turn
> invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> This can result in severe performance degradation for real-time workloads
> and especially for short-interation-length HPC workloads.  Furthermore,
> because only one instance of try_stop_cpus() can be making forward progress
> at a given time, only one instance of synchronize_srcu_expedited() can
> make forward progress at a time, even if they are all operating on
> distinct srcu_struct structures.
> 
> This commit, inspired by an earlier implementation by Peter Zijlstra
> (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> takes a strictly algorithmic bits-in-memory approach.  This has the
> disadvantage of requiring one explicit memory-barrier instruction in
> each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> completely dispenses with OS jitter and furthermore allows SRCU to be
> used freely by CPUs that RCU believes to be idle or offline.
> 
> The update-side implementation handles the single read-side memory
> barrier by rechecking the per-CPU counters after summing them and
> by running through the update-side state machine twice.

Yeah, getting rid of that second memory barrier in srcu_read_lock() is
pure magic :-)

> This implementation has passed moderate rcutorture testing on both 32-bit
> x86 and 64-bit Power.  A call_srcu() function will be present in a later
> version of this patch.

Goodness ;-)

> @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	int idx;
>  
>  	preempt_disable();
> -	idx = sp->completed & 0x1;
> -	barrier();  /* ensure compiler looks -once- at sp->completed. */
> -	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> -	srcu_barrier();  /* ensure compiler won't misorder critical section. */
> +	idx = rcu_dereference_index_check(sp->completed,
> +					  rcu_read_lock_sched_held()) & 0x1;
> +	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> +		SRCU_USAGE_COUNT + 1;
> +	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	preempt_enable();
>  	return idx;
>  }

You could use __this_cpu_* muck to shorten some of that.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Mathieu Desnoyers Feb. 15, 2012, 2:31 p.m. | #2
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
[...]
> +/*
> + * To be called from the update side after an index flip.  Returns true
> + * if the modulo sum of the counters is stably zero, false if there is
> + * some possibility of non-zero.
> + */
> +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>  {
>  	int cpu;
> -	int sum;
>  
> -	sum = 0;
> +	/*
> +	 * Note that srcu_readers_active_idx() can incorrectly return
> +	 * zero even though there is a pre-existing reader throughout.
> +	 * To see this, suppose that task A is in a very long SRCU
> +	 * read-side critical section that started on CPU 0, and that
> +	 * no other reader exists, so that the modulo sum of the counters
> +	 * is equal to one.  Then suppose that task B starts executing
> +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> +	 * task C starts reading on CPU 0, so that its increment is not
> +	 * summed, but finishes reading on CPU 2, so that its decrement
> +	 * -is- summed.  Then when task B completes its sum, it will
> +	 * incorrectly get zero, despite the fact that task A has been
> +	 * in its SRCU read-side critical section the whole time.
> +	 *
> +	 * We therefore do a validation step should srcu_readers_active_idx()
> +	 * return zero.
> +	 */
> +	if (srcu_readers_active_idx(sp, idx) != 0)
> +		return false;
> +
> +	/*
> +	 * Since the caller recently flipped ->completed, we can see at
> +	 * most one increment of each CPU's counter from this point
> +	 * forward.  The reason for this is that the reader CPU must have
> +	 * fetched the index before srcu_readers_active_idx checked
> +	 * that CPU's counter, but not yet incremented its counter.
> +	 * Its eventual counter increment will follow the read in
> +	 * srcu_readers_active_idx(), and that increment is immediately
> +	 * followed by smp_mb() B.  Because smp_mb() D is between
> +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> +	 * that CPU's subsequent load of ->completed must see the new
> +	 * value, and therefore increment the counter in the other rank.
> +	 */
> +	smp_mb(); /* A */
> +
> +	/*
> +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> +	 * filled in from the per-CPU counter values.  Since both
> +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> +	 * upper bits of the per-CPU counter, an increment/decrement
> +	 * pair will change the value of the counter.  Since there is
> +	 * only one possible increment, the only way to wrap the counter
> +	 * is to have a huge number of counter decrements, which requires
> +	 * a huge number of tasks and huge SRCU read-side critical-section
> +	 * nesting levels, even on 32-bit systems.
> +	 *
> +	 * All of the ways of confusing the readings require that the scan
> +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> +	 * but not its increment.  However, between that decrement and
> +	 * increment are smb_mb() B and C.  Either or both of these pair
> +	 * with smp_mb() A above to ensure that the scan below will see
> +	 * the read-side tasks's increment, thus noting a difference in
> +	 * the counter values between the two passes.

Hi Paul,

I think the implementation is correct, but the explanation above might
be improved. Let's consider the following a scenario, where a reader is
migrated between increment of the counter and issuing the memory barrier
in the read lock:

A,B,C are readers
D is synchronize_rcu (one flip'n'wait)

CPU A               CPU B           CPU C         CPU D
                                    c[1]++
                                    smp_mb(1)
                                                  read c[0] -> 0
c[0]++
(implicit smp_mb (2))
      -> migrated ->
                    (implicit smp_mb (3))
                    smp_mb (4)
                    smp_mb (5)
                    c[1]--
                                                  read c[1] -> -1
                                                  read c[2] -> 1
                                                  (false 0 sum)
                                                  smp_mb (6)
                                                  re-check each.
                                    c[1]--

re-check: because we observed c[1] == -1, thanks to the implicit memory
barriers within thread migration (2 and 3), we can assume that we _will_
observe the updated value of c[0] after smp_mb (6).

The current explanation states that memory barriers 4 and 5, along with  
6, are responsible for ensuring that the increment will be observed by 
the re-check. However, I doubt they have anything to do with it: it's 
rather the implicit memory barriers in thread migration, along with
program order guarantees on writes to the same address, that seems to be
the reason why we can do this ordering assumption.

Does it make sense, or shall I get another coffee to wake myself up ?
;)

Thanks,

Mathieu

> +	 *
> +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> +	 * none of the counters changed, we know that the zero was the
> +	 * correct sum.
> +	 *
> +	 * Of course, it is possible that a task might be delayed
> +	 * for a very long time in __srcu_read_lock() after fetching
> +	 * the index but before incrementing its counter.  This
> +	 * possibility will be dealt with in __synchronize_srcu().
> +	 */
>  	for_each_possible_cpu(cpu)
> -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> -	return sum;
> +		if (sp->snap[cpu] !=
> +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> +			return false;  /* False zero reading! */
> +	return true;
>  }
Mathieu Desnoyers Feb. 15, 2012, 2:51 p.m. | #3
* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> [...]
> > +/*
> > + * To be called from the update side after an index flip.  Returns true
> > + * if the modulo sum of the counters is stably zero, false if there is
> > + * some possibility of non-zero.
> > + */
> > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >  {
> >  	int cpu;
> > -	int sum;
> >  
> > -	sum = 0;
> > +	/*
> > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > +	 * zero even though there is a pre-existing reader throughout.
> > +	 * To see this, suppose that task A is in a very long SRCU
> > +	 * read-side critical section that started on CPU 0, and that
> > +	 * no other reader exists, so that the modulo sum of the counters
> > +	 * is equal to one.  Then suppose that task B starts executing
> > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > +	 * task C starts reading on CPU 0, so that its increment is not
> > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > +	 * -is- summed.  Then when task B completes its sum, it will
> > +	 * incorrectly get zero, despite the fact that task A has been
> > +	 * in its SRCU read-side critical section the whole time.
> > +	 *
> > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > +	 * return zero.
> > +	 */
> > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > +		return false;
> > +
> > +	/*
> > +	 * Since the caller recently flipped ->completed, we can see at
> > +	 * most one increment of each CPU's counter from this point
> > +	 * forward.  The reason for this is that the reader CPU must have
> > +	 * fetched the index before srcu_readers_active_idx checked
> > +	 * that CPU's counter, but not yet incremented its counter.
> > +	 * Its eventual counter increment will follow the read in
> > +	 * srcu_readers_active_idx(), and that increment is immediately
> > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > +	 * that CPU's subsequent load of ->completed must see the new
> > +	 * value, and therefore increment the counter in the other rank.
> > +	 */
> > +	smp_mb(); /* A */
> > +
> > +	/*
> > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > +	 * filled in from the per-CPU counter values.  Since both
> > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > +	 * upper bits of the per-CPU counter, an increment/decrement
> > +	 * pair will change the value of the counter.  Since there is
> > +	 * only one possible increment, the only way to wrap the counter
> > +	 * is to have a huge number of counter decrements, which requires
> > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > +	 * nesting levels, even on 32-bit systems.
> > +	 *
> > +	 * All of the ways of confusing the readings require that the scan
> > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > +	 * but not its increment.  However, between that decrement and
> > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > +	 * with smp_mb() A above to ensure that the scan below will see
> > +	 * the read-side tasks's increment, thus noting a difference in
> > +	 * the counter values between the two passes.
> 
> Hi Paul,
> 
> I think the implementation is correct, but the explanation above might
> be improved. Let's consider the following a scenario, where a reader is
> migrated between increment of the counter and issuing the memory barrier
> in the read lock:
> 
> A,B,C are readers
> D is synchronize_rcu (one flip'n'wait)
> 
> CPU A               CPU B           CPU C         CPU D
>                                     c[1]++
>                                     smp_mb(1)
>                                                   read c[0] -> 0
> c[0]++
> (implicit smp_mb (2))
>       -> migrated ->
>                     (implicit smp_mb (3))
>                     smp_mb (4)
>                     smp_mb (5)
>                     c[1]--
>                                                   read c[1] -> -1
>                                                   read c[2] -> 1
>                                                   (false 0 sum)
>                                                   smp_mb (6)
>                                                   re-check each.
>                                     c[1]--
> 
> re-check: because we observed c[1] == -1, thanks to the implicit memory
> barriers within thread migration (2 and 3), we can assume that we _will_
> observe the updated value of c[0] after smp_mb (6).
> 
> The current explanation states that memory barriers 4 and 5, along with  
> 6, are responsible for ensuring that the increment will be observed by 
> the re-check. However, I doubt they have anything to do with it: it's 
> rather the implicit memory barriers in thread migration, along with
> program order guarantees on writes to the same address, that seems to be
> the reason why we can do this ordering assumption.

Please disregard the part about program order: CPU A writes to c[0], and
CPU B writes to c[1], which are two different memory locations. The rest
of my discussion stands though.

Simply reasoning about write to c[0], memory barriers 2-3, write to
c[1], along with c[1] read, memory barrier 6, and then c[0] read is
enough to explain the ordering guarantees you need, without invoking
program order.

Thanks,

Mathieu

> 
> Does it make sense, or shall I get another coffee to wake myself up ?
> ;)
> 
> Thanks,
> 
> Mathieu
> 
> > +	 *
> > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > +	 * none of the counters changed, we know that the zero was the
> > +	 * correct sum.
> > +	 *
> > +	 * Of course, it is possible that a task might be delayed
> > +	 * for a very long time in __srcu_read_lock() after fetching
> > +	 * the index but before incrementing its counter.  This
> > +	 * possibility will be dealt with in __synchronize_srcu().
> > +	 */
> >  	for_each_possible_cpu(cpu)
> > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > -	return sum;
> > +		if (sp->snap[cpu] !=
> > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > +			return false;  /* False zero reading! */
> > +	return true;
> >  }
> 
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
Paul E. McKenney Feb. 16, 2012, 6:35 a.m. | #4
On Wed, Feb 15, 2012 at 01:59:23PM +0100, Peter Zijlstra wrote:
> On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> > The current implementation of synchronize_srcu_expedited() can cause
> > severe OS jitter due to its use of synchronize_sched(), which in turn
> > invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> > This can result in severe performance degradation for real-time workloads
> > and especially for short-interation-length HPC workloads.  Furthermore,
> > because only one instance of try_stop_cpus() can be making forward progress
> > at a given time, only one instance of synchronize_srcu_expedited() can
> > make forward progress at a time, even if they are all operating on
> > distinct srcu_struct structures.
> > 
> > This commit, inspired by an earlier implementation by Peter Zijlstra
> > (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> > takes a strictly algorithmic bits-in-memory approach.  This has the
> > disadvantage of requiring one explicit memory-barrier instruction in
> > each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> > completely dispenses with OS jitter and furthermore allows SRCU to be
> > used freely by CPUs that RCU believes to be idle or offline.
> > 
> > The update-side implementation handles the single read-side memory
> > barrier by rechecking the per-CPU counters after summing them and
> > by running through the update-side state machine twice.
> 
> Yeah, getting rid of that second memory barrier in srcu_read_lock() is
> pure magic :-)
> 
> > This implementation has passed moderate rcutorture testing on both 32-bit
> > x86 and 64-bit Power.  A call_srcu() function will be present in a later
> > version of this patch.
> 
> Goodness ;-)

Glad you like the magic and the prospect of call_srcu().  ;-)

> > @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >  	int idx;
> >  
> >  	preempt_disable();
> > -	idx = sp->completed & 0x1;
> > -	barrier();  /* ensure compiler looks -once- at sp->completed. */
> > -	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > -	srcu_barrier();  /* ensure compiler won't misorder critical section. */
> > +	idx = rcu_dereference_index_check(sp->completed,
> > +					  rcu_read_lock_sched_held()) & 0x1;
> > +	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> > +		SRCU_USAGE_COUNT + 1;
> > +	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >  	preempt_enable();
> >  	return idx;
> >  }
> 
> You could use __this_cpu_* muck to shorten some of that.

Ah, so something like this?

	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
		SRCU_USAGE_COUNT + 1;

Now that you mention it, this does look nicer, applied here and to
srcu_read_unlock().

> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


							Thanx, Paul
Paul E. McKenney Feb. 16, 2012, 6:38 a.m. | #5
On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > [...]
> > > +/*
> > > + * To be called from the update side after an index flip.  Returns true
> > > + * if the modulo sum of the counters is stably zero, false if there is
> > > + * some possibility of non-zero.
> > > + */
> > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > >  {
> > >  	int cpu;
> > > -	int sum;
> > >  
> > > -	sum = 0;
> > > +	/*
> > > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > > +	 * zero even though there is a pre-existing reader throughout.
> > > +	 * To see this, suppose that task A is in a very long SRCU
> > > +	 * read-side critical section that started on CPU 0, and that
> > > +	 * no other reader exists, so that the modulo sum of the counters
> > > +	 * is equal to one.  Then suppose that task B starts executing
> > > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > +	 * task C starts reading on CPU 0, so that its increment is not
> > > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > > +	 * -is- summed.  Then when task B completes its sum, it will
> > > +	 * incorrectly get zero, despite the fact that task A has been
> > > +	 * in its SRCU read-side critical section the whole time.
> > > +	 *
> > > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > > +	 * return zero.
> > > +	 */
> > > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * Since the caller recently flipped ->completed, we can see at
> > > +	 * most one increment of each CPU's counter from this point
> > > +	 * forward.  The reason for this is that the reader CPU must have
> > > +	 * fetched the index before srcu_readers_active_idx checked
> > > +	 * that CPU's counter, but not yet incremented its counter.
> > > +	 * Its eventual counter increment will follow the read in
> > > +	 * srcu_readers_active_idx(), and that increment is immediately
> > > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > > +	 * that CPU's subsequent load of ->completed must see the new
> > > +	 * value, and therefore increment the counter in the other rank.
> > > +	 */
> > > +	smp_mb(); /* A */
> > > +
> > > +	/*
> > > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > > +	 * filled in from the per-CPU counter values.  Since both
> > > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > +	 * upper bits of the per-CPU counter, an increment/decrement
> > > +	 * pair will change the value of the counter.  Since there is
> > > +	 * only one possible increment, the only way to wrap the counter
> > > +	 * is to have a huge number of counter decrements, which requires
> > > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > > +	 * nesting levels, even on 32-bit systems.
> > > +	 *
> > > +	 * All of the ways of confusing the readings require that the scan
> > > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > > +	 * but not its increment.  However, between that decrement and
> > > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > > +	 * with smp_mb() A above to ensure that the scan below will see
> > > +	 * the read-side tasks's increment, thus noting a difference in
> > > +	 * the counter values between the two passes.
> > 
> > Hi Paul,
> > 
> > I think the implementation is correct, but the explanation above might
> > be improved. Let's consider the following a scenario, where a reader is
> > migrated between increment of the counter and issuing the memory barrier
> > in the read lock:
> > 
> > A,B,C are readers
> > D is synchronize_rcu (one flip'n'wait)
> > 
> > CPU A               CPU B           CPU C         CPU D
> >                                     c[1]++
> >                                     smp_mb(1)
> >                                                   read c[0] -> 0
> > c[0]++
> > (implicit smp_mb (2))
> >       -> migrated ->
> >                     (implicit smp_mb (3))
> >                     smp_mb (4)
> >                     smp_mb (5)
> >                     c[1]--
> >                                                   read c[1] -> -1
> >                                                   read c[2] -> 1
> >                                                   (false 0 sum)
> >                                                   smp_mb (6)
> >                                                   re-check each.
> >                                     c[1]--
> > 
> > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > barriers within thread migration (2 and 3), we can assume that we _will_
> > observe the updated value of c[0] after smp_mb (6).
> > 
> > The current explanation states that memory barriers 4 and 5, along with  
> > 6, are responsible for ensuring that the increment will be observed by 
> > the re-check. However, I doubt they have anything to do with it: it's 
> > rather the implicit memory barriers in thread migration, along with
> > program order guarantees on writes to the same address, that seems to be
> > the reason why we can do this ordering assumption.
> 
> Please disregard the part about program order: CPU A writes to c[0], and
> CPU B writes to c[1], which are two different memory locations. The rest
> of my discussion stands though.
> 
> Simply reasoning about write to c[0], memory barriers 2-3, write to
> c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> enough to explain the ordering guarantees you need, without invoking
> program order.

I am assuming that if the scheduler migrates a process, it applies enough
memory ordering to allow the proof to operate as if it had stayed on a
single CPU throughout.  The reasoning for this would consider the
scheduler access and memory barriers -- but there would be an arbitrarily
large number of migration patterns, so I am not convinced that it would
help...

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Does it make sense, or shall I get another coffee to wake myself up ?
> > ;)
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > > +	 *
> > > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > > +	 * none of the counters changed, we know that the zero was the
> > > +	 * correct sum.
> > > +	 *
> > > +	 * Of course, it is possible that a task might be delayed
> > > +	 * for a very long time in __srcu_read_lock() after fetching
> > > +	 * the index but before incrementing its counter.  This
> > > +	 * possibility will be dealt with in __synchronize_srcu().
> > > +	 */
> > >  	for_each_possible_cpu(cpu)
> > > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > -	return sum;
> > > +		if (sp->snap[cpu] !=
> > > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > +			return false;  /* False zero reading! */
> > > +	return true;
> > >  }
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>
Mathieu Desnoyers Feb. 16, 2012, 10:50 a.m. | #6
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 15, 2012 at 01:59:23PM +0100, Peter Zijlstra wrote:
> > On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> > > The current implementation of synchronize_srcu_expedited() can cause
> > > severe OS jitter due to its use of synchronize_sched(), which in turn
> > > invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> > > This can result in severe performance degradation for real-time workloads
> > > and especially for short-interation-length HPC workloads.  Furthermore,
> > > because only one instance of try_stop_cpus() can be making forward progress
> > > at a given time, only one instance of synchronize_srcu_expedited() can
> > > make forward progress at a time, even if they are all operating on
> > > distinct srcu_struct structures.
> > > 
> > > This commit, inspired by an earlier implementation by Peter Zijlstra
> > > (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> > > takes a strictly algorithmic bits-in-memory approach.  This has the
> > > disadvantage of requiring one explicit memory-barrier instruction in
> > > each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> > > completely dispenses with OS jitter and furthermore allows SRCU to be
> > > used freely by CPUs that RCU believes to be idle or offline.
> > > 
> > > The update-side implementation handles the single read-side memory
> > > barrier by rechecking the per-CPU counters after summing them and
> > > by running through the update-side state machine twice.
> > 
> > Yeah, getting rid of that second memory barrier in srcu_read_lock() is
> > pure magic :-)
> > 
> > > This implementation has passed moderate rcutorture testing on both 32-bit
> > > x86 and 64-bit Power.  A call_srcu() function will be present in a later
> > > version of this patch.
> > 
> > Goodness ;-)
> 
> Glad you like the magic and the prospect of call_srcu().  ;-)
> 
> > > @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > >  	int idx;
> > >  
> > >  	preempt_disable();
> > > -	idx = sp->completed & 0x1;
> > > -	barrier();  /* ensure compiler looks -once- at sp->completed. */
> > > -	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > > -	srcu_barrier();  /* ensure compiler won't misorder critical section. */
> > > +	idx = rcu_dereference_index_check(sp->completed,
> > > +					  rcu_read_lock_sched_held()) & 0x1;
> > > +	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> > > +		SRCU_USAGE_COUNT + 1;
> > > +	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> > >  	preempt_enable();
> > >  	return idx;
> > >  }
> > 
> > You could use __this_cpu_* muck to shorten some of that.
> 
> Ah, so something like this?
> 
> 	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
> 		SRCU_USAGE_COUNT + 1;
> 
> Now that you mention it, this does look nicer, applied here and to
> srcu_read_unlock().

I think Peter refers to __this_cpu_add().

Thanks,

Mathieu

> 
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> 
> 							Thanx, Paul
>
Peter Zijlstra Feb. 16, 2012, 10:52 a.m. | #7
On Thu, 2012-02-16 at 05:50 -0500, Mathieu Desnoyers wrote:
> > Ah, so something like this?
> > 
> >       ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
> >               SRCU_USAGE_COUNT + 1;
> > 
> > Now that you mention it, this does look nicer, applied here and to
> > srcu_read_unlock().
> 
> I think Peter refers to __this_cpu_add(). 

I'm not sure that implies the ACCESS_ONCE() thing
Mathieu Desnoyers Feb. 16, 2012, 11 a.m. | #8
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > [...]
> > > > +/*
> > > > + * To be called from the update side after an index flip.  Returns true
> > > > + * if the modulo sum of the counters is stably zero, false if there is
> > > > + * some possibility of non-zero.
> > > > + */
> > > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > > >  {
> > > >  	int cpu;
> > > > -	int sum;
> > > >  
> > > > -	sum = 0;
> > > > +	/*
> > > > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > > > +	 * zero even though there is a pre-existing reader throughout.
> > > > +	 * To see this, suppose that task A is in a very long SRCU
> > > > +	 * read-side critical section that started on CPU 0, and that
> > > > +	 * no other reader exists, so that the modulo sum of the counters
> > > > +	 * is equal to one.  Then suppose that task B starts executing
> > > > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > > +	 * task C starts reading on CPU 0, so that its increment is not
> > > > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > > > +	 * -is- summed.  Then when task B completes its sum, it will
> > > > +	 * incorrectly get zero, despite the fact that task A has been
> > > > +	 * in its SRCU read-side critical section the whole time.
> > > > +	 *
> > > > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > > > +	 * return zero.
> > > > +	 */
> > > > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * Since the caller recently flipped ->completed, we can see at
> > > > +	 * most one increment of each CPU's counter from this point
> > > > +	 * forward.  The reason for this is that the reader CPU must have
> > > > +	 * fetched the index before srcu_readers_active_idx checked
> > > > +	 * that CPU's counter, but not yet incremented its counter.
> > > > +	 * Its eventual counter increment will follow the read in
> > > > +	 * srcu_readers_active_idx(), and that increment is immediately
> > > > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > > > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > > > +	 * that CPU's subsequent load of ->completed must see the new
> > > > +	 * value, and therefore increment the counter in the other rank.
> > > > +	 */
> > > > +	smp_mb(); /* A */
> > > > +
> > > > +	/*
> > > > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > > > +	 * filled in from the per-CPU counter values.  Since both
> > > > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > > +	 * upper bits of the per-CPU counter, an increment/decrement
> > > > +	 * pair will change the value of the counter.  Since there is
> > > > +	 * only one possible increment, the only way to wrap the counter
> > > > +	 * is to have a huge number of counter decrements, which requires
> > > > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > > > +	 * nesting levels, even on 32-bit systems.
> > > > +	 *
> > > > +	 * All of the ways of confusing the readings require that the scan
> > > > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > > > +	 * but not its increment.  However, between that decrement and
> > > > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > > > +	 * with smp_mb() A above to ensure that the scan below will see
> > > > +	 * the read-side tasks's increment, thus noting a difference in
> > > > +	 * the counter values between the two passes.
> > > 
> > > Hi Paul,
> > > 
> > > I think the implementation is correct, but the explanation above might
> > > be improved. Let's consider the following a scenario, where a reader is
> > > migrated between increment of the counter and issuing the memory barrier
> > > in the read lock:
> > > 
> > > A,B,C are readers
> > > D is synchronize_rcu (one flip'n'wait)
> > > 
> > > CPU A               CPU B           CPU C         CPU D
> > >                                     c[1]++
> > >                                     smp_mb(1)
> > >                                                   read c[0] -> 0
> > > c[0]++
> > > (implicit smp_mb (2))
> > >       -> migrated ->
> > >                     (implicit smp_mb (3))
> > >                     smp_mb (4)
> > >                     smp_mb (5)
> > >                     c[1]--
> > >                                                   read c[1] -> -1
> > >                                                   read c[2] -> 1
> > >                                                   (false 0 sum)
> > >                                                   smp_mb (6)
> > >                                                   re-check each.
> > >                                     c[1]--
> > > 
> > > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > > barriers within thread migration (2 and 3), we can assume that we _will_
> > > observe the updated value of c[0] after smp_mb (6).
> > > 
> > > The current explanation states that memory barriers 4 and 5, along with  
> > > 6, are responsible for ensuring that the increment will be observed by 
> > > the re-check. However, I doubt they have anything to do with it: it's 
> > > rather the implicit memory barriers in thread migration, along with
> > > program order guarantees on writes to the same address, that seems to be
> > > the reason why we can do this ordering assumption.
> > 
> > Please disregard the part about program order: CPU A writes to c[0], and
> > CPU B writes to c[1], which are two different memory locations. The rest
> > of my discussion stands though.
> > 
> > Simply reasoning about write to c[0], memory barriers 2-3, write to
> > c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> > enough to explain the ordering guarantees you need, without invoking
> > program order.
> 
> I am assuming that if the scheduler migrates a process, it applies enough
> memory ordering to allow the proof to operate as if it had stayed on a
> single CPU throughout.

When applied to per-cpu variables, the reasoning cannot invoke program
order though, since we're touching two different memory locations.

> The reasoning for this would consider the
> scheduler access and memory barriers

indeed,

>-- but there would be an arbitrarily
> large number of migration patterns, so I am not convinced that it would
> help...

This brings the following question then: which memory barriers, in the
scheduler activity, act as full memory barriers to migrated threads ? I
see that the rq_lock is taken, but this lock is permeable in one
direction (operations can spill into the critical section). I'm probably
missing something else, but this something else probably needs to be
documented somewhere, since we are doing tons of assumptions based on
it.

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > Does it make sense, or shall I get another coffee to wake myself up ?
> > > ;)
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > > +	 *
> > > > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > > > +	 * none of the counters changed, we know that the zero was the
> > > > +	 * correct sum.
> > > > +	 *
> > > > +	 * Of course, it is possible that a task might be delayed
> > > > +	 * for a very long time in __srcu_read_lock() after fetching
> > > > +	 * the index but before incrementing its counter.  This
> > > > +	 * possibility will be dealt with in __synchronize_srcu().
> > > > +	 */
> > > >  	for_each_possible_cpu(cpu)
> > > > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > > -	return sum;
> > > > +		if (sp->snap[cpu] !=
> > > > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > > +			return false;  /* False zero reading! */
> > > > +	return true;
> > > >  }
> > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
>
Mathieu Desnoyers Feb. 16, 2012, 11:14 a.m. | #9
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-02-16 at 05:50 -0500, Mathieu Desnoyers wrote:
> > > Ah, so something like this?
> > > 
> > >       ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
> > >               SRCU_USAGE_COUNT + 1;
> > > 
> > > Now that you mention it, this does look nicer, applied here and to
> > > srcu_read_unlock().
> > 
> > I think Peter refers to __this_cpu_add(). 
> 
> I'm not sure that implies the ACCESS_ONCE() thing
> 

Fair point. The "generic" fallback for this_cpu_add does not imply the
ACCESS_ONCE() semantic AFAIK. I don't know if there would be a clean way
to get both without duplicating these operations needlessly.

Thanks,

Mathieu
Peter Zijlstra Feb. 16, 2012, 11:51 a.m. | #10
On Thu, 2012-02-16 at 06:00 -0500, Mathieu Desnoyers wrote:
> This brings the following question then: which memory barriers, in the
> scheduler activity, act as full memory barriers to migrated threads ? I
> see that the rq_lock is taken, but this lock is permeable in one
> direction (operations can spill into the critical section). I'm probably
> missing something else, but this something else probably needs to be
> documented somewhere, since we are doing tons of assumptions based on
> it. 

A migration consists of two context switches, one switching out the task
on the old cpu, and one switching in the task on the new cpu.

Now on x86 all the rq->lock grabbery is plenty implied memory barriers
to make anybody happy.

But I think, since there's guaranteed order (can't switch to before
switching from) you can match the UNLOCK from the switch-from to the
LOCK from the switch-to to make your complete MB.

Does that work or do we need more?
Mathieu Desnoyers Feb. 16, 2012, 12:18 p.m. | #11
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-02-16 at 06:00 -0500, Mathieu Desnoyers wrote:
> > This brings the following question then: which memory barriers, in the
> > scheduler activity, act as full memory barriers to migrated threads ? I
> > see that the rq_lock is taken, but this lock is permeable in one
> > direction (operations can spill into the critical section). I'm probably
> > missing something else, but this something else probably needs to be
> > documented somewhere, since we are doing tons of assumptions based on
> > it. 
> 
> A migration consists of two context switches, one switching out the task
> on the old cpu, and one switching in the task on the new cpu.

If we have memory barriers on both context switches, then we should be
good. If just fail to see them.

> Now on x86 all the rq->lock grabbery is plenty implied memory barriers
> to make anybody happy.

Indeed. Outside of x86 is far less certain though.

> But I think, since there's guaranteed order (can't switch to before
> switching from) you can match the UNLOCK from the switch-from to the
> LOCK from the switch-to to make your complete MB.
> 
> Does that work or do we need more?

Hrm, I think we'd need a little more than just lock/unlock ordering
guarantees. Let's consider the following, where the stores would be
expected to be seen as "store A before store B" by CPU 2

CPU 0             CPU 1               CPU 2

                                      load B, smp_rmb, load A in loop,
                                      expecting that when updated A is
                                      observed, B is always observed as
                                      updated too.
store A
(lock is permeable:
outside can leak
inside)
lock(rq->lock)

      -> migration ->

                  unlock(rq->lock)
                  (lock is permeable:
                  outside can leak inside)
                  store B

As we notice, the "store A" could theoretically be still pending in
CPU 0's write buffers when store B occurs, because the memory barrier
associated with "lock" only has acquire semantic (so memory operations
prior to the lock can leak into the critical section).

Given that the unlock(rq->lock) on CPU 0 is not guaranteed to happen in
a bound time-frame, no memory barrier with release semantic can be
assumed to have happened. This could happen if we have a long critical
section holding the rq->lock on CPU 0, and a much shorter critical
section on CPU 1.

Does that make sense, or should I get my first morning coffee ? :)

Thanks,

Mathieu
Peter Zijlstra Feb. 16, 2012, 12:44 p.m. | #12
On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> 
> Hrm, I think we'd need a little more than just lock/unlock ordering
> guarantees. Let's consider the following, where the stores would be
> expected to be seen as "store A before store B" by CPU 2
> 
> CPU 0             CPU 1               CPU 2
> 
>                                       load B, smp_rmb, load A in loop,
>                                       expecting that when updated A is
>                                       observed, B is always observed as
>                                       updated too.
> store A
> (lock is permeable:
> outside can leak
> inside)
> lock(rq->lock)
> 
>       -> migration ->
> 
>                   unlock(rq->lock)
>                   (lock is permeable:
>                   outside can leak inside)
>                   store B

You got the pairing the wrong way around, I suggested:

  store A

  switch-out
    UNLOCK

  	-> migration ->

			switch-in
			  LOCK

			store B

While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
and B won't pass the LOCK.

Yes, A can pass switch-out LOCK, but that doesn't matter much since the
switch-in cannot happen until we've passed UNLOCK.

And yes B can pass switch-in UNLOCK, but again, I can't see that being a
problem since the LOCK will avoid it being visible before A.

> Does that make sense, or should I get my first morning coffee ? :) 

Probably.. but that's not saying I'm not wrong ;-)
Mathieu Desnoyers Feb. 16, 2012, 2:52 p.m. | #13
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> > 
> > Hrm, I think we'd need a little more than just lock/unlock ordering
> > guarantees. Let's consider the following, where the stores would be
> > expected to be seen as "store A before store B" by CPU 2
> > 
> > CPU 0             CPU 1               CPU 2
> > 
> >                                       load B, smp_rmb, load A in loop,
> >                                       expecting that when updated A is
> >                                       observed, B is always observed as
> >                                       updated too.
> > store A
> > (lock is permeable:
> > outside can leak
> > inside)
> > lock(rq->lock)
> > 
> >       -> migration ->
> > 
> >                   unlock(rq->lock)
> >                   (lock is permeable:
> >                   outside can leak inside)
> >                   store B
> 
> You got the pairing the wrong way around, I suggested:
> 
>   store A
> 
>   switch-out
>     UNLOCK
> 
>   	-> migration ->
> 
> 			switch-in
> 			  LOCK
> 
> 			store B
> 
> While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
> and B won't pass the LOCK.
> 
> Yes, A can pass switch-out LOCK, but that doesn't matter much since the
> switch-in cannot happen until we've passed UNLOCK.
> 
> And yes B can pass switch-in UNLOCK, but again, I can't see that being a
> problem since the LOCK will avoid it being visible before A.

Ah, so this is what I missed: the context switch has its lock/unlock
pair, the following migration is performed under its own lock/unlock
pair, and the following context switch also has its lock/unlock pair. So
yes, this should be sufficient to act as a full memory barrier.

> 
> > Does that make sense, or should I get my first morning coffee ? :) 
> 
> Probably.. but that's not saying I'm not wrong ;-)

It does pass my 1st morning coffee test still, so it looks good, at
least to me. :-)

Back to the initial subject: I think it would be important for general
code understanding that when RCU operates tricks on per-cpu variables
based on scheduler migration memory ordering assumption, that it tells
so explicitely, rather than claiming that the memory barriers match
those at RCU read lock/unlock sites, which is not quite right.

Thanks,

Mathieu
Peter Zijlstra Feb. 16, 2012, 2:58 p.m. | #14
On Thu, 2012-02-16 at 09:52 -0500, Mathieu Desnoyers wrote:
> Ah, so this is what I missed: the context switch has its lock/unlock
> pair, the following migration is performed under its own lock/unlock
> pair, and the following context switch also has its lock/unlock pair. So
> yes, this should be sufficient to act as a full memory barrier. 

In fact it typically looks like:

 LOCK(rq0->lock)
  switch-from-A
 UNLOCK(rq0->lock);

	LOCK(rq0->lock);
	LOCK(rq1->lock);
	  migrate-A
	UNLOCK(rq1->lock);
	UNLOCK(rq0->lock);

		LOCK(rq1->lock);
		  switch-to-A
		UNLOCK(rq1->lock);

the migrate taking both locks involved guarantees that the switch-to
always comes _after_ the switch_from. The migrate might be performed on
a completely unrelated cpu although typically it would be either the old
(push) or the new (pull).
Paul E. McKenney Feb. 16, 2012, 3:13 p.m. | #15
On Thu, Feb 16, 2012 at 01:44:41PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> > 
> > Hrm, I think we'd need a little more than just lock/unlock ordering
> > guarantees. Let's consider the following, where the stores would be
> > expected to be seen as "store A before store B" by CPU 2
> > 
> > CPU 0             CPU 1               CPU 2
> > 
> >                                       load B, smp_rmb, load A in loop,
> >                                       expecting that when updated A is
> >                                       observed, B is always observed as
> >                                       updated too.
> > store A
> > (lock is permeable:
> > outside can leak
> > inside)
> > lock(rq->lock)
> > 
> >       -> migration ->
> > 
> >                   unlock(rq->lock)
> >                   (lock is permeable:
> >                   outside can leak inside)
> >                   store B
> 
> You got the pairing the wrong way around, I suggested:
> 
>   store A
> 
>   switch-out
>     UNLOCK
> 
>   	-> migration ->
> 
> 			switch-in
> 			  LOCK
> 
> 			store B
> 
> While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
> and B won't pass the LOCK.
> 
> Yes, A can pass switch-out LOCK, but that doesn't matter much since the
> switch-in cannot happen until we've passed UNLOCK.
> 
> And yes B can pass switch-in UNLOCK, but again, I can't see that being a
> problem since the LOCK will avoid it being visible before A.
> 
> > Does that make sense, or should I get my first morning coffee ? :) 
> 
> Probably.. but that's not saying I'm not wrong ;-)

It does look good to me, but given that I don't drink coffee, you should
take that with a large grain of salt.

							Thanx, Paul
Lai Jiangshan Feb. 20, 2012, 7:15 a.m. | #16
On 02/13/2012 10:09 AM, Paul E. McKenney wrote:

>  /*
>   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>   */
> -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  {
>  	int idx;
>  
> @@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
>  			   !lock_is_held(&rcu_sched_lock_map),
>  			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
>  
> -	idx = sp->completed;
> +	idx = ACCESS_ONCE(sp->completed);
>  	mutex_lock(&sp->mutex);
>  
>  	/*
>  	 * Check to see if someone else did the work for us while we were
> -	 * waiting to acquire the lock.  We need -two- advances of
> +	 * waiting to acquire the lock.  We need -three- advances of
>  	 * the counter, not just one.  If there was but one, we might have
>  	 * shown up -after- our helper's first synchronize_sched(), thus
>  	 * having failed to prevent CPU-reordering races with concurrent
> -	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
> -	 * either (1) wait for two or (2) supply the second ourselves.
> +	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
> +	 * was only two, we are guaranteed to have waited through only one
> +	 * full index-flip phase.  So we either (1) wait for three or
> +	 * (2) supply the additional ones we need.
>  	 */
>  
> -	if ((sp->completed - idx) >= 2) {
> +	if (sp->completed == idx + 2)
> +		idx = 1;
> +	else if (sp->completed == idx + 3) {
>  		mutex_unlock(&sp->mutex);
>  		return;
> -	}
> -
> -	sync_func();  /* Force memory barrier on all CPUs. */
> +	} else
> +		idx = 0;


Hi, Paul

I don't think this check-and-return path is needed since we will introduce call_srcu().
We just need a correct code to show how it works and to be used for a while,
and new call_srcu() will be implemented based on this correct code which will be removed.

And I think this unneeded check-and-return path is incorrect. See the following:

Reader			Updater						Helper thread
			old_ptr = rcu_ptr;
			/* rcu_ptr = NULL; but be reordered to (1) */
			start synchronize_srcu()
			idx = ACCESS_ONCE(sp->completed);(2)
									synchronize_srcu()
									synchronize_srcu()
srcu_read_lock();
old_ptr = rcu_ptr;
			rcu_ptr = NULL;(1)
			mutex_lock() and read sp->completed
			and return from synchronize_srcu()
			free(old_ptr);
use freed old_ptr
srcu_read_unlock();


So, we need a smb_mb() between (1) and (2) to force the order.

__synchronize_srcu() {
	smp_mb(); /* F */
	idx = ACCESS_ONCE(sp->completed); /* (2) */
	....
}

And this smp_mb() F is paired with helper's smp_mb() D. So if Updater sees X advances of
->completed, Updater must sees X times of *full* flip_and_wait(). So We need see -two- advances of
->completed from Helper only, not -three-.

        if (sp->completed == idx + 1)
                idx = 1;
        else if (sp->completed == idx + 2) {
                mutex_unlock(&sp->mutex);
                return;
        } else
                idx = 0;


Or simpler:

__synchronize_srcu() {
	unsigned int idx;   /* <-------- unsigned */

	/* comments for smp_mb() F */
	smp_mb(); /* F */
	idx = ACCESS_ONCE(sp->completed);

	mutex_lock(&sp->mutex);
	idx = sp->completed - idx;

	/* original comments */
	for (; idx < 2; idx++)
                flip_idx_and_wait(sp, expedited);
        mutex_unlock(&sp->mutex);
}

At last, I can't understand the comments of this check-and-return path.
So maybe the above reply and I are totally wrong.
But the comments of this check-and-return path does not describe the code
well(especially the order), and it contains the old "synchronize_sched()"
which make me confuse.

My conclusion, we can just remove the check-and-return path to reduce
the complexity since we will introduce call_srcu().

This new srcu is very great, especially the SRCU_USAGE_COUNT for every
lock/unlock witch forces any increment/decrement pair changes the counter
for me.


Thanks,
Lai
Paul E. McKenney Feb. 20, 2012, 5:44 p.m. | #17
On Mon, Feb 20, 2012 at 03:15:33PM +0800, Lai Jiangshan wrote:
> On 02/13/2012 10:09 AM, Paul E. McKenney wrote:
> 
> >  /*
> >   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> >   */
> > -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> > +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> >  {
> >  	int idx;
> >  
> > @@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> >  			   !lock_is_held(&rcu_sched_lock_map),
> >  			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
> >  
> > -	idx = sp->completed;
> > +	idx = ACCESS_ONCE(sp->completed);
> >  	mutex_lock(&sp->mutex);
> >  
> >  	/*
> >  	 * Check to see if someone else did the work for us while we were
> > -	 * waiting to acquire the lock.  We need -two- advances of
> > +	 * waiting to acquire the lock.  We need -three- advances of
> >  	 * the counter, not just one.  If there was but one, we might have
> >  	 * shown up -after- our helper's first synchronize_sched(), thus
> >  	 * having failed to prevent CPU-reordering races with concurrent
> > -	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
> > -	 * either (1) wait for two or (2) supply the second ourselves.
> > +	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
> > +	 * was only two, we are guaranteed to have waited through only one
> > +	 * full index-flip phase.  So we either (1) wait for three or
> > +	 * (2) supply the additional ones we need.
> >  	 */
> >  
> > -	if ((sp->completed - idx) >= 2) {
> > +	if (sp->completed == idx + 2)
> > +		idx = 1;
> > +	else if (sp->completed == idx + 3) {
> >  		mutex_unlock(&sp->mutex);
> >  		return;
> > -	}
> > -
> > -	sync_func();  /* Force memory barrier on all CPUs. */
> > +	} else
> > +		idx = 0;
> 
> 
> Hi, Paul
> 
> I don't think this check-and-return path is needed since we will introduce call_srcu().
> 
> We just need a correct code to show how it works and to be used for a while,
> and new call_srcu() will be implemented based on this correct code which will be removed.

Hello, Lai!

Yep, this code will be replaced with a state machine driven by callbacks.

> And I think this unneeded check-and-return path is incorrect. See the following:
> 
> Reader			Updater						Helper thread
> 			old_ptr = rcu_ptr;
> 			/* rcu_ptr = NULL; but be reordered to (1) */
> 			start synchronize_srcu()
> 			idx = ACCESS_ONCE(sp->completed);(2)
> 									synchronize_srcu()
> 									synchronize_srcu()
> srcu_read_lock();
> old_ptr = rcu_ptr;
> 			rcu_ptr = NULL;(1)
> 			mutex_lock() and read sp->completed
> 			and return from synchronize_srcu()
> 			free(old_ptr);
> use freed old_ptr
> srcu_read_unlock();
> 
> 
> So, we need a smb_mb() between (1) and (2) to force the order.
> 
> __synchronize_srcu() {
> 	smp_mb(); /* F */
> 	idx = ACCESS_ONCE(sp->completed); /* (2) */

And one here as well because mutex_lock() allows code to bleed in from
outside the critical section.

> 	....
> }

Good catch!  And shows the limitations of testing -- I hit this pretty
hard and didn't get a failure.  I was focused so much on the complex
part of the patch that I failed to get the simple stuff right!!!

Shows the value of the Linux community's review processes, I guess.  ;-)

> And this smp_mb() F is paired with helper's smp_mb() D. So if Updater sees X advances of
> ->completed, Updater must sees X times of *full* flip_and_wait(). So We need see -two- advances of
> ->completed from Helper only, not -three-.

Hmmm...  Let's see...  The case I was worried about is where the updater
samples ->completed just before it is incremented, then samples it again
just after it is incremented a second time.  So you are arguing that this
cannot happen because the second sample occurs after acquiring the lock,
so that the second flip-and-wait cycle has to have already completed,
correct?

So waiting for three is appropriate for mutex_try_lock(), but is overly
conservative for mutex_lock().

>         if (sp->completed == idx + 1)
>                 idx = 1;
>         else if (sp->completed == idx + 2) {
>                 mutex_unlock(&sp->mutex);
>                 return;
>         } else
>                 idx = 0;
> 
> 
> Or simpler:
> 
> __synchronize_srcu() {
> 	unsigned int idx;   /* <-------- unsigned */
> 
> 	/* comments for smp_mb() F */
> 	smp_mb(); /* F */
> 	idx = ACCESS_ONCE(sp->completed);
> 
> 	mutex_lock(&sp->mutex);
> 	idx = sp->completed - idx;
> 
> 	/* original comments */
> 	for (; idx < 2; idx++)
>                 flip_idx_and_wait(sp, expedited);
>         mutex_unlock(&sp->mutex);
> }
> 
> At last, I can't understand the comments of this check-and-return path.
> So maybe the above reply and I are totally wrong.

I -think- you might be correct, but my approach is going to be to implement
call_srcu() which will eliminate this anyway.

> But the comments of this check-and-return path does not describe the code
> well(especially the order), and it contains the old "synchronize_sched()"
> which make me confuse.

The diffs are confusing -- I have to look at the actual code in this case.

> My conclusion, we can just remove the check-and-return path to reduce
> the complexity since we will introduce call_srcu().

If I actually submit the above upstream, that would be quite reasonable.
My thought is that patch remains RFC and the upstream version has
call_srcu().

> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> lock/unlock witch forces any increment/decrement pair changes the counter
> for me.

Glad you like it!  ;-)

And thank you for your review and feedback!

							Thanx, Paul

> Thanks,
> Lai
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
Lai Jiangshan Feb. 21, 2012, 1:11 a.m. | #18
On 02/21/2012 01:44 AM, Paul E. McKenney wrote:

> 
>> My conclusion, we can just remove the check-and-return path to reduce
>> the complexity since we will introduce call_srcu().
> 
> If I actually submit the above upstream, that would be quite reasonable.
> My thought is that patch remains RFC and the upstream version has
> call_srcu().

Does the work of call_srcu() is started or drafted?

> 
>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
>> lock/unlock witch forces any increment/decrement pair changes the counter
>> for me.
> 
> Glad you like it!  ;-)
> 
> And thank you for your review and feedback!

Could you add my Reviewed-by when this patch is last submitted?


Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Thanks
Lai
Paul E. McKenney Feb. 21, 2012, 1:50 a.m. | #19
On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
> 
> > 
> >> My conclusion, we can just remove the check-and-return path to reduce
> >> the complexity since we will introduce call_srcu().
> > 
> > If I actually submit the above upstream, that would be quite reasonable.
> > My thought is that patch remains RFC and the upstream version has
> > call_srcu().
> 
> Does the work of call_srcu() is started or drafted?

I do have a draft design, and am currently beating it into shape.
No actual code yet, though.  The general idea at the moment is as follows:

o	The state machine must be preemptible.	I recently received
	a bug report about 200-microsecond latency spikes on a system
	with more than a thousand CPUs, so the summation of the per-CPU
	counters and subsequent recheck cannot be in a preempt-disable
	region.  I am therefore currently thinking in terms of a kthread.

o	At the moment, having a per-srcu_struct kthread seems excessive.
	I am planning on a single kthread to do the counter summation
	and checking.  Further parallelism might be useful in the future,
	but I would want to see someone run into problems before adding
	more complexity.

o	There needs to be a linked list of srcu_struct structures so
	that they can be traversed by the state-machine kthread.

o	If there are expedited SRCU callbacks anywhere, the kthread
	would scan through the list of srcu_struct structures quickly
	(perhaps pausing a few microseconds between).  If there are no
	expedited SRCU callbacks, the kthread would wait a jiffy or so
	between scans.

o	If a given srcu_struct structure has been scanned too many times
	(say, more than ten times) while waiting for the counters to go
	to zero, it loses expeditedness.  It makes no sense for the kthread
	to go CPU-bound just because some SRCU reader somewhere is blocked
	in its SRCU read-side critical section.

o	Expedited SRCU callbacks cannot be delayed by normal SRCU
	callbacks, but neither can expedited callbacks be allowed to
	starve normal callbacks.  I am thinking in terms of invoking these
	from softirq context, with a pair of multi-tailed callback queues
	per CPU, stored in the same structure as the per-CPU counters.

o	There are enough srcu_struct structures in the Linux that
	it does not make sense to force softirq to dig through them all
	any time any one of them has callbacks ready to invoke.  One way
	to deal with this is to have a per-CPU set of linked lists of
	of srcu_struct_array structures, so that the kthread enqueues
	a given structure when it transitions to having callbacks ready
	to invoke, and softirq dequeues it.  This can be done locklessly
	given that there is only one producer and one consumer.

o	We can no longer use the trick of pushing callbacks to another
	CPU from the CPU_DYING notifier because it is likely that CPU
	hotplug will stop using stop_cpus().  I am therefore thinking
	in terms of a set of orphanages (two for normal, two more for
	expedited -- one set of each for callbacks ready to invoke,
	the other for still-waiting callbacks).

o	There will need to be an srcu_barrier() that can be called
	before cleanup_srcu_struct().  Otherwise, someone will end up
	freeing up an srcu_struct that still has callbacks outstanding.

But what did you have in mind?

> >> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> >> lock/unlock witch forces any increment/decrement pair changes the counter
> >> for me.
> > 
> > Glad you like it!  ;-)
> > 
> > And thank you for your review and feedback!
> 
> Could you add my Reviewed-by when this patch is last submitted?
> 
> 
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Will do, thank you!

							Thanx, Paul
Lai Jiangshan Feb. 21, 2012, 8:44 a.m. | #20
On 02/21/2012 09:50 AM, Paul E. McKenney wrote:
> On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
>> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
>>
>>>
>>>> My conclusion, we can just remove the check-and-return path to reduce
>>>> the complexity since we will introduce call_srcu().
>>>
>>> If I actually submit the above upstream, that would be quite reasonable.
>>> My thought is that patch remains RFC and the upstream version has
>>> call_srcu().
>>
>> Does the work of call_srcu() is started or drafted?
> 
> I do have a draft design, and am currently beating it into shape.
> No actual code yet, though.  The general idea at the moment is as follows:

If you don't mind, I will implement it.(it requires your new version of SRCU implementation)

> 
> o	The state machine must be preemptible.	I recently received
> 	a bug report about 200-microsecond latency spikes on a system
> 	with more than a thousand CPUs, so the summation of the per-CPU
> 	counters and subsequent recheck cannot be in a preempt-disable
> 	region.  I am therefore currently thinking in terms of a kthread.

sure.

Addition:
	Srcu callbacks must run on process context and sleepable,
	(and therefore they finish orderless). We can't change
	the current rcu-callback, we must design a new for srcu.
	These do not introduce any complexity, we reuse the workqueue.
	All ready srcu-callback will be delivered to workqueue.
	so state-machine-thread only to do counter summation
	and checking and delivering.

	workqueue do all the complexity for us, it will auto
	alloc threads when needed(when the callback is sleep, or
	there are too much ready callbacks).

	struct srcu_head is a little bigger than struct rcu_head, 
	it contain a union for struct work_struct.

	(synchronize_srcu()'s callbacks are special handled)

> 
> o	At the moment, having a per-srcu_struct kthread seems excessive.
> 	I am planning on a single kthread to do the counter summation
> 	and checking.  Further parallelism might be useful in the future,
> 	but I would want to see someone run into problems before adding
> 	more complexity.

Simple is important, I vote for a single kthread to do the counter summation
and checking, and left the convenient that we can introduce parallelism
thread easy.

> 
> o	There needs to be a linked list of srcu_struct structures so
> 	that they can be traversed by the state-machine kthread.
> 
> o	If there are expedited SRCU callbacks anywhere, the kthread
> 	would scan through the list of srcu_struct structures quickly
> 	(perhaps pausing a few microseconds between).  If there are no
> 	expedited SRCU callbacks, the kthread would wait a jiffy or so
> 	between scans.

Sure
But I think generic expedited SRCU callbacks are make no sense,
we could just allow expedited SRCU callbacks for synchronize_srcu_expedited()
only.

> 
> o	If a given srcu_struct structure has been scanned too many times
> 	(say, more than ten times) while waiting for the counters to go
> 	to zero, it loses expeditedness.  It makes no sense for the kthread
> 	to go CPU-bound just because some SRCU reader somewhere is blocked
> 	in its SRCU read-side critical section.
> 
> o	Expedited SRCU callbacks cannot be delayed by normal SRCU
> 	callbacks, but neither can expedited callbacks be allowed to
> 	starve normal callbacks.  I am thinking in terms of invoking these
> 	from softirq context, with a pair of multi-tailed callback queues
> 	per CPU, stored in the same structure as the per-CPU counters.

My Addition design, callbacks are not running in softirq context.
And state-machine-thread is not delay by any normal callbacks.

But all synchronize_srcu()'s callback are done in state-machine-thread
(it is just a wake up), not in workqueue.

Since we allow expedited SRCU callbacks for synchronize_srcu_expedited() only,
Expedited SRCU callbacks will not be delayed by normal srcu callbacks.

I don't think we need to use per CPU callback queues, since SRCU callbacks
are rare(compared to rcu callbacks)

> 
> o	There are enough srcu_struct structures in the Linux that
> 	it does not make sense to force softirq to dig through them all
> 	any time any one of them has callbacks ready to invoke.  One way
> 	to deal with this is to have a per-CPU set of linked lists of
> 	of srcu_struct_array structures, so that the kthread enqueues
> 	a given structure when it transitions to having callbacks ready
> 	to invoke, and softirq dequeues it.  This can be done locklessly
> 	given that there is only one producer and one consumer.
> 
> o	We can no longer use the trick of pushing callbacks to another
> 	CPU from the CPU_DYING notifier because it is likely that CPU
> 	hotplug will stop using stop_cpus().  I am therefore thinking
> 	in terms of a set of orphanages (two for normal, two more for
> 	expedited -- one set of each for callbacks ready to invoke,
> 	the other for still-waiting callbacks).

no such issues in srcu.c if we use workqueue.

> 
> o	There will need to be an srcu_barrier() that can be called
> 	before cleanup_srcu_struct().  Otherwise, someone will end up
> 	freeing up an srcu_struct that still has callbacks outstanding.

Sure.

when use workqueue, the delivering is ordered.

srcu_barrier()
{
	synchronzie_srcu();
	flush_workqueue();
}


> 
> But what did you have in mind?

I agree most of your design, but my relaxing the ability for callbacks and
using workqueue ideas make things simpler.

Thanks,
Lai

> 
>>>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
>>>> lock/unlock witch forces any increment/decrement pair changes the counter
>>>> for me.
>>>
>>> Glad you like it!  ;-)
>>>
>>> And thank you for your review and feedback!
>>
>> Could you add my Reviewed-by when this patch is last submitted?
>>
>>
>> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Will do, thank you!
> 
> 							Thanx, Paul
> 
>
Paul E. McKenney Feb. 21, 2012, 5:24 p.m. | #21
On Tue, Feb 21, 2012 at 04:44:22PM +0800, Lai Jiangshan wrote:
> On 02/21/2012 09:50 AM, Paul E. McKenney wrote:
> > On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
> >> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
> >>
> >>>
> >>>> My conclusion, we can just remove the check-and-return path to reduce
> >>>> the complexity since we will introduce call_srcu().
> >>>
> >>> If I actually submit the above upstream, that would be quite reasonable.
> >>> My thought is that patch remains RFC and the upstream version has
> >>> call_srcu().
> >>
> >> Does the work of call_srcu() is started or drafted?
> > 
> > I do have a draft design, and am currently beating it into shape.
> > No actual code yet, though.  The general idea at the moment is as follows:
> 
> If you don't mind, I will implement it.(it requires your new version of SRCU implementation)

I would very much welcome a patch from you for call_srcu()!

I have an rcu/srcu branch for this purpose at:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

> > o	The state machine must be preemptible.	I recently received
> > 	a bug report about 200-microsecond latency spikes on a system
> > 	with more than a thousand CPUs, so the summation of the per-CPU
> > 	counters and subsequent recheck cannot be in a preempt-disable
> > 	region.  I am therefore currently thinking in terms of a kthread.
> 
> sure.
> 
> Addition:
> 	Srcu callbacks must run on process context and sleepable,
> 	(and therefore they finish orderless). We can't change
> 	the current rcu-callback, we must design a new for srcu.
> 	These do not introduce any complexity, we reuse the workqueue.
> 	All ready srcu-callback will be delivered to workqueue.
> 	so state-machine-thread only to do counter summation
> 	and checking and delivering.
> 
> 	workqueue do all the complexity for us, it will auto
> 	alloc threads when needed(when the callback is sleep, or
> 	there are too much ready callbacks).
> 
> 	struct srcu_head is a little bigger than struct rcu_head, 
> 	it contain a union for struct work_struct.
> 
> 	(synchronize_srcu()'s callbacks are special handled)

Full separation of counter summation etc. from callback invocation sounds
very good.  There does need to be something like an srcu_barrier(), so the
SRCU callbacks posted by a given CPU do need to be invoked in the order
that they were posted (unless you have some other trick in mind, in which
case please let us all know what it is).  Given the need to keep things
in order, I do not believe that we can allow the SRCU callbacks to sleep.
Note that there is no need for srcu_barrier() to wait on the expedited
SRCU callbacks -- there should not be an call_srcu_expedited(), so the
only way for the callbacks to appear is synchronize_srcu_expedited(),
which does not return until after its callback has been invoked.
This means that expedited SRCU callbacks need not be ordered -- in fact,
the expedited path need not use callbacks at all, if that works better.
(I was thinking in terms of expedited callbacks for SRCU because that
makes the state machine simpler.)

That said, the idea of invoking them from a workqueue seems reasonable,
for example, you can do local_disable_bh() around each invocation.  Doing
this also gives us the flexibility to move SRCU callback invocation into
softirq if that proves necessary for whatever reason.  (Yes, I do still
well remember the 3.0 fun and excitement with RCU and kthreads!)

But I have to ask...  Why does SRCU need more space than RCU?  Can't
the selection of workqueue be implicit based on which CPU the callback
is queued on?

> > o	At the moment, having a per-srcu_struct kthread seems excessive.
> > 	I am planning on a single kthread to do the counter summation
> > 	and checking.  Further parallelism might be useful in the future,
> > 	but I would want to see someone run into problems before adding
> > 	more complexity.
> 
> Simple is important, I vote for a single kthread to do the counter summation
> and checking, and left the convenient that we can introduce parallelism
> thread easy.

Very good!

> > o	There needs to be a linked list of srcu_struct structures so
> > 	that they can be traversed by the state-machine kthread.
> > 
> > o	If there are expedited SRCU callbacks anywhere, the kthread
> > 	would scan through the list of srcu_struct structures quickly
> > 	(perhaps pausing a few microseconds between).  If there are no
> > 	expedited SRCU callbacks, the kthread would wait a jiffy or so
> > 	between scans.
> 
> Sure
> But I think generic expedited SRCU callbacks are make no sense,
> we could just allow expedited SRCU callbacks for synchronize_srcu_expedited()
> only.

Agreed -- We don't have call_rcu_expedited(), so we should definitely
not provide call_srcu_expedited().

> > o	If a given srcu_struct structure has been scanned too many times
> > 	(say, more than ten times) while waiting for the counters to go
> > 	to zero, it loses expeditedness.  It makes no sense for the kthread
> > 	to go CPU-bound just because some SRCU reader somewhere is blocked
> > 	in its SRCU read-side critical section.
> > 
> > o	Expedited SRCU callbacks cannot be delayed by normal SRCU
> > 	callbacks, but neither can expedited callbacks be allowed to
> > 	starve normal callbacks.  I am thinking in terms of invoking these
> > 	from softirq context, with a pair of multi-tailed callback queues
> > 	per CPU, stored in the same structure as the per-CPU counters.
> 
> My Addition design, callbacks are not running in softirq context.
> And state-machine-thread is not delay by any normal callbacks.
> 
> But all synchronize_srcu()'s callback are done in state-machine-thread
> (it is just a wake up), not in workqueue.

So the idea is that the function pointer is a task_struct pointer in this
case, and that you check for a text address to decide which?  In any case,
a per-callback wake up should be OK.

> Since we allow expedited SRCU callbacks for synchronize_srcu_expedited() only,
> Expedited SRCU callbacks will not be delayed by normal srcu callbacks.

OK, good.

> I don't think we need to use per CPU callback queues, since SRCU callbacks
> are rare(compared to rcu callbacks)

Indeed, everything currently in the kernel would turn into a wakeup.  ;-)

> > o	There are enough srcu_struct structures in the Linux that
> > 	it does not make sense to force softirq to dig through them all
> > 	any time any one of them has callbacks ready to invoke.  One way
> > 	to deal with this is to have a per-CPU set of linked lists of
> > 	of srcu_struct_array structures, so that the kthread enqueues
> > 	a given structure when it transitions to having callbacks ready
> > 	to invoke, and softirq dequeues it.  This can be done locklessly
> > 	given that there is only one producer and one consumer.
> > 
> > o	We can no longer use the trick of pushing callbacks to another
> > 	CPU from the CPU_DYING notifier because it is likely that CPU
> > 	hotplug will stop using stop_cpus().  I am therefore thinking
> > 	in terms of a set of orphanages (two for normal, two more for
> > 	expedited -- one set of each for callbacks ready to invoke,
> > 	the other for still-waiting callbacks).
> 
> no such issues in srcu.c if we use workqueue.

OK, sounds good.

> > o	There will need to be an srcu_barrier() that can be called
> > 	before cleanup_srcu_struct().  Otherwise, someone will end up
> > 	freeing up an srcu_struct that still has callbacks outstanding.
> 
> Sure.
> 
> when use workqueue, the delivering is ordered.
> 
> srcu_barrier()
> {
> 	synchronzie_srcu();
> 	flush_workqueue();
> }

Would this handle the case where the SRCU kthread migrated from one CPU
to another?  My guess is that you would need to flush all the workqueues
that might have ever had SRCU-related work pending on them.

> > But what did you have in mind?
> 
> I agree most of your design, but my relaxing the ability for callbacks and
> using workqueue ideas make things simpler.

Your approach looks good in general.  My main concerns are:

1.	We should prohibit sleeping in SRCU callback functions, at least
	for the near future -- just in case something forces us to move
	to softirq.

2.	If SRCU callbacks can use rcu_head, that would be good -- I don't
	yet see why any additional space is needed.

							Thanx, Paul

> Thanks,
> Lai
> 
> > 
> >>>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> >>>> lock/unlock witch forces any increment/decrement pair changes the counter
> >>>> for me.
> >>>
> >>> Glad you like it!  ;-)
> >>>
> >>> And thank you for your review and feedback!
> >>
> >> Could you add my Reviewed-by when this patch is last submitted?
> >>
> >>
> >> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > 
> > Will do, thank you!
> > 
> > 							Thanx, Paul
> > 
> > 
>
Lai Jiangshan March 6, 2012, 8:42 a.m. | #22
Patch 1~4 are for preparing.
Patch 1,2 can be merge now, maybe patch 3 can be merge now.

Patch 5 is the draft per-cpu state machine call_srcu() implementation
	it will be split before merge.

Patch 6 is for srcu torture test.

Lai Jiangshan (6):
  remove unused srcu_barrier()
  Don't touch the snap in srcu_readers_active()
  use "int trycount" instead of "bool expedited"
  remove flip_idx_and_wait()
  implement call_srcu()
  add srcu torture test

 include/linux/srcu.h |   49 +++++-
 kernel/rcutorture.c  |   68 ++++++++-
 kernel/srcu.c        |  421 +++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 454 insertions(+), 84 deletions(-)

Patch

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index d3d5fa5..a478c8e 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -31,13 +31,19 @@ 
 #include <linux/rcupdate.h>
 
 struct srcu_struct_array {
-	int c[2];
+	unsigned long c[2];
 };
 
+/* Bit definitions for field ->c above and ->snap below. */
+#define SRCU_USAGE_BITS		2
+#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
+#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
+
 struct srcu_struct {
-	int completed;
+	unsigned completed;
 	struct srcu_struct_array __percpu *per_cpu_ref;
 	struct mutex mutex;
+	unsigned long snap[NR_CPUS];
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index e0fe148..3d99162 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -620,7 +620,7 @@  static int srcu_torture_stats(char *page)
 	cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
 		       torture_type, TORTURE_FLAG, idx);
 	for_each_possible_cpu(cpu) {
-		cnt += sprintf(&page[cnt], " %d(%d,%d)", cpu,
+		cnt += sprintf(&page[cnt], " %d(%lu,%lu)", cpu,
 			       per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
 			       per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
 	}
diff --git a/kernel/srcu.c b/kernel/srcu.c
index ba35f3a..540671e 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -73,19 +73,102 @@  EXPORT_SYMBOL_GPL(init_srcu_struct);
 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
 /*
- * srcu_readers_active_idx -- returns approximate number of readers
- *	active on the specified rank of per-CPU counters.
+ * Returns approximate number of readers active on the specified rank
+ * of per-CPU counters.  Also snapshots each counter's value in the
+ * corresponding element of sp->snap[] for later use validating
+ * the sum.
  */
+static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+{
+	int cpu;
+	unsigned long sum = 0;
+	unsigned long t;
 
-static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+	for_each_possible_cpu(cpu) {
+		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
+		sum += t;
+		sp->snap[cpu] = t;
+	}
+	return sum & SRCU_REF_MASK;
+}
+
+/*
+ * To be called from the update side after an index flip.  Returns true
+ * if the modulo sum of the counters is stably zero, false if there is
+ * some possibility of non-zero.
+ */
+static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
 {
 	int cpu;
-	int sum;
 
-	sum = 0;
+	/*
+	 * Note that srcu_readers_active_idx() can incorrectly return
+	 * zero even though there is a pre-existing reader throughout.
+	 * To see this, suppose that task A is in a very long SRCU
+	 * read-side critical section that started on CPU 0, and that
+	 * no other reader exists, so that the modulo sum of the counters
+	 * is equal to one.  Then suppose that task B starts executing
+	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
+	 * task C starts reading on CPU 0, so that its increment is not
+	 * summed, but finishes reading on CPU 2, so that its decrement
+	 * -is- summed.  Then when task B completes its sum, it will
+	 * incorrectly get zero, despite the fact that task A has been
+	 * in its SRCU read-side critical section the whole time.
+	 *
+	 * We therefore do a validation step should srcu_readers_active_idx()
+	 * return zero.
+	 */
+	if (srcu_readers_active_idx(sp, idx) != 0)
+		return false;
+
+	/*
+	 * Since the caller recently flipped ->completed, we can see at
+	 * most one increment of each CPU's counter from this point
+	 * forward.  The reason for this is that the reader CPU must have
+	 * fetched the index before srcu_readers_active_idx checked
+	 * that CPU's counter, but not yet incremented its counter.
+	 * Its eventual counter increment will follow the read in
+	 * srcu_readers_active_idx(), and that increment is immediately
+	 * followed by smp_mb() B.  Because smp_mb() D is between
+	 * the ->completed flip and srcu_readers_active_idx()'s read,
+	 * that CPU's subsequent load of ->completed must see the new
+	 * value, and therefore increment the counter in the other rank.
+	 */
+	smp_mb(); /* A */
+
+	/*
+	 * Now, we check the ->snap array that srcu_readers_active_idx()
+	 * filled in from the per-CPU counter values.  Since both
+	 * __srcu_read_lock() and __srcu_read_unlock() increment the
+	 * upper bits of the per-CPU counter, an increment/decrement
+	 * pair will change the value of the counter.  Since there is
+	 * only one possible increment, the only way to wrap the counter
+	 * is to have a huge number of counter decrements, which requires
+	 * a huge number of tasks and huge SRCU read-side critical-section
+	 * nesting levels, even on 32-bit systems.
+	 *
+	 * All of the ways of confusing the readings require that the scan
+	 * in srcu_readers_active_idx() see the read-side task's decrement,
+	 * but not its increment.  However, between that decrement and
+	 * increment are smb_mb() B and C.  Either or both of these pair
+	 * with smp_mb() A above to ensure that the scan below will see
+	 * the read-side tasks's increment, thus noting a difference in
+	 * the counter values between the two passes.
+	 *
+	 * Therefore, if srcu_readers_active_idx() returned zero, and
+	 * none of the counters changed, we know that the zero was the
+	 * correct sum.
+	 *
+	 * Of course, it is possible that a task might be delayed
+	 * for a very long time in __srcu_read_lock() after fetching
+	 * the index but before incrementing its counter.  This
+	 * possibility will be dealt with in __synchronize_srcu().
+	 */
 	for_each_possible_cpu(cpu)
-		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
-	return sum;
+		if (sp->snap[cpu] !=
+		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
+			return false;  /* False zero reading! */
+	return true;
 }
 
 /**
@@ -131,10 +214,11 @@  int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	preempt_disable();
-	idx = sp->completed & 0x1;
-	barrier();  /* ensure compiler looks -once- at sp->completed. */
-	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
-	srcu_barrier();  /* ensure compiler won't misorder critical section. */
+	idx = rcu_dereference_index_check(sp->completed,
+					  rcu_read_lock_sched_held()) & 0x1;
+	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
+		SRCU_USAGE_COUNT + 1;
+	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	preempt_enable();
 	return idx;
 }
@@ -149,8 +233,9 @@  EXPORT_SYMBOL_GPL(__srcu_read_lock);
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
 	preempt_disable();
-	srcu_barrier();  /* ensure compiler won't misorder critical section. */
-	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
+	smp_mb(); /* C */  /* Avoid leaking the critical section. */
+	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
+		SRCU_USAGE_COUNT - 1;
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
@@ -163,12 +248,65 @@  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * we repeatedly block for 1-millisecond time periods.  This approach
  * has done well in testing, so there is no need for a config parameter.
  */
-#define SYNCHRONIZE_SRCU_READER_DELAY 10
+#define SYNCHRONIZE_SRCU_READER_DELAY 5
+
+/*
+ * Flip the readers' index by incrementing ->completed, then wait
+ * until there are no more readers using the counters referenced by
+ * the old index value.  (Recall that the index is the bottom bit
+ * of ->completed.)
+ *
+ * Of course, it is possible that a reader might be delayed for the
+ * full duration of flip_idx_and_wait() between fetching the
+ * index and incrementing its counter.  This possibility is handled
+ * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
+ */
+static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
+{
+	int idx;
+	int trycount = 0;
+
+	idx = sp->completed++ & 0x1;
+
+	/*
+	 * If a reader fetches the index before the above increment,
+	 * but increments its counter after srcu_readers_active_idx_check()
+	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
+	 * smp_mb() B to ensure that the SRCU read-side critical section
+	 * will see any updates that the current task performed before its
+	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
+	 * as the case may be.
+	 */
+	smp_mb(); /* D */
+
+	/*
+	 * SRCU read-side critical sections are normally short, so wait
+	 * a small amount of time before possibly blocking.
+	 */
+	if (!srcu_readers_active_idx_check(sp, idx)) {
+		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+		while (!srcu_readers_active_idx_check(sp, idx)) {
+			if (expedited && ++ trycount < 10)
+				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+			else
+				schedule_timeout_interruptible(1);
+		}
+	}
+
+	/*
+	 * The following smp_mb() E pairs with srcu_read_unlock()'s
+	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
+	 * sees srcu_read_unlock()'s counter decrement, then any
+	 * of the current task's subsequent code will happen after
+	 * that SRCU read-side critical section.
+	 */
+	smp_mb(); /* E */
+}
 
 /*
  * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
  */
-static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
+static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 {
 	int idx;
 
@@ -178,90 +316,51 @@  static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
 			   !lock_is_held(&rcu_sched_lock_map),
 			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
 
-	idx = sp->completed;
+	idx = ACCESS_ONCE(sp->completed);
 	mutex_lock(&sp->mutex);
 
 	/*
 	 * Check to see if someone else did the work for us while we were
-	 * waiting to acquire the lock.  We need -two- advances of
+	 * waiting to acquire the lock.  We need -three- advances of
 	 * the counter, not just one.  If there was but one, we might have
 	 * shown up -after- our helper's first synchronize_sched(), thus
 	 * having failed to prevent CPU-reordering races with concurrent
-	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
-	 * either (1) wait for two or (2) supply the second ourselves.
+	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
+	 * was only two, we are guaranteed to have waited through only one
+	 * full index-flip phase.  So we either (1) wait for three or
+	 * (2) supply the additional ones we need.
 	 */
 
-	if ((sp->completed - idx) >= 2) {
+	if (sp->completed == idx + 2)
+		idx = 1;
+	else if (sp->completed == idx + 3) {
 		mutex_unlock(&sp->mutex);
 		return;
-	}
-
-	sync_func();  /* Force memory barrier on all CPUs. */
+	} else
+		idx = 0;
 
 	/*
-	 * The preceding synchronize_sched() ensures that any CPU that
-	 * sees the new value of sp->completed will also see any preceding
-	 * changes to data structures made by this CPU.  This prevents
-	 * some other CPU from reordering the accesses in its SRCU
-	 * read-side critical section to precede the corresponding
-	 * srcu_read_lock() -- ensuring that such references will in
-	 * fact be protected.
+	 * If there were no helpers, then we need to do two flips of
+	 * the index.  The first flip is required if there are any
+	 * outstanding SRCU readers even if there are no new readers
+	 * running concurrently with the first counter flip.
 	 *
-	 * So it is now safe to do the flip.
-	 */
-
-	idx = sp->completed & 0x1;
-	sp->completed++;
-
-	sync_func();  /* Force memory barrier on all CPUs. */
-
-	/*
-	 * At this point, because of the preceding synchronize_sched(),
-	 * all srcu_read_lock() calls using the old counters have completed.
-	 * Their corresponding critical sections might well be still
-	 * executing, but the srcu_read_lock() primitives themselves
-	 * will have finished executing.  We initially give readers
-	 * an arbitrarily chosen 10 microseconds to get out of their
-	 * SRCU read-side critical sections, then loop waiting 1/HZ
-	 * seconds per iteration.  The 10-microsecond value has done
-	 * very well in testing.
-	 */
-
-	if (srcu_readers_active_idx(sp, idx))
-		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-	while (srcu_readers_active_idx(sp, idx))
-		schedule_timeout_interruptible(1);
-
-	sync_func();  /* Force memory barrier on all CPUs. */
-
-	/*
-	 * The preceding synchronize_sched() forces all srcu_read_unlock()
-	 * primitives that were executing concurrently with the preceding
-	 * for_each_possible_cpu() loop to have completed by this point.
-	 * More importantly, it also forces the corresponding SRCU read-side
-	 * critical sections to have also completed, and the corresponding
-	 * references to SRCU-protected data items to be dropped.
+	 * The second flip is required when a new reader picks up
+	 * the old value of the index, but does not increment its
+	 * counter until after its counters is summed/rechecked by
+	 * srcu_readers_active_idx_check().  In this case, the current SRCU
+	 * grace period would be OK because the SRCU read-side critical
+	 * section started after this SRCU grace period started, so the
+	 * grace period is not required to wait for the reader.
 	 *
-	 * Note:
-	 *
-	 *	Despite what you might think at first glance, the
-	 *	preceding synchronize_sched() -must- be within the
-	 *	critical section ended by the following mutex_unlock().
-	 *	Otherwise, a task taking the early exit can race
-	 *	with a srcu_read_unlock(), which might have executed
-	 *	just before the preceding srcu_readers_active() check,
-	 *	and whose CPU might have reordered the srcu_read_unlock()
-	 *	with the preceding critical section.  In this case, there
-	 *	is nothing preventing the synchronize_sched() task that is
-	 *	taking the early exit from freeing a data structure that
-	 *	is still being referenced (out of order) by the task
-	 *	doing the srcu_read_unlock().
-	 *
-	 *	Alternatively, the comparison with "2" on the early exit
-	 *	could be changed to "3", but this increases synchronize_srcu()
-	 *	latency for bulk loads.  So the current code is preferred.
+	 * However, the next SRCU grace period would be waiting for the
+	 * other set of counters to go to zero, and therefore would not
+	 * wait for the reader, which would be very bad.  To avoid this
+	 * bad scenario, we flip and wait twice, clearing out both sets
+	 * of counters.
 	 */
-
+	for (; idx < 2; idx++)
+		flip_idx_and_wait(sp, expedited);
 	mutex_unlock(&sp->mutex);
 }
 
@@ -281,7 +380,7 @@  static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
  */
 void synchronize_srcu(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, synchronize_sched);
+	__synchronize_srcu(sp, 0);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 
@@ -289,18 +388,11 @@  EXPORT_SYMBOL_GPL(synchronize_srcu);
  * synchronize_srcu_expedited - Brute-force SRCU grace period
  * @sp: srcu_struct with which to synchronize.
  *
- * Wait for an SRCU grace period to elapse, but use a "big hammer"
- * approach to force the grace period to end quickly.  This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code.  In fact,
- * if you are using synchronize_srcu_expedited() in a loop, please
- * restructure your code to batch your updates, and then use a single
- * synchronize_srcu() instead.
+ * Wait for an SRCU grace period to elapse, but be more aggressive about
+ * spinning rather than blocking when waiting.
  *
  * Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier.  Failing to observe
- * these restriction will result in deadlock.  It is also illegal to call
+ * that is acquired by a CPU-hotplug notifier.  It is also illegal to call
  * synchronize_srcu_expedited() from the corresponding SRCU read-side
  * critical section; doing so will result in deadlock.  However, it is
  * perfectly legal to call synchronize_srcu_expedited() on one srcu_struct
@@ -309,7 +401,7 @@  EXPORT_SYMBOL_GPL(synchronize_srcu);
  */
 void synchronize_srcu_expedited(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, synchronize_sched_expedited);
+	__synchronize_srcu(sp, 1);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);