diff mbox

[tip/core/urgent,3/7] rcu: Streamline code produced by __rcu_read_unlock()

Message ID 1311186383-24819-3-git-send-email-paulmck@linux.vnet.ibm.com
State Accepted
Commit be0e1e21ef707be4d16ea6a96ac9997463e4b8d2
Headers show

Commit Message

Paul E. McKenney July 20, 2011, 6:26 p.m. UTC
Given some common flag combinations, particularly -Os, gcc will inline
rcu_read_unlock_special() despite its being in an unlikely() clause.
Use noinline to prohibit this misoptimization.

In addition, move the second barrier() in __rcu_read_unlock() so that
it is not on the common-case code path.  This will allow the compiler to
generate better code for the common-case path through __rcu_read_unlock().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 kernel/rcutree_plugin.h |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

Comments

Linus Torvalds July 20, 2011, 10:44 p.m. UTC | #1
On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> Given some common flag combinations, particularly -Os, gcc will inline
> rcu_read_unlock_special() despite its being in an unlikely() clause.
> Use noinline to prohibit this misoptimization.

Btw, I suspect that we should at least look at what it would mean if
we make the rcu_read_lock_nesting and the preempt counters both be
per-cpu variables instead of making them per-thread/process counters.

Then, when we switch threads, we'd just save/restore them from the
process register save area.

There's a lot of critical code sequences (spin-lock/unlock, rcu
read-lock/unlock) that currently fetches the thread/process pointer
only to then offset it and increment the count. I get the strong
feeling that code generation could be improved and we could avoid one
level of indirection by just making it a per-thread counter.

For example, instead of __rcu_read_lock: looking like this (and being
an external function, partly because of header file dependencies on
the data structures involved):

  push   %rbp
  mov    %rsp,%rbp
  mov    %gs:0xb580,%rax
  incl   0x100(%rax)
  leaveq
  retq

it should inline to just something like

  incl %gs:0x100

instead. Same for the preempt counter.

Of course, it would need to involve making sure that we pick a good
cacheline etc that is already always dirty. But other than that, is
there any real downside?

                       Linus
Paul E. McKenney July 21, 2011, 5:09 a.m. UTC | #2
On Wed, Jul 20, 2011 at 03:44:55PM -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > Given some common flag combinations, particularly -Os, gcc will inline
> > rcu_read_unlock_special() despite its being in an unlikely() clause.
> > Use noinline to prohibit this misoptimization.
> 
> Btw, I suspect that we should at least look at what it would mean if
> we make the rcu_read_lock_nesting and the preempt counters both be
> per-cpu variables instead of making them per-thread/process counters.
> 
> Then, when we switch threads, we'd just save/restore them from the
> process register save area.
> 
> There's a lot of critical code sequences (spin-lock/unlock, rcu
> read-lock/unlock) that currently fetches the thread/process pointer
> only to then offset it and increment the count. I get the strong
> feeling that code generation could be improved and we could avoid one
> level of indirection by just making it a per-thread counter.
> 
> For example, instead of __rcu_read_lock: looking like this (and being
> an external function, partly because of header file dependencies on
> the data structures involved):
> 
>   push   %rbp
>   mov    %rsp,%rbp
>   mov    %gs:0xb580,%rax
>   incl   0x100(%rax)
>   leaveq
>   retq
> 
> it should inline to just something like
> 
>   incl %gs:0x100
> 
> instead. Same for the preempt counter.
> 
> Of course, it would need to involve making sure that we pick a good
> cacheline etc that is already always dirty. But other than that, is
> there any real downside?

We would need a form of per-CPU variable access that generated
efficient code, but that didn't complain about being used when
preemption was enabled.  __this_cpu_add_4() might do the trick,
but I haven't dug fully through it yet.

						Thanx, Paul
diff mbox

Patch

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 3a0ae03..4d2c068 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -284,7 +284,7 @@  static struct list_head *rcu_next_node_entry(struct task_struct *t,
  * notify RCU core processing or task having blocked during the RCU
  * read-side critical section.
  */
-static void rcu_read_unlock_special(struct task_struct *t)
+static noinline void rcu_read_unlock_special(struct task_struct *t)
 {
 	int empty;
 	int empty_exp;
@@ -391,11 +391,11 @@  void __rcu_read_unlock(void)
 	struct task_struct *t = current;
 
 	barrier();  /* needed if we ever invoke rcu_read_unlock in rcutree.c */
-	--t->rcu_read_lock_nesting;
-	barrier();  /* decrement before load of ->rcu_read_unlock_special */
-	if (t->rcu_read_lock_nesting == 0 &&
-	    unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
-		rcu_read_unlock_special(t);
+	if (--t->rcu_read_lock_nesting == 0) {
+		barrier();  /* decr before ->rcu_read_unlock_special load */
+		if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
+			rcu_read_unlock_special(t);
+	}
 #ifdef CONFIG_PROVE_LOCKING
 	WARN_ON_ONCE(ACCESS_ONCE(t->rcu_read_lock_nesting) < 0);
 #endif /* #ifdef CONFIG_PROVE_LOCKING */