diff mbox

[RFC,tip/core/rcu,6/6] rcu: Reduce cache-miss initialization latencies for large systems

Message ID 1335199347-13926-6-git-send-email-paulmck@linux.vnet.ibm.com
State Accepted
Commit 8932a63d5edb02f714d50c26583152fe0a97a69c
Headers show

Commit Message

Paul E. McKenney April 23, 2012, 4:42 p.m. UTC
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
needed to reduce lock contention that was induced by the synchronization
of scheduling-clock interrupts, which was in turn needed to improve
energy efficiency for moderate-sized lightly loaded servers.

However, reducing the leaf-level fanout means that there are more
leaf-level rcu_node structures in the tree, which in turn means that
RCU's grace-period initialization incurs more cache misses.  This is
not a problem on moderate-sized servers with only a few tens of CPUs,
but becomes a major source of real-time latency spikes on systems with
many hundreds of CPUs.  In addition, the workloads running on these large
systems tend to be CPU-bound, which eliminates the energy-efficiency
advantages of synchronizing scheduling-clock interrupts.  Therefore,
these systems need maximal values for the rcu_node leaf-level fanout.

This commit addresses this problem by introducing a new kernel parameter
named RCU_FANOUT_LEAF that directly controls the leaf-level fanout.
This parameter defaults to 16 to handle the common case of a moderate
sized lightly loaded servers, but may be set higher on larger systems.

Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 init/Kconfig     |   27 +++++++++++++++++++++++++++
 kernel/rcutree.c |    2 +-
 kernel/rcutree.h |   10 +++-------
 3 files changed, 31 insertions(+), 8 deletions(-)

Comments

Peter Zijlstra April 26, 2012, 12:51 p.m. UTC | #1
On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> needed to reduce lock contention that was induced by the synchronization
> of scheduling-clock interrupts, which was in turn needed to improve
> energy efficiency for moderate-sized lightly loaded servers.
> 
> However, reducing the leaf-level fanout means that there are more
> leaf-level rcu_node structures in the tree, which in turn means that
> RCU's grace-period initialization incurs more cache misses.  This is
> not a problem on moderate-sized servers with only a few tens of CPUs,
> but becomes a major source of real-time latency spikes on systems with
> many hundreds of CPUs.  In addition, the workloads running on these large
> systems tend to be CPU-bound, which eliminates the energy-efficiency
> advantages of synchronizing scheduling-clock interrupts.  Therefore,
> these systems need maximal values for the rcu_node leaf-level fanout.
> 
> This commit addresses this problem by introducing a new kernel parameter
> named RCU_FANOUT_LEAF that directly controls the leaf-level fanout.
> This parameter defaults to 16 to handle the common case of a moderate
> sized lightly loaded servers, but may be set higher on larger systems.

Wouldn't it be much better to match the rcu fanout tree to the physical
topology of the machine?
Paul E. McKenney April 26, 2012, 2:12 p.m. UTC | #2
On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> > limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> > needed to reduce lock contention that was induced by the synchronization
> > of scheduling-clock interrupts, which was in turn needed to improve
> > energy efficiency for moderate-sized lightly loaded servers.
> > 
> > However, reducing the leaf-level fanout means that there are more
> > leaf-level rcu_node structures in the tree, which in turn means that
> > RCU's grace-period initialization incurs more cache misses.  This is
> > not a problem on moderate-sized servers with only a few tens of CPUs,
> > but becomes a major source of real-time latency spikes on systems with
> > many hundreds of CPUs.  In addition, the workloads running on these large
> > systems tend to be CPU-bound, which eliminates the energy-efficiency
> > advantages of synchronizing scheduling-clock interrupts.  Therefore,
> > these systems need maximal values for the rcu_node leaf-level fanout.
> > 
> > This commit addresses this problem by introducing a new kernel parameter
> > named RCU_FANOUT_LEAF that directly controls the leaf-level fanout.
> > This parameter defaults to 16 to handle the common case of a moderate
> > sized lightly loaded servers, but may be set higher on larger systems.
> 
> Wouldn't it be much better to match the rcu fanout tree to the physical
> topology of the machine?

From what I am hearing, doing so requires me to morph the rcu_node tree
at run time.  I might eventually become courageous/inspired/senile
enough to try this, but not yet.  ;-)

Actually, some of this topology shifting seems to me like a firmware
bug.  Why not arrange the Linux-visible numbering in a way to promote
locality for code sequencing through the CPUs?

							Thanx, Paul
Peter Zijlstra April 26, 2012, 3:28 p.m. UTC | #3
On Thu, 2012-04-26 at 07:12 -0700, Paul E. McKenney wrote:
> On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote:

> > Wouldn't it be much better to match the rcu fanout tree to the physical
> > topology of the machine?
> 
> From what I am hearing, doing so requires me to morph the rcu_node tree
> at run time.  I might eventually become courageous/inspired/senile
> enough to try this, but not yet.  ;-)

Yes, boot time with possibly some hotplug hooks.

> Actually, some of this topology shifting seems to me like a firmware
> bug.  Why not arrange the Linux-visible numbering in a way to promote
> locality for code sequencing through the CPUs?

I'm not sure.. but it seems well established on x86 to first enumerate
the cores (thread 0) and then the sibling threads (thread 1) -- one
'advantage' is that if you boot with max_cpus=$half you get all cores
instead of half the cores.

OTOH it does make linear iteration of the cpus 'funny' :-)

Also, a fanout of 16 is nice when your machine doesn't have HT and has a
2^n core count, but some popular machines these days have 6/10 cores per
socket, resulting in your fanout splitting caches.
Paul E. McKenney April 26, 2012, 4:15 p.m. UTC | #4
On Thu, Apr 26, 2012 at 05:28:57PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-04-26 at 07:12 -0700, Paul E. McKenney wrote:
> > On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote:
> 
> > > Wouldn't it be much better to match the rcu fanout tree to the physical
> > > topology of the machine?
> > 
> > From what I am hearing, doing so requires me to morph the rcu_node tree
> > at run time.  I might eventually become courageous/inspired/senile
> > enough to try this, but not yet.  ;-)
> 
> Yes, boot time with possibly some hotplug hooks.

Has anyone actually measured any slowdown due to the rcu_node structure
not matching the topology?  (But see also below.)

> > Actually, some of this topology shifting seems to me like a firmware
> > bug.  Why not arrange the Linux-visible numbering in a way to promote
> > locality for code sequencing through the CPUs?
> 
> I'm not sure.. but it seems well established on x86 to first enumerate
> the cores (thread 0) and then the sibling threads (thread 1) -- one
> 'advantage' is that if you boot with max_cpus=$half you get all cores
> instead of half the cores.
> 
> OTOH it does make linear iteration of the cpus 'funny' :-)

Like I said, firmware bug.  Seems like the fix should be there as well.
Perhaps there needs to be two CPU numberings, one for people wanting
whole cores and another for people who want cache locality.  Yes, this
could be confusing, but keep in mind that you are asking every kernel
subsystem to keep its own version of the cache-locality numbering,
and that will be even more confusing.

> Also, a fanout of 16 is nice when your machine doesn't have HT and has a
> 2^n core count, but some popular machines these days have 6/10 cores per
> socket, resulting in your fanout splitting caches.

That is easy.  Such systems can set CONFIG_RCU_FANOUT to 6, 12, 10,
or 20, depending on preference.  With a patch intended for 3.6, they
could set the smallest reasonable value at build time and adjust to
the hardware using the boot parameter.

http://www.gossamer-threads.com/lists/linux/kernel/1524864

I expect to make other similar changes over time, but will be proceeding
cautiously.

							Thanx, Paul
Mike Galbraith April 27, 2012, 4:36 a.m. UTC | #5
On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: 
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> needed to reduce lock contention that was induced by the synchronization
> of scheduling-clock interrupts, which was in turn needed to improve
> energy efficiency for moderate-sized lightly loaded servers.
> 
> However, reducing the leaf-level fanout means that there are more
> leaf-level rcu_node structures in the tree, which in turn means that
> RCU's grace-period initialization incurs more cache misses.  This is
> not a problem on moderate-sized servers with only a few tens of CPUs,

With a distro config (4096 CPUs) interrupt latency is bad even on a
quad.  Traversing empty nodes taking locks and cache misses hurts.

-Mike
Paul E. McKenney April 27, 2012, 3:15 p.m. UTC | #6
On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote:
> On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: 
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> > limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> > needed to reduce lock contention that was induced by the synchronization
> > of scheduling-clock interrupts, which was in turn needed to improve
> > energy efficiency for moderate-sized lightly loaded servers.
> > 
> > However, reducing the leaf-level fanout means that there are more
> > leaf-level rcu_node structures in the tree, which in turn means that
> > RCU's grace-period initialization incurs more cache misses.  This is
> > not a problem on moderate-sized servers with only a few tens of CPUs,
> 
> With a distro config (4096 CPUs) interrupt latency is bad even on a
> quad.  Traversing empty nodes taking locks and cache misses hurts.

Agreed -- and I will be working on an additional patch that makes RCU
avoid initializing its data structures for CPUs that don't exist.

That said, increasing the leaf-level fanout from 16 to 64 should reduce
the latency pain by a factor of four.  In addition, I would expect that
real-time builds of the kernel would set NR_CPUS to some value much
smaller than 4096.  ;-)

							Thanx, Paul
Mike Galbraith April 28, 2012, 4:42 a.m. UTC | #7
On Fri, 2012-04-27 at 08:15 -0700, Paul E. McKenney wrote: 
> On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote:
> > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: 
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > 
> > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> > > limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> > > needed to reduce lock contention that was induced by the synchronization
> > > of scheduling-clock interrupts, which was in turn needed to improve
> > > energy efficiency for moderate-sized lightly loaded servers.
> > > 
> > > However, reducing the leaf-level fanout means that there are more
> > > leaf-level rcu_node structures in the tree, which in turn means that
> > > RCU's grace-period initialization incurs more cache misses.  This is
> > > not a problem on moderate-sized servers with only a few tens of CPUs,
> > 
> > With a distro config (4096 CPUs) interrupt latency is bad even on a
> > quad.  Traversing empty nodes taking locks and cache misses hurts.
> 
> Agreed -- and I will be working on an additional patch that makes RCU
> avoid initializing its data structures for CPUs that don't exist.

That's still on my todo list too, your initial patch (and my butchery
thereof to skip taking lock) showed this helps a heap.

> That said, increasing the leaf-level fanout from 16 to 64 should reduce
> the latency pain by a factor of four.  In addition, I would expect that
> real-time builds of the kernel would set NR_CPUS to some value much
> smaller than 4096.  ;-)

Yup, else you would have heard whimpering months ago ;-)

-Mike
Paul E. McKenney April 28, 2012, 5:21 p.m. UTC | #8
On Sat, Apr 28, 2012 at 06:42:22AM +0200, Mike Galbraith wrote:
> On Fri, 2012-04-27 at 08:15 -0700, Paul E. McKenney wrote: 
> > On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote:
> > > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: 
> > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > 
> > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> > > > limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> > > > needed to reduce lock contention that was induced by the synchronization
> > > > of scheduling-clock interrupts, which was in turn needed to improve
> > > > energy efficiency for moderate-sized lightly loaded servers.
> > > > 
> > > > However, reducing the leaf-level fanout means that there are more
> > > > leaf-level rcu_node structures in the tree, which in turn means that
> > > > RCU's grace-period initialization incurs more cache misses.  This is
> > > > not a problem on moderate-sized servers with only a few tens of CPUs,
> > > 
> > > With a distro config (4096 CPUs) interrupt latency is bad even on a
> > > quad.  Traversing empty nodes taking locks and cache misses hurts.
> > 
> > Agreed -- and I will be working on an additional patch that makes RCU
> > avoid initializing its data structures for CPUs that don't exist.
> 
> That's still on my todo list too, your initial patch (and my butchery
> thereof to skip taking lock) showed this helps a heap.

Indeed, I am a bit worried about the safety of skipping the lock in
that case.  Either way, if you would rather keep working on this, I
have no problem pushing it further down my todo list.

> > That said, increasing the leaf-level fanout from 16 to 64 should reduce
> > the latency pain by a factor of four.  In addition, I would expect that
> > real-time builds of the kernel would set NR_CPUS to some value much
> > smaller than 4096.  ;-)
> 
> Yup, else you would have heard whimpering months ago ;-)

I am not sure that it would have been exactly whimpering, but yes.  ;-)

							Thanx, Paul
Mike Galbraith April 29, 2012, 3:54 a.m. UTC | #9
On Sat, 2012-04-28 at 10:21 -0700, Paul E. McKenney wrote: 
> On Sat, Apr 28, 2012 at 06:42:22AM +0200, Mike Galbraith wrote:
> > On Fri, 2012-04-27 at 08:15 -0700, Paul E. McKenney wrote: 
> > > On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote:
> > > > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: 
> > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > > 
> > > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
> > > > > limit of 16 on the leaf-level fanout for the rcu_node tree.  This was
> > > > > needed to reduce lock contention that was induced by the synchronization
> > > > > of scheduling-clock interrupts, which was in turn needed to improve
> > > > > energy efficiency for moderate-sized lightly loaded servers.
> > > > > 
> > > > > However, reducing the leaf-level fanout means that there are more
> > > > > leaf-level rcu_node structures in the tree, which in turn means that
> > > > > RCU's grace-period initialization incurs more cache misses.  This is
> > > > > not a problem on moderate-sized servers with only a few tens of CPUs,
> > > > 
> > > > With a distro config (4096 CPUs) interrupt latency is bad even on a
> > > > quad.  Traversing empty nodes taking locks and cache misses hurts.
> > > 
> > > Agreed -- and I will be working on an additional patch that makes RCU
> > > avoid initializing its data structures for CPUs that don't exist.
> > 
> > That's still on my todo list too, your initial patch (and my butchery
> > thereof to skip taking lock) showed this helps a heap.
> 
> Indeed, I am a bit worried about the safety of skipping the lock in
> that case.  Either way, if you would rather keep working on this, I
> have no problem pushing it further down my todo list.

It's on my todo because Paul is a busy fellow.  I'd much prefer that RCU
tweaking is done by somebody who fully understands RCU's gizzard (!me).
IOW, I'll fiddle with RCU as time permits/gripes require, but waiting to
test your implementation is the more prudent thing to do.  Heck, as yet,
I haven't even scraped together the time to test your other patches :-/ 

For the nonce, I've done the safe thing, reverted 0209f649 and enabled
tick skew.  Per my measurements, the regression is thus dead, so it's
not as pressing an issue, though some users do still want and need more.

-Mike
diff mbox

Patch

diff --git a/init/Kconfig b/init/Kconfig
index 85c6870..6d18ef8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -458,6 +458,33 @@  config RCU_FANOUT
 	  Select a specific number if testing RCU itself.
 	  Take the default if unsure.
 
+config RCU_FANOUT_LEAF
+	int "Tree-based hierarchical RCU leaf-level fanout value"
+	range 2 RCU_FANOUT if 64BIT
+	range 2 RCU_FANOUT if !64BIT
+	depends on TREE_RCU || TREE_PREEMPT_RCU
+	default 16
+	help
+	  This option controls the leaf-level fanout of hierarchical
+	  implementations of RCU, and allows trading off cache misses
+	  against lock contention.  Systems that synchronize their
+	  scheduling-clock interrupts for energy-efficiency reasons will
+	  want the default because the smaller leaf-level fanout keeps
+	  lock contention levels acceptably low.  Very large systems
+	  (hundreds or thousands of CPUs) will instead want to set this
+	  value to the maximum value possible in order to reduce the
+	  number of cache misses incurred during RCU's grace-period
+	  initialization.  These systems tend to run CPU-bound, and thus
+	  are not helped by synchronized interrupts, and thus tend to
+	  skew them, which reduces lock contention enough that large
+	  leaf-level fanouts work well.
+
+	  Select a specific number if testing RCU itself.
+
+	  Select the maximum permissible value for large systems.
+
+	  Take the default if unsure.
+
 config RCU_FANOUT_EXACT
 	bool "Disable tree-based hierarchical RCU auto-balancing"
 	depends on TREE_RCU || TREE_PREEMPT_RCU
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 1050d6d..780acf8 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2418,7 +2418,7 @@  static void __init rcu_init_levelspread(struct rcu_state *rsp)
 
 	for (i = NUM_RCU_LVLS - 1; i > 0; i--)
 		rsp->levelspread[i] = CONFIG_RCU_FANOUT;
-	rsp->levelspread[0] = RCU_FANOUT_LEAF;
+	rsp->levelspread[0] = CONFIG_RCU_FANOUT_LEAF;
 }
 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */
 static void __init rcu_init_levelspread(struct rcu_state *rsp)
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index cdd1be0..a905c20 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -29,18 +29,14 @@ 
 #include <linux/seqlock.h>
 
 /*
- * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT.
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
  * In theory, it should be possible to add more levels straightforwardly.
  * In practice, this did work well going from three levels to four.
  * Of course, your mileage may vary.
  */
 #define MAX_RCU_LVLS 4
-#if CONFIG_RCU_FANOUT > 16
-#define RCU_FANOUT_LEAF       16
-#else /* #if CONFIG_RCU_FANOUT > 16 */
-#define RCU_FANOUT_LEAF       (CONFIG_RCU_FANOUT)
-#endif /* #else #if CONFIG_RCU_FANOUT > 16 */
-#define RCU_FANOUT_1	      (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_1	      (CONFIG_RCU_FANOUT_LEAF)
 #define RCU_FANOUT_2	      (RCU_FANOUT_1 * CONFIG_RCU_FANOUT)
 #define RCU_FANOUT_3	      (RCU_FANOUT_2 * CONFIG_RCU_FANOUT)
 #define RCU_FANOUT_4	      (RCU_FANOUT_3 * CONFIG_RCU_FANOUT)