mbox series

[0/3] Represent cluster topology and enable load balance between clusters

Message ID 20210820013008.12881-1-21cnbao@gmail.com
Headers show
Series Represent cluster topology and enable load balance between clusters | expand

Message

Barry Song Aug. 20, 2021, 1:30 a.m. UTC
From: Barry Song <song.bao.hua@hisilicon.com>

ARM64 machines like kunpeng920 and x86 machines like Jacobsville have a
level of hardware topology in which some CPU cores, typically 4 cores,
share L3 tags or L2 cache.

That means spreading those tasks between clusters will bring more memory
bandwidth and decrease cache contention. But packing tasks might help
decrease the latency of cache synchronization.

We have three series to bring up cluster level scheduler in kernel.
This is the first series.

1st series(this one): make kernel aware of cluster, expose cluster to sysfs
ABI and add SCHED_CLUSTER which can make load balance among clusters to
benefit lots of workload.
Testing shows this can hugely boost the performance, for example, this
can increase 25.1% of SPECrate mcf on Jacobsville and 13.574% of mcf
on kunpeng920.

2nd series(packing path): modify the wake_affine and let kernel select CPUs
within cluster first before scanning the whole LLC so that we can benefit
from the lower latency of the communication within one single cluster.
this series is much more tricky. so we would like to send it after the 1st
series settles down. Prototype here:
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-June/000219.html

3rd series: a sysctl to permit users to enable or disable cluster scheduler
from Tim Chen. Prototype here:
Add run time sysctl to enable/disable cluster scheduling
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-July/000258.html

This series is rebased on Greg's driver-core-next with the update in topology
sysfs ABI.

-V1:
 differences with RFC v6 
 * rebased on top of the latest update in topology sysfs ABI of Greg's
   driver-core-next
 * removed wake_affine path modifcation, which will be separately b2nd series
 * cluster_id is gotten by detecting valid ID before falling back to use offset
 * lots of benchmark data from both x86 Jacobsville and ARM64 kunpeng920

-RFC v6:
https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao.hua@hisilicon.com/

Barry Song (1):
  scheduler: Add cluster scheduler level in core and related Kconfig for
    ARM64

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
 Documentation/admin-guide/cputopology.rst     | 12 ++--
 arch/arm64/Kconfig                            |  7 ++
 arch/arm64/kernel/topology.c                  |  2 +
 arch/x86/Kconfig                              |  8 +++
 arch/x86/include/asm/smp.h                    |  7 ++
 arch/x86/include/asm/topology.h               |  3 +
 arch/x86/kernel/cpu/cacheinfo.c               |  1 +
 arch/x86/kernel/cpu/common.c                  |  3 +
 arch/x86/kernel/smpboot.c                     | 44 +++++++++++-
 drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
 drivers/base/arch_topology.c                  | 14 ++++
 drivers/base/topology.c                       | 10 +++
 include/linux/acpi.h                          |  5 ++
 include/linux/arch_topology.h                 |  5 ++
 include/linux/sched/topology.h                |  7 ++
 include/linux/topology.h                      | 13 ++++
 kernel/sched/topology.c                       |  5 ++
 18 files changed, 223 insertions(+), 5 deletions(-)

Comments

Tim Chen Aug. 23, 2021, 5:49 p.m. UTC | #1
On 8/19/21 6:30 PM, Barry Song wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>

> 

> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is

> shared among a cluster of cores instead of being exclusive to one

> single core.

> To prevent oversubscription of L2 cache, load should be balanced

> between such L2 clusters, especially for tasks with no shared data.

> On benchmark such as SPECrate mcf test, this change provides a

> boost to performance especially on medium load system on Jacobsville.

> on a Jacobsville that has 24 Atom cores, arranged into 6 clusters

> of 4 cores each, the benchmark number is as follow:

> 

>  Improvement over baseline kernel for mcf_r

>  copies		run time	base rate

>  1		-0.1%		-0.2%

>  6		25.1%		25.1%

>  12		18.8%		19.0%

>  24		0.3%		0.3%

> 

> So this looks pretty good. In terms of the system's task distribution,

> some pretty bad clumping can be seen for the vanilla kernel without

> the L2 cluster domain for the 6 and 12 copies case. With the extra

> domain for cluster, the load does get evened out between the clusters.

> 

> Note this patch isn't an universal win as spreading isn't necessarily

> a win, particually for those workload who can benefit from packing.


I have another patch set to make cluster scheduling selectable at run
time and boot time.  Will like to see people's feed back on this patch
set first before sending that out.

Thanks.

Tim