[RFC,v6,2/4] scheduler: add scheduler level for clusters

ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.

This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.

This will help spread unrelated tasks among clusters, thus decrease the
contention and improve the throughput, for example, stream benchmark can
improve 20%+ while parallelism is 6 and improve around 5% while paralle-
lism is 12:

(1) -P <parallelism> 6
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 6 -M 1024M -N 5

w/o patch:
STREAM copy latency: 2.46 nanoseconds
STREAM copy bandwidth: 39096.28 MB/sec
STREAM scale latency: 2.46 nanoseconds
STREAM scale bandwidth: 38970.26 MB/sec
STREAM add latency: 4.45 nanoseconds
STREAM add bandwidth: 32332.04 MB/sec
STREAM triad latency: 4.07 nanoseconds
STREAM triad bandwidth: 35387.69 MB/sec

w/ patch:
STREAM copy latency: 2.02 nanoseconds
STREAM copy bandwidth: 47604.47 MB/sec   +21.7%
STREAM scale latency: 2.04 nanoseconds
STREAM scale bandwidth: 47066.84 MB/sec  +20.8%
STREAM add latency: 3.35 nanoseconds
STREAM add bandwidth: 42942.15 MB/sec    +32.8%
STREAM triad latency: 3.16 nanoseconds
STREAM triad bandwidth: 45619.18 MB/sec  +28.9%

On the other hand,stream result could change significantly during different
tests without the patch, eg:
a.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 44448.45 MB/sec
STREAM scale latency: 2.17 nanoseconds
STREAM scale bandwidth: 44320.77 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38230.54 MB/sec
STREAM triad latency: 3.88 nanoseconds
STREAM triad bandwidth: 37072.10 MB/sec

b.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 44403.22 MB/sec
STREAM scale latency: 2.39 nanoseconds
STREAM scale bandwidth: 40173.69 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38232.56 MB/sec
STREAM triad latency: 3.38 nanoseconds
STREAM triad bandwidth: 42592.04 MB/sec

Obviously it is because the 6 threads are put randomly in 6 cores. Sometimes
they are packed in clusters, sometimes they are spread widely.

(2) -P <parallelism> 12
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5

w/o patch:
STREAM copy latency: 3.37 nanoseconds
STREAM copy bandwidth: 57008.80 MB/sec
STREAM scale latency: 3.38 nanoseconds
STREAM scale bandwidth: 56848.47 MB/sec
STREAM add latency: 5.50 nanoseconds
STREAM add bandwidth: 52398.62 MB/sec
STREAM triad latency: 5.09 nanoseconds
STREAM triad bandwidth: 56591.60 MB/sec

w/ patch:
STREAM copy latency: 3.24 nanoseconds
STREAM copy bandwidth: 59338.60 MB/sec  +4.1%
STREAM scale latency: 3.25 nanoseconds
STREAM scale bandwidth: 58993.23 MB/sec +3.7%
STREAM add latency: 5.19 nanoseconds
STREAM add bandwidth: 55517.45 MB/sec   +5.9%
STREAM triad latency: 4.86 nanoseconds
STREAM triad bandwidth: 59245.34 MB/sec +4.7%

Obviously the load balance between clusters help improve the parallelism
of unrelated tasks.

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 6
to 32 in a NUMA node with 24 cores, for each different g, we run the
command 20 times and get the average time:
$ numactl -N 0 hackbench -p -T -l 1000000 -f 1 -g $1
As -f is set to 1, this means all threads are talking with each other
monogamously.

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 1000000 -f 1 -g 6
Running in threaded mode with 6 groups using 2 file descriptors each (== 12 tasks)
Each sender will pass 1000000 messages of 100 bytes

The below is the result of hackbench w/ and w/o the patch:
g=    6       12    18      24    28     32
w/o: 1.2474 1.5635 1.5133 1.4796 1.6177 1.7898
w/ : 1.1458 1.3309 1.3416 1.4990 1.9212 2.3411

It seems this patch benefits hackbench when the load is relatively low,
while it hurts hackbench much when the load is relatively high(56 and
64 threads in 24 cores).

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/arm64/Kconfig             |  7 +++++++
 include/linux/sched/cluster.h  | 19 +++++++++++++++++++
 include/linux/sched/sd_flags.h |  9 +++++++++
 include/linux/sched/topology.h |  7 +++++++
 include/linux/topology.h       |  7 +++++++
 kernel/sched/core.c            | 20 ++++++++++++++++++++
 kernel/sched/fair.c            |  4 ++++
 kernel/sched/sched.h           |  1 +
 kernel/sched/topology.c        |  6 ++++++
 9 files changed, 80 insertions(+)
 create mode 100644 include/linux/sched/cluster.h

Message ID	20210420001844.9116-3-song.bao.hua@hisilicon.com
State	New
Headers	show Return-Path: <linux-acpi-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F130FC433ED for <linux-acpi@archiver.kernel.org>; Tue, 20 Apr 2021 00:26:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CDCFD6135F for <linux-acpi@archiver.kernel.org>; Tue, 20 Apr 2021 00:26:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230033AbhDTA0s (ORCPT <rfc822;linux-acpi@archiver.kernel.org>); Mon, 19 Apr 2021 20:26:48 -0400 Received: from szxga04-in.huawei.com ([45.249.212.190]:16137 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229936AbhDTA0r (ORCPT <rfc822; linux-acpi@vger.kernel.org>); Mon, 19 Apr 2021 20:26:47 -0400 Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.58]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FPPY01jmkzmdWh; Tue, 20 Apr 2021 08:23:16 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.200.79) by DGGEMS407-HUB.china.huawei.com (10.3.19.207) with Microsoft SMTP Server id 14.3.498.0; Tue, 20 Apr 2021 08:26:06 +0800 From: Barry Song <song.bao.hua@hisilicon.com> To: <tim.c.chen@linux.intel.com>, <catalin.marinas@arm.com>, <will@kernel.org>, <rjw@rjwysocki.net>, <vincent.guittot@linaro.org>, <bp@alien8.de>, <tglx@linutronix.de>, <mingo@redhat.com>, <lenb@kernel.org>, <peterz@infradead.org>, <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de> CC: <msys.mizuma@gmail.com>, <valentin.schneider@arm.com>, <gregkh@linuxfoundation.org>, <jonathan.cameron@huawei.com>, <juri.lelli@redhat.com>, <mark.rutland@arm.com>, <sudeep.holla@arm.com>, <aubrey.li@linux.intel.com>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>, <x86@kernel.org>, <xuwei5@huawei.com>, <prime.zeng@hisilicon.com>, <guodong.xu@linaro.org>, <yangyicong@huawei.com>, <liguozhu@hisilicon.com>, <linuxarm@openeuler.org>, <hpa@zytor.com>, Barry Song <song.bao.hua@hisilicon.com> Subject: [RFC PATCH v6 2/4] scheduler: add scheduler level for clusters Date: Tue, 20 Apr 2021 12:18:42 +1200 Message-ID: <20210420001844.9116-3-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 In-Reply-To: <20210420001844.9116-1-song.bao.hua@hisilicon.com> References: <20210420001844.9116-1-song.bao.hua@hisilicon.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.126.200.79] X-CFilter-Loop: Reflected Precedence: bulk List-ID: <linux-acpi.vger.kernel.org> X-Mailing-List: linux-acpi@vger.kernel.org
Series	scheduler: expose the topology of clusters and add cluster scheduler \| expand [RFC,v6,0/4] scheduler: expose the topology of clusters and add cluster scheduler [RFC,v6,1/4] topology: Represent clusters of CPUs within a die [RFC,v6,2/4] scheduler: add scheduler level for clusters [RFC,v6,4/4] scheduler: Add cluster scheduler level for x86

[RFC,v6,2/4] scheduler: add scheduler level for clusters

Commit Message

Patch