ARM64: Improve copy_page for 128 cache line sizes.

Message ID	1450570278-19404-1-git-send-email-apinski@cavium.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Andrew Pinski <apinski@cavium.com> To: pinsia@gmail.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Cc: Andrew Pinski <apinski@cavium.com> Subject: [PATCH] ARM64: Improve copy_page for 128 cache line sizes. Date: Sat, 19 Dec 2015 16:11:18 -0800 Message-Id: <1450570278-19404-1-git-send-email-apinski@cavium.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk

Message ID

1450570278-19404-1-git-send-email-apinski@cavium.com

State

New

Headers

Received-SPF: pass (google.com: best guess record for domain of
	linux-kernel-owner@vger.kernel.org designates 209.132.180.67
	as permitted sender) client-ip=209.132.180.67; 
From: Andrew Pinski <apinski@cavium.com>
To: pinsia@gmail.com, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Cc: Andrew Pinski <apinski@cavium.com>
Subject: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
Date: Sat, 19 Dec 2015 16:11:18 -0800
Message-Id: <1450570278-19404-1-git-send-email-apinski@cavium.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Andrew Pinski Dec. 20, 2015, 12:11 a.m. UTC

Adding a check for the cache line size is not much overhead.
Special case 128 byte cache line size.
This improves copy_page by 85% on ThunderX compared to the
original implementation.

For LMBench, it improves between 4-10%.

Signed-off-by: Andrew Pinski <apinski@cavium.com>

---
 arch/arm64/lib/copy_page.S |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)

-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Will Deacon Dec. 21, 2015, 12:46 p.m. UTC | #1

On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> Adding a check for the cache line size is not much overhead.

> Special case 128 byte cache line size.

> This improves copy_page by 85% on ThunderX compared to the

> original implementation.

So this patch seems to:

  - Align the loop
  - Increase the prefetch size
  - Unroll the loop once

Do you know where your 85% boost comes from between these? I'd really
like to avoid having multiple versions of copy_page, if possible, but
maybe we could end up with something that works well enough regardless
of cacheline size. Understanding what your bottleneck is would help to
lead us in the right direction.

Also, how are you measuring the improvement? If you can share your
test somewhere, I can see how it affects the other systems I have access
to.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7..4c28789 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,7 @@ 
 #include <linux/const.h>
 #include <asm/assembler.h>
 #include <asm/page.h>
+#include <asm/cachetype.h>
 
 /*
  * Copy a page from src to dest (both are page aligned)
@@ -27,8 +28,17 @@ 
  *	x1 - src
  */
 ENTRY(copy_page)
+	/* Special case 128 byte or more cache lines */
+	mrs	x2, ctr_el0
+	lsr	x2, x2, CTR_CWG_SHIFT
+	and	w2, w2, CTR_CWG_MASK
+	cmp	w2, 5
+	b.ge    2f
+
 	/* Assume cache line size is 64 bytes. */
 	prfm	pldl1strm, [x1, #64]
+	/* Align the loop is it fits in one cache line. */
+	.balign 64
 1:	ldp	x2, x3, [x1]
 	ldp	x4, x5, [x1, #16]
 	ldp	x6, x7, [x1, #32]
@@ -43,4 +53,33 @@  ENTRY(copy_page)
 	tst	x1, #(PAGE_SIZE - 1)
 	b.ne	1b
 	ret
+
+2:
+	/* The cache line size is at least 128 bytes. */
+	prfm	pldl1strm, [x1, #128]
+	/* Align the loop so it fits in one cache line  */
+	.balign 128
+1:	prfm	pldl1strm, [x1, #256]
+	ldp	x2, x3, [x1]
+	ldp	x4, x5, [x1, #16]
+	ldp	x6, x7, [x1, #32]
+	ldp	x8, x9, [x1, #48]
+	stnp	x2, x3, [x0]
+	stnp	x4, x5, [x0, #16]
+	stnp	x6, x7, [x0, #32]
+	stnp	x8, x9, [x0, #48]
+
+	ldp	x2, x3, [x1, #64]
+	ldp	x4, x5, [x1, #80]
+	ldp	x6, x7, [x1, #96]
+	ldp	x8, x9, [x1, #112]
+	add	x1, x1, #128
+	stnp	x2, x3, [x0, #64]
+	stnp	x4, x5, [x0, #80]
+	stnp	x6, x7, [x0, #96]
+	stnp	x8, x9, [x0, #112]
+	add	x0, x0, #128
+	tst	x1, #(PAGE_SIZE - 1)
+	b.ne	1b
+	ret
 ENDPROC(copy_page)

ARM64: Improve copy_page for 128 cache line sizes.

Commit Message

Comments

Patch