[v4] ARM: Improve armv7 memcpy performance.

Message ID	5236C33A.7080802@linaro.org
State	Accepted
Headers	show Return-Path: <patchwork-forward+bncBCXPZFGDUEJBBQEG3OIQKGQEGK75SNY@linaro.org> Received-SPF: neutral (google.com: 209.85.220.177 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=209.85.220.177; Received-SPF: neutral (google.com: 209.85.214.53 is neither permitted nor denied by best guess record for domain of will.newton@linaro.org) client-ip=209.85.214.53; Message-ID: <5236C33A.7080802@linaro.org> Date: Mon, 16 Sep 2013 09:37:14 +0100 From: Will Newton <will.newton@linaro.org> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8 MIME-Version: 1.0 To: libc-ports@sourceware.org CC: patches@linaro.org Subject: [PATCH v4] ARM: Improve armv7 memcpy performance. Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit

Message ID

5236C33A.7080802@linaro.org

State

Accepted

Headers

Received-SPF: neutral (google.com: 209.85.220.177 is neither permitted nor
	denied by best guess record for domain of
	patch+caf_=patchwork-forward=linaro.org@linaro.org)
	client-ip=209.85.220.177; 
Received-SPF: neutral (google.com: 209.85.214.53 is neither permitted nor
	denied by best guess record for domain of
	will.newton@linaro.org) client-ip=209.85.214.53; 
Message-ID: <5236C33A.7080802@linaro.org>
Date: Mon, 16 Sep 2013 09:37:14 +0100
From: Will Newton <will.newton@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:17.0) Gecko/20130805 Thunderbird/17.0.8
MIME-Version: 1.0
To: libc-ports@sourceware.org
CC: patches@linaro.org
Subject: [PATCH v4] ARM: Improve armv7 memcpy performance.
Precedence: list
Mailing-list: list patchwork-forward@linaro.org;
	contact patchwork-forward+owners@linaro.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Commit Message

Will Newton Sept. 16, 2013, 8:37 a.m. UTC

Only enter the aligned copy loop with buffers that can be 8-byte
aligned. This improves performance slightly on Cortex-A9 and
Cortex-A15 cores for large copies with buffers that are 4-byte
aligned but not 8-byte aligned.

ports/ChangeLog.arm:

2013-08-30  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
	on entry to aligned copy loop to improve performance.
---
 ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

Changes in v4:
 - More comment fixes

The output of the cortex-strings benchmark can be found here (where "this" is the new code
and "old" is the previous version):

http://people.linaro.org/~will.newton/glibc_memcpy/

Comments

Joseph Myers Sept. 16, 2013, 3:24 p.m. UTC | #1

On Mon, 16 Sep 2013, Will Newton wrote:

> Only enter the aligned copy loop with buffers that can be 8-byte
> aligned. This improves performance slightly on Cortex-A9 and
> Cortex-A15 cores for large copies with buffers that are 4-byte
> aligned but not 8-byte aligned.
> 
> ports/ChangeLog.arm:
> 
> 2013-08-30  Will Newton  <will.newton@linaro.org>
> 
> 	* sysdeps/arm/armv7/multiarch/memcpy_impl.S: Tighten check
> 	on entry to aligned copy loop to improve performance.

OK.

diff --git a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
index 3decad6..ad43a3d 100644
--- a/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
+++ b/ports/sysdeps/arm/armv7/multiarch/memcpy_impl.S
@@ -24,7 +24,6 @@ 
     ARMv6 (ARMv7-a if using Neon)
     ARM state
     Unaligned accesses
-    LDRD/STRD support unaligned word accesses

  */

@@ -369,8 +368,8 @@  ENTRY(memcpy)
 	cfi_adjust_cfa_offset (FRAME_SIZE)
 	cfi_rel_offset (tmp2, 0)
 	cfi_remember_state
-	and	tmp2, src, #3
-	and	tmp1, dst, #3
+	and	tmp2, src, #7
+	and	tmp1, dst, #7
 	cmp	tmp1, tmp2
 	bne	.Lcpy_notaligned

@@ -381,9 +380,9 @@  ENTRY(memcpy)
 	vmov.f32	s0, s0
 #endif

-	/* SRC and DST have the same mutual 32-bit alignment, but we may
+	/* SRC and DST have the same mutual 64-bit alignment, but we may
 	   still need to pre-copy some bytes to get to natural alignment.
-	   We bring DST into full 64-bit alignment.  */
+	   We bring SRC and DST into full 64-bit alignment.  */
 	lsls	tmp2, dst, #29
 	beq	1f
 	rsbs	tmp2, tmp2, #0
@@ -515,7 +514,7 @@  ENTRY(memcpy)

 .Ltail63aligned:			/* Count in tmp2.  */
 	/* Copy up to 7 d-words of data.  Similar to Ltail63unaligned, but
-	   we know that the src and dest are 32-bit aligned so we can use
+	   we know that the src and dest are 64-bit aligned so we can use
 	   LDRD/STRD to improve efficiency.  */
 	/* TMP2 is now negative, but we don't care about that.  The bottom
 	   six bits still tell us how many bytes are left to copy.  */

[v4] ARM: Improve armv7 memcpy performance.

Commit Message

Comments

Patch