[ARM] Optimised strchr and strlen

Message ID 20111219172122.GA10120@davesworkthinkpad
State New
Headers show

Commit Message

Dr. David Alan Gilbert Dec. 19, 2011, 5:21 p.m.
This is a strchr and strlen optimised for ARM v6t2 or v7. It's against svn rev r15869
with my previous memchr patch.   Tested both little & big endian.

(I've checked it still applies on svn trunk, but not done a retest on that; nothing
seems to have changed around there).

Dave

2012-12-19 Dr. David Alan Gilbert <david.gilbert@linaro.org>
	* sysdeps/arm/eabi/armv6t2/strchr.S: New file
	* sysdeps/arm/eabi/armv6t2/strlen.S: New file

Comments

Richard Henderson Dec. 20, 2011, 8:29 p.m. | #1
On 12/19/2011 09:21 AM, Dr. David Alan Gilbert wrote:
> +	@ r1 currently points to the 2nd byte of the word containing the 0
> +	tst	r2, # CHARTSTMASK(0)	@ 1st character
> +	bne	10f
> +	adds	r1,r1,#1
> +	tst	r2, # CHARTSTMASK(1)	@ 2nd character
> +	ittt	eq
> +	addeq	r1,r1,#1
> +	tsteq	r2, # (3<<15)		@ 2nd & 3rd character
> +	@ If not the 3rd must be the last one
> +	addeq	r1,r1,#1

No need to search -- clz can do that for you.

#ifdef __ARMEL__
	rbit	r2, r2
#endif
	clz	r2, r2
	lsrs	r2, r2, #3
	adds	r1, r1, r2


r~
Joseph Myers Dec. 20, 2011, 11:07 p.m. | #2
On Mon, 19 Dec 2011, Dr. David Alan Gilbert wrote:

> +@ A very simple strchr routine, from benchmarks on A9 it's a bit faster than
> +@ the current version in eglibc.
> +@ While I have a version that does 8 bytes/loop and is a lot faster on long
> +@ strings, it is slower on short strings, and short strings seem more common
> +@ in strchr usage.

That sounds like a possible case for a hybrid function, with an unrolled 
initial part testing some number of characters to cover short strings (it 
might be possible to get things aligned at the same time) and the more 
complicated version for longer strings.  What's the actual size 
distribution you see in strchr use?
Dr. David Alan Gilbert Dec. 21, 2011, 10:55 a.m. | #3
On 20 December 2011 23:07, Joseph S. Myers <joseph@codesourcery.com> wrote:
> On Mon, 19 Dec 2011, Dr. David Alan Gilbert wrote:
>
>> +@ A very simple strchr routine, from benchmarks on A9 it's a bit faster than
>> +@ the current version in eglibc.
>> +@ While I have a version that does 8 bytes/loop and is a lot faster on long
>> +@ strings, it is slower on short strings, and short strings seem more common
>> +@ in strchr usage.
>
> That sounds like a possible case for a hybrid function, with an unrolled
> initial part testing some number of characters to cover short strings (it
> might be possible to get things aligned at the same time) and the more
> complicated version for longer strings.

Of course the difficulty with strchr (compared with memchr) is that you have
no length parameter to hint at how much is left, and thus whether it's
worth making that switch; so it has to be a heuristic and it's going to cost
you something in the small case.

>  What's the actual size distribution you see in strchr use?

It varies heavily by program.  I only found one of the SPEC benchmarks using
it to a measurable amount (i.e. showed up in profile), and that was
mostly in 24byte strings
with the match at random positions within the string.

From some embedded benchmarks I found that almost all the uses of strchr
were calls with less than 8 byte strings (mostly unexpected fallout
from within libc rather
than the meat of the benchmark).
One of the things people commonly seem to do with strchr is see whether a
character is in a set, and run that strchr along a string - e.g. something like

char *mystr=....
char tmp;

for(tmp=*mystr;tmp;mystr++) {
   tmp=*(mystr++);
   if (strchr(" \t\n\r", tmp)!=NULL) { do something }
}

With the string being searched depressingly small.   There are some places where
you might hope the compiler could be smart and do something with that,
although often the string being searched is actually a variable (think
$IFS in shell).
(I fancy something like a strchr_short with the compiler calling that
when it knows
the string is less than some length - but how do you define that
length between the
compiler and library?)

The other thing I found in some examples was splitting env strings; so
searching for
the '=' in NAME=VALUE, the NAME parts tend not to be particularly long.

The worst case tends to be where you're using strchr on a long string and you
don't actually find the match.

A ltrace of gcc's cc1 shows a few of the cases - e.g.

Mostly short identifiers:

strchr("nothrow", ' ')                           = NULL
strchr("final", ' ')                             = NULL
strchr("dfinish", ' ')                           = NULL


This is a case that can get long - finding '.' in filenames; this can
be one where
you get longer strings without  a match.

strchr("/usr/local/include", '.')                = NULL
strchr("/usr/local/include", '.')                = NULL
strchr("/usr/lib/arm-linux-gnueabi/gcc/arm-linux-gnueabi/4.5.2/include",
'.') = ".5.2/include"

I suspect something like parsing a format string:
strchr("-+ #0", 's')                             = NULL

Splitting assignments:
strchr("__UINT16_C(c)=c", '=')                   = "=c"
strchr("__UINT_LEAST32_MAX__=4294967295U", '=')  = "=4294967295U"


Here's a profile graph of different strlen's on an ARM:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?action=AttachFile&do=get&target=panda-01-strchr-git44154ec-strchr-abs.png

That 'simple' one is showing the benefit at the short lengths,
the 'smarter' one I have is doing 8 bytes/loop and is nice on the long
strings - but as you can see worse at the short ones.

Dave
Dr. David Alan Gilbert Dec. 21, 2011, 10:56 a.m. | #4
On 20 December 2011 20:29, Richard Henderson <rth@twiddle.net> wrote:
> On 12/19/2011 09:21 AM, Dr. David Alan Gilbert wrote:
>> +     @ r1 currently points to the 2nd byte of the word containing the 0
>> +     tst     r2, # CHARTSTMASK(0)    @ 1st character
>> +     bne     10f
>> +     adds    r1,r1,#1
>> +     tst     r2, # CHARTSTMASK(1)    @ 2nd character
>> +     ittt    eq
>> +     addeq   r1,r1,#1
>> +     tsteq   r2, # (3<<15)           @ 2nd & 3rd character
>> +     @ If not the 3rd must be the last one
>> +     addeq   r1,r1,#1
>
> No need to search -- clz can do that for you.
>
> #ifdef __ARMEL__
>        rbit    r2, r2
> #endif
>        clz     r2, r2
>        lsrs    r2, r2, #3
>        adds    r1, r1, r2

Hi Richard,
  Thanks for the suggestion - I'd seen that form in some code from another
library and decided because of that not to roll it into my code; however
since there have now been 3 separate people who've suggested it to me
independently I don't see why not!

Thanks,

Dave
Richard Henderson Dec. 21, 2011, 9:20 p.m. | #5
On 12/21/2011 02:55 AM, David Gilbert wrote:
> That 'simple' one is showing the benefit at the short lengths,
> the 'smarter' one I have is doing 8 bytes/loop and is nice on the long
> strings - but as you can see worse at the short ones.

Having not seen your "smarter" strchr, it's hard to suggest anything
concrete.  I'd have thought that there's enough slack in load delay
that one or two arithmetic operations could be done without penalty...

Something like performing a simple compare loop looking for "alignment plus":

...
	bic	r3, r0, #7
	and	r1, r1, #255
	adds	r3, r3, #32
1:
	ldrb	r2, [r0],#1
	cmp	r2, r1
	cbz	r2, .Lfound_zero
	it	ne
	cmpne	r0, r3
	bne	1b
	cmp	r2, r1
	beq	.Lfound
	@ Here, r0 is aligned.  Do something word-based.
...

or even just

	and	r3, r0, #7
	and	r1, r1, #255
	rsb	r3, r3, #32
1:
	ldrb	r2, [r0],#1
	cmp	r2, r1
	beq	.Lfound
	subs	r3, r3, #1
	cbz	r2, .Lfound_zero
	bne	1b
	@ Here, r0 is aligned.  Do something word-based.


r~

Patch hide | download patch | download mbox

diff -urN ports/sysdeps/arm/eabi/armv6t2/strchr.S src/ports/sysdeps/arm/eabi/armv6t2/strchr.S
--- ports/sysdeps/arm/eabi/armv6t2/strchr.S	1970-01-01 01:00:00.000000000 +0100
+++ ports/sysdeps/arm/eabi/armv6t2/strchr.S	2011-12-16 13:43:56.704694919 +0000
@@ -0,0 +1,71 @@ 
+/* Copyright (C) 2011 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Code contributed by Dave Gilbert <david.gilbert@linaro.org>
+ 
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+#include <sysdep.h>
+
+@ A very simple strchr routine, from benchmarks on A9 it's a bit faster than
+@ the current version in eglibc.
+@ While I have a version that does 8 bytes/loop and is a lot faster on long
+@ strings, it is slower on short strings, and short strings seem more common
+@ in strchr usage.
+@ Note: The use of cbz/cbnz means it's Thumb only
+
+@ 2011-02-07 david.gilbert@linaro.org
+@    Extracted from local git a5b438d861
+@ 2011-12-16 david.gilbert@linaro.org
+@    Copy from Cortex strings rev 65 and change license
+
+	.syntax unified
+
+	.text
+	.thumb
+
+@ ---------------------------------------------------------------------------
+
+	.thumb_func
+	.global strchr
+	.type strchr,%function
+ENTRY(strchr)
+	@ r0 = start of string
+	@ r1 = character to match
+	@ returns NULL for no match, or a pointer to the match
+	and	r1,r1, #255
+
+1:
+	ldrb	r2,[r0],#1
+	cmp	r2,r1
+	cbz	r2,10f
+	bne	1b
+
+	@ We're here if it matched
+5:
+	subs	r0,r0,#1
+	DO_RET(lr)
+
+10:
+	@ We're here if we ran off the end
+	cmp	r1, #0	@ Corner case - you can search for the nil and get a pointer to it
+	beq	5b	@ messy, if common we should branch at the start to a special loop
+	mov	r0,#0
+	DO_RET(lr)
+
+END(strchr)
+
+weak_alias (strchr, index)
+libc_hidden_builtin_def(strchr)
diff -urN ports/sysdeps/arm/eabi/armv6t2/strlen.S src/ports/sysdeps/arm/eabi/armv6t2/strlen.S
--- ports/sysdeps/arm/eabi/armv6t2/strlen.S	1970-01-01 01:00:00.000000000 +0100
+++ ports/sysdeps/arm/eabi/armv6t2/strlen.S	2011-12-16 13:43:01.991130183 +0000
@@ -0,0 +1,118 @@ 
+/* Copyright (C) 2011 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Code contributed by Dave Gilbert <david.gilbert@linaro.org>
+ 
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+#include <sysdep.h>
+
+@ This strlen routine is optimised on a Cortex-A9 and should work on
+@ all ARMv7 processors.   This routine is reasonably fast for short
+@ strings, but is probably slower than a simple implementation if all
+@ your strings are very short
+@ Note: The use of cbz/cbnz means it's Thumb only
+
+@ 2011-02-08 david.gilbert@linaro.org
+@    Extracted from local git 6848613a
+@ 2011-12-16 david.gilbert@linaro.org
+@    Copy from Cortex strings rev 65 and change license
+@    Add cfi magic, switch to ldrd
+
+
+@ this lets us check a flag in a 00/ff byte easily in either endianness
+#ifdef __ARMEB__
+#define CHARTSTMASK(c) 1<<(31-(c*8))
+#else
+#define CHARTSTMASK(c) 1<<(c*8)
+#endif
+
+@-----------------------------------------------------------------------------
+	.syntax unified
+
+	.text
+	.thumb
+
+	.thumb_func
+	.global strlen
+	.type strlen,%function
+ENTRY(strlen)
+	@ r0 = string
+	@ returns count of bytes in string not including terminator
+	mov	r1, r0
+	push	{ r4,r6 }
+	cfi_adjust_cfa_offset (8)
+	cfi_rel_offset (r4, 0)
+	cfi_rel_offset (r6, 4)
+
+	cfi_remember_state
+
+	mvns	r6, #0		@ all F
+	movs	r4, #0
+	tst	r0, #7
+	beq	2f
+
+1:
+	ldrb	r2, [r1], #1
+	tst	r1, #7		@ Hit alignment yet?
+	cbz	r2, 10f		@ Exit if we found the 0
+	bne	1b
+
+	@ So we're now aligned
+2:
+	ldrd	r2,r3,[r1],#8
+	uadd8	r2, r2, r6	@ Parallel add 0xff - sets the GE bits for anything that wasn't 0
+	sel	r2, r4, r6	@ bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION
+	uadd8	r3, r3, r6	@ Parallel add 0xff - sets the GE bits for anything that wasn't 0
+	sel	r3, r2, r6	@ bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION
+	cmp	r3, #0
+	beq	2b
+
+strlenendtmp:
+	@ One (or more) of the bytes we loaded was 0 - but which one?
+	@ r2 has the mask corresponding to the first loaded word
+	@ r3 has a combined mask of the two words - but if r2 was all-non 0 
+	@ then it's just the 2nd words
+	cmp	r2, #0
+	itte	eq
+	moveq	r2, r3		@ the end is in the 2nd word
+	subeq	r1,r1,#3
+	subne	r1,r1,#7
+
+	@ r1 currently points to the 2nd byte of the word containing the 0
+	tst	r2, # CHARTSTMASK(0)	@ 1st character
+	bne	10f
+	adds	r1,r1,#1
+	tst	r2, # CHARTSTMASK(1)	@ 2nd character
+	ittt	eq
+	addeq	r1,r1,#1
+	tsteq	r2, # (3<<15)		@ 2nd & 3rd character
+	@ If not the 3rd must be the last one
+	addeq	r1,r1,#1
+
+10:
+	@ r0 is still at the beginning, r1 is pointing 1 byte after terminator
+	sub	r0, r1, r0
+	subs	r0, r0, #1
+	pop	{ r4, r6 }
+
+	cfi_adjust_cfa_offset (-8)
+	cfi_restore (r4)
+	cfi_restore (r6)
+
+	DO_RET(lr)
+
+END(strlen)
+libc_hidden_builtin_def (strlen)