[v3,00/18] Improve generic string routines

Message ID 1515588482-15744-1-git-send-email-adhemerval.zanella@linaro.org
Headers show
Series
  • Improve generic string routines
Related show

Message

Adhemerval Zanella Jan. 10, 2018, 12:47 p.m.
It is an update of previous Richard's patchset [1] to provide generic 
string implementation for newer ports and make them only focus on 
just specific routines to get a better overall improvement.
It is done by:

  1. parametrizing the internal routines (for instance the find zero
     in a word) so each architecture can reimplement without the need
     to reimplement the whole routine.

  2. vectorizing more string implementations (for instance strcpy 
     and strcmp).

  3. Change some implementations to use already possible optimized
     ones (for instance strnlen).  It makes new ports to focus on
     only provide optimized implementation of a hardful symbols
     (for instance memchr) and make its improvement to be used in
     a larger set of routines.

For the rest of #5806 I think we can handle them later and if 
performance of generic implementation is closer I think it is better
to just remove old assembly implementations.

I also checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly 
implementation and disabling multiarch (it covers both LE and BE
for 64 and 32 bits). I also checked the string routines on alpha, hppa,
and sh.

Changes since v2:

  * Move string-fz{a,b,i} to its own patch.
 
  * Add a inline implementation for __builtin_c{l,t}z to avoid using
    compiler provided symbols.

  * Add a new header, string-maskoff.h, to handle unaligned accesses
    on some implementation.

  * Fixed strcmp on LE machines.

  * Added a unaligned strcpy variant for architecture that define
    _STRING_ARCH_unaligned.

  * Add SH string-fzb.h (which uses cmp/str instruction to find
    a zero in word).

Changes since v1:

  * Marked ChangeLog entries with [BZ #5806], as appropriate.

  * Reorganized the headers, so that armv6t2 and power6 need override
    as little as possible to use their (integer) zero detection insns.

  * Hopefully fixed all of the coding style issues.

  * Adjusted the memrchr algorithm as discussed.

  * Replaced the #ifdef STRRCHR etc that are used by the multiarch files.

  * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
    ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
    aarch64, alpha (qemu) and hppa (qemu).

PS: This patchset is aimed for 2.28.

[1] https://sourceware.org/ml/libc-alpha/2016-12/msg00830.html

Adhemerval Zanella (5):
  Add string-maskoff.h generic header
  Add string vectorized find and detection functions
  string: Improve generic strnlen
  string: Improve generic strcpy
  sh: Add string-fzb.h

Richard Henderson (13):
  Parameterize op_t from memcopy.h
  Parameterize OP_T_THRES from memcopy.h
  string: Improve generic strlen
  string: Improve generic memchr
  string: Improve generic memrchr
  string: Improve generic strchr
  string: Improve generic strchrnul
  string: Improve generic strcmp
  hppa: Add memcopy.h
  hppa: Add string-fzb.h and string-fzi.h
  alpha: Add string-fzb.h and string-fzi.h
  arm: Add string-fza.h
  powerpc: Add string-fza.h

 config.h.in                                        |   8 +
 configure                                          |  54 ++++++
 configure.ac                                       |  34 ++++
 string/memchr.c                                    | 157 ++++-----------
 string/memcmp.c                                    |   4 -
 string/memrchr.c                                   | 193 ++++--------------
 string/strchr.c                                    | 166 +++-------------
 string/strchrnul.c                                 | 146 +++-----------
 string/strcmp.c                                    |  97 +++++++++-
 string/strcpy.c                                    | 109 ++++++++++-
 string/strlen.c                                    |  83 ++------
 string/strnlen.c                                   | 139 +------------
 string/test-strcpy.c                               |  24 ++-
 sysdeps/alpha/string-fzb.h                         |  51 +++++
 sysdeps/alpha/string-fzi.h                         | 113 +++++++++++
 sysdeps/arm/armv6t2/string-fza.h                   |  69 +++++++
 sysdeps/generic/memcopy.h                          |  11 +-
 sysdeps/generic/string-extbyte.h                   |  35 ++++
 sysdeps/generic/string-fza.h                       | 117 +++++++++++
 sysdeps/generic/string-fzb.h                       |  49 +++++
 sysdeps/generic/string-fzi.h                       | 215 +++++++++++++++++++++
 sysdeps/generic/string-maskoff.h                   |  64 ++++++
 sysdeps/generic/string-opthr.h                     |  25 +++
 sysdeps/generic/string-optype.h                    |  31 +++
 sysdeps/hppa/memcopy.h                             |  44 +++++
 sysdeps/hppa/string-fzb.h                          |  69 +++++++
 sysdeps/hppa/string-fzi.h                          | 135 +++++++++++++
 sysdeps/i386/i686/multiarch/memrchr-c.c            |   2 +
 sysdeps/i386/i686/multiarch/strnlen-c.c            |  19 +-
 sysdeps/i386/memcopy.h                             |   3 -
 sysdeps/i386/string-opthr.h                        |  25 +++
 sysdeps/m68k/memcopy.h                             |   3 -
 sysdeps/powerpc/power6/string-fza.h                |  65 +++++++
 sysdeps/powerpc/powerpc32/power4/memcopy.h         |   5 -
 .../powerpc32/power4/multiarch/strnlen-ppc32.c     |  19 +-
 sysdeps/powerpc/powerpc32/power6/string-fza.h      |   1 +
 sysdeps/powerpc/powerpc64/power6/string-fza.h      |   1 +
 sysdeps/s390/multiarch/memrchr-c.c                 |   2 +
 sysdeps/s390/multiarch/strchr-c.c                  |   1 +
 sysdeps/s390/multiarch/strnlen-c.c                 |  18 +-
 sysdeps/sh/string-fzb.h                            |  53 +++++
 sysdeps/tile/memcmp.c                              |   1 -
 sysdeps/tile/memcopy.h                             |   7 -
 sysdeps/tile/tilegx32/gmp-mparam.h                 |  30 +++
 44 files changed, 1707 insertions(+), 790 deletions(-)
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h
 create mode 100644 sysdeps/generic/string-maskoff.h
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/generic/string-optype.h
 create mode 100644 sysdeps/hppa/memcopy.h
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzi.h
 create mode 100644 sysdeps/i386/string-opthr.h
 create mode 100644 sysdeps/powerpc/power6/string-fza.h
 create mode 100644 sysdeps/powerpc/powerpc32/power6/string-fza.h
 create mode 100644 sysdeps/powerpc/powerpc64/power6/string-fza.h
 create mode 100644 sysdeps/sh/string-fzb.h
 create mode 100644 sysdeps/tile/tilegx32/gmp-mparam.h

-- 
2.7.4

Comments

Ondřej Bílka Jan. 10, 2018, 10:30 p.m. | #1
On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote:
> It is an update of previous Richard's patchset [1] to provide generic 

> string implementation for newer ports and make them only focus on 

> just specific routines to get a better overall improvement.

> It is done by:

> 

This is basically poorly reinvented version of my patch to make generic
string routines. By dusting it of one could get lot better performance
than this.

This contains lot of glaring issues, main ones are:
strnlen - quite ineffective as testing vs zero is faster than generic c.
Could be trivially improved by inlining memchr. 

strcmp - this

+  /* Handle the unaligned bytes of p1 first.  */
+  n = -(uintptr_t)p1 % sizeof(op_t);
+  for (i = 0; i < n; ++i)

is a the worst way how to write memcmp as you make loop with unpredictable
branch about alignment. Also in strcmp trying to be clever usually
causes performance regression because its likely that you get difference
in starting bytes.

strcpy - this was previously changed to strlen+memcpy because it with
platform specific strlen/memcpy it is faster. One should at least check
if some platforms are affected. I wouldn't be surprised that one could
get faster function just by adding some unrolling to memcpy.
Adhemerval Zanella Jan. 11, 2018, 10:54 a.m. | #2
On 10/01/2018 20:30, Ondřej Bílka wrote:
> On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote:

>> It is an update of previous Richard's patchset [1] to provide generic 

>> string implementation for newer ports and make them only focus on 

>> just specific routines to get a better overall improvement.

>> It is done by:

>>

> This is basically poorly reinvented version of my patch to make generic

> string routines. By dusting it of one could get lot better performance

> than this.


Why not work together to get to get a better implementation instead of
bashing a patch proposal by just saying 'I know/did better'? I did take a 
look at your suggestion and my first impression what a more convoluted
with a lot of over-engineering. I used Richard's proposal because as a
start point mainly because I thought it is better decomposed and organized
over the internal string operations and a bit simpler for generic
implementations.

Not saying your previous proposal is out of value, and I would like if we
could improve this patch proposal with your input.

> 

> This contains lot of glaring issues, main ones are:

> strnlen - quite ineffective as testing vs zero is faster than generic c.

> Could be trivially improved by inlining memchr. 


The 08/08 is using memchr instead (just no inline it)...

> 

> strcmp - this

> 

> +  /* Handle the unaligned bytes of p1 first.  */

> +  n = -(uintptr_t)p1 % sizeof(op_t);

> +  for (i = 0; i < n; ++i)

> 

> is a the worst way how to write memcmp as you make loop with unpredictable

> branch about alignment. Also in strcmp trying to be clever usually

> causes performance regression because its likely that you get difference

> in starting bytes.


I am open to suggestions here, usually the other methods trade speed for
more complexity and hardware/abi idiosyncrasies. Also keep in mind the idea 
of implementation is to be used on a large sets of different hardware, 
where strategies like unaligned access within a page can not be guarantee
to work.


> 

> strcpy - this was previously changed to strlen+memcpy because it with

> platform specific strlen/memcpy it is faster. One should at least check

> if some platforms are affected. I wouldn't be surprised that one could

> get faster function just by adding some unrolling to memcpy. 

> 


It really depends, for instance aarch64 implementation sets a threshold
where strlen+memcpy is actually used where then it switches to load/store.
Also both strlen and memcpy is inlined, witch amortizes the function call
cost.

I though about adding such optimization, but it will need to rely if 
compiler can inline size bounded strlen/memcpy to actually make sense 
for small sizes (where it is not true for a lot of platforms).  And 
there is also the issue of which is the size to used as threshold.
Joseph Myers Jan. 11, 2018, 1:49 p.m. | #3
On Thu, 11 Jan 2018, Adhemerval Zanella wrote:

> > This contains lot of glaring issues, main ones are:

> > strnlen - quite ineffective as testing vs zero is faster than generic c.

> > Could be trivially improved by inlining memchr. 

> 

> The 08/08 is using memchr instead (just no inline it)...


Which is a good idea on the general principles of architectures being able 
to add optimized versions of a few common functions, such as memchr, and 
have those used internally by the less common functions, such as strnlen.

-- 
Joseph S. Myers
joseph@codesourcery.com