Message ID | 1515588482-15744-1-git-send-email-adhemerval.zanella@linaro.org |
---|---|
Headers | show |
Series | Improve generic string routines | expand |
On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote: > It is an update of previous Richard's patchset [1] to provide generic > string implementation for newer ports and make them only focus on > just specific routines to get a better overall improvement. > It is done by: > This is basically poorly reinvented version of my patch to make generic string routines. By dusting it of one could get lot better performance than this. This contains lot of glaring issues, main ones are: strnlen - quite ineffective as testing vs zero is faster than generic c. Could be trivially improved by inlining memchr. strcmp - this + /* Handle the unaligned bytes of p1 first. */ + n = -(uintptr_t)p1 % sizeof(op_t); + for (i = 0; i < n; ++i) is a the worst way how to write memcmp as you make loop with unpredictable branch about alignment. Also in strcmp trying to be clever usually causes performance regression because its likely that you get difference in starting bytes. strcpy - this was previously changed to strlen+memcpy because it with platform specific strlen/memcpy it is faster. One should at least check if some platforms are affected. I wouldn't be surprised that one could get faster function just by adding some unrolling to memcpy.
On 10/01/2018 20:30, Ondřej Bílka wrote: > On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote: >> It is an update of previous Richard's patchset [1] to provide generic >> string implementation for newer ports and make them only focus on >> just specific routines to get a better overall improvement. >> It is done by: >> > This is basically poorly reinvented version of my patch to make generic > string routines. By dusting it of one could get lot better performance > than this. Why not work together to get to get a better implementation instead of bashing a patch proposal by just saying 'I know/did better'? I did take a look at your suggestion and my first impression what a more convoluted with a lot of over-engineering. I used Richard's proposal because as a start point mainly because I thought it is better decomposed and organized over the internal string operations and a bit simpler for generic implementations. Not saying your previous proposal is out of value, and I would like if we could improve this patch proposal with your input. > > This contains lot of glaring issues, main ones are: > strnlen - quite ineffective as testing vs zero is faster than generic c. > Could be trivially improved by inlining memchr. The 08/08 is using memchr instead (just no inline it)... > > strcmp - this > > + /* Handle the unaligned bytes of p1 first. */ > + n = -(uintptr_t)p1 % sizeof(op_t); > + for (i = 0; i < n; ++i) > > is a the worst way how to write memcmp as you make loop with unpredictable > branch about alignment. Also in strcmp trying to be clever usually > causes performance regression because its likely that you get difference > in starting bytes. I am open to suggestions here, usually the other methods trade speed for more complexity and hardware/abi idiosyncrasies. Also keep in mind the idea of implementation is to be used on a large sets of different hardware, where strategies like unaligned access within a page can not be guarantee to work. > > strcpy - this was previously changed to strlen+memcpy because it with > platform specific strlen/memcpy it is faster. One should at least check > if some platforms are affected. I wouldn't be surprised that one could > get faster function just by adding some unrolling to memcpy. > It really depends, for instance aarch64 implementation sets a threshold where strlen+memcpy is actually used where then it switches to load/store. Also both strlen and memcpy is inlined, witch amortizes the function call cost. I though about adding such optimization, but it will need to rely if compiler can inline size bounded strlen/memcpy to actually make sense for small sizes (where it is not true for a lot of platforms). And there is also the issue of which is the size to used as threshold.
On Thu, 11 Jan 2018, Adhemerval Zanella wrote: > > This contains lot of glaring issues, main ones are: > > strnlen - quite ineffective as testing vs zero is faster than generic c. > > Could be trivially improved by inlining memchr. > > The 08/08 is using memchr instead (just no inline it)... Which is a good idea on the general principles of architectures being able to add optimized versions of a few common functions, such as memchr, and have those used internally by the less common functions, such as strnlen. -- Joseph S. Myers joseph@codesourcery.com