Message ID | 20241025182614.2022697-1-adhemerval.zanella@linaro.org |
---|---|
Headers | show |
Series | Add more CORE-MATH on libm | expand |
On Fri, 25 Oct 2024, Adhemerval Zanella wrote: > The CORE-MATH implementation is correctly rounded (for any rounding mode) > and shows slight better performance to the generic log10pf. This commit message should refer to log1pf, not log10pf.
On 25/10/24 16:12, Joseph Myers wrote: > On Fri, 25 Oct 2024, Adhemerval Zanella wrote: > >> The CORE-MATH implementation is correctly rounded (for any rounding mode) >> and shows slight better performance to the generic log10pf. > > This commit message should refer to log1pf, not log10pf. > Sigh, I though I had it fixed. Thanks for pointing out, Paul did it privately to me as well.
On Fri, Oct 25, 2024 at 1:26 PM Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > > Following the tgammaf implementation (392b3f0971764) and its telling > performance improvement, I worked with Pauz Zimmermann to check if we > can integrate more routines on glibc. > > This patchset adds the optimized and correctly rounded exp10m1f, > exp2m1f, expm1f, log10f, log2p1f, log1pf, and log10p1f. I also added > a benchmark to evaluate each implementation. > > I tested the implementation on recent hardware (Ryzen 9 5900X for > x86_64, Ampere/Neoverse for aarch64, and POWER10 for powerpc), and > most of the implementation shows impressive performance > improvements. Like the implementation from ARM optimized routines, > the CORE-MATH one takes advantage of recent ISA and platform support > (like fma and rounding instructions, along with FP throughpu). > > For a couple of implementations, exp10m1f, and exp2m1f, CORE-MATH > shows slightly worse performance for x86_64-v1. It is due the glibc > generic implementation that calls optimized exp10f/exp2f, and when a > more recent ISA is used (x86_64-v2 or x86_64-v3) CORE-MATH shows a > better output than the current implementation. For both cases I added > iFUNC support to use FMA on x86_64. > > Adhemerval Zanella (17): > math: Add e_gammaf_r to glibc code and style > benchtests: Add exp10m1f benchmark > benchtests: Add exp2m1f benchmark > benchtests: Add expm1f benchmark > benchtests: Add log10f benchmark > benchtests: Add log2p1f benchmark > benchtests: Add log1p benchmark > benchtests: Add log10p1f benchmark > math: Use exp10m1f from CORE-MATH > math: Use exp2m1f from CORE-MATH > math: Use expm1f from CORE-MATH > math: Use log10f from CORE-MATH > math: Use log2p1f from CORE-MATH > math: Use log1pf from CORE-MATH > math: Use log10p1f from CORE-MATH > x86_64: Add exp10m1f with FMA > x86_64: Add exp2m1f with FMA > > SHARED-FILES | 16 + > benchtests/Makefile | 7 + > benchtests/exp10m1f-inputs | 2389 ++++++++++++++ > benchtests/exp2m1f-inputs | 2388 ++++++++++++++ > benchtests/expm1f-inputs | 799 +++++ > benchtests/log10f-inputs | 1005 ++++++ > benchtests/log10p1f-inputs | 2888 +++++++++++++++++ > benchtests/log1pf-inputs | 1005 ++++++ > benchtests/log2p1f-inputs | 2888 +++++++++++++++++ > sysdeps/aarch64/libm-test-ulps | 29 +- > sysdeps/alpha/fpu/libm-test-ulps | 12 - > sysdeps/arc/fpu/libm-test-ulps | 25 - > sysdeps/arc/nofpu/libm-test-ulps | 7 - > sysdeps/arm/libm-test-ulps | 31 +- > sysdeps/csky/fpu/libm-test-ulps | 12 - > sysdeps/csky/nofpu/libm-test-ulps | 12 - > sysdeps/hppa/fpu/libm-test-ulps | 28 - > sysdeps/i386/fpu/e_log10f.S | 66 - > sysdeps/i386/fpu/libm-test-ulps | 25 - > sysdeps/i386/fpu/s_expm1f.S | 112 - > sysdeps/i386/fpu/s_log1pf.S | 66 - > .../i386/i686/fpu/multiarch/libm-test-ulps | 25 - > sysdeps/ieee754/flt-32/e_gammaf_r.c | 178 +- > sysdeps/ieee754/flt-32/e_log10f.c | 196 +- > sysdeps/ieee754/flt-32/s_exp10m1f.c | 227 ++ > sysdeps/ieee754/flt-32/s_exp2m1f.c | 194 ++ > sysdeps/ieee754/flt-32/s_expm1f.c | 232 +- > sysdeps/ieee754/flt-32/s_log10p1f.c | 182 ++ > sysdeps/ieee754/flt-32/s_log1pf.c | 271 +- > sysdeps/ieee754/flt-32/s_log2p1f.c | 248 ++ > .../math_errf.c => ieee754/flt-32/w_log1pf.c} | 0 > sysdeps/loongarch/lp64/libm-test-ulps | 28 - > sysdeps/m68k/coldfire/fpu/libm-test-ulps | 6 - > sysdeps/m68k/m680x0/fpu/libm-test-ulps | 12 - > sysdeps/m68k/m680x0/fpu/w_log1pf.c | 20 + > sysdeps/microblaze/libm-test-ulps | 3 - > sysdeps/mips/mips32/libm-test-ulps | 28 - > sysdeps/mips/mips64/libm-test-ulps | 28 - > sysdeps/nios2/libm-test-ulps | 3 - > sysdeps/or1k/fpu/libm-test-ulps | 4 - > sysdeps/or1k/nofpu/libm-test-ulps | 12 - > sysdeps/powerpc/fpu/libm-test-ulps | 29 +- > sysdeps/powerpc/nofpu/libm-test-ulps | 28 - > sysdeps/riscv/nofpu/libm-test-ulps | 16 - > sysdeps/riscv/rvd/libm-test-ulps | 28 - > sysdeps/s390/fpu/libm-test-ulps | 28 - > sysdeps/sh/libm-test-ulps | 6 - > sysdeps/sparc/fpu/libm-test-ulps | 28 - > sysdeps/x86_64/fpu/libm-test-ulps | 29 +- > sysdeps/x86_64/fpu/multiarch/Makefile | 4 + > sysdeps/x86_64/fpu/multiarch/s_exp10m1f-fma.c | 4 + > sysdeps/x86_64/fpu/multiarch/s_exp10m1f.c | 33 + > sysdeps/x86_64/fpu/multiarch/s_exp2m1f-fma.c | 4 + > sysdeps/x86_64/fpu/multiarch/s_exp2m1f.c | 33 + > 54 files changed, 14873 insertions(+), 1104 deletions(-) > create mode 100644 benchtests/exp10m1f-inputs > create mode 100644 benchtests/exp2m1f-inputs > create mode 100644 benchtests/expm1f-inputs > create mode 100644 benchtests/log10f-inputs > create mode 100644 benchtests/log10p1f-inputs > create mode 100644 benchtests/log1pf-inputs > create mode 100644 benchtests/log2p1f-inputs > delete mode 100644 sysdeps/i386/fpu/e_log10f.S > delete mode 100644 sysdeps/i386/fpu/s_expm1f.S > delete mode 100644 sysdeps/i386/fpu/s_log1pf.S > create mode 100644 sysdeps/ieee754/flt-32/s_exp10m1f.c > create mode 100644 sysdeps/ieee754/flt-32/s_exp2m1f.c > create mode 100644 sysdeps/ieee754/flt-32/s_log10p1f.c > create mode 100644 sysdeps/ieee754/flt-32/s_log2p1f.c > rename sysdeps/{m68k/m680x0/fpu/math_errf.c => ieee754/flt-32/w_log1pf.c} (100%) > create mode 100644 sysdeps/m68k/m680x0/fpu/w_log1pf.c > create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp10m1f-fma.c > create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp10m1f.c > create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp2m1f-fma.c > create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp2m1f.c > > -- > 2.43.0 > Whitespace issues in some of your patches: ``` Applying: math: Add e_gammaf_r to glibc code and style Applying: benchtests: Add exp10m1f benchmark Applying: benchtests: Add exp2m1f benchmark Applying: benchtests: Add expm1f benchmark Applying: benchtests: Add log10f benchmark Applying: benchtests: Add log2p1f benchmark Applying: benchtests: Add log1p benchmark Applying: benchtests: Add log10p1f benchmark Applying: math: Use exp10m1f from CORE-MATH Applying: math: Use exp2m1f from CORE-MATH Applying: math: Use expm1f from CORE-MATH Applying: math: Use log2p1f from CORE-MATH .git/rebase-apply/patch:402: space before tab in indent. { .git/rebase-apply/patch:456: space before tab in indent. }; warning: 2 lines add whitespace errors. Applying: math: Use log10p1f from CORE-MATH .git/rebase-apply/patch:352: trailing whitespace. { .git/rebase-apply/patch:366: space before tab in indent. { warning: 2 lines add whitespace errors. Applying: x86_64: Add exp10m1f with FMA Applying: x86_64: Add exp2m1f with FMA ```
On 26/10/24 15:28, Noah Goldstein wrote: > On Fri, Oct 25, 2024 at 1:26 PM Adhemerval Zanella > <adhemerval.zanella@linaro.org> wrote: >> >> Following the tgammaf implementation (392b3f0971764) and its telling >> performance improvement, I worked with Pauz Zimmermann to check if we >> can integrate more routines on glibc. >> >> This patchset adds the optimized and correctly rounded exp10m1f, >> exp2m1f, expm1f, log10f, log2p1f, log1pf, and log10p1f. I also added >> a benchmark to evaluate each implementation. >> >> I tested the implementation on recent hardware (Ryzen 9 5900X for >> x86_64, Ampere/Neoverse for aarch64, and POWER10 for powerpc), and >> most of the implementation shows impressive performance >> improvements. Like the implementation from ARM optimized routines, >> the CORE-MATH one takes advantage of recent ISA and platform support >> (like fma and rounding instructions, along with FP throughpu). >> >> For a couple of implementations, exp10m1f, and exp2m1f, CORE-MATH >> shows slightly worse performance for x86_64-v1. It is due the glibc >> generic implementation that calls optimized exp10f/exp2f, and when a >> more recent ISA is used (x86_64-v2 or x86_64-v3) CORE-MATH shows a >> better output than the current implementation. For both cases I added >> iFUNC support to use FMA on x86_64. >> >> Adhemerval Zanella (17): >> math: Add e_gammaf_r to glibc code and style >> benchtests: Add exp10m1f benchmark >> benchtests: Add exp2m1f benchmark >> benchtests: Add expm1f benchmark >> benchtests: Add log10f benchmark >> benchtests: Add log2p1f benchmark >> benchtests: Add log1p benchmark >> benchtests: Add log10p1f benchmark >> math: Use exp10m1f from CORE-MATH >> math: Use exp2m1f from CORE-MATH >> math: Use expm1f from CORE-MATH >> math: Use log10f from CORE-MATH >> math: Use log2p1f from CORE-MATH >> math: Use log1pf from CORE-MATH >> math: Use log10p1f from CORE-MATH >> x86_64: Add exp10m1f with FMA >> x86_64: Add exp2m1f with FMA >> >> SHARED-FILES | 16 + >> benchtests/Makefile | 7 + >> benchtests/exp10m1f-inputs | 2389 ++++++++++++++ >> benchtests/exp2m1f-inputs | 2388 ++++++++++++++ >> benchtests/expm1f-inputs | 799 +++++ >> benchtests/log10f-inputs | 1005 ++++++ >> benchtests/log10p1f-inputs | 2888 +++++++++++++++++ >> benchtests/log1pf-inputs | 1005 ++++++ >> benchtests/log2p1f-inputs | 2888 +++++++++++++++++ >> sysdeps/aarch64/libm-test-ulps | 29 +- >> sysdeps/alpha/fpu/libm-test-ulps | 12 - >> sysdeps/arc/fpu/libm-test-ulps | 25 - >> sysdeps/arc/nofpu/libm-test-ulps | 7 - >> sysdeps/arm/libm-test-ulps | 31 +- >> sysdeps/csky/fpu/libm-test-ulps | 12 - >> sysdeps/csky/nofpu/libm-test-ulps | 12 - >> sysdeps/hppa/fpu/libm-test-ulps | 28 - >> sysdeps/i386/fpu/e_log10f.S | 66 - >> sysdeps/i386/fpu/libm-test-ulps | 25 - >> sysdeps/i386/fpu/s_expm1f.S | 112 - >> sysdeps/i386/fpu/s_log1pf.S | 66 - >> .../i386/i686/fpu/multiarch/libm-test-ulps | 25 - >> sysdeps/ieee754/flt-32/e_gammaf_r.c | 178 +- >> sysdeps/ieee754/flt-32/e_log10f.c | 196 +- >> sysdeps/ieee754/flt-32/s_exp10m1f.c | 227 ++ >> sysdeps/ieee754/flt-32/s_exp2m1f.c | 194 ++ >> sysdeps/ieee754/flt-32/s_expm1f.c | 232 +- >> sysdeps/ieee754/flt-32/s_log10p1f.c | 182 ++ >> sysdeps/ieee754/flt-32/s_log1pf.c | 271 +- >> sysdeps/ieee754/flt-32/s_log2p1f.c | 248 ++ >> .../math_errf.c => ieee754/flt-32/w_log1pf.c} | 0 >> sysdeps/loongarch/lp64/libm-test-ulps | 28 - >> sysdeps/m68k/coldfire/fpu/libm-test-ulps | 6 - >> sysdeps/m68k/m680x0/fpu/libm-test-ulps | 12 - >> sysdeps/m68k/m680x0/fpu/w_log1pf.c | 20 + >> sysdeps/microblaze/libm-test-ulps | 3 - >> sysdeps/mips/mips32/libm-test-ulps | 28 - >> sysdeps/mips/mips64/libm-test-ulps | 28 - >> sysdeps/nios2/libm-test-ulps | 3 - >> sysdeps/or1k/fpu/libm-test-ulps | 4 - >> sysdeps/or1k/nofpu/libm-test-ulps | 12 - >> sysdeps/powerpc/fpu/libm-test-ulps | 29 +- >> sysdeps/powerpc/nofpu/libm-test-ulps | 28 - >> sysdeps/riscv/nofpu/libm-test-ulps | 16 - >> sysdeps/riscv/rvd/libm-test-ulps | 28 - >> sysdeps/s390/fpu/libm-test-ulps | 28 - >> sysdeps/sh/libm-test-ulps | 6 - >> sysdeps/sparc/fpu/libm-test-ulps | 28 - >> sysdeps/x86_64/fpu/libm-test-ulps | 29 +- >> sysdeps/x86_64/fpu/multiarch/Makefile | 4 + >> sysdeps/x86_64/fpu/multiarch/s_exp10m1f-fma.c | 4 + >> sysdeps/x86_64/fpu/multiarch/s_exp10m1f.c | 33 + >> sysdeps/x86_64/fpu/multiarch/s_exp2m1f-fma.c | 4 + >> sysdeps/x86_64/fpu/multiarch/s_exp2m1f.c | 33 + >> 54 files changed, 14873 insertions(+), 1104 deletions(-) >> create mode 100644 benchtests/exp10m1f-inputs >> create mode 100644 benchtests/exp2m1f-inputs >> create mode 100644 benchtests/expm1f-inputs >> create mode 100644 benchtests/log10f-inputs >> create mode 100644 benchtests/log10p1f-inputs >> create mode 100644 benchtests/log1pf-inputs >> create mode 100644 benchtests/log2p1f-inputs >> delete mode 100644 sysdeps/i386/fpu/e_log10f.S >> delete mode 100644 sysdeps/i386/fpu/s_expm1f.S >> delete mode 100644 sysdeps/i386/fpu/s_log1pf.S >> create mode 100644 sysdeps/ieee754/flt-32/s_exp10m1f.c >> create mode 100644 sysdeps/ieee754/flt-32/s_exp2m1f.c >> create mode 100644 sysdeps/ieee754/flt-32/s_log10p1f.c >> create mode 100644 sysdeps/ieee754/flt-32/s_log2p1f.c >> rename sysdeps/{m68k/m680x0/fpu/math_errf.c => ieee754/flt-32/w_log1pf.c} (100%) >> create mode 100644 sysdeps/m68k/m680x0/fpu/w_log1pf.c >> create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp10m1f-fma.c >> create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp10m1f.c >> create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp2m1f-fma.c >> create mode 100644 sysdeps/x86_64/fpu/multiarch/s_exp2m1f.c >> >> -- >> 2.43.0 >> > > Whitespace issues in some of your patches: > ``` > Applying: math: Add e_gammaf_r to glibc code and style > Applying: benchtests: Add exp10m1f benchmark > Applying: benchtests: Add exp2m1f benchmark > Applying: benchtests: Add expm1f benchmark > Applying: benchtests: Add log10f benchmark > Applying: benchtests: Add log2p1f benchmark > Applying: benchtests: Add log1p benchmark > Applying: benchtests: Add log10p1f benchmark > Applying: math: Use exp10m1f from CORE-MATH > Applying: math: Use exp2m1f from CORE-MATH > Applying: math: Use expm1f from CORE-MATH > Applying: math: Use log2p1f from CORE-MATH > .git/rebase-apply/patch:402: space before tab in indent. > { > .git/rebase-apply/patch:456: space before tab in indent. > }; > warning: 2 lines add whitespace errors. > Applying: math: Use log10p1f from CORE-MATH > .git/rebase-apply/patch:352: trailing whitespace. > { > .git/rebase-apply/patch:366: space before tab in indent. > { > warning: 2 lines add whitespace errors. > Applying: x86_64: Add exp10m1f with FMA > Applying: x86_64: Add exp2m1f with FMA > ``` Thanks, I have fixed it loacally.