diff mbox

[AArch64] Emit square root using the Newton series

Message ID 56685EBD.5040401@arm.com
State New
Headers show

Commit Message

Kyrylo Tkachov Dec. 9, 2015, 5:02 p.m. UTC
On 09/12/15 16:59, Evandro Menezes wrote:
> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

>> Hi Evandro,

>>

>> On 08/12/15 21:35, Evandro Menezes wrote:

>>> Emit square root using the Newton series

>>>

>>>    2015-12-03  Evandro Menezes  <e.menezes@samsung.com>

>>>

>>>    gcc/

>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):

>>>    Declare new

>>>             function.

>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New

>>>    expansion and

>>>             insn definitions.

>>>             * config/aarch64/aarch64-tuning-flags.def

>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.

>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define

>>>    new function.

>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion

>>>    and insn

>>>             definitions.

>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):

>>>    Expand option

>>>             description.

>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

>>>

>>> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.

>>>

>>> Is it OK at this point of stage 3?

>>>

>>

>> A comment on the patch itself from me...

>>

>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def

>> index 6f7dbce..11c6c9a 100644

>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def

>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def

>> @@ -30,4 +30,4 @@

>>

>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)

>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)

>> -

>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

>>

>> That seems like a misleading name to me.

>> If we're doing this, that means that the sqrt instruction is not faster

>> than doing the inverse sqrt estimation followed by a multiply.

>> I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines

>> is more appropriate.

>

> Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. :-(  So, other targets when this is also true, using the "fast_sqrt" option might make sense.

>


Sure, but the way your patch is written, we will emit the series when "fast_sqrt" is set, rather
than the other way around, unless I'm misreading the logic in:

Comments

Kyrylo Tkachov Dec. 9, 2015, 5:16 p.m. UTC | #1
On 09/12/15 17:02, Kyrill Tkachov wrote:
>

> On 09/12/15 16:59, Evandro Menezes wrote:

>> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

>>> Hi Evandro,

>>>

>>> On 08/12/15 21:35, Evandro Menezes wrote:

>>>> Emit square root using the Newton series

>>>>

>>>>    2015-12-03  Evandro Menezes <e.menezes@samsung.com>

>>>>

>>>>    gcc/

>>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):

>>>>    Declare new

>>>>             function.

>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New

>>>>    expansion and

>>>>             insn definitions.

>>>>             * config/aarch64/aarch64-tuning-flags.def

>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.

>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define

>>>>    new function.

>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion

>>>>    and insn

>>>>             definitions.

>>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):

>>>>    Expand option

>>>>             description.

>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

>>>>

>>>> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.

>>>>

>>>> Is it OK at this point of stage 3?

>>>>

>>>

>>> A comment on the patch itself from me...

>>>

>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def

>>> index 6f7dbce..11c6c9a 100644

>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def

>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def

>>> @@ -30,4 +30,4 @@

>>>

>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)

>>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)

>>> -

>>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

>>>

>>> That seems like a misleading name to me.

>>> If we're doing this, that means that the sqrt instruction is not faster

>>> than doing the inverse sqrt estimation followed by a multiply.

>>> I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines

>>> is more appropriate.

>>

>> Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. :-(  So, other targets when this is also true, using the "fast_sqrt" option might make sense.

>>

>

> Sure, but the way your patch is written, we will emit the series when "fast_sqrt" is set, rather

> than the other way around, unless I'm misreading the logic in:

>


Sorry, what I meant to say is it would be clearer, IMO, to describe the compiler action that is being taken
(e.g. the rename_fma_regs tuning flag), in this case it's estimating sqrt using a series.

Kyrill

> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md

> index 030a101..f6d2da4 100644

> --- a/gcc/config/aarch64/aarch64-simd.md

> +++ b/gcc/config/aarch64/aarch64-simd.md

> @@ -4280,7 +4280,23 @@

>

>  ;; sqrt

>

> -(define_insn "sqrt<mode>2"

> +(define_expand "sqrt<mode>2"

> +  [(set (match_operand:VDQF 0 "register_operand")

> +    (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]

> +  "TARGET_SIMD"

> +{

> +  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)

> +      && !optimize_function_for_size_p (cfun)

> +      && flag_finite_math_only

> +      && !flag_trapping_math

> +      && flag_unsafe_math_optimizations)

> +    {

> +      aarch64_emit_swsqrt (operands[0], operands[1]);

> +      DONE;

> +    }

> +})

>

>

> Thanks,

> Kyrill

>

>
Kyrylo Tkachov Dec. 10, 2015, 10:30 a.m. UTC | #2
On 09/12/15 18:50, Evandro Menezes wrote:
> On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:

>>

>> On 09/12/15 17:02, Kyrill Tkachov wrote:

>>>

>>> On 09/12/15 16:59, Evandro Menezes wrote:

>>>> On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

>>>>> Hi Evandro,

>>>>>

>>>>> On 08/12/15 21:35, Evandro Menezes wrote:

>>>>>> Emit square root using the Newton series

>>>>>>

>>>>>>    2015-12-03  Evandro Menezes <e.menezes@samsung.com>

>>>>>>

>>>>>>    gcc/

>>>>>>             * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):

>>>>>>    Declare new

>>>>>>             function.

>>>>>>             * config/aarch64/aarch64-simd.md (sqrt<mode>2): New

>>>>>>    expansion and

>>>>>>             insn definitions.

>>>>>>             * config/aarch64/aarch64-tuning-flags.def

>>>>>>             (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.

>>>>>>             * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define

>>>>>>    new function.

>>>>>>             * config/aarch64/aarch64.md (sqrt<mode>2): New expansion

>>>>>>    and insn

>>>>>>             definitions.

>>>>>>             * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):

>>>>>>    Expand option

>>>>>>             description.

>>>>>>             * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

>>>>>>

>>>>>> This patch extends the patch that added support for implementing x^-1/2 using the Newton series by adding support for x^1/2 as well.

>>>>>>

>>>>>> Is it OK at this point of stage 3?

>>>>>>

>>>>>

>>>>> A comment on the patch itself from me...

>>>>>

>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def

>>>>> index 6f7dbce..11c6c9a 100644

>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def

>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def

>>>>> @@ -30,4 +30,4 @@

>>>>>

>>>>>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)

>>>>>  AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)

>>>>> -

>>>>> +AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

>>>>>

>>>>> That seems like a misleading name to me.

>>>>> If we're doing this, that means that the sqrt instruction is not faster

>>>>> than doing the inverse sqrt estimation followed by a multiply.

>>>>> I think a name like "synth_sqrt" or "estimate_sqrt" or something along those lines

>>>>> is more appropriate.

>>>>

>>>> Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. :-(  So, other targets when this is also true, using the "fast_sqrt" option might make sense.

>>>>

>>>

>>> Sure, but the way your patch is written, we will emit the series when "fast_sqrt" is set, rather

>>> than the other way around, unless I'm misreading the logic in:

>>>

>>

>> Sorry, what I meant to say is it would be clearer, IMO, to describe the compiler action that is being taken

>> (e.g. the rename_fma_regs tuning flag), in this case it's estimating sqrt using a series.

>>

>> Kyrill

>>

>>> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md

>>> index 030a101..f6d2da4 100644

>>> --- a/gcc/config/aarch64/aarch64-simd.md

>>> +++ b/gcc/config/aarch64/aarch64-simd.md

>>> @@ -4280,7 +4280,23 @@

>>>

>>>  ;; sqrt

>>>

>>> -(define_insn "sqrt<mode>2"

>>> +(define_expand "sqrt<mode>2"

>>> +  [(set (match_operand:VDQF 0 "register_operand")

>>> +    (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]

>>> +  "TARGET_SIMD"

>>> +{

>>> +  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)

>>> +      && !optimize_function_for_size_p (cfun)

>>> +      && flag_finite_math_only

>>> +      && !flag_trapping_math

>>> +      && flag_unsafe_math_optimizations)

>>> +    {

>>> +      aarch64_emit_swsqrt (operands[0], operands[1]);

>>> +      DONE;

>>> +    }

>>> +})

>>>

>

> Kyrill,

>

> How about "approx_sqrt" for, you guessed it, approximate square root?  The same adjective would perhaps describe "recip_sqrt" better too.

>


Sounds good to me.
Sorry for the bikeshedding.

Kyrill

> Thanks,

>
James Greenhalgh Feb. 26, 2016, 2:59 p.m. UTC | #3
On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:
> In preparation for the patch adding the Newton series also for

> square root, I'd like to propose this patch changing the name of the

> existing tuning flag for the reciprocal square root.


This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would also
be OK. Pick your favourite!

One comment on the replacement invoke.texi text below, otherwise this is
OK to apply now.

> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt

> index 5cbd4cd..155d2bd 100644

> --- a/gcc/config/aarch64/aarch64.opt

> +++ b/gcc/config/aarch64/aarch64.opt

> @@ -151,5 +151,5 @@ PC relative literal loads.

>  

>  mlow-precision-recip-sqrt

>  Common Var(flag_mrecip_low_precision_sqrt) Optimization

> -When calculating a sqrt approximation, run fewer steps.

> +Calculate the reciprocal square-root approximation in fewer steps.

>  This reduces precision, but can result in faster computation.

> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi

> index 490df93..eeff24d 100644

> --- a/gcc/doc/invoke.texi

> +++ b/gcc/doc/invoke.texi

> @@ -12879,12 +12879,10 @@ corresponding flag to the linker.

>  @item -mno-low-precision-recip-sqrt

>  @opindex -mlow-precision-recip-sqrt

>  @opindex -mno-low-precision-recip-sqrt

> -The square root estimate uses two steps instead of three for double-precision,

> -and one step instead of two for single-precision.

> -Thus reducing latency and precision.

> -This is only relevant if @option{-ffast-math} activates

> -reciprocal square root estimate instructions.

> -Which in turn depends on the target processor.

> +The reciprocal square root approximation uses one step less than otherwise,

> +thus reducing latency and precision.


When calculating the reciprocal square root approximation, use one less
step than otherwise, thus reducing latency and precision.

Thanks,
James
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@ 
  
  ;; sqrt
  
-(define_insn "sqrt<mode>2"
+(define_expand "sqrt<mode>2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+      && !optimize_function_for_size_p (cfun)
+      && flag_finite_math_only
+      && !flag_trapping_math
+      && flag_unsafe_math_optimizations)
+    {
+      aarch64_emit_swsqrt (operands[0], operands[1]);
+      DONE;
+    }
+})


Thanks,
Kyrill