From patchwork Fri Oct 21 12:40:06 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ramana Radhakrishnan X-Patchwork-Id: 4775 Return-Path: X-Original-To: patchwork@peony.canonical.com Delivered-To: patchwork@peony.canonical.com Received: from fiordland.canonical.com (fiordland.canonical.com [91.189.94.145]) by peony.canonical.com (Postfix) with ESMTP id 39AD523EF5 for ; Fri, 21 Oct 2011 12:40:11 +0000 (UTC) Received: from mail-yw0-f52.google.com (mail-yw0-f52.google.com [209.85.213.52]) by fiordland.canonical.com (Postfix) with ESMTP id D854CA183D5 for ; Fri, 21 Oct 2011 12:40:10 +0000 (UTC) Received: by ywm39 with SMTP id 39so1777328ywm.11 for ; Fri, 21 Oct 2011 05:40:10 -0700 (PDT) Received: by 10.223.17.3 with SMTP id q3mr24610395faa.28.1319200809474; Fri, 21 Oct 2011 05:40:09 -0700 (PDT) X-Forwarded-To: linaro-patchwork@canonical.com X-Forwarded-For: patch@linaro.org linaro-patchwork@canonical.com Delivered-To: patches@linaro.org Received: by 10.152.1.71 with SMTP id 7cs12510lak; Fri, 21 Oct 2011 05:40:09 -0700 (PDT) Received: by 10.52.72.104 with SMTP id c8mr5615990vdv.105.1319200808099; Fri, 21 Oct 2011 05:40:08 -0700 (PDT) Received: from mail-qy0-f171.google.com (mail-qy0-f171.google.com [209.85.216.171]) by mx.google.com with ESMTPS id 6si1945899vcn.10.2011.10.21.05.40.07 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 21 Oct 2011 05:40:08 -0700 (PDT) Received-SPF: neutral (google.com: 209.85.216.171 is neither permitted nor denied by best guess record for domain of ramana.radhakrishnan@linaro.org) client-ip=209.85.216.171; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.216.171 is neither permitted nor denied by best guess record for domain of ramana.radhakrishnan@linaro.org) smtp.mail=ramana.radhakrishnan@linaro.org Received: by qyk33 with SMTP id 33so556655qyk.16 for ; Fri, 21 Oct 2011 05:40:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.79.7 with SMTP id n7mr1972353qck.89.1319200806567; Fri, 21 Oct 2011 05:40:06 -0700 (PDT) Received: by 10.229.215.207 with HTTP; Fri, 21 Oct 2011 05:40:06 -0700 (PDT) Date: Fri, 21 Oct 2011 13:40:06 +0100 Message-ID: Subject: [RFC ARM] Use vcvt.f32/64.s32 with immediate bits to do fixed to floating point conversions better. From: Ramana Radhakrishnan To: gcc-patches Cc: Patch Tracking Hi, Some time back Michael pointed out that the ARM backend doesn't generate vcvt.f32.s where you have a conversion from fixed to floating point as in the example below. It should also be possible to generate the vector forms of this which will be the subject of a follow-up patch . I've chosen to implement this in the following manner in the backend using these interfaces from real.c . The reason I've chosen to not allow this transformation in case flag_rounding_math is true is because this instruction always ends up rounding using round-to-nearest rather than obeying whats in the FPSCR and thus is not safe for programs that want to dynamically set their rounding modes. I have chosen to use the unified assembler syntax for this patch and have a set of follow up patches that I've been working on that try to replace all the old assembler mnemonics with the newer UAL ones. I think gas has matured to a point where most of the new syntax for VFP is now fully recognized and there's no reason why we shouldn't move forward. What is the opinion in this regard ? The benefits are quite obvious in that we eliminate a load from the constant pool and a floating point multiply and thus essentially shaving off a floating point multiply + Load latency off these sequences. This instruction can only write the output into the same register as the input register which is why I've modelled it as below by tying op1 into op0. Also the i32 -> f64 cases were quite impossible to model with insn_and_splits and subreg modes which is what Richard and I tried to cook up. If someone has an idea as to how this might be achieved I'm all ears compared to the current way in which it's all sort of tied together. Also, if there's a simpler way of using the interfaces into real.c then I'm all ears ? OK for trunk ? cheers Ramana * config/arm/arm.c (vfp3_const_double_for_fract_bits): Define. * config/arm/arm-protos.h (vfp3_const_double_for_fract_bits): Declare. * config/arm/constraints.md ("Dt"): New constraint. * config/arm/predicates.md (const_double_vcvt_power_of_two_reciprocal): New. * config/arm/vfp.md (*arm_combine_vcvt_f32_s32): New. (*arm_combine_vcvt_f32_u32): New. For the following testcases I see the code as follows with -mfloat-abi=hard -mfpu=vfpv3 and -mcpu=cortex-a9 float foo (int i) { float v = (float)i / (1 << 11); return v; } float foa_unsigned (unsigned int i) { float v = (float)i / (1 << 5); return v; } After patch . foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s0, r0 @ int vcvt.f32.s32 s0, s0, #11 bx lr .size foo, .-foo .align 2 .global foa_unsigned .type foa_unsigned, %function foa_unsigned: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s0, r0 @ int vcvt.f32.u32 s0, s0, #5 bx lr .size foa_unsigned, .-foa_unsigned .align 2 .global foo1 .type foo1, %function rather than .type foo, %function foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s15, r0 @ int fsitos s0, s15 flds s15, .L2 fmuls s0, s0, s15 bx lr .L3: .align 2 .L2: .word 973078528 .size foo, .-foo .align 2 .global foa_unsigned .type foa_unsigned, %function foa_unsigned: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. fmsr s15, r0 @ int fuitos s0, s15 flds s15, .L5 fmuls s0, s0, s15 bx lr .L6: .align 2 .L5: .word 1023410176 diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23a29c6..c933704 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -242,6 +242,7 @@ struct tune_params }; extern const struct tune_params *current_tune; +extern int vfp3_const_double_for_fract_bits (rtx); #endif /* RTX_CODE */ #endif /* ! GCC_ARM_PROTOS_H */ diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index f1ada6f..266b757 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -17606,6 +17606,11 @@ arm_print_operand (FILE *stream, rtx x, int code) } return; + case 'v': + gcc_assert (GET_CODE (x) == CONST_DOUBLE); + fprintf (stream, "#%d", vfp3_const_double_for_fract_bits (x)); + return; + /* Register specifier for vld1.16/vst1.16. Translate the S register number into a D register number and element index. */ case 'z': @@ -24972,4 +24977,27 @@ arm_count_output_move_double_insns (rtx *operands) return count; } +int +vfp3_const_double_for_fract_bits (rtx operand) +{ + REAL_VALUE_TYPE r0; + + if (GET_CODE (operand) != CONST_DOUBLE) + return 0; + + REAL_VALUE_FROM_CONST_DOUBLE (r0, operand); + if (exact_real_inverse (DFmode, &r0)) + { + if (exact_real_truncate (DFmode, &r0)) + { + HOST_WIDE_INT value = real_to_integer (&r0); + value = value & 0xffffffff; + if ((value != 0) && ( (value & (value - 1)) == 0)) + return int_log2 (value); + } + } + return 0; +} + #include "gt-arm.h" + diff --git a/gcc/config/arm/constraints.md b/gcc/config/arm/constraints.md index d8ce982..7d0269a 100644 --- a/gcc/config/arm/constraints.md +++ b/gcc/config/arm/constraints.md @@ -29,7 +29,7 @@ ;; in Thumb-1 state: I, J, K, L, M, N, O ;; The following multi-letter normal constraints have been used: -;; in ARM/Thumb-2 state: Da, Db, Dc, Dn, Dl, DL, Dv, Dy, Di, Dz +;; in ARM/Thumb-2 state: Da, Db, Dc, Dn, Dl, DL, Dv, Dy, Di, Dt, Dz ;; in Thumb-1 state: Pa, Pb, Pc, Pd ;; in Thumb-2 state: Pj, PJ, Ps, Pt, Pu, Pv, Pw, Px, Py @@ -291,6 +291,12 @@ (and (match_code "const_double") (match_test "TARGET_32BIT && TARGET_VFP_DOUBLE && vfp3_const_double_rtx (op)"))) +(define_constraint "Dt" + "@internal + In ARM/ Thumb2 a const_double which can be used with a vcvt.f32.s32 with fract bits operation" + (and (match_code "const_double") + (match_test "TARGET_32BIT && TARGET_VFP && vfp3_const_double_for_fract_bits (op)"))) + (define_memory_constraint "Ut" "@internal In ARM/Thumb-2 state an address valid for loading/storing opaque structure diff --git a/gcc/config/arm/predicates.md b/gcc/config/arm/predicates.md index 92eb004..b535335 100644 --- a/gcc/config/arm/predicates.md +++ b/gcc/config/arm/predicates.md @@ -754,6 +754,11 @@ return true; }) +(define_predicate "const_double_vcvt_power_of_two_reciprocal" + (and (match_code "const_double") + (match_test "TARGET_32BIT && TARGET_VFP + && vfp3_const_double_for_fract_bits (op)"))) + (define_predicate "neon_struct_operand" (and (match_code "mem") (match_test "TARGET_32BIT && neon_vector_mem_operand (op, 2)"))) diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md index 0c85c46..71f6d08 100644 --- a/gcc/config/arm/vfp.md +++ b/gcc/config/arm/vfp.md @@ -1144,9 +1144,40 @@ (set_attr "type" "fcmpd")] ) +;; Fixed point to floating point conversions. +(define_code_iterator FCVT [unsigned_float float]) +(define_code_attr FCVTI32typename [(unsigned_float "u32") (float "s32")]) + +(define_insn "*combine_vcvt_f32_" + [(set (match_operand:SF 0 "s_register_operand" "=t") + (mult:SF (FCVT:SF (match_operand:SI 1 "s_register_operand" "0")) + (match_operand 2 + "const_double_vcvt_power_of_two_reciprocal" "Dt")))] + "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP3 && !flag_rounding_math" + "vcvt.f32.\\t%0, %1, %v2" + [(set_attr "predicable" "no") + (set_attr "type" "f_cvt")] +) -;; Store multiple insn used in function prologue. +;; Not the ideal way of implementing this. Ideally we would be able to split +;; this into a move to a DP register and then a vcvt.f64.i32 +(define_insn "*combine_vcvt_f64_" + [(set (match_operand:DF 0 "s_register_operand" "=x,x,w") + (mult:DF (FCVT:DF (match_operand:SI 1 "s_register_operand" "r,t,r")) + (match_operand 2 + "const_double_vcvt_power_of_two_reciprocal" "Dt,Dt,Dt")))] + "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP3 && !flag_rounding_math + && !TARGET_VFP_SINGLE" + "@ + vmov.f32\\t%0, %1\;vcvt.f64.\\t%P0, %P0, %v2 + vmov.f32\\t%0, %1\;vcvt.f64.\\t%P0, %P0, %v2 + vmov.f64\\t%0, %1, %1\; vcvt.f64.\\t%P0, %P0, %v2" + [(set_attr "predicable" "no") + (set_attr "type" "f_cvt") + (set_attr "length" "8")] +) +;; Store multiple insn used in function prologue. (define_insn "*push_multi_vfp" [(match_parallel 2 "multi_register_push" [(set (match_operand:BLK 0 "memory_operand" "=m")