[v5,36/54] tcg: Introduce tcg_out_movext3

Message ID	20230515143313.734053-37-richard.henderson@linaro.org
State	Accepted
Commit	2462e30e99676c710624806febe5ce67a45f0521
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Cc: qemu-arm@nongnu.org, qemu-s390x@nongnu.org Subject: [PATCH v5 36/54] tcg: Introduce tcg_out_movext3 Date: Mon, 15 May 2023 07:32:55 -0700 Message-Id: <20230515143313.734053-37-richard.henderson@linaro.org> In-Reply-To: <20230515143313.734053-1-richard.henderson@linaro.org> References: <20230515143313.734053-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::62c; envelope-from=richard.henderson@linaro.org; helo=mail-pl1-x62c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org
Series	tcg: Improve atomicity support \| expand [v5,00/54] tcg: Improve atomicity support [v5,01/54] include/exec/memop: Add MO_ATOM_* [v5,02/54] accel/tcg: Honor atomicity of loads [v5,03/54] accel/tcg: Honor atomicity of stores [v5,04/54] tcg: Unify helper_{be,le}_{ld,st}* [v5,05/54] accel/tcg: Implement helper_{ld, st}_mmu for user-only [v5,06/54] tcg/tci: Use helper_{ld,st}_mmu for user-only [v5,07/54] tcg: Add 128-bit guest memory primitives [v5,08/54] meson: Detect atomic128 support with optimization [v5,09/54] tcg/i386: Add have_atomic16 [v5,10/54] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc [v5,11/54] accel/tcg: Add aarch64 specific support in ldst_atomicity [v5,12/54] tcg/aarch64: Detect have_lse, have_lse2 for linux [v5,13/54] tcg/aarch64: Detect have_lse, have_lse2 for darwin [v5,14/54] accel/tcg: Add have_lse2 support in ldst_atomicity [v5,15/54] tcg/i386: Use full load/store helpers in user-only mode [v5,16/54] tcg/aarch64: Use full load/store helpers in user-only mode [v5,17/54] tcg/ppc: Use full load/store helpers in user-only mode [v5,18/54] tcg/loongarch64: Use full load/store helpers in user-only mode [v5,19/54] tcg/riscv: Use full load/store helpers in user-only mode [v5,20/54] tcg/arm: Adjust constraints on qemu_ld/st [v5,21/54] tcg/arm: Use full load/store helpers in user-only mode [v5,22/54] tcg/mips: Use full load/store helpers in user-only mode [v5,23/54] tcg/s390x: Use full load/store helpers in user-only mode [v5,24/54] tcg/sparc64: Allocate %g2 as a third temporary [v5,25/54] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13 [v5,26/54] target/sparc64: Remove tcg_out_movi_s13 case from tcg_out_movi_imm32 [v5,27/54] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32 [v5,28/54] tcg/sparc64: Split out tcg_out_movi_s32 [v5,29/54] tcg/sparc64: Use standard slow path for softmmu [v5,30/54] accel/tcg: Remove helper_unaligned_{ld,st} [v5,31/54] tcg/loongarch64: Check the host supports unaligned accesses [v5,32/54] tcg/loongarch64: Support softmmu unaligned accesses [v5,33/54] tcg/riscv: Support softmmu unaligned accesses [v5,34/54] tcg: Introduce tcg_target_has_memory_bswap [v5,35/54] tcg: Add INDEX_op_qemu_{ld,st}_i128 [v5,36/54] tcg: Introduce tcg_out_movext3 [v5,37/54] tcg: Merge tcg_out_helper_load_regs into caller [v5,38/54] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret} [v5,39/54] tcg: Introduce atom_and_align_for_opc [v5,40/54] tcg/i386: Use atom_and_align_for_opc [v5,41/54] tcg/aarch64: Use atom_and_align_for_opc [v5,42/54] tcg/arm: Use atom_and_align_for_opc [v5,43/54] tcg/loongarch64: Use atom_and_align_for_opc [v5,44/54] tcg/mips: Use atom_and_align_for_opc [v5,45/54] tcg/ppc: Use atom_and_align_for_opc [v5,46/54] tcg/riscv: Use atom_and_align_for_opc [v5,47/54] tcg/s390x: Use atom_and_align_for_opc [v5,48/54] tcg/sparc64: Use atom_and_align_for_opc [v5,49/54] tcg/i386: Honor 64-bit atomicity in 32-bit mode [v5,50/54] tcg/i386: Support 128-bit load/store with have_atomic16 [v5,51/54] tcg/aarch64: Rename temporaries [v5,52/54] tcg/aarch64: Support 128-bit load/store [v5,53/54] tcg/ppc: Support 128-bit load/store [v5,54/54] tcg/s390x: Support 128-bit load/store

Message ID

20230515143313.734053-37-richard.henderson@linaro.org

State

Accepted

Commit

2462e30e99676c710624806febe5ce67a45f0521

Headers

Received-SPF: pass (google.com: domain of
 qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as
 permitted sender) client-ip=209.51.188.17;
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Cc: qemu-arm@nongnu.org,
	qemu-s390x@nongnu.org
Subject: [PATCH v5 36/54] tcg: Introduce tcg_out_movext3
Date: Mon, 15 May 2023 07:32:55 -0700
Message-Id: <20230515143313.734053-37-richard.henderson@linaro.org>
In-Reply-To: <20230515143313.734053-1-richard.henderson@linaro.org>
References: <20230515143313.734053-1-richard.henderson@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::62c;
 envelope-from=richard.henderson@linaro.org; helo=mail-pl1-x62c.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
 T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org

Series

tcg: Improve atomicity support | expand

Commit Message

Richard Henderson May 15, 2023, 2:32 p.m. UTC

With x86_64 as host, we do not have any temporaries with which to
resolve cycles, but we do have xchg.   As a side bonus, the set of
graphs that can be made with 3 nodes and all nodes conflicting is
small: two.  We can solve the cycle with a single temp.

This is required for x86_64 to handle stores of i128: 1 address
register and 2 data registers.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.c | 138 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 107 insertions(+), 31 deletions(-)

Comments

Peter Maydell May 16, 2023, 10:03 a.m. UTC | #1

On Mon, 15 May 2023 at 15:43, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> With x86_64 as host, we do not have any temporaries with which to
> resolve cycles, but we do have xchg.   As a side bonus, the set of
> graphs that can be made with 3 nodes and all nodes conflicting is
> small: two.  We can solve the cycle with a single temp.
>
> This is required for x86_64 to handle stores of i128: 1 address
> register and 2 data registers.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>



>  static void tcg_out_helper_load_regs(TCGContext *s,
>                                       unsigned nmov, TCGMovExtend *mov,
> -                                     unsigned ntmp, const int *tmp)
> +                                     const TCGLdstHelperParam *parm)
>  {
> +    TCGReg dst3;
> +
>      switch (nmov) {
> -    default:
> +    case 4:
>          /* The backend must have provided enough temps for the worst case. */
> -        tcg_debug_assert(ntmp + 1 >= nmov);
> +        tcg_debug_assert(parm->ntmp >= 2);
>
> -        for (unsigned i = nmov - 1; i >= 2; --i) {
> -            TCGReg dst = mov[i].dst;
> -
> -            for (unsigned j = 0; j < i; ++j) {
> -                if (dst == mov[j].src) {
> -                    /*
> -                     * Conflict.
> -                     * Copy the source to a temporary, recurse for the
> -                     * remaining moves, perform the extension from our
> -                     * scratch on the way out.
> -                     */
> -                    TCGReg scratch = tmp[--ntmp];
> -                    tcg_out_mov(s, mov[i].src_type, scratch, mov[i].src);
> -                    mov[i].src = scratch;
> -
> -                    tcg_out_helper_load_regs(s, i, mov, ntmp, tmp);
> -                    tcg_out_movext1(s, &mov[i]);
> -                    return;
> -                }
> +        dst3 = mov[3].dst;
> +        for (unsigned j = 0; j < 3; ++j) {
> +            if (dst3 == mov[j].src) {
> +                /*
> +                 * Conflict. Copy the source to a temporary, perform the
> +                 * remaining moves, then the extension from our scratch
> +                 * on the way out.
> +                 */
> +                TCGReg scratch = parm->tmp[1];
> +                tcg_out_movext3(s, mov, mov + 1, mov + 2, parm->tmp[0]);
> +                tcg_out_movext1_new_src(s, &mov[3], scratch);

Isn't this missing the "copy the source to a temporary" part?
I was expecting an initial tcg_out_mov() like the old code has.

> +                break;
>              }
> -

Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

-- PMM

Richard Henderson May 16, 2023, 1:56 p.m. UTC | #2

On 5/16/23 03:03, Peter Maydell wrote:
> On Mon, 15 May 2023 at 15:43, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> With x86_64 as host, we do not have any temporaries with which to
>> resolve cycles, but we do have xchg.   As a side bonus, the set of
>> graphs that can be made with 3 nodes and all nodes conflicting is
>> small: two.  We can solve the cycle with a single temp.
>>
>> This is required for x86_64 to handle stores of i128: 1 address
>> register and 2 data registers.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> 
> 
> 
>>   static void tcg_out_helper_load_regs(TCGContext *s,
>>                                        unsigned nmov, TCGMovExtend *mov,
>> -                                     unsigned ntmp, const int *tmp)
>> +                                     const TCGLdstHelperParam *parm)
>>   {
>> +    TCGReg dst3;
>> +
>>       switch (nmov) {
>> -    default:
>> +    case 4:
>>           /* The backend must have provided enough temps for the worst case. */
>> -        tcg_debug_assert(ntmp + 1 >= nmov);
>> +        tcg_debug_assert(parm->ntmp >= 2);
>>
>> -        for (unsigned i = nmov - 1; i >= 2; --i) {
>> -            TCGReg dst = mov[i].dst;
>> -
>> -            for (unsigned j = 0; j < i; ++j) {
>> -                if (dst == mov[j].src) {
>> -                    /*
>> -                     * Conflict.
>> -                     * Copy the source to a temporary, recurse for the
>> -                     * remaining moves, perform the extension from our
>> -                     * scratch on the way out.
>> -                     */
>> -                    TCGReg scratch = tmp[--ntmp];
>> -                    tcg_out_mov(s, mov[i].src_type, scratch, mov[i].src);
>> -                    mov[i].src = scratch;
>> -
>> -                    tcg_out_helper_load_regs(s, i, mov, ntmp, tmp);
>> -                    tcg_out_movext1(s, &mov[i]);
>> -                    return;
>> -                }
>> +        dst3 = mov[3].dst;
>> +        for (unsigned j = 0; j < 3; ++j) {
>> +            if (dst3 == mov[j].src) {
>> +                /*
>> +                 * Conflict. Copy the source to a temporary, perform the
>> +                 * remaining moves, then the extension from our scratch
>> +                 * on the way out.
>> +                 */
>> +                TCGReg scratch = parm->tmp[1];
>> +                tcg_out_movext3(s, mov, mov + 1, mov + 2, parm->tmp[0]);
>> +                tcg_out_movext1_new_src(s, &mov[3], scratch);
> 
> Isn't this missing the "copy the source to a temporary" part?
> I was expecting an initial tcg_out_mov() like the old code has.

It is.  Sloppy of me, and I haven't re-tested ppc32 this week.


r~

diff --git a/tcg/tcg.c b/tcg/tcg.c
index aa0a6c3763..8688248284 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -532,6 +532,82 @@  static void tcg_out_movext2(TCGContext *s, const TCGMovExtend *i1,
     tcg_out_movext1_new_src(s, i1, src1);
 }
 
+/**
+ * tcg_out_movext3 -- move and extend three pair
+ * @s: tcg context
+ * @i1: first move description
+ * @i2: second move description
+ * @i3: third move description
+ * @scratch: temporary register, or -1 for none
+ *
+ * As tcg_out_movext, for all of @i1, @i2 and @i3, caring for overlap
+ * between the sources and destinations.
+ */
+
+static void tcg_out_movext3(TCGContext *s, const TCGMovExtend *i1,
+                            const TCGMovExtend *i2, const TCGMovExtend *i3,
+                            int scratch)
+{
+    TCGReg src1 = i1->src;
+    TCGReg src2 = i2->src;
+    TCGReg src3 = i3->src;
+
+    if (i1->dst != src2 && i1->dst != src3) {
+        tcg_out_movext1(s, i1);
+        tcg_out_movext2(s, i2, i3, scratch);
+        return;
+    }
+    if (i2->dst != src1 && i2->dst != src3) {
+        tcg_out_movext1(s, i2);
+        tcg_out_movext2(s, i1, i3, scratch);
+        return;
+    }
+    if (i3->dst != src1 && i3->dst != src2) {
+        tcg_out_movext1(s, i3);
+        tcg_out_movext2(s, i1, i2, scratch);
+        return;
+    }
+
+    /*
+     * There is a cycle.  Since there are only 3 nodes, the cycle is
+     * either "clockwise" or "anti-clockwise", and can be solved with
+     * a single scratch or two xchg.
+     */
+    if (i1->dst == src2 && i2->dst == src3 && i3->dst == src1) {
+        /* "Clockwise" */
+        if (tcg_out_xchg(s, MAX(i1->src_type, i2->src_type), src1, src2)) {
+            tcg_out_xchg(s, MAX(i2->src_type, i3->src_type), src2, src3);
+            /* The data is now in the correct registers, now extend. */
+            tcg_out_movext1_new_src(s, i1, i1->dst);
+            tcg_out_movext1_new_src(s, i2, i2->dst);
+            tcg_out_movext1_new_src(s, i3, i3->dst);
+        } else {
+            tcg_debug_assert(scratch >= 0);
+            tcg_out_mov(s, i1->src_type, scratch, src1);
+            tcg_out_movext1(s, i3);
+            tcg_out_movext1(s, i2);
+            tcg_out_movext1_new_src(s, i1, scratch);
+        }
+    } else if (i1->dst == src3 && i2->dst == src1 && i3->dst == src2) {
+        /* "Anti-clockwise" */
+        if (tcg_out_xchg(s, MAX(i2->src_type, i3->src_type), src2, src3)) {
+            tcg_out_xchg(s, MAX(i1->src_type, i2->src_type), src1, src2);
+            /* The data is now in the correct registers, now extend. */
+            tcg_out_movext1_new_src(s, i1, i1->dst);
+            tcg_out_movext1_new_src(s, i2, i2->dst);
+            tcg_out_movext1_new_src(s, i3, i3->dst);
+        } else {
+            tcg_debug_assert(scratch >= 0);
+            tcg_out_mov(s, i1->src_type, scratch, src1);
+            tcg_out_movext1(s, i2);
+            tcg_out_movext1(s, i3);
+            tcg_out_movext1_new_src(s, i1, scratch);
+        }
+    } else {
+        g_assert_not_reached();
+    }
+}
+
 #define C_PFX1(P, A)                    P##A
 #define C_PFX2(P, A, B)                 P##A##_##B
 #define C_PFX3(P, A, B, C)              P##A##_##B##_##C
@@ -5149,46 +5225,46 @@  static int tcg_out_helper_stk_ofs(TCGType type, unsigned slot)
 
 static void tcg_out_helper_load_regs(TCGContext *s,
                                      unsigned nmov, TCGMovExtend *mov,
-                                     unsigned ntmp, const int *tmp)
+                                     const TCGLdstHelperParam *parm)
 {
+    TCGReg dst3;
+
     switch (nmov) {
-    default:
+    case 4:
         /* The backend must have provided enough temps for the worst case. */
-        tcg_debug_assert(ntmp + 1 >= nmov);
+        tcg_debug_assert(parm->ntmp >= 2);
 
-        for (unsigned i = nmov - 1; i >= 2; --i) {
-            TCGReg dst = mov[i].dst;
-
-            for (unsigned j = 0; j < i; ++j) {
-                if (dst == mov[j].src) {
-                    /*
-                     * Conflict.
-                     * Copy the source to a temporary, recurse for the
-                     * remaining moves, perform the extension from our
-                     * scratch on the way out.
-                     */
-                    TCGReg scratch = tmp[--ntmp];
-                    tcg_out_mov(s, mov[i].src_type, scratch, mov[i].src);
-                    mov[i].src = scratch;
-
-                    tcg_out_helper_load_regs(s, i, mov, ntmp, tmp);
-                    tcg_out_movext1(s, &mov[i]);
-                    return;
-                }
+        dst3 = mov[3].dst;
+        for (unsigned j = 0; j < 3; ++j) {
+            if (dst3 == mov[j].src) {
+                /*
+                 * Conflict. Copy the source to a temporary, perform the
+                 * remaining moves, then the extension from our scratch
+                 * on the way out.
+                 */
+                TCGReg scratch = parm->tmp[1];
+                tcg_out_movext3(s, mov, mov + 1, mov + 2, parm->tmp[0]);
+                tcg_out_movext1_new_src(s, &mov[3], scratch);
+                break;
             }
-
-            /* No conflicts: perform this move and continue. */
-            tcg_out_movext1(s, &mov[i]);
         }
-        /* fall through for the final two moves */
 
+        /* No conflicts: perform this move and continue. */
+        tcg_out_movext1(s, &mov[3]);
+        /* fall through */
+
+    case 3:
+        tcg_out_movext3(s, mov, mov + 1, mov + 2,
+                        parm->ntmp ? parm->tmp[0] : -1);
+        break;
     case 2:
-        tcg_out_movext2(s, mov, mov + 1, ntmp ? tmp[0] : -1);
-        return;
+        tcg_out_movext2(s, mov, mov + 1,
+                        parm->ntmp ? parm->tmp[0] : -1);
+        break;
     case 1:
         tcg_out_movext1(s, mov);
-        return;
-    case 0:
+        break;
+    default:
         g_assert_not_reached();
     }
 }
@@ -5235,7 +5311,7 @@  static void tcg_out_helper_load_slots(TCGContext *s,
     for (i = 0; i < nmov; ++i) {
         mov[i].dst = tcg_target_call_iarg_regs[mov[i].dst];
     }
-    tcg_out_helper_load_regs(s, nmov, mov, parm->ntmp, parm->tmp);
+    tcg_out_helper_load_regs(s, nmov, mov, parm);
 }
 
 static void tcg_out_helper_load_imm(TCGContext *s, unsigned slot,

[v5,36/54] tcg: Introduce tcg_out_movext3

Commit Message

Comments

Patch