diff mbox series

[v9,03/13] accel: collecting JIT statistics

Message ID 20191007152839.30804-4-alex.bennee@linaro.org
State New
Headers show
Series TCG code quality tracking and perf integration | expand

Commit Message

Alex Bennée Oct. 7, 2019, 3:28 p.m. UTC
From: "Vanderson M. do Rosario" <vandersonmr2@gmail.com>


If a TB has a TBS (TBStatistics) with the TB_JIT_STATS enabled then we
collect statistics of its translation processes and code translation.

To help with collection we include the TCGProfile structure
unconditionally. It will have further alterations in future commits.

Collecting the number of host instructions seems to be not simple as
it would imply in having to modify several target source files. So,
for now, we are only collecting the size of the host gen code.

Signed-off-by: Vanderson M. do Rosario <vandersonmr2@gmail.com>

Message-Id: <20190829173437.5926-4-vandersonmr2@gmail.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>


---
AJB:
  - replace tb tracking with proper array.
  - stash tcg counts in tcg_ctx.prof until we can update later
  - update jit stats under a lock instead of lots of atomics
  - don't re-count nb_ops
---
 accel/tcg/translate-all.c | 35 ++++++++++++++++++++++++++++++++++-
 accel/tcg/translator.c    |  3 +++
 include/exec/tb-stats.h   | 23 +++++++++++++++++++++++
 tcg/tcg.c                 |  9 +++++++--
 tcg/tcg.h                 | 23 +++++++++++++++++++++--
 5 files changed, 88 insertions(+), 5 deletions(-)

-- 
2.20.1

Comments

Richard Henderson Oct. 8, 2019, 1:38 p.m. UTC | #1
On 10/7/19 11:28 AM, Alex Bennée wrote:
> @@ -1795,6 +1799,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,

>          if (flag & TB_EXEC_STATS) {

>              tb->tb_stats->stats_enabled |= TB_EXEC_STATS;

>          }

> +

> +        if (flag & TB_JIT_STATS) {

> +            tb->tb_stats->stats_enabled |= TB_JIT_STATS;

> +        }


So, assuming that you really meant this, and not replacing as I was guessing vs
patch 2, then this is

    tb->tb_stats->stats_enabled |=
        flag & (TB_EXEC_STATS | TB_JIT_STATS);

But I still think it's weird to be wanting to modify the shared structure.
What does that mean for concurrent code generation?

> +    /*

> +     * Collect JIT stats when enabled. We batch them all up here to

> +     * avoid spamming the cache with atomic accesses

> +     */

> +    if (tb_stats_enabled(tb, TB_JIT_STATS)) {

> +        TBStatistics *ts = tb->tb_stats;

> +        qemu_mutex_lock(&ts->jit_stats_lock);

> +

> +        ts->code.num_guest_inst += prof->translation.nb_guest_insns;

> +        ts->code.num_tcg_ops += prof->translation.nb_ops_pre_opt;

> +        ts->code.num_tcg_ops_opt += tcg_ctx->nb_ops;

> +        ts->code.spills += prof->translation.nb_spills;

> +        ts->code.out_len += tb->tc.size;

> +

> +        ts->translations.total++;

> +        if (phys_page2 != -1) {

> +            ts->translations.spanning++;

> +        }

> +

> +        g_ptr_array_add(ts->tbs, tb);

> +

> +        qemu_mutex_unlock(&ts->jit_stats_lock);

> +    }


Hmm.  So we're to interpret all of code.field as sums, the average of which can
be obtained by dividing by translations.total.  Ok, but that should probably be
mentioned in a comment in/near the structure definition.

What are we planning to do with the set of all tb's collected here?

> @@ -3125,6 +3126,7 @@ static void temp_sync

>          case TEMP_VAL_REG:

>              tcg_out_st(s, ts->type, ts->reg,

>                         ts->mem_base->reg, ts->mem_offset);

> +            s->prof.translation.nb_spills++;

>              break;

>  

>          case TEMP_VAL_MEM:


This is not a spill in the common compiler definition.

This is "write the temp to its backing storage".  While this does happen in the
course of spilling, the vast majority of these are because we've finished
modifying a global temp and must now update memory.  Which is not nearly the
same thing as "spill".

A spill in the compiler definition happens in tcg_reg_alloc, right after the
comment "We must spill something".  ;-)


r~
Alex Bennée Dec. 13, 2019, 11:51 a.m. UTC | #2
Richard Henderson <richard.henderson@linaro.org> writes:

> On 10/7/19 11:28 AM, Alex Bennée wrote:

>> @@ -1795,6 +1799,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,

>>          if (flag & TB_EXEC_STATS) {

>>              tb->tb_stats->stats_enabled |= TB_EXEC_STATS;

>>          }

>> +

>> +        if (flag & TB_JIT_STATS) {

>> +            tb->tb_stats->stats_enabled |= TB_JIT_STATS;

>> +        }

>

> So, assuming that you really meant this, and not replacing as I was guessing vs

> patch 2, then this is

>

>     tb->tb_stats->stats_enabled |=

>         flag & (TB_EXEC_STATS | TB_JIT_STATS);

>

> But I still think it's weird to be wanting to modify the shared structure.

> What does that mean for concurrent code generation?


The idea was to have per translation area granularity on collecting the
stats so we didn't have to burden all areas with the overhead. Currently
this only takes effect when qemu_log_in_addr_range is in effect. However
as the run goes on we could make decisions to disable some or all stats
for stuff that doesn't come up that frequently.

However the current positioning doesn't work as we keep setting the flag
so I think we need to apply get_default_tbstats_flag() inside
tb_get_stats only when we first create the data block.

>

>> +    /*

>> +     * Collect JIT stats when enabled. We batch them all up here to

>> +     * avoid spamming the cache with atomic accesses

>> +     */

>> +    if (tb_stats_enabled(tb, TB_JIT_STATS)) {

>> +        TBStatistics *ts = tb->tb_stats;

>> +        qemu_mutex_lock(&ts->jit_stats_lock);

>> +

>> +        ts->code.num_guest_inst += prof->translation.nb_guest_insns;

>> +        ts->code.num_tcg_ops += prof->translation.nb_ops_pre_opt;

>> +        ts->code.num_tcg_ops_opt += tcg_ctx->nb_ops;

>> +        ts->code.spills += prof->translation.nb_spills;

>> +        ts->code.out_len += tb->tc.size;

>> +

>> +        ts->translations.total++;

>> +        if (phys_page2 != -1) {

>> +            ts->translations.spanning++;

>> +        }

>> +

>> +        g_ptr_array_add(ts->tbs, tb);

>> +

>> +        qemu_mutex_unlock(&ts->jit_stats_lock);

>> +    }

>

> Hmm.  So we're to interpret all of code.field as sums, the average of which can

> be obtained by dividing by translations.total.  Ok, but that should probably be

> mentioned in a comment in/near the structure definition.


OK

> What are we planning to do with the set of all tb's collected here?


Originally we kept track for the coverset calculation as we need to know
where each individual TB goes next. The code was racy so I dropped it
from the series so tracking this now is possibly redundant although it
might be useful in the future.

>

>> @@ -3125,6 +3126,7 @@ static void temp_sync

>>          case TEMP_VAL_REG:

>>              tcg_out_st(s, ts->type, ts->reg,

>>                         ts->mem_base->reg, ts->mem_offset);

>> +            s->prof.translation.nb_spills++;

>>              break;

>>  

>>          case TEMP_VAL_MEM:

>

> This is not a spill in the common compiler definition.

>

> This is "write the temp to its backing storage".  While this does happen in the

> course of spilling, the vast majority of these are because we've finished

> modifying a global temp and must now update memory.  Which is not nearly the

> same thing as "spill".

>

> A spill in the compiler definition happens in tcg_reg_alloc, right after the

> comment "We must spill something".  ;-)


OK I'll fix that.

-- 
Alex Bennée
diff mbox series

Patch

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index b7dd1a78e5..6fa9850a3a 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1690,10 +1690,13 @@  static TBStatistics *tb_get_stats(tb_page_addr_t phys_pc, target_ulong pc,
     TBStatistics *new_stats = g_new0(TBStatistics, 1);
     uint32_t hash = tb_stats_hash_func(phys_pc, pc, flags);
     void *existing_stats = NULL;
+
     new_stats->phys_pc = phys_pc;
     new_stats->pc = pc;
     new_stats->cs_base = cs_base;
     new_stats->flags = flags;
+    new_stats->tbs = g_ptr_array_sized_new(4);
+    qemu_mutex_init(&new_stats->jit_stats_lock);
 
     /*
      * All initialisation must be complete before we insert into qht
@@ -1707,6 +1710,7 @@  static TBStatistics *tb_get_stats(tb_page_addr_t phys_pc, target_ulong pc,
          * If there is already a TBStatistic for this TB from a previous flush
          * then just make the new TB point to the older TBStatistic
          */
+        g_ptr_array_free(new_stats->tbs, true);
         g_free(new_stats);
         return existing_stats;
     } else {
@@ -1726,8 +1730,8 @@  TranslationBlock *tb_gen_code(CPUState *cpu,
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
     int gen_code_size, search_size, max_insns;
-#ifdef CONFIG_PROFILER
     TCGProfile *prof = &tcg_ctx->prof;
+#ifdef CONFIG_PROFILER
     int64_t ti;
 #endif
     assert_memory_lock();
@@ -1795,6 +1799,10 @@  TranslationBlock *tb_gen_code(CPUState *cpu,
         if (flag & TB_EXEC_STATS) {
             tb->tb_stats->stats_enabled |= TB_EXEC_STATS;
         }
+
+        if (flag & TB_JIT_STATS) {
+            tb->tb_stats->stats_enabled |= TB_JIT_STATS;
+        }
     } else {
         tb->tb_stats = NULL;
     }
@@ -1930,6 +1938,31 @@  TranslationBlock *tb_gen_code(CPUState *cpu,
     if ((pc & TARGET_PAGE_MASK) != virt_page2) {
         phys_page2 = get_page_addr_code(env, virt_page2);
     }
+
+    /*
+     * Collect JIT stats when enabled. We batch them all up here to
+     * avoid spamming the cache with atomic accesses
+     */
+    if (tb_stats_enabled(tb, TB_JIT_STATS)) {
+        TBStatistics *ts = tb->tb_stats;
+        qemu_mutex_lock(&ts->jit_stats_lock);
+
+        ts->code.num_guest_inst += prof->translation.nb_guest_insns;
+        ts->code.num_tcg_ops += prof->translation.nb_ops_pre_opt;
+        ts->code.num_tcg_ops_opt += tcg_ctx->nb_ops;
+        ts->code.spills += prof->translation.nb_spills;
+        ts->code.out_len += tb->tc.size;
+
+        ts->translations.total++;
+        if (phys_page2 != -1) {
+            ts->translations.spanning++;
+        }
+
+        g_ptr_array_add(ts->tbs, tb);
+
+        qemu_mutex_unlock(&ts->jit_stats_lock);
+    }
+
     /*
      * No explicit memory barrier is required -- tb_link_page() makes the
      * TB visible in a consistent state.
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index ec6bd829a0..114e76db27 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -116,6 +116,9 @@  void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
     db->tb->size = db->pc_next - db->pc_first;
     db->tb->icount = db->num_insns;
 
+    /* Save number of guest instructions for TB_JIT_STATS */
+    tcg_ctx->prof.translation.nb_guest_insns = db->num_insns;
+
 #ifdef DEBUG_DISAS
     if (qemu_loglevel_mask(CPU_LOG_TB_IN_ASM)
         && qemu_log_in_addr_range(db->pc_first)) {
diff --git a/include/exec/tb-stats.h b/include/exec/tb-stats.h
index 51aecf65e2..566ae0d2be 100644
--- a/include/exec/tb-stats.h
+++ b/include/exec/tb-stats.h
@@ -59,6 +59,28 @@  struct TBStatistics {
         unsigned long atomic;
     } executions;
 
+    /* JIT Stats - protected by lock */
+    QemuMutex jit_stats_lock;
+
+    struct {
+        unsigned num_guest_inst;
+        unsigned num_tcg_ops;
+        unsigned num_tcg_ops_opt;
+        unsigned spills;
+        unsigned out_len;
+    } code;
+
+    struct {
+        unsigned long total;
+        unsigned long uncached;
+        unsigned long spanning;
+    } translations;
+
+    /*
+     * All persistent (cached) TranslationBlocks using
+     * this TBStats structure. Has to be reset on a tb_flush.
+     */
+    GPtrArray *tbs;
 };
 
 bool tb_stats_cmp(const void *ap, const void *bp);
@@ -67,6 +89,7 @@  void init_tb_stats_htable_if_not(void);
 
 #define TB_NOTHING    (1 << 0)
 #define TB_EXEC_STATS (1 << 1)
+#define TB_JIT_STATS  (1 << 2)
 
 void enable_collect_tb_stats(void);
 void disable_collect_tb_stats(void);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 16b2d0e0ec..fa2fef9f1a 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1117,6 +1117,7 @@  void tcg_func_start(TCGContext *s)
     s->nb_labels = 0;
     s->current_frame_offset = s->frame_start;
 
+    s->prof.translation.nb_spills = 0;
 #ifdef CONFIG_DEBUG_TCG
     s->goto_tb_issue_mask = 0;
 #endif
@@ -3125,6 +3126,7 @@  static void temp_sync(TCGContext *s, TCGTemp *ts, TCGRegSet allocated_regs,
         case TEMP_VAL_REG:
             tcg_out_st(s, ts->type, ts->reg,
                        ts->mem_base->reg, ts->mem_offset);
+            s->prof.translation.nb_spills++;
             break;
 
         case TEMP_VAL_MEM:
@@ -3990,12 +3992,12 @@  int64_t tcg_cpu_exec_time(void)
 
 int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 {
-#ifdef CONFIG_PROFILER
     TCGProfile *prof = &s->prof;
-#endif
     int i, num_insns;
     TCGOp *op;
 
+    s->current_tb = tb;
+
 #ifdef CONFIG_PROFILER
     {
         int n = 0;
@@ -4027,6 +4029,9 @@  int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
     }
 #endif
 
+    /* save pre-optimisation op count */
+    prof->translation.nb_ops_pre_opt = s->nb_ops;
+
 #ifdef CONFIG_DEBUG_TCG
     /* Ensure all labels referenced have been emitted.  */
     {
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a37181c899..31571289cf 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -555,7 +555,26 @@  typedef struct TCGOp {
 /* Make sure operands fit in the bitfields above.  */
 QEMU_BUILD_BUG_ON(NB_OPS > (1 << 8));
 
+/*
+ * The TCGProfile structure holds data for analysing the quality of
+ * the code generation. The data is split between stuff that is valid
+ * for the lifetime of a single translation and things that are valid
+ * for the lifetime of the translator. As the former is reset for each
+ * new translation so it should be copied elsewhere if you want to
+ * keep it.
+ *
+ * The structure is safe to access within the context of translation
+ * but accessing the data from elsewhere should be done with safe
+ * work.
+ */
 typedef struct TCGProfile {
+
+    struct {
+        int nb_guest_insns;
+        int nb_spills;
+        int nb_ops_pre_opt;
+    } translation;
+
     int64_t cpu_exec_time;
     int64_t tb_count1;
     int64_t tb_count;
@@ -600,9 +619,7 @@  struct TCGContext {
 
     tcg_insn_unit *code_ptr;
 
-#ifdef CONFIG_PROFILER
     TCGProfile prof;
-#endif
 
 #ifdef CONFIG_DEBUG_TCG
     int temps_in_use;
@@ -651,6 +668,8 @@  struct TCGContext {
 
     uint16_t gen_insn_end_off[TCG_MAX_INSNS];
     target_ulong gen_insn_data[TCG_MAX_INSNS][TARGET_INSN_START_WORDS];
+
+    TranslationBlock *current_tb;
 };
 
 extern TCGContext tcg_init_ctx;