mbox series

[00/12] tcg: Improve register allocation for calls

Message ID 20181128053834.10861-1-richard.henderson@linaro.org
Headers show
Series tcg: Improve register allocation for calls | expand

Message

Richard Henderson Nov. 28, 2018, 5:38 a.m. UTC
The intent here is to remove several move insns putting the
function arguments into the proper place.  I'm hoping that
this will solve the skylake regression with spec2006, as
seen with the ool softmmu patch set.

Emilio, all of this is present on my tcg-next-for-4.0 branch.


r~


Richard Henderson (12):
  tcg: Add preferred_reg argument to tcg_reg_alloc
  tcg: Add preferred_reg argument to temp_load
  tcg: Add preferred_reg argument to temp_sync
  tcg: Add preferred_reg argument to tcg_reg_alloc_do_movi
  tcg: Add output_pref to TCGOp
  tcg: Improve register allocation for matching constraints
  tcg: Dump register preference info with liveness
  tcg: Reindent parts of liveness_pass_1
  tcg: Rename and adjust liveness_pass_1 helpers
  tcg: Split out more subroutines from liveness_pass_1
  tcg: Add TCG_OPF_BB_EXIT
  tcg: Record register preferences during liveness

 tcg/tcg-opc.h |   7 +-
 tcg/tcg.h     |  20 +-
 tcg/tcg.c     | 527 +++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 405 insertions(+), 149 deletions(-)

-- 
2.17.2

Comments

Emilio Cota Nov. 28, 2018, 10:15 p.m. UTC | #1
On Tue, Nov 27, 2018 at 21:38:22 -0800, Richard Henderson wrote:
> The intent here is to remove several move insns putting the

> function arguments into the proper place.  I'm hoping that

> this will solve the skylake regression with spec2006, as

> seen with the ool softmmu patch set.

> 

> Emilio, all of this is present on my tcg-next-for-4.0 branch.


Thanks for this.

Unfortunately, it doesn't seem to help, performance-wise.

I've benchmarked this on three different machines: Sandy
Bridge, Haswell and Skylake. The average slowdown vs.
the baseline is ~0%, ~5%, and ~10%, respectively.

So it seems the more modern the microarchitecture, the more
severe the slowdown (this is consistent with the assumption
that processors are getting better at caching over time).

Here are all the bar charts:

  https://imgur.com/a/k7vmjVd

- baseline: tcg-next-for-4.0's parent from master, i.e.
  4822f1e ("Merge remote-tracking branch
  'remotes/kraxel/tags/fixes-31-20181127-pull-request'
  into staging", 2018-11-27)

- ool: dc93c4a ("tcg/ppc: Use TCG_TARGET_NEED_LDST_OOL_LABELS",
  2018-11-27)

- ool-regs: a9bac58 ("tcg: Record register preferences during
  liveness", 2018-11-27)

I've also looked at hardware event counts on Skylake for
the above three commits. It seems that the indirection of
the (very) frequent ool calls/rets are what cause the large
reduction in IPC (results for bootup + hmmer):

- baseline:
   291,451,142,426      instructions              #    2.94  insn per cycle           (71.45%)
    99,050,829,190      cycles                                                        (71.49%)
     2,678,751,743      br_inst_retired.near_call                                     (71.43%)
     2,674,367,278      br_inst_retired.near_return                                   (71.42%)
    34,065,079,963      branches                                                      (57.09%)
       161,441,496      branch-misses             #    0.47% of all branches          (57.17%)
      29.916874137 seconds time elapsed

- ool:
   312,368,465,806      instructions              #    2.79  insn per cycle           (71.45%)
   111,863,014,212      cycles                                                        (71.31%)
    11,751,151,140      br_inst_retired.near_call                                     (71.30%)
    11,736,770,191      br_inst_retired.near_return                                   (71.41%)
        24,660,597      br_misp_retired.near_call                                     (71.49%)
    52,096,512,558      branches                                                      (57.28%)
       176,951,727      branch-misses             #    0.34% of all branches          (57.20%)
      33.285149773 seconds time elapsed

- ool-regs:
   309,253,149,588      instructions              #    2.71  insn per cycle           (71.47%)
   113,938,069,597      cycles                                                        (71.50%)
    11,735,199,530      br_inst_retired.near_call                                     (71.51%)
    11,725,686,909      br_inst_retired.near_return                                   (71.54%)
        24,885,204      br_misp_retired.near_call                                     (71.46%)
    52,768,150,694      branches                                                      (56.97%)
       184,421,824      branch-misses             #    0.35% of all branches          (57.03%)
      33.867122498 seconds time elapsed 

The additional branches are all from call/ret. I double-checked the generated
code and these are all well-matched (no jmp's instead of ret's), so
I don't think we can optimize anything there; it seems to me that this
is just a code size vs. speed trade-off.

ool-regs has even lower IPC, but it also uses less instructions, which
mitigates the slowdown due to lower IPC. The bottleneck in the ool
calls/rets remains, which explains why there isn't much to
be gained from the lower number of insns.

Let me know if you want me to do any other data collection.

Thanks,

		Emilio
Richard Henderson Nov. 29, 2018, 7:23 p.m. UTC | #2
On 11/28/18 2:15 PM, Emilio G. Cota wrote:
> Unfortunately, it doesn't seem to help, performance-wise.


That is really disappointing, considering the size gains are huge -- even more
dramatically for non-x86 hosts.  I will see about some more benchmarking on
this for other host/guest combinations.

Thanks!


r~
Emilio Cota Nov. 30, 2018, 12:39 a.m. UTC | #3
On Thu, Nov 29, 2018 at 11:23:09 -0800, Richard Henderson wrote:
> On 11/28/18 2:15 PM, Emilio G. Cota wrote:

> > Unfortunately, it doesn't seem to help, performance-wise.

> 

> That is really disappointing, considering the size gains are huge -- even more

> dramatically for non-x86 hosts.  I will see about some more benchmarking on

> this for other host/guest combinations.


A64 and POWER9 host numbers:

  https://imgur.com/a/m6Pss99

There's quite a bit of noise in the P9 measurements, but it's
a shared machine so I can't do much about that.

I'll update the A64 results with error bars later tonight,
when I get further results.

		E.
Emilio Cota Nov. 30, 2018, 3 a.m. UTC | #4
On Thu, Nov 29, 2018 at 19:39:15 -0500, Emilio G. Cota wrote:
> A64 and POWER9 host numbers:

> 

>   https://imgur.com/a/m6Pss99

> 

> There's quite a bit of noise in the P9 measurements, but it's

> a shared machine so I can't do much about that.

> 

> I'll update the A64 results with error bars later tonight,

> when I get further results.


Here they are:

  https://imgur.com/a/EAAapSW

The second image is the same results, but zoomed in. I could
bring the confidence intervals down by running this many times,
but each run takes 2h and I only have access to the
machine for a few hours at a time.

Those confidence intervals are generated from only 2 runs per benchmark,
which explains why they're so large.

		E.
Laurent Desnogues Nov. 30, 2018, 7:15 a.m. UTC | #5
On Fri, Nov 30, 2018 at 4:00 AM Emilio G. Cota <cota@braap.org> wrote:
>

> On Thu, Nov 29, 2018 at 19:39:15 -0500, Emilio G. Cota wrote:

> > A64 and POWER9 host numbers:

> >

> >   https://imgur.com/a/m6Pss99

> >

> > There's quite a bit of noise in the P9 measurements, but it's

> > a shared machine so I can't do much about that.

> >

> > I'll update the A64 results with error bars later tonight,

> > when I get further results.

>

> Here they are:

>

>   https://imgur.com/a/EAAapSW


What is a X-Gene A57? It's either X-Gene or A57 :-)

Thanks,

Laurent

> The second image is the same results, but zoomed in. I could

> bring the confidence intervals down by running this many times,

> but each run takes 2h and I only have access to the

> machine for a few hours at a time.

>

> Those confidence intervals are generated from only 2 runs per benchmark,

> which explains why they're so large.

>

>                 E.

>
Emilio Cota Nov. 30, 2018, 3:56 p.m. UTC | #6
On Fri, Nov 30, 2018 at 08:15:56 +0100, Laurent Desnogues wrote:
> On Fri, Nov 30, 2018 at 4:00 AM Emilio G. Cota <cota@braap.org> wrote:

> >

> > On Thu, Nov 29, 2018 at 19:39:15 -0500, Emilio G. Cota wrote:

> > > A64 and POWER9 host numbers:

> > >

> > >   https://imgur.com/a/m6Pss99

> > >

> > > There's quite a bit of noise in the P9 measurements, but it's

> > > a shared machine so I can't do much about that.

> > >

> > > I'll update the A64 results with error bars later tonight,

> > > when I get further results.

> >

> > Here they are:

> >

> >   https://imgur.com/a/EAAapSW

> 

> What is a X-Gene A57? It's either X-Gene or A57 :-)


You're right -- this is an X-Gene (xgene 1).

The A57 reference came from here:

 https://www.cloudlab.us/hardware.php
 m400 nodes: 45 per chassis, 315 total
 Processor/Chipset: Applied Micro X-Gene system-on-chip
 Eight 64-bit ARMv8 (Atlas/A57) cores at 2.4 GHz
                           ^^^

I'm not familiar with ARMv8's commercial offerings, so I
just quoted the above--which turns out to be wrong,
since A57 is an ARM design and X-Gene is not.

Thanks,

		E.
Emilio Cota Dec. 24, 2018, 9:53 p.m. UTC | #7
On Tue, Nov 27, 2018 at 21:38:22 -0800, Richard Henderson wrote:
> The intent here is to remove several move insns putting the

> function arguments into the proper place.  I'm hoping that

> this will solve the skylake regression with spec2006, as

> seen with the ool softmmu patch set.


Reviewed-by: Emilio G. Cota <cota@braap.org>


for the series.

Thanks,

		E.