Message ID | 20230130205935.1157347-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | tcg: Simplify temporary usage | expand |
Hi Richard, On Mon, Jan 30, 2023 at 10:59:07 -1000, Richard Henderson wrote: (snip) > With this, and by not recycling TEMP_LOCAL, we can get identical code > out of the backend even when changing the front end translators are > adjusted to use TEMP_LOCAL for everything. > > Benchmarking one test case, qemu-arm linux-test, the new liveness pass > comes in at about 1.6% on perf, but I can't see any difference in > wall clock time before and after the patch set. I ran yesterday linux-user SPEC06 benchmarks from your tcg-life branch. I do see perf regressions for two workloads (sjeng and xalancbmk). With perf(1) I see liveness_pass* are at 0.00%, so I wonder: is it possible that the emitted code isn't quite the same? Happy to run more tests if helpful. Results below. Thanks, Emilio - bar chart, png: https://postimg.cc/ZCTkbYS9 - bar chart, txt: Speedup of tcg-life (de6361f6) over master (ae2b5d83) Host: AMD Ryzen 7 PRO 5850U. Compiler: gcc12 1.03 +----------------------------------------------------------------------------------------------------------------------------------------------------+ 1.02 |-+.............................................................................................|..................................................+-| | | | 1.01 |-+.............................................................................................|..................................................+-| 1 |-+.....**+-+*.....*+-+**......+-+...........................**+-+**..............**+-+**....***|**...............**+-+**..........................+-| | * * *+-+ * **+-+** +-+ +-+ * +-+ * * * * | * * +-+ * +-+ | 0.99 |-+.....*....*.....*....*....*.....*....***|**.....**|***....*.....*..............*.....*....*..|.*......+-+......*.....*...............*+-+**.....+-| 0.98 |-+.....*....*.....*....*....*.....*....*.+-+*.....*+-+.*....*.....*....**+-+*....*.....*....*..|.*.......|.......*.....*...............*....*.....+-| | * * * * * * * * * * * * * +-+* * * * +-+* **|*** * * * * | 0.97 |-+.....*....*.....*....*....*.....*....*....*.....*....*....*.....*....*....*....*.....*....*....*.....*.|..*....*.....*......+-+......*....*.....+-| 0.96 |-+.....*....*.....*....*....*.....*....*....*.....*....*....*.....*....*....*....*.....*....*....*.....*+-+.*....*.....*.......|.......*....*.....+-| | * * * * * * * * * * * * * * * * * * * * * * ***|** * * | 0.95 |-+.....*....*.....*....*....*.....*....*....*.....*....*....*.....*....*....*....*.....*....*....*.....*....*....*.....*....*.+-+*.....*....*.....+-| 0.94 +----------------------------------------------------------------------------------------------------------------------------------------------------+ 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.s462.libquantum464.h264re471.omnetpp 473.ast483.xalancbmk geomean - Raw data for the bar chart: + baseline: # benchmark mean stdev raw 400.perlbench 94.4343747333333 0.331828752549838 94.131272,94.421923,94.34074,94.747239,94.982504,94.602928,93.743109,94.077325,94.220688,94.505739,94.598781,94.779386,94.177626,94.811701,94.37466 401.bzip2 83.0563643333333 0.270338451882521 83.603378,82.784967,82.766427,83.703505,83.018864,82.859924,83.128875,83.052816,82.921046,82.809962,83.027326,83.122502,83.099782,83.005817,82.940274 403.gcc 2.8751204 0.0183794528241263 2.872445,2.886974,2.884226,2.877824,2.871482,2.927202,2.864385,2.86503,2.855154,2.856129,2.86079,2.861818,2.887109,2.867046,2.889192 429.mcf 13.527965 0.0849965442919382 13.498908,13.494126,13.469952,13.606229,13.604864,13.513806,13.472737,13.572454,13.407602,13.70441,13.487249,13.562176,13.503575,13.39053,13.630857 445.gobmk 279.017610333333 1.91925368167126 279.808944,278.057813,278.831984,279.388752,276.801944,280.078062,278.675088,277.094009,279.452037,278.832294,278.843473,279.407613,275.879438,284.430909,279.681795 456.hmmer 103.296133533333 0.38166706019324 103.33233,102.944119,103.083766,103.001765,104.302275,103.329573,103.720265,103.537909,102.931565,103.008669,102.974703,103.5448,103.484963,102.958228,103.287073 458.sjeng 332.387649666667 0.868297133920158 331.71233,333.413204,333.367836,332.57489,331.818019,331.14369,333.848697,333.135605,332.878587,332.069454,332.003468,332.692292,331.01894,331.426129,332.711604 462.libquantum 4.12260253333333 0.00508688019554322 4.121422,4.116031,4.131564,4.113532,4.117144,4.124039,4.128896,4.118079,4.121929,4.124027,4.125302,4.124549,4.119102,4.125368,4.128054 464.h264ref 244.092639533333 13.3464074285764 239.569243,240.187437,240.760271,241.483515,240.772044,241.492141,240.530232,240.449723,240.679955,240.464527,241.3703,292.302111,240.254072,240.490477,240.583545 471.omnetpp 261.340260533333 3.7694119109844 263.463533,259.640839,260.834291,263.816131,256.877675,259.833289,258.708458,261.868763,260.75424,265.656161,257.900388,265.734187,256.747515,270.004887,258.263551 473.astar 142.966170866667 0.481395129184935 142.636087,142.675786,141.895549,143.236359,142.892086,142.325069,143.267024,143.910479,143.279771,142.666683,143.11241,143.15343,143.041394,143.391831,143.008605 483.xalancbmk 401.605619866667 3.99007996364547 401.101824,400.266261,396.474675,406.136427,404.400767,406.339383,397.442574,409.241015,399.084079,399.828507,402.585078,394.89061,404.722299,401.654323,399.916476 + tcg-life: # benchmark mean stdev raw 400.perlbench 94.1968828666667 0.352661861692484 94.726037,94.169276,93.893696,94.224617,94.613626,94.471446,94.198829,94.616742,93.845426,93.435601,94.040449,94.574709,94.105065,94.007179,94.030545 401.bzip2 83.0027554666667 0.214192109333076 83.181646,83.299212,83.342217,82.848151,82.808142,82.888099,82.942223,82.777883,82.739787,82.770313,83.01728,83.327844,83.201232,82.905666,82.991637 403.gcc 2.87870153333333 0.0304401106926527 2.860922,2.867219,2.860457,2.888637,2.879031,2.87397,2.882131,2.896422,2.865079,2.870739,2.847357,2.864518,2.901592,2.849287,2.973162 429.mcf 13.6952006666667 0.155876459519191 13.734646,13.746608,13.528171,13.577692,13.534005,13.65201,13.947822,13.541465,13.710553,13.787918,13.521862,13.997184,13.546621,13.848357,13.753096 445.gobmk 282.1855452 1.68500895181812 281.715494,282.875207,282.073035,281.660872,281.96679,278.912804,281.078281,283.777396,283.485664,278.564193,283.900278,283.662609,282.781748,284.176339,282.152468 456.hmmer 103.3804904 0.554303069916862 103.077106,103.013059,103.247046,105.192431,103.221722,102.99502,103.787524,103.086281,103.213953,103.048905,103.042041,103.664296,103.278652,103.445109,103.394211 458.sjeng 339.3596132 3.77963378278808 341.545293,341.249426,336.87165,343.192545,338.087093,339.691087,337.29754,341.586473,336.838538,345.476397,339.196873,342.773593,337.546389,329.312139,339.729162 462.libquantum 4.1225128 0.00546800475754836 4.112292,4.119043,4.119803,4.129127,4.117612,4.122837,4.120172,4.121449,4.127452,4.113505,4.129305,4.128303,4.126079,4.127113,4.1236 464.h264ref 243.447219066667 0.924288945630674 241.71547,242.724405,242.751474,243.730945,243.889673,243.254516,244.328523,244.374465,243.447008,245.45696,243.256098,242.348791,243.440131,242.895642,244.094185 471.omnetpp 268.2971082 5.67916415832786 271.509491,273.656661,274.294363,266.501929,272.7864,267.868119,271.032049,267.085038,256.124737,270.430985,271.586944,256.427087,268.23723,264.012334,272.903256 473.astar 142.842279266667 0.482819143874435 142.820726,142.742386,143.237814,143.241978,142.761549,142.026643,143.042933,142.849644,143.035134,142.150158,142.066603,143.086841,143.701693,142.553374,143.316713 483.xalancbmk 420.324755133333 8.22679014442942 424.925688,433.128404,415.710656,423.156208,428.067657,426.100068,429.6215,412.083569,411.921022,410.749722,407.134107,414.478705,416.110115,430.104758,421.579148 I then ran perf record on xalancbmk before/after: $ time for suffix in gcc12; do for tag in tcg-life-baseline tcg-life; do perf record -o /tmp/$tag-$suffix.perf.data -k 1 taskset -c 2 ./spec06.pl --iterations=1 --size=train --config=aarch64 --show-raw run ~/src/dbt-bench/out/$tag-$suffix/bin/qemu-aarch64 ~/src/spec/spec06-aarch64 xalancbmk; done; done 483.xalancbmk (#1/1) run_base_train_aarch64.0068.qemu-aarch64: qemu-aarch64 Xalan_base.aarch64 -v allbooks.xml xalanc.xsl: 410.191153s # benchmark mean stdev raw 483.xalancbmk 410.191153 0 410.191153 [ perf record: Woken up 251 times to write data ] [ perf record: Captured and wrote 62.629 MB /tmp/tcg-life-baseline-gcc12.perf.data (1641030 samples) ] 483.xalancbmk (#1/1) run_base_train_aarch64.0069.qemu-aarch64: qemu-aarch64 Xalan_base.aarch64 -v allbooks.xml xalanc.xsl: 464.428108s # benchmark mean stdev raw 483.xalancbmk 464.428108 0 464.428108 [ perf record: Woken up 284 times to write data ] [ perf record: Captured and wrote 70.905 MB /tmp/tcg-life-gcc12.perf.data (1857959 samples) ] real 14m35.863s user 14m34.897s sys 0m0.925s - perf report (baseline): # Total Lost Samples: 0 # # Samples: 1M of event 'cycles' # Event count (approx.): 1797955092780 # # Overhead Command Shared Object Symbol # ........ ............ ....................... ............................................ # 43.83% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr 5.56% qemu-aarch64 qemu-aarch64 [.] cpu_get_tb_cpu_state 2.23% qemu-aarch64 qemu-aarch64 [.] qht_lookup_custom 1.57% qemu-aarch64 qemu-aarch64 [.] tb_htable_lookup 1.29% qemu-aarch64 qemu-aarch64 [.] tb_lookup_cmp 0.72% qemu-aarch64 qemu-aarch64 [.] interval_tree_iter_first 0.28% qemu-aarch64 qemu-aarch64 [.] helper_vfp_cmpd_a64 0.27% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f79244b2a43 0.24% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f792449c058 0.20% qemu-aarch64 qemu-aarch64 [.] page_get_flags 0.20% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924349c22 0.19% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924349c40 0.18% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924349203 0.17% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7923e09b03 0.17% qemu-aarch64 qemu-aarch64 [.] helper_vfp_cmped_a64 0.15% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7923e9f965 0.15% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924552f2b 0.15% qemu-aarch64 qemu-aarch64 [.] float64_hs_compare 0.14% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f79244f7003 0.14% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924552a03 0.14% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924349243 0.14% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924546df6 0.13% qemu-aarch64 qemu-aarch64 [.] get_page_addr_code_hostp 0.12% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f792454de7b 0.12% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924555a85 0.12% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f792454f465 0.12% qemu-aarch64 qemu-aarch64 [.] float64_add 0.12% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f792439af03 0.11% qemu-aarch64 [JIT] tid 561758 [.] 0x00007f7924554b43 [...] 0.00% qemu-aarch64 qemu-aarch64 [.] liveness_pass_1 - perf report (tcg-life): # Total Lost Samples: 0 # # Samples: 1M of event 'cycles' # Event count (approx.): 2035140825489 # # Overhead Command Shared Object Symbol # ........ ............ ....................... ............................................ # 43.00% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr 5.73% qemu-aarch64 qemu-aarch64 [.] cpu_get_tb_cpu_state 2.16% qemu-aarch64 qemu-aarch64 [.] qht_lookup_custom 1.58% qemu-aarch64 qemu-aarch64 [.] tb_htable_lookup 1.10% qemu-aarch64 qemu-aarch64 [.] tb_lookup_cmp 0.40% qemu-aarch64 qemu-aarch64 [.] interval_tree_iter_first 0.26% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb37d4018 0.25% qemu-aarch64 qemu-aarch64 [.] helper_vfp_cmpd_a64 0.22% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb387ecb6 0.21% qemu-aarch64 qemu-aarch64 [.] page_get_flags 0.19% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3141b03 0.17% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3681d62 0.16% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb388ae2b 0.16% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb37ea9c3 0.16% qemu-aarch64 qemu-aarch64 [.] helper_vfp_cmped_a64 0.15% qemu-aarch64 qemu-aarch64 [.] get_page_addr_code_hostp 0.15% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3887325 0.15% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb389ddc3 0.14% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3681d80 0.14% qemu-aarch64 qemu-aarch64 [.] float64_hs_compare 0.13% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb36d2f83 0.12% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3885b65 0.12% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb37eab43 0.12% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb388a903 0.12% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb31d7925 0.11% qemu-aarch64 qemu-aarch64 [.] parts64_float_to_sint 0.11% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3885d3b 0.11% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb3681383 0.11% qemu-aarch64 [JIT] tid 562312 [.] 0x00007fdeb388a683 [...] 0.00% qemu-aarch64 qemu-aarch64 [.] liveness_pass_1 0.00% qemu-aarch64 qemu-aarch64 [.] liveness_pass_0
Ping for the 9 patches lacking review. r~ On 1/30/23 10:59, Richard Henderson wrote: > Based-on: 20230126043824.54819-1-richard.henderson@linaro.org > ("[PATCH v5 00/36] tcg: Support for Int128 with helpers") > > The biggest pitfall for new users of TCG is the fact that "normal" > temporaries die at branches, and we must therefore use a different > "local" temporary in that case. > > The following patch set changes that, so that the "normal" temporary > is the one that lives across branches, and there is a special temporary > that dies at the end of the extended basic block, and this special > case is reserved for tcg internals. > > TEMP_LOCAL is renamed TEMP_TB, which I believe to be more explicit and > less confusing. TEMP_NORMAL is removed entirely. > > I thought about putting in a proper full-power liveness analysis pass. > This would have eliminated the differences between all non-global > temporaries, and would have noticed when TEMP_LOCAL finally dies > within a translation and avoid any final writeback. > But I came to the conclusion that it was too expensive in runtime, > and so retaining some distinction in the types was required. > > In addition, I found that the usage of temps within plugin-gen.c > (9 per guest memory operation) meant that we *must* have some form > of temp that can be re-used. (There is one x86 instruction which > generates 62 memory operations; 62 * 9 == 558, which is larger than > our current TCG_MAX_TEMPS.) > > However I did add a new liveness pass which, with a single pass over > the opcode stream, can see that a TEMP_LOCAL is only live within a > single extended basic block, and thus may be transformed to TEMP_EBB. > > With this, and by not recycling TEMP_LOCAL, we can get identical code > out of the backend even when changing the front end translators are > adjusted to use TEMP_LOCAL for everything. > > Benchmarking one test case, qemu-arm linux-test, the new liveness pass > comes in at about 1.6% on perf, but I can't see any difference in > wall clock time before and after the patch set. > > > r~ > > > Richard Henderson (27): > tcg: Adjust TCGContext.temps_in_use check > accel/tcg: Pass max_insn to gen_intermediate_code by pointer > accel/tcg: Use more accurate max_insns for tb_overflow > tcg: Remove branch-to-next regardless of reference count > tcg: Rename TEMP_LOCAL to TEMP_TB > tcg: Add liveness_pass_0 > tcg: Remove TEMP_NORMAL > tcg: Pass TCGTempKind to tcg_temp_new_internal > tcg: Add tcg_temp_ebb_new_{i32,i64,ptr} > tcg: Add tcg_gen_movi_ptr > tcg: Use tcg_temp_ebb_new_* in tcg/ > accel/tcg/plugin: Use tcg_temp_ebb_* > accel/tcg/plugin: Tidy plugin_gen_disable_mem_helpers > tcg: Don't re-use TEMP_TB temporaries > tcg: Change default temp lifetime to TEMP_TB > target/arm: Drop copies in gen_sve_{ldr,str} > target/arm: Don't use tcg_temp_local_new_* > target/cris: Don't use tcg_temp_local_new > target/hexagon: Don't use tcg_temp_local_new_* > target/hppa: Don't use tcg_temp_local_new > target/i386: Don't use tcg_temp_local_new > target/mips: Don't use tcg_temp_local_new > target/ppc: Don't use tcg_temp_local_new > target/xtensa: Don't use tcg_temp_local_new_* > exec/gen-icount: Don't use tcg_temp_local_new_i32 > tcg: Remove tcg_temp_local_new_*, tcg_const_local_* > tcg: Update docs/devel/tcg-ops.rst for temporary changes > > docs/devel/tcg-ops.rst | 103 ++++---- > target/hexagon/idef-parser/README.rst | 4 +- > include/exec/gen-icount.h | 8 +- > include/exec/translator.h | 4 +- > include/tcg/tcg-op.h | 7 +- > include/tcg/tcg.h | 64 ++--- > target/arm/translate-a64.h | 1 - > target/hexagon/gen_tcg.h | 4 +- > accel/tcg/plugin-gen.c | 33 +-- > accel/tcg/translate-all.c | 2 +- > accel/tcg/translator.c | 6 +- > target/alpha/translate.c | 2 +- > target/arm/translate-a64.c | 6 - > target/arm/translate-sve.c | 38 +-- > target/arm/translate.c | 8 +- > target/avr/translate.c | 2 +- > target/cris/translate.c | 8 +- > target/hexagon/genptr.c | 16 +- > target/hexagon/idef-parser/parser-helpers.c | 4 +- > target/hexagon/translate.c | 4 +- > target/hppa/translate.c | 5 +- > target/i386/tcg/translate.c | 29 +-- > target/loongarch/translate.c | 2 +- > target/m68k/translate.c | 2 +- > target/microblaze/translate.c | 2 +- > target/mips/tcg/translate.c | 59 ++--- > target/nios2/translate.c | 2 +- > target/openrisc/translate.c | 2 +- > target/ppc/translate.c | 8 +- > target/riscv/translate.c | 2 +- > target/rx/translate.c | 2 +- > target/s390x/tcg/translate.c | 2 +- > target/sh4/translate.c | 2 +- > target/sparc/translate.c | 2 +- > target/tricore/translate.c | 2 +- > target/xtensa/translate.c | 18 +- > tcg/optimize.c | 2 +- > tcg/tcg-op-gvec.c | 270 ++++++++++---------- > tcg/tcg-op.c | 258 +++++++++---------- > tcg/tcg.c | 270 +++++++++++--------- > target/cris/translate_v10.c.inc | 10 +- > target/mips/tcg/nanomips_translate.c.inc | 4 +- > target/ppc/translate/spe-impl.c.inc | 8 +- > target/ppc/translate/vmx-impl.c.inc | 4 +- > target/hexagon/README | 8 +- > target/hexagon/gen_tcg_funcs.py | 18 +- > 46 files changed, 640 insertions(+), 677 deletions(-) >
On 2/10/23 02:35, Emilio Cota wrote: > I ran yesterday linux-user SPEC06 benchmarks from your tcg-life branch. > I do see perf regressions for two workloads (sjeng and xalancbmk). > With perf(1) I see liveness_pass* are at 0.00%, so I wonder: is it > possible that the emitted code isn't quite the same? Everything that I checked by hand was the same, but it's possible. It's a tedious process. You'd definitely want to turn off ASR. My current branch has __attribute__((noreturn)) added to all of the liveness passes, so that they don't get folded into tcg_gen_code. But I still would expect 0%. r~
On Wed, Feb 15, 2023 at 20:15:37 -1000, Richard Henderson wrote: > On 2/10/23 02:35, Emilio Cota wrote: > > I ran yesterday linux-user SPEC06 benchmarks from your tcg-life branch. > > I do see perf regressions for two workloads (sjeng and xalancbmk). > > With perf(1) I see liveness_pass* are at 0.00%, so I wonder: is it > > possible that the emitted code isn't quite the same? > > Everything that I checked by hand was the same, but it's possible. > It's a tedious process. You'd definitely want to turn off ASR. I've checked with -jitdump and perf whether there was any difference in the generated code before vs. after for the most common TBs. They were identical. Benchmarking without ASR didn't make a difference, unfortunately. > My current branch has __attribute__((noreturn)) added to all of the liveness > passes, so that they don't get folded into tcg_gen_code. But I still would > expect 0%. I'll bisect the series in the next few days see exactly where the perf regression begins so that at least we know where to look. Thanks, Emilio