[RFC] include/exec/cpu-defs.h: try and make SoftMMU page size match target

Message ID 20170710142850.10468-1-alex.bennee@linaro.org
State New
Headers show

Commit Message

Alex Bennée July 10, 2017, 2:28 p.m.
While the SoftMMU is not emulating the target MMU of a system there is
a relationship between its page size and that of the target. If the
target MMU is full featured the functions called to re-fill the
entries in the SoftMMU entries start moving up the perf profiles. If
we can we should try and prevent too much thrashing around by having
the page sizes the same.

Ideally we should use TARGET_PAGE_BITS_MIN but that potentially
involves a fair bit of #include re-jigging so I went for 10 bits (1k
pages) which I think is the smallest of all our emulated systems.

Some quick numbers show a reasonable performance win on an x86_64
host:

 ./aarch64-softmmu/qemu-system-aarch64 -machine type=virt \
   -display none -m 16384 -cpu cortex-a57 -serial mon:stdio \
   -drive file=../jessie-arm64.qcow2,id=myblock,index=0,if=none \
   -device virtio-blk-device,drive=myblock \
   -append "console=ttyAMA0 root=/dev/vda1 systemd.unit=benchmark-build.service" \
   -kernel ../aarch64-current-linux-kernel-only.img -machine gic-version=3 -smp 4

8 bit TLB:
  run 1: ret=0 (PASS), time=425.202797 (1/1)
  run 2: ret=0 (PASS), time=410.421742 (2/2)
  run 3: ret=0 (PASS), time=417.666752 (3/3)
  run 4: ret=0 (PASS), time=411.158793 (4/4)
  run 5: ret=0 (PASS), time=417.133068 (5/5)
  Results summary:
  0: 5 times (100.00%), avg time 416.317 (35.70 varience/5.98 deviation)

10 bit TLB
  run 1: ret=0 (PASS), time=359.310380 (1/1)
  run 2: ret=0 (PASS), time=387.826981 (2/2)
  run 3: ret=0 (PASS), time=381.097123 (3/3)
  run 4: ret=0 (PASS), time=393.826197 (4/4)
  run 5: ret=0 (PASS), time=384.340781 (5/5)
  Results summary:
  0: 5 times (100.00%), avg time 381.280 (173.08 varience/13.16 deviation)

CC: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
 include/exec/cpu-defs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.13.0

Comments

Peter Maydell July 10, 2017, 2:35 p.m. | #1
On 10 July 2017 at 15:28, Alex Bennée <alex.bennee@linaro.org> wrote:
> While the SoftMMU is not emulating the target MMU of a system there is

> a relationship between its page size and that of the target. If the

> target MMU is full featured the functions called to re-fill the

> entries in the SoftMMU entries start moving up the perf profiles. If

> we can we should try and prevent too much thrashing around by having

> the page sizes the same.

>

> Ideally we should use TARGET_PAGE_BITS_MIN but that potentially

> involves a fair bit of #include re-jigging so I went for 10 bits (1k

> pages) which I think is the smallest of all our emulated systems.


The figures certainly show an improvement, but it's not clear
to me why this is related to the target's page size rather than
just being a "bigger is better" kind of thing?

thanks
-- PMM
Alex Bennée July 10, 2017, 3:17 p.m. | #2
Peter Maydell <peter.maydell@linaro.org> writes:

> On 10 July 2017 at 15:28, Alex Bennée <alex.bennee@linaro.org> wrote:

>> While the SoftMMU is not emulating the target MMU of a system there is

>> a relationship between its page size and that of the target. If the

>> target MMU is full featured the functions called to re-fill the

>> entries in the SoftMMU entries start moving up the perf profiles. If

>> we can we should try and prevent too much thrashing around by having

>> the page sizes the same.

>>

>> Ideally we should use TARGET_PAGE_BITS_MIN but that potentially

>> involves a fair bit of #include re-jigging so I went for 10 bits (1k

>> pages) which I think is the smallest of all our emulated systems.

>

> The figures certainly show an improvement, but it's not clear

> to me why this is related to the target's page size rather than

> just being a "bigger is better" kind of thing?


Well this was driven by a discussion with Pranith last week. In his
(admittedly memory intensive) bench-marking he was seeing around 30%
overhead is coming from mmu related functions with the hottest being
get_phys_addr_lpae() followed by address_space_do_translate(). We
theorised that even given the high hit rate of the fast path the slow
path was triggered by moving over SoftMMU's effective page boundary. A
quick experiment in extending the size of the TLB made his hot spots
disappear.

I don't see quite such a hot-spot in my simple boot/build benchmark test
but after helper_lookup_tb_ptr quite a lot of hits are part of the
re-fill chain:

  16.37%  qemu-system-aar  qemu-system-aarch64      [.] helper_lookup_tb_ptr
   3.43%  qemu-system-aar  qemu-system-aarch64      [.] victim_tlb_hit
   2.73%  qemu-system-aar  qemu-system-aarch64      [.] tlb_set_page_with_attrs
   2.60%  qemu-system-aar  qemu-system-aarch64      [.] get_phys_addr_lpae
   2.36%  qemu-system-aar  qemu-system-aarch64      [.] qht_lookup
   1.53%  qemu-system-aar  qemu-system-aarch64      [.] arm_regime_tbi1
   1.37%  qemu-system-aar  qemu-system-aarch64      [.] tcg_optimize
   1.34%  qemu-system-aar  qemu-system-aarch64      [.] tcg_gen_code
   1.31%  qemu-system-aar  qemu-system-aarch64      [.] arm_regime_tbi0
   1.28%  qemu-system-aar  qemu-system-aarch64      [.] address_space_ldq_le
   1.22%  qemu-system-aar  qemu-system-aarch64      [.] object_dynamic_cast_assert
   1.11%  qemu-system-aar  qemu-system-aarch64      [.] address_space_translate_internal
   1.03%  qemu-system-aar  qemu-system-aarch64      [.] tb_htable_lookup
   0.98%  qemu-system-aar  qemu-system-aarch64      [.] get_page_addr_code
   0.98%  qemu-system-aar  qemu-system-aarch64      [.] address_space_do_translate
   0.87%  qemu-system-aar  qemu-system-aarch64      [.] object_class_dynamic_cast_assert
   0.82%  qemu-system-aar  qemu-system-aarch64      [.] get_phys_addr
   0.75%  qemu-system-aar  qemu-system-aarch64      [.] tb_cmp
   0.63%  qemu-system-aar  qemu-system-aarch64      [.] liveness_pass_1
   0.59%  qemu-system-aar  qemu-system-aarch64      [.] helper_le_ldq_mmu

--
Alex Bennée
Peter Maydell July 10, 2017, 3:23 p.m. | #3
On 10 July 2017 at 16:17, Alex Bennée <alex.bennee@linaro.org> wrote:
>

> Peter Maydell <peter.maydell@linaro.org> writes:

>

>> On 10 July 2017 at 15:28, Alex Bennée <alex.bennee@linaro.org> wrote:

>>> While the SoftMMU is not emulating the target MMU of a system there is

>>> a relationship between its page size and that of the target. If the

>>> target MMU is full featured the functions called to re-fill the

>>> entries in the SoftMMU entries start moving up the perf profiles. If

>>> we can we should try and prevent too much thrashing around by having

>>> the page sizes the same.

>>>

>>> Ideally we should use TARGET_PAGE_BITS_MIN but that potentially

>>> involves a fair bit of #include re-jigging so I went for 10 bits (1k

>>> pages) which I think is the smallest of all our emulated systems.

>>

>> The figures certainly show an improvement, but it's not clear

>> to me why this is related to the target's page size rather than

>> just being a "bigger is better" kind of thing?

>

> Well this was driven by a discussion with Pranith last week. In his

> (admittedly memory intensive) bench-marking he was seeing around 30%

> overhead is coming from mmu related functions with the hottest being

> get_phys_addr_lpae() followed by address_space_do_translate(). We

> theorised that even given the high hit rate of the fast path the slow

> path was triggered by moving over SoftMMU's effective page boundary. A

> quick experiment in extending the size of the TLB made his hot spots

> disappear.

>

> I don't see quite such a hot-spot in my simple boot/build benchmark test

> but after helper_lookup_tb_ptr quite a lot of hits are part of the

> re-fill chain:


Right, but why do we know that the target page size matters rather
than this just being "smaller TLB -> more TLB misses -> more calls
to the slow path -> functions called in the slow path appear more
in profiling" ?

thanks
-- PMM
Richard Henderson July 10, 2017, 4:55 p.m. | #4
On 07/10/2017 04:28 AM, Alex Bennée wrote:
> While the SoftMMU is not emulating the target MMU of a system there is

> a relationship between its page size and that of the target. If the

> target MMU is full featured the functions called to re-fill the

> entries in the SoftMMU entries start moving up the perf profiles. If

> we can we should try and prevent too much thrashing around by having

> the page sizes the same.


What you are changing has absolutely nothing to do with page size.  Mentioning 
page sizes just confuses the issue.

What you're changing is the number of entries in the TLB, and the mapping 
between pages and TLB entries.  And I'm quite certain that you saw a 
performance increase by increasing the size of the TLB.


>   #define CPU_TLB_BITS                                             \

> -    MIN(8,                                                       \

> +    MIN(10,                                                      \


You will find this has broken tcg/arm, because the operand to an armv6 AND 
(immediate) instruction is only 8 bits.

I will grant you that we could do a better job of configuring this across the 
tcg backends.

It also possibly warrants some work in the tcg/arm backend so that armv7 makes 
use of UBFX so that not-ancient arm is not so constrained.


r~
Alex Bennée July 10, 2017, 6:28 p.m. | #5
Richard Henderson <rth@twiddle.net> writes:

> On 07/10/2017 04:28 AM, Alex Bennée wrote:

>> While the SoftMMU is not emulating the target MMU of a system there is

>> a relationship between its page size and that of the target. If the

>> target MMU is full featured the functions called to re-fill the

>> entries in the SoftMMU entries start moving up the perf profiles. If

>> we can we should try and prevent too much thrashing around by having

>> the page sizes the same.

>

> What you are changing has absolutely nothing to do with page size.

> Mentioning page sizes just confuses the issue.

>

> What you're changing is the number of entries in the TLB, and the

> mapping between pages and TLB entries.  And I'm quite certain that you

> saw a performance increase by increasing the size of the TLB.


You are quite right, I'm going to claim Monday brain on the commit text.

There is some relationship though as the index into the SoftMMU TLB is a
product of the target page number masked by the total number of SoftMMU
TLB entries we have. By increasing the SoftMMU TLB size we reduce the
number of aliased target pages for any given SoftMMU entry.

I guess to have a better understanding of the slow-path fill patterns we
would want to have an idea of which entries are getting re-filled more
often and if there is an even distribution over the TLB?

>

>

>>   #define CPU_TLB_BITS                                             \

>> -    MIN(8,                                                       \

>> +    MIN(10,                                                      \

>

> You will find this has broken tcg/arm, because the operand to an armv6

> AND (immediate) instruction is only 8 bits.

>

> I will grant you that we could do a better job of configuring this

> across the tcg backends.

>

> It also possibly warrants some work in the tcg/arm backend so that

> armv7 makes use of UBFX so that not-ancient arm is not so constrained.


I think Pranith is going to take this forward with a patch set to expand
the SoftMMU TLB to the maximum supported by whichever the backend can
do.

--
Alex Bennée

Patch

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index bc8e7f848d..a0f9249752 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -89,7 +89,7 @@  typedef uint64_t target_ulong;
  * of tlb_table inside env (which is non-trivial but not huge).
  */
 #define CPU_TLB_BITS                                             \
-    MIN(8,                                                       \
+    MIN(10,                                                      \
         TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
         (NB_MMU_MODES <= 1 ? 0 :                                 \
          NB_MMU_MODES <= 2 ? 1 :                                 \