[v1,4/4] accel/tcg: increase default code gen buffer size for 64 bit

Message ID 20200226181020.19592-5-alex.bennee@linaro.org
State New
Headers show
Series
  • Fix codegen translation cache size
Related show

Commit Message

Alex Bennée Feb. 26, 2020, 6:10 p.m.
While 32mb is certainly usable a full system boot ends up flushing the
codegen buffer nearly 100 times. Increase the default on 64 bit hosts
to take advantage of all that spare memory. After this change I can
boot my tests system without any TB flushes.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
 accel/tcg/translate-all.c | 4 ++++
 1 file changed, 4 insertions(+)

-- 
2.20.1

Comments

Niek Linnenbank Feb. 26, 2020, 10:45 p.m. | #1
Hi Alex,

On Wed, Feb 26, 2020 at 7:13 PM Alex Bennée <alex.bennee@linaro.org> wrote:

> While 32mb is certainly usable a full system boot ends up flushing the

> codegen buffer nearly 100 times. Increase the default on 64 bit hosts

> to take advantage of all that spare memory. After this change I can

> boot my tests system without any TB flushes.

>


That great, with this change I'm seeing a performance improvement when
running the avocado tests for cubieboard.
It runs about 4-5 seconds faster. My host is Ubuntu 18.04 on 64-bit.

I don't know much about the internals of TCG nor how it actually uses the
cache,
but it seems logical to me that increasing the cache size would improve
performance.

What I'm wondering is: will this also result in TCG translating larger
chunks in one shot, so potentially
taking more time to do the translation? If so, could it perhaps affect more
latency sensitive code?


>

> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

>

Tested-by: Niek Linnenbank <nieklinnenbank@gmail.com>



> ---

>  accel/tcg/translate-all.c | 4 ++++

>  1 file changed, 4 insertions(+)

>

> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c

> index 4ce5d1b3931..f7baa512059 100644

> --- a/accel/tcg/translate-all.c

> +++ b/accel/tcg/translate-all.c

> @@ -929,7 +929,11 @@ static void page_lock_pair(PageDesc **ret_p1,

> tb_page_addr_t phys1,

>  # define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)

>  #endif

>

> +#if TCG_TARGET_REG_BITS == 32

>  #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)

> +#else

> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)

> +#endif

>


The qemu process now takes up more virtual memory, about ~2.5GiB in my
test, which can be expected with this change.

Is it very likely that the TCG cache will be filled quickly and completely?
I'm asking because I also use Qemu to do automated testing
where the nodes are 64-bit but each have only 2GiB physical RAM.

Regards,
Niek


>

>  #define DEFAULT_CODE_GEN_BUFFER_SIZE \

>    (DEFAULT_CODE_GEN_BUFFER_SIZE_1 < MAX_CODE_GEN_BUFFER_SIZE \

> --

> 2.20.1

>

>

>


-- 
Niek Linnenbank
<div dir="ltr"><div>Hi Alex,<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Feb 26, 2020 at 7:13 PM Alex Bennée &lt;<a href="mailto:alex.bennee@linaro.org">alex.bennee@linaro.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">While 32mb is certainly usable a full system boot ends up flushing the<br>
codegen buffer nearly 100 times. Increase the default on 64 bit hosts<br>
to take advantage of all that spare memory. After this change I can<br>
boot my tests system without any TB flushes.<br></blockquote><div><br></div><div>That great, with this change I&#39;m seeing a performance improvement when running the avocado tests for cubieboard.</div><div>It runs about 4-5 seconds faster. My host is Ubuntu 18.04 on 64-bit.</div><div><br></div><div>I don&#39;t know much about the internals of TCG nor how it actually uses the cache,</div><div>but it seems logical to me that increasing the cache size would improve performance.</div><div><br></div><div>What I&#39;m wondering is: will this also result in TCG translating larger chunks in one shot, so potentially</div><div>taking more time to do the translation? If so, could it perhaps affect more latency sensitive code?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Signed-off-by: Alex Bennée &lt;<a href="mailto:alex.bennee@linaro.org" target="_blank">alex.bennee@linaro.org</a>&gt;<br></blockquote><div>Tested-by: Niek Linnenbank &lt;<a href="mailto:nieklinnenbank@gmail.com">nieklinnenbank@gmail.com</a>&gt;<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

---<br>
 accel/tcg/translate-all.c | 4 ++++<br>
 1 file changed, 4 insertions(+)<br>
<br>
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c<br>
index 4ce5d1b3931..f7baa512059 100644<br>
--- a/accel/tcg/translate-all.c<br>
+++ b/accel/tcg/translate-all.c<br>
@@ -929,7 +929,11 @@ static void page_lock_pair(PageDesc **ret_p1, tb_page_addr_t phys1,<br>
 # define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)<br>
 #endif<br>
<br>
+#if TCG_TARGET_REG_BITS == 32<br>
 #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)<br>
+#else<br>
+#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)<br>
+#endif<br></blockquote><div><br></div><div><div>The qemu process now takes up more virtual memory, about ~2.5GiB in my test, which can be expected with this change.</div><div><br></div><div>Is it very likely that the TCG cache will be filled quickly and completely? I&#39;m asking because I also use Qemu to do automated testing</div><div>where the nodes are 64-bit but each have only 2GiB physical RAM.</div><div><br></div><div>Regards,</div><div>Niek<br></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
 #define DEFAULT_CODE_GEN_BUFFER_SIZE \<br>
   (DEFAULT_CODE_GEN_BUFFER_SIZE_1 &lt; MAX_CODE_GEN_BUFFER_SIZE \<br>
-- <br>
2.20.1<br>
<br>
<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Niek Linnenbank<br><br></div></div></div></div>
Richard Henderson Feb. 26, 2020, 10:55 p.m. | #2
On 2/26/20 10:10 AM, Alex Bennée wrote:
> While 32mb is certainly usable a full system boot ends up flushing the

> codegen buffer nearly 100 times. Increase the default on 64 bit hosts

> to take advantage of all that spare memory. After this change I can

> boot my tests system without any TB flushes.


> +#if TCG_TARGET_REG_BITS == 32

>  #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)

> +#else

> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)

> +#endif


This particular number, I'm not so sure about.

It makes sense for a lone vm, running in system mode, on a large-ish host.
It's more questionable for a large-ish host running many system mode vm's,
although one can tune that from the command-line, so perhaps it's still ok.

It does not make sense for a linux-user chroot, running make -jN, on just about
any host.  For linux-user, I could be happy with a modest increase, but not all
the way out to 2GiB.

Discuss.


r~
Alex Bennée Feb. 27, 2020, 12:19 p.m. | #3
Niek Linnenbank <nieklinnenbank@gmail.com> writes:

> Hi Alex,

>

> On Wed, Feb 26, 2020 at 7:13 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>

>> While 32mb is certainly usable a full system boot ends up flushing the

>> codegen buffer nearly 100 times. Increase the default on 64 bit hosts

>> to take advantage of all that spare memory. After this change I can

>> boot my tests system without any TB flushes.

>>

>

> That great, with this change I'm seeing a performance improvement when

> running the avocado tests for cubieboard.

> It runs about 4-5 seconds faster. My host is Ubuntu 18.04 on 64-bit.

>

> I don't know much about the internals of TCG nor how it actually uses the

> cache,

> but it seems logical to me that increasing the cache size would improve

> performance.

>

> What I'm wondering is: will this also result in TCG translating larger

> chunks in one shot, so potentially

> taking more time to do the translation? If so, could it perhaps affect more

> latency sensitive code?


No - the size of the translation blocks is governed by the guest code
and where it ends a basic block. In system mode we also care about
crossing guest page boundaries.

>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

>>

> Tested-by: Niek Linnenbank <nieklinnenbank@gmail.com>

>

>

>> ---

>>  accel/tcg/translate-all.c | 4 ++++

>>  1 file changed, 4 insertions(+)

>>

>> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c

>> index 4ce5d1b3931..f7baa512059 100644

>> --- a/accel/tcg/translate-all.c

>> +++ b/accel/tcg/translate-all.c

>> @@ -929,7 +929,11 @@ static void page_lock_pair(PageDesc **ret_p1,

>> tb_page_addr_t phys1,

>>  # define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)

>>  #endif

>>

>> +#if TCG_TARGET_REG_BITS == 32

>>  #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)

>> +#else

>> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)

>> +#endif

>>

>

> The qemu process now takes up more virtual memory, about ~2.5GiB in my

> test, which can be expected with this change.

>

> Is it very likely that the TCG cache will be filled quickly and completely?

> I'm asking because I also use Qemu to do automated testing

> where the nodes are 64-bit but each have only 2GiB physical RAM.


Well so this is the interesting question and as ever it depends.

For system emulation the buffer will just slowly fill-up over time until
exhausted and which point it will flush and reset. Each time the guest
needs to flush a page and load fresh code in we will generate more
translated code. If the guest isn't under load and never uses all it's
RAM for code then in theory the pages of the mmap that are never filled
never need to be actualised by the host kernel.

You can view the behaviour by running "info jit" from the HMP monitor in
your tests. The "TB Flush" value shows the number of times this has
happened along with other information about translation state.

>

> Regards,

> Niek

>

>

>>

>>  #define DEFAULT_CODE_GEN_BUFFER_SIZE \

>>    (DEFAULT_CODE_GEN_BUFFER_SIZE_1 < MAX_CODE_GEN_BUFFER_SIZE \

>> --

>> 2.20.1

>>

>>

>>



-- 
Alex Bennée
Alex Bennée Feb. 27, 2020, 12:31 p.m. | #4
Richard Henderson <richard.henderson@linaro.org> writes:

> On 2/26/20 10:10 AM, Alex Bennée wrote:

>> While 32mb is certainly usable a full system boot ends up flushing the

>> codegen buffer nearly 100 times. Increase the default on 64 bit hosts

>> to take advantage of all that spare memory. After this change I can

>> boot my tests system without any TB flushes.

>

>> +#if TCG_TARGET_REG_BITS == 32

>>  #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)

>> +#else

>> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)

>> +#endif

>

> This particular number, I'm not so sure about.

>

> It makes sense for a lone vm, running in system mode, on a large-ish host.

> It's more questionable for a large-ish host running many system mode vm's,

> although one can tune that from the command-line, so perhaps it's

> still ok.


Yeah it would be nice to get some feedback from users. I suspect system
emulation means the mmap is less efficient due to the sharding of the
translation buffer.

> It does not make sense for a linux-user chroot, running make -jN, on just about

> any host.  For linux-user, I could be happy with a modest increase, but not all

> the way out to 2GiB.

>

> Discuss.


Does it matter that much? Surely for small programs the kernel just
never pages in the used portions of the mmap?

That said does linux-user have a better idea of the size of the problem
set before we start running? Could we defer calling tcg_exec_init until
we have mapped in the main executable and then size based on that?

>

>

> r~



-- 
Alex Bennée
Richard Henderson Feb. 27, 2020, 12:56 p.m. | #5
On 2/27/20 4:31 AM, Alex Bennée wrote:
>> It does not make sense for a linux-user chroot, running make -jN, on just about

>> any host.  For linux-user, I could be happy with a modest increase, but not all

>> the way out to 2GiB.

>>

>> Discuss.

> 

> Does it matter that much? Surely for small programs the kernel just

> never pages in the used portions of the mmap?


That's why I used the example of a build under the chroot, because the compiler
is not a small program.

Consider when the memory *is* used, and N * 2GB implies lots of paging, where
the previous N * 32MB did not.

I'm saying that we should consider a setting more like 128MB or so, since the
value cannot be changed from the command-line, or through the environment.


r~
Igor Mammedov Feb. 27, 2020, 2:13 p.m. | #6
On Thu, 27 Feb 2020 04:56:46 -0800
Richard Henderson <richard.henderson@linaro.org> wrote:

> On 2/27/20 4:31 AM, Alex Bennée wrote:

> >> It does not make sense for a linux-user chroot, running make -jN, on just about

> >> any host.  For linux-user, I could be happy with a modest increase, but not all

> >> the way out to 2GiB.

> >>

> >> Discuss.  

> > 

> > Does it matter that much? Surely for small programs the kernel just

> > never pages in the used portions of the mmap?  

> 

> That's why I used the example of a build under the chroot, because the compiler

> is not a small program.

> 

> Consider when the memory *is* used, and N * 2GB implies lots of paging, where

> the previous N * 32MB did not.

> 

> I'm saying that we should consider a setting more like 128MB or so, since the

> value cannot be changed from the command-line, or through the environment.


That's what BSD guys force tb-size to, to speed up system emulation.

> 

> 

> r~

>
Niek Linnenbank Feb. 27, 2020, 7:01 p.m. | #7
Hi Alex,

On Thu, Feb 27, 2020 at 1:19 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>

> Niek Linnenbank <nieklinnenbank@gmail.com> writes:

>

> > Hi Alex,

> >

> > On Wed, Feb 26, 2020 at 7:13 PM Alex Bennée <alex.bennee@linaro.org>

> wrote:

> >

> >> While 32mb is certainly usable a full system boot ends up flushing the

> >> codegen buffer nearly 100 times. Increase the default on 64 bit hosts

> >> to take advantage of all that spare memory. After this change I can

> >> boot my tests system without any TB flushes.

> >>

> >

> > That great, with this change I'm seeing a performance improvement when

> > running the avocado tests for cubieboard.

> > It runs about 4-5 seconds faster. My host is Ubuntu 18.04 on 64-bit.

> >

> > I don't know much about the internals of TCG nor how it actually uses the

> > cache,

> > but it seems logical to me that increasing the cache size would improve

> > performance.

> >

> > What I'm wondering is: will this also result in TCG translating larger

> > chunks in one shot, so potentially

> > taking more time to do the translation? If so, could it perhaps affect

> more

> > latency sensitive code?

>

> No - the size of the translation blocks is governed by the guest code

> and where it ends a basic block. In system mode we also care about

> crossing guest page boundaries.

>

> >> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

> >>

> > Tested-by: Niek Linnenbank <nieklinnenbank@gmail.com>

> >

> >

> >> ---

> >>  accel/tcg/translate-all.c | 4 ++++

> >>  1 file changed, 4 insertions(+)

> >>

> >> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c

> >> index 4ce5d1b3931..f7baa512059 100644

> >> --- a/accel/tcg/translate-all.c

> >> +++ b/accel/tcg/translate-all.c

> >> @@ -929,7 +929,11 @@ static void page_lock_pair(PageDesc **ret_p1,

> >> tb_page_addr_t phys1,

> >>  # define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)

> >>  #endif

> >>

> >> +#if TCG_TARGET_REG_BITS == 32

> >>  #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)

> >> +#else

> >> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)

> >> +#endif

> >>

> >

> > The qemu process now takes up more virtual memory, about ~2.5GiB in my

> > test, which can be expected with this change.

> >

> > Is it very likely that the TCG cache will be filled quickly and

> completely?

> > I'm asking because I also use Qemu to do automated testing

> > where the nodes are 64-bit but each have only 2GiB physical RAM.

>

> Well so this is the interesting question and as ever it depends.

>

> For system emulation the buffer will just slowly fill-up over time until

> exhausted and which point it will flush and reset. Each time the guest

> needs to flush a page and load fresh code in we will generate more

> translated code. If the guest isn't under load and never uses all it's

> RAM for code then in theory the pages of the mmap that are never filled

> never need to be actualised by the host kernel.

>

> You can view the behaviour by running "info jit" from the HMP monitor in

> your tests. The "TB Flush" value shows the number of times this has

> happened along with other information about translation state.

>


Thanks for clarifying this, now it all starts to make more sense to me.

Regards,
Niek


>

> >

> > Regards,

> > Niek

> >

> >

> >>

> >>  #define DEFAULT_CODE_GEN_BUFFER_SIZE \

> >>    (DEFAULT_CODE_GEN_BUFFER_SIZE_1 < MAX_CODE_GEN_BUFFER_SIZE \

> >> --

> >> 2.20.1

> >>

> >>

> >>

>

>

> --

> Alex Bennée

>



-- 
Niek Linnenbank
<div dir="ltr"><div>Hi Alex,<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 27, 2020 at 1:19 PM Alex Bennée &lt;<a href="mailto:alex.bennee@linaro.org">alex.bennee@linaro.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
Niek Linnenbank &lt;<a href="mailto:nieklinnenbank@gmail.com" target="_blank">nieklinnenbank@gmail.com</a>&gt; writes:<br>
<br>
&gt; Hi Alex,<br>
&gt;<br>
&gt; On Wed, Feb 26, 2020 at 7:13 PM Alex Bennée &lt;<a href="mailto:alex.bennee@linaro.org" target="_blank">alex.bennee@linaro.org</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; While 32mb is certainly usable a full system boot ends up flushing the<br>
&gt;&gt; codegen buffer nearly 100 times. Increase the default on 64 bit hosts<br>
&gt;&gt; to take advantage of all that spare memory. After this change I can<br>
&gt;&gt; boot my tests system without any TB flushes.<br>
&gt;&gt;<br>
&gt;<br>
&gt; That great, with this change I&#39;m seeing a performance improvement when<br>
&gt; running the avocado tests for cubieboard.<br>
&gt; It runs about 4-5 seconds faster. My host is Ubuntu 18.04 on 64-bit.<br>
&gt;<br>
&gt; I don&#39;t know much about the internals of TCG nor how it actually uses the<br>
&gt; cache,<br>
&gt; but it seems logical to me that increasing the cache size would improve<br>
&gt; performance.<br>
&gt;<br>
&gt; What I&#39;m wondering is: will this also result in TCG translating larger<br>
&gt; chunks in one shot, so potentially<br>
&gt; taking more time to do the translation? If so, could it perhaps affect more<br>
&gt; latency sensitive code?<br>
<br>
No - the size of the translation blocks is governed by the guest code<br>
and where it ends a basic block. In system mode we also care about<br>
crossing guest page boundaries.<br>
<br>
&gt;&gt; Signed-off-by: Alex Bennée &lt;<a href="mailto:alex.bennee@linaro.org" target="_blank">alex.bennee@linaro.org</a>&gt;<br>
&gt;&gt;<br>
&gt; Tested-by: Niek Linnenbank &lt;<a href="mailto:nieklinnenbank@gmail.com" target="_blank">nieklinnenbank@gmail.com</a>&gt;<br>
&gt;<br>
&gt;<br>
&gt;&gt; ---<br>
&gt;&gt;  accel/tcg/translate-all.c | 4 ++++<br>
&gt;&gt;  1 file changed, 4 insertions(+)<br>
&gt;&gt;<br>
&gt;&gt; diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c<br>
&gt;&gt; index 4ce5d1b3931..f7baa512059 100644<br>
&gt;&gt; --- a/accel/tcg/translate-all.c<br>
&gt;&gt; +++ b/accel/tcg/translate-all.c<br>
&gt;&gt; @@ -929,7 +929,11 @@ static void page_lock_pair(PageDesc **ret_p1,<br>
&gt;&gt; tb_page_addr_t phys1,<br>
&gt;&gt;  # define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)<br>
&gt;&gt;  #endif<br>
&gt;&gt;<br>
&gt;&gt; +#if TCG_TARGET_REG_BITS == 32<br>
&gt;&gt;  #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)<br>
&gt;&gt; +#else<br>
&gt;&gt; +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)<br>
&gt;&gt; +#endif<br>
&gt;&gt;<br>
&gt;<br>
&gt; The qemu process now takes up more virtual memory, about ~2.5GiB in my<br>
&gt; test, which can be expected with this change.<br>
&gt;<br>
&gt; Is it very likely that the TCG cache will be filled quickly and completely?<br>
&gt; I&#39;m asking because I also use Qemu to do automated testing<br>
&gt; where the nodes are 64-bit but each have only 2GiB physical RAM.<br>
<br>
Well so this is the interesting question and as ever it depends.<br>
<br>
For system emulation the buffer will just slowly fill-up over time until<br>
exhausted and which point it will flush and reset. Each time the guest<br>
needs to flush a page and load fresh code in we will generate more<br>
translated code. If the guest isn&#39;t under load and never uses all it&#39;s<br>
RAM for code then in theory the pages of the mmap that are never filled<br>
never need to be actualised by the host kernel.<br>
<br>
You can view the behaviour by running &quot;info jit&quot; from the HMP monitor in<br>
your tests. The &quot;TB Flush&quot; value shows the number of times this has<br>
happened along with other information about translation state.<br></blockquote><div><br></div><div>Thanks for clarifying this, now it all starts to make more sense to me.</div><div><br></div><div>Regards,</div><div>Niek<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt;<br>
&gt; Regards,<br>
&gt; Niek<br>
&gt;<br>
&gt;<br>
&gt;&gt;<br>
&gt;&gt;  #define DEFAULT_CODE_GEN_BUFFER_SIZE \<br>
&gt;&gt;    (DEFAULT_CODE_GEN_BUFFER_SIZE_1 &lt; MAX_CODE_GEN_BUFFER_SIZE \<br>
&gt;&gt; --<br>
&gt;&gt; 2.20.1<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
<br>
<br>
-- <br>
Alex Bennée<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Niek Linnenbank<br><br></div></div></div></div>
Niek Linnenbank Feb. 27, 2020, 7:07 p.m. | #8
Hi Richard,

On Thu, Feb 27, 2020 at 1:57 PM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 2/27/20 4:31 AM, Alex Bennée wrote:

> >> It does not make sense for a linux-user chroot, running make -jN, on

> just about

> >> any host.  For linux-user, I could be happy with a modest increase, but

> not all

> >> the way out to 2GiB.

> >>

> >> Discuss.

> >

> > Does it matter that much? Surely for small programs the kernel just

> > never pages in the used portions of the mmap?

>

> That's why I used the example of a build under the chroot, because the

> compiler

> is not a small program.

>

> Consider when the memory *is* used, and N * 2GB implies lots of paging,

> where

> the previous N * 32MB did not.

>

> I agree that a lower default value probably is safer until we have more

proof that a larger value does not give any issues.


> I'm saying that we should consider a setting more like 128MB or so, since

> the

> value cannot be changed from the command-line, or through the environment.

>


Proposal: can we then introduce a new command line parameter for this?
Maybe in a new patch?
Since the size of the code generation buffer appears to have an impact on
performance,
in my opinion it would make sense to make it configurable by the user.

Regards,
Niek


>

>

> r~

>

>


-- 
Niek Linnenbank
<div dir="ltr"><div>Hi Richard,<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 27, 2020 at 1:57 PM Richard Henderson &lt;<a href="mailto:richard.henderson@linaro.org">richard.henderson@linaro.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 2/27/20 4:31 AM, Alex Bennée wrote:<br>
&gt;&gt; It does not make sense for a linux-user chroot, running make -jN, on just about<br>
&gt;&gt; any host.  For linux-user, I could be happy with a modest increase, but not all<br>
&gt;&gt; the way out to 2GiB.<br>
&gt;&gt;<br>
&gt;&gt; Discuss.<br>
&gt; <br>
&gt; Does it matter that much? Surely for small programs the kernel just<br>
&gt; never pages in the used portions of the mmap?<br>
<br>
That&#39;s why I used the example of a build under the chroot, because the compiler<br>
is not a small program.<br>
<br>
Consider when the memory *is* used, and N * 2GB implies lots of paging, where<br>
the previous N * 32MB did not.<br>
<br></blockquote><div><div>I agree that a lower default value probably is safer until we have more proof that a larger value does not give any issues.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I&#39;m saying that we should consider a setting more like 128MB or so, since the<br>
value cannot be changed from the command-line, or through the environment.<br></blockquote><div><br></div><div>Proposal: can we then introduce a new command line parameter for this? Maybe in a new patch?</div><div>Since the size of the code generation buffer appears to have an impact on performance,</div><div>in my opinion it would make sense to make it configurable by the user.<br></div><div><br></div><div>Regards,</div><div>Niek<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
r~<br>
<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Niek Linnenbank<br><br></div></div></div></div>
Igor Mammedov Feb. 28, 2020, 7:47 a.m. | #9
On Thu, 27 Feb 2020 20:07:24 +0100
Niek Linnenbank <nieklinnenbank@gmail.com> wrote:

> Hi Richard,

> 

> On Thu, Feb 27, 2020 at 1:57 PM Richard Henderson <

> richard.henderson@linaro.org> wrote:  

> 

> > On 2/27/20 4:31 AM, Alex Bennée wrote:  

> > >> It does not make sense for a linux-user chroot, running make -jN, on  

> > just about  

> > >> any host.  For linux-user, I could be happy with a modest increase, but  

> > not all  

> > >> the way out to 2GiB.

> > >>

> > >> Discuss.  

> > >

> > > Does it matter that much? Surely for small programs the kernel just

> > > never pages in the used portions of the mmap?  

> >

> > That's why I used the example of a build under the chroot, because the

> > compiler

> > is not a small program.

> >

> > Consider when the memory *is* used, and N * 2GB implies lots of paging,

> > where

> > the previous N * 32MB did not.

> >

> > I agree that a lower default value probably is safer until we have more  

> proof that a larger value does not give any issues.

> 

> 

> > I'm saying that we should consider a setting more like 128MB or so, since

> > the

> > value cannot be changed from the command-line, or through the environment.

> >  

> 

> Proposal: can we then introduce a new command line parameter for this?

> Maybe in a new patch?


linux-user currently uses 32Mb static buffer so it probably fine to
leave it as is or bump it to 128Mb regardless of the 32/64bit host.

for system emulation, we already have tb-size option to set user
specified buffer size.

Issue is with system emulation is that it sizes buffer to 1/4 of
ram_size and dependency on ram_size is what we are trying to get
rid of. If we consider unit/acceptance tests as main target/user,
then they mostly use default ram_size value which varies mostly
from 16Mb to 1Gb depending on the board. So used buffer size is
in 4-256Mb range.
Considering that current CI runs fine with max 256Mb buffer,
it might make sense to use it as new heuristic which would not
regress our test infrastructure and might improve performance
for boards where smaller default buffer was used.


> Since the size of the code generation buffer appears to have an impact on

> performance,

> in my opinion it would make sense to make it configurable by the user.

> 

> Regards,

> 

> 

> >

> >

> > r~

> >

> >  

>
Alex Bennée Feb. 28, 2020, 7:20 p.m. | #10
Igor Mammedov <imammedo@redhat.com> writes:

> On Thu, 27 Feb 2020 20:07:24 +0100

> Niek Linnenbank <nieklinnenbank@gmail.com> wrote:

>

>> Hi Richard,

>> 

>> On Thu, Feb 27, 2020 at 1:57 PM Richard Henderson <

>> richard.henderson@linaro.org> wrote:  

>> 

>> > On 2/27/20 4:31 AM, Alex Bennée wrote:  

>> > >> It does not make sense for a linux-user chroot, running make -jN, on  

>> > just about  

>> > >> any host.  For linux-user, I could be happy with a modest increase, but  

>> > not all  

>> > >> the way out to 2GiB.

>> > >>

>> > >> Discuss.  

>> > >

>> > > Does it matter that much? Surely for small programs the kernel just

>> > > never pages in the used portions of the mmap?  

>> >

>> > That's why I used the example of a build under the chroot, because the

>> > compiler

>> > is not a small program.

>> >

>> > Consider when the memory *is* used, and N * 2GB implies lots of paging,

>> > where

>> > the previous N * 32MB did not.

>> >

>> > I agree that a lower default value probably is safer until we have more  

>> proof that a larger value does not give any issues.

>> 

>> 

>> > I'm saying that we should consider a setting more like 128MB or so, since

>> > the

>> > value cannot be changed from the command-line, or through the environment.

>> >  

>> 

>> Proposal: can we then introduce a new command line parameter for this?

>> Maybe in a new patch?

>

> linux-user currently uses 32Mb static buffer so it probably fine to

> leave it as is or bump it to 128Mb regardless of the 32/64bit host.

>

> for system emulation, we already have tb-size option to set user

> specified buffer size.

>

> Issue is with system emulation is that it sizes buffer to 1/4 of

> ram_size and dependency on ram_size is what we are trying to get

> rid of. If we consider unit/acceptance tests as main target/user,

> then they mostly use default ram_size value which varies mostly

> from 16Mb to 1Gb depending on the board. So used buffer size is

> in 4-256Mb range.

> Considering that current CI runs fine with max 256Mb buffer,

> it might make sense to use it as new heuristic which would not

> regress our test infrastructure and might improve performance

> for boards where smaller default buffer was used.


I've dropped it from 2gb to 1gb for system emulation. For the acceptance
tests I doubt we'll even fill the buffer but the mmap memory should
overcommit fine.

>

>

>> Since the size of the code generation buffer appears to have an impact on

>> performance,

>> in my opinion it would make sense to make it configurable by the user.

>> 

>> Regards,

>> 

>> 

>> >

>> >

>> > r~

>> >

>> >  

>> 



-- 
Alex Bennée

Patch

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 4ce5d1b3931..f7baa512059 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -929,7 +929,11 @@  static void page_lock_pair(PageDesc **ret_p1, tb_page_addr_t phys1,
 # define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 #endif
 
+#if TCG_TARGET_REG_BITS == 32
 #define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (32 * MiB)
+#else
+#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (2 * GiB)
+#endif
 
 #define DEFAULT_CODE_GEN_BUFFER_SIZE \
   (DEFAULT_CODE_GEN_BUFFER_SIZE_1 < MAX_CODE_GEN_BUFFER_SIZE \