diff mbox

[RFC,1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure

Message ID 1414496662-25202-2-git-send-email-will.deacon@arm.com
State Accepted
Commit ce9ec37bddb633404a0c23e1acb181a264e7f7f2
Headers show

Commit Message

Will Deacon Oct. 28, 2014, 11:44 a.m. UTC
When unmapping a range of pages in zap_pte_range, the page being
unmapped is added to an mmu_gather_batch structure for asynchronous
freeing. If we run out of space in the batch structure before the range
has been completely unmapped, then we break out of the loop, force a
TLB flush and free the pages that we have batched so far. If there are
further pages to unmap, then we resume the loop where we left off.

Unfortunately, we forget to update addr when we break out of the loop,
which causes us to truncate the range being invalidated as the end
address is exclusive. When we re-enter the loop at the same address, the
page has already been freed and the pte_present test will fail, meaning
that we do not reconsider the address for invalidation.

This patch fixes the problem by incrementing addr by the PAGE_SIZE
before breaking out of the loop on batch failure.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Linus Torvalds Oct. 28, 2014, 3:30 p.m. UTC | #1
On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> This patch fixes the problem by incrementing addr by the PAGE_SIZE
> before breaking out of the loop on batch failure.

This patch seems harmless and right, unlike the other one.

I'd be ok with changing the *generic* code to do the "update start/end
pointers in the mmu_gather structure", but then it really has to be
done in *generic* code. Move your arm64 start/end updates to
include/asm-generic/tlb.h, and change the initialization of start/end
entirely. Right now we initialize those things to the maximal range
(see tlb_gather_mmu()), and the arm64 logic seems to be to initialize
them to TASK_SIZE/0 respectively and then update start/end as you add
pages, so that you get the minimal range.

But because of this arm64 confusion (the "minimal range" really is
*not* how this is designed to work), the current existing
tlb_gather_mmu() does the wrong initialization.

In other words: my argument is that right now the arm64 code is just
*wrong*. I'd be ok with making it the right thing to do, but if so, it
needs to be made generic.

Are there any actual advantages to teh whole "minimal range" model? It
adds overhead and complexity, and the only case where it would seem to
be worth it is for benchmarks that do mmap/munmap in a loop and then
only map a single page. Normal loads don't tend to have those kinds of
"minimal range is very different from whole range" cases. Do you have
a real case for why it does that minimal range thing.

Because if you don't have a real case, I'd really suggest you get rid
of all the arm64 games with start/end. That still leaves this one
patch the correct thing to do (because even without the start/end
games the "need_flush" flag goes with the range, but it makes the
second patch a complete non-issue.

If you *do* have a real case, I think you need to modify your second patch to:

 - move the arm start/end updates from tlb_flush/tlb_add_flush to asm-generic

 - make tlb_gather_mmu() initialize start/end to TASK_SIZE/0 the same
way your tlb_flush does (so that the subsequent min/max games work).

so that *everybody* does the start/end games. I'd be ok with that.
What I'm not ok with is arm64 using the generic TLB gather way in odd
ways that then breaks code and results in things like your 2/2 patch
that fixes ARM64 but breaks x86.

Hmm? Is there something I'm missing?

                         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 28, 2014, 4:07 p.m. UTC | #2
Hi Linus,

On Tue, Oct 28, 2014 at 03:30:43PM +0000, Linus Torvalds wrote:
> On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
> > This patch fixes the problem by incrementing addr by the PAGE_SIZE
> > before breaking out of the loop on batch failure.
> 
> This patch seems harmless and right, unlike the other one.

Yes; the other patch was more of a discussion point, as I was really
struggling to solve the problem without changing arch code. It was also
broken, as you discovered.

> I'd be ok with changing the *generic* code to do the "update start/end
> pointers in the mmu_gather structure", but then it really has to be
> done in *generic* code. Move your arm64 start/end updates to
> include/asm-generic/tlb.h, and change the initialization of start/end
> entirely. Right now we initialize those things to the maximal range
> (see tlb_gather_mmu()), and the arm64 logic seems to be to initialize
> them to TASK_SIZE/0 respectively and then update start/end as you add
> pages, so that you get the minimal range.
> 
> But because of this arm64 confusion (the "minimal range" really is
> *not* how this is designed to work), the current existing
> tlb_gather_mmu() does the wrong initialization.
> 
> In other words: my argument is that right now the arm64 code is just
> *wrong*. I'd be ok with making it the right thing to do, but if so, it
> needs to be made generic.

Ok, that's useful, thanks. Out of curiosity, what *is* the current intention
of __tlb_remove_tlb_entry, if start/end shouldn't be touched by
architectures? Is it just for the PPC hash thing?

> Are there any actual advantages to teh whole "minimal range" model? It
> adds overhead and complexity, and the only case where it would seem to
> be worth it is for benchmarks that do mmap/munmap in a loop and then
> only map a single page. Normal loads don't tend to have those kinds of
> "minimal range is very different from whole range" cases. Do you have
> a real case for why it does that minimal range thing.

I was certainly seeing this issue trigger regularly when running firefox,
but I'll need to dig and find out the differences in range size.

> Because if you don't have a real case, I'd really suggest you get rid
> of all the arm64 games with start/end. That still leaves this one
> patch the correct thing to do (because even without the start/end
> games the "need_flush" flag goes with the range, but it makes the
> second patch a complete non-issue.

Since we have hardware broadcasting of TLB invalidations on ARM, it is
in our interest to keep the number of outstanding operations as small as
possible, particularly on large systems where we don't get the targetted
shootdown with a single message that you can perform using IPIs (i.e.
you can only broadcast to all or no CPUs, and that happens for each pte).

Now, what I'd actually like to do is to keep track of discrete ranges,
so that we can avoid sending a TLB invalidation for each PAGE_SIZE region
of a huge-page mapping. That information is lost with our current minimal
range scheme and it was whilst I was thinking about this that I noticed
we were getting passed negative ranges with the current implementation.

> If you *do* have a real case, I think you need to modify your second patch to:
> 
>  - move the arm start/end updates from tlb_flush/tlb_add_flush to asm-generic
> 
>  - make tlb_gather_mmu() initialize start/end to TASK_SIZE/0 the same
> way your tlb_flush does (so that the subsequent min/max games work).
> 
> so that *everybody* does the start/end games. I'd be ok with that.

I'll try and come up with something which addresses the above, and we can
go from there.

> What I'm not ok with is arm64 using the generic TLB gather way in odd
> ways that then breaks code and results in things like your 2/2 patch
> that fixes ARM64 but breaks x86.

Understood, that certainly wasn't my intention.

Cheers,

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 28, 2014, 4:25 p.m. UTC | #3
On Tue, Oct 28, 2014 at 9:07 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> Ok, that's useful, thanks. Out of curiosity, what *is* the current intention
> of __tlb_remove_tlb_entry, if start/end shouldn't be touched by
> architectures? Is it just for the PPC hash thing?

I think it's both the PPC hash, and for "legacy reasons" (ie
architectures that don't use the generic code, and were converted from
the "invalidate as you walk the tables" without ever really fixing the
"you have to flush the TLB before you free the page, and do
batching").

It would be lovely if we could just drop it entirely, although
changing it to actively do the minimal range is fine too.

> I was certainly seeing this issue trigger regularly when running firefox,
> but I'll need to dig and find out the differences in range size.

I'm wondering whether that was perhaps because of the mix-up with
initialization of the range. Afaik, that would always break your
min/max thing for the first batch (and since the batches are fairly
large, "first" may be "only")

But hey. it's possible that firefox does some big mappings but only
populates the beginning. Most architectures don't tend to have
excessive glass jaws in this area: invalidating things page-by-page is
invariably so slow that at some point you just go "just do the whole
range".

> Since we have hardware broadcasting of TLB invalidations on ARM, it is
> in our interest to keep the number of outstanding operations as small as
> possible, particularly on large systems where we don't get the targetted
> shootdown with a single message that you can perform using IPIs (i.e.
> you can only broadcast to all or no CPUs, and that happens for each pte).

Do you seriously *have* to broadcast for each pte?

Because that is quite frankly moronic.  We batch things up in software
for a real good reason: doing things one entry at a time just cannot
ever scale. At some point (and that point is usually not even very far
away), it's much better to do a single invalidate over a range. The
cost of having to refill the TLB's is *much* smaller than the cost of
doing tons of cross-CPU invalidates.

That's true even for the cases where we track the CPU's involved in
that mapping, and only invalidate a small subset. With a "all CPU's
broadcast", the cross-over point must be even smaller. Doing thousands
of CPU broadcasts is just crazy, even if they are hw-accelerated.

Can't you just do a full invalidate and a SW IPI for larger ranges?

And as mentioned, true sparse mappings are actually fairly rare, so
making extra effort (and data structures) to have individual ranges
sounds crazy.

Is this some hw-enforced thing? You really can't turn off the
cross-cpu-for-each-pte braindamage?

                         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Oct. 28, 2014, 5:07 p.m. UTC | #4
On Tue, Oct 28, 2014 at 04:25:35PM +0000, Linus Torvalds wrote:
> On Tue, Oct 28, 2014 at 9:07 AM, Will Deacon <will.deacon@arm.com> wrote:
> > I was certainly seeing this issue trigger regularly when running firefox,
> > but I'll need to dig and find out the differences in range size.
> 
> I'm wondering whether that was perhaps because of the mix-up with
> initialization of the range. Afaik, that would always break your
> min/max thing for the first batch (and since the batches are fairly
> large, "first" may be "only")
> 
> But hey. it's possible that firefox does some big mappings but only
> populates the beginning. Most architectures don't tend to have
> excessive glass jaws in this area: invalidating things page-by-page is
> invariably so slow that at some point you just go "just do the whole
> range".
> 
> > Since we have hardware broadcasting of TLB invalidations on ARM, it is
> > in our interest to keep the number of outstanding operations as small as
> > possible, particularly on large systems where we don't get the targetted
> > shootdown with a single message that you can perform using IPIs (i.e.
> > you can only broadcast to all or no CPUs, and that happens for each pte).
> 
> Do you seriously *have* to broadcast for each pte?
> 
> Because that is quite frankly moronic.  We batch things up in software
> for a real good reason: doing things one entry at a time just cannot
> ever scale. At some point (and that point is usually not even very far
> away), it's much better to do a single invalidate over a range. The
> cost of having to refill the TLB's is *much* smaller than the cost of
> doing tons of cross-CPU invalidates.

I don't think that's necessarily true, at least not on the systems I'm
familiar with. A table walk can be comparatively expensive, particularly
when virtualisation is involved and the depth of the host and guest page
tables starts to grow -- we're talking >20 memory accesses per walk. By
contrast, the TLB invalidation messages are asynchronous and carried on
the interconnect (a DSB instruction is used to synchronise the updates).

> That's true even for the cases where we track the CPU's involved in
> that mapping, and only invalidate a small subset. With a "all CPU's
> broadcast", the cross-over point must be even smaller. Doing thousands
> of CPU broadcasts is just crazy, even if they are hw-accelerated.
> 
> Can't you just do a full invalidate and a SW IPI for larger ranges?

We already do that, but it's mainly there to catch *really* large ranges
(like the negative ones...), which can trigger the soft lockup detector.
The cases we've seen for this so far have been bugs (e.g. this thread and
also a related issue where we try to flush the whole of vmalloc space).

> And as mentioned, true sparse mappings are actually fairly rare, so
> making extra effort (and data structures) to have individual ranges
> sounds crazy.

Sure, I'll try and get some data on this. I'd like to resolve the THP case,
at least, which means keeping track of calls to __tlb_remove_pmd_tlb_entry.

> Is this some hw-enforced thing? You really can't turn off the
> cross-cpu-for-each-pte braindamage?

We could use IPIs if we wanted to and issue local TLB invalidations on
the targetted cores, but I'd be surprised if this showed an improvement
on ARM-based systems.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 28, 2014, 6:03 p.m. UTC | #5
On Tue, Oct 28, 2014 at 10:07 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> I don't think that's necessarily true, at least not on the systems I'm
> familiar with. A table walk can be comparatively expensive, particularly
> when virtualisation is involved and the depth of the host and guest page
> tables starts to grow -- we're talking >20 memory accesses per walk. By
> contrast, the TLB invalidation messages are asynchronous and carried on
> the interconnect (a DSB instruction is used to synchronise the updates).

">20 memory accesses per *walk*"? Isn't the ARM a regular table? So
once you've gone down to the pte level, it's just an array, regardless
of how many levels there are.

But I guess there are no actual multi-socket ARM's around in real
life, so you probably don't see the real scaling costs. Within a die,
you're probably right that the overhead is negligible.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Herrenschmidt Oct. 28, 2014, 9:16 p.m. UTC | #6
On Tue, 2014-10-28 at 09:25 -0700, Linus Torvalds wrote:

> > Since we have hardware broadcasting of TLB invalidations on ARM, it is
> > in our interest to keep the number of outstanding operations as small as
> > possible, particularly on large systems where we don't get the targetted
> > shootdown with a single message that you can perform using IPIs (i.e.
> > you can only broadcast to all or no CPUs, and that happens for each pte).
> 
> Do you seriously *have* to broadcast for each pte?

We do too, in current CPUs at least, it's sad ...

> Because that is quite frankly moronic.  We batch things up in software
> for a real good reason: doing things one entry at a time just cannot
> ever scale. At some point (and that point is usually not even very far
> away), it's much better to do a single invalidate over a range. The
> cost of having to refill the TLB's is *much* smaller than the cost of
> doing tons of cross-CPU invalidates.
> 
> That's true even for the cases where we track the CPU's involved in
> that mapping, and only invalidate a small subset. With a "all CPU's
> broadcast", the cross-over point must be even smaller. Doing thousands
> of CPU broadcasts is just crazy, even if they are hw-accelerated.
> 
> Can't you just do a full invalidate and a SW IPI for larger ranges?

For us, this would be great except ... we can potentially have other
agents with an MMU that only support snooping of the broadcasts...

> And as mentioned, true sparse mappings are actually fairly rare, so
> making extra effort (and data structures) to have individual ranges
> sounds crazy.
> 
> Is this some hw-enforced thing? You really can't turn off the
> cross-cpu-for-each-pte braindamage?

Cheers,
Ben.

>                          Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 28, 2014, 9:32 p.m. UTC | #7
On Tue, Oct 28, 2014 at 2:16 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>>
>> Can't you just do a full invalidate and a SW IPI for larger ranges?
>
> For us, this would be great except ... we can potentially have other
> agents with an MMU that only support snooping of the broadcasts...

Ugh. Oh well. I guess on power you need to walk all the hashed entries
individually _anyway_, so there's no way you could really use a
range-based or "invalidate all" model to avoid some of the work.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 28, 2014, 9:40 p.m. UTC | #8
On Tue, Oct 28, 2014 at 8:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
>>
>> This patch fixes the problem by incrementing addr by the PAGE_SIZE
>> before breaking out of the loop on batch failure.
>
> This patch seems harmless and right [..]

I've applied it (commit ce9ec37bddb6), and marked it for stable.

I think that bug has been around since at least commit 2b047252d087
("Fix TLB gather virtual address range invalidation corner cases")
which went into 3.11, but that has in turn then was also marked for
stable, so I'm not sure just how far back this fix needs to go. I
suspect the simple answer is "as far back as it applies" ;)

I'll wait and see what you'll do about the other patch.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 1cc6bfbd872e..3e503831e042 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1147,6 +1147,7 @@  again:
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
 				force_flush = 1;
+				addr += PAGE_SIZE;
 				break;
 			}
 			continue;