diff mbox

[RFC,1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure

Message ID 20141029194738.GA29911@arm.com
State New
Headers show

Commit Message

Will Deacon Oct. 29, 2014, 7:47 p.m. UTC
On Tue, Oct 28, 2014 at 09:40:35PM +0000, Linus Torvalds wrote:
> On Tue, Oct 28, 2014 at 8:30 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
> >>
> >> This patch fixes the problem by incrementing addr by the PAGE_SIZE
> >> before breaking out of the loop on batch failure.
> >
> > This patch seems harmless and right [..]
> 
> I've applied it (commit ce9ec37bddb6), and marked it for stable.
> 
> I think that bug has been around since at least commit 2b047252d087
> ("Fix TLB gather virtual address range invalidation corner cases")
> which went into 3.11, but that has in turn then was also marked for
> stable, so I'm not sure just how far back this fix needs to go. I
> suspect the simple answer is "as far back as it applies" ;)

Thanks for that.

> I'll wait and see what you'll do about the other patch.

I've cooked up something (see below), but it unfortunately makes a couple
of minor changes to powerpc and microblaze to address redefinitions of
some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry).

On the plus side, it tidies up the force_flush case in zap_pte_range
quite nicely (assuming I didn't screw it up again).

Cheers,

Will

--->8

commit f51dd616639dfbfe0685c82b47e0f31e4a34f16b
Author: Will Deacon <will.deacon@arm.com>
Date:   Wed Oct 29 10:03:09 2014 +0000

    mmu_gather: move minimal range calculations into generic code
    
    On architectures with hardware broadcasting of TLB invalidation messages
    , it makes sense to reduce the range of the mmu_gather structure when
    unmapping page ranges based on the dirty address information passed to
    tlb_remove_tlb_entry.
    
    arm64 already does this by directly manipulating the start/end fields
    of the gather structure, but this confuses the generic code which
    does not expect these fields to change and can end up calculating
    invalid, negative ranges when forcing a flush in zap_pte_range.
    
    This patch moves the minimal range calculation out of the arm64 code
    and into the generic implementation, simplifying zap_pte_range in the
    process (which no longer needs to care about start/end, since they will
    point to the appropriate ranges already).
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Linus Torvalds Oct. 29, 2014, 9:11 p.m. UTC | #1
On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote:
>
> I've cooked up something (see below), but it unfortunately makes a couple
> of minor changes to powerpc and microblaze to address redefinitions of
> some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry).
>
> On the plus side, it tidies up the force_flush case in zap_pte_range
> quite nicely (assuming I didn't screw it up again).

Yes, this looks fine to me. Looks like a good cleanup, and moves more
code to generic headers rather than having arch-specific stuff.

Ben, can you check that this is ok on powerpc? Who else should
double-check this due to having been involved in tlb flushing? But I
think this is good to go, modulo checking other architectures for
sanity.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Herrenschmidt Oct. 29, 2014, 9:27 p.m. UTC | #2
On Wed, 2014-10-29 at 14:11 -0700, Linus Torvalds wrote:
> On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote:
> >
> > I've cooked up something (see below), but it unfortunately makes a couple
> > of minor changes to powerpc and microblaze to address redefinitions of
> > some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry).
> >
> > On the plus side, it tidies up the force_flush case in zap_pte_range
> > quite nicely (assuming I didn't screw it up again).
> 
> Yes, this looks fine to me. Looks like a good cleanup, and moves more
> code to generic headers rather than having arch-specific stuff.
> 
> Ben, can you check that this is ok on powerpc? Who else should
> double-check this due to having been involved in tlb flushing? But I
> think this is good to go, modulo checking other architectures for
> sanity.

TLB flushing is only me I think, I'll engage my brain after breakfast
and see if is all good. The difficulty for us is that SW loaded TLB,
hash32 and hash64 are all very different and my brain's been good at
swapping a lot of that stuff out lately...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Nov. 1, 2014, 5:01 p.m. UTC | #3
On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> TLB flushing is only me I think, I'll engage my brain after breakfast
> and see if is all good

Ping? Breakfast is either long over, of you're starting to look a bit
like Mr Creosote...

Anyway, Will, I assume this is not a correctness issue for you, just
an annoying performance issue. Right? Or is there actually some issue
with the actual range not being set to be sufficiently large?

Also, it strikes me that I *think* that you might be able to extend
your patch to remove the whole "need_flush" field, since as far as I
can tell, "tlb->need_flush" is now equivalent to "tlb->start <
tlb->end". Of course, as long as we still require that
"need_flush_all", that doesn't actually save us any space, so maybe
it's not worth changing.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Herrenschmidt Nov. 1, 2014, 8:25 p.m. UTC | #4
On Sat, 2014-11-01 at 10:01 -0700, Linus Torvalds wrote:
> On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > TLB flushing is only me I think, I'll engage my brain after breakfast
> > and see if is all good
> 
> Ping? Breakfast is either long over, of you're starting to look a bit
> like Mr Creosote...

Argh... dropped that ball.

> Anyway, Will, I assume this is not a correctness issue for you, just
> an annoying performance issue. Right? Or is there actually some issue
> with the actual range not being set to be sufficiently large?

It should be fine for us in term of correctness I think. We rely on the
lazy mmu bits for batching/flushing on hash64, we use
__tlb_remove_tlb_entry() for immediate flush on hash32 and the SW loaded
TLB cases are pretty dumb here and should be generally unaffected.

> Also, it strikes me that I *think* that you might be able to extend
> your patch to remove the whole "need_flush" field, since as far as I
> can tell, "tlb->need_flush" is now equivalent to "tlb->start <
> tlb->end". Of course, as long as we still require that
> "need_flush_all", that doesn't actually save us any space, so maybe
> it's not worth changing.

Cheers,
Ben.

>                        Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Will Deacon Nov. 3, 2014, 5:56 p.m. UTC | #5
On Sat, Nov 01, 2014 at 05:01:30PM +0000, Linus Torvalds wrote:
> On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > TLB flushing is only me I think, I'll engage my brain after breakfast
> > and see if is all good
> 
> Ping? Breakfast is either long over, of you're starting to look a bit
> like Mr Creosote...

Wafer thin patch?

> Anyway, Will, I assume this is not a correctness issue for you, just
> an annoying performance issue. Right? Or is there actually some issue
> with the actual range not being set to be sufficiently large?

Yeah, it's just a performance issue. For ranges over 1k pages, we end up
flushing the whole TLB.

> Also, it strikes me that I *think* that you might be able to extend
> your patch to remove the whole "need_flush" field, since as far as I
> can tell, "tlb->need_flush" is now equivalent to "tlb->start <
> tlb->end". Of course, as long as we still require that
> "need_flush_all", that doesn't actually save us any space, so maybe
> it's not worth changing.

We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll
take a look in the name of cleanup and this can wait until 3.19.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Nov. 3, 2014, 6:05 p.m. UTC | #6
On Mon, Nov 3, 2014 at 9:56 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll
> take a look in the name of cleanup and this can wait until 3.19.

Ok, I'll just archive this thread for now, and expect a patch at some
future date. And as far as I'm concerned, it's ok if it just comes in
through the normal arm64 tree, with just a note in the pull request
reminding me about why it also touches some other architectures.

Thanks,

                     Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas Nov. 4, 2014, 2:29 p.m. UTC | #7
(catching up with email after holiday)

On Wed, Oct 29, 2014 at 07:47:39PM +0000, Will Deacon wrote:
>     mmu_gather: move minimal range calculations into generic code
>     
>     On architectures with hardware broadcasting of TLB invalidation messages
>     , it makes sense to reduce the range of the mmu_gather structure when
>     unmapping page ranges based on the dirty address information passed to
>     tlb_remove_tlb_entry.
>     
>     arm64 already does this by directly manipulating the start/end fields
>     of the gather structure, but this confuses the generic code which
>     does not expect these fields to change and can end up calculating
>     invalid, negative ranges when forcing a flush in zap_pte_range.
>     
>     This patch moves the minimal range calculation out of the arm64 code
>     and into the generic implementation, simplifying zap_pte_range in the
>     process (which no longer needs to care about start/end, since they will
>     point to the appropriate ranges already).
>     
>     Signed-off-by: Will Deacon <will.deacon@arm.com>

Nice to see this clean-up for arm64, however I have a question below.

> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 5672d7ea1fa0..340bc5c5ca2d 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  		tlb_flush_mmu(tlb);
>  }
>  
> +static inline void __tlb_adjust_range(struct mmu_gather *tlb,
> +				      unsigned long address)
> +{
> +	if (!tlb->fullmm) {
> +		tlb->start = min(tlb->start, address);
> +		tlb->end = max(tlb->end, address + PAGE_SIZE);
> +	}
> +}

Here __tlb_adjust_range() assumes end to be (start + PAGE_SIZE) always.

> @@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)		\
>  	do {							\
>  		tlb->need_flush = 1;				\
> +		__tlb_adjust_range(tlb, address);		\
>  		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);	\
>  	} while (0)

[...]

>  #define pmd_free_tlb(tlb, pmdp, address)			\
>  	do {							\
>  		tlb->need_flush = 1;				\
> +		__tlb_adjust_range(tlb, address);		\
>  		__pmd_free_tlb(tlb, pmdp, address);		\
>  	} while (0)

This would work on arm64 but is the PAGE_SIZE range enough for all
architectures even when we flush a huge page or a pmd/pud table entry?
The approach Peter Z took with his patches was to use pmd_addr_end(addr,
TASK_SIZE) and change __tlb_adjust_range() to take start/end arguments:

https://lkml.org/lkml/2011/3/7/302
Linus Torvalds Nov. 4, 2014, 4:08 p.m. UTC | #8
On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> This would work on arm64 but is the PAGE_SIZE range enough for all
> architectures even when we flush a huge page or a pmd/pud table entry?

It pretty much had *better* be.

For things like page tables caches (ie caching addresses "inside" the
page tables, like x86 does), for legacy reasons, flushing an
individual page had better flush the page table caches behind it. This
is definitely how x86 works, for example. And if you have an
architected non-legacy page table cache (which I'm not aware of
anybody actually doing), you're going to have some architecturally
explicit flushing for that, likely *separate* from a regular TLB entry
flush, and thus you'd need more than just some range expansion..

And the logic is very similar for things like hugepages. Either a
normal "TLB invalidate" insutrction anywhere in the hugepage will
invalidate the whole hugepage), or you would have special instructions
or rules for invalidating hugepages and you'd need more than just some
range expansion.

So in neither case does it make sense to expand the range, afaik. And
it would hurt normal architectures. So if we ever find an architecture
that would want something that odd, I think it is up to that
architecture to do its own odd thing, not cause pain for others.

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas Nov. 6, 2014, 1:57 p.m. UTC | #9
On Tue, Nov 04, 2014 at 04:08:27PM +0000, Linus Torvalds wrote:
> On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > This would work on arm64 but is the PAGE_SIZE range enough for all
> > architectures even when we flush a huge page or a pmd/pud table entry?
> 
> It pretty much had *better* be.

Thanks for confirming.

> For things like page tables caches (ie caching addresses "inside" the
> page tables, like x86 does), for legacy reasons, flushing an
> individual page had better flush the page table caches behind it. This
> is definitely how x86 works, for example. And if you have an
> architected non-legacy page table cache (which I'm not aware of
> anybody actually doing), you're going to have some architecturally
> explicit flushing for that, likely *separate* from a regular TLB entry
> flush, and thus you'd need more than just some range expansion..

On arm64 we have two types of TLB invalidation instructions, the
standard one which flushes a pte entry together with the corresponding
upper level page table cache and a "leaf" operation only for the pte. We
don't use the latter in Linux (yet) but in theory it's more efficient.

Anyway, even without special "leaf" operations, it would be useful to
make the distinction between unmap_vmas() and free_pgtables() with
regards to the ranges tracked by mmu_gather. For the former, tlb_flush()
needs to flush the range in PAGE_SIZE increments (assuming a mix of
small and huge pages). For the latter, PMD_SIZE increments would be
enough.

With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick
but for x86 we can keep mmu_gather.need_flush only for pte clearing
and remove need_flush = 1 from p*_free_tlb() functions. The arch
specific tlb_flush() can take need_flush into account to change the
range flushing increment or even ignore the second tlb_flush() triggered
by tlb_finish_mmu() (after free_pgtables(), the ptes have been flushed
via tlb_end_vma()).
Linus Torvalds Nov. 6, 2014, 5:53 p.m. UTC | #10
On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Anyway, even without special "leaf" operations, it would be useful to
> make the distinction between unmap_vmas() and free_pgtables() with
> regards to the ranges tracked by mmu_gather. For the former, tlb_flush()
> needs to flush the range in PAGE_SIZE increments (assuming a mix of
> small and huge pages). For the latter, PMD_SIZE increments would be
> enough.

Why woyuld you *ever* care about the increments?

Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and
(b) misleading.

It might actually be better to instead of

    tlb->end = max(tlb->end, address + PAGE_SIZE);

it might as well be a simpler

    tlb->end = max(tlb->end, address+1)

(Even the "+1" is kind of unnecessary, but it's there to distinguish
the zero address from the initial condition).

The thing is, absolutely *no* architecture has a TLB flush that
involves giving the start and end of the page you want to flush. None.
Nada. Zilch. Trying to think of it as a "range of bytes I need to
flush" is wrong. And it's misleading, and it makes people think in
stupid ways like "I should change PAGE_SIZE to PMD_SIZE when I flush a
PMD". That's *wrong*.

Every single TLB flush is about a variation of "flush the TLB entry
that contains the mapping for this address". The whole "add page-size"
is pointless, because the *flushing* is not about giving a page-sized
range, it's about giving *one* address in that page.

Using "address+1" is actually equally descriptive of what the range is
("flush the tlb entry that maps this *byte*"), but is less amenable to
the above kind of silly PMD_SIZE confusion. Because the exact same
thing that is true for a single page is true for a page table
operation too. There is just a *single* page table directory entry
cache, it's not like you have one page table directory entry cache for
each page.

So what matters for the non-leaf operations is not size. Not AT ALL.
It's a totally different operation, and you'd need not a different
size, but a different flag entirely - the same way we already have a
different flag for the "full_mm" case. And it's actually for exactly
the same reason as "full_mm": you do the flush itself differently,
it's not that the range is different. If it was just about the range,
then "full_mm" would just initialize the range to the whole VM. But it
*isn't* about the range. It's about the fact that a full-MM tear-down
can fundamentally do different operations, knowing that there are no
other threads using that VM any more.

So I really really object to playing games with PMD_SIZE or other
idiocies, because it fundamentally mis-states the whole problem.

If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you
need to add a new flag, and you just set that flag when you tear down
a page table (as opposed to just a PTE).

It doesn't affect the "range" at all in any way as far as I can tell.

> With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick
> but for x86 we can keep mmu_gather.need_flush only for pte clearing
> and remove need_flush = 1 from p*_free_tlb() functions.

This is more confusion about what is going on.

I'd actually really really prefer to have the "need_flush = 1" for the
page table tear-down case even for x86. No, if you never removed any
PTE at all, it is possible that it's not actually needed because an
x86 CPU isn't supposed to cache non-present page table entries (so if
you could tear down the page tables because there were no entries,
there should be no TLB entries, and there *hopefully* should be no
caches of mid-level page tables either that need a TLB invalidate).
But in practice, I'd not take that chance. If you tear down a page
table, you should flush the TLB in that range (and note how I say *in*
that range - an invalidate anywhere in the range should be sufficient,
not "over the whole range"!), because quite frankly, from an
implementation standpoint, I really think it's the sane and safe thing
to do.

So I would suggest you think of the x86 invlpg instruction as your
"non-leaf invalidate". The same way you'd want to do non-leaf
invalidate whenever you tear down a page table, you'd do "invlpg" on
x86.

And no, we should *not* play games with "tlb->local.next". That just
sounds completely and utterly insane. That's a hack, it's unclear,
it's stupid, and it's connected to a totally irrelevant implementation
detail, namely that random RCU freeing.

Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set
tlb->need_flush_inner to let the flusher know" (probably in *addition*
to "tlb->need_flush", just to maintain that rule). Make it explicit,
and make it obvious, and don't play games.

And don't make it be about range sizes. Because I really don't see how
the exact end/start could ever be relevant. TLB's aren't byte-based,
they are entry-based.

                         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas Nov. 6, 2014, 6:38 p.m. UTC | #11
On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote:
> On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > Anyway, even without special "leaf" operations, it would be useful to
> > make the distinction between unmap_vmas() and free_pgtables() with
> > regards to the ranges tracked by mmu_gather. For the former, tlb_flush()
> > needs to flush the range in PAGE_SIZE increments (assuming a mix of
> > small and huge pages). For the latter, PMD_SIZE increments would be
> > enough.
> 
> Why woyuld you *ever* care about the increments?

Sorry, I wasn't clear enough about the "increments" part. I agreed with
not using end = start + PMD_SIZE/PAGE_SIZE from your previous email
already.

The flush_tlb_range() implementation on ARM(64) uses a loop that goes
over the given range in PAGE_SIZE increments. This is fine and even
optimal when we flush the PTEs. But it could be even faster when we go
over the same range again and only need to flush the page table cache
(PMD/PUD/PGD). A new flush_tlb_table_range() function could loop in
PMD_SIZE increments. That's an arm64-only implementation of the TLB
range flushing, I'm not suggesting the PAGE_SIZE/PMD_SIZE increments
when setting mmu_gather.end at all.

> Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and
> (b) misleading.
> 
> It might actually be better to instead of
> 
>     tlb->end = max(tlb->end, address + PAGE_SIZE);
> 
> it might as well be a simpler
> 
>     tlb->end = max(tlb->end, address+1)

I fully agree.

One minor drawback is that the TLB invalidation instructions on ARM work
on pfn and end >> PAGE_SHIFT would make it equal to start. It can be
fixed in the arch code though.

> So what matters for the non-leaf operations is not size. Not AT ALL.
> It's a totally different operation, and you'd need not a different
> size, but a different flag entirely - the same way we already have a
> different flag for the "full_mm" case. And it's actually for exactly
> the same reason as "full_mm": you do the flush itself differently,
> it's not that the range is different. If it was just about the range,
> then "full_mm" would just initialize the range to the whole VM. But it
> *isn't* about the range. It's about the fact that a full-MM tear-down
> can fundamentally do different operations, knowing that there are no
> other threads using that VM any more.
> 
> So I really really object to playing games with PMD_SIZE or other
> idiocies, because it fundamentally mis-states the whole problem.

That's not what I suggesting (though I agree I wasn't clear).

The use of PMD_SIZE steps in a new flush_tlb_table_range() loop is
entirely and arch-specific optimisation. Only that the arch code doesn't
currently know which tlb_flush() it should use because need_flush is set
for both PTEs and page table tear down. We just need different flags
here to be able to optimise the arch code further.

> If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you
> need to add a new flag, and you just set that flag when you tear down
> a page table (as opposed to just a PTE).

Indeed. Actually we could use need_flag only for PTEs and ignore it for
page table tear-down. With Will's patch, we can already check tlb->end
for what need_flush is currently doing and use need_flush in an
arch-specific way (and we could give a new name as well).

> > With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick
> > but for x86 we can keep mmu_gather.need_flush only for pte clearing
> > and remove need_flush = 1 from p*_free_tlb() functions.
> 
> This is more confusion about what is going on.

Yes, and if we do this we may no longer understand the code in few weeks
time.

> I'd actually really really prefer to have the "need_flush = 1" for the
> page table tear-down case even for x86. No, if you never removed any
> PTE at all, it is possible that it's not actually needed because an
> x86 CPU isn't supposed to cache non-present page table entries (so if
> you could tear down the page tables because there were no entries,
> there should be no TLB entries, and there *hopefully* should be no
> caches of mid-level page tables either that need a TLB invalidate).

On ARM, as long as an intermediate page table entry is valid, even
though the full translation (PTE) is not, the CPU can go and cache it.
What's worse (and we've hit it before) is that it may even end up
reading something that looks like a valid PTE (of the PTE page has been
freed before the TLB invalidation) and it will stick around as a full
translation. So we need to flush the page table cache before freeing the
page table pages.

> But in practice, I'd not take that chance. If you tear down a page
> table, you should flush the TLB in that range (and note how I say *in*
> that range - an invalidate anywhere in the range should be sufficient,
> not "over the whole range"!), because quite frankly, from an
> implementation standpoint, I really think it's the sane and safe thing
> to do.

A single TLB is indeed enough for a single page table page removed. If
we queue multiple page table pages freeing, we accumulate the range via
pmd_free_tlb() etc. and we would eventually need multiple TLB
invalidations, at most one every PMD_SIZE (that's when !fullmm).

> So I would suggest you think of the x86 invlpg instruction as your
> "non-leaf invalidate". The same way you'd want to do non-leaf
> invalidate whenever you tear down a page table, you'd do "invlpg" on
> x86.

I need to dig some more in the x86 code, I'm not familiar with it. We
could do the page table cache invalidation non-lazily every time
pmd_free_tlb() is called, though it's not as optimal as we need a heavy
DSB barrier on ARM after each TLB invalidate.

> And no, we should *not* play games with "tlb->local.next". That just
> sounds completely and utterly insane. That's a hack, it's unclear,
> it's stupid, and it's connected to a totally irrelevant implementation
> detail, namely that random RCU freeing.
> 
> Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set
> tlb->need_flush_inner to let the flusher know" (probably in *addition*
> to "tlb->need_flush", just to maintain that rule). Make it explicit,
> and make it obvious, and don't play games.

I agree.
Linus Torvalds Nov. 6, 2014, 9:29 p.m. UTC | #12
On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote:
>
> Sorry, I wasn't clear enough about the "increments" part. I agreed with
> not using end = start + PMD_SIZE/PAGE_SIZE from your previous email
> already.

Ahh, I misunderstood. You're really just after the granularity of tlb flushes.

That's fine. That makes sense. In fact, how about adding "granularity"
to the mmu_gather structure, and then doing:\

 - in __tlb_reset_range(), setting it to ~0ul

 - add "granularity" to __tlb_adjust_range(), and make it do something like

       if (!tlb->fullmm) {
               tlb->granularity = min(tlb->granularity, granularity);
               tlb->start = min(tlb->start, address);
               tlb->end = max(tlb->end, address+1);
       }

and then the TLB flush logic would basically do

   address = tlb->start;
   do {
        flush(address);
        if (address + tlb->granularity < address)
                break;
        address = address + tlb->granularity;
   } while (address < tlb->end);

or something like that.

Now, if you unmap mixed ranges of large-pages and regular pages, you'd
still have that granularity of one page, but quite frankly, if you do
that, you probably deserve it. The common case is almost certainly
going to be just "unmap large pages" or "unmap normal pages".

And if it turns out that I'm completely wrong, and mixed granularities
are common, maybe there could be some hack in the "tlb->granularity"
calculations that just forces a TLB flush when the granularity
changes.

Hmm?

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas Nov. 7, 2014, 4:50 p.m. UTC | #13
On Thu, Nov 06, 2014 at 09:29:54PM +0000, Linus Torvalds wrote:
> On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote:
> >
> > Sorry, I wasn't clear enough about the "increments" part. I agreed with
> > not using end = start + PMD_SIZE/PAGE_SIZE from your previous email
> > already.
> 
> Ahh, I misunderstood. You're really just after the granularity of tlb flushes.

Yes. The granularity would also help when tearing down page tables as
the granule would be PMD_SIZE.

> That's fine. That makes sense. In fact, how about adding "granularity"
> to the mmu_gather structure, and then doing:\
> 
>  - in __tlb_reset_range(), setting it to ~0ul
> 
>  - add "granularity" to __tlb_adjust_range(), and make it do something like
> 
>        if (!tlb->fullmm) {
>                tlb->granularity = min(tlb->granularity, granularity);
>                tlb->start = min(tlb->start, address);
>                tlb->end = max(tlb->end, address+1);
>        }
> 
> and then the TLB flush logic would basically do
> 
>    address = tlb->start;
>    do {
>         flush(address);
>         if (address + tlb->granularity < address)
>                 break;
>         address = address + tlb->granularity;
>    } while (address < tlb->end);
> 
> or something like that.

Indeed. We'll come up with a patch after Will's clean-up.

> Now, if you unmap mixed ranges of large-pages and regular pages, you'd
> still have that granularity of one page, but quite frankly, if you do
> that, you probably deserve it. The common case is almost certainly
> going to be just "unmap large pages" or "unmap normal pages".

I think this could only happen with transparent huge pages that replaced
small pages in an anonymous mapping. I don't think munmap'ing them
happens very often.

Thanks.
diff mbox

Patch

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index a82c0c5c8b52..a9c9df0f60ff 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -19,10 +19,6 @@ 
 #ifndef __ASM_TLB_H
 #define __ASM_TLB_H
 
-#define  __tlb_remove_pmd_tlb_entry __tlb_remove_pmd_tlb_entry
-
-#include <asm-generic/tlb.h>
-
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 
@@ -37,16 +33,8 @@  static inline void __tlb_remove_table(void *_table)
 #define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
 #endif /* CONFIG_HAVE_RCU_TABLE_FREE */
 
-/*
- * There's three ways the TLB shootdown code is used:
- *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
- *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *  2. Unmapping all vmas.  See exit_mmap().
- *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     Page tables will be freed.
- *  3. Unmapping argument pages.  See shift_arg_pages().
- *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- */
+#include <asm-generic/tlb.h>
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	if (tlb->fullmm) {
@@ -54,54 +42,13 @@  static inline void tlb_flush(struct mmu_gather *tlb)
 	} else if (tlb->end > 0) {
 		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
 		flush_tlb_range(&vma, tlb->start, tlb->end);
-		tlb->start = TASK_SIZE;
-		tlb->end = 0;
-	}
-}
-
-static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
-{
-	if (!tlb->fullmm) {
-		tlb->start = min(tlb->start, addr);
-		tlb->end = max(tlb->end, addr + PAGE_SIZE);
-	}
-}
-
-/*
- * Memorize the range for the TLB flush.
- */
-static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
-					  unsigned long addr)
-{
-	tlb_add_flush(tlb, addr);
-}
-
-/*
- * In the case of tlb vma handling, we can optimise these away in the
- * case where we're doing a full MM flush.  When we're doing a munmap,
- * the vmas are adjusted to only cover the region to be torn down.
- */
-static inline void tlb_start_vma(struct mmu_gather *tlb,
-				 struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm) {
-		tlb->start = TASK_SIZE;
-		tlb->end = 0;
 	}
 }
 
-static inline void tlb_end_vma(struct mmu_gather *tlb,
-			       struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm)
-		tlb_flush(tlb);
-}
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				  unsigned long addr)
 {
 	pgtable_page_dtor(pte);
-	tlb_add_flush(tlb, addr);
 	tlb_remove_entry(tlb, pte);
 }
 
@@ -109,7 +56,6 @@  static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
-	tlb_add_flush(tlb, addr);
 	tlb_remove_entry(tlb, virt_to_page(pmdp));
 }
 #endif
@@ -118,15 +64,8 @@  static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
-	tlb_add_flush(tlb, addr);
 	tlb_remove_entry(tlb, virt_to_page(pudp));
 }
 #endif
 
-static inline void __tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp,
-						unsigned long address)
-{
-	tlb_add_flush(tlb, address);
-}
-
 #endif
diff --git a/arch/microblaze/include/asm/tlb.h b/arch/microblaze/include/asm/tlb.h
index 8aa97817cc8c..99b6ded54849 100644
--- a/arch/microblaze/include/asm/tlb.h
+++ b/arch/microblaze/include/asm/tlb.h
@@ -14,7 +14,6 @@ 
 #define tlb_flush(tlb)	flush_tlb_mm((tlb)->mm)
 
 #include <linux/pagemap.h>
-#include <asm-generic/tlb.h>
 
 #ifdef CONFIG_MMU
 #define tlb_start_vma(tlb, vma)		do { } while (0)
@@ -22,4 +21,6 @@ 
 #define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0)
 #endif
 
+#include <asm-generic/tlb.h>
+
 #endif /* _ASM_MICROBLAZE_TLB_H */
diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index e9a9f60e596d..fc3ee06eab87 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -3,7 +3,6 @@ 
 #ifdef __KERNEL__
 
 #include <linux/mm.h>
-#include <asm-generic/tlb.h>
 
 #ifdef CONFIG_PPC_BOOK3E
 extern void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address);
@@ -14,6 +13,8 @@  static inline void tlb_flush_pgtable(struct mmu_gather *tlb,
 }
 #endif /* !CONFIG_PPC_BOOK3E */
 
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index e2b428b0f7ba..20733fa518ae 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -27,6 +27,7 @@ 
 
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 
 extern void tlb_flush(struct mmu_gather *tlb);
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 5672d7ea1fa0..340bc5c5ca2d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -128,6 +128,46 @@  static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 		tlb_flush_mmu(tlb);
 }
 
+static inline void __tlb_adjust_range(struct mmu_gather *tlb,
+				      unsigned long address)
+{
+	if (!tlb->fullmm) {
+		tlb->start = min(tlb->start, address);
+		tlb->end = max(tlb->end, address + PAGE_SIZE);
+	}
+}
+
+static inline void __tlb_reset_range(struct mmu_gather *tlb)
+{
+	tlb->start = TASK_SIZE;
+	tlb->end = 0;
+}
+
+/*
+ * In the case of tlb vma handling, we can optimise these away in the
+ * case where we're doing a full MM flush.  When we're doing a munmap,
+ * the vmas are adjusted to only cover the region to be torn down.
+ */
+#ifndef tlb_start_vma
+#define tlb_start_vma(tlb, vma) do { } while (0)
+#endif
+
+#define __tlb_end_vma(tlb, vma)					\
+	do {							\
+		if (!tlb->fullmm) {				\
+			tlb_flush(tlb);				\
+			__tlb_reset_range(tlb);			\
+		}						\
+	} while (0)
+
+#ifndef tlb_end_vma
+#define tlb_end_vma	__tlb_end_vma
+#endif
+
+#ifndef __tlb_remove_tlb_entry
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#endif
+
 /**
  * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
  *
@@ -138,6 +178,7 @@  static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define tlb_remove_tlb_entry(tlb, ptep, address)		\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
@@ -152,12 +193,14 @@  static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)		\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);	\
 	} while (0)
 
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
 
@@ -165,6 +208,7 @@  static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
@@ -172,6 +216,7 @@  static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..0bc940e41ec9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -220,8 +220,6 @@  void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
 	tlb->need_flush_all = 0;
-	tlb->start	= start;
-	tlb->end	= end;
 	tlb->need_flush = 0;
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
@@ -232,6 +230,8 @@  void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
+
+	__tlb_reset_range(tlb);
 }
 
 static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -241,6 +241,7 @@  static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
+	__tlb_reset_range(tlb);
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -1186,20 +1187,8 @@  again:
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush) {
-		unsigned long old_end;
-
-		/*
-		 * Flush the TLB just for the previous segment,
-		 * then update the range to be the remaining
-		 * TLB range.
-		 */
-		old_end = tlb->end;
-		tlb->end = addr;
+	if (force_flush)
 		tlb_flush_mmu_tlbonly(tlb);
-		tlb->start = addr;
-		tlb->end = old_end;
-	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*