diff mbox series

[RFC,08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

Message ID 1535125966-7666-9-git-send-email-will.deacon@arm.com
State Superseded
Headers show
Series Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 | expand

Commit Message

Will Deacon Aug. 24, 2018, 3:52 p.m. UTC
From: Peter Zijlstra <peterz@infradead.org>


Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Will Deacon <will.deacon@arm.com>

---
 include/asm-generic/tlb.h | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

-- 
2.1.4

Comments

Nicholas Piggin Aug. 27, 2018, 4:44 a.m. UTC | #1
On Fri, 24 Aug 2018 16:52:43 +0100
Will Deacon <will.deacon@arm.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>

> 

> Some architectures require different TLB invalidation instructions

> depending on whether it is only the last-level of page table being

> changed, or whether there are also changes to the intermediate

> (directory) entries higher up the tree.

> 

> Add a new bit to the flags bitfield in struct mmu_gather so that the

> architecture code can operate accordingly if it's the intermediate

> levels being invalidated.

> 

> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

> Signed-off-by: Will Deacon <will.deacon@arm.com>


powerpc should be able to move right over to using this rather
than keeping the bit in need_flush_all.

powerpc may be able to use the unmap granule thing to improve
its page size dependent flushes, but it might prefer to go
a different way and track start-end for different page sizes.
I wonder how much of that stuff should go into generic code,
and whether it should instead go into a struct arch_mmu_gather.

Thanks,
Nick
Peter Zijlstra Aug. 28, 2018, 1:46 p.m. UTC | #2
On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:

> powerpc may be able to use the unmap granule thing to improve

> its page size dependent flushes, but it might prefer to go

> a different way and track start-end for different page sizes.


I don't really see how tracking multiple ranges would help much with
THP. The ranges would end up being almost the same if there is a good
mix of page sizes.

But something like:

void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
{
	if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
		tblie_pte(addr);
	if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
		tlbie_pmd(addr);
	if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
		tlbie_pud(addr);
}

void tlb_flush_range(struct mmu_gather *tlb)
{
	unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
	unsigned long addr;

	for (addr = tlb->start; addr < tlb->end; addr += stride)
		tlb_flush_one(tlb, addr);

	ptesync();
}

Should workd I think. You'll only issue multiple TLBIEs on the
boundaries, not every stride.

And for hugetlb the above should be optimal, since stride and
tlb->cleared_* match up 1:1.
Peter Zijlstra Aug. 28, 2018, 1:48 p.m. UTC | #3
On Tue, Aug 28, 2018 at 03:46:38PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:

> 

> > powerpc may be able to use the unmap granule thing to improve

> > its page size dependent flushes, but it might prefer to go

> > a different way and track start-end for different page sizes.

> 

> I don't really see how tracking multiple ranges would help much with

> THP. The ranges would end up being almost the same if there is a good

> mix of page sizes.

> 

> But something like:

> 

> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)

> {

> 	if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))

> 		tblie_pte(addr);

> 	if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))

> 		tlbie_pmd(addr);

> 	if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))

> 		tlbie_pud(addr);

> }


Sorry, those all should (of course) be !(addr << ...).

> void tlb_flush_range(struct mmu_gather *tlb)

> {

> 	unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);

> 	unsigned long addr;

> 

> 	for (addr = tlb->start; addr < tlb->end; addr += stride)

> 		tlb_flush_one(tlb, addr);

> 

> 	ptesync();

> }


And one could; like x86 has; add a threshold above which you just kill
the complete mm.
Nicholas Piggin Aug. 28, 2018, 2:12 p.m. UTC | #4
On Tue, 28 Aug 2018 15:46:38 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:

> 

> > powerpc may be able to use the unmap granule thing to improve

> > its page size dependent flushes, but it might prefer to go

> > a different way and track start-end for different page sizes.  

> 

> I don't really see how tracking multiple ranges would help much with

> THP. The ranges would end up being almost the same if there is a good

> mix of page sizes.


That's assuming quite large unmaps. But a lot of the time they are
going to go to a full PID flush.

> 

> But something like:

> 

> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)

> {

> 	if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))

> 		tblie_pte(addr);

> 	if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))

> 		tlbie_pmd(addr);

> 	if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))

> 		tlbie_pud(addr);

> }

> 

> void tlb_flush_range(struct mmu_gather *tlb)

> {

> 	unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);

> 	unsigned long addr;

> 

> 	for (addr = tlb->start; addr < tlb->end; addr += stride)

> 		tlb_flush_one(tlb, addr);

> 

> 	ptesync();

> }

> 

> Should workd I think. You'll only issue multiple TLBIEs on the

> boundaries, not every stride.


Yeah we already do basically that today in the flush_tlb_range path,
just without the precise test for which page sizes

                if (hflush) {
                        hstart = (start + PMD_SIZE - 1) & PMD_MASK;
                        hend = end & PMD_MASK;
                        if (hstart == hend)
                                hflush = false;
                }

                if (gflush) {
                        gstart = (start + PUD_SIZE - 1) & PUD_MASK;
                        gend = end & PUD_MASK;
                        if (gstart == gend)
                                gflush = false;
                }

                asm volatile("ptesync": : :"memory");
                if (local) {
                        __tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize);
                        if (hflush)
                                __tlbiel_va_range(hstart, hend, pid,
                                                PMD_SIZE, MMU_PAGE_2M);
                        if (gflush)
                                __tlbiel_va_range(gstart, gend, pid,
                                                PUD_SIZE, MMU_PAGE_1G);
                        asm volatile("ptesync": : :"memory");

Thing is I think it's the smallish range cases you want to optimize
for. And for those we'll probably do something even smarter (like keep
a bitmap of pages to flush) because we really want to keep tlbies off
the bus whereas that's less important for x86.

Still not really seeing a reason not to implement a struct
arch_mmu_gather. A little bit of data contained to the arch is nothing
compared with the multitude of hooks and divergence of code.

Thanks,
Nick
diff mbox series

Patch

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 859f897d3d53..a5caf90264e6 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -99,12 +99,22 @@  struct mmu_gather {
 #endif
 	unsigned long		start;
 	unsigned long		end;
-	/* we are in the middle of an operation to clear
-	 * a full mm and can make some optimizations */
-	unsigned int		fullmm : 1,
-	/* we have performed an operation which
-	 * requires a complete flush of the tlb */
-				need_flush_all : 1;
+	/*
+	 * we are in the middle of an operation to clear
+	 * a full mm and can make some optimizations
+	 */
+	unsigned int		fullmm : 1;
+
+	/*
+	 * we have performed an operation which
+	 * requires a complete flush of the tlb
+	 */
+	unsigned int		need_flush_all : 1;
+
+	/*
+	 * we have removed page directories
+	 */
+	unsigned int		freed_tables : 1;
 
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
@@ -139,6 +149,7 @@  static inline void __tlb_reset_range(struct mmu_gather *tlb)
 		tlb->start = TASK_SIZE;
 		tlb->end = 0;
 	}
+	tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -280,6 +291,7 @@  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
 #endif
@@ -287,7 +299,8 @@  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 #endif
@@ -297,6 +310,7 @@  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
@@ -306,7 +320,8 @@  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__p4d_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif