diff mbox

arm64: mm: Prevent the initial page table setup from creating larger blocks

Message ID 1448387338-27851-1-git-send-email-catalin.marinas@arm.com
State New
Headers show

Commit Message

Catalin Marinas Nov. 24, 2015, 5:48 p.m. UTC
While the ARM ARM is not entirely clear, over the years we have assumed
that we can split a large block entry (pmd/pud) into smaller blocks
pointing to the same physical address with little risk of a TLB
conflict. However, remapping a smaller blocks range as a large one (e.g.
from page to sections or to contiguous pages) implies a high risk of TLB
conflict. Excessive TLB flushing would make the window smaller but it
would not remove the issue.

This patch prevents the larger block mappings at all levels _if_ the
existing pud/pmd entry is present _and_ not already a block mapping.
Contiguous ptes are rejected if any of the entries in the range are
non-empty and non-contiguous.

In addition, TLB flushing is added for the cases where an existing block
entry is changed.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
---

Hi,

We know the page table setup in Linux is currently broken (with or
without contiguous PTEs support, though this makes it more visible). The
architects imply that the risk of TLB conflict is *much* smaller when
going from large to smaller blocks, to the point that it may even be
stated thus in the ARM ARM. In the meantime, this patch is aimed to get
the kernel workin on the old assumption for the 4.4 kernel (potentially
backporting it to earlier kernels for the pud/pmd mappings) until Mark R
has a proper fix for 4.5.

Thanks.

 arch/arm64/mm/mmu.c | 65 ++++++++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 33 deletions(-)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Comments

Catalin Marinas Nov. 25, 2015, 6:08 p.m. UTC | #1
On Wed, Nov 25, 2015 at 10:31:48AM -0600, Jeremy Linton wrote:
> On 11/24/2015 11:48 AM, Catalin Marinas wrote:

> >While the ARM ARM is not entirely clear, over the years we have assumed

> >that we can split a large block entry (pmd/pud) into smaller blocks

> >pointing to the same physical address with little risk of a TLB

> >conflict. However, remapping a smaller blocks range as a large one (e.g.

> >from page to sections or to contiguous pages) implies a high risk of TLB

> >conflict. Excessive TLB flushing would make the window smaller but it

> >would not remove the issue.

> 

> 	Is a requirement of this assumption, that the kernel isn't running on a

> VM'ed host with small page mappings? AKA the hypervisor is providing smaller

> page sizes than guest linear mapping?


The assumption is not even endorsed by the ARM ARM, so it can probably
be further messed by stage 2.

> 	Because I can understand the idea that the CPU won't walk PTEs for entries

> it has a larger translation for, but my understanding of how the TLB's are

> fragmented when the host has a smaller page size means that its potentially

> possible to have multiple TLB entries for different parts of a single

> cont/block range....


Another random assumption is that the TLB entry for a full VA->PA
translation covers the minimum block size between stage 1 and stage 2
(so even smaller block sizes are allowed). We risk a TLB conflict if we
switch from a smaller to a bigger block TLB entry, so under this
assumption I don't see it happening.

Anyway, too many assumptions. The proper fix is break-before-make (when
possible, though not for the kernel linear mapping) or switch to a
temporary TTBR but that's not feasible for 4.4. In the meantime, we have
two options:

1. Avoid creating contiguous PTEs since they have shown such TLB
   conflict behaviour (it doesn't mean it's only this to blame)

2. Add a TLB flush in __populate_init_pte() and hope that we will reduce
   the potential TLB conflict window (as we do for pmd/pud)

Point 2 is not really a fix and it requires more testing on different
platforms. With this patch, I tried a partial solution for point 1. If
we deem this unsafe, I would rather revert the page table setup to what
we had before 4.4 and re-enable the feature in 4.5 together with the
complete fix. AFAICT, we haven't seen any failures prior to 4.4 in this
area (though they could as well be silent). Unfortunately, this means
reverting some of your patches (not all though and the good thing is
that you can merge them again in 4.5 ;)).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Catalin Marinas Nov. 26, 2015, 4:06 p.m. UTC | #2
On Wed, Nov 25, 2015 at 04:58:52PM -0600, Jeremy Linton wrote:
> On 11/24/2015 11:48 AM, Catalin Marinas wrote:

> >This patch prevents the larger block mappings at all levels _if_ the

> >existing pud/pmd entry is present _and_ not already a block mapping.

> >Contiguous ptes are rejected if any of the entries in the range are

> >non-empty and non-contiguous.

> >

> >In addition, TLB flushing is added for the cases where an existing block

> >entry is changed.

> 

> Ok, it seems to boot fairly reliably on the m400/ACPI, over a few dozen

> reboots yesterday and today. So, for that:

> 

> Tested-by: Jeremy Linton <jeremy.linton@arm.com>

> 

> That said, it basically disables the CONT ranges for the main kernel text

> section.


Thanks for testing. However, I made the decision to revert commit
348a65cdcbbf ("arm64: Mark kernel page ranges contiguous") temporarily
until 4.5. While I see your point that other cases pte->pmd->pud could
be equally broken, we haven't had reports about them so far. Mark is
close to finishing a proper fix for the page table manipulation and
we'll re-apply your commit once that's done. I kept the other patches in
the original series, so it's only the last one that needs merging.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
diff mbox

Patch

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e379fe9f764c..cddd3536ff64 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -126,6 +126,19 @@  static void split_pmd(pmd_t *pmd, pte_t *pte)
 	} while (pte++, i++, i < PTRS_PER_PTE);
 }
 
+static bool __pte_range_none_or_cont(pte_t *pte)
+{
+	int i;
+
+	for (i = 0; i < CONT_PTES; i++) {
+		if (!pte_none(*pte) && !pte_cont(*pte))
+			return false;
+		pte++;
+	}
+
+	return true;
+}
+
 /*
  * Given a PTE with the CONT bit set, determine where the CONT range
  * starts, and clear the entire range of PTE CONT bits.
@@ -150,14 +163,20 @@  static void __populate_init_pte(pte_t *pte, unsigned long addr,
 				pgprot_t prot)
 {
 	unsigned long pfn = __phys_to_pfn(phys);
+	bool flush = false;
 
 	do {
+		if (!pte_none(*pte))
+			flush = true;
 		/* clear all the bits except the pfn, then apply the prot */
 		set_pte(pte, pfn_pte(pfn, prot));
 		pte++;
 		pfn++;
 		addr += PAGE_SIZE;
 	} while (addr != end);
+
+	if (flush)
+		flush_tlb_all();
 }
 
 static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
@@ -180,7 +199,8 @@  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
 	pte = pte_offset_kernel(pmd, addr);
 	do {
 		next = min(end, (addr + CONT_SIZE) & CONT_MASK);
-		if (((addr | next | phys) & ~CONT_MASK) == 0) {
+		if (((addr | next | phys) & ~CONT_MASK) == 0 &&
+		    __pte_range_none_or_cont(pte)) {
 			/* a block of CONT_PTES  */
 			__populate_init_pte(pte, addr, next, phys,
 					    __pgprot(pgprot_val(prot) | PTE_CONT));
@@ -192,8 +212,9 @@  static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
 			 * ranges, then update the portion of the range we
 			 * are interrested in.
 			 */
-			 clear_cont_pte_range(pte, addr);
-			 __populate_init_pte(pte, addr, next, phys, prot);
+			if (pte_cont(*pte))
+				clear_cont_pte_range(pte, addr);
+			__populate_init_pte(pte, addr, next, phys, prot);
 		}
 
 		pte += (next - addr) >> PAGE_SHIFT;
@@ -241,24 +262,15 @@  static void alloc_init_pmd(struct mm_struct *mm, pud_t *pud,
 
 	pmd = pmd_offset(pud, addr);
 	do {
+		pmd_t old_pmd = *pmd;
 		next = pmd_addr_end(addr, end);
 		/* try section mapping first */
-		if (((addr | next | phys) & ~SECTION_MASK) == 0) {
-			pmd_t old_pmd =*pmd;
+		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
+		    (pmd_none(old_pmd) || pmd_sect(old_pmd))) {
 			set_pmd(pmd, __pmd(phys |
 					   pgprot_val(mk_sect_prot(prot))));
-			/*
-			 * Check for previous table entries created during
-			 * boot (__create_page_tables) and flush them.
-			 */
-			if (!pmd_none(old_pmd)) {
+			if (!pmd_none(old_pmd))
 				flush_tlb_all();
-				if (pmd_table(old_pmd)) {
-					phys_addr_t table = __pa(pte_offset_map(&old_pmd, 0));
-					if (!WARN_ON_ONCE(slab_is_available()))
-						memblock_free(table, PAGE_SIZE);
-				}
-			}
 		} else {
 			alloc_init_pte(pmd, addr, next, phys, prot, alloc);
 		}
@@ -294,31 +306,18 @@  static void alloc_init_pud(struct mm_struct *mm, pgd_t *pgd,
 
 	pud = pud_offset(pgd, addr);
 	do {
+		pud_t old_pud = *pud;
 		next = pud_addr_end(addr, end);
 
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
 		 */
-		if (use_1G_block(addr, next, phys)) {
-			pud_t old_pud = *pud;
+		if (use_1G_block(addr, next, phys) &&
+		    (pud_none(old_pud) || pud_sect(old_pud))) {
 			set_pud(pud, __pud(phys |
 					   pgprot_val(mk_sect_prot(prot))));
-
-			/*
-			 * If we have an old value for a pud, it will
-			 * be pointing to a pmd table that we no longer
-			 * need (from swapper_pg_dir).
-			 *
-			 * Look up the old pmd table and free it.
-			 */
-			if (!pud_none(old_pud)) {
+			if (!pud_none(old_pud))
 				flush_tlb_all();
-				if (pud_table(old_pud)) {
-					phys_addr_t table = __pa(pmd_offset(&old_pud, 0));
-					if (!WARN_ON_ONCE(slab_is_available()))
-						memblock_free(table, PAGE_SIZE);
-				}
-			}
 		} else {
 			alloc_init_pmd(mm, pud, addr, next, phys, prot, alloc);
 		}