arm64: mm: Avoid set_pte_at with HugeTLB pages

Message ID 1385739261-26689-1-git-send-email-steve.capper@linaro.org
State New
Headers show

Commit Message

Steve Capper Nov. 29, 2013, 3:34 p.m.
For huge pages, given newprot a pgprot_t value for a shared writable
VMA, and ptep a pointer to a pte belonging to this VMA; the following
behaviour is assumed by core code:

   hugetlb_change_protection(vma, address, end, newprot);
   ...

   huge_pte_write(huge_ptep_get(ptep)); /* should be true! */

Unfortunately, set_huge_pte_at calls set_pte_at which includes a
side-effect that renders ptes read only if the dirty bit is unset.

If one were to allocate a read only shared huge page, then fault it in,
and then mprotect it to be writeable. A subsequent write to that huge
page will result in a spurious call to hugetlb_cow, which causes
corruption. This call is optimised away prior to:
 37a2140 mm, hugetlb: do not use a page in page cache for cow
         optimization

If one runs the libhugetlbfs test suite on v3.12-rc1 upwards, then the
mprotect test will cause the afformentioned corruption and before the
set of tests completes, the system will be left in an unresponsive
state. (calls to fork fail with -ENOMEM).

This patch re-implements set_huge_pte_at to dereference the pte value
explicitly. hugetlb_cow is no longer called spuriously, and the unit
tests complete successfully.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
I operated under the deluded notion that set_pte_at on arm64 had no
side effects when I originally sent out:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-November/212475.html

As this is patch is more or less self-contained for arm64, I am sending
this out on its own rather than merging with the above series.

Apologies for not catching this sooner.
---
 arch/arm64/include/asm/hugetlb.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Catalin Marinas Nov. 29, 2013, 4:28 p.m. | #1
On Fri, Nov 29, 2013 at 03:34:21PM +0000, Steve Capper wrote:
> For huge pages, given newprot a pgprot_t value for a shared writable
> VMA, and ptep a pointer to a pte belonging to this VMA; the following
> behaviour is assumed by core code:
> 
>    hugetlb_change_protection(vma, address, end, newprot);
>    ...
> 
>    huge_pte_write(huge_ptep_get(ptep)); /* should be true! */
> 
> Unfortunately, set_huge_pte_at calls set_pte_at which includes a
> side-effect that renders ptes read only if the dirty bit is unset.

And don't you also need this side-effect for huge pages?

> If one were to allocate a read only shared huge page, then fault it in,
> and then mprotect it to be writeable. A subsequent write to that huge
> page will result in a spurious call to hugetlb_cow, which causes
> corruption.

In general making a page writable also makes it dirty but I couldn't
find this for standard page tables (sys_mprotect ... change_pte_range).

Anyway, why would a fault on huge page trigger cow while one on standard
page not?

So I think we have a different problem, which I've been thinking about
but haven't bitten us with standard page tables. In handle_pte_fault()
for standard pages if the fault is write and !pte_write() we call
do_wp_page(). This is smart enough not to do a COW.

hugetlb_fault() OTOH is not that smart ;) and calls hugetlb_cow() if
!huge_pte_write(). You can fix this logic for not to do COW similarly to
do_wp_page(), though I haven't looked in detail on how it decides this.

In the arch code, what we need and it would work as an optimisation for
such faults is to add another software bit for PTE_WRITE, independent of
!PTE_RDONLY. This way you can have clean (and hardware read-only) pages
but with a software pte_write(). handle_pte_fault() would simply call
pte_mkdirty() for standard pages.

BTW, I think we have the same issue with LPAE.

Patch

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 5b7ca8a..32b042f 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -33,7 +33,10 @@  static inline pte_t huge_ptep_get(pte_t *ptep)
 static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 				   pte_t *ptep, pte_t pte)
 {
-	set_pte_at(mm, addr, ptep, pte);
+	if (pte_exec(pte))
+		__sync_icache_dcache(pte, addr);
+
+	*ptep = pte;
 }
 
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,