diff mbox series

[RESEND] mm, oom_reaper: gather each vma to prevent leaking TLB entry

Message ID 20171107095453.179940-1-wangnan0@huawei.com
State Accepted
Commit 687cb0884a714ff484d038e9190edc874edcf146
Headers show
Series [RESEND] mm, oom_reaper: gather each vma to prevent leaking TLB entry | expand

Commit Message

Wang Nan Nov. 7, 2017, 9:54 a.m. UTC
tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't
flush TLB when tlb->fullmm is true:

  commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

Which makes leaking of tlb entries.

Will clarifies his patch:

> Basically, we tag each address space with an ASID (PCID on x86) which

> is resident in the TLB. This means we can elide TLB invalidation when

> pulling down a full mm because we won't ever assign that ASID to another mm

> without doing TLB invalidation elsewhere (which actually just nukes the

> whole TLB).

>

> I think that means that we could potentially not fault on a kernel uaccess,

> because we could hit in the TLB.


There could be a window between complete_signal() sending IPI to other
cores and all threads sharing this mm are really kicked off from cores.
In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush
TLB then frees pages. However, due to the above problem, the TLB entries
are not really flushed on arm64. Other threads are possible to access
these pages through TLB entries. Moreover, a copy_to_user() can also
write to these pages without generating page fault, causes use-after-free
bugs.

This patch gathers each vma instead of gathering full vm space.
In this case tlb->fullmm is not true. The behavior of oom reaper become
similar to munmapping before do_exit, which should be safe for all archs.

Signed-off-by: Wang Nan <wangnan0@huawei.com>

Cc: Bob Liu <liubo95@huawei.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/oom_kill.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

-- 
2.10.1

Comments

Michal Hocko Nov. 7, 2017, 10:09 a.m. UTC | #1
On Tue 07-11-17 09:54:53, Wang Nan wrote:
> tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory

> space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't

> flush TLB when tlb->fullmm is true:

> 

>   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

> 

> Which makes leaking of tlb entries.

> 

> Will clarifies his patch:

> 

> > Basically, we tag each address space with an ASID (PCID on x86) which

> > is resident in the TLB. This means we can elide TLB invalidation when

> > pulling down a full mm because we won't ever assign that ASID to another mm

> > without doing TLB invalidation elsewhere (which actually just nukes the

> > whole TLB).

> >

> > I think that means that we could potentially not fault on a kernel uaccess,

> > because we could hit in the TLB.

> 

> There could be a window between complete_signal() sending IPI to other

> cores and all threads sharing this mm are really kicked off from cores.

> In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush

> TLB then frees pages. However, due to the above problem, the TLB entries

> are not really flushed on arm64. Other threads are possible to access

> these pages through TLB entries. Moreover, a copy_to_user() can also

> write to these pages without generating page fault, causes use-after-free

> bugs.

> 

> This patch gathers each vma instead of gathering full vm space.

> In this case tlb->fullmm is not true. The behavior of oom reaper become

> similar to munmapping before do_exit, which should be safe for all archs.


This goes all the way down to when the reaper has been introduced.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")

Maybe we should even Cc stable because a memory corruption will be quite
hard to debug.

> Signed-off-by: Wang Nan <wangnan0@huawei.com>

> Cc: Bob Liu <liubo95@huawei.com>

> Cc: Andrew Morton <akpm@linux-foundation.org>

> Cc: Michal Hocko <mhocko@suse.com>

> Cc: David Rientjes <rientjes@google.com>

> Cc: Ingo Molnar <mingo@kernel.org>

> Cc: Roman Gushchin <guro@fb.com>

> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

> Cc: Andrea Arcangeli <aarcange@redhat.com>


Acked-by: Michal Hocko <mhocko@suse.com>


Thanks!

> ---

>  mm/oom_kill.c | 7 ++++---

>  1 file changed, 4 insertions(+), 3 deletions(-)

> 

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c

> index dee0f75..18c5b35 100644

> --- a/mm/oom_kill.c

> +++ b/mm/oom_kill.c

> @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)

>  	 */

>  	set_bit(MMF_UNSTABLE, &mm->flags);

>  

> -	tlb_gather_mmu(&tlb, mm, 0, -1);

>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {

>  		if (!can_madv_dontneed_vma(vma))

>  			continue;

> @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)

>  		 * we do not want to block exit_mmap by keeping mm ref

>  		 * count elevated without a good reason.

>  		 */

> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))

> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {

> +			tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);

>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,

>  					 NULL);

> +			tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end);

> +		}

>  	}

> -	tlb_finish_mmu(&tlb, 0, -1);

>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

>  			task_pid_nr(tsk), tsk->comm,

>  			K(get_mm_counter(mm, MM_ANONPAGES)),

> -- 

> 2.10.1

> 


-- 
Michal Hocko
SUSE Labs
Minchan Kim Nov. 10, 2017, 12:19 a.m. UTC | #2
On Tue, Nov 07, 2017 at 09:54:53AM +0000, Wang Nan wrote:
> tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory

> space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't

> flush TLB when tlb->fullmm is true:

> 

>   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

> 

> Which makes leaking of tlb entries.


That means soft-dirty which has used tlb_gather_mmu with fullmm could be
broken via losing write-protection bit once it supports arm64 in future?

If so, it would be better to use TASK_SIZE rather than -1 in tlb_gather_mmu.
Of course, it's a off-topic.

However, I want to add a big fat comment in tlb_gather_mmu to warn "TLB
flushing with (0, -1) can be skipped on some architectures" so upcoming
users can care of.

Thanks.

> 

> Will clarifies his patch:

> 

> > Basically, we tag each address space with an ASID (PCID on x86) which

> > is resident in the TLB. This means we can elide TLB invalidation when

> > pulling down a full mm because we won't ever assign that ASID to another mm

> > without doing TLB invalidation elsewhere (which actually just nukes the

> > whole TLB).

> >

> > I think that means that we could potentially not fault on a kernel uaccess,

> > because we could hit in the TLB.

> 

> There could be a window between complete_signal() sending IPI to other

> cores and all threads sharing this mm are really kicked off from cores.

> In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush

> TLB then frees pages. However, due to the above problem, the TLB entries

> are not really flushed on arm64. Other threads are possible to access

> these pages through TLB entries. Moreover, a copy_to_user() can also

> write to these pages without generating page fault, causes use-after-free

> bugs.

> 

> This patch gathers each vma instead of gathering full vm space.

> In this case tlb->fullmm is not true. The behavior of oom reaper become

> similar to munmapping before do_exit, which should be safe for all archs.

> 

> Signed-off-by: Wang Nan <wangnan0@huawei.com>

> Cc: Bob Liu <liubo95@huawei.com>

> Cc: Andrew Morton <akpm@linux-foundation.org>

> Cc: Michal Hocko <mhocko@suse.com>

> Cc: David Rientjes <rientjes@google.com>

> Cc: Ingo Molnar <mingo@kernel.org>

> Cc: Roman Gushchin <guro@fb.com>

> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

> Cc: Andrea Arcangeli <aarcange@redhat.com>

> ---

>  mm/oom_kill.c | 7 ++++---

>  1 file changed, 4 insertions(+), 3 deletions(-)

> 

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c

> index dee0f75..18c5b35 100644

> --- a/mm/oom_kill.c

> +++ b/mm/oom_kill.c

> @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)

>  	 */

>  	set_bit(MMF_UNSTABLE, &mm->flags);

>  

> -	tlb_gather_mmu(&tlb, mm, 0, -1);

>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {

>  		if (!can_madv_dontneed_vma(vma))

>  			continue;

> @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)

>  		 * we do not want to block exit_mmap by keeping mm ref

>  		 * count elevated without a good reason.

>  		 */

> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))

> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {

> +			tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);

>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,

>  					 NULL);

> +			tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end);

> +		}

>  	}

> -	tlb_finish_mmu(&tlb, 0, -1);

>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

>  			task_pid_nr(tsk), tsk->comm,

>  			K(get_mm_counter(mm, MM_ANONPAGES)),

> -- 

> 2.10.1

> 

> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in

> the body to majordomo@kvack.org.  For more info on Linux MM,

> see: http://www.linux-mm.org/ .

> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Michal Hocko Nov. 10, 2017, 10:15 a.m. UTC | #3
On Fri 10-11-17 09:19:33, Minchan Kim wrote:
> On Tue, Nov 07, 2017 at 09:54:53AM +0000, Wang Nan wrote:

> > tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory

> > space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't

> > flush TLB when tlb->fullmm is true:

> > 

> >   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

> > 

> > Which makes leaking of tlb entries.

> 

> That means soft-dirty which has used tlb_gather_mmu with fullmm could be

> broken via losing write-protection bit once it supports arm64 in future?

> 

> If so, it would be better to use TASK_SIZE rather than -1 in tlb_gather_mmu.

> Of course, it's a off-topic.


I wouldn't play tricks like that. And maybe the API itself could be more
explicit. E.g. add a lazy parameter which would allow arch specific code
to not flush if it is sure that nobody can actually stumble over missed
flush. E.g. the following?

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index d5562f9ce600..fe9042aee8e9 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 
 static inline void
 arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-			unsigned long start, unsigned long end)
+			unsigned long start, unsigned long end,
+			bool lazy)
 {
 	tlb->mm = mm;
 	tlb->fullmm = !(start | (end+1));
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index ffdaea7954bb..7adde19b2bcc 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	 * The ASID allocator will either invalidate the ASID or mark
 	 * it as used.
 	 */
-	if (tlb->fullmm)
+	if (tlb->lazy)
 		return;
 
 	/*
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index cbe5ac3699bf..50c440f5b7bc 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -169,7 +169,8 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 
 static inline void
 arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-			unsigned long start, unsigned long end)
+			unsigned long start, unsigned long end,
+			bool lazy)
 {
 	tlb->mm = mm;
 	tlb->max = ARRAY_SIZE(tlb->local);
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 2eb8ff0d6fca..2310657b64c4 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -49,7 +49,8 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
 static inline void
 arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-			unsigned long start, unsigned long end)
+			unsigned long start, unsigned long end,
+			bool lazy)
 {
 	tlb->mm = mm;
 	tlb->start = start;
diff --git a/arch/sh/include/asm/tlb.h b/arch/sh/include/asm/tlb.h
index 51a8bc967e75..ae4c50a7c1ec 100644
--- a/arch/sh/include/asm/tlb.h
+++ b/arch/sh/include/asm/tlb.h
@@ -37,7 +37,7 @@ static inline void init_tlb_gather(struct mmu_gather *tlb)
 
 static inline void
 arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-		unsigned long start, unsigned long end)
+		unsigned long start, unsigned long end, bool lazy)
 {
 	tlb->mm = mm;
 	tlb->start = start;
diff --git a/arch/um/include/asm/tlb.h b/arch/um/include/asm/tlb.h
index 344d95619d03..f24af66d07a4 100644
--- a/arch/um/include/asm/tlb.h
+++ b/arch/um/include/asm/tlb.h
@@ -46,7 +46,7 @@ static inline void init_tlb_gather(struct mmu_gather *tlb)
 
 static inline void
 arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-		unsigned long start, unsigned long end)
+		unsigned long start, unsigned long end, bool lazy)
 {
 	tlb->mm = mm;
 	tlb->start = start;
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faddde44de8c..c312fcd5a953 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -102,6 +102,7 @@ struct mmu_gather {
 	/* we have performed an operation which
 	 * requires a complete flush of the tlb */
 				need_flush_all : 1;
+				lazy : 1;
 
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
@@ -113,7 +114,8 @@ struct mmu_gather {
 #define HAVE_GENERIC_MMU_GATHER
 
 void arch_tlb_gather_mmu(struct mmu_gather *tlb,
-	struct mm_struct *mm, unsigned long start, unsigned long end);
+	struct mm_struct *mm, unsigned long start, unsigned long end,
+	bool lazy);
 void tlb_flush_mmu(struct mmu_gather *tlb);
 void arch_tlb_finish_mmu(struct mmu_gather *tlb,
 			 unsigned long start, unsigned long end, bool force);
diff --git a/mm/memory.c b/mm/memory.c
index 590709e84a43..7dfdd4d8224f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -218,13 +218,15 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 }
 
 void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool lazy)
 {
 	tlb->mm = mm;
 
 	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
 	tlb->need_flush_all = 0;
+	tlb->lazy	= lazy;
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
@@ -408,7 +410,18 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 			unsigned long start, unsigned long end)
 {
-	arch_tlb_gather_mmu(tlb, mm, start, end);
+	arch_tlb_gather_mmu(tlb, mm, start, end, false);
+	inc_tlb_flush_pending(tlb->mm);
+}
+
+/* tlb_gather_mmu_lazy
+ * 	Basically same as tlb_gather_mmu except it allows architectures to
+ * 	skip tlb flushing if they can ensure that nobody will reuse tlb entries
+ */
+void tlb_gather_mmu_lazy(struct mmu_gather *tlb, struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+	arch_tlb_gather_mmu(tlb, mm, start, end, true);
 	inc_tlb_flush_pending(tlb->mm);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..43594a6a2eac 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2997,7 +2997,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb_gather_mmu(&tlb, mm, 0, -1);
+	tlb_gather_mmu_lazy(&tlb, mm, 0, -1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);

> However, I want to add a big fat comment in tlb_gather_mmu to warn "TLB

> flushing with (0, -1) can be skipped on some architectures" so upcoming

> users can care of.


Yes, this would be really useful if the api is not explicit.
-- 
Michal Hocko
SUSE Labs
Minchan Kim Nov. 13, 2017, 12:28 a.m. UTC | #4
On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote:
> On Fri 10-11-17 11:15:29, Michal Hocko wrote:

> > On Fri 10-11-17 09:19:33, Minchan Kim wrote:

> > > On Tue, Nov 07, 2017 at 09:54:53AM +0000, Wang Nan wrote:

> > > > tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory

> > > > space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't

> > > > flush TLB when tlb->fullmm is true:

> > > > 

> > > >   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

> > > > 

> > > > Which makes leaking of tlb entries.

> > > 

> > > That means soft-dirty which has used tlb_gather_mmu with fullmm could be

> > > broken via losing write-protection bit once it supports arm64 in future?

> > > 

> > > If so, it would be better to use TASK_SIZE rather than -1 in tlb_gather_mmu.

> > > Of course, it's a off-topic.

> > 

> > I wouldn't play tricks like that. And maybe the API itself could be more

> > explicit. E.g. add a lazy parameter which would allow arch specific code

> > to not flush if it is sure that nobody can actually stumble over missed

> > flush. E.g. the following?

> 

> This one has a changelog and even compiles on my crosscompile test

> ---

> From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001

> From: Michal Hocko <mhocko@suse.com>

> Date: Fri, 10 Nov 2017 11:27:17 +0100

> Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy

> 

> 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has

> introduced an optimization to not flush tlb when we are tearing the

> whole address space down. Will goes on to explain

> 

> : Basically, we tag each address space with an ASID (PCID on x86) which

> : is resident in the TLB. This means we can elide TLB invalidation when

> : pulling down a full mm because we won't ever assign that ASID to

> : another mm without doing TLB invalidation elsewhere (which actually

> : just nukes the whole TLB).

> 

> This all is nice but tlb_gather users are not aware of that and this can

> actually cause some real problems. E.g. the oom_reaper tries to reap the

> whole address space but it might race with threads accessing the memory [1].

> It is possible that soft-dirty handling might suffer from the same

> problem [2].

> 

> Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the

> behavior arm64 implements for the fullmm case and replace it by an

> explicit lazy flag in the mmu_gather structure. exit_mmap path is then

> turned into the explicit lazy variant. Other architectures simply ignore

> the flag.

> 

> [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com

> [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox

> Signed-off-by: Michal Hocko <mhocko@suse.com>

> ---

>  arch/arm/include/asm/tlb.h   |  3 ++-

>  arch/arm64/include/asm/tlb.h |  2 +-

>  arch/ia64/include/asm/tlb.h  |  3 ++-

>  arch/s390/include/asm/tlb.h  |  3 ++-

>  arch/sh/include/asm/tlb.h    |  2 +-

>  arch/um/include/asm/tlb.h    |  2 +-

>  include/asm-generic/tlb.h    |  6 ++++--

>  include/linux/mm_types.h     |  2 ++

>  mm/memory.c                  | 17 +++++++++++++++--

>  mm/mmap.c                    |  2 +-

>  10 files changed, 31 insertions(+), 11 deletions(-)

> 

> diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h

> index d5562f9ce600..fe9042aee8e9 100644

> --- a/arch/arm/include/asm/tlb.h

> +++ b/arch/arm/include/asm/tlb.h

> @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)

>  

>  static inline void

>  arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,

> -			unsigned long start, unsigned long end)

> +			unsigned long start, unsigned long end,

> +			bool lazy)

> 


Thanks for the patch, Michal.
However, it would be nice to do it tranparently without asking
new flags from users.

When I read tlb_gather_mmu's description, fullmm is supposed to
be used only if there is no users and full address space.

That means we can do it API itself like this?

void arch_tlb_gather_mmu(...)

        tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;
Michal Hocko Nov. 13, 2017, 9:51 a.m. UTC | #5
On Mon 13-11-17 09:28:33, Minchan Kim wrote:
[...]
> Thanks for the patch, Michal.

> However, it would be nice to do it tranparently without asking

> new flags from users.

> 

> When I read tlb_gather_mmu's description, fullmm is supposed to

> be used only if there is no users and full address space.

> 

> That means we can do it API itself like this?

> 

> void arch_tlb_gather_mmu(...)

> 

>         tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;


I do not have a strong opinion here. The optimization is quite subtle so
calling it explicitly sounds like a less surprising behavior to me
longterm. Note that I haven't checked all fullmm users.

-- 
Michal Hocko
SUSE Labs
Minchan Kim Nov. 14, 2017, 1:45 a.m. UTC | #6
On Mon, Nov 13, 2017 at 10:51:07AM +0100, Michal Hocko wrote:
> On Mon 13-11-17 09:28:33, Minchan Kim wrote:

> [...]

> > Thanks for the patch, Michal.

> > However, it would be nice to do it tranparently without asking

> > new flags from users.

> > 

> > When I read tlb_gather_mmu's description, fullmm is supposed to

> > be used only if there is no users and full address space.

> > 

> > That means we can do it API itself like this?

> > 

> > void arch_tlb_gather_mmu(...)

> > 

> >         tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;

> 

> I do not have a strong opinion here. The optimization is quite subtle so

> calling it explicitly sounds like a less surprising behavior to me

> longterm. Note that I haven't checked all fullmm users.


With description of tlb_gather_mmu and 4d6ddfa9242b, set fullmm to true
should guarantees there is *no users* of the mm_struct so I think
my suggestion is not about optimization but to keep the semantic
"there should be no one who can access address space when entire
address space is destroyed".

If you want to be more explicit, we should add some description
about "where can we use lazy mode". I think it should tell the
internal of some architecture for user to understand. I'm not
sure it's worth although we can do it transparently.

I'm not strong against with you approach, either.

Anyway, I think Wang Nan's patch is already broken.
http://lkml.kernel.org/r/%3C20171107095453.179940-1-wangnan0@huawei.com%3E

Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully
and free pages. However, the architecture code for TLB flush cannot
flush at all by wrong fullmm so other threads can write freed-page.

Thanks.
Michal Hocko Nov. 14, 2017, 7:21 a.m. UTC | #7
On Tue 14-11-17 10:45:49, Minchan Kim wrote:
[...]
> Anyway, I think Wang Nan's patch is already broken.

> http://lkml.kernel.org/r/%3C20171107095453.179940-1-wangnan0@huawei.com%3E

> 

> Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully

> and free pages. However, the architecture code for TLB flush cannot

> flush at all by wrong fullmm so other threads can write freed-page.


I am not sure I understand what you mean. How is that any different from
any other explicit partial madvise call?
-- 
Michal Hocko
SUSE Labs
Minchan Kim Nov. 15, 2017, 12:12 a.m. UTC | #8
On Tue, Nov 14, 2017 at 08:21:00AM +0100, Michal Hocko wrote:
> On Tue 14-11-17 10:45:49, Minchan Kim wrote:

> [...]

> > Anyway, I think Wang Nan's patch is already broken.

> > http://lkml.kernel.org/r/%3C20171107095453.179940-1-wangnan0@huawei.com%3E

> > 

> > Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully

> > and free pages. However, the architecture code for TLB flush cannot

> > flush at all by wrong fullmm so other threads can write freed-page.

> 

> I am not sure I understand what you mean. How is that any different from

> any other explicit partial madvise call?


Argh, I misread his code. Sorry for that.
Michal Hocko Nov. 15, 2017, 8:14 a.m. UTC | #9
On Mon 13-11-17 09:28:33, Minchan Kim wrote:
[...]
> void arch_tlb_gather_mmu(...)

> 

>         tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;


Sorry, I should have realized sooner but this will not work for the oom
reaper. It _can_ race with the final exit_mmap and run with mm_users == 0
-- 
Michal Hocko
SUSE Labs
Will Deacon Nov. 15, 2017, 5:33 p.m. UTC | #10
Hi Michal,

On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote:
> From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001

> From: Michal Hocko <mhocko@suse.com>

> Date: Fri, 10 Nov 2017 11:27:17 +0100

> Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy

> 

> 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has

> introduced an optimization to not flush tlb when we are tearing the

> whole address space down. Will goes on to explain

> 

> : Basically, we tag each address space with an ASID (PCID on x86) which

> : is resident in the TLB. This means we can elide TLB invalidation when

> : pulling down a full mm because we won't ever assign that ASID to

> : another mm without doing TLB invalidation elsewhere (which actually

> : just nukes the whole TLB).

> 

> This all is nice but tlb_gather users are not aware of that and this can

> actually cause some real problems. E.g. the oom_reaper tries to reap the

> whole address space but it might race with threads accessing the memory [1].

> It is possible that soft-dirty handling might suffer from the same

> problem [2].

> 

> Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the

> behavior arm64 implements for the fullmm case and replace it by an

> explicit lazy flag in the mmu_gather structure. exit_mmap path is then

> turned into the explicit lazy variant. Other architectures simply ignore

> the flag.

> 

> [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com

> [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox

> Signed-off-by: Michal Hocko <mhocko@suse.com>

> ---

>  arch/arm/include/asm/tlb.h   |  3 ++-

>  arch/arm64/include/asm/tlb.h |  2 +-

>  arch/ia64/include/asm/tlb.h  |  3 ++-

>  arch/s390/include/asm/tlb.h  |  3 ++-

>  arch/sh/include/asm/tlb.h    |  2 +-

>  arch/um/include/asm/tlb.h    |  2 +-

>  include/asm-generic/tlb.h    |  6 ++++--

>  include/linux/mm_types.h     |  2 ++

>  mm/memory.c                  | 17 +++++++++++++++--

>  mm/mmap.c                    |  2 +-

>  10 files changed, 31 insertions(+), 11 deletions(-)

> 

> diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h

> index d5562f9ce600..fe9042aee8e9 100644

> --- a/arch/arm/include/asm/tlb.h

> +++ b/arch/arm/include/asm/tlb.h

> @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)

>  

>  static inline void

>  arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,

> -			unsigned long start, unsigned long end)

> +			unsigned long start, unsigned long end,

> +			bool lazy)

>  {

>  	tlb->mm = mm;

>  	tlb->fullmm = !(start | (end+1));

> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h

> index ffdaea7954bb..7adde19b2bcc 100644

> --- a/arch/arm64/include/asm/tlb.h

> +++ b/arch/arm64/include/asm/tlb.h

> @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)

>  	 * The ASID allocator will either invalidate the ASID or mark

>  	 * it as used.

>  	 */

> -	if (tlb->fullmm)

> +	if (tlb->lazy)

>  		return;


This looks like the right idea, but I'd rather make this check:

	if (tlb->fullmm && tlb->lazy)

since the optimisation doesn't work for anything than tearing down the
entire address space.

Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which
would save you having to add an extra flag in the first place, e.g.:

	if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags))

which is a nice one-liner.

Will
Minchan Kim Nov. 16, 2017, 12:44 a.m. UTC | #11
On Wed, Nov 15, 2017 at 09:14:52AM +0100, Michal Hocko wrote:
> On Mon 13-11-17 09:28:33, Minchan Kim wrote:

> [...]

> > void arch_tlb_gather_mmu(...)

> > 

> >         tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;

> 

> Sorry, I should have realized sooner but this will not work for the oom

> reaper. It _can_ race with the final exit_mmap and run with mm_users == 0


If someone see mm_users is zero, it means there is no user to access
address space by stale TLB. Am I missing something?
Michal Hocko Nov. 16, 2017, 9:19 a.m. UTC | #12
On Thu 16-11-17 09:44:57, Minchan Kim wrote:
> On Wed, Nov 15, 2017 at 09:14:52AM +0100, Michal Hocko wrote:

> > On Mon 13-11-17 09:28:33, Minchan Kim wrote:

> > [...]

> > > void arch_tlb_gather_mmu(...)

> > > 

> > >         tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;

> > 

> > Sorry, I should have realized sooner but this will not work for the oom

> > reaper. It _can_ race with the final exit_mmap and run with mm_users == 0

> 

> If someone see mm_users is zero, it means there is no user to access

> address space by stale TLB. Am I missing something?


You are probably right but changing the flushing policy in the middle of
the address space tear down makes me nervous. While this might work
right now, it is kind of tricky and it has some potential to kick us
back in future. Just note how the current arm64 optimization went
unnoticed because the the oom reaper is such a rare event that nobody
has actually noticed this. And I suspect that the likelyhood of failure
is very low even when applied for anybody to notice in the real life.

So I would very much like to make the behavior really explicit for
everybody to see what is going on there.

-- 
Michal Hocko
SUSE Labs
Michal Hocko Nov. 16, 2017, 9:20 a.m. UTC | #13
On Wed 15-11-17 17:33:32, Will Deacon wrote:
> Hi Michal,

> 

> On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote:

> > From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001

> > From: Michal Hocko <mhocko@suse.com>

> > Date: Fri, 10 Nov 2017 11:27:17 +0100

> > Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy

> > 

> > 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has

> > introduced an optimization to not flush tlb when we are tearing the

> > whole address space down. Will goes on to explain

> > 

> > : Basically, we tag each address space with an ASID (PCID on x86) which

> > : is resident in the TLB. This means we can elide TLB invalidation when

> > : pulling down a full mm because we won't ever assign that ASID to

> > : another mm without doing TLB invalidation elsewhere (which actually

> > : just nukes the whole TLB).

> > 

> > This all is nice but tlb_gather users are not aware of that and this can

> > actually cause some real problems. E.g. the oom_reaper tries to reap the

> > whole address space but it might race with threads accessing the memory [1].

> > It is possible that soft-dirty handling might suffer from the same

> > problem [2].

> > 

> > Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the

> > behavior arm64 implements for the fullmm case and replace it by an

> > explicit lazy flag in the mmu_gather structure. exit_mmap path is then

> > turned into the explicit lazy variant. Other architectures simply ignore

> > the flag.

> > 

> > [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com

> > [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox

> > Signed-off-by: Michal Hocko <mhocko@suse.com>

> > ---

> >  arch/arm/include/asm/tlb.h   |  3 ++-

> >  arch/arm64/include/asm/tlb.h |  2 +-

> >  arch/ia64/include/asm/tlb.h  |  3 ++-

> >  arch/s390/include/asm/tlb.h  |  3 ++-

> >  arch/sh/include/asm/tlb.h    |  2 +-

> >  arch/um/include/asm/tlb.h    |  2 +-

> >  include/asm-generic/tlb.h    |  6 ++++--

> >  include/linux/mm_types.h     |  2 ++

> >  mm/memory.c                  | 17 +++++++++++++++--

> >  mm/mmap.c                    |  2 +-

> >  10 files changed, 31 insertions(+), 11 deletions(-)

> > 

> > diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h

> > index d5562f9ce600..fe9042aee8e9 100644

> > --- a/arch/arm/include/asm/tlb.h

> > +++ b/arch/arm/include/asm/tlb.h

> > @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)

> >  

> >  static inline void

> >  arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,

> > -			unsigned long start, unsigned long end)

> > +			unsigned long start, unsigned long end,

> > +			bool lazy)

> >  {

> >  	tlb->mm = mm;

> >  	tlb->fullmm = !(start | (end+1));

> > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h

> > index ffdaea7954bb..7adde19b2bcc 100644

> > --- a/arch/arm64/include/asm/tlb.h

> > +++ b/arch/arm64/include/asm/tlb.h

> > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)

> >  	 * The ASID allocator will either invalidate the ASID or mark

> >  	 * it as used.

> >  	 */

> > -	if (tlb->fullmm)

> > +	if (tlb->lazy)

> >  		return;

> 

> This looks like the right idea, but I'd rather make this check:

> 

> 	if (tlb->fullmm && tlb->lazy)

> 

> since the optimisation doesn't work for anything than tearing down the

> entire address space.


OK, that makes sense.

> Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which

> would save you having to add an extra flag in the first place, e.g.:

> 

> 	if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags))

> 

> which is a nice one-liner.


But that would make it oom_reaper specific. What about the softdirty
case Minchan has mentioned earlier?
-- 
Michal Hocko
SUSE Labs
Will Deacon Nov. 20, 2017, 2:24 p.m. UTC | #14
On Thu, Nov 16, 2017 at 10:20:42AM +0100, Michal Hocko wrote:
> On Wed 15-11-17 17:33:32, Will Deacon wrote:

> > Hi Michal,

> > 

> > On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote:

> > > From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001

> > > From: Michal Hocko <mhocko@suse.com>

> > > Date: Fri, 10 Nov 2017 11:27:17 +0100

> > > Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy

> > > 

> > > 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has

> > > introduced an optimization to not flush tlb when we are tearing the

> > > whole address space down. Will goes on to explain

> > > 

> > > : Basically, we tag each address space with an ASID (PCID on x86) which

> > > : is resident in the TLB. This means we can elide TLB invalidation when

> > > : pulling down a full mm because we won't ever assign that ASID to

> > > : another mm without doing TLB invalidation elsewhere (which actually

> > > : just nukes the whole TLB).

> > > 

> > > This all is nice but tlb_gather users are not aware of that and this can

> > > actually cause some real problems. E.g. the oom_reaper tries to reap the

> > > whole address space but it might race with threads accessing the memory [1].

> > > It is possible that soft-dirty handling might suffer from the same

> > > problem [2].

> > > 

> > > Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the

> > > behavior arm64 implements for the fullmm case and replace it by an

> > > explicit lazy flag in the mmu_gather structure. exit_mmap path is then

> > > turned into the explicit lazy variant. Other architectures simply ignore

> > > the flag.

> > > 

> > > [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com

> > > [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox

> > > Signed-off-by: Michal Hocko <mhocko@suse.com>

> > > ---

> > >  arch/arm/include/asm/tlb.h   |  3 ++-

> > >  arch/arm64/include/asm/tlb.h |  2 +-

> > >  arch/ia64/include/asm/tlb.h  |  3 ++-

> > >  arch/s390/include/asm/tlb.h  |  3 ++-

> > >  arch/sh/include/asm/tlb.h    |  2 +-

> > >  arch/um/include/asm/tlb.h    |  2 +-

> > >  include/asm-generic/tlb.h    |  6 ++++--

> > >  include/linux/mm_types.h     |  2 ++

> > >  mm/memory.c                  | 17 +++++++++++++++--

> > >  mm/mmap.c                    |  2 +-

> > >  10 files changed, 31 insertions(+), 11 deletions(-)

> > > 

> > > diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h

> > > index d5562f9ce600..fe9042aee8e9 100644

> > > --- a/arch/arm/include/asm/tlb.h

> > > +++ b/arch/arm/include/asm/tlb.h

> > > @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)

> > >  

> > >  static inline void

> > >  arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,

> > > -			unsigned long start, unsigned long end)

> > > +			unsigned long start, unsigned long end,

> > > +			bool lazy)

> > >  {

> > >  	tlb->mm = mm;

> > >  	tlb->fullmm = !(start | (end+1));

> > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h

> > > index ffdaea7954bb..7adde19b2bcc 100644

> > > --- a/arch/arm64/include/asm/tlb.h

> > > +++ b/arch/arm64/include/asm/tlb.h

> > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)

> > >  	 * The ASID allocator will either invalidate the ASID or mark

> > >  	 * it as used.

> > >  	 */

> > > -	if (tlb->fullmm)

> > > +	if (tlb->lazy)

> > >  		return;

> > 

> > This looks like the right idea, but I'd rather make this check:

> > 

> > 	if (tlb->fullmm && tlb->lazy)

> > 

> > since the optimisation doesn't work for anything than tearing down the

> > entire address space.

> 

> OK, that makes sense.

> 

> > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which

> > would save you having to add an extra flag in the first place, e.g.:

> > 

> > 	if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags))

> > 

> > which is a nice one-liner.

> 

> But that would make it oom_reaper specific. What about the softdirty

> case Minchan has mentioned earlier?


We don't (yet) support that on arm64, so we're ok for now. If we do grow
support for it, then I agree that we want a flag to identify the case where
the address space is going away and only elide the invalidation then.

Will
Will Deacon Nov. 22, 2017, 7:30 p.m. UTC | #15
Hi Michal,

On Mon, Nov 20, 2017 at 05:04:22PM +0100, Michal Hocko wrote:
> On Mon 20-11-17 14:24:44, Will Deacon wrote:

> > On Thu, Nov 16, 2017 at 10:20:42AM +0100, Michal Hocko wrote:

> > > On Wed 15-11-17 17:33:32, Will Deacon wrote:

> [...]

> > > > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h

> > > > > index ffdaea7954bb..7adde19b2bcc 100644

> > > > > --- a/arch/arm64/include/asm/tlb.h

> > > > > +++ b/arch/arm64/include/asm/tlb.h

> > > > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)

> > > > >  	 * The ASID allocator will either invalidate the ASID or mark

> > > > >  	 * it as used.

> > > > >  	 */

> > > > > -	if (tlb->fullmm)

> > > > > +	if (tlb->lazy)

> > > > >  		return;

> > > > 

> > > > This looks like the right idea, but I'd rather make this check:

> > > > 

> > > > 	if (tlb->fullmm && tlb->lazy)

> > > > 

> > > > since the optimisation doesn't work for anything than tearing down the

> > > > entire address space.

> > > 

> > > OK, that makes sense.

> > > 

> > > > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which

> > > > would save you having to add an extra flag in the first place, e.g.:

> > > > 

> > > > 	if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags))

> > > > 

> > > > which is a nice one-liner.

> > > 

> > > But that would make it oom_reaper specific. What about the softdirty

> > > case Minchan has mentioned earlier?

> > 

> > We don't (yet) support that on arm64, so we're ok for now. If we do grow

> > support for it, then I agree that we want a flag to identify the case where

> > the address space is going away and only elide the invalidation then.

> 

> What do you think about the following patch instead? I have to confess

> I do not really understand the fullmm semantic so I might introduce some

> duplication by this flag. If you think this is a good idea, I will post

> it in a separate thread.



Please do! My only suggestion would be s/lazy/exit/, since I don't think the
optimisation works in any other situation than the address space going away
for good.

Will
Minchan Kim Nov. 23, 2017, 6:18 a.m. UTC | #16
On Wed, Nov 22, 2017 at 07:30:50PM +0000, Will Deacon wrote:
> Hi Michal,

> 

> On Mon, Nov 20, 2017 at 05:04:22PM +0100, Michal Hocko wrote:

> > On Mon 20-11-17 14:24:44, Will Deacon wrote:

> > > On Thu, Nov 16, 2017 at 10:20:42AM +0100, Michal Hocko wrote:

> > > > On Wed 15-11-17 17:33:32, Will Deacon wrote:

> > [...]

> > > > > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h

> > > > > > index ffdaea7954bb..7adde19b2bcc 100644

> > > > > > --- a/arch/arm64/include/asm/tlb.h

> > > > > > +++ b/arch/arm64/include/asm/tlb.h

> > > > > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)

> > > > > >  	 * The ASID allocator will either invalidate the ASID or mark

> > > > > >  	 * it as used.

> > > > > >  	 */

> > > > > > -	if (tlb->fullmm)

> > > > > > +	if (tlb->lazy)

> > > > > >  		return;

> > > > > 

> > > > > This looks like the right idea, but I'd rather make this check:

> > > > > 

> > > > > 	if (tlb->fullmm && tlb->lazy)

> > > > > 

> > > > > since the optimisation doesn't work for anything than tearing down the

> > > > > entire address space.

> > > > 

> > > > OK, that makes sense.

> > > > 

> > > > > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which

> > > > > would save you having to add an extra flag in the first place, e.g.:

> > > > > 

> > > > > 	if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags))

> > > > > 

> > > > > which is a nice one-liner.

> > > > 

> > > > But that would make it oom_reaper specific. What about the softdirty

> > > > case Minchan has mentioned earlier?

> > > 

> > > We don't (yet) support that on arm64, so we're ok for now. If we do grow

> > > support for it, then I agree that we want a flag to identify the case where

> > > the address space is going away and only elide the invalidation then.

> > 

> > What do you think about the following patch instead? I have to confess

> > I do not really understand the fullmm semantic so I might introduce some

> > duplication by this flag. If you think this is a good idea, I will post

> > it in a separate thread.

> 

> 

> Please do! My only suggestion would be s/lazy/exit/, since I don't think the

> optimisation works in any other situation than the address space going away

> for good.


Yes, address space going. That's why I wanted to add additional check that
address space going without adding new flags.

http://lkml.kernel.org/r/<20171113002833.GA18301@bbox>

However, if you guys love to add new flag to distinguish, I prefer
"exit" to "lazy". It also would be better to add WARN_ON to catch
future potential wrong use case like OOM reaper.
Anyway, I'm not strong against so it up to you, Michal.

        WARN_ON_ONCE(exit == true && atomic_read(&mm->mm_users) > 0);
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dee0f75..18c5b35 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -532,7 +532,6 @@  static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
@@ -547,11 +546,13 @@  static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 NULL);
+			tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end);
+		}
 	}
-	tlb_finish_mmu(&tlb, 0, -1);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),