Message ID | 20171107095453.179940-1-wangnan0@huawei.com |
---|---|
State | Accepted |
Commit | 687cb0884a714ff484d038e9190edc874edcf146 |
Headers | show |
Series | [RESEND] mm, oom_reaper: gather each vma to prevent leaking TLB entry | expand |
On Tue 07-11-17 09:54:53, Wang Nan wrote: > tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory > space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't > flush TLB when tlb->fullmm is true: > > commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). > > Which makes leaking of tlb entries. > > Will clarifies his patch: > > > Basically, we tag each address space with an ASID (PCID on x86) which > > is resident in the TLB. This means we can elide TLB invalidation when > > pulling down a full mm because we won't ever assign that ASID to another mm > > without doing TLB invalidation elsewhere (which actually just nukes the > > whole TLB). > > > > I think that means that we could potentially not fault on a kernel uaccess, > > because we could hit in the TLB. > > There could be a window between complete_signal() sending IPI to other > cores and all threads sharing this mm are really kicked off from cores. > In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush > TLB then frees pages. However, due to the above problem, the TLB entries > are not really flushed on arm64. Other threads are possible to access > these pages through TLB entries. Moreover, a copy_to_user() can also > write to these pages without generating page fault, causes use-after-free > bugs. > > This patch gathers each vma instead of gathering full vm space. > In this case tlb->fullmm is not true. The behavior of oom reaper become > similar to munmapping before do_exit, which should be safe for all archs. This goes all the way down to when the reaper has been introduced. Fixes: aac453635549 ("mm, oom: introduce oom reaper") Maybe we should even Cc stable because a memory corruption will be quite hard to debug. > Signed-off-by: Wang Nan <wangnan0@huawei.com> > Cc: Bob Liu <liubo95@huawei.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Roman Gushchin <guro@fb.com> > Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > mm/oom_kill.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index dee0f75..18c5b35 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > */ > set_bit(MMF_UNSTABLE, &mm->flags); > > - tlb_gather_mmu(&tlb, mm, 0, -1); > for (vma = mm->mmap ; vma; vma = vma->vm_next) { > if (!can_madv_dontneed_vma(vma)) > continue; > @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > * we do not want to block exit_mmap by keeping mm ref > * count elevated without a good reason. > */ > - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) > + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { > + tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end); > unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end, > NULL); > + tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end); > + } > } > - tlb_finish_mmu(&tlb, 0, -1); > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > -- > 2.10.1 > -- Michal Hocko SUSE Labs
On Tue, Nov 07, 2017 at 09:54:53AM +0000, Wang Nan wrote: > tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory > space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't > flush TLB when tlb->fullmm is true: > > commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). > > Which makes leaking of tlb entries. That means soft-dirty which has used tlb_gather_mmu with fullmm could be broken via losing write-protection bit once it supports arm64 in future? If so, it would be better to use TASK_SIZE rather than -1 in tlb_gather_mmu. Of course, it's a off-topic. However, I want to add a big fat comment in tlb_gather_mmu to warn "TLB flushing with (0, -1) can be skipped on some architectures" so upcoming users can care of. Thanks. > > Will clarifies his patch: > > > Basically, we tag each address space with an ASID (PCID on x86) which > > is resident in the TLB. This means we can elide TLB invalidation when > > pulling down a full mm because we won't ever assign that ASID to another mm > > without doing TLB invalidation elsewhere (which actually just nukes the > > whole TLB). > > > > I think that means that we could potentially not fault on a kernel uaccess, > > because we could hit in the TLB. > > There could be a window between complete_signal() sending IPI to other > cores and all threads sharing this mm are really kicked off from cores. > In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush > TLB then frees pages. However, due to the above problem, the TLB entries > are not really flushed on arm64. Other threads are possible to access > these pages through TLB entries. Moreover, a copy_to_user() can also > write to these pages without generating page fault, causes use-after-free > bugs. > > This patch gathers each vma instead of gathering full vm space. > In this case tlb->fullmm is not true. The behavior of oom reaper become > similar to munmapping before do_exit, which should be safe for all archs. > > Signed-off-by: Wang Nan <wangnan0@huawei.com> > Cc: Bob Liu <liubo95@huawei.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Roman Gushchin <guro@fb.com> > Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > Cc: Andrea Arcangeli <aarcange@redhat.com> > --- > mm/oom_kill.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index dee0f75..18c5b35 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > */ > set_bit(MMF_UNSTABLE, &mm->flags); > > - tlb_gather_mmu(&tlb, mm, 0, -1); > for (vma = mm->mmap ; vma; vma = vma->vm_next) { > if (!can_madv_dontneed_vma(vma)) > continue; > @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > * we do not want to block exit_mmap by keeping mm ref > * count elevated without a good reason. > */ > - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) > + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { > + tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end); > unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end, > NULL); > + tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end); > + } > } > - tlb_finish_mmu(&tlb, 0, -1); > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > -- > 2.10.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Fri 10-11-17 09:19:33, Minchan Kim wrote: > On Tue, Nov 07, 2017 at 09:54:53AM +0000, Wang Nan wrote: > > tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory > > space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't > > flush TLB when tlb->fullmm is true: > > > > commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). > > > > Which makes leaking of tlb entries. > > That means soft-dirty which has used tlb_gather_mmu with fullmm could be > broken via losing write-protection bit once it supports arm64 in future? > > If so, it would be better to use TASK_SIZE rather than -1 in tlb_gather_mmu. > Of course, it's a off-topic. I wouldn't play tricks like that. And maybe the API itself could be more explicit. E.g. add a lazy parameter which would allow arch specific code to not flush if it is sure that nobody can actually stumble over missed flush. E.g. the following? diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h index d5562f9ce600..fe9042aee8e9 100644 --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb) static inline void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool lazy) { tlb->mm = mm; tlb->fullmm = !(start | (end+1)); diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h index ffdaea7954bb..7adde19b2bcc 100644 --- a/arch/arm64/include/asm/tlb.h +++ b/arch/arm64/include/asm/tlb.h @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) * The ASID allocator will either invalidate the ASID or mark * it as used. */ - if (tlb->fullmm) + if (tlb->lazy) return; /* diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h index cbe5ac3699bf..50c440f5b7bc 100644 --- a/arch/ia64/include/asm/tlb.h +++ b/arch/ia64/include/asm/tlb.h @@ -169,7 +169,8 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb) static inline void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool lazy) { tlb->mm = mm; tlb->max = ARRAY_SIZE(tlb->local); diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 2eb8ff0d6fca..2310657b64c4 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -49,7 +49,8 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table); static inline void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool lazy) { tlb->mm = mm; tlb->start = start; diff --git a/arch/sh/include/asm/tlb.h b/arch/sh/include/asm/tlb.h index 51a8bc967e75..ae4c50a7c1ec 100644 --- a/arch/sh/include/asm/tlb.h +++ b/arch/sh/include/asm/tlb.h @@ -37,7 +37,7 @@ static inline void init_tlb_gather(struct mmu_gather *tlb) static inline void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, bool lazy) { tlb->mm = mm; tlb->start = start; diff --git a/arch/um/include/asm/tlb.h b/arch/um/include/asm/tlb.h index 344d95619d03..f24af66d07a4 100644 --- a/arch/um/include/asm/tlb.h +++ b/arch/um/include/asm/tlb.h @@ -46,7 +46,7 @@ static inline void init_tlb_gather(struct mmu_gather *tlb) static inline void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, bool lazy) { tlb->mm = mm; tlb->start = start; diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index faddde44de8c..c312fcd5a953 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -102,6 +102,7 @@ struct mmu_gather { /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1; + lazy : 1; struct mmu_gather_batch *active; struct mmu_gather_batch local; @@ -113,7 +114,8 @@ struct mmu_gather { #define HAVE_GENERIC_MMU_GATHER void arch_tlb_gather_mmu(struct mmu_gather *tlb, - struct mm_struct *mm, unsigned long start, unsigned long end); + struct mm_struct *mm, unsigned long start, unsigned long end, + bool lazy); void tlb_flush_mmu(struct mmu_gather *tlb); void arch_tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end, bool force); diff --git a/mm/memory.c b/mm/memory.c index 590709e84a43..7dfdd4d8224f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -218,13 +218,15 @@ static bool tlb_next_batch(struct mmu_gather *tlb) } void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool lazy) { tlb->mm = mm; /* Is it from 0 to ~0? */ tlb->fullmm = !(start | (end+1)); tlb->need_flush_all = 0; + tlb->lazy = lazy; tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); @@ -408,7 +410,18 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end) { - arch_tlb_gather_mmu(tlb, mm, start, end); + arch_tlb_gather_mmu(tlb, mm, start, end, false); + inc_tlb_flush_pending(tlb->mm); +} + +/* tlb_gather_mmu_lazy + * Basically same as tlb_gather_mmu except it allows architectures to + * skip tlb flushing if they can ensure that nobody will reuse tlb entries + */ +void tlb_gather_mmu_lazy(struct mmu_gather *tlb, struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + arch_tlb_gather_mmu(tlb, mm, start, end, true); inc_tlb_flush_pending(tlb->mm); } diff --git a/mm/mmap.c b/mm/mmap.c index 680506faceae..43594a6a2eac 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2997,7 +2997,7 @@ void exit_mmap(struct mm_struct *mm) lru_add_drain(); flush_cache_mm(mm); - tlb_gather_mmu(&tlb, mm, 0, -1); + tlb_gather_mmu_lazy(&tlb, mm, 0, -1); /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); > However, I want to add a big fat comment in tlb_gather_mmu to warn "TLB > flushing with (0, -1) can be skipped on some architectures" so upcoming > users can care of. Yes, this would be really useful if the api is not explicit. -- Michal Hocko SUSE Labs
On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote: > On Fri 10-11-17 11:15:29, Michal Hocko wrote: > > On Fri 10-11-17 09:19:33, Minchan Kim wrote: > > > On Tue, Nov 07, 2017 at 09:54:53AM +0000, Wang Nan wrote: > > > > tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory > > > > space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't > > > > flush TLB when tlb->fullmm is true: > > > > > > > > commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). > > > > > > > > Which makes leaking of tlb entries. > > > > > > That means soft-dirty which has used tlb_gather_mmu with fullmm could be > > > broken via losing write-protection bit once it supports arm64 in future? > > > > > > If so, it would be better to use TASK_SIZE rather than -1 in tlb_gather_mmu. > > > Of course, it's a off-topic. > > > > I wouldn't play tricks like that. And maybe the API itself could be more > > explicit. E.g. add a lazy parameter which would allow arch specific code > > to not flush if it is sure that nobody can actually stumble over missed > > flush. E.g. the following? > > This one has a changelog and even compiles on my crosscompile test > --- > From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 10 Nov 2017 11:27:17 +0100 > Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy > > 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has > introduced an optimization to not flush tlb when we are tearing the > whole address space down. Will goes on to explain > > : Basically, we tag each address space with an ASID (PCID on x86) which > : is resident in the TLB. This means we can elide TLB invalidation when > : pulling down a full mm because we won't ever assign that ASID to > : another mm without doing TLB invalidation elsewhere (which actually > : just nukes the whole TLB). > > This all is nice but tlb_gather users are not aware of that and this can > actually cause some real problems. E.g. the oom_reaper tries to reap the > whole address space but it might race with threads accessing the memory [1]. > It is possible that soft-dirty handling might suffer from the same > problem [2]. > > Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the > behavior arm64 implements for the fullmm case and replace it by an > explicit lazy flag in the mmu_gather structure. exit_mmap path is then > turned into the explicit lazy variant. Other architectures simply ignore > the flag. > > [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com > [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > arch/arm/include/asm/tlb.h | 3 ++- > arch/arm64/include/asm/tlb.h | 2 +- > arch/ia64/include/asm/tlb.h | 3 ++- > arch/s390/include/asm/tlb.h | 3 ++- > arch/sh/include/asm/tlb.h | 2 +- > arch/um/include/asm/tlb.h | 2 +- > include/asm-generic/tlb.h | 6 ++++-- > include/linux/mm_types.h | 2 ++ > mm/memory.c | 17 +++++++++++++++-- > mm/mmap.c | 2 +- > 10 files changed, 31 insertions(+), 11 deletions(-) > > diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h > index d5562f9ce600..fe9042aee8e9 100644 > --- a/arch/arm/include/asm/tlb.h > +++ b/arch/arm/include/asm/tlb.h > @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb) > > static inline void > arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, > + bool lazy) > Thanks for the patch, Michal. However, it would be nice to do it tranparently without asking new flags from users. When I read tlb_gather_mmu's description, fullmm is supposed to be used only if there is no users and full address space. That means we can do it API itself like this? void arch_tlb_gather_mmu(...) tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0;
On Mon 13-11-17 09:28:33, Minchan Kim wrote: [...] > Thanks for the patch, Michal. > However, it would be nice to do it tranparently without asking > new flags from users. > > When I read tlb_gather_mmu's description, fullmm is supposed to > be used only if there is no users and full address space. > > That means we can do it API itself like this? > > void arch_tlb_gather_mmu(...) > > tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0; I do not have a strong opinion here. The optimization is quite subtle so calling it explicitly sounds like a less surprising behavior to me longterm. Note that I haven't checked all fullmm users. -- Michal Hocko SUSE Labs
On Mon, Nov 13, 2017 at 10:51:07AM +0100, Michal Hocko wrote: > On Mon 13-11-17 09:28:33, Minchan Kim wrote: > [...] > > Thanks for the patch, Michal. > > However, it would be nice to do it tranparently without asking > > new flags from users. > > > > When I read tlb_gather_mmu's description, fullmm is supposed to > > be used only if there is no users and full address space. > > > > That means we can do it API itself like this? > > > > void arch_tlb_gather_mmu(...) > > > > tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0; > > I do not have a strong opinion here. The optimization is quite subtle so > calling it explicitly sounds like a less surprising behavior to me > longterm. Note that I haven't checked all fullmm users. With description of tlb_gather_mmu and 4d6ddfa9242b, set fullmm to true should guarantees there is *no users* of the mm_struct so I think my suggestion is not about optimization but to keep the semantic "there should be no one who can access address space when entire address space is destroyed". If you want to be more explicit, we should add some description about "where can we use lazy mode". I think it should tell the internal of some architecture for user to understand. I'm not sure it's worth although we can do it transparently. I'm not strong against with you approach, either. Anyway, I think Wang Nan's patch is already broken. http://lkml.kernel.org/r/%3C20171107095453.179940-1-wangnan0@huawei.com%3E Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully and free pages. However, the architecture code for TLB flush cannot flush at all by wrong fullmm so other threads can write freed-page. Thanks.
On Tue 14-11-17 10:45:49, Minchan Kim wrote: [...] > Anyway, I think Wang Nan's patch is already broken. > http://lkml.kernel.org/r/%3C20171107095453.179940-1-wangnan0@huawei.com%3E > > Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully > and free pages. However, the architecture code for TLB flush cannot > flush at all by wrong fullmm so other threads can write freed-page. I am not sure I understand what you mean. How is that any different from any other explicit partial madvise call? -- Michal Hocko SUSE Labs
On Tue, Nov 14, 2017 at 08:21:00AM +0100, Michal Hocko wrote: > On Tue 14-11-17 10:45:49, Minchan Kim wrote: > [...] > > Anyway, I think Wang Nan's patch is already broken. > > http://lkml.kernel.org/r/%3C20171107095453.179940-1-wangnan0@huawei.com%3E > > > > Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully > > and free pages. However, the architecture code for TLB flush cannot > > flush at all by wrong fullmm so other threads can write freed-page. > > I am not sure I understand what you mean. How is that any different from > any other explicit partial madvise call? Argh, I misread his code. Sorry for that.
On Mon 13-11-17 09:28:33, Minchan Kim wrote: [...] > void arch_tlb_gather_mmu(...) > > tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0; Sorry, I should have realized sooner but this will not work for the oom reaper. It _can_ race with the final exit_mmap and run with mm_users == 0 -- Michal Hocko SUSE Labs
Hi Michal, On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote: > From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 10 Nov 2017 11:27:17 +0100 > Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy > > 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has > introduced an optimization to not flush tlb when we are tearing the > whole address space down. Will goes on to explain > > : Basically, we tag each address space with an ASID (PCID on x86) which > : is resident in the TLB. This means we can elide TLB invalidation when > : pulling down a full mm because we won't ever assign that ASID to > : another mm without doing TLB invalidation elsewhere (which actually > : just nukes the whole TLB). > > This all is nice but tlb_gather users are not aware of that and this can > actually cause some real problems. E.g. the oom_reaper tries to reap the > whole address space but it might race with threads accessing the memory [1]. > It is possible that soft-dirty handling might suffer from the same > problem [2]. > > Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the > behavior arm64 implements for the fullmm case and replace it by an > explicit lazy flag in the mmu_gather structure. exit_mmap path is then > turned into the explicit lazy variant. Other architectures simply ignore > the flag. > > [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com > [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > arch/arm/include/asm/tlb.h | 3 ++- > arch/arm64/include/asm/tlb.h | 2 +- > arch/ia64/include/asm/tlb.h | 3 ++- > arch/s390/include/asm/tlb.h | 3 ++- > arch/sh/include/asm/tlb.h | 2 +- > arch/um/include/asm/tlb.h | 2 +- > include/asm-generic/tlb.h | 6 ++++-- > include/linux/mm_types.h | 2 ++ > mm/memory.c | 17 +++++++++++++++-- > mm/mmap.c | 2 +- > 10 files changed, 31 insertions(+), 11 deletions(-) > > diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h > index d5562f9ce600..fe9042aee8e9 100644 > --- a/arch/arm/include/asm/tlb.h > +++ b/arch/arm/include/asm/tlb.h > @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb) > > static inline void > arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, > + bool lazy) > { > tlb->mm = mm; > tlb->fullmm = !(start | (end+1)); > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h > index ffdaea7954bb..7adde19b2bcc 100644 > --- a/arch/arm64/include/asm/tlb.h > +++ b/arch/arm64/include/asm/tlb.h > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) > * The ASID allocator will either invalidate the ASID or mark > * it as used. > */ > - if (tlb->fullmm) > + if (tlb->lazy) > return; This looks like the right idea, but I'd rather make this check: if (tlb->fullmm && tlb->lazy) since the optimisation doesn't work for anything than tearing down the entire address space. Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which would save you having to add an extra flag in the first place, e.g.: if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags)) which is a nice one-liner. Will
On Wed, Nov 15, 2017 at 09:14:52AM +0100, Michal Hocko wrote: > On Mon 13-11-17 09:28:33, Minchan Kim wrote: > [...] > > void arch_tlb_gather_mmu(...) > > > > tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0; > > Sorry, I should have realized sooner but this will not work for the oom > reaper. It _can_ race with the final exit_mmap and run with mm_users == 0 If someone see mm_users is zero, it means there is no user to access address space by stale TLB. Am I missing something?
On Thu 16-11-17 09:44:57, Minchan Kim wrote: > On Wed, Nov 15, 2017 at 09:14:52AM +0100, Michal Hocko wrote: > > On Mon 13-11-17 09:28:33, Minchan Kim wrote: > > [...] > > > void arch_tlb_gather_mmu(...) > > > > > > tlb->fullmm = !(start | (end + 1)) && atomic_read(&mm->mm_users) == 0; > > > > Sorry, I should have realized sooner but this will not work for the oom > > reaper. It _can_ race with the final exit_mmap and run with mm_users == 0 > > If someone see mm_users is zero, it means there is no user to access > address space by stale TLB. Am I missing something? You are probably right but changing the flushing policy in the middle of the address space tear down makes me nervous. While this might work right now, it is kind of tricky and it has some potential to kick us back in future. Just note how the current arm64 optimization went unnoticed because the the oom reaper is such a rare event that nobody has actually noticed this. And I suspect that the likelyhood of failure is very low even when applied for anybody to notice in the real life. So I would very much like to make the behavior really explicit for everybody to see what is going on there. -- Michal Hocko SUSE Labs
On Wed 15-11-17 17:33:32, Will Deacon wrote: > Hi Michal, > > On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote: > > From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.com> > > Date: Fri, 10 Nov 2017 11:27:17 +0100 > > Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy > > > > 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has > > introduced an optimization to not flush tlb when we are tearing the > > whole address space down. Will goes on to explain > > > > : Basically, we tag each address space with an ASID (PCID on x86) which > > : is resident in the TLB. This means we can elide TLB invalidation when > > : pulling down a full mm because we won't ever assign that ASID to > > : another mm without doing TLB invalidation elsewhere (which actually > > : just nukes the whole TLB). > > > > This all is nice but tlb_gather users are not aware of that and this can > > actually cause some real problems. E.g. the oom_reaper tries to reap the > > whole address space but it might race with threads accessing the memory [1]. > > It is possible that soft-dirty handling might suffer from the same > > problem [2]. > > > > Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the > > behavior arm64 implements for the fullmm case and replace it by an > > explicit lazy flag in the mmu_gather structure. exit_mmap path is then > > turned into the explicit lazy variant. Other architectures simply ignore > > the flag. > > > > [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com > > [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > arch/arm/include/asm/tlb.h | 3 ++- > > arch/arm64/include/asm/tlb.h | 2 +- > > arch/ia64/include/asm/tlb.h | 3 ++- > > arch/s390/include/asm/tlb.h | 3 ++- > > arch/sh/include/asm/tlb.h | 2 +- > > arch/um/include/asm/tlb.h | 2 +- > > include/asm-generic/tlb.h | 6 ++++-- > > include/linux/mm_types.h | 2 ++ > > mm/memory.c | 17 +++++++++++++++-- > > mm/mmap.c | 2 +- > > 10 files changed, 31 insertions(+), 11 deletions(-) > > > > diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h > > index d5562f9ce600..fe9042aee8e9 100644 > > --- a/arch/arm/include/asm/tlb.h > > +++ b/arch/arm/include/asm/tlb.h > > @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb) > > > > static inline void > > arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, > > + bool lazy) > > { > > tlb->mm = mm; > > tlb->fullmm = !(start | (end+1)); > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h > > index ffdaea7954bb..7adde19b2bcc 100644 > > --- a/arch/arm64/include/asm/tlb.h > > +++ b/arch/arm64/include/asm/tlb.h > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) > > * The ASID allocator will either invalidate the ASID or mark > > * it as used. > > */ > > - if (tlb->fullmm) > > + if (tlb->lazy) > > return; > > This looks like the right idea, but I'd rather make this check: > > if (tlb->fullmm && tlb->lazy) > > since the optimisation doesn't work for anything than tearing down the > entire address space. OK, that makes sense. > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which > would save you having to add an extra flag in the first place, e.g.: > > if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags)) > > which is a nice one-liner. But that would make it oom_reaper specific. What about the softdirty case Minchan has mentioned earlier? -- Michal Hocko SUSE Labs
On Thu, Nov 16, 2017 at 10:20:42AM +0100, Michal Hocko wrote: > On Wed 15-11-17 17:33:32, Will Deacon wrote: > > Hi Michal, > > > > On Fri, Nov 10, 2017 at 01:26:35PM +0100, Michal Hocko wrote: > > > From 7f0fcd2cab379ddac5611b2a520cdca8a77a235b Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@suse.com> > > > Date: Fri, 10 Nov 2017 11:27:17 +0100 > > > Subject: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy > > > > > > 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1") has > > > introduced an optimization to not flush tlb when we are tearing the > > > whole address space down. Will goes on to explain > > > > > > : Basically, we tag each address space with an ASID (PCID on x86) which > > > : is resident in the TLB. This means we can elide TLB invalidation when > > > : pulling down a full mm because we won't ever assign that ASID to > > > : another mm without doing TLB invalidation elsewhere (which actually > > > : just nukes the whole TLB). > > > > > > This all is nice but tlb_gather users are not aware of that and this can > > > actually cause some real problems. E.g. the oom_reaper tries to reap the > > > whole address space but it might race with threads accessing the memory [1]. > > > It is possible that soft-dirty handling might suffer from the same > > > problem [2]. > > > > > > Introduce an explicit lazy variant tlb_gather_mmu_lazy which allows the > > > behavior arm64 implements for the fullmm case and replace it by an > > > explicit lazy flag in the mmu_gather structure. exit_mmap path is then > > > turned into the explicit lazy variant. Other architectures simply ignore > > > the flag. > > > > > > [1] http://lkml.kernel.org/r/20171106033651.172368-1-wangnan0@huawei.com > > > [2] http://lkml.kernel.org/r/20171110001933.GA12421@bbox > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > --- > > > arch/arm/include/asm/tlb.h | 3 ++- > > > arch/arm64/include/asm/tlb.h | 2 +- > > > arch/ia64/include/asm/tlb.h | 3 ++- > > > arch/s390/include/asm/tlb.h | 3 ++- > > > arch/sh/include/asm/tlb.h | 2 +- > > > arch/um/include/asm/tlb.h | 2 +- > > > include/asm-generic/tlb.h | 6 ++++-- > > > include/linux/mm_types.h | 2 ++ > > > mm/memory.c | 17 +++++++++++++++-- > > > mm/mmap.c | 2 +- > > > 10 files changed, 31 insertions(+), 11 deletions(-) > > > > > > diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h > > > index d5562f9ce600..fe9042aee8e9 100644 > > > --- a/arch/arm/include/asm/tlb.h > > > +++ b/arch/arm/include/asm/tlb.h > > > @@ -149,7 +149,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb) > > > > > > static inline void > > > arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > > > - unsigned long start, unsigned long end) > > > + unsigned long start, unsigned long end, > > > + bool lazy) > > > { > > > tlb->mm = mm; > > > tlb->fullmm = !(start | (end+1)); > > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h > > > index ffdaea7954bb..7adde19b2bcc 100644 > > > --- a/arch/arm64/include/asm/tlb.h > > > +++ b/arch/arm64/include/asm/tlb.h > > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) > > > * The ASID allocator will either invalidate the ASID or mark > > > * it as used. > > > */ > > > - if (tlb->fullmm) > > > + if (tlb->lazy) > > > return; > > > > This looks like the right idea, but I'd rather make this check: > > > > if (tlb->fullmm && tlb->lazy) > > > > since the optimisation doesn't work for anything than tearing down the > > entire address space. > > OK, that makes sense. > > > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which > > would save you having to add an extra flag in the first place, e.g.: > > > > if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags)) > > > > which is a nice one-liner. > > But that would make it oom_reaper specific. What about the softdirty > case Minchan has mentioned earlier? We don't (yet) support that on arm64, so we're ok for now. If we do grow support for it, then I agree that we want a flag to identify the case where the address space is going away and only elide the invalidation then. Will
Hi Michal, On Mon, Nov 20, 2017 at 05:04:22PM +0100, Michal Hocko wrote: > On Mon 20-11-17 14:24:44, Will Deacon wrote: > > On Thu, Nov 16, 2017 at 10:20:42AM +0100, Michal Hocko wrote: > > > On Wed 15-11-17 17:33:32, Will Deacon wrote: > [...] > > > > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h > > > > > index ffdaea7954bb..7adde19b2bcc 100644 > > > > > --- a/arch/arm64/include/asm/tlb.h > > > > > +++ b/arch/arm64/include/asm/tlb.h > > > > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) > > > > > * The ASID allocator will either invalidate the ASID or mark > > > > > * it as used. > > > > > */ > > > > > - if (tlb->fullmm) > > > > > + if (tlb->lazy) > > > > > return; > > > > > > > > This looks like the right idea, but I'd rather make this check: > > > > > > > > if (tlb->fullmm && tlb->lazy) > > > > > > > > since the optimisation doesn't work for anything than tearing down the > > > > entire address space. > > > > > > OK, that makes sense. > > > > > > > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which > > > > would save you having to add an extra flag in the first place, e.g.: > > > > > > > > if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags)) > > > > > > > > which is a nice one-liner. > > > > > > But that would make it oom_reaper specific. What about the softdirty > > > case Minchan has mentioned earlier? > > > > We don't (yet) support that on arm64, so we're ok for now. If we do grow > > support for it, then I agree that we want a flag to identify the case where > > the address space is going away and only elide the invalidation then. > > What do you think about the following patch instead? I have to confess > I do not really understand the fullmm semantic so I might introduce some > duplication by this flag. If you think this is a good idea, I will post > it in a separate thread. Please do! My only suggestion would be s/lazy/exit/, since I don't think the optimisation works in any other situation than the address space going away for good. Will
On Wed, Nov 22, 2017 at 07:30:50PM +0000, Will Deacon wrote: > Hi Michal, > > On Mon, Nov 20, 2017 at 05:04:22PM +0100, Michal Hocko wrote: > > On Mon 20-11-17 14:24:44, Will Deacon wrote: > > > On Thu, Nov 16, 2017 at 10:20:42AM +0100, Michal Hocko wrote: > > > > On Wed 15-11-17 17:33:32, Will Deacon wrote: > > [...] > > > > > > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h > > > > > > index ffdaea7954bb..7adde19b2bcc 100644 > > > > > > --- a/arch/arm64/include/asm/tlb.h > > > > > > +++ b/arch/arm64/include/asm/tlb.h > > > > > > @@ -43,7 +43,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) > > > > > > * The ASID allocator will either invalidate the ASID or mark > > > > > > * it as used. > > > > > > */ > > > > > > - if (tlb->fullmm) > > > > > > + if (tlb->lazy) > > > > > > return; > > > > > > > > > > This looks like the right idea, but I'd rather make this check: > > > > > > > > > > if (tlb->fullmm && tlb->lazy) > > > > > > > > > > since the optimisation doesn't work for anything than tearing down the > > > > > entire address space. > > > > > > > > OK, that makes sense. > > > > > > > > > Alternatively, I could actually go check MMF_UNSTABLE in tlb->mm, which > > > > > would save you having to add an extra flag in the first place, e.g.: > > > > > > > > > > if (tlb->fullmm && !test_bit(MMF_UNSTABLE, &tlb->mm->flags)) > > > > > > > > > > which is a nice one-liner. > > > > > > > > But that would make it oom_reaper specific. What about the softdirty > > > > case Minchan has mentioned earlier? > > > > > > We don't (yet) support that on arm64, so we're ok for now. If we do grow > > > support for it, then I agree that we want a flag to identify the case where > > > the address space is going away and only elide the invalidation then. > > > > What do you think about the following patch instead? I have to confess > > I do not really understand the fullmm semantic so I might introduce some > > duplication by this flag. If you think this is a good idea, I will post > > it in a separate thread. > > > Please do! My only suggestion would be s/lazy/exit/, since I don't think the > optimisation works in any other situation than the address space going away > for good. Yes, address space going. That's why I wanted to add additional check that address space going without adding new flags. http://lkml.kernel.org/r/<20171113002833.GA18301@bbox> However, if you guys love to add new flag to distinguish, I prefer "exit" to "lazy". It also would be better to add WARN_ON to catch future potential wrong use case like OOM reaper. Anyway, I'm not strong against so it up to you, Michal. WARN_ON_ONCE(exit == true && atomic_read(&mm->mm_users) > 0);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75..18c5b35 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) */ set_bit(MMF_UNSTABLE, &mm->flags); - tlb_gather_mmu(&tlb, mm, 0, -1); for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * we do not want to block exit_mmap by keeping mm ref * count elevated without a good reason. */ - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { + tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end); unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end, NULL); + tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end); + } } - tlb_finish_mmu(&tlb, 0, -1); pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)),