From patchwork Sun Dec 6 06:14:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 339678 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 864A0C4167B for ; Sun, 6 Dec 2020 06:15:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 588FE230FE for ; Sun, 6 Dec 2020 06:15:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725774AbgLFGPX (ORCPT ); Sun, 6 Dec 2020 01:15:23 -0500 Received: from mail.kernel.org ([198.145.29.99]:58512 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725379AbgLFGPX (ORCPT ); Sun, 6 Dec 2020 01:15:23 -0500 Date: Sat, 05 Dec 2020 22:14:42 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235283; bh=FK5p77zY63eqh74pQFA4qHfdP6Wbip7SX+nT8OBcPO8=; h=From:To:Subject:In-Reply-To:From; b=n43jEqr6qJo6AXwpjo2BwFmCs4zQ77NBczXlG792lH1j57vcw7UoXJP1dp2T+eeIK 99GhcJvDxpyZDpxe/C4eqK+xEe5Q+W8+VT69xET14I1qJC3MZqSznpuYZmH6rEvsGA DC05RKDo8qLVezr38YedMN6y79jLqmB0GJnnGin8= From: Andrew Morton To: akpm@linux-foundation.org, dong.menglong@zte.com.cn, jwilk@jwilk.net, linux-mm@kvack.org, mm-commits@vger.kernel.org, nhorman@tuxdriver.com, pabs3@bonedaddy.net, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 02/12] coredump: fix core_pattern parse error Message-ID: <20201206061442.va_wmK_ha%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Menglong Dong Subject: coredump: fix core_pattern parse error 'format_corename()' will splite 'core_pattern' on spaces when it is in pipe mode, and take helper_argv[0] as the path to usermode executable. It works fine in most cases. However, if there is a space between '|' and '/file/path', such as '| /usr/lib/systemd/systemd-coredump %P %u %g', helper_argv[0] will be parsed as '', and users will get a 'Core dump to | disabled'. It is not friendly to users, as the pattern above was valid previously. Fix this by ignoring the spaces between '|' and '/file/path'. Link: https://lkml.kernel.org/r/5fb62870.1c69fb81.8ef5d.af76@mx.google.com Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template") Signed-off-by: Menglong Dong Cc: Paul Wise Cc: Jakub Wilk [https://bugs.debian.org/924398] Cc: Neil Horman Cc: Signed-off-by: Andrew Morton --- fs/coredump.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/fs/coredump.c~coredump-fix-core_pattern-parse-error +++ a/fs/coredump.c @@ -229,7 +229,8 @@ static int format_corename(struct core_n */ if (ispipe) { if (isspace(*pat_ptr)) { - was_space = true; + if (cn->used != 0) + was_space = true; pat_ptr++; continue; } else if (was_space) { From patchwork Sun Dec 6 06:14:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 339676 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABB62C1B087 for ; Sun, 6 Dec 2020 06:15:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 84D8F2313B for ; Sun, 6 Dec 2020 06:15:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725781AbgLFGP1 (ORCPT ); Sun, 6 Dec 2020 01:15:27 -0500 Received: from mail.kernel.org ([198.145.29.99]:58564 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725379AbgLFGP1 (ORCPT ); Sun, 6 Dec 2020 01:15:27 -0500 Date: Sat, 05 Dec 2020 22:14:45 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235286; bh=eAfzf1HXdPcS85CCArRsVIv3KL0OiSlNphIbfm68+Bw=; h=From:To:Subject:In-Reply-To:From; b=Mh7Jh5eXlMEhlQqYLQ2FVN0Q9lEwUraaYnjTY5Va2NtRMjfSqqwPrY3ejL5ExI7uh Dbbv/1JgoIweDj+Fs6GJeWRzw1C9WsuATGF82N75OMEPEsMiaGT5zXGcP7JTFEBScE Ccxob/DBfctjLvr9G243MJ+weBYvFN4hXQkYPisk= From: Andrew Morton To: akpm@linux-foundation.org, guro@fb.com, hannes@cmpxchg.org, linux-mm@kvack.org, mhocko@kernel.org, mm-commits@vger.kernel.org, shakeelb@google.com, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 03/12] mm: memcg/slab: fix obj_cgroup_charge() return value handling Message-ID: <20201206061445.FBPPaghTp%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Roman Gushchin Subject: mm: memcg/slab: fix obj_cgroup_charge() return value handling Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") introduced a regression into the handling of the obj_cgroup_charge() return value. If a non-zero value is returned (indicating of exceeding one of memory.max limits), the allocation should fail, instead of falling back to non-accounted mode. To make the code more readable, move memcg_slab_pre_alloc_hook() and memcg_slab_post_alloc_hook() calling conditions into bodies of these hooks. Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") Signed-off-by: Roman Gushchin Reviewed-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton --- mm/slab.h | 40 ++++++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 16 deletions(-) --- a/mm/slab.h~mm-memcg-slab-fix-obj_cgroup_charge-return-value-handling +++ a/mm/slab.h @@ -274,22 +274,32 @@ static inline size_t obj_full_size(struc return s->size + sizeof(struct obj_cgroup *); } -static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, - size_t objects, - gfp_t flags) +/* + * Returns false if the allocation should fail. + */ +static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, + struct obj_cgroup **objcgp, + size_t objects, gfp_t flags) { struct obj_cgroup *objcg; + if (!memcg_kmem_enabled()) + return true; + + if (!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT)) + return true; + objcg = get_obj_cgroup_from_current(); if (!objcg) - return NULL; + return true; if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) { obj_cgroup_put(objcg); - return NULL; + return false; } - return objcg; + *objcgp = objcg; + return true; } static inline void mod_objcg_state(struct obj_cgroup *objcg, @@ -315,7 +325,7 @@ static inline void memcg_slab_post_alloc unsigned long off; size_t i; - if (!objcg) + if (!memcg_kmem_enabled() || !objcg) return; flags &= ~__GFP_ACCOUNT; @@ -400,11 +410,11 @@ static inline void memcg_free_page_obj_c { } -static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, - size_t objects, - gfp_t flags) +static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, + struct obj_cgroup **objcgp, + size_t objects, gfp_t flags) { - return NULL; + return true; } static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, @@ -508,9 +518,8 @@ static inline struct kmem_cache *slab_pr if (should_failslab(s, flags)) return NULL; - if (memcg_kmem_enabled() && - ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) - *objcgp = memcg_slab_pre_alloc_hook(s, size, flags); + if (!memcg_slab_pre_alloc_hook(s, objcgp, size, flags)) + return NULL; return s; } @@ -529,8 +538,7 @@ static inline void slab_post_alloc_hook( s->flags, flags); } - if (memcg_kmem_enabled()) - memcg_slab_post_alloc_hook(s, objcg, flags, size, p); + memcg_slab_post_alloc_hook(s, objcg, flags, size, p); } #ifndef CONFIG_SLOB From patchwork Sun Dec 6 06:14:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 339135 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5A52C19425 for ; Sun, 6 Dec 2020 06:15:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 905DD230FE for ; Sun, 6 Dec 2020 06:15:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725823AbgLFGPe (ORCPT ); Sun, 6 Dec 2020 01:15:34 -0500 Received: from mail.kernel.org ([198.145.29.99]:58612 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725379AbgLFGPa (ORCPT ); Sun, 6 Dec 2020 01:15:30 -0500 Date: Sat, 05 Dec 2020 22:14:48 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235289; bh=AOVqoxJMKb9AUnv02YL93cYo6yHYpzGWwaF0vnteRQE=; h=From:To:Subject:In-Reply-To:From; b=IauN4NFdFj7DrjffQAjCnWC7sQyzpnOlKROaV8UADz/OgEWHgQSkzJFPkuTH984Es ii4K7qKu0uXWWC+ZyuuakZNjzAOe/IYWWV9s5Zp8OM9iNSxvDxmpUn1/D5ISRNGJWi ubXhVceGvRw0elpp2jAZoREktPG6eSZF5fx7G6zU= From: Andrew Morton To: akpm@linux-foundation.org, guro@fb.com, ktkhai@virtuozzo.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, shakeelb@google.com, shy828301@gmail.com, stable@vger.kernel.org, torvalds@linux-foundation.org, vdavydov.dev@gmail.com Subject: [patch 04/12] mm: list_lru: set shrinker map bit when child nr_items is not zero Message-ID: <20201206061448.D4xiqM6MX%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Yang Shi Subject: mm: list_lru: set shrinker map bit when child nr_items is not zero When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all. The further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back. I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below. The memcg hierarchy on our production environment looks like: root / \ system user The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly. So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen: CPU A CPU B reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit return SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id with parent's (in memcg_offline_kmem()) list_lru_del() between shrinker runs see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del() The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 coincidently keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever. How does list_lru_del() race with reparenting? It is because reparenting replaces children's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and free kmemcg_id on css offline") says. Since it is impossible that dst->nr_items goes negative and src->nr_items goes zero at the same time, so it seems we could set the shrinker map bit iff src->nr_items != 0. We could synchronize list_lru_count_one() and reparenting with nlru->lock, but it seems checking src->nr_items in reparenting is the simplest and avoids lock contention. Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance") Signed-off-by: Yang Shi Suggested-by: Roman Gushchin Reviewed-by: Roman Gushchin Acked-by: Kirill Tkhai Reviewed-by: Shakeel Butt Cc: Vladimir Davydov Cc: [4.19] Signed-off-by: Andrew Morton --- mm/list_lru.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) --- a/mm/list_lru.c~mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero +++ a/mm/list_lru.c @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(st struct list_lru_node *nlru = &lru->node[nid]; int dst_idx = dst_memcg->kmemcg_id; struct list_lru_one *src, *dst; - bool set; /* * Since list_lru_{add,del} may be called under an IRQ-safe lock, @@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(st dst = list_lru_from_memcg_idx(nlru, dst_idx); list_splice_init(&src->list, &dst->list); - set = (!dst->nr_items && src->nr_items); - dst->nr_items += src->nr_items; - if (set) + + if (src->nr_items) { + dst->nr_items += src->nr_items; memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); - src->nr_items = 0; + src->nr_items = 0; + } spin_unlock_irq(&nlru->lock); } From patchwork Sun Dec 6 06:14:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 339677 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING, SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 620B7C433FE for ; Sun, 6 Dec 2020 06:15:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 34F8822D2A for ; Sun, 6 Dec 2020 06:15:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725834AbgLFGPg (ORCPT ); Sun, 6 Dec 2020 01:15:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:58656 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725379AbgLFGPf (ORCPT ); Sun, 6 Dec 2020 01:15:35 -0500 Date: Sat, 05 Dec 2020 22:14:51 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235293; bh=DoO5WKkihVsou738paX48uPDiMMJVQG9M3iLZY5SvP4=; h=From:To:Subject:In-Reply-To:From; b=MnoB9zIJp6xHK1OmpuMbKCNGm+/GhCt+CUTEHoz9zu6pR4WiT2Zg5OABgOPwinatp FgX08/5DquOPuKMZo07+6G/gBjfOB7IN2wLhPR4x1pmBFBHN7fZxIPDLQclxL3h0Pd Rx6kFhFqyqxrQU/Qrx4w+nITZl2jB+ZudgJJocoA= From: Andrew Morton To: akpm@linux-foundation.org, harish@linux.ibm.com, hch@infradead.org, linux-mm@kvack.org, minchan@kernel.org, mm-commits@vger.kernel.org, sergey.senozhatsky@gmail.com, stable@vger.kernel.org, tony@atomide.com, torvalds@linux-foundation.org, urezki@gmail.com Subject: [patch 05/12] mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING Message-ID: <20201206061451.IlxWUdUQr%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Minchan Kim Subject: mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING While I was doing zram testing, I found sometimes decompression failed since the compression buffer was corrupted. With investigation, I found below commit calls cond_resched unconditionally so it could make a problem in atomic context if the task is reschedule. [ 55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108 [ 55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog [ 55.111973] 3 locks held by memhog/946: [ 55.112807] #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160 [ 55.114151] #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160 [ 55.115848] #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0 [ 55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316 [ 55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 55.122540] Call Trace: [ 55.122974] dump_stack+0x8b/0xb8 [ 55.123588] ___might_sleep.cold+0xb6/0xc6 [ 55.124328] unmap_kernel_range_noflush+0x2eb/0x350 [ 55.125198] unmap_kernel_range+0x14/0x30 [ 55.125920] zs_unmap_object+0xd5/0xe0 [ 55.126604] zram_bvec_rw.isra.0+0x38c/0x8e0 [ 55.127462] zram_rw_page+0x90/0x101 [ 55.128199] bdev_write_page+0x92/0xe0 [ 55.128957] ? swap_slot_free_notify+0xb0/0xb0 [ 55.129841] __swap_writepage+0x94/0x4a0 [ 55.130636] ? do_raw_spin_unlock+0x4b/0xa0 [ 55.131462] ? _raw_spin_unlock+0x1f/0x30 [ 55.132261] ? page_swapcount+0x6c/0x90 [ 55.133038] pageout+0xe3/0x3a0 [ 55.133702] shrink_page_list+0xb94/0xd60 [ 55.134626] shrink_inactive_list+0x158/0x460 We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which contains the offending calling code) from zsmalloc. Even though this option showed some amount improvement(e.g., 30%) in some arm32 platforms, it has been headache to maintain since it have abused APIs[1](e.g., unmap_kernel_range in atomic context). Since we are approaching to deprecate 32bit machines and already made the config option available for only builtin build since v5.8, lastly it has been not default option in zsmalloc, it's time to drop the option for better maintenance. [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range") Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Cc: Tony Lindgren Cc: Christoph Hellwig Cc: Harish Sriram Cc: Uladzislau Rezki Cc: Signed-off-by: Andrew Morton --- arch/arm/configs/omap2plus_defconfig | 1 include/linux/zsmalloc.h | 1 mm/Kconfig | 13 ------ mm/zsmalloc.c | 54 ------------------------- 4 files changed, 69 deletions(-) --- a/arch/arm/configs/omap2plus_defconfig~mm-zsmallocc-drop-zsmalloc_pgtable_mapping +++ a/arch/arm/configs/omap2plus_defconfig @@ -81,7 +81,6 @@ CONFIG_PARTITION_ADVANCED=y CONFIG_BINFMT_MISC=y CONFIG_CMA=y CONFIG_ZSMALLOC=m -CONFIG_ZSMALLOC_PGTABLE_MAPPING=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_UNIX=y --- a/include/linux/zsmalloc.h~mm-zsmallocc-drop-zsmalloc_pgtable_mapping +++ a/include/linux/zsmalloc.h @@ -20,7 +20,6 @@ * zsmalloc mapping modes * * NOTE: These only make a difference when a mapped object spans pages. - * They also have no effect when ZSMALLOC_PGTABLE_MAPPING is selected. */ enum zs_mapmode { ZS_MM_RW, /* normal read-write mapping */ --- a/mm/Kconfig~mm-zsmallocc-drop-zsmalloc_pgtable_mapping +++ a/mm/Kconfig @@ -707,19 +707,6 @@ config ZSMALLOC returned by an alloc(). This handle must be mapped in order to access the allocated space. -config ZSMALLOC_PGTABLE_MAPPING - bool "Use page table mapping to access object in zsmalloc" - depends on ZSMALLOC=y - help - By default, zsmalloc uses a copy-based object mapping method to - access allocations that span two pages. However, if a particular - architecture (ex, ARM) performs VM mapping faster than copying, - then you should select this. This causes zsmalloc to use page table - mapping rather than copying for object mapping. - - You can check speed with zsmalloc benchmark: - https://github.com/spartacus06/zsmapbench - config ZSMALLOC_STAT bool "Export zsmalloc statistics" depends on ZSMALLOC --- a/mm/zsmalloc.c~mm-zsmallocc-drop-zsmalloc_pgtable_mapping +++ a/mm/zsmalloc.c @@ -293,11 +293,7 @@ struct zspage { }; struct mapping_area { -#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING - struct vm_struct *vm; /* vm area for mapping object that span pages */ -#else char *vm_buf; /* copy buffer for objects that span pages */ -#endif char *vm_addr; /* address of kmap_atomic()'ed pages */ enum zs_mapmode vm_mm; /* mapping mode */ }; @@ -1113,54 +1109,6 @@ static struct zspage *find_get_zspage(st return zspage; } -#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING -static inline int __zs_cpu_up(struct mapping_area *area) -{ - /* - * Make sure we don't leak memory if a cpu UP notification - * and zs_init() race and both call zs_cpu_up() on the same cpu - */ - if (area->vm) - return 0; - area->vm = get_vm_area(PAGE_SIZE * 2, 0); - if (!area->vm) - return -ENOMEM; - - /* - * Populate ptes in advance to avoid pte allocation with GFP_KERNEL - * in non-preemtible context of zs_map_object. - */ - return apply_to_page_range(&init_mm, (unsigned long)area->vm->addr, - PAGE_SIZE * 2, NULL, NULL); -} - -static inline void __zs_cpu_down(struct mapping_area *area) -{ - if (area->vm) - free_vm_area(area->vm); - area->vm = NULL; -} - -static inline void *__zs_map_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - unsigned long addr = (unsigned long)area->vm->addr; - - BUG_ON(map_kernel_range(addr, PAGE_SIZE * 2, PAGE_KERNEL, pages) < 0); - area->vm_addr = area->vm->addr; - return area->vm_addr + off; -} - -static inline void __zs_unmap_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - unsigned long addr = (unsigned long)area->vm_addr; - - unmap_kernel_range(addr, PAGE_SIZE * 2); -} - -#else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */ - static inline int __zs_cpu_up(struct mapping_area *area) { /* @@ -1241,8 +1189,6 @@ out: pagefault_enable(); } -#endif /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */ - static int zs_cpu_prepare(unsigned int cpu) { struct mapping_area *area; From patchwork Sun Dec 6 06:14:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 339134 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6A3FC4167B for ; Sun, 6 Dec 2020 06:15:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9C52B230FE for ; Sun, 6 Dec 2020 06:15:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725866AbgLFGPh (ORCPT ); Sun, 6 Dec 2020 01:15:37 -0500 Received: from mail.kernel.org ([198.145.29.99]:58684 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725379AbgLFGPh (ORCPT ); Sun, 6 Dec 2020 01:15:37 -0500 Date: Sat, 05 Dec 2020 22:14:55 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235296; bh=yQHF9wy/rEDp1jEnZIQgWe+affdUfYqCvb72tCH1FP0=; h=From:To:Subject:In-Reply-To:From; b=XThZZkI8OIP4QZjWIZNoEBx2nuAn7+HlDDpPEgArXBEZHdkYqhaqDfHTpq5sDjhbp JRsgXNKSTTiNASvB7U4QnrkUm/Zfy+OYpK5OAdSmzdKC+B6VdTh5JOT7/ik2i3TWXJ /TWylwhzx5X69m6/nW5FQ1jaB1GrveFKFCJsL0SE= From: Andrew Morton To: akpm@linux-foundation.org, hughd@google.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, qcai@redhat.com, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 06/12] mm/swapfile: do not sleep with a spin lock held Message-ID: <20201206061455.fGLHX_jlo%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Qian Cai Subject: mm/swapfile: do not sleep with a spin lock held We can't call kvfree() with a spin lock held, so defer it. Fixes a might_sleep() runtime warning. Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") Signed-off-by: Qian Cai Reviewed-by: Andrew Morton Cc: Hugh Dickins Cc: Signed-off-by: Andrew Morton --- mm/swapfile.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/mm/swapfile.c~mm-swapfile-do-not-sleep-with-a-spin-lock-held +++ a/mm/swapfile.c @@ -2867,6 +2867,7 @@ late_initcall(max_swapfiles_check); static struct swap_info_struct *alloc_swap_info(void) { struct swap_info_struct *p; + struct swap_info_struct *defer = NULL; unsigned int type; int i; @@ -2895,7 +2896,7 @@ static struct swap_info_struct *alloc_sw smp_wmb(); WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); } else { - kvfree(p); + defer = p; p = swap_info[type]; /* * Do not memset this entry: a racing procfs swap_next() @@ -2908,6 +2909,7 @@ static struct swap_info_struct *alloc_sw plist_node_init(&p->avail_lists[i], 0); p->flags = SWP_USED; spin_unlock(&swap_lock); + kvfree(defer); spin_lock_init(&p->lock); spin_lock_init(&p->cont_lock); From patchwork Sun Dec 6 06:15:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 339133 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0A52C1B0D9 for ; Sun, 6 Dec 2020 06:16:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 994E622D2C for ; Sun, 6 Dec 2020 06:16:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725948AbgLFGQI (ORCPT ); Sun, 6 Dec 2020 01:16:08 -0500 Received: from mail.kernel.org ([198.145.29.99]:59004 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725945AbgLFGQI (ORCPT ); Sun, 6 Dec 2020 01:16:08 -0500 Date: Sat, 05 Dec 2020 22:15:12 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235313; bh=QSV5zHJaoaRWLvhtj+cRLdEs+5kHkjXZ5jnz/ZOflFQ=; h=From:To:Subject:In-Reply-To:From; b=2oAHjelcAt/dvwYbvsSwiEL9VxeNnBNYRJY8pPZfNM0nj27ceSz7DO1OCNfqIvktx GzXB4yEN1u+5tMAtW/MPBSKpN0Ujfh4+ibF2+QouN8B8biueR/DqTCjjJVAWkslBjY PUzuahrlTmEAaMLWKuBhc/iOA071244mT7UjHUmw= From: Andrew Morton To: akpm@linux-foundation.org, almasrymina@google.com, amorenoz@redhat.com, gthelen@google.com, linux-mm@kvack.org, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, rientjes@google.com, sandipan@linux.ibm.com, shakeelb@google.com, shuah@kernel.org, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 11/12] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Message-ID: <20201206061512.FBawJ1qq2%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Mike Kravetz Subject: hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload using hugetlbfs. In this environment the issue is reproduced by: 1 - Start a simple pod that uses the recently added HugePages medium feature (pod yaml attached) 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer packets) nor interact with real hardware. It seems just initializing the EAL layer (which handles hugepage reservation and locking) is enough to trigger the issue 3 - Delete the Pod (or let it "Complete"). This would result in a kworker thread going into a tight loop (top output): 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy 'perf top -g' reports: - 63.28% 0.01% [kernel] [k] worker_thread - 49.97% worker_thread - 52.64% process_one_work - 62.08% css_killed_work_fn - hugetlb_cgroup_css_offline 41.52% _raw_spin_lock - 2.82% _cond_resched rcu_all_qs 2.66% PageHuge - 0.57% schedule - 0.57% __schedule We are spinning in the do-while loop in hugetlb_cgroup_css_offline. Worse yet, we are holding the master cgroup lock (cgroup_mutex) while infinitely spinning. Little else can be done on the system as the cgroup_mutex can not be acquired. Do note that the issue can be reproduced by simply offlining a hugetlb cgroup containing pages with reservation counts. The loop in hugetlb_cgroup_css_offline is moving page counts from the cgroup being offlined to the parent cgroup. This is done for each hstate, and is repeated until hugetlb_cgroup_have_usage returns false. The routine moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. The routine hugetlb_cgroup_have_usage is checking for both 'usage' and 'reservation' counts. Discussion about what to do with reservation counts when reparenting was discussed here: https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ The decision was made to leave a zombie cgroup for with reservation counts. Unfortunately, the code checking reservation counts was incorrectly added to hugetlb_cgroup_have_usage. To fix the issue, simply remove the check for reservation counts. While fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. The hstate index is not reinitialized each time through the do-while loop. Fix this as well. Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") Reported-by: Adrian Moreno Tested-by: Adrian Moreno Signed-off-by: Mike Kravetz Reviewed-by: Shakeel Butt Cc: Mina Almasry Cc: David Rientjes Cc: Greg Thelen Cc: Sandipan Das Cc: Shuah Khan Cc: Signed-off-by: Andrew Morton --- mm/hugetlb_cgroup.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) --- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-fix-offline-of-hugetlb-cgroup-with-reservations +++ a/mm/hugetlb_cgroup.c @@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_u for (idx = 0; idx < hugetlb_max_hstate; idx++) { if (page_counter_read( - hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) || - page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd( - h_cg, idx))) { + hugetlb_cgroup_counter_from_cgroup(h_cg, idx))) return true; - } } return false; } @@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(s struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); struct hstate *h; struct page *page; - int idx = 0; + int idx; do { + idx = 0; for_each_hstate(h) { spin_lock(&hugetlb_lock); list_for_each_entry(page, &h->hugepage_activelist, lru)