From patchwork Sun Dec  6 06:14:42 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 339678
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_CR_TRAILER, 
 INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 864A0C4167B
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:36 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 588FE230FE
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725774AbgLFGPX (ORCPT <rfc822;stable@archiver.kernel.org>);
 Sun, 6 Dec 2020 01:15:23 -0500
Received: from mail.kernel.org ([198.145.29.99]:58512 "EHLO mail.kernel.org"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1725379AbgLFGPX (ORCPT <rfc822;stable@vger.kernel.org>);
 Sun, 6 Dec 2020 01:15:23 -0500
Date: Sat, 05 Dec 2020 22:14:42 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; 
 s=korg; t=1607235283;
 bh=FK5p77zY63eqh74pQFA4qHfdP6Wbip7SX+nT8OBcPO8=;
 h=From:To:Subject:In-Reply-To:From;
 b=n43jEqr6qJo6AXwpjo2BwFmCs4zQ77NBczXlG792lH1j57vcw7UoXJP1dp2T+eeIK
 99GhcJvDxpyZDpxe/C4eqK+xEe5Q+W8+VT69xET14I1qJC3MZqSznpuYZmH6rEvsGA
 DC05RKDo8qLVezr38YedMN6y79jLqmB0GJnnGin8=
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, dong.menglong@zte.com.cn,
 jwilk@jwilk.net, linux-mm@kvack.org, mm-commits@vger.kernel.org,
 nhorman@tuxdriver.com, pabs3@bonedaddy.net, stable@vger.kernel.org,
 torvalds@linux-foundation.org
Subject: [patch 02/12] coredump: fix core_pattern parse error
Message-ID: <20201206061442.va_wmK_ha%akpm@linux-foundation.org>
In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org>
User-Agent: s-nail v14.8.16
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Menglong Dong <dong.menglong@zte.com.cn>
Subject: coredump: fix core_pattern parse error

'format_corename()' will splite 'core_pattern' on spaces when it is in
pipe mode, and take helper_argv[0] as the path to usermode executable.

It works fine in most cases. However, if there is a space between
'|' and '/file/path', such as
'| /usr/lib/systemd/systemd-coredump %P %u %g',
helper_argv[0] will be parsed as '', and users will get a
'Core dump to | disabled'.

It is not friendly to users, as the pattern above was valid previously. 
Fix this by ignoring the spaces between '|' and '/file/path'.

Link: https://lkml.kernel.org/r/5fb62870.1c69fb81.8ef5d.af76@mx.google.com
Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template")
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Cc: Paul Wise <pabs3@bonedaddy.net>
Cc: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/coredump.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/coredump.c~coredump-fix-core_pattern-parse-error
+++ a/fs/coredump.c
@@ -229,7 +229,8 @@ static int format_corename(struct core_n
 		 */
 		if (ispipe) {
 			if (isspace(*pat_ptr)) {
-				was_space = true;
+				if (cn->used != 0)
+					was_space = true;
 				pat_ptr++;
 				continue;
 			} else if (was_space) {

From patchwork Sun Dec  6 06:14:45 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 339676
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_CR_TRAILER, 
 INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id ABB62C1B087
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:52 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 84D8F2313B
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725781AbgLFGP1 (ORCPT <rfc822;stable@archiver.kernel.org>);
 Sun, 6 Dec 2020 01:15:27 -0500
Received: from mail.kernel.org ([198.145.29.99]:58564 "EHLO mail.kernel.org"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1725379AbgLFGP1 (ORCPT <rfc822;stable@vger.kernel.org>);
 Sun, 6 Dec 2020 01:15:27 -0500
Date: Sat, 05 Dec 2020 22:14:45 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; 
 s=korg; t=1607235286;
 bh=eAfzf1HXdPcS85CCArRsVIv3KL0OiSlNphIbfm68+Bw=;
 h=From:To:Subject:In-Reply-To:From;
 b=Mh7Jh5eXlMEhlQqYLQ2FVN0Q9lEwUraaYnjTY5Va2NtRMjfSqqwPrY3ejL5ExI7uh
 Dbbv/1JgoIweDj+Fs6GJeWRzw1C9WsuATGF82N75OMEPEsMiaGT5zXGcP7JTFEBScE
 Ccxob/DBfctjLvr9G243MJ+weBYvFN4hXQkYPisk=
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, guro@fb.com, hannes@cmpxchg.org,
 linux-mm@kvack.org, mhocko@kernel.org, mm-commits@vger.kernel.org,
 shakeelb@google.com, stable@vger.kernel.org, torvalds@linux-foundation.org
Subject: [patch 03/12] mm: memcg/slab: fix obj_cgroup_charge()
 return value handling
Message-ID: <20201206061445.FBPPaghTp%akpm@linux-foundation.org>
In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org>
User-Agent: s-nail v14.8.16
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: fix obj_cgroup_charge() return value handling

Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for
all allocations") introduced a regression into the handling of the
obj_cgroup_charge() return value.  If a non-zero value is returned
(indicating of exceeding one of memory.max limits), the allocation should
fail, instead of falling back to non-accounted mode.

To make the code more readable, move memcg_slab_pre_alloc_hook() and
memcg_slab_post_alloc_hook() calling conditions into bodies of these
hooks.

Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.h |   40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

--- a/mm/slab.h~mm-memcg-slab-fix-obj_cgroup_charge-return-value-handling
+++ a/mm/slab.h
@@ -274,22 +274,32 @@ static inline size_t obj_full_size(struc
 	return s->size + sizeof(struct obj_cgroup *);
 }
 
-static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-							   size_t objects,
-							   gfp_t flags)
+/*
+ * Returns false if the allocation should fail.
+ */
+static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+					     struct obj_cgroup **objcgp,
+					     size_t objects, gfp_t flags)
 {
 	struct obj_cgroup *objcg;
 
+	if (!memcg_kmem_enabled())
+		return true;
+
+	if (!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT))
+		return true;
+
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
-		return NULL;
+		return true;
 
 	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
 		obj_cgroup_put(objcg);
-		return NULL;
+		return false;
 	}
 
-	return objcg;
+	*objcgp = objcg;
+	return true;
 }
 
 static inline void mod_objcg_state(struct obj_cgroup *objcg,
@@ -315,7 +325,7 @@ static inline void memcg_slab_post_alloc
 	unsigned long off;
 	size_t i;
 
-	if (!objcg)
+	if (!memcg_kmem_enabled() || !objcg)
 		return;
 
 	flags &= ~__GFP_ACCOUNT;
@@ -400,11 +410,11 @@ static inline void memcg_free_page_obj_c
 {
 }
 
-static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-							   size_t objects,
-							   gfp_t flags)
+static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+					     struct obj_cgroup **objcgp,
+					     size_t objects, gfp_t flags)
 {
-	return NULL;
+	return true;
 }
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
@@ -508,9 +518,8 @@ static inline struct kmem_cache *slab_pr
 	if (should_failslab(s, flags))
 		return NULL;
 
-	if (memcg_kmem_enabled() &&
-	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		*objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
+	if (!memcg_slab_pre_alloc_hook(s, objcgp, size, flags))
+		return NULL;
 
 	return s;
 }
@@ -529,8 +538,7 @@ static inline void slab_post_alloc_hook(
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled())
-		memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
+	memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
 }
 
 #ifndef CONFIG_SLOB

From patchwork Sun Dec  6 06:14:48 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 339135
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_CR_TRAILER, 
 INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id B5A52C19425
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:36 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 905DD230FE
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725823AbgLFGPe (ORCPT <rfc822;stable@archiver.kernel.org>);
 Sun, 6 Dec 2020 01:15:34 -0500
Received: from mail.kernel.org ([198.145.29.99]:58612 "EHLO mail.kernel.org"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1725379AbgLFGPa (ORCPT <rfc822;stable@vger.kernel.org>);
 Sun, 6 Dec 2020 01:15:30 -0500
Date: Sat, 05 Dec 2020 22:14:48 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; 
 s=korg; t=1607235289;
 bh=AOVqoxJMKb9AUnv02YL93cYo6yHYpzGWwaF0vnteRQE=;
 h=From:To:Subject:In-Reply-To:From;
 b=IauN4NFdFj7DrjffQAjCnWC7sQyzpnOlKROaV8UADz/OgEWHgQSkzJFPkuTH984Es
 ii4K7qKu0uXWWC+ZyuuakZNjzAOe/IYWWV9s5Zp8OM9iNSxvDxmpUn1/D5ISRNGJWi
 ubXhVceGvRw0elpp2jAZoREktPG6eSZF5fx7G6zU=
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, guro@fb.com, ktkhai@virtuozzo.com,
 linux-mm@kvack.org, mm-commits@vger.kernel.org,
 shakeelb@google.com, shy828301@gmail.com, stable@vger.kernel.org,
 torvalds@linux-foundation.org, vdavydov.dev@gmail.com
Subject: [patch 04/12] mm: list_lru: set shrinker map bit when
 child nr_items is not zero
Message-ID: <20201206061448.D4xiqM6MX%akpm@linux-foundation.org>
In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org>
User-Agent: s-nail v14.8.16
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Yang Shi <shy828301@gmail.com>
Subject: mm: list_lru: set shrinker map bit when child nr_items is not zero

When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache.  The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all.  The further
investigation shows the memcg's vfs shrinker_map bit is not set.  So the
reclaimer or dropping cache just skip calling vfs shrinker.  So we have
to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting.  But it seems there is race
between shrinker map bit clear and reparenting by code inspection.  The
hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:
                root
               /    \
          system   user

The main workloads are running under user slice's children, and it creates
and removes memcg frequently.  So reparenting happens very often under user
slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

       CPU A                            CPU B
reparent
    dst->nr_items == 0
                                 shrinker:
                                     total_objects == 0
    add src->nr_items to dst
    set_bit
                                     return SHRINK_EMPTY
                                     clear_bit
child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
                                  list_lru_del() between shrinker runs
                                     see parent's kmemcg_id
                                     dec dst->nr_items
reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

                                 The second run of shrinker:
                                     read nr_items without any
                                     synchronization, so it may
                                     see intermediate negative
                                     nr_items then total_objects
                                     may return 0 coincidently

                                     keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore.  And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either.  That bit is kept cleared forever.

How does list_lru_del() race with reparenting?  It is because
reparenting replaces children's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.

Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0.  We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.

Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>	[4.19]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/list_lru.c~mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero
+++ a/mm/list_lru.c
@@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(st
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
-	bool set;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(st
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
-	set = (!dst->nr_items && src->nr_items);
-	dst->nr_items += src->nr_items;
-	if (set)
+
+	if (src->nr_items) {
+		dst->nr_items += src->nr_items;
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-	src->nr_items = 0;
+		src->nr_items = 0;
+	}
 
 	spin_unlock_irq(&nlru->lock);
 }

From patchwork Sun Dec  6 06:14:51 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 339677
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-20.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_CR_TRAILER, 
 INCLUDES_PATCH, MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,
 SPF_HELO_NONE, SPF_PASS
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id 620B7C433FE
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:52 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 34F8822D2A
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725834AbgLFGPg (ORCPT <rfc822;stable@archiver.kernel.org>);
 Sun, 6 Dec 2020 01:15:36 -0500
Received: from mail.kernel.org ([198.145.29.99]:58656 "EHLO mail.kernel.org"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1725379AbgLFGPf (ORCPT <rfc822;stable@vger.kernel.org>);
 Sun, 6 Dec 2020 01:15:35 -0500
Date: Sat, 05 Dec 2020 22:14:51 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; 
 s=korg; t=1607235293;
 bh=DoO5WKkihVsou738paX48uPDiMMJVQG9M3iLZY5SvP4=;
 h=From:To:Subject:In-Reply-To:From;
 b=MnoB9zIJp6xHK1OmpuMbKCNGm+/GhCt+CUTEHoz9zu6pR4WiT2Zg5OABgOPwinatp
 FgX08/5DquOPuKMZo07+6G/gBjfOB7IN2wLhPR4x1pmBFBHN7fZxIPDLQclxL3h0Pd
 Rx6kFhFqyqxrQU/Qrx4w+nITZl2jB+ZudgJJocoA=
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, harish@linux.ibm.com, hch@infradead.org,
 linux-mm@kvack.org, minchan@kernel.org, mm-commits@vger.kernel.org,
 sergey.senozhatsky@gmail.com, stable@vger.kernel.org,
 tony@atomide.com, torvalds@linux-foundation.org, urezki@gmail.com
Subject: [patch 05/12] mm/zsmalloc.c: drop
 ZSMALLOC_PGTABLE_MAPPING
Message-ID: <20201206061451.IlxWUdUQr%akpm@linux-foundation.org>
In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org>
User-Agent: s-nail v14.8.16
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Minchan Kim <minchan@kernel.org>
Subject: mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING

While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted.  With investigation, I found
below commit calls cond_resched unconditionally so it could make a problem
in atomic context if the task is reschedule.

[   55.109012] BUG: sleeping function called from invalid context at mm/vmalloc.c:108
[   55.110774] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
[   55.111973] 3 locks held by memhog/946:
[   55.112807]  #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
[   55.114151]  #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
[   55.115848]  #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
[   55.118947] CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
[   55.121265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[   55.122540] Call Trace:
[   55.122974]  dump_stack+0x8b/0xb8
[   55.123588]  ___might_sleep.cold+0xb6/0xc6
[   55.124328]  unmap_kernel_range_noflush+0x2eb/0x350
[   55.125198]  unmap_kernel_range+0x14/0x30
[   55.125920]  zs_unmap_object+0xd5/0xe0
[   55.126604]  zram_bvec_rw.isra.0+0x38c/0x8e0
[   55.127462]  zram_rw_page+0x90/0x101
[   55.128199]  bdev_write_page+0x92/0xe0
[   55.128957]  ? swap_slot_free_notify+0xb0/0xb0
[   55.129841]  __swap_writepage+0x94/0x4a0
[   55.130636]  ? do_raw_spin_unlock+0x4b/0xa0
[   55.131462]  ? _raw_spin_unlock+0x1f/0x30
[   55.132261]  ? page_swapcount+0x6c/0x90
[   55.133038]  pageout+0xe3/0x3a0
[   55.133702]  shrink_page_list+0xb94/0xd60
[   55.134626]  shrink_inactive_list+0x158/0x460

We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.

Even though this option showed some amount improvement(e.g., 30%) in some
arm32 platforms, it has been headache to maintain since it have abused
APIs[1](e.g., unmap_kernel_range in atomic context).

Since we are approaching to deprecate 32bit machines and already made the
config option available for only builtin build since v5.8, lastly it has
been not default option in zsmalloc, it's time to drop the option for
better maintenance.

[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Harish Sriram <harish@linux.ibm.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/configs/omap2plus_defconfig |    1 
 include/linux/zsmalloc.h             |    1 
 mm/Kconfig                           |   13 ------
 mm/zsmalloc.c                        |   54 -------------------------
 4 files changed, 69 deletions(-)

--- a/arch/arm/configs/omap2plus_defconfig~mm-zsmallocc-drop-zsmalloc_pgtable_mapping
+++ a/arch/arm/configs/omap2plus_defconfig
@@ -81,7 +81,6 @@ CONFIG_PARTITION_ADVANCED=y
 CONFIG_BINFMT_MISC=y
 CONFIG_CMA=y
 CONFIG_ZSMALLOC=m
-CONFIG_ZSMALLOC_PGTABLE_MAPPING=y
 CONFIG_NET=y
 CONFIG_PACKET=y
 CONFIG_UNIX=y
--- a/include/linux/zsmalloc.h~mm-zsmallocc-drop-zsmalloc_pgtable_mapping
+++ a/include/linux/zsmalloc.h
@@ -20,7 +20,6 @@
  * zsmalloc mapping modes
  *
  * NOTE: These only make a difference when a mapped object spans pages.
- * They also have no effect when ZSMALLOC_PGTABLE_MAPPING is selected.
  */
 enum zs_mapmode {
 	ZS_MM_RW, /* normal read-write mapping */
--- a/mm/Kconfig~mm-zsmallocc-drop-zsmalloc_pgtable_mapping
+++ a/mm/Kconfig
@@ -707,19 +707,6 @@ config ZSMALLOC
 	  returned by an alloc().  This handle must be mapped in order to
 	  access the allocated space.
 
-config ZSMALLOC_PGTABLE_MAPPING
-	bool "Use page table mapping to access object in zsmalloc"
-	depends on ZSMALLOC=y
-	help
-	  By default, zsmalloc uses a copy-based object mapping method to
-	  access allocations that span two pages. However, if a particular
-	  architecture (ex, ARM) performs VM mapping faster than copying,
-	  then you should select this. This causes zsmalloc to use page table
-	  mapping rather than copying for object mapping.
-
-	  You can check speed with zsmalloc benchmark:
-	  https://github.com/spartacus06/zsmapbench
-
 config ZSMALLOC_STAT
 	bool "Export zsmalloc statistics"
 	depends on ZSMALLOC
--- a/mm/zsmalloc.c~mm-zsmallocc-drop-zsmalloc_pgtable_mapping
+++ a/mm/zsmalloc.c
@@ -293,11 +293,7 @@ struct zspage {
 };
 
 struct mapping_area {
-#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING
-	struct vm_struct *vm; /* vm area for mapping object that span pages */
-#else
 	char *vm_buf; /* copy buffer for objects that span pages */
-#endif
 	char *vm_addr; /* address of kmap_atomic()'ed pages */
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
@@ -1113,54 +1109,6 @@ static struct zspage *find_get_zspage(st
 	return zspage;
 }
 
-#ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING
-static inline int __zs_cpu_up(struct mapping_area *area)
-{
-	/*
-	 * Make sure we don't leak memory if a cpu UP notification
-	 * and zs_init() race and both call zs_cpu_up() on the same cpu
-	 */
-	if (area->vm)
-		return 0;
-	area->vm = get_vm_area(PAGE_SIZE * 2, 0);
-	if (!area->vm)
-		return -ENOMEM;
-
-	/*
-	 * Populate ptes in advance to avoid pte allocation with GFP_KERNEL
-	 * in non-preemtible context of zs_map_object.
-	 */
-	return apply_to_page_range(&init_mm, (unsigned long)area->vm->addr,
-			PAGE_SIZE * 2, NULL, NULL);
-}
-
-static inline void __zs_cpu_down(struct mapping_area *area)
-{
-	if (area->vm)
-		free_vm_area(area->vm);
-	area->vm = NULL;
-}
-
-static inline void *__zs_map_object(struct mapping_area *area,
-				struct page *pages[2], int off, int size)
-{
-	unsigned long addr = (unsigned long)area->vm->addr;
-
-	BUG_ON(map_kernel_range(addr, PAGE_SIZE * 2, PAGE_KERNEL, pages) < 0);
-	area->vm_addr = area->vm->addr;
-	return area->vm_addr + off;
-}
-
-static inline void __zs_unmap_object(struct mapping_area *area,
-				struct page *pages[2], int off, int size)
-{
-	unsigned long addr = (unsigned long)area->vm_addr;
-
-	unmap_kernel_range(addr, PAGE_SIZE * 2);
-}
-
-#else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-
 static inline int __zs_cpu_up(struct mapping_area *area)
 {
 	/*
@@ -1241,8 +1189,6 @@ out:
 	pagefault_enable();
 }
 
-#endif /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-
 static int zs_cpu_prepare(unsigned int cpu)
 {
 	struct mapping_area *area;

From patchwork Sun Dec  6 06:14:55 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 339134
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_CR_TRAILER, 
 INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id D6A3FC4167B
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:52 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 9C52B230FE
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:15:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725866AbgLFGPh (ORCPT <rfc822;stable@archiver.kernel.org>);
 Sun, 6 Dec 2020 01:15:37 -0500
Received: from mail.kernel.org ([198.145.29.99]:58684 "EHLO mail.kernel.org"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1725379AbgLFGPh (ORCPT <rfc822;stable@vger.kernel.org>);
 Sun, 6 Dec 2020 01:15:37 -0500
Date: Sat, 05 Dec 2020 22:14:55 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; 
 s=korg; t=1607235296;
 bh=yQHF9wy/rEDp1jEnZIQgWe+affdUfYqCvb72tCH1FP0=;
 h=From:To:Subject:In-Reply-To:From;
 b=XThZZkI8OIP4QZjWIZNoEBx2nuAn7+HlDDpPEgArXBEZHdkYqhaqDfHTpq5sDjhbp
 JRsgXNKSTTiNASvB7U4QnrkUm/Zfy+OYpK5OAdSmzdKC+B6VdTh5JOT7/ik2i3TWXJ
 /TWylwhzx5X69m6/nW5FQ1jaB1GrveFKFCJsL0SE=
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, hughd@google.com, linux-mm@kvack.org,
 mm-commits@vger.kernel.org, qcai@redhat.com,
 stable@vger.kernel.org, torvalds@linux-foundation.org
Subject: [patch 06/12] mm/swapfile: do not sleep with a spin lock
 held
Message-ID: <20201206061455.fGLHX_jlo%akpm@linux-foundation.org>
In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org>
User-Agent: s-nail v14.8.16
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Qian Cai <qcai@redhat.com>
Subject: mm/swapfile: do not sleep with a spin lock held

We can't call kvfree() with a spin lock held, so defer it.  Fixes a
might_sleep() runtime warning.

Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Signed-off-by: Qian Cai <qcai@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/swapfile.c~mm-swapfile-do-not-sleep-with-a-spin-lock-held
+++ a/mm/swapfile.c
@@ -2867,6 +2867,7 @@ late_initcall(max_swapfiles_check);
 static struct swap_info_struct *alloc_swap_info(void)
 {
 	struct swap_info_struct *p;
+	struct swap_info_struct *defer = NULL;
 	unsigned int type;
 	int i;
 
@@ -2895,7 +2896,7 @@ static struct swap_info_struct *alloc_sw
 		smp_wmb();
 		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
 	} else {
-		kvfree(p);
+		defer = p;
 		p = swap_info[type];
 		/*
 		 * Do not memset this entry: a racing procfs swap_next()
@@ -2908,6 +2909,7 @@ static struct swap_info_struct *alloc_sw
 		plist_node_init(&p->avail_lists[i], 0);
 	p->flags = SWP_USED;
 	spin_unlock(&swap_lock);
+	kvfree(defer);
 	spin_lock_init(&p->lock);
 	spin_lock_init(&p->cont_lock);
 

From patchwork Sun Dec  6 06:15:12 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 339133
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,
 INCLUDES_CR_TRAILER, 
 INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by smtp.lore.kernel.org (Postfix) with ESMTP id D0A52C1B0D9
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:16:18 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
 by mail.kernel.org (Postfix) with ESMTP id 994E622D2C
 for <stable@archiver.kernel.org>;
 Sun,  6 Dec 2020 06:16:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725948AbgLFGQI (ORCPT <rfc822;stable@archiver.kernel.org>);
 Sun, 6 Dec 2020 01:16:08 -0500
Received: from mail.kernel.org ([198.145.29.99]:59004 "EHLO mail.kernel.org"
 rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
 id S1725945AbgLFGQI (ORCPT <rfc822;stable@vger.kernel.org>);
 Sun, 6 Dec 2020 01:16:08 -0500
Date: Sat, 05 Dec 2020 22:15:12 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; 
 s=korg; t=1607235313;
 bh=QSV5zHJaoaRWLvhtj+cRLdEs+5kHkjXZ5jnz/ZOflFQ=;
 h=From:To:Subject:In-Reply-To:From;
 b=2oAHjelcAt/dvwYbvsSwiEL9VxeNnBNYRJY8pPZfNM0nj27ceSz7DO1OCNfqIvktx
 GzXB4yEN1u+5tMAtW/MPBSKpN0Ujfh4+ibF2+QouN8B8biueR/DqTCjjJVAWkslBjY
 PUzuahrlTmEAaMLWKuBhc/iOA071244mT7UjHUmw=
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, almasrymina@google.com,
 amorenoz@redhat.com, gthelen@google.com, linux-mm@kvack.org,
 mike.kravetz@oracle.com, mm-commits@vger.kernel.org,
 rientjes@google.com, sandipan@linux.ibm.com, shakeelb@google.com,
 shuah@kernel.org, stable@vger.kernel.org, torvalds@linux-foundation.org
Subject: [patch 11/12] hugetlb_cgroup: fix offline of hugetlb
 cgroup with reservations
Message-ID: <20201206061512.FBawJ1qq2%akpm@linux-foundation.org>
In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org>
User-Agent: s-nail v14.8.16
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb_cgroup: fix offline of hugetlb cgroup with reservations

Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs.  In this environment the issue is reproduced by:
1 - Start a simple pod that uses the recently added HugePages medium
    feature (pod yaml attached)
2 - Start a DPDK app. It doesn't need to run successfully (as in transfer
    packets) nor interact with real hardware. It seems just initializing
    the EAL layer (which handles hugepage reservation and locking) is
    enough to trigger the issue
3 - Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):
 1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45
kworker/28:7+cgroup_destroy

'perf top -g' reports:
-   63.28%     0.01%  [kernel]                    [k] worker_thread
   - 49.97% worker_thread
      - 52.64% process_one_work
         - 62.08% css_killed_work_fn
            - hugetlb_cgroup_css_offline
                 41.52% _raw_spin_lock
               - 2.82% _cond_resched
                    rcu_all_qs
                 2.66% PageHuge
      - 0.57% schedule
         - 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning.  Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup.  This is done for each hstate,
and is repeated until hugetlb_cgroup_have_usage returns false.  The routine
moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts.
The routine hugetlb_cgroup_have_usage is checking for both 'usage' and
'reservation' counts.  Discussion about what to do with reservation
counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts.  Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts.  While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed.
The hstate index is not reinitialized each time through the do-while loop.
Fix this as well.

Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Adrian Moreno <amorenoz@redhat.com>
Tested-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb_cgroup.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-fix-offline-of-hugetlb-cgroup-with-reservations
+++ a/mm/hugetlb_cgroup.c
@@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_u
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
 		if (page_counter_read(
-			    hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) ||
-		    page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd(
-			    h_cg, idx))) {
+				hugetlb_cgroup_counter_from_cgroup(h_cg, idx)))
 			return true;
-		}
 	}
 	return false;
 }
@@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(s
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 	struct hstate *h;
 	struct page *page;
-	int idx = 0;
+	int idx;
 
 	do {
+		idx = 0;
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)