From patchwork Wed Apr 12 17:36:31 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sumit Semwal <sumit.semwal@linaro.org>
X-Patchwork-Id: 97318
Delivered-To: patch@linaro.org
Received: by 10.140.109.52 with SMTP id k49csp371894qgf;
 Wed, 12 Apr 2017 10:36:56 -0700 (PDT)
X-Received: by 10.98.201.212 with SMTP id l81mr66954851pfk.13.1492018616803; 
 Wed, 12 Apr 2017 10:36:56 -0700 (PDT)
Return-Path: <stable-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 i63si21086788pgc.265.2017.04.12.10.36.56; 
 Wed, 12 Apr 2017 10:36:56 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 stable-owner@vger.kernel.org designates 209.132.180.67 as
 permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org;
 spf=pass (google.com: best guess record for domain of
 stable-owner@vger.kernel.org designates 209.132.180.67 as
 permitted sender) smtp.mailfrom=stable-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1754249AbdDLRgz (ORCPT <rfc822;sumit.semwal@linaro.org>
 + 6 others); Wed, 12 Apr 2017 13:36:55 -0400
Received: from mail-pf0-f174.google.com ([209.85.192.174]:36499 "EHLO
 mail-pf0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1753876AbdDLRgy (ORCPT
 <rfc822;stable@vger.kernel.org>); Wed, 12 Apr 2017 13:36:54 -0400
Received: by mail-pf0-f174.google.com with SMTP id o126so16900778pfb.3
 for <stable@vger.kernel.org>; Wed, 12 Apr 2017 10:36:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=TLGdhXuIEo8FhIVFkHw19F0Hfo16ig1l3XzWIZZchGA=;
 b=AE5ZRl6tS102Bvs/VXgub9GwpQ8nCHysGAPvucri5c54Gw+1UyXDFZbVeIYoPhAiGN
 e1Ltw42jw4Z31ka60SH1goGGLqLtpjsbOYDMeKvnPixyQrEmDcS6N8uiD19lSJ9Umk+X
 HZryZk/ZcvjZ0YyBJO2H5hFpOktieU1AJPkUk=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=TLGdhXuIEo8FhIVFkHw19F0Hfo16ig1l3XzWIZZchGA=;
 b=nsICR+WWl99gtCiI0cFlmUTnumbxsSGPFjiTTRFeLHp99WLyaC5PjVNJm5wdlPSG4K
 qLmQIosZcDqa4zcYE9qvma/Qp34NLcjIdRU8tlzdXivqnbEKpXIlHb07zSmvsCyyzbaI
 ULcN0ePXcDL93wVefmusYa2PB3r1HqmgcWcmpuB1GZcjBySGcxOhNcudNpWsVlzkfNsy
 jCBl0+niBZJ10mKr0QONy1VJTqRWIiCGpHwp8EbmhjD85A/PR6lxnTybwoanzb1fb8YQ
 zhOryg7V92YzbImCYfPP9SQtH+rKe/AY3kGkq00ZmswS4fdGUkRfll0nDYIbQb5E1BBa
 kT9A==
X-Gm-Message-State: AFeK/H0GOpMYg9AlhSGzhcsN5FaBPMWdjb/caVi1K91qIMEZ+pQVzfC+KSb+FJi9eWEXujBc
X-Received: by 10.99.138.68 with SMTP id y65mr69527594pgd.73.1492018613077; 
 Wed, 12 Apr 2017 10:36:53 -0700 (PDT)
Received: from phantom.lan ([106.51.225.38]) by smtp.gmail.com with ESMTPSA id
 133sm31562648pfy.106.2017.04.12.10.36.49
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Wed, 12 Apr 2017 10:36:51 -0700 (PDT)
From: Sumit Semwal <sumit.semwal@linaro.org>
To: stable@vger.kernel.org
Cc: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>,
 Brian King <brking@linux.vnet.ibm.com>,
 Douglas Miller <dougmill@linux.vnet.ibm.com>,
 linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
 Jens Axboe <axboe@fb.com>, Sumit Semwal <sumit.semwal@linaro.org>
Subject: [PATCH for-4.9 1/5] blk-mq: Avoid memory reclaim when remapping queues
Date: Wed, 12 Apr 2017 23:06:31 +0530
Message-Id: <1492018595-13167-2-git-send-email-sumit.semwal@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1492018595-13167-1-git-send-email-sumit.semwal@linaro.org>
References: <1492018595-13167-1-git-send-email-sumit.semwal@linaro.org>
Sender: stable-owner@vger.kernel.org
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

From: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

[ Upstream commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 ]

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.7.4

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee54ad0..7b597ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1474,7 +1474,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&tags->page_list);
 
 	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
 	if (!tags->rqs) {
 		blk_mq_free_tags(tags);
@@ -1500,7 +1500,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 
 		do {
 			page = alloc_pages_node(set->numa_node,
-				GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
 				this_order);
 			if (page)
 				break;
@@ -1521,7 +1521,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		 * Allow kmemleak to scan these pages as they contain pointers
 		 * to additional allocations like via ops->init_request().
 		 */
-		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
+		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
 		entries_per_page = order_to_size(this_order) / rq_size;
 		to_do = min(entries_per_page, set->queue_depth - i);
 		left -= to_do * rq_size;