From patchwork Tue Mar 12 08:59:27 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160071
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12860581jad;
 Tue, 12 Mar 2019 02:00:10 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwIBcRGB8amN+CHT/X21mhrUJJSvr6cGdk60AwfMTUhizIwb2jatf+qlOCPdo8Qz39qBi2/
X-Received: by 2002:a17:902:7c8f:: with SMTP id
 y15mr37993912pll.44.1552381210056; 
 Tue, 12 Mar 2019 02:00:10 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381210; cv=none;
 d=google.com; s=arc-20160816;
 b=LTIEWJQ+EuNXeEY+9t1KE1G2Qoc9c5tjwq71tTbADeToOpr5iA+RPaPeEHjlN9BZlS
 Tpf/q1gl/JXhdK3H3fYGGOn0xO8HsBuZP1csmufnO6T3XL7K1uX2nXmMarGcpP40sQWa
 D+IXOscXl2DDEgQU8eTSFy9vyIkRDLMKbzspc2hG6720rIBwnGcfC01t8Q/JiYPkhEZs
 73DDw/oJe+RncUQYMxzUcgMopjAWmjejjmUqnFjhBuNGxLrNOs9zva30paNFcRcBLxMf
 P3KAGW7u2OTS+S65mw0UEgVKzHp1cDAc5RGO8vgHgyZztwsJRDF7I+cKe5947LU5466k
 Pxwg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=zPxdu4FfztX2fjTuYHQF2xv6a9EM4Wt9HqApVzFZrNQ=;
 b=vPh5pU70OJjzlpP7Iqe56lufJ8O4Ou7A5R+tTBo3JLtyGB/RqprM/cxevNgNcenXXS
 ai4KYt5V/5BNrlfaZ/4TsY1Mp+iy4vw1hFI8tJ7fmHt38o6TiYoCKZc286h9+oD0HqZU
 fF3OJqxFyMWGN9pxf2imhOJvVJHWORMpLDRwxkaDcuHnp9l4l8rmu7SR/tJbkFdDX5Ji
 UKjVY+hWSaQ0tf0rydDJ+Kfe8+9n8TJ3y3HgVasGVR91+BFMFEicj9scTQ78Q9KVBjPL
 eAkCAUBSUQ5mCwIENCn5tYHRuvYy3Od7IVZ/iZzGvDwq5DMEnQYfksi88Yj4OkJHoiP5
 CqMQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=nxSGFbhg;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 d16si7538316pll.236.2019.03.12.02.00.09; 
 Tue, 12 Mar 2019 02:00:10 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=nxSGFbhg;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727685AbfCLJAI (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:08 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:44495 "EHLO
 mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1726924AbfCLJAE (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:04 -0400
Received: by mail-wr1-f65.google.com with SMTP id w2so1716513wrt.11
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=zPxdu4FfztX2fjTuYHQF2xv6a9EM4Wt9HqApVzFZrNQ=;
 b=nxSGFbhgUV0GSjJ9o0aDkRUnhb7rF4cJZtG3yJI9SPaZe2touaWmq6dTzIQY4cRU5+
 M7c7dYRu+uAfqfNYcdYt9snPLYlW8d7Y9FDxZNoAUq193PlzQyJBqYdKkoGkR5xLeWpd
 wwyU9CrT9T4ZqluwiVHQGN0Susi9hncsYO3TS9lR+fv5qvLyDtapVYBQ7ItRP1j86pWY
 /5gbuYUSKD+RkoG1EduH4lBiCwb6lpIsAHe7mPJvj4d/j+EwwlMuhlqIAVILdaq2TN/I
 j/1IhcQT57Tj4Zu5ODQPBcj2ndBq6dU7vAsvwfG1LpPKyhnqsdJV0MlQSVzMs2GWttMC
 SxCg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=zPxdu4FfztX2fjTuYHQF2xv6a9EM4Wt9HqApVzFZrNQ=;
 b=Z1YI1+rSA/4YUmA8RWFqVWEbR/1m12IdHBqMwEvXM+3Zg5ZayNLSRLimU0b/GkNDrJ
 +Z5EuFOsWyj5TVRmKf/FyMhHUqB8Y1WJ75juNtapSid79pIs6omsx3LACftaGPjZQGGq
 qvcEpYt9332E0Vs3EfFetw/MlNcfXljwxXS7V8oNZD2sKuh2PZ8aNyQv9x/+q0xNnaYW
 hqghzo1YSerigej734bfMJtXZG7gSmDR6sGiigysq6AIHZqxAj552/fYAPz9i2aOY8hd
 dR3XBpFbfMe9ZqtSt1/yeQl5kyPTiNEx1hP4JkAQZgVP0ajeriFI9s8ji8NcCLOAl3qP
 vGoQ==
X-Gm-Message-State: APjAAAVxCOIDNgbZeRo/zwMsr4e84t/i7oqmV62vG26rxrCZFlgbBZdr
 gSX0eJusxjgLlkFEA8cyzxx7jw==
X-Received: by 2002:adf:fe05:: with SMTP id n5mr10000720wrr.160.1552381203408; 
 Tue, 12 Mar 2019 02:00:03 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.01
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:02 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 1/9] block,
 bfq: increase idling for weight-raised queues
Date: Tue, 12 Mar 2019 09:59:27 +0100
Message-Id: <20190312085935.11340-2-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

If a sync bfq_queue has a higher weight than some other queue, and
remains temporarily empty while in service, then, to preserve the
bandwidth share of the queue, it is necessary to plug I/O dispatching
until a new request arrives for the queue. In addition, a timeout
needs to be set, to avoid waiting for ever if the process associated
with the queue has actually finished its I/O.

Even with the above timeout, the device is however not fed with new
I/O for a while, if the process has finished its I/O. If this happens
often, then throughput drops and latencies grow. For this reason, the
timeout is kept rather low: 8 ms is the current default.

Unfortunately, such a low value may cause, on the opposite end, a
violation of bandwidth guarantees for a process that happens to issue
new I/O too late. The higher the system load, the higher the
probability that this happens to some process. This is a problem in
scenarios where service guarantees matter more than throughput. One
important case are weight-raised queues, which need to be granted a
very high fraction of the bandwidth.

To address this issue, this commit lower-bounds the plugging timeout
for weight-raised queues to 20 ms. This simple change provides
relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
gnome-terminal starts in 0.6 seconds if there is no other I/O in
progress, the same applications starts in
- 0.8 seconds, instead of 1.2 seconds, if ten files are being read
  sequentially in parallel
- 1 second, instead of 2 seconds, if, in parallel, five files are
  being read sequentially, and five more files are being written
  sequentially

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 4c592496a16a..eb658de3cc40 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2545,6 +2545,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
 	    bfq_symmetric_scenario(bfqd))
 		sl = min_t(u64, sl, BFQ_MIN_TT);
+	else if (bfqq->wr_coeff > 1)
+		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
 
 	bfqd->last_idling_start = ktime_get();
 	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),

From patchwork Tue Mar 12 08:59:28 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160072
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12860630jad;
 Tue, 12 Mar 2019 02:00:13 -0700 (PDT)
X-Google-Smtp-Source: APXvYqyJFZAjj12xPAH/6bcBIlmc69l6bd/BO/WMLGuF7UHRMGPXnyu45e45QBeD0NkpMY0PV1R3
X-Received: by 2002:a17:902:e090:: with SMTP id
 cb16mr37471191plb.32.1552381213037; 
 Tue, 12 Mar 2019 02:00:13 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381213; cv=none;
 d=google.com; s=arc-20160816;
 b=lUoL5Jjxja5uOnZ+p5Z05rEH72Xq/irGd+1KVaubEu3LAGQdQdWkSLcv7BVZ1KxlkQ
 MBwD71cJbhB2wq51FgNJ0k3p1MOHw4xUnpHhi5QsBx5A97MeZsqWzJQOe0BsPTNI7AYd
 ajwcbstRAxIkT7PnUgzZvh7NHMDp6dJwxMa2iY5HKujHhkj1DIWnGkcqE4ptRCFHJwrs
 yioH88jfX+EYXnSrcbiYPlvPVDWIiC/nlTWpRvk59rLt/PdmDxA9uulw71oDjiZ8mQKb
 CLnQEPvvHJWfNuy4sijPxp/OchUvh+3wBWjf726pt4/lFuqTMx1wmJppsG3YapP5wSz6
 hyrg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=ZLoSzPM8JCzlrvhENszx38F72M+gYcQOccI+CEkAF1M=;
 b=WQLaGMJDf0ofbrs5U/nD4AEivq7g5ao6/kpe3whmCNYwA3VqqIi8GAiEWphqx41p3R
 dL4jK1Ss1BHopGJXnrpjCA2lvXYroIYDsSTT6+bI2lT5xNF8u6F5m3l/aQExjuA2aK9s
 jmh32DGEn9rjRT1EgPiNcLU4tBKQenB7LgVBUUf8ouUZigFjwjT4iQ88GIIiIsvB9XQm
 byJHqDWoPQt0n3kMLSMStBTomDydqgjVqaEXUVEognaEnhjHw2YtMZxUx/k0QORHeXDZ
 wCuaBrHi23F0+2JnIQPC3iGUOTucE98J0B+VYeAtX78bFhwqMAN9t73SNAYG+HI0fO5m
 dGww==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=hXlU8byH;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 d16si7538316pll.236.2019.03.12.02.00.12; 
 Tue, 12 Mar 2019 02:00:13 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=hXlU8byH;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727709AbfCLJAL (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:11 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:39704 "EHLO
 mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1726934AbfCLJAH (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:07 -0400
Received: by mail-wr1-f67.google.com with SMTP id p8so1739614wrq.6
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=ZLoSzPM8JCzlrvhENszx38F72M+gYcQOccI+CEkAF1M=;
 b=hXlU8byHK982gXMWVfn4z50ExTVt/qoXhA6I0Nv5zHa4QmMXt+tfUC5JhEw6W/h3iM
 Z8RF8esFVRQl9B8IzexovkXix0G5HAa1ZZjhZPXUb9Z0cwwcoyzjdZn9cR3Mc4LtpXAR
 7SurqKl61lYC1JUc9w4CwGCTl2eZFdmCGkxh7/YWYA+bkuw2l8qAAkAkBXljlEqrPac8
 wMQnDqqWE2OwPf32alnomwG2SXUUI/w1G49IBNP1E+rXLWecC47QeL2Fp98IKx38hIpP
 EKk0IhjUD5XmnIYlFKvcj2GN1vfw1Gd13ZWTFLAfm8BtL1ENGsEue63Izm7LUOgDwFN8
 4mWw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=ZLoSzPM8JCzlrvhENszx38F72M+gYcQOccI+CEkAF1M=;
 b=Rh++YDR+mhdaKKu9qi611s9x781e1f5+l2q3tM4CXeLgvM41iPh2aWqKvnH+goJNTm
 byVvFgpSnLtWP7wBMANzStbuFD5Lnryl6ZaIfOUwzLnBnb/g7jlQq9Zv1JHyYC3J5iKr
 Tszh95TTvBc+dqNmFwG/62OBRLS0Ge24ms9UtxAbBpoSUquLPpmNao9qt90Bypyc23QG
 QNRsV0qFylsYZwjC1yrMKnA+70ls3RN967oiLZA3qiRBkpREP12VkctQLEGtcQ+hDbNg
 Ac52xSwXrxNKFawUgZK6/k7DwRU1stfiVFKnclByP/uL/pGgcCobkbD/VesvnYpSxuOt
 ToNw==
X-Gm-Message-State: APjAAAWJ+W/poQLT6jbkUWjzi5JPyWjBfioZMV1sEsgy9GY7zOppv9/b
 hN/LhI2SFFR6XekY67V3lPSsiQ==
X-Received: by 2002:a5d:4802:: with SMTP id l2mr9343876wrq.148.1552381204741; 
 Tue, 12 Mar 2019 02:00:04 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.03
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:04 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 2/9] block,
 bfq: do not idle for lowest-weight queues
Date: Tue, 12 Mar 2019 09:59:28 +0100
Message-Id: <20190312085935.11340-3-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

In most cases, it is detrimental for throughput to plug I/O dispatch
when the in-service bfq_queue becomes temporarily empty (plugging is
performed to wait for the possible arrival, soon, of new I/O from the
in-service queue). There is however a case where plugging is needed
for service guarantees. If a bfq_queue, say Q, has a higher weight
than some other active bfq_queue, and is sync, i.e., contains sync
I/O, then, to guarantee that Q does receive a higher share of the
throughput than other lower-weight queues, it is necessary to plug I/O
dispatch when Q remains temporarily empty while being served.

For this reason, BFQ performs I/O plugging when some active bfq_queue
has a higher weight than some other active bfq_queue. But this is
unnecessarily overkill. In fact, if the in-service bfq_queue actually
has a weight lower than or equal to the other queues, then the queue
*must not* be guaranteed a higher share of the throughput than the
other queues. So, not plugging I/O cannot cause any harm to the
queue. And can boost throughput.

Taking advantage of this fact, this commit does not plug I/O for sync
bfq_queues with a weight lower than or equal to the weights of the
other queues. Here is an example of the resulting throughput boost
with the dbench workload, which is particularly nasty for BFQ. With
the dbench test in the Phoronix suite, BFQ reaches its lowest total
throughput with 6 clients on a filesystem with journaling, in case the
journaling daemon has a higher weight than normal processes. Before
this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
after this commit it is ~100 MB/sec.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 204 +++++++++++++++++++++++++-------------------
 block/bfq-iosched.h |   6 +-
 block/bfq-wf2q.c    |   2 +-
 3 files changed, 118 insertions(+), 94 deletions(-)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index eb658de3cc40..2be504f25b09 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -629,12 +629,19 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 }
 
 /*
- * The following function returns true if every queue must receive the
- * same share of the throughput (this condition is used when deciding
- * whether idling may be disabled, see the comments in the function
- * bfq_better_to_idle()).
+ * The following function returns false either if every active queue
+ * must receive the same share of the throughput (symmetric scenario),
+ * or, as a special case, if bfqq must receive a share of the
+ * throughput lower than or equal to the share that every other active
+ * queue must receive.  If bfqq does sync I/O, then these are the only
+ * two cases where bfqq happens to be guaranteed its share of the
+ * throughput even if I/O dispatching is not plugged when bfqq remains
+ * temporarily empty (for more details, see the comments in the
+ * function bfq_better_to_idle()). For this reason, the return value
+ * of this function is used to check whether I/O-dispatch plugging can
+ * be avoided.
  *
- * Such a scenario occurs when:
+ * The above first case (symmetric scenario) occurs when:
  * 1) all active queues have the same weight,
  * 2) all active queues belong to the same I/O-priority class,
  * 3) all active groups at the same level in the groups tree have the same
@@ -654,30 +661,36 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * support or the cgroups interface are not enabled, thus no state
  * needs to be maintained in this case.
  */
-static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq)
 {
+	bool smallest_weight = bfqq &&
+		bfqq->weight_counter &&
+		bfqq->weight_counter ==
+		container_of(
+			rb_first_cached(&bfqd->queue_weights_tree),
+			struct bfq_weight_counter,
+			weights_node);
+
 	/*
 	 * For queue weights to differ, queue_weights_tree must contain
 	 * at least two nodes.
 	 */
-	bool varied_queue_weights = !RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
-		(bfqd->queue_weights_tree.rb_node->rb_left ||
-		 bfqd->queue_weights_tree.rb_node->rb_right);
+	bool varied_queue_weights = !smallest_weight &&
+		!RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
+		(bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
 
 	bool multiple_classes_busy =
 		(bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
 		(bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
 		(bfqd->busy_queues[1] && bfqd->busy_queues[2]);
 
-	/*
-	 * For queue weights to differ, queue_weights_tree must contain
-	 * at least two nodes.
-	 */
-	return !(varied_queue_weights || multiple_classes_busy
+	return varied_queue_weights || multiple_classes_busy
 #ifdef BFQ_GROUP_IOSCHED_ENABLED
 	       || bfqd->num_groups_with_pending_reqs > 0
 #endif
-		);
+		;
 }
 
 /*
@@ -694,10 +707,11 @@ static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
  * should be low too.
  */
 void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-			  struct rb_root *root)
+			  struct rb_root_cached *root)
 {
 	struct bfq_entity *entity = &bfqq->entity;
-	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;
+	bool leftmost = true;
 
 	/*
 	 * Do not insert if the queue is already associated with a
@@ -726,8 +740,10 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 		}
 		if (entity->weight < __counter->weight)
 			new = &((*new)->rb_left);
-		else
+		else {
 			new = &((*new)->rb_right);
+			leftmost = false;
+		}
 	}
 
 	bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
@@ -736,7 +752,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	/*
 	 * In the unlucky event of an allocation failure, we just
 	 * exit. This will cause the weight of queue to not be
-	 * considered in bfq_symmetric_scenario, which, in its turn,
+	 * considered in bfq_asymmetric_scenario, which, in its turn,
 	 * causes the scenario to be deemed wrongly symmetric in case
 	 * bfqq's weight would have been the only weight making the
 	 * scenario asymmetric.  On the bright side, no unbalance will
@@ -750,7 +766,8 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->weight_counter->weight = entity->weight;
 	rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
-	rb_insert_color(&bfqq->weight_counter->weights_node, root);
+	rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,
+				leftmost);
 
 inc_counter:
 	bfqq->weight_counter->num_active++;
@@ -765,7 +782,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  */
 void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 			       struct bfq_queue *bfqq,
-			       struct rb_root *root)
+			       struct rb_root_cached *root)
 {
 	if (!bfqq->weight_counter)
 		return;
@@ -774,7 +791,7 @@ void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 	if (bfqq->weight_counter->num_active > 0)
 		goto reset_entity_pointer;
 
-	rb_erase(&bfqq->weight_counter->weights_node, root);
+	rb_erase_cached(&bfqq->weight_counter->weights_node, root);
 	kfree(bfqq->weight_counter);
 
 reset_entity_pointer:
@@ -889,7 +906,7 @@ static unsigned long bfq_serv_to_charge(struct request *rq,
 					struct bfq_queue *bfqq)
 {
 	if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
-	    !bfq_symmetric_scenario(bfqq->bfqd))
+	    bfq_asymmetric_scenario(bfqq->bfqd, bfqq))
 		return blk_rq_sectors(rq);
 
 	return blk_rq_sectors(rq) * bfq_async_charge_factor;
@@ -2543,7 +2560,7 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * queue).
 	 */
 	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
-	    bfq_symmetric_scenario(bfqd))
+	    !bfq_asymmetric_scenario(bfqd, bfqq))
 		sl = min_t(u64, sl, BFQ_MIN_TT);
 	else if (bfqq->wr_coeff > 1)
 		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
@@ -3500,8 +3517,9 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
 }
 
 /*
- * There is a case where idling must be performed not for
- * throughput concerns, but to preserve service guarantees.
+ * There is a case where idling does not have to be performed for
+ * throughput concerns, but to preserve the throughput share of
+ * the process associated with bfqq.
  *
  * To introduce this case, we can note that allowing the drive
  * to enqueue more than one request at a time, and hence
@@ -3517,77 +3535,83 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
  * concern about per-process throughput distribution, and
  * makes its decisions only on a per-request basis. Therefore,
  * the service distribution enforced by the drive's internal
- * scheduler is likely to coincide with the desired
- * device-throughput distribution only in a completely
- * symmetric scenario where:
- * (i)  each of these processes must get the same throughput as
- *      the others;
- * (ii) the I/O of each process has the same properties, in
- *      terms of locality (sequential or random), direction
- *      (reads or writes), request sizes, greediness
- *      (from I/O-bound to sporadic), and so on.
- * In fact, in such a scenario, the drive tends to treat
- * the requests of each of these processes in about the same
- * way as the requests of the others, and thus to provide
- * each of these processes with about the same throughput
- * (which is exactly the desired throughput distribution). In
- * contrast, in any asymmetric scenario, device idling is
- * certainly needed to guarantee that bfqq receives its
- * assigned fraction of the device throughput (see [1] for
- * details).
- * The problem is that idling may significantly reduce
- * throughput with certain combinations of types of I/O and
- * devices. An important example is sync random I/O, on flash
- * storage with command queueing. So, unless bfqq falls in the
- * above cases where idling also boosts throughput, it would
- * be important to check conditions (i) and (ii) accurately,
- * so as to avoid idling when not strictly needed for service
- * guarantees.
+ * scheduler is likely to coincide with the desired throughput
+ * distribution only in a completely symmetric, or favorably
+ * skewed scenario where:
+ * (i-a) each of these processes must get the same throughput as
+ *	 the others,
+ * (i-b) in case (i-a) does not hold, it holds that the process
+ *       associated with bfqq must receive a lower or equal
+ *	 throughput than any of the other processes;
+ * (ii)  the I/O of each process has the same properties, in
+ *       terms of locality (sequential or random), direction
+ *       (reads or writes), request sizes, greediness
+ *       (from I/O-bound to sporadic), and so on;
+
+ * In fact, in such a scenario, the drive tends to treat the requests
+ * of each process in about the same way as the requests of the
+ * others, and thus to provide each of these processes with about the
+ * same throughput.  This is exactly the desired throughput
+ * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
+ * even more convenient distribution for (the process associated with)
+ * bfqq.
+ *
+ * In contrast, in any asymmetric or unfavorable scenario, device
+ * idling (I/O-dispatch plugging) is certainly needed to guarantee
+ * that bfqq receives its assigned fraction of the device throughput
+ * (see [1] for details).
+ *
+ * The problem is that idling may significantly reduce throughput with
+ * certain combinations of types of I/O and devices. An important
+ * example is sync random I/O on flash storage with command
+ * queueing. So, unless bfqq falls in cases where idling also boosts
+ * throughput, it is important to check conditions (i-a), i(-b) and
+ * (ii) accurately, so as to avoid idling when not strictly needed for
+ * service guarantees.
  *
- * Unfortunately, it is extremely difficult to thoroughly
- * check condition (ii). And, in case there are active groups,
- * it becomes very difficult to check condition (i) too. In
- * fact, if there are active groups, then, for condition (i)
- * to become false, it is enough that an active group contains
- * more active processes or sub-groups than some other active
- * group. More precisely, for condition (i) to hold because of
- * such a group, it is not even necessary that the group is
- * (still) active: it is sufficient that, even if the group
- * has become inactive, some of its descendant processes still
- * have some request already dispatched but still waiting for
- * completion. In fact, requests have still to be guaranteed
- * their share of the throughput even after being
- * dispatched. In this respect, it is easy to show that, if a
- * group frequently becomes inactive while still having
- * in-flight requests, and if, when this happens, the group is
- * not considered in the calculation of whether the scenario
- * is asymmetric, then the group may fail to be guaranteed its
- * fair share of the throughput (basically because idling may
- * not be performed for the descendant processes of the group,
- * but it had to be).  We address this issue with the
- * following bi-modal behavior, implemented in the function
- * bfq_symmetric_scenario().
+ * Unfortunately, it is extremely difficult to thoroughly check
+ * condition (ii). And, in case there are active groups, it becomes
+ * very difficult to check conditions (i-a) and (i-b) too.  In fact,
+ * if there are active groups, then, for conditions (i-a) or (i-b) to
+ * become false 'indirectly', it is enough that an active group
+ * contains more active processes or sub-groups than some other active
+ * group. More precisely, for conditions (i-a) or (i-b) to become
+ * false because of such a group, it is not even necessary that the
+ * group is (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still have
+ * some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed their
+ * share of the throughput even after being dispatched. In this
+ * respect, it is easy to show that, if a group frequently becomes
+ * inactive while still having in-flight requests, and if, when this
+ * happens, the group is not considered in the calculation of whether
+ * the scenario is asymmetric, then the group may fail to be
+ * guaranteed its fair share of the throughput (basically because
+ * idling may not be performed for the descendant processes of the
+ * group, but it had to be).  We address this issue with the following
+ * bi-modal behavior, implemented in the function
+ * bfq_asymmetric_scenario().
  *
  * If there are groups with requests waiting for completion
  * (as commented above, some of these groups may even be
  * already inactive), then the scenario is tagged as
  * asymmetric, conservatively, without checking any of the
- * conditions (i) and (ii). So the device is idled for bfqq.
+ * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
  * This behavior matches also the fact that groups are created
  * exactly if controlling I/O is a primary concern (to
  * preserve bandwidth and latency guarantees).
  *
- * On the opposite end, if there are no groups with requests
- * waiting for completion, then only condition (i) is actually
- * controlled, i.e., provided that condition (i) holds, idling
- * is not performed, regardless of whether condition (ii)
- * holds. In other words, only if condition (i) does not hold,
- * then idling is allowed, and the device tends to be
- * prevented from queueing many requests, possibly of several
- * processes. Since there are no groups with requests waiting
- * for completion, then, to control condition (i) it is enough
- * to check just whether all the queues with requests waiting
- * for completion also have the same weight.
+ * On the opposite end, if there are no groups with requests waiting
+ * for completion, then only conditions (i-a) and (i-b) are actually
+ * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
+ * idling is not performed, regardless of whether condition (ii)
+ * holds.  In other words, only if conditions (i-a) and (i-b) do not
+ * hold, then idling is allowed, and the device tends to be prevented
+ * from queueing many requests, possibly of several processes. Since
+ * there are no groups with requests waiting for completion, then, to
+ * control conditions (i-a) and (i-b) it is enough to check just
+ * whether all the queues with requests waiting for completion also
+ * have the same weight.
  *
  * Not checking condition (ii) evidently exposes bfqq to the
  * risk of getting less throughput than its fair share.
@@ -3639,7 +3663,7 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
  * compound condition that is checked below for deciding
  * whether the scenario is asymmetric. To explain this
  * compound condition, we need to add that the function
- * bfq_symmetric_scenario checks the weights of only
+ * bfq_asymmetric_scenario checks the weights of only
  * non-weight-raised queues, for efficiency reasons (see
  * comments on bfq_weights_tree_add()). Then the fact that
  * bfqq is weight-raised is checked explicitly here. More
@@ -3667,7 +3691,7 @@ static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
 	return (bfqq->wr_coeff > 1 &&
 		bfqd->wr_busy_queues <
 		bfq_tot_busy_queues(bfqd)) ||
-		!bfq_symmetric_scenario(bfqd);
+		bfq_asymmetric_scenario(bfqd, bfqq);
 }
 
 /*
@@ -5505,7 +5529,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 		     HRTIMER_MODE_REL);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 
-	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT_CACHED;
 	bfqd->num_groups_with_pending_reqs = 0;
 
 	INIT_LIST_HEAD(&bfqd->active_list);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 062e1c4787f4..81cabf51a87e 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -450,7 +450,7 @@ struct bfq_data {
 	 * weight-raised @bfq_queue (see the comments to the functions
 	 * bfq_weights_tree_[add|remove] for further details).
 	 */
-	struct rb_root queue_weights_tree;
+	struct rb_root_cached queue_weights_tree;
 
 	/*
 	 * Number of groups with at least one descendant process that
@@ -898,10 +898,10 @@ void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
 struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-			  struct rb_root *root);
+			  struct rb_root_cached *root);
 void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 			       struct bfq_queue *bfqq,
-			       struct rb_root *root);
+			       struct rb_root_cached *root);
 void bfq_weights_tree_remove(struct bfq_data *bfqd,
 			     struct bfq_queue *bfqq);
 void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 63311d1ff1ed..0e3f344cc4d3 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -737,7 +737,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 		unsigned int prev_weight, new_weight;
 		struct bfq_data *bfqd = NULL;
-		struct rb_root *root;
+		struct rb_root_cached *root;
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 		struct bfq_sched_data *sd;
 		struct bfq_group *bfqg;

From patchwork Tue Mar 12 08:59:29 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160079
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12861239jad;
 Tue, 12 Mar 2019 02:00:49 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwBHEqmsMGpbxP/X0k4pVhnr+q34HqxbmvaW8jSLb201tcUjWCzjAsx+xOM0IdD58RWozkj
X-Received: by 2002:a65:51c2:: with SMTP id i2mr25631802pgq.295.1552381248938; 
 Tue, 12 Mar 2019 02:00:48 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381248; cv=none;
 d=google.com; s=arc-20160816;
 b=YBh/poRxD83NqXQQTp8A5PPxXn6OcXnysjmsyn+POgeKrSozmOO0AqsvR0hacHtbwr
 D9utSCrkWG4HRff7aDtCAFDk6sHmLZtGQ2ORlN/50JhKCKfBXqUIIVbD+jt8fUR0wr9K
 W0VQCNRG9dPewTyjtHlFEAKaTbNQ7/3EWE0cXT56NTy1GYg1lJSsMZ6868pnIbw3eFaL
 WztWdaoH5InRQdWHINxV9STZLb8246rGyIXs4HogC1Ar4bl+otKUAySMpt3seDQ0G5pP
 026hbTBgiQjjW8srs9KYClm/wqS1myTP2sNSwknbbKxhfvFiPzGHKzzUvQ9ieuPNlmh2
 Q6eA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=Wu+Va5WbObnMzLzTvgKX+iK75zdLwoyEsW4l/0qRGcU=;
 b=w2++4pAov6Z7xZJwEVzc2HCArjmzRoqp3qJXAucAD+xZasX6OTNb7DP5LndHMe7GWe
 kdmX1LSlWDXcekyvHMU45BEo3aHFWFgjf6qkwxXnClYmoM3t6UzwyQbbSbCkpjN6JYxv
 iUHTaTJkHpvbyU+wpqoIMnFuj7Nz7lxbmVaJJPjpUx0g1D2qA3L7fpIfNwLQIUktgK67
 0JxWaOxHeDV1RYRGMebiR+beYorfNG1pR6O4dZN/hoWfuvde5xBzxxU9Smleqkc8xln7
 fK1ku0Moc5kDpujx10SnAW+hdtypA5D+PqovhocyOf8V3GMPHE64jFWfmcXKTwK9yZ+3
 GbHQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=M0ofQ3h7;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id v7si2311425pgq.125.2019.03.12.02.00.48;
 Tue, 12 Mar 2019 02:00:48 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=M0ofQ3h7;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727794AbfCLJAl (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:41 -0400
Received: from mail-wr1-f48.google.com ([209.85.221.48]:36761 "EHLO
 mail-wr1-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1727678AbfCLJAK (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:10 -0400
Received: by mail-wr1-f48.google.com with SMTP id g18so1763278wru.3
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=Wu+Va5WbObnMzLzTvgKX+iK75zdLwoyEsW4l/0qRGcU=;
 b=M0ofQ3h77aQZWmk3fLSNQbC/RV1pavUTXN3/HU+mVgxcTVt4Hft9iHkdw0+hXt+Bwo
 vCCgOGHCud4oOeZDqQ37xGV7zxphqS51PTZvSOKKexHHpbq5+KvfFS2FVrggXfYT4Fml
 apHIsaGn7fea+jdCfj8mDqYHEInJce+dGH8ZaphpH8k4LuhNQIVpFNpsDz/8eWCP/ufX
 mXwj5Eu2OpgywcuFjJbhI0vpxah+39+LKxZjVmtJfBAypflWI9GUpk3NxxJpXatcyWOx
 ae6jCTm0lGGTJl1ysfM5iMMwa3+w3vmk+HtNLkk+ibRQ/yWWYiC3JygK+ynt47maeg9j
 aUeA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=Wu+Va5WbObnMzLzTvgKX+iK75zdLwoyEsW4l/0qRGcU=;
 b=Fzt627qKHjybOt21MWPIOoJupF9uAwOGTh4yGoGem3RnobxjBEp6Mncc0+DA7t22YD
 yUfTpME2DIHWYYM/lCDXhhQdlyDpzStygP84UCEkmctqEzVd42WTmMSm+sILq0e8qNjj
 ADtkh4A+h1HSn9iEa/9TaJESCTSDkKfy7FXdEZTZZO4xs59KhsToA34NJSv4Lh6njSAX
 vqrHU1s4CeDLCWDN/vYGl3pckWLjI+TaEts8ueD6GfYz1ZhJQngIF96o6PP/tSNj9hSC
 c56yJLx2DZlAwHzBHnb63oIiRHBN2HuwZD9J4trsLqEJYt89i1S+c/X5fKyHZm4BNDgO
 +cpA==
X-Gm-Message-State: APjAAAUi2yzxu/iavaIdXqxT1VGmWB/vhfpCzzUkBlp1WC2ZuomsPKN8
 zlBikz6ygnVw6RQTlp7nPyJvZg==
X-Received: by 2002:adf:fe05:: with SMTP id n5mr10000921wrr.160.1552381206243; 
 Tue, 12 Mar 2019 02:00:06 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.04
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:05 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 3/9] block,
 bfq: tune service injection basing on request service times
Date: Tue, 12 Mar 2019 09:59:29 +0100
Message-Id: <20190312085935.11340-4-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The processes associated with a bfq_queue, say Q, may happen to
generate their cumulative I/O at a lower rate than the rate at which
the device could serve the same I/O. This is rather probable, e.g., if
only one process is associated with Q and the device is an SSD. It
results in Q becoming often empty while in service. If BFQ is not
allowed to switch to another queue when Q becomes empty, then, during
the service of Q, there will be frequent "service holes", i.e., time
intervals during which Q gets empty and the device can only consume
the I/O already queued in its hardware queues. This easily causes
considerable losses of throughput.

To counter this problem, BFQ implements a request injection mechanism,
which tries to fill the above service holes with I/O requests taken
from other bfq_queues. The hard part in this mechanism is finding the
right amount of I/O to inject, so as to both boost throughput and not
break Q's bandwidth and latency guarantees. To this goal, the current
version of this mechanism measures the bandwidth enjoyed by Q while it
is being served, and tries to inject the maximum possible amount of
extra service that does not cause Q's bandwidth to decrease too
much.

This solution has an important shortcoming. For bandwidth measurements
to be stable and reliable, Q must remain in service for a much longer
time than that needed to serve a single I/O request. Unfortunately,
this does not hold with many workloads. This commit addresses this
issue by changing the way the amount of injection allowed is
dynamically computed. It tunes injection as a function of the service
times of single I/O requests of Q, instead of Q's
bandwidth. Single-request service times are evidently meaningful even
if Q gets very few I/O requests completed while it is in service.

As a testbed for this new solution, we measured the throughput reached
by BFQ for one of the nastiest workloads and configurations for this
scheduler: the workload generated by the dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes.
With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
a PLEXTOR PX-256M5.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 417 ++++++++++++++++++++++++++++++++++++++++----
 block/bfq-iosched.h |  51 +++---
 2 files changed, 409 insertions(+), 59 deletions(-)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2be504f25b09..41364c0cca8c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1721,6 +1721,123 @@ static void bfq_add_request(struct request *rq)
 	bfqq->queued[rq_is_sync(rq)]++;
 	bfqd->queued++;
 
+	if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) {
+		/*
+		 * Periodically reset inject limit, to make sure that
+		 * the latter eventually drops in case workload
+		 * changes, see step (3) in the comments on
+		 * bfq_update_inject_limit().
+		 */
+		if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+					     msecs_to_jiffies(1000))) {
+			/* invalidate baseline total service time */
+			bfqq->last_serv_time_ns = 0;
+
+			/*
+			 * Reset pointer in case we are waiting for
+			 * some request completion.
+			 */
+			bfqd->waited_rq = NULL;
+
+			/*
+			 * If bfqq has a short think time, then start
+			 * by setting the inject limit to 0
+			 * prudentially, because the service time of
+			 * an injected I/O request may be higher than
+			 * the think time of bfqq, and therefore, if
+			 * one request was injected when bfqq remains
+			 * empty, this injected request might delay
+			 * the service of the next I/O request for
+			 * bfqq significantly. In case bfqq can
+			 * actually tolerate some injection, then the
+			 * adaptive update will however raise the
+			 * limit soon. This lucky circumstance holds
+			 * exactly because bfqq has a short think
+			 * time, and thus, after remaining empty, is
+			 * likely to get new I/O enqueued---and then
+			 * completed---before being expired. This is
+			 * the very pattern that gives the
+			 * limit-update algorithm the chance to
+			 * measure the effect of injection on request
+			 * service times, and then to update the limit
+			 * accordingly.
+			 *
+			 * On the opposite end, if bfqq has a long
+			 * think time, then start directly by 1,
+			 * because:
+			 * a) on the bright side, keeping at most one
+			 * request in service in the drive is unlikely
+			 * to cause any harm to the latency of bfqq's
+			 * requests, as the service time of a single
+			 * request is likely to be lower than the
+			 * think time of bfqq;
+			 * b) on the downside, after becoming empty,
+			 * bfqq is likely to expire before getting its
+			 * next request. With this request arrival
+			 * pattern, it is very hard to sample total
+			 * service times and update the inject limit
+			 * accordingly (see comments on
+			 * bfq_update_inject_limit()). So the limit is
+			 * likely to be never, or at least seldom,
+			 * updated.  As a consequence, by setting the
+			 * limit to 1, we avoid that no injection ever
+			 * occurs with bfqq. On the downside, this
+			 * proactive step further reduces chances to
+			 * actually compute the baseline total service
+			 * time. Thus it reduces chances to execute the
+			 * limit-update algorithm and possibly raise the
+			 * limit to more than 1.
+			 */
+			if (bfq_bfqq_has_short_ttime(bfqq))
+				bfqq->inject_limit = 0;
+			else
+				bfqq->inject_limit = 1;
+			bfqq->decrease_time_jif = jiffies;
+		}
+
+		/*
+		 * The following conditions must hold to setup a new
+		 * sampling of total service time, and then a new
+		 * update of the inject limit:
+		 * - bfqq is in service, because the total service
+		 *   time is evaluated only for the I/O requests of
+		 *   the queues in service;
+		 * - this is the right occasion to compute or to
+		 *   lower the baseline total service time, because
+		 *   there are actually no requests in the drive,
+		 *   or
+		 *   the baseline total service time is available, and
+		 *   this is the right occasion to compute the other
+		 *   quantity needed to update the inject limit, i.e.,
+		 *   the total service time caused by the amount of
+		 *   injection allowed by the current value of the
+		 *   limit. It is the right occasion because injection
+		 *   has actually been performed during the service
+		 *   hole, and there are still in-flight requests,
+		 *   which are very likely to be exactly the injected
+		 *   requests, or part of them;
+		 * - the minimum interval for sampling the total
+		 *   service time and updating the inject limit has
+		 *   elapsed.
+		 */
+		if (bfqq == bfqd->in_service_queue &&
+		    (bfqd->rq_in_driver == 0 ||
+		     (bfqq->last_serv_time_ns > 0 &&
+		      bfqd->rqs_injected && bfqd->rq_in_driver > 0)) &&
+		    time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+					      msecs_to_jiffies(100))) {
+			bfqd->last_empty_occupied_ns = ktime_get_ns();
+			/*
+			 * Start the state machine for measuring the
+			 * total service time of rq: setting
+			 * wait_dispatch will cause bfqd->waited_rq to
+			 * be set when rq will be dispatched.
+			 */
+			bfqd->wait_dispatch = true;
+			bfqd->rqs_injected = false;
+		}
+	}
+
 	elv_rb_add(&bfqq->sort_list, rq);
 
 	/*
@@ -2566,6 +2683,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
 
 	bfqd->last_idling_start = ktime_get();
+	bfqd->last_idling_start_jiffies = jiffies;
+
 	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
 		      HRTIMER_MODE_REL);
 	bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
@@ -3240,13 +3359,6 @@ static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
 		    jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
 }
 
-static bool bfq_bfqq_injectable(struct bfq_queue *bfqq)
-{
-	return BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
-		blk_queue_nonrot(bfqq->bfqd->queue) &&
-		bfqq->bfqd->hw_tag;
-}
-
 /**
  * bfq_bfqq_expire - expire a queue.
  * @bfqd: device owning the queue.
@@ -3361,6 +3473,14 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
 		"expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
 
+	/*
+	 * bfqq expired, so no total service time needs to be computed
+	 * any longer: reset state machine for measuring total service
+	 * times.
+	 */
+	bfqd->rqs_injected = bfqd->wait_dispatch = false;
+	bfqd->waited_rq = NULL;
+
 	/*
 	 * Increase, decrease or leave budget unchanged according to
 	 * reason.
@@ -3372,8 +3492,6 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (ref == 1) /* bfqq is gone, no more actions on it */
 		return;
 
-	bfqq->injected_service = 0;
-
 	/* mark bfqq as waiting a request only if a bic still points to it */
 	if (!bfq_bfqq_busy(bfqq) &&
 	    reason != BFQQE_BUDGET_TIMEOUT &&
@@ -3767,26 +3885,98 @@ static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
 	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);
 }
 
-static struct bfq_queue *bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
+/*
+ * This function chooses the queue from which to pick the next extra
+ * I/O request to inject, if it finds a compatible queue. See the
+ * comments on bfq_update_inject_limit() for details on the injection
+ * mechanism, and for the definitions of the quantities mentioned
+ * below.
+ */
+static struct bfq_queue *
+bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
 {
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;
+	unsigned int limit = in_serv_bfqq->inject_limit;
+	/*
+	 * If
+	 * - bfqq is not weight-raised and therefore does not carry
+	 *   time-critical I/O,
+	 * or
+	 * - regardless of whether bfqq is weight-raised, bfqq has
+	 *   however a long think time, during which it can absorb the
+	 *   effect of an appropriate number of extra I/O requests
+	 *   from other queues (see bfq_update_inject_limit for
+	 *   details on the computation of this number);
+	 * then injection can be performed without restrictions.
+	 */
+	bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 ||
+		!bfq_bfqq_has_short_ttime(in_serv_bfqq);
 
 	/*
-	 * A linear search; but, with a high probability, very few
-	 * steps are needed to find a candidate queue, i.e., a queue
-	 * with enough budget left for its next request. In fact:
+	 * If
+	 * - the baseline total service time could not be sampled yet,
+	 *   so the inject limit happens to be still 0, and
+	 * - a lot of time has elapsed since the plugging of I/O
+	 *   dispatching started, so drive speed is being wasted
+	 *   significantly;
+	 * then temporarily raise inject limit to one request.
+	 */
+	if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 &&
+	    bfq_bfqq_wait_request(in_serv_bfqq) &&
+	    time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +
+				      bfqd->bfq_slice_idle)
+		)
+		limit = 1;
+
+	if (bfqd->rq_in_driver >= limit)
+		return NULL;
+
+	/*
+	 * Linear search of the source queue for injection; but, with
+	 * a high probability, very few steps are needed to find a
+	 * candidate queue, i.e., a queue with enough budget left for
+	 * its next request. In fact:
 	 * - BFQ dynamically updates the budget of every queue so as
 	 *   to accommodate the expected backlog of the queue;
 	 * - if a queue gets all its requests dispatched as injected
 	 *   service, then the queue is removed from the active list
-	 *   (and re-added only if it gets new requests, but with
-	 *   enough budget for its new backlog).
+	 *   (and re-added only if it gets new requests, but then it
+	 *   is assigned again enough budget for its new backlog).
 	 */
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
 		if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
+		    (in_serv_always_inject || bfqq->wr_coeff > 1) &&
 		    bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
-		    bfq_bfqq_budget_left(bfqq))
-			return bfqq;
+		    bfq_bfqq_budget_left(bfqq)) {
+			/*
+			 * Allow for only one large in-flight request
+			 * on non-rotational devices, for the
+			 * following reason. On non-rotationl drives,
+			 * large requests take much longer than
+			 * smaller requests to be served. In addition,
+			 * the drive prefers to serve large requests
+			 * w.r.t. to small ones, if it can choose. So,
+			 * having more than one large requests queued
+			 * in the drive may easily make the next first
+			 * request of the in-service queue wait for so
+			 * long to break bfqq's service guarantees. On
+			 * the bright side, large requests let the
+			 * drive reach a very high throughput, even if
+			 * there is only one in-flight large request
+			 * at a time.
+			 */
+			if (blk_queue_nonrot(bfqd->queue) &&
+			    blk_rq_sectors(bfqq->next_rq) >=
+			    BFQQ_SECT_THR_NONROT)
+				limit = min_t(unsigned int, 1, limit);
+			else
+				limit = in_serv_bfqq->inject_limit;
+
+			if (bfqd->rq_in_driver < limit) {
+				bfqd->rqs_injected = true;
+				return bfqq;
+			}
+		}
 
 	return NULL;
 }
@@ -3873,14 +4063,32 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
 	 * for a new request, or has requests waiting for a completion and
 	 * may idle after their completion, then keep it anyway.
 	 *
-	 * Yet, to boost throughput, inject service from other queues if
-	 * possible.
+	 * Yet, inject service from other queues if it boosts
+	 * throughput and is possible.
 	 */
 	if (bfq_bfqq_wait_request(bfqq) ||
 	    (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
-		if (bfq_bfqq_injectable(bfqq) &&
-		    bfqq->injected_service * bfqq->inject_coeff <
-		    bfqq->entity.service * 10)
+		struct bfq_queue *async_bfqq =
+			bfqq->bic && bfqq->bic->bfqq[0] &&
+			bfq_bfqq_busy(bfqq->bic->bfqq[0]) ?
+			bfqq->bic->bfqq[0] : NULL;
+
+		/*
+		 * If the process associated with bfqq has also async
+		 * I/O pending, then inject it
+		 * unconditionally. Injecting I/O from the same
+		 * process can cause no harm to the process. On the
+		 * contrary, it can only increase bandwidth and reduce
+		 * latency for the process.
+		 */
+		if (async_bfqq &&
+		    icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
+		    bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
+		    bfq_bfqq_budget_left(async_bfqq))
+			bfqq = bfqq->bic->bfqq[0];
+		else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
+			 (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
+			  !bfq_bfqq_has_short_ttime(bfqq)))
 			bfqq = bfq_choose_bfqq_for_injection(bfqd);
 		else
 			bfqq = NULL;
@@ -3972,15 +4180,15 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
 
 	bfq_bfqq_served(bfqq, service_to_charge);
 
-	bfq_dispatch_remove(bfqd->queue, rq);
+	if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {
+		bfqd->wait_dispatch = false;
+		bfqd->waited_rq = rq;
+	}
 
-	if (bfqq != bfqd->in_service_queue) {
-		if (likely(bfqd->in_service_queue))
-			bfqd->in_service_queue->injected_service +=
-				bfq_serv_to_charge(rq, bfqq);
+	bfq_dispatch_remove(bfqd->queue, rq);
 
+	if (bfqq != bfqd->in_service_queue)
 		goto return_rq;
-	}
 
 	/*
 	 * If weight raising has to terminate for bfqq, then next
@@ -4411,13 +4619,6 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 			bfq_mark_bfqq_has_short_ttime(bfqq);
 		bfq_mark_bfqq_sync(bfqq);
 		bfq_mark_bfqq_just_created(bfqq);
-		/*
-		 * Aggressively inject a lot of service: up to 90%.
-		 * This coefficient remains constant during bfqq life,
-		 * but this behavior might be changed, after enough
-		 * testing and tuning.
-		 */
-		bfqq->inject_coeff = 1;
 	} else
 		bfq_clear_bfqq_sync(bfqq);
 
@@ -4976,6 +5177,147 @@ static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq)
 	bfq_put_queue(bfqq);
 }
 
+/*
+ * The processes associated with bfqq may happen to generate their
+ * cumulative I/O at a lower rate than the rate at which the device
+ * could serve the same I/O. This is rather probable, e.g., if only
+ * one process is associated with bfqq and the device is an SSD. It
+ * results in bfqq becoming often empty while in service. In this
+ * respect, if BFQ is allowed to switch to another queue when bfqq
+ * remains empty, then the device goes on being fed with I/O requests,
+ * and the throughput is not affected. In contrast, if BFQ is not
+ * allowed to switch to another queue---because bfqq is sync and
+ * I/O-dispatch needs to be plugged while bfqq is temporarily
+ * empty---then, during the service of bfqq, there will be frequent
+ * "service holes", i.e., time intervals during which bfqq gets empty
+ * and the device can only consume the I/O already queued in its
+ * hardware queues. During service holes, the device may even get to
+ * remaining idle. In the end, during the service of bfqq, the device
+ * is driven at a lower speed than the one it can reach with the kind
+ * of I/O flowing through bfqq.
+ *
+ * To counter this loss of throughput, BFQ implements a "request
+ * injection mechanism", which tries to fill the above service holes
+ * with I/O requests taken from other queues. The hard part in this
+ * mechanism is finding the right amount of I/O to inject, so as to
+ * both boost throughput and not break bfqq's bandwidth and latency
+ * guarantees. In this respect, the mechanism maintains a per-queue
+ * inject limit, computed as below. While bfqq is empty, the injection
+ * mechanism dispatches extra I/O requests only until the total number
+ * of I/O requests in flight---i.e., already dispatched but not yet
+ * completed---remains lower than this limit.
+ *
+ * A first definition comes in handy to introduce the algorithm by
+ * which the inject limit is computed.  We define as first request for
+ * bfqq, an I/O request for bfqq that arrives while bfqq is in
+ * service, and causes bfqq to switch from empty to non-empty. The
+ * algorithm updates the limit as a function of the effect of
+ * injection on the service times of only the first requests of
+ * bfqq. The reason for this restriction is that these are the
+ * requests whose service time is affected most, because they are the
+ * first to arrive after injection possibly occurred.
+ *
+ * To evaluate the effect of injection, the algorithm measures the
+ * "total service time" of first requests. We define as total service
+ * time of an I/O request, the time that elapses since when the
+ * request is enqueued into bfqq, to when it is completed. This
+ * quantity allows the whole effect of injection to be measured. It is
+ * easy to see why. Suppose that some requests of other queues are
+ * actually injected while bfqq is empty, and that a new request R
+ * then arrives for bfqq. If the device does start to serve all or
+ * part of the injected requests during the service hole, then,
+ * because of this extra service, it may delay the next invocation of
+ * the dispatch hook of BFQ. Then, even after R gets eventually
+ * dispatched, the device may delay the actual service of R if it is
+ * still busy serving the extra requests, or if it decides to serve,
+ * before R, some extra request still present in its queues. As a
+ * conclusion, the cumulative extra delay caused by injection can be
+ * easily evaluated by just comparing the total service time of first
+ * requests with and without injection.
+ *
+ * The limit-update algorithm works as follows. On the arrival of a
+ * first request of bfqq, the algorithm measures the total time of the
+ * request only if one of the three cases below holds, and, for each
+ * case, it updates the limit as described below:
+ *
+ * (1) If there is no in-flight request. This gives a baseline for the
+ *     total service time of the requests of bfqq. If the baseline has
+ *     not been computed yet, then, after computing it, the limit is
+ *     set to 1, to start boosting throughput, and to prepare the
+ *     ground for the next case. If the baseline has already been
+ *     computed, then it is updated, in case it results to be lower
+ *     than the previous value.
+ *
+ * (2) If the limit is higher than 0 and there are in-flight
+ *     requests. By comparing the total service time in this case with
+ *     the above baseline, it is possible to know at which extent the
+ *     current value of the limit is inflating the total service
+ *     time. If the inflation is below a certain threshold, then bfqq
+ *     is assumed to be suffering from no perceivable loss of its
+ *     service guarantees, and the limit is even tentatively
+ *     increased. If the inflation is above the threshold, then the
+ *     limit is decreased. Due to the lack of any hysteresis, this
+ *     logic makes the limit oscillate even in steady workload
+ *     conditions. Yet we opted for it, because it is fast in reaching
+ *     the best value for the limit, as a function of the current I/O
+ *     workload. To reduce oscillations, this step is disabled for a
+ *     short time interval after the limit happens to be decreased.
+ *
+ * (3) Periodically, after resetting the limit, to make sure that the
+ *     limit eventually drops in case the workload changes. This is
+ *     needed because, after the limit has gone safely up for a
+ *     certain workload, it is impossible to guess whether the
+ *     baseline total service time may have changed, without measuring
+ *     it again without injection. A more effective version of this
+ *     step might be to just sample the baseline, by interrupting
+ *     injection only once, and then to reset/lower the limit only if
+ *     the total service time with the current limit does happen to be
+ *     too large.
+ *
+ * More details on each step are provided in the comments on the
+ * pieces of code that implement these steps: the branch handling the
+ * transition from empty to non empty in bfq_add_request(), the branch
+ * handling injection in bfq_select_queue(), and the function
+ * bfq_choose_bfqq_for_injection(). These comments also explain some
+ * exceptions, made by the injection mechanism in some special cases.
+ */
+static void bfq_update_inject_limit(struct bfq_data *bfqd,
+				    struct bfq_queue *bfqq)
+{
+	u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns;
+	unsigned int old_limit = bfqq->inject_limit;
+
+	if (bfqq->last_serv_time_ns > 0) {
+		u64 threshold = (bfqq->last_serv_time_ns * 3)>>1;
+
+		if (tot_time_ns >= threshold && old_limit > 0) {
+			bfqq->inject_limit--;
+			bfqq->decrease_time_jif = jiffies;
+		} else if (tot_time_ns < threshold &&
+			   old_limit < bfqd->max_rq_in_driver<<1)
+			bfqq->inject_limit++;
+	}
+
+	/*
+	 * Either we still have to compute the base value for the
+	 * total service time, and there seem to be the right
+	 * conditions to do it, or we can lower the last base value
+	 * computed.
+	 */
+	if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 0) ||
+	    tot_time_ns < bfqq->last_serv_time_ns) {
+		bfqq->last_serv_time_ns = tot_time_ns;
+		/*
+		 * Now we certainly have a base value: make sure we
+		 * start trying injection.
+		 */
+		bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
+	}
+
+	/* update complete, not waiting for any request completion any longer */
+	bfqd->waited_rq = NULL;
+}
+
 /*
  * Handle either a requeue or a finish for rq. The things to do are
  * the same in both cases: all references to rq are to be dropped. In
@@ -5020,6 +5362,9 @@ static void bfq_finish_requeue_request(struct request *rq)
 
 		spin_lock_irqsave(&bfqd->lock, flags);
 
+		if (rq == bfqd->waited_rq)
+			bfq_update_inject_limit(bfqd, bfqq);
+
 		bfq_completed_request(bfqq, bfqd);
 		bfq_finish_requeue_request_body(bfqq);
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 81cabf51a87e..26869cfbbfa9 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -240,6 +240,13 @@ struct bfq_queue {
 	/* next ioprio and ioprio class if a change is in progress */
 	unsigned short new_ioprio, new_ioprio_class;
 
+	/* last total-service-time sample, see bfq_update_inject_limit() */
+	u64 last_serv_time_ns;
+	/* limit for request injection */
+	unsigned int inject_limit;
+	/* last time the inject limit has been decreased, in jiffies */
+	unsigned long decrease_time_jif;
+
 	/*
 	 * Shared bfq_queue if queue is cooperating with one or more
 	 * other queues.
@@ -357,29 +364,6 @@ struct bfq_queue {
 
 	/* max service rate measured so far */
 	u32 max_service_rate;
-	/*
-	 * Ratio between the service received by bfqq while it is in
-	 * service, and the cumulative service (of requests of other
-	 * queues) that may be injected while bfqq is empty but still
-	 * in service. To increase precision, the coefficient is
-	 * measured in tenths of unit. Here are some example of (1)
-	 * ratios, (2) resulting percentages of service injected
-	 * w.r.t. to the total service dispatched while bfqq is in
-	 * service, and (3) corresponding values of the coefficient:
-	 * 1 (50%) -> 10
-	 * 2 (33%) -> 20
-	 * 10 (9%) -> 100
-	 * 9.9 (9%) -> 99
-	 * 1.5 (40%) -> 15
-	 * 0.5 (66%) -> 5
-	 * 0.1 (90%) -> 1
-	 *
-	 * So, if the coefficient is lower than 10, then
-	 * injected service is more than bfqq service.
-	 */
-	unsigned int inject_coeff;
-	/* amount of service injected in current service slot */
-	unsigned int injected_service;
 };
 
 /**
@@ -544,6 +528,26 @@ struct bfq_data {
 	/* time of last request completion (ns) */
 	u64 last_completion;
 
+	/* time of last transition from empty to non-empty (ns) */
+	u64 last_empty_occupied_ns;
+
+	/*
+	 * Flag set to activate the sampling of the total service time
+	 * of a just-arrived first I/O request (see
+	 * bfq_update_inject_limit()). This will cause the setting of
+	 * waited_rq when the request is finally dispatched.
+	 */
+	bool wait_dispatch;
+	/*
+	 *  If set, then bfq_update_inject_limit() is invoked when
+	 *  waited_rq is eventually completed.
+	 */
+	struct request *waited_rq;
+	/*
+	 * True if some request has been injected during the last service hole.
+	 */
+	bool rqs_injected;
+
 	/* time of first rq dispatch in current observation interval (ns) */
 	u64 first_dispatch;
 	/* time of last rq dispatch in current observation interval (ns) */
@@ -553,6 +557,7 @@ struct bfq_data {
 	ktime_t last_budget_start;
 	/* beginning of the last idle slice */
 	ktime_t last_idling_start;
+	unsigned long last_idling_start_jiffies;
 
 	/* number of samples in current observation interval */
 	int peak_rate_samples;

From patchwork Tue Mar 12 08:59:30 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160078
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12861162jad;
 Tue, 12 Mar 2019 02:00:44 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwxXgZCUxPifvH9T/UY3BJ47NtQ/I4JdAvyNtISiTasp2OVFoKX+ZDdrQjVTvjWMinnvhjO
X-Received: by 2002:a17:902:24:: with SMTP id
 33mr33452982pla.259.1552381244901; 
 Tue, 12 Mar 2019 02:00:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381244; cv=none;
 d=google.com; s=arc-20160816;
 b=l2miZ5TRADSVdtYll6gB93v4M23YnB6xRsT8uibMc6WgzoZIHLQ2y3YBNmtAcjIiZn
 TISDn8AwZvcjrdWrZLLrrkYrrzIRcpvKWndTTxeoIcSgaxvmzAcNubiNQkqumR+oBXG3
 7a2OY5w4fLJhJXqyDx4/t8CLR8tnwO8fuYQd6RMi/YVUsTEA71NcvNrfLt1sZxCQ0oYE
 c0/fSzv3Nj8rnp1OpY8mMbH9/tCQYEfF072IDIm4IFepz8fFzaOJiJOzZBDp87BbUW9X
 HM+5wPOD47/76i8y4cyfO+nFMdPMDbcHJ6uWApKu3S/OLgZOx4QeU4GX/DesBz/XjW41
 KXGg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=axOM28S4gOS/WhjiBsehvxDPipmerEG+68f8K41Ab7I=;
 b=Nn23ViNYJ0VFsjQG8nEnBrlNx0hVgQ7oG9sN9qJDsAIneUkaIHg9kQcvKNBKwyJ6Tv
 LhdPmh6hQc5grapZ0famFH7EqAUd05bZ1/vZm+yYlji5Fkw5ykEaXOeElP+62pI1u4+T
 DXG3xOFDr0pD1tYb6wRSuKlMSd/cahB3WaQhLJIBe73+tuuMBbyJxJfVNw+1LLFfruNe
 W4HSY3bC4+UYu51gbKrcNoyhb/AwlOR/mWXMagOh2M1t7uYmSivUCfN+pYl6kU0WcIhs
 /k4GqOtMg2EnT3uTAB5e0j8V+FgikQLjPHlw2jtX+Q9ILvDBGzJL0m8/X+n73jGtZWIE
 SH6Q==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=JP6QsrxM;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id x6si7137636pge.211.2019.03.12.02.00.44;
 Tue, 12 Mar 2019 02:00:44 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=JP6QsrxM;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727806AbfCLJAn (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:43 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:36271 "EHLO
 mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1726924AbfCLJAK (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:10 -0400
Received: by mail-wr1-f65.google.com with SMTP id g18so1763361wru.3
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=axOM28S4gOS/WhjiBsehvxDPipmerEG+68f8K41Ab7I=;
 b=JP6QsrxMUPTt5P5/YwXE4RMsvQjIQYi29r2fk+eO2JzYC4fJTBTaOqhx0EEydZjVYZ
 PkQb+jiY/pdBoPpzWXJICXfnONncxXnlnEYO1Ns1/3COeQE9Vgq5WksTzbDe9KK8uVaz
 8pNQZT712UJFDsWI1hLRLuNef6IU7kG8+VFrFIZGBC7PsrGxucfBaZbF5LVWNxBaCT8H
 jPm7Wy4FaTrbJ1iwI3h8eespZ9+YWT6Bs4Bz9S7NiPTOW2aE+C/QhBnp1XX/bFoC8CsQ
 /rDC5I1AtY6soxxJirXEpcezf7reLsA5OTnIGgyzy1AHKinQLoIpEWXCL0+tosRDFK1e
 GGdw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=axOM28S4gOS/WhjiBsehvxDPipmerEG+68f8K41Ab7I=;
 b=o9HBJJMgBBkzcPyC7RQc6+UenB0WsTHtYlnrL3zzFw4+7dLI5gtS2oz4BowwsKuNPJ
 VRFuOyrzdZ21negzZnbGfsKOcl8jB8gE0qweIMR0E0ZN46iCe5Ge0zJ2GUoOKEzPZyjP
 NPtdW/G56NI1Czd9dktEkeF0CIyD6i7B0QbfvBYBIcczrimHOditfY3svYbs6DMGunj/
 eeHbK6IX9kRRk1ENbZq8MlkDoDkgLZshy2yx+3eBtRS68A6zVQ6F6dhr9aLMq7vDmPR2
 Rl0cSORgzsS+3CDXfnw0UVgyEwc2W1jDGm4Gv+7cK+AgmziqYz3m5yDta2xnRDfNHZhv
 2otQ==
X-Gm-Message-State: APjAAAV9B9MTnqMiW3I1GIxwEKxw7Kx+LItPzkOwxsYg346Fgz4D/Ebm
 h4Q1FKJSehQ4V2o0Hvr7qD7krw==
X-Received: by 2002:adf:e90b:: with SMTP id f11mr23482653wrm.36.1552381207804; 
 Tue, 12 Mar 2019 02:00:07 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.06
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:06 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 4/9] block,
 bfq: do not merge queues on flash storage with queueing
Date: Tue, 12 Mar 2019 09:59:30 +0100
Message-Id: <20190312085935.11340-5-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/S

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Alessio Masola <alessio.masola@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-cgroup.c  |  3 +-
 block/bfq-iosched.c | 73 +++++++++++++++++++++++++++++++++++++++++----
 block/bfq-iosched.h |  3 ++
 3 files changed, 73 insertions(+), 6 deletions(-)

-- 
2.20.1

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c6113af31960..2a74a3f2a8f7 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -578,7 +578,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqg_and_blkg_get(bfqg);
 
 	if (bfq_bfqq_busy(bfqq)) {
-		bfq_pos_tree_add_move(bfqd, bfqq);
+		if (unlikely(!bfqd->nonrot_with_queueing))
+			bfq_pos_tree_add_move(bfqd, bfqq);
 		bfq_activate_bfqq(bfqd, bfqq);
 	}
 
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 41364c0cca8c..b96be3764b8a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -595,7 +595,16 @@ static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
 				       bfq_merge_time_limit);
 }
 
-void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * The following function is not marked as __cold because it is
+ * actually cold, but for the same performance goal described in the
+ * comments on the likely() at the beginning of
+ * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
+ * execution time for the case where this function is not invoked, we
+ * had to add an unlikely() in each involved if().
+ */
+void __cold
+bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	struct rb_node **p, *parent;
 	struct bfq_queue *__bfqq;
@@ -1849,8 +1858,9 @@ static void bfq_add_request(struct request *rq)
 
 	/*
 	 * Adjust priority tree position, if next_rq changes.
+	 * See comments on bfq_pos_tree_add_move() for the unlikely().
 	 */
-	if (prev != bfqq->next_rq)
+	if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
 		bfq_pos_tree_add_move(bfqd, bfqq);
 
 	if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
@@ -1990,7 +2000,9 @@ static void bfq_remove_request(struct request_queue *q,
 			bfqq->pos_root = NULL;
 		}
 	} else {
-		bfq_pos_tree_add_move(bfqd, bfqq);
+		/* see comments on bfq_pos_tree_add_move() for the unlikely() */
+		if (unlikely(!bfqd->nonrot_with_queueing))
+			bfq_pos_tree_add_move(bfqd, bfqq);
 	}
 
 	if (rq->cmd_flags & REQ_META)
@@ -2075,7 +2087,12 @@ static void bfq_request_merged(struct request_queue *q, struct request *req,
 		 */
 		if (prev != bfqq->next_rq) {
 			bfq_updated_next_req(bfqd, bfqq);
-			bfq_pos_tree_add_move(bfqd, bfqq);
+			/*
+			 * See comments on bfq_pos_tree_add_move() for
+			 * the unlikely().
+			 */
+			if (unlikely(!bfqd->nonrot_with_queueing))
+				bfq_pos_tree_add_move(bfqd, bfqq);
 		}
 	}
 }
@@ -2357,6 +2374,46 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 {
 	struct bfq_queue *in_service_bfqq, *new_bfqq;
 
+	/*
+	 * Do not perform queue merging if the device is non
+	 * rotational and performs internal queueing. In fact, such a
+	 * device reaches a high speed through internal parallelism
+	 * and pipelining. This means that, to reach a high
+	 * throughput, it must have many requests enqueued at the same
+	 * time. But, in this configuration, the internal scheduling
+	 * algorithm of the device does exactly the job of queue
+	 * merging: it reorders requests so as to obtain as much as
+	 * possible a sequential I/O pattern. As a consequence, with
+	 * the workload generated by processes doing interleaved I/O,
+	 * the throughput reached by the device is likely to be the
+	 * same, with and without queue merging.
+	 *
+	 * Disabling merging also provides a remarkable benefit in
+	 * terms of throughput. Merging tends to make many workloads
+	 * artificially more uneven, because of shared queues
+	 * remaining non empty for incomparably more time than
+	 * non-merged queues. This may accentuate workload
+	 * asymmetries. For example, if one of the queues in a set of
+	 * merged queues has a higher weight than a normal queue, then
+	 * the shared queue may inherit such a high weight and, by
+	 * staying almost always active, may force BFQ to perform I/O
+	 * plugging most of the time. This evidently makes it harder
+	 * for BFQ to let the device reach a high throughput.
+	 *
+	 * Finally, the likely() macro below is not used because one
+	 * of the two branches is more likely than the other, but to
+	 * have the code path after the following if() executed as
+	 * fast as possible for the case of a non rotational device
+	 * with queueing. We want it because this is the fastest kind
+	 * of device. On the opposite end, the likely() may lengthen
+	 * the execution time of BFQ for the case of slower devices
+	 * (rotational or at least without queueing). But in this case
+	 * the execution time of BFQ matters very little, if not at
+	 * all.
+	 */
+	if (likely(bfqd->nonrot_with_queueing))
+		return NULL;
+
 	/*
 	 * Prevent bfqq from being merged if it has been created too
 	 * long ago. The idea is that true cooperating processes, and
@@ -2986,8 +3043,10 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		bfq_requeue_bfqq(bfqd, bfqq, true);
 		/*
 		 * Resort priority tree of potential close cooperators.
+		 * See comments on bfq_pos_tree_add_move() for the unlikely().
 		 */
-		bfq_pos_tree_add_move(bfqd, bfqq);
+		if (unlikely(!bfqd->nonrot_with_queueing))
+			bfq_pos_tree_add_move(bfqd, bfqq);
 	}
 
 	/*
@@ -5051,6 +5110,9 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
 	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
 	bfqd->max_rq_in_driver = 0;
 	bfqd->hw_tag_samples = 0;
+
+	bfqd->nonrot_with_queueing =
+		blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
 }
 
 static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
@@ -5882,6 +5944,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	INIT_HLIST_HEAD(&bfqd->burst_list);
 
 	bfqd->hw_tag = -1;
+	bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
 
 	bfqd->bfq_max_budget = bfq_default_max_budget;
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 26869cfbbfa9..829730b96fb2 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -497,6 +497,9 @@ struct bfq_data {
 	/* number of requests dispatched and waiting for completion */
 	int rq_in_driver;
 
+	/* true if the device is non rotational and performs queueing */
+	bool nonrot_with_queueing;
+
 	/*
 	 * Maximum number of requests in driver in the last
 	 * @hw_tag_samples completed requests.

From patchwork Tue Mar 12 08:59:31 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160077
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12861110jad;
 Tue, 12 Mar 2019 02:00:42 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwgv6Jif5iFvoRMscWkJFBZU0ajE3o/K/LNLByTcU5XQzCF7XdjKeLw5nQcYdOshgsDl32A
X-Received: by 2002:aa7:8a0a:: with SMTP id
 m10mr37257241pfa.129.1552381242179; 
 Tue, 12 Mar 2019 02:00:42 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381242; cv=none;
 d=google.com; s=arc-20160816;
 b=c4X/YNGI3C8ixyJ7OjGJKe0iNtyJaAZ/CZz9iOpitqdmnTE1ghr5yDrTdVE2zUG/gW
 PVrm8yCT1UTEchWmZi6zuVZmN+za9Q0guti1zMI+EE40YYTDQ29tffrtUqA2yd2htymv
 BiaJj1eLfzH7uFF/1xxA0AxjHQj+cjn3i8E8739vBYkBz64PXPFu0hBpSJVlh8szfnUN
 VDNIxkjSlcmNaCynWya6rIaxMaS3PpHxuVM6K7M4UOyfp5jhXwvYT0JdmSlv70ETb3ZC
 S3MvWyGP3p0i/Jr+hqstNC6PwtjvKWHJYl/WLrzl9bfOKFHm09VhspIlZFzGwrFuEtBm
 HuuQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=3yl3XU7a49GA8s9ZssN4iqwWKfxKo62fpyxLFCHTfdM=;
 b=gpRE3IirOl0t1rI3ZykUz8fSNhhfVTARlmnx7T8Ms70jVChlqIrDSVAbWgVA0OcUk7
 tA0yppQtUbR0jhqKWWbFAB3Ajvdjf+3PL1nZKb5igj2GBa2hDqNtUVbir/1QeEJQPP9d
 IehZk7+4bJ/vR7kf1ersCBJejp/d1xcd3vJ6K7jgylIo9TjDH95F1vcza2VAO6k1QaVn
 KTkxYRRP8aDWejfOfTVY8RyNE0dqzxd1nspMOsG6g6BTrgtJmKlM1mAq52iJ72QW/hjg
 q006aLPQmES17BB/lZkS50SEzyJvx7HKcWuQ2kFk91vjnqtawUU3dxVwhFpWJxgXKxaM
 VSxw==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=l21Jl1yZ;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id x6si7137636pge.211.2019.03.12.02.00.41;
 Tue, 12 Mar 2019 02:00:42 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=l21Jl1yZ;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727780AbfCLJAk (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:40 -0400
Received: from mail-wm1-f65.google.com ([209.85.128.65]:38520 "EHLO
 mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1727690AbfCLJAK (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:10 -0400
Received: by mail-wm1-f65.google.com with SMTP id a188so1678909wmf.3
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=3yl3XU7a49GA8s9ZssN4iqwWKfxKo62fpyxLFCHTfdM=;
 b=l21Jl1yZjVVBrUCvymJJxze0ysYHTiw8TwtK9e3AcziIbTngV9WXO4AnA9uaSTgijV
 D9MiZAR73lQWdDxKHhfzUeEpliUC5D0O9sZw0nt2rKKp8HrBo8chqhzCVM9OsrR/kIlL
 P0pyx7lS+4XtIDzMH0kO89P5l+IhAEwDth3GMsu2NSFiVT8BxqLESpvxWTN5Q0iPqB/W
 ImhpUq7ArZI6iObjZu8dHr/cRWWH4zFWQkkJxcYC5RrkB53h8LJ16qt/3BnkWHZlz59o
 +TmHndTPD2q4e/yulwkvL/dGT7lE7gfxkW0XWYl0rGKQu4IjNcr6yfThWciXGbglUCA8
 QJ4w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=3yl3XU7a49GA8s9ZssN4iqwWKfxKo62fpyxLFCHTfdM=;
 b=sanrvFUvkR5B69m0YObOjgTV9FrTnzz8ukXMYSUncgCnS1f2CnyyjUGsETrWLxVdRt
 On/EV7z8gw2l5bt4mJYcuUR6rFDby1Lak8IfgEDZyXdkzppdosIUMpTFIOCvT73ZveP8
 0UGrVIgjkdpnAMB7OHELJuUxgyvg+Zelce60aZ9LvWiVzX+5moIh7KMSKeH1EG3fce2L
 pvTWyIj/K/ydcYLqZsMscia78l50H2G9mCg7t9ATbHsSxGKaeHbwqHLGDzau/sYHmFpb
 TVyv+wrzDItt3pVgE0NjRgSOD2aMMf4O0miyjRaoBOeInPO/gB8FW9PanFBhg/1BTXzj
 SYpQ==
X-Gm-Message-State: APjAAAVBk/QXwrxL3f/Xx2aUr2TLljTyq570rDO9JfpgcEJw4z42gjnK
 oF/HvG0UkLX9AxcaoCm+FAWw4g1zLVo=
X-Received: by 2002:a1c:9694:: with SMTP id y142mr1501451wmd.98.1552381209127; 
 Tue, 12 Mar 2019 02:00:09 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.07
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:08 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 5/9] block,
 bfq: do not tag totally seeky queues as soft rt
Date: Tue, 12 Mar 2019 09:59:31 +0100
Message-Id: <20190312085935.11340-6-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Sync random I/O is likely to be confused with soft real-time I/O,
because it is characterized by limited throughput and apparently
isochronous arrival pattern. To avoid false positives, this commits
prevents bfq_queues containing only random (seeky) I/O from being
tagged as soft real-time.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b96be3764b8a..d34b80e5c47d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -242,6 +242,14 @@ static struct kmem_cache *bfq_pool;
 	  blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
 #define BFQQ_CLOSE_THR		(sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)	(hweight32(bfqq->seek_history) > 19)
+/*
+ * Sync random I/O is likely to be confused with soft real-time I/O,
+ * because it is characterized by limited throughput and apparently
+ * isochronous arrival pattern. To avoid false positives, queues
+ * containing only random (seeky) I/O are prevented from being tagged
+ * as soft real-time.
+ */
+#define BFQQ_TOTALLY_SEEKY(bfqq)	(bfqq->seek_history & -1)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES	32
@@ -1622,6 +1630,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 	 */
 	in_burst = bfq_bfqq_in_large_burst(bfqq);
 	soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+		!BFQQ_TOTALLY_SEEKY(bfqq) &&
 		!in_burst &&
 		time_is_before_jiffies(bfqq->soft_rt_next_start) &&
 		bfqq->dispatched == 0;
@@ -4816,6 +4825,11 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 {
 	bfqq->seek_history <<= 1;
 	bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
+
+	if (bfqq->wr_coeff > 1 &&
+	    bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+	    BFQQ_TOTALLY_SEEKY(bfqq))
+		bfq_bfqq_end_wr(bfqq);
 }
 
 static void bfq_update_has_short_ttime(struct bfq_data *bfqd,

From patchwork Tue Mar 12 08:59:32 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160073
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12860685jad;
 Tue, 12 Mar 2019 02:00:16 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwd1AgwLKC6vqu4DC31qKO9FjVyGN5QuHQsvqbi4NrwTRuGDZs3tVRYRXZQsm9dzRzhZycP
X-Received: by 2002:a65:4549:: with SMTP id x9mr34601570pgr.3.1552381216559; 
 Tue, 12 Mar 2019 02:00:16 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381216; cv=none;
 d=google.com; s=arc-20160816;
 b=U8zxq1p5cotbY++DEsfK6KjZ2sEzEgZLTsyHtJhqnIK10lmL9cXicOFTxKYEyCliSS
 qHhPM94YQd3WqH0m0P+JWr0qXirQZbx3FOjHuzT9t6uCWlSCRgrE1Ww6GasM3dktoEnn
 sQwOMRi5Oua6z+JLabBk9ymIDZsJISpQEZvA2MB20DhHn0YcPdv5GWaOxBegMz4AFXEY
 ROLu2gReCKeFzwb8DWjFa4Ajv92v34UUcf4F9MJJ6lteZKfmLW0vx72mftcrk29C2AsW
 IabqX/C+kHxDrPBI4LMX/ggXbzBlBLMeQmHfB+4FBqP7uMftfk1kC82NI9Gk/pS5Zim5
 vZBA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=v4uFtyzA18u5b8FpMac6NfiN9aYKDmVQAs2/t+LEouQ=;
 b=AaTK/A9sLYf+wE+oS8g5s2AZK6068+k/N6FfKSb7/oLti7DAiQqaIU6ZGVX517HNpj
 0mtkHfm69xR20sQiUz3aBiyNsKk810hNK0HO3YWnGbveA8IJxZk1TgipB/uaT7yFSQi7
 Ewqi6vrlIQa//YcYWVFVGJg1zeLU6YUpFJz52uvM1AxW235BooGsVl9rRl5rHA2sBae/
 8gkM2ep35EnMaV2IPGSfh09TR7E+Vcd9+oJ67DxZ2TBoGFH07BTviJjaZ4essygPCBke
 Hx79obWLtToIbXWx1/nlG/HD0rwIx/KLiUEHcuWWgxbHOWFhT82a0PYORZKwxss0aIkG
 zm+w==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=iS0RXPUB;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 d16si7538316pll.236.2019.03.12.02.00.16; 
 Tue, 12 Mar 2019 02:00:16 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=iS0RXPUB;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727720AbfCLJAO (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:14 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:38525 "EHLO
 mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1727700AbfCLJAM (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:12 -0400
Received: by mail-wm1-f68.google.com with SMTP id a188so1678979wmf.3
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=v4uFtyzA18u5b8FpMac6NfiN9aYKDmVQAs2/t+LEouQ=;
 b=iS0RXPUBY1WOA6OUxZAMoN+2kbxJkGL3Hjr3CPiqg/6XmRhyCkplfWh2Mdf2I/dXX7
 zNmdB/ScIRo8GjCF/czzFwN4K6V+UCCeiSvxz+WyWD1az6KokzfCRSKT0NoNlCq2s53S
 hIEyeC8rOxD8od8zs2k6ADNGwbGJ0O1Gv7DXJ+KXeznNxcRyaixzDoFQqbFuF3HPE3ed
 q/L9BqpmSzzNUkifD/tpnUjuscSrCFbc6gRR2gHMPTCp4PtaslXe9pQE6TvXKzJSOx5K
 2xHUswb/EIZeP5FFOezsXVgiFgIfnLkPIn7UMypsxLLbLZyKvCWiLKYix3qxt2I/XzoZ
 mgLA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=v4uFtyzA18u5b8FpMac6NfiN9aYKDmVQAs2/t+LEouQ=;
 b=KIhqQFVvJcU49sOJq2O7jJ7/79dkMVgnGvAJUcQwthz/FTRNqqZgvajtQthpWcmRd8
 J8wPD/fmt1zZgGrWwbpxYMoVX6idr/XGNgV0vBANjrUDC95g+JZY32J+UtiaOtYR6j/m
 ECLjiMjoSyljoVJqIWBje09o5qR24RBlS8DEOy5Hz2NJKWv+aieo0D9TEkWG2NfPEAXn
 hwPZYIHuxRXU3xydZlzYq43JAfDcg6yo0mnoQ9jxQcvT7/NvnDTs9s+P2sbQ18uoEeS+
 m9Ttw7uoQ6zWLsPz0OmGRZNRiWBOCbC9qpmBl3dcdnuLuhm0E7gJfWWRphZQZfH+pb4m
 N9Nw==
X-Gm-Message-State: APjAAAVD27owtY86LFdjPvkXIyIr1WC7bYrmYgsPeuLVkVV2zW0NV3Xf
 DBMQMwI5GjK8watRJNJJP44wMw==
X-Received: by 2002:a1c:a007:: with SMTP id j7mr1493311wme.25.1552381210260; 
 Tue, 12 Mar 2019 02:00:10 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.09
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:09 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 6/9] block,
 bfq: always protect newly-created queues from existing active queues
Date: Tue, 12 Mar 2019 09:59:32 +0100
Message-Id: <20190312085935.11340-7-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

If many bfq_queues belonging to the same group happen to be created
shortly after each other, then the processes associated with these
queues have typically a common goal. In particular, bursts of queue
creations are usually caused by services or applications that spawn
many parallel threads/processes. Examples are systemd during boot, or
git grep. If there are no other active queues, then, to help these
processes get their job done as soon as possible, the best thing to do
is to reach a high throughput. To this goal, it is usually better to
not grant either weight-raising or device idling to the queues
associated with these processes. And this is exactly what BFQ
currently does.

There is however a drawback: if, in contrast, some other queues are
already active, then the newly created queues must be protected from
the I/O flowing through the already existing queues. In this case, the
best thing to do is the opposite as in the other case: it is much
better to grant weight-raising and device idling to the newly-created
queues, if they deserve it. This commit addresses this issue by doing
so if there are already other active queues.

This change also helps eliminating false positives, which occur when
the newly-created queues do not belong to an actual large burst of
creations, but some background task (e.g., a service) happens to
trigger the creation of new queues in the middle, i.e., very close to
when the victim queues are created. These false positive may cause
total loss of control on process latencies.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 64 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 13 deletions(-)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index d34b80e5c47d..500b04df9efa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1075,8 +1075,18 @@ static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
 		hlist_del_init(&item->burst_list_node);
-	hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
-	bfqd->burst_size = 1;
+
+	/*
+	 * Start the creation of a new burst list only if there is no
+	 * active queue. See comments on the conditional invocation of
+	 * bfq_handle_burst().
+	 */
+	if (bfq_tot_busy_queues(bfqd) == 0) {
+		hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
+		bfqd->burst_size = 1;
+	} else
+		bfqd->burst_size = 0;
+
 	bfqd->burst_parent_entity = bfqq->entity.parent;
 }
 
@@ -1132,7 +1142,8 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * many parallel threads/processes. Examples are systemd during boot,
  * or git grep. To help these processes get their job done as soon as
  * possible, it is usually better to not grant either weight-raising
- * or device idling to their queues.
+ * or device idling to their queues, unless these queues must be
+ * protected from the I/O flowing through other active queues.
  *
  * In this comment we describe, firstly, the reasons why this fact
  * holds, and, secondly, the next function, which implements the main
@@ -1144,7 +1155,10 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * cumulatively served, the sooner the target job of these queues gets
  * completed. As a consequence, weight-raising any of these queues,
  * which also implies idling the device for it, is almost always
- * counterproductive. In most cases it just lowers throughput.
+ * counterproductive, unless there are other active queues to isolate
+ * these new queues from. If there no other active queues, then
+ * weight-raising these new queues just lowers throughput in most
+ * cases.
  *
  * On the other hand, a burst of queue creations may be caused also by
  * the start of an application that does not consist of a lot of
@@ -1178,14 +1192,16 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * are very rare. They typically occur if some service happens to
  * start doing I/O exactly when the interactive task starts.
  *
- * Turning back to the next function, it implements all the steps
- * needed to detect the occurrence of a large burst and to properly
- * mark all the queues belonging to it (so that they can then be
- * treated in a different way). This goal is achieved by maintaining a
- * "burst list" that holds, temporarily, the queues that belong to the
- * burst in progress. The list is then used to mark these queues as
- * belonging to a large burst if the burst does become large. The main
- * steps are the following.
+ * Turning back to the next function, it is invoked only if there are
+ * no active queues (apart from active queues that would belong to the
+ * same, possible burst bfqq would belong to), and it implements all
+ * the steps needed to detect the occurrence of a large burst and to
+ * properly mark all the queues belonging to it (so that they can then
+ * be treated in a different way). This goal is achieved by
+ * maintaining a "burst list" that holds, temporarily, the queues that
+ * belong to the burst in progress. The list is then used to mark
+ * these queues as belonging to a large burst if the burst does become
+ * large. The main steps are the following.
  *
  * . when the very first queue is created, the queue is inserted into the
  *   list (as it could be the first queue in a possible burst)
@@ -5695,7 +5711,29 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
 		}
 	}
 
-	if (unlikely(bfq_bfqq_just_created(bfqq)))
+	/*
+	 * Consider bfqq as possibly belonging to a burst of newly
+	 * created queues only if:
+	 * 1) A burst is actually happening (bfqd->burst_size > 0)
+	 * or
+	 * 2) There is no other active queue. In fact, if, in
+	 *    contrast, there are active queues not belonging to the
+	 *    possible burst bfqq may belong to, then there is no gain
+	 *    in considering bfqq as belonging to a burst, and
+	 *    therefore in not weight-raising bfqq. See comments on
+	 *    bfq_handle_burst().
+	 *
+	 * This filtering also helps eliminating false positives,
+	 * occurring when bfqq does not belong to an actual large
+	 * burst, but some background task (e.g., a service) happens
+	 * to trigger the creation of new queues very close to when
+	 * bfqq and its possible companion queues are created. See
+	 * comments on bfq_handle_burst() for further details also on
+	 * this issue.
+	 */
+	if (unlikely(bfq_bfqq_just_created(bfqq) &&
+		     (bfqd->burst_size > 0 ||
+		      bfq_tot_busy_queues(bfqd) == 0)))
 		bfq_handle_burst(bfqd, bfqq);
 
 	return bfqq;

From patchwork Tue Mar 12 08:59:33 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160074
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12860768jad;
 Tue, 12 Mar 2019 02:00:22 -0700 (PDT)
X-Google-Smtp-Source: APXvYqxwuLYAJCANrJJS+fXqqY3a7BzZyetXzq8UkDRQBw7eeeajFFnQ0vL7585ax/nfgV9OITAW
X-Received: by 2002:a17:902:850c:: with SMTP id
 bj12mr23845784plb.69.1552381222251; 
 Tue, 12 Mar 2019 02:00:22 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381222; cv=none;
 d=google.com; s=arc-20160816;
 b=P++V1LJ792rF9BPQF6FvbQBj0OU70h4XYsis2iyqi4BTtz0h8uf5LrNRGJMx8nHYVa
 tt48Jdk9KL5cuuNzjURXu1UmlqTXzpgrNzsoomKPwIPan9zH0l1TFDsXKmgZ3XxrT8DU
 06k3swiCPZ5AQ5vJqrNawdLQqHC/vJ/0Xbk+ecqOC0x2IoH1ZD073WHBQZ6VNV+PpElQ
 rWXr12c8+SHuQKA2dUk44LjDAR4ClPKJ595Au5NnRFGEevIvLVv4K0BazKCz+UAuEaFE
 /VZHpD1IOW/55WsRcziLdOsdXHj6VI+tLsUUXBmo7rXZ9gixACXZXObPhxmpHavGqt1x
 APYQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=btVK/jtdAfu6AHuJ6S+yT+D8LfZ0KK3FDm6jIOjAw+w=;
 b=MYpk2k7NLb/VBniqdr4U8h9N5AsPrLMIOiIRD6juA6AWAuIgULO6c0Ke5vtjUVkbCx
 yKHz89YlRbcWCWVaiCFW5ZRymUVI+MXSoEndPxpFIKUzG6uGxFxZHTtd2QGytadVw8wH
 Htm1wJIsRIZueBqkh762m+SsfhmEhCKx5uqcBqjvPFadCVCMEaJBSgsDqkYe3ZEisyW5
 IJy0dKggEFSq1ooKFyNH8Nxn0KuXXAJjNDuPqZ6QpFjEqTT2ieNBB/27l9Tw/bERaeJF
 EvfYH5MBcCb+dK7Ky06tbx5vrIALaC8fqxCTeynV13XlOAeBgdSUfrcB0+fADpFjPgHz
 86iQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=zIwVUnUF;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id u3si7324774pgi.275.2019.03.12.02.00.21;
 Tue, 12 Mar 2019 02:00:22 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=zIwVUnUF;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727740AbfCLJAU (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:20 -0400
Received: from mail-wr1-f51.google.com ([209.85.221.51]:43868 "EHLO
 mail-wr1-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1727724AbfCLJAS (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:18 -0400
Received: by mail-wr1-f51.google.com with SMTP id d17so1719664wre.10
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=btVK/jtdAfu6AHuJ6S+yT+D8LfZ0KK3FDm6jIOjAw+w=;
 b=zIwVUnUFW4N+3v68oh+lo2cLxoG34ZSW/qTtQiFK0iOgbGazb2FoAehAsJ5Oxg1NKE
 mldTdbxnmF526mUDFp5wcKTbFqtvE+RTglmjvlT5o7ZETSKIfA4/R9QmemlV8NU2VuLp
 atEqWEttirD5F0hB1KSXrFQYwVPqG6z0VsM0cw1Pbwdn2FF6SHEnxZfMcrOHV9OfCTiL
 WryEopGoTlXJSpWJ13hdT0Q+m/m1PFJWjcg9na/gzbFlwGsMCRjDr7NmRzND26729v/k
 FzNRZh5yxaKvjRHYlBedciyESQkGtTEOKFKIE44BAXriBKmhrLdswggsFZr+5PdZRt2O
 4Ujg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=btVK/jtdAfu6AHuJ6S+yT+D8LfZ0KK3FDm6jIOjAw+w=;
 b=USIdrC3gYJ0CoQ+PZOW1wDkxjXPLFfzx8lCWjM0qQszok2uwyXFtWUOINtzO+rP3FT
 MPy1FkTWFzS/Ja4ucYOGRxhX7rBW8WVadDrlGx2if5uGOaRjftbFPIagyKk6zb5LeSzv
 GN09gff//X0SAz0rFUKVBvygeGQ6Op6AoNq1ava4x2oeq2X6UHITwc+eon4SIksgFX2R
 Jt2IcCUI53TnOWhuAzltFX2CChgwQHrLFgHbT2DPQt0k70UGuki0/+1T3HQlmRHmSguB
 0V4xHg/rYpElxQMJWcwqmj09CFpIcnNwfiO1UoyJcVutsTrOBhajihrkvupu0NJr8vwb
 L3Eg==
X-Gm-Message-State: APjAAAVMkrIipkfGCSWV5fJlJF28hmGhAHznZrQ/fHdFF7iHV1syL5QF
 v7xSF9N5pOijvAftFP1tMz1f9g==
X-Received: by 2002:a05:6000:1091:: with SMTP id
 y17mr22255913wrw.317.1552381215773; 
 Tue, 12 Mar 2019 02:00:15 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.10
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:15 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 7/9] block,
 bfq: print SHARED instead of pid for shared queues in logs
Date: Tue, 12 Mar 2019 09:59:33 +0100
Message-Id: <20190312085935.11340-8-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

From: Francesco Pollicino <fra.fra.800@gmail.com>

The function "bfq_log_bfqq" prints the pid of the process
associated with the queue passed as input.

Unfortunately, if the queue is shared, then more than one process
is associated with the queue. The pid that gets printed in this
case is the pid of one of the associated processes.
Which process gets printed depends on the exact sequence of merge
events the queue underwent. So printing such a pid is rather
useless and above all is often rather confusing because it
reports a random pid between those of the associated processes.

This commit addresses this issue by printing SHARED instead of a pid
if the queue is shared.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 10 ++++++++++
 block/bfq-iosched.h | 23 +++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 500b04df9efa..7d95d9c01036 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2590,6 +2590,16 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
 	 *   assignment causes no harm).
 	 */
 	new_bfqq->bic = NULL;
+	/*
+	 * If the queue is shared, the pid is the pid of one of the associated
+	 * processes. Which pid depends on the exact sequence of merge events
+	 * the queue underwent. So printing such a pid is useless and confusing
+	 * because it reports a random pid between those of the associated
+	 * processes.
+	 * We mark such a queue with a pid -1, and then print SHARED instead of
+	 * a pid in logging messages.
+	 */
+	new_bfqq->pid = -1;
 	bfqq->bic = NULL;
 	/* release process reference to bfqq */
 	bfq_put_queue(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 829730b96fb2..67e63c276c7a 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -32,6 +32,8 @@
 #define BFQ_DEFAULT_GRP_IOPRIO	0
 #define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
 
+#define MAX_PID_STR_LENGTH 12
+
 /*
  * Soft real-time applications are extremely more latency sensitive
  * than interactive ones. Over-raise the weight of the former to
@@ -1016,13 +1018,23 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 /* --------------- end of interface of B-WF2Q+ ---------------- */
 
 /* Logging facilities. */
+static inline void bfq_pid_to_str(int pid, char *str, int len)
+{
+	if (pid != -1)
+		snprintf(str, len, "%d", pid);
+	else
+		snprintf(str, len, "SHARED-");
+}
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	do {			\
+	char pid_str[MAX_PID_STR_LENGTH];	\
+	bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);	\
 	blk_add_cgroup_trace_msg((bfqd)->queue,				\
 			bfqg_to_blkg(bfqq_group(bfqq))->blkcg,		\
-			"bfq%d%c " fmt, (bfqq)->pid,			\
+			"bfq%s%c " fmt, pid_str,			\
 			bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args);	\
 } while (0)
 
@@ -1033,10 +1045,13 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	\
-	blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid,	\
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {	\
+	char pid_str[MAX_PID_STR_LENGTH];	\
+	bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);	\
+	blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str,	\
 			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
-				##args)
+				##args);	\
+} while (0)
 #define bfq_log_bfqg(bfqd, bfqg, fmt, args...)		do {} while (0)
 
 #endif /* CONFIG_BFQ_GROUP_IOSCHED */

From patchwork Tue Mar 12 08:59:34 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160076
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12860935jad;
 Tue, 12 Mar 2019 02:00:32 -0700 (PDT)
X-Google-Smtp-Source: APXvYqylIPZHT08O5yib9qJftRO7dgQE4iKrelHVqu4qKXVWaLr52dWO6lmw2/vPktFWwUneEb/a
X-Received: by 2002:a62:f517:: with SMTP id
 n23mr37741101pfh.209.1552381232069; 
 Tue, 12 Mar 2019 02:00:32 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381232; cv=none;
 d=google.com; s=arc-20160816;
 b=MX1Xyrob1Ai3AavO4MqnxVzC4zLd1sKiELfNRLJ9aUdHBXBgCpVwWpk8PhU+s49lb+
 Ti6SQSTofHp/lTT3T++K8Igwdv/d2S+vadvJx0z7pwJyIuRcw0NWuICPF99SFI+R4ZeJ
 bth/DAgE2JDXjx4W2ckEZwJHB7HEajVI3TRZtdYXhn4+ly/h3tKGIImKGcZDkdT6oevX
 rZOt37v1HN7oRk7Pcd1xhIzVjeenJukDawGLsl+LOxuKaP923sNQ8lEWHmdSEdn2gAfd
 DWRKpMAD4AyLGcnt49XpAKSpJoZRXBeKGAafQNzetuV7zAMmhr0Au4IA4t36Cct/iDLu
 hGqQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=uB6F8nkcc+jXPBDeIaz78K4jgn2LvnmUlQ+OCREpF+E=;
 b=LsGNborscMUGTRXrGbl26eCIXyJJ0n2Ar97nQrN4GuDeV06m+xnM/KRfrbjdQFXT7d
 PrCaX37v7I7cn2FR7s90+cL8FHwTIaKYrmL9OK28KnEPH7An40mTcBpHbmI8CzHaRh9C
 UaVtvAPKTNprJ4LMm7SIUO9UmUzVOSq1SKk3uVmismmwmQGW3TmvVUlOpQT0uoTYU1vL
 LG7RgwRJv+pFAEbrhk1QiL08hXo0AphWvwJkLsUSzXC8YL4c9GfagpHXTqaCc0HkKbvN
 XUpfk6NPMsv+fPct3MSXsyi8QIQRL2vhc5DrYuGFaauFl9OzVBRgXBRluYous6qmybYI
 PWDA==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=rrDaoIOH;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id a7si7552516plm.420.2019.03.12.02.00.31;
 Tue, 12 Mar 2019 02:00:32 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=rrDaoIOH;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727766AbfCLJA3 (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:29 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:37018 "EHLO
 mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1727726AbfCLJAS (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:18 -0400
Received: by mail-wr1-f67.google.com with SMTP id y15so1751187wro.4
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=uB6F8nkcc+jXPBDeIaz78K4jgn2LvnmUlQ+OCREpF+E=;
 b=rrDaoIOHvvi1Hid8Z/1X9MJv6gwLoS8LoXZbdj6oUdt7UFvECfEJ4m+nd7XmpWb3YE
 lXm+YDALDH3MtZ+vzNGBMFG0m2ZY8j8uDlf+jmxHpp0Nr9AOXi2P2uy3+wn83FrcedIz
 3H/7SMt8j3yGMXKQBns/Oan2hMEiW4J+A8jZiH78kvm+zaHdUZG/OVDPEG/eaQC6ct88
 N9iPEdgAYVjW0D/wWQ3bH83ieHgR4LbYBtq+zEiGn2XCozM68fPT9c3zL9JwC90l5j+Z
 MjKJfOWGV//Oepcv8HhCU+H1rhg98ZqtVBhc/vLWO/DN/YlMYOP5kIQyK8pjJwQJaR73
 mu9g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=uB6F8nkcc+jXPBDeIaz78K4jgn2LvnmUlQ+OCREpF+E=;
 b=oJeQYQMXMyPMqsIn+vNdaoh+/q3nNz4AK3kXXH0ZHBMWC3+M/lbXn8hpg8bwRlXtAj
 oLg+Dq0MH4rj6SXrKOan2QjU3h+QsxKJCLh0t0L5zV8UgaVU4PQ+/eVDaZwyMW6N1nAl
 IMiaV3n8D1E5DeSc/RmPVKo8sohndMRYFwJu78ncbIBNDmu+TNfYxxaoMN2RgOwt1NOX
 Cxro5AG4wEtQTYyBIOFGHyFqbQc2P4zNXGy3gDe3EGDv7ocftH9Uz6Q3P46RQf0uzUYD
 gGsrUDMAWS8i0VLFqKjOO4qaJAo5ECwkd+1jq2ZVh0Bdf4fWvrjuysBYFdfb3GlbzxKj
 T94A==
X-Gm-Message-State: APjAAAWR2rbyP7dM+0Eud4FTtDgq973EkjHybdqYYALtg4g4LMrsuTul
 wNWgGb4VTuLxCt5talx7O5W1VQ==
X-Received: by 2002:adf:dfd2:: with SMTP id q18mr2267713wrn.259.1552381216921; 
 Tue, 12 Mar 2019 02:00:16 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.15
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:16 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 8/9] block,
 bfq: save & resume weight on a queue merge/split
Date: Tue, 12 Mar 2019 09:59:34 +0100
Message-Id: <20190312085935.11340-9-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

From: Francesco Pollicino <fra.fra.800@gmail.com>

bfq saves the state of a queue each time a merge occurs, to be
able to resume such a state when the queue is associated again
with its original process, on a split.

Unfortunately bfq does not save & restore also the weight of the
queue. If the weight is not correctly resumed when the queue is
recycled, then the weight of the recycled queue could differ
from the weight of the original queue.

This commit adds the missing save & resume of the weight.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 2 ++
 block/bfq-iosched.h | 9 +++++++++
 2 files changed, 11 insertions(+)

-- 
2.20.1

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7d95d9c01036..1712d12340c0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1028,6 +1028,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
 	else
 		bfq_clear_bfqq_IO_bound(bfqq);
 
+	bfqq->entity.new_weight = bic->saved_weight;
 	bfqq->ttime = bic->saved_ttime;
 	bfqq->wr_coeff = bic->saved_wr_coeff;
 	bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -2502,6 +2503,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
 	if (!bic)
 		return;
 
+	bic->saved_weight = bfqq->entity.orig_weight;
 	bic->saved_ttime = bfqq->ttime;
 	bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
 	bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 67e63c276c7a..60c148728cc5 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -404,6 +404,15 @@ struct bfq_io_cq {
 	 */
 	bool was_in_burst_list;
 
+	/*
+	 * Save the weight when a merge occurs, to be able
+	 * to restore it in case of split. If the weight is not
+	 * correctly resumed when the queue is recycled,
+	 * then the weight of the recycled queue could differ
+	 * from the weight of the original queue.
+	 */
+	unsigned int saved_weight;
+
 	/*
 	 * Similar to previous fields: save wr information.
 	 */

From patchwork Tue Mar 12 08:59:35 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Paolo Valente <paolo.valente@linaro.org>
X-Patchwork-Id: 160075
Delivered-To: patch@linaro.org
Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp12860867jad;
 Tue, 12 Mar 2019 02:00:27 -0700 (PDT)
X-Google-Smtp-Source: APXvYqzwhUM21dlX821KoIYLJSfCkcYL579YL3B6ihyckJ3NLxKlXMcdMjYb5uFhHZr76FtqwTYj
X-Received: by 2002:a17:902:8e8b:: with SMTP id
 bg11mr37515923plb.328.1552381227606; 
 Tue, 12 Mar 2019 02:00:27 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1552381227; cv=none;
 d=google.com; s=arc-20160816;
 b=uVVRb7EBZ6kZKdQAGfRnNI8oiY15vkZYE1Kib9SsVD9Z9K+bzg6KR3DfjqoKOtH41y
 jH0zWA85aGb3xHwoEKuMzx1OuDDtnt9fvFHvPqey4Y2JHKLkqE0eQwzmWBoYs24kEQJq
 BWId2dXHq6gEN0+xNZUubBwI1jfpb4xc9E6b6ie/VmI+xnWaMOOB1MmCcxxbO4TZQAER
 vKt3cf+e5YttyX/cQQ3DpfTO2a5zaVzfz+YaBON8s/GYxVNHdZZZ1lCpIw1K49NyiMb+
 WsGdtSLql74kapxvbH54exhSdz4E1A54cpIzE7JtfjFuh85ImCUl8KJ/X0YFWcMV9brK
 VVFA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:content-transfer-encoding:mime-version
 :references:in-reply-to:message-id:date:subject:cc:to:from
 :dkim-signature;
 bh=aRWEDuQLWzp3URwu3pEoQcxfbAgAvx//BghTXCmLhvY=;
 b=fcfDe0OAQl4bC5cqsgb20F3h+r9x96UwilI36BWkuka1hfsaUtlyh3mIgnwb4WXskZ
 85s3g2MRlNxjEHOnh+eT49PyXrkf7pScZRA++BNOUodWAYwNsgiBK4je5YM6J3IkOb/W
 dWA2ZM0hMJQNxWyEOPGzBrmbcx+Q0y9kYfuFOTyjysZ6S1OFQeJdpS67nfOwKw4zVDsv
 QGpyhe+wT3eIzKHPkvqZyUS8pEjrMnpIft64DZAspXxuU4D4rEO7h18ZIIuNnFJa1bB6
 PEnU1Q9qq8jw3hrag1k0ybU0Ijvoe2fdNlZbNkgLh0jUYRYB3GV9SfoX+XTguEFTHY9Q
 NQAA==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b="Ue6lNa/I";
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id k7si7518379plt.198.2019.03.12.02.00.25;
 Tue, 12 Mar 2019 02:00:26 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b="Ue6lNa/I";
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1727750AbfCLJAY (ORCPT <rfc822;mike.holmes@linaro.org>
 + 31 others); Tue, 12 Mar 2019 05:00:24 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:55249 "EHLO
 mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1727736AbfCLJAU (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Tue, 12 Mar 2019 05:00:20 -0400
Received: by mail-wm1-f68.google.com with SMTP id f3so1721658wmj.4
 for <linux-kernel@vger.kernel.org>;
 Tue, 12 Mar 2019 02:00:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=aRWEDuQLWzp3URwu3pEoQcxfbAgAvx//BghTXCmLhvY=;
 b=Ue6lNa/IGugqpb8G8XwSn8P7Ea14AkFs7WXIwTTyDArhtqfVLe0x3+HLZ1T0DZ8Eax
 bX96iBvYX9PsZ/0+POi5uuLv3jY2F6mBSw/KOYquh2SAh0cu6KOD70ebJOREmvaT84oZ
 sQbnWzznM8V3rUlztaVZXXoK5f0w5feFoxvisAda+4oE2oZOREV9NEhUbTQtarSNKwiL
 O03BotWBaW0vcvh+ZbkyfSo9RbUyadc1eZT2AubADPr8PJlYnV3xymmFJAfgk8slXdDP
 4lBS0LHpSh1cEjLjI4YscAfY+e1V7MA1Ld3rURKRiBcKMjdg9JqMJOKGa83KTnZYHQPt
 Bh5A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=aRWEDuQLWzp3URwu3pEoQcxfbAgAvx//BghTXCmLhvY=;
 b=S52ZBSjsGzKZqZi3PdJRgEXBrnNsjbxP4+m4nondAd8N9F5Z62xcWzw25W+zaI+23v
 s2/RQACFTak15IGvo6X8AtVnafmS+jax5COCRatIuoyVIM7DcNfWu+lgzYlj6udEUSv5
 qRn1pBuh/UBEZTMnS4cvVyTCcYU4w7ehriPHtO1DMgsxJNgOalNLlkmbFFZxPUUK//bq
 Jrc03pd0FQSX2kP9KdI9c0T6pwz4tgZkVEqEg4iUmcUHhOMXiVNgHjCpWdKs4yReeAgc
 6WiCOclh7mTFt8oHs1HhuYMM98tKF743uP+4k9dq58Ox0m8/1U0kXS52u0uL08yQ706d
 hwkQ==
X-Gm-Message-State: APjAAAX2MQsSZzPE5/2iXHgksIZCvrOtx9OGmhPMfR5DDwERBSvtTkW4
 WwPw5HuTOZ1R6TU+c0rm5rTi6A==
X-Received: by 2002:a7b:cb8e:: with SMTP id m14mr1444020wmi.64.1552381218249; 
 Tue, 12 Mar 2019 02:00:18 -0700 (PDT)
Received: from localhost.localdomain ([84.33.65.66])
 by smtp.gmail.com with ESMTPSA id
 y20sm3953043wmi.34.2019.03.12.02.00.16
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Mar 2019 02:00:17 -0700 (PDT)
From: Paolo Valente <paolo.valente@linaro.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
 ulf.hansson@linaro.org, linus.walleij@linaro.org,
 broonie@kernel.org, bfq-iosched@googlegroups.com,
 oleksandr@natalenko.name, fra.fra.800@gmail.com,
 alessio.masola@gmail.com, holger@applied-asynchrony.com,
 Paolo Valente <paolo.valente@linaro.org>
Subject: [PATCH BUGFIX IMPROVEMENT V3 9/9] doc, block,
 bfq: add information on bfq execution time
Date: Tue, 12 Mar 2019 09:59:35 +0100
Message-Id: <20190312085935.11340-10-paolo.valente@linaro.org>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190312085935.11340-1-paolo.valente@linaro.org>
References: <20190312085935.11340-1-paolo.valente@linaro.org>
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The execution time of BFQ has been slightly lowered. Report the new
execution time in BFQ documentation.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 Documentation/block/bfq-iosched.txt | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

-- 
2.20.1

diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 98a8dd5ee385..1a0f2ac02eb6 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -20,13 +20,26 @@ for that device, by setting low_latency to 0. See Section 3 for
 details on how to configure BFQ for the desired tradeoff between
 latency and throughput, or on how to maximize throughput.
 
-BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
-can process for a device scheduled with BFQ. To give an idea of the
-limits on slow or average CPUs, here are, first, the limits of BFQ for
-three different CPUs, on, respectively, an average laptop, an old
-desktop, and a cheap embedded system, in case full hierarchical
-support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
-CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
+As every I/O scheduler, BFQ adds some overhead to per-I/O-request
+processing. To give an idea of this overhead, the total,
+single-lock-protected, per-request processing time of BFQ---i.e., the
+sum of the execution times of the request insertion, dispatch and
+completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
+(dated CPU for notebooks; time measured with simple code
+instrumentation, and using the throughput-sync.sh script of the S
+suite [1], in performance-profiling mode). To put this result into
+context, the total, single-lock-protected, per-request execution time
+of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7
+us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ).
+
+Scheduling overhead further limits the maximum IOPS that a CPU can
+process (already limited by the execution of the rest of the I/O
+stack). To give an idea of the limits with BFQ, on slow or average
+CPUs, here are, first, the limits of BFQ for three different CPUs, on,
+respectively, an average laptop, an old desktop, and a cheap embedded
+system, in case full hierarchical support is enabled (i.e.,
+CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
+set (Section 4-2):
 - Intel i7-4850HQ: 400 KIOPS
 - AMD A8-3850: 250 KIOPS
 - ARM CortexTM-A53 Octa-core: 80 KIOPS
@@ -566,3 +579,5 @@ applications. Unset this tunable if you need/want to control weights.
     Slightly extended version:
     http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
 							results.pdf
+
+[3] https://github.com/Algodev-github/S