block: BFQ default for single queue devices

Message ID 20181002124329.21248-1-linus.walleij@linaro.org
State New
Headers show
Series
  • block: BFQ default for single queue devices
Related show

Commit Message

Linus Walleij Oct. 2, 2018, 12:43 p.m.
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but notably also UBI and
the loopback device.

I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited.

I talked to Pavel a bit back and it turns out he has a
usecase for BFQ as well and I bet he also would like it
as default scheduler for that system (Pavel tell us more,
I don't remember what it was!)

Intuitively I could understand that maybe we want to
leave the loop device (possibly others? nbd? rbd?) as
"none", as it is probably relying on a scheduler on the
device below it, so I'm open to passing in a scheduler hint
from the respective subsystem in say struct blk_mq_tag_set.
However that makes for a bit of syntactic dissonance
with the struct member ".nr_hw_queues" (I wonder how
the loop device can have 1 "hardware queue"?) so
maybe we should in that case also rename that struct
member to ".nr_queues" fair and square before we start
making adjustments for treating queues differently whether
they are in hardware or actually not.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

---
 block/elevator.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
2.17.1

Comments

Jens Axboe Oct. 2, 2018, 2:31 p.m. | #1
On 10/2/18 6:43 AM, Linus Walleij wrote:
> This sets BFQ as the default scheduler for single queue

> block devices (nr_hw_queues == 1) if it is available. This

> affects notably MMC/SD-cards but notably also UBI and

> the loopback device.

> 

> I have been running it for a while without any negative

> effects on my pet systems and I want some wider testing

> so let's throw it out there and see what people say.

> Admittedly my use cases are limited.

> 

> I talked to Pavel a bit back and it turns out he has a

> usecase for BFQ as well and I bet he also would like it

> as default scheduler for that system (Pavel tell us more,

> I don't remember what it was!)

> 

> Intuitively I could understand that maybe we want to

> leave the loop device (possibly others? nbd? rbd?) as

> "none", as it is probably relying on a scheduler on the

> device below it, so I'm open to passing in a scheduler hint

> from the respective subsystem in say struct blk_mq_tag_set.

> However that makes for a bit of syntactic dissonance

> with the struct member ".nr_hw_queues" (I wonder how

> the loop device can have 1 "hardware queue"?) so

> maybe we should in that case also rename that struct

> member to ".nr_queues" fair and square before we start

> making adjustments for treating queues differently whether

> they are in hardware or actually not.


I think this should just be done with udev rules, and I'd
prefer if the distros would lead the way on this, as they
are the ones that will most likely see the most bug reports
on a change like this.

-- 
Jens Axboe
Linus Walleij Oct. 2, 2018, 2:45 p.m. | #2
On Tue, Oct 2, 2018 at 4:31 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/2/18 6:43 AM, Linus Walleij wrote:


> > This sets BFQ as the default scheduler for single queue

> > block devices (nr_hw_queues == 1) if it is available. This

> > affects notably MMC/SD-cards but notably also UBI and

> > the loopback device.

>

> I think this should just be done with udev rules, and I'd

> prefer if the distros would lead the way on this, as they

> are the ones that will most likely see the most bug reports

> on a change like this.


AFAICT there is no sysfs property that
states how many hw queues the device has. And what
we want to do is activate BFQ when there is one HW
queue.

Should I make a patch to add a nr_hw_queues sysfs
file for this purpose in that case?

That will be a slightly misleading file for loop or networked
devices.

udev is a way to do this with desktop/server distros that has
"standard" (as they think about it) userspace. They can even
do it from their initrd/initramfs to mount root using BFQ
I guess (quick handover from e.g. UEFI).

However this is not a very good fit with Embedded systems,
as they tend to be minimal, not use udev (e.g. Android,
OpenWRT, busybox-derivates...) they don't do udev
rules, but I guess they can in theory do other scripts.
But they will mount root before anything like that can
happen. They don't use initrd/initramfs.

What I want to achieve
is to mount my rootfs with BFQ but that is not possible
on embedded systems that do not use initramfs, e.g.
a rootfs on MMC/SD or UBI.

Yours,
Linus Walleij
Richard Weinberger Oct. 2, 2018, 9:28 p.m. | #3
Linus,

Am Dienstag, 2. Oktober 2018, 14:43:29 CEST schrieb Linus Walleij:
> This sets BFQ as the default scheduler for single queue

> block devices (nr_hw_queues == 1) if it is available. This

> affects notably MMC/SD-cards but notably also UBI and

> the loopback device.


did you notice a difference for UBI?
Strictly speaking it affects only ubibock, the read-only
block device on top of an UBI volume.

Thanks,
//richard
Paolo Valente Oct. 3, 2018, 6:29 a.m. | #4
> Il giorno 02 ott 2018, alle ore 16:31, Jens Axboe <axboe@kernel.dk> ha scritto:

> 

> On 10/2/18 6:43 AM, Linus Walleij wrote:

>> This sets BFQ as the default scheduler for single queue

>> block devices (nr_hw_queues == 1) if it is available. This

>> affects notably MMC/SD-cards but notably also UBI and

>> the loopback device.

>> 

>> I have been running it for a while without any negative

>> effects on my pet systems and I want some wider testing

>> so let's throw it out there and see what people say.

>> Admittedly my use cases are limited.

>> 

>> I talked to Pavel a bit back and it turns out he has a

>> usecase for BFQ as well and I bet he also would like it

>> as default scheduler for that system (Pavel tell us more,

>> I don't remember what it was!)

>> 

>> Intuitively I could understand that maybe we want to

>> leave the loop device (possibly others? nbd? rbd?) as

>> "none", as it is probably relying on a scheduler on the

>> device below it, so I'm open to passing in a scheduler hint

>> from the respective subsystem in say struct blk_mq_tag_set.

>> However that makes for a bit of syntactic dissonance

>> with the struct member ".nr_hw_queues" (I wonder how

>> the loop device can have 1 "hardware queue"?) so

>> maybe we should in that case also rename that struct

>> member to ".nr_queues" fair and square before we start

>> making adjustments for treating queues differently whether

>> they are in hardware or actually not.

> 

> I think this should just be done with udev rules, and I'd

> prefer if the distros would lead the way on this, as they

> are the ones that will most likely see the most bug reports

> on a change like this.

> 


Hi Jens,
I see your point, but I doubt this is the way to go, because of the
following flaws.

As also Linus Torvalds complained [1], people feel lost among
I/O-scheduler options.  Actual differences across I/O schedulers are
basically obscure to non experts.  In this respect, Linux-kernel
'users' are way more than a few top-level distros that can afford a
strong performance team, and that, basing on the input of such a team,
might venture light-heartedly to change a critical component like an
I/O scheduler.  Plus, as Linus Walleij pointed out, some users simply
are not distros that use udev.

So, probably 99% of Linux-kernel users will just stick to the default
I/O scheduler, mq-deadline, assuming that the algorithm by which that
scheduler was chosen was not "pick the scheduler with the longest
name", but "pick the best scheduler for most cases".  The problem is
that, for single-queue devices with a speed below 400/500 KIOPS, the
default scheduler is apparently incomparably worse than bfq in terms
of responsiveness and latency for time-sensitive applications [2], and
in terms of throughput reached while controlling I/O [3].  And, in all
other tests ran so far, by any entity or group I'm aware of, bfq
results basically on par with or better than mq-deadline.

So, I do understand your need for conservativeness, but, after so much
evidence on single-queue devices, and so many years! :), what's the
point in keeping Linux worse for virtually everybody, by default?

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/2/21/791
[2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php
[3] https://lwn.net/Articles/763603/



> -- 

> Jens Axboe
Artem Bityutskiy Oct. 3, 2018, 7:05 a.m. | #5
On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
> So, I do understand your need for conservativeness, but, after so much

> evidence on single-queue devices, and so many years! :), what's the

> point in keeping Linux worse for virtually everybody, by default?


Sounds like what we just need a mechanism for the device (ubi block in
this case) to select the I/O scheduler. I doubt enhancing the default
scheduler selection logic in 'elevator.c' is the right answer. Just
give the driver authority to override the defaults.
Linus Walleij Oct. 3, 2018, 7:18 a.m. | #6
On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:
> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:

> > So, I do understand your need for conservativeness, but, after so much

> > evidence on single-queue devices, and so many years! :), what's the

> > point in keeping Linux worse for virtually everybody, by default?

>

> Sounds like what we just need a mechanism for the device (ubi block in

> this case) to select the I/O scheduler. I doubt enhancing the default

> scheduler selection logic in 'elevator.c' is the right answer. Just

> give the driver authority to override the defaults.


This might be true in the wider sense (like for what scheduler to
select for an NVME device with N channels) but $SUBJECT is just
trying to select BFQ (if available) for devices with one and only one
hardware queue.

That is AFAICT the only reasonable choice for anything with just
one hardware queue as things stand right now.

I have a slight reservation for the weird outliers like loopdev, which
has "one hardware queue" (.nr_hw_queues == 1) though this
makes no sense at all. So I would like to know what people think
about that. Maybe we should have .nr_queues and .nr_hw_queues
where the former is the number of logical queues and the latter
the actual number of hardware queues.

Yours,
Linus Walleij
Christoph Hellwig Oct. 3, 2018, 12:51 p.m. | #7
On Wed, Oct 03, 2018 at 07:42:15AM +0000, Damien Le Moal wrote:
> Of note also is that host-managed like sequential zone devices are also likely

> to show up soon with the work being done in the NVMe standard on the new "Zoned

> namespace" feature proposal. These devices will also require a scheduler like

> mq-deadline guaranteeing per-zone in-order delivery of sequential write

> requests. Looking only at the number of queues of the device is not enough to

> choose the best (most reasonnable/appropriate) scheduler.


We actually have a plan to avoid the need for a non-reordering scheduler
there (including a Linux prototype for it).  Lets see if it survives the
committee.
Jan Kara Oct. 3, 2018, 1:25 p.m. | #8
On Wed 03-10-18 08:53:37, Linus Walleij wrote:
> On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:

> 

> > So, I do understand your need for conservativeness, but, after so much

> > evidence on single-queue devices, and so many years! :), what's the

> > point in keeping Linux worse for virtually everybody, by default?

> 

> I understand if we need to ease things in as well, I don't intend this

> change for the current merge window or anything, since v4.19

> will notably have this patch:

> 

> commit d5038a13eca72fb216c07eb717169092e92284f1

> Author: Johannes Thumshirn <jthumshirn@suse.de>

> Date:   Wed Jul 4 10:53:56 2018 +0200

> 

>     scsi: core: switch to scsi-mq by default

> 

>     It has been more than one year since we tried to change the default from

>     legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to

>     scsi-mq"). But due to issues with suspend/resume and performance problems

>     it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default

>     to scsi-mq"").

> 

>     In the meantime there have been a substantial amount of performance

>     improvements and suspend/resume got fixed as well, thus we can re-enable

>     scsi-mq without a significant performance penalty.

> 

>     Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>

>     Reviewed-by: Hannes Reinecke <hare@suse.com>

>     Reviewed-by: Ming Lei <ming.lei@redhat.com>

>     Acked-by: John Garry <john.garry@huawei.com>

>     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

> 

> I guess that patch can be a bit scary by itself. But IIUC it all went

> fine this time!

> 

> But hey, if that works, that means $SUBJECT patch will enable BFQ on all

> libata devices and any SCSI that is single queue as well, not just

> "obscure" stuff like MMC/SD and UBI, and that is

> indeed a massive crowd of legacy devices. But we're talking

> v4.21 here.

> 

> Johannes, you might be interested in $SUBJECT patch.

> It'd be nice to hear what SUSE people have to add, since they

> are pretty proactive in this area.


So we do have a udev rules in our distro which sets the IO scheduler based
on device parameters (rotational at least, with blk-mq we might start
considering number of queues as well, plus we have some exceptions like
virtio, loop, etc.). So the kernel default doesn't concern us too much as a
distro.

I personally would consider bfq a safer default for single-queue devices
(loop probably needs exception) but I don't feel too strongly about it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Bart Van Assche Oct. 3, 2018, 2:58 p.m. | #9
On Wed, 2018-10-03 at 05:51 -0700, Christoph Hellwig wrote:
> On Wed, Oct 03, 2018 at 07:42:15AM +0000, Damien Le Moal wrote:

> > Of note also is that host-managed like sequential zone devices are also likely

> > to show up soon with the work being done in the NVMe standard on the new "Zoned

> > namespace" feature proposal. These devices will also require a scheduler like

> > mq-deadline guaranteeing per-zone in-order delivery of sequential write

> > requests. Looking only at the number of queues of the device is not enough to

> > choose the best (most reasonnable/appropriate) scheduler.

> 

> We actually have a plan to avoid the need for a non-reordering scheduler

> there (including a Linux prototype for it).  Lets see if it survives the

> committee.


Has the work with the T10 committee to standardize the SCSI equivalent of anonymous
writes already started?

Thanks,

Bart.
Christoph Hellwig Oct. 3, 2018, 3:01 p.m. | #10
On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote:
> Has the work with the T10 committee to standardize the SCSI equivalent of anonymous

> writes already started?


No, and I don't know of anyone who wants to do that in the short term.
Bart Van Assche Oct. 3, 2018, 3:15 p.m. | #11
On Wed, 2018-10-03 at 08:01 -0700, Christoph Hellwig wrote:
> On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote:

> > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous

> > writes already started?

> 

> No, and I don't know of anyone who wants to do that in the short term.


That's unfortunate. I think having such a command available in the SCSI
command set would be a step forward.

Bart.
Paolo Valente Oct. 3, 2018, 3:51 p.m. | #12
> Il giorno 02 ott 2018, alle ore 14:43, Linus Walleij <linus.walleij@linaro.org> ha scritto:

> 

> This sets BFQ as the default scheduler for single queue

> block devices (nr_hw_queues == 1) if it is available. This

> affects notably MMC/SD-cards but notably also UBI and

> the loopback device.

> 

> I have been running it for a while without any negative

> effects on my pet systems and I want some wider testing

> so let's throw it out there and see what people say.

> Admittedly my use cases are limited.

> 

> I talked to Pavel a bit back and it turns out he has a

> usecase for BFQ as well and I bet he also would like it

> as default scheduler for that system (Pavel tell us more,

> I don't remember what it was!)

> 

> Intuitively I could understand that maybe we want to

> leave the loop device


Actually, I've tested loop devices too.  And, also with these virtual
devices, switching to bfq radically improves figures of merits as
responsiveness and latency for soft real-time applications.

Thanks,
Paolo


> (possibly others? nbd? rbd?) as

> "none", as it is probably relying on a scheduler on the

> device below it, so I'm open to passing in a scheduler hint

> from the respective subsystem in say struct blk_mq_tag_set.

> However that makes for a bit of syntactic dissonance

> with the struct member ".nr_hw_queues" (I wonder how

> the loop device can have 1 "hardware queue"?) so

> maybe we should in that case also rename that struct

> member to ".nr_queues" fair and square before we start

> making adjustments for treating queues differently whether

> they are in hardware or actually not.

> 

> Cc: Pavel Machek <pavel@ucw.cz>

> Cc: Paolo Valente <paolo.valente@linaro.org>

> Cc: Jens Axboe <axboe@kernel.dk>

> Cc: Ulf Hansson <ulf.hansson@linaro.org>

> Cc: Richard Weinberger <richard@nod.at>

> Cc: Artem Bityutskiy <dedekind1@gmail.com>

> Cc: Adrian Hunter <adrian.hunter@intel.com>

> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

> ---

> block/elevator.c | 21 ++++++++++++++-------

> 1 file changed, 14 insertions(+), 7 deletions(-)

> 

> diff --git a/block/elevator.c b/block/elevator.c

> index e18ac68626e3..e5a2c39eee7b 100644

> --- a/block/elevator.c

> +++ b/block/elevator.c

> @@ -948,13 +948,15 @@ int elevator_switch_mq(struct request_queue *q,

> }

> 

> /*

> - * For blk-mq devices, we default to using mq-deadline, if available, for single

> - * queue devices.  If deadline isn't available OR we have multiple queues,

> - * default to "none".

> + * For blk-mq devices, we default to using:

> + * - "none" for multiqueue devices (nr_hw_queues != 1)

> + * - "bfq", if available, for single queue devices

> + * - "mq-deadline" if "bfq" is not available for single queue devices

> + * - "none" for single queue devices as well as last resort

>  */

> int elevator_init_mq(struct request_queue *q)

> {

> -	struct elevator_type *e;

> +	struct elevator_type *e = NULL;

> 	int err = 0;

> 

> 	if (q->nr_hw_queues != 1)

> @@ -968,9 +970,14 @@ int elevator_init_mq(struct request_queue *q)

> 	if (unlikely(q->elevator))

> 		goto out_unlock;

> 

> -	e = elevator_get(q, "mq-deadline", false);

> -	if (!e)

> -		goto out_unlock;

> +	if (IS_ENABLED(CONFIG_IOSCHED_BFQ))

> +		e = elevator_get(q, "bfq", false);

> +

> +	if (!e) {

> +		e = elevator_get(q, "mq-deadline", false);

> +		if (!e)

> +			goto out_unlock;

> +	}

> 

> 	err = blk_mq_init_sched(q, e);

> 	if (err)

> -- 

> 2.17.1

>
Paolo Valente Oct. 3, 2018, 3:52 p.m. | #13
> Il giorno 03 ott 2018, alle ore 09:42, Damien Le Moal <damien.lemoal@wdc.com> ha scritto:

> 

> On 2018/10/03 16:18, Linus Walleij wrote:

>> On Wed, Oct 3, 2018 at 9:05 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:

>>> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:

>>>> So, I do understand your need for conservativeness, but, after so much

>>>> evidence on single-queue devices, and so many years! :), what's the

>>>> point in keeping Linux worse for virtually everybody, by default?

>>> 

>>> Sounds like what we just need a mechanism for the device (ubi block in

>>> this case) to select the I/O scheduler. I doubt enhancing the default

>>> scheduler selection logic in 'elevator.c' is the right answer. Just

>>> give the driver authority to override the defaults.

>> 

>> This might be true in the wider sense (like for what scheduler to

>> select for an NVME device with N channels) but $SUBJECT is just

>> trying to select BFQ (if available) for devices with one and only one

>> hardware queue.

>> 

>> That is AFAICT the only reasonable choice for anything with just

>> one hardware queue as things stand right now.

>> 

>> I have a slight reservation for the weird outliers like loopdev, which

>> has "one hardware queue" (.nr_hw_queues == 1) though this

>> makes no sense at all. So I would like to know what people think

>> about that. Maybe we should have .nr_queues and .nr_hw_queues

>> where the former is the number of logical queues and the latter

>> the actual number of hardware queues.

> 

> There is another class of outliers: host-managed SMR disks (SATA and SCSI,

> definitely single hw queue). For these, using mq-deadline is mandatory in many

> cases in order to guarantee sequential write command delivery to the device

> driver. Having the default changed to bfq, which as far as I know is not SMR

> friendly (can sequential writes within a single zone be reordered ?) is asking

> for troubles (unaligned write errors showing up).

> 


Hi Damien,
actually I have followed threads on SMR device, and have already looked
into this.  I'm sorry for not having mentioned it in my first reply.

My plan is to simply port this feature from mq-deadline to bfq.  It
should be really straightforward, especially after the testing you did
through mq-deadline.  Even if I'm missing some less trivial hidden
issue, I guess it won't be impossible to address.

If it may be useful for the outcome of this thread, I'm willing to
raise the priority of this change to bfq.

> A while back, we already had this discussion with Jens and Christoph on the list

> to allow device drivers to set a sensible default I/O scheduler for devices with

> "special needs" (e.g. host-managed SMR). At the time, the conclusion was that

> udev (or something alike in userland) is better suited to set a correct scheduler.

> 

> Of note also is that host-managed like sequential zone devices are also likely

> to show up soon with the work being done in the NVMe standard on the new "Zoned

> namespace" feature proposal. These devices will also require a scheduler like

> mq-deadline guaranteeing per-zone in-order delivery of sequential write

> requests. Looking only at the number of queues of the device is not enough to

> choose the best (most reasonnable/appropriate) scheduler.

> 


Until bfq simply handles SMR devices too.

Thanks,
Paolo

> -- 

> Damien Le Moal

> Western Digital Research
Bart Van Assche Oct. 3, 2018, 3:54 p.m. | #14
On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:
> [1] https://lkml.org/lkml/2017/2/21/791

> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php

> [3] https://lwn.net/Articles/763603/


From [2]: "BFQ loses about 18% with only random readers, because the number
of IOPS becomes so high that the execution time and parallel efficiency of
the schedulers becomes relevant." Since the number of I/O patterns for which
results are available on [2] is limited and since the number of devices for
which test results are available on [2] is limited (e.g. RAID is missing),
there might be other cases in which configuring BFQ as the default would
introduce a regression.

I agree with Jens that it's best to leave it to the Linux distributors to
select a default I/O scheduler.

Bart.
Paolo Valente Oct. 3, 2018, 3:55 p.m. | #15
> Il giorno 03 ott 2018, alle ore 13:49, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto:

> 

> Hi.

> 

> On 03.10.2018 08:29, Paolo Valente wrote:

>> As also Linus Torvalds complained [1], people feel lost among

>> I/O-scheduler options.  Actual differences across I/O schedulers are

>> basically obscure to non experts.  In this respect, Linux-kernel

>> 'users' are way more than a few top-level distros that can afford a

>> strong performance team, and that, basing on the input of such a team,

>> might venture light-heartedly to change a critical component like an

>> I/O scheduler.  Plus, as Linus Walleij pointed out, some users simply

>> are not distros that use udev.

> 

> I feel a contradiction in this counter-argument. On one hand, there are lots of, let's call them, home users, that use major distributions with udev, so the distribution maintainers can reasonably decide which scheduler to use for which type of device based on the udev rule and common sense provided via Documentation/ by linux-block devs. Moreover, most likely, those rules should be similar or the same across all the major distros and available via some (systemd?) upstream.

> 


Let me basically repeat Mark's answer here, with my words.

Unfortunately, facts mismatch with your optimistic view: after so many
years and concordant test results, only very few distributions
switched to bfq, no major distribution did (AFAIK).  As I already
wrote, the reason is the one pointed out by Torvalds [1].  Do you want
a simple example?  Take the last sentence in Jan's email in this
thread: "I *personally would* consider bfq a safer default ...  but *I
don't feel too strongly* about it." And he is definitely a storage
expert.

The problem, in particular, is that bfq is a complex beast, fighting
against a jungle of I/O issues.  You have to be really into bfq, even
to just know all of its features!

> On another hand, the users of embedded devices, mentioned by Linus, should already know what scheduler to choose because dealing with embedded world assumes the person can decide this on their own, or with the help of abovementioned udev scripts and/or Documentation/ as a reference point.

> 


Same situation for embedded devices, if not even worse.  Again for the
same reasons above.  In the end, it is hard even for a kernel expert
to be an in-depth expert of every possible complex component.

> So I see no obstacles here, and the choice to rely on udev by default sounds reasonable.

> 

> The question that remain is whether it is really important to mount a root partition while already using some specific scheduler? Why it cannot be done with "none", for instance?

> 

>> So, probably 99% of Linux-kernel users will just stick to the default

>> I/O scheduler, mq-deadline, assuming that the algorithm by which that

>> scheduler was chosen was not "pick the scheduler with the longest

>> name", but "pick the best scheduler for most cases".  The problem is

>> that, for single-queue devices with a speed below 400/500 KIOPS, the

>> default scheduler is apparently incomparably worse than bfq in terms

>> of responsiveness and latency for time-sensitive applications [2], and

>> in terms of throughput reached while controlling I/O [3].  And, in all

>> other tests ran so far, by any entity or group I'm aware of, bfq

>> results basically on par with or better than mq-deadline.

> 

> And that's why major distributions are likely to default to BFQ via udev. No one argues with BFQ superiority here ☺.

> 

>> So, I do understand your need for conservativeness, but, after so much

>> evidence on single-queue devices, and so many years! :), what's the

>> point in keeping Linux worse for virtually everybody, by default?

> 

> From my point of view this is not a conservative approach at all. On contrary, offloading decisions to userspace aligns pretty well with recent trends like pressure metrics/userspace OOM killer, eBPF etc. The less unnecessary logic the kernel handles, the more flexibility it affords.

> 


To not answer too seriously here, let me answer with a quote that is
still missing a clear paternity: "Everything should be made as simple
as possible, but not simpler." :)

Thanks,
Paolo

> -- 

>  Oleksandr Natalenko (post-factum)
Bart Van Assche Oct. 3, 2018, 4 p.m. | #16
On Wed, 2018-10-03 at 17:55 +0200, Paolo Valente wrote:
> The problem, in particular, is that bfq is a complex beast, fighting

> against a jungle of I/O issues.  You have to be really into bfq, even

> to just know all of its features!


This is a problem by itself. I don't know anyone who wants to have to deal
with I/O scheduler tunables.

Bart.
Paolo Valente Oct. 3, 2018, 4:02 p.m. | #17
> Il giorno 03 ott 2018, alle ore 17:54, Bart Van Assche <bvanassche@acm.org> ha scritto:

> 

> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:

>> [1] https://lkml.org/lkml/2017/2/21/791

>> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php

>> [3] https://lwn.net/Articles/763603/

> 

> From [2]: "BFQ loses about 18% with only random readers, because the number

> of IOPS becomes so high that the execution time and parallel efficiency of

> the schedulers becomes relevant." Since the number of I/O patterns for which

> results are available on [2] is limited and since the number of devices for

> which test results are available on [2] is limited (e.g. RAID is missing),

> there might be other cases in which configuring BFQ as the default would

> introduce a regression.

> 


From [3]: none with throttling loses 80% of the throughput when used
to control I/O. On any drive. And this is really only one example among a ton.

In addition, the test you mention, designed by me, was meant exactly
to find and show the worst breaking point of BFQ.  If your main
workload of interest is really made only of tens of parallel thread
doing only sync random I/O, and you care only about throughput,
without any concern for your system becoming so unresponsive to be
unusable during the test, then, yes, mq-deadline is a better option
for you.

So, are you really sure the balance is in favor of mq-deadline?

Thanks,
Paolo

> I agree with Jens that it's best to leave it to the Linux distributors to

> select a default I/O scheduler.

> 

> Bart.
Paolo Valente Oct. 3, 2018, 4:04 p.m. | #18
> Il giorno 03 ott 2018, alle ore 18:00, Bart Van Assche <bvanassche@acm.org> ha scritto:

> 

> On Wed, 2018-10-03 at 17:55 +0200, Paolo Valente wrote:

>> The problem, in particular, is that bfq is a complex beast, fighting

>> against a jungle of I/O issues.  You have to be really into bfq, even

>> to just know all of its features!

> 

> This is a problem by itself. I don't know anyone who wants to have to deal

> with I/O scheduler tunables.

> 


In fact, I designed and am constantly improving bfq, exactly so that
you don't have to touch any tunable.

Thanks,
Paolo

> Bart.

>
Paolo Valente Oct. 3, 2018, 5:22 p.m. | #19
> Il giorno 03 ott 2018, alle ore 18:02, Paolo Valente <paolo.valente@linaro.org> ha scritto:

> 

> 

> 

>> Il giorno 03 ott 2018, alle ore 17:54, Bart Van Assche <bvanassche@acm.org> ha scritto:

>> 

>> On Wed, 2018-10-03 at 08:29 +0200, Paolo Valente wrote:

>>> [1] https://lkml.org/lkml/2017/2/21/791

>>> [2] http://algo.ing.unimo.it/people/paolo/disk_sched/results.php

>>> [3] https://lwn.net/Articles/763603/

>> 

>> From [2]: "BFQ loses about 18% with only random readers, because the number

>> of IOPS becomes so high that the execution time and parallel efficiency of

>> the schedulers becomes relevant." Since the number of I/O patterns for which

>> results are available on [2] is limited and since the number of devices for

>> which test results are available on [2] is limited (e.g. RAID is missing),

>> there might be other cases in which configuring BFQ as the default would

>> introduce a regression.

>> 

> 

> From [3]: none with throttling loses 80% of the throughput when used

> to control I/O. On any drive. And this is really only one example among a ton.

> 


I forgot to add that the same 80% loss happens with mq-deadline plus
throttling, sorry.  In addition, mq-deadline suffers from much more
than a 18% loss of throughput, w.r.t. bfq, exactly in the same figure
you cited, if there are random writes too.

> In addition, the test you mention, designed by me, was meant exactly

> to find and show the worst breaking point of BFQ.  If your main

> workload of interest is really made only of tens of parallel thread

> doing only sync random I/O, and you care only about throughput,

> without any concern for your system becoming so unresponsive to be

> unusable during the test, then, yes, mq-deadline is a better option

> for you.

> 


Some more detail on this.  The fact that bfq reaches a lower
throughput than none in this test is actually still puzzling me,
because the process rate of I/O with bfq is one order of magnitude
higher than the IOPS of this device.  So, I still don't understand
why, with bfq, the queue of the device does not get as full as with
none, and thus why the throughput with bfq is not the same as with
none.

To further test this issue, I replaced sync I/O with async I/O (with a
very high depth).  And, nonsensically (for me), throughput dropped
with both bfq and none!  I already meant to to report this issue,
after investigating it more.  Anyway, this is a different story w.r.t.
this thread.

Thanks,
Paolo


> So, are you really sure the balance is in favor of mq-deadline?

> 

> Thanks,

> Paolo

> 

>> I agree with Jens that it's best to leave it to the Linux distributors to

>> select a default I/O scheduler.

>> 

>> Bart.

> 

> -- 

> You received this message because you are subscribed to the Google Groups "bfq-iosched" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
Johannes Thumshirn Oct. 4, 2018, 7:45 a.m. | #20
On Wed, Oct 03, 2018 at 03:25:54PM +0200, Jan Kara wrote:
> On Wed 03-10-18 08:53:37, Linus Walleij wrote:

> > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:

> > 

> > > So, I do understand your need for conservativeness, but, after so much

> > > evidence on single-queue devices, and so many years! :), what's the

> > > point in keeping Linux worse for virtually everybody, by default?

> > 

> > I understand if we need to ease things in as well, I don't intend this

> > change for the current merge window or anything, since v4.19

> > will notably have this patch:

> > 

> > commit d5038a13eca72fb216c07eb717169092e92284f1

> > Author: Johannes Thumshirn <jthumshirn@suse.de>

> > Date:   Wed Jul 4 10:53:56 2018 +0200

> > 

> >     scsi: core: switch to scsi-mq by default

> > 

> >     It has been more than one year since we tried to change the default from

> >     legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to

> >     scsi-mq"). But due to issues with suspend/resume and performance problems

> >     it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default

> >     to scsi-mq"").

> > 

> >     In the meantime there have been a substantial amount of performance

> >     improvements and suspend/resume got fixed as well, thus we can re-enable

> >     scsi-mq without a significant performance penalty.

> > 

> >     Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>

> >     Reviewed-by: Hannes Reinecke <hare@suse.com>

> >     Reviewed-by: Ming Lei <ming.lei@redhat.com>

> >     Acked-by: John Garry <john.garry@huawei.com>

> >     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

> > 

> > I guess that patch can be a bit scary by itself. But IIUC it all went

> > fine this time!

> > 

> > But hey, if that works, that means $SUBJECT patch will enable BFQ on all

> > libata devices and any SCSI that is single queue as well, not just

> > "obscure" stuff like MMC/SD and UBI, and that is

> > indeed a massive crowd of legacy devices. But we're talking

> > v4.21 here.

> > 

> > Johannes, you might be interested in $SUBJECT patch.

> > It'd be nice to hear what SUSE people have to add, since they

> > are pretty proactive in this area.

> 

> So we do have a udev rules in our distro which sets the IO scheduler based

> on device parameters (rotational at least, with blk-mq we might start

> considering number of queues as well, plus we have some exceptions like

> virtio, loop, etc.). So the kernel default doesn't concern us too much as a

> distro.

> 

> I personally would consider bfq a safer default for single-queue devices

> (loop probably needs exception) but I don't feel too strongly about it.


[Full quote for context]

What about resurrecting CONFIG_DEFAULT_IOSCHED for MQ as well and
leave it default to mq-deadline but give bfq, kyber and none as a
choice as well?

The question is shall we only do it for single queue devices or for
native MQ devices as well if we go down that road?

I understand the embedded floks will want a different interface than
udev, but from the non-embedded point of view I'm with Jens and Jan
here, let udev do the job.

      Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Linus Walleij Oct. 4, 2018, 8:21 a.m. | #21
On Wed, Oct 3, 2018 at 7:34 PM Bryan Gurney <bgurney@redhat.com> wrote:

> Right now, users of host-managed SMR drives should be using "deadline"

> or "mq-deadline", to avoid out-of-order writes in sequential-only

> zones.

>

> I'm running into a situation right now on a test system (Fedora 28,

> 4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I

> accidentally forgot to add my "udev rule" file:


This should be fixed after
d5038a13eca7 scsi: core: switch to scsi-mq by default
right?

Since mq use mq-deadline by default.

I'm making sure to preserve mq-deadline on zoned devices
in my v2 of this patch.

Yours,
Linus Walleij
Andreas Herrmann Oct. 4, 2018, 8:24 a.m. | #22
On Thu, Oct 04, 2018 at 09:45:35AM +0200, Johannes Thumshirn wrote:
> On Wed, Oct 03, 2018 at 03:25:54PM +0200, Jan Kara wrote:

> > On Wed 03-10-18 08:53:37, Linus Walleij wrote:

> > > On Wed, Oct 3, 2018 at 8:29 AM Paolo Valente <paolo.valente@linaro.org> wrote:

> > > 

> > > > So, I do understand your need for conservativeness, but, after so much

> > > > evidence on single-queue devices, and so many years! :), what's the

> > > > point in keeping Linux worse for virtually everybody, by default?

> > > 

> > > I understand if we need to ease things in as well, I don't intend this

> > > change for the current merge window or anything, since v4.19

> > > will notably have this patch:

> > > 

> > > commit d5038a13eca72fb216c07eb717169092e92284f1

> > > Author: Johannes Thumshirn <jthumshirn@suse.de>

> > > Date:   Wed Jul 4 10:53:56 2018 +0200

> > > 

> > >     scsi: core: switch to scsi-mq by default

> > > 

> > >     It has been more than one year since we tried to change the default from

> > >     legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to

> > >     scsi-mq"). But due to issues with suspend/resume and performance problems

> > >     it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default

> > >     to scsi-mq"").

> > > 

> > >     In the meantime there have been a substantial amount of performance

> > >     improvements and suspend/resume got fixed as well, thus we can re-enable

> > >     scsi-mq without a significant performance penalty.

> > > 

> > >     Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>

> > >     Reviewed-by: Hannes Reinecke <hare@suse.com>

> > >     Reviewed-by: Ming Lei <ming.lei@redhat.com>

> > >     Acked-by: John Garry <john.garry@huawei.com>

> > >     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

> > > 

> > > I guess that patch can be a bit scary by itself. But IIUC it all went

> > > fine this time!

> > > 

> > > But hey, if that works, that means $SUBJECT patch will enable BFQ on all

> > > libata devices and any SCSI that is single queue as well, not just

> > > "obscure" stuff like MMC/SD and UBI, and that is

> > > indeed a massive crowd of legacy devices. But we're talking

> > > v4.21 here.

> > > 

> > > Johannes, you might be interested in $SUBJECT patch.

> > > It'd be nice to hear what SUSE people have to add, since they

> > > are pretty proactive in this area.

> > 

> > So we do have a udev rules in our distro which sets the IO scheduler based

> > on device parameters (rotational at least, with blk-mq we might start

> > considering number of queues as well, plus we have some exceptions like

> > virtio, loop, etc.). So the kernel default doesn't concern us too much as a

> > distro.

> > 

> > I personally would consider bfq a safer default for single-queue devices

> > (loop probably needs exception) but I don't feel too strongly about it.

> 

> [Full quote for context]

> 

> What about resurrecting CONFIG_DEFAULT_IOSCHED for MQ as well and

> leave it default to mq-deadline but give bfq, kyber and none as a

> choice as well?


I second this -- introduction of a CONFIG_DEFAULT_MQ_IOSCHED.
Having a default I/O scheduler kernel config option for MQ allows to
build a kernel suitable for specific use w/o userspace
dependencies.
(But it still allows to reconfigure things via userspace.)

> The question is shall we only do it for single queue devices or for

> native MQ devices as well if we go down that road?


Good question. I am not yet sure about this.
I'd start with using the default for single queue devices.

Andreas

> I understand the embedded floks will want a different interface than

> udev, but from the non-embedded point of view I'm with Jens and Jan

> here, let udev do the job.

> 

>       Johannes

> -- 

> Johannes Thumshirn                                          Storage

> jthumshirn@suse.de                                +49 911 74053 689

> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg

> GF: Felix Imendörffer, Jane Smithard, Graham Norton

> HRB 21284 (AG Nürnberg)

> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Linus Walleij Oct. 4, 2018, 8:25 a.m. | #23
On Wed, Oct 3, 2018 at 1:49 PM Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:

> On another hand, the users of embedded devices, mentioned by Linus,

> should already know what scheduler to choose because dealing with

> embedded world assumes the person can decide this on their own, or with

> the help of abovementioned udev scripts and/or Documentation/ as a

> reference point.

>

> So I see no obstacles here, and the choice to rely on udev by default

> sounds reasonable.


I am sorry but I do not agree with this.

There are several historical precedents where we have
concluded that just "have the kernel do the right thing
by default" is the way to go.

Example 1: pluggable CPU schedulers.
The reasoning was that users or distros have no clue what
scheduler they want, only scheduler developers do. We
drove it to the point where we have one and one
scheduler only, not different flavors. (Special
usecases have special scheduling classes inside the
one scheduler instead.)

Example 2: Automatic process group scheduling
The reasoning was that daemons such as systemd would
be better at placing processes/tasks into the right
control groups to manage their resources, so this would
be a userspace policy handled by the udev/systemd
complex. We did not do that. Instead the kernel does
autogrouping per-session, indeed it is a Kconfig option
but even e.g. Fedora has this enabled by default.
(commit 5091faa449ee)

As pointed out elsewhere: these defaults make it
easy for custom builds not using udev+systemd to
get a system up and running with sensible defaults.

Simple embedded systems use Busybox' mdev (I wouldn't
trust it do do any complex decisions). OpenWRT
has ubox+ubus+uci, also extremely lightweight,
Android has its own init system that I don't
manage to keep track of anymore. Instead of running
all over the map and fixing these userspaces to
do the right thing, it makes sense to make the
right thing the default.

And these are millions and millions of deployed
systems not using udev+systemd we are talking about,
they are not fringe hobby projects. It's not that I
personally dislike udev or anything, I kind of like
it, but these tailored distros simply don't use it
and they are huge in numbers. They need help to do
the right thing. Fixing a udev rule doesn't solve
even half the world's problems I'm afraid.

Yours,
Linus Walleij
Mark Brown Oct. 4, 2018, 10:13 a.m. | #24
On Thu, Oct 04, 2018 at 10:14:38AM +0200, Linus Walleij wrote:

> And these are millions and millions of deployed

> systems not using udev+systemd we are talking about,

> they are not fringe hobby projects. It's not that I

> personally dislike udev or anything, I kind of like

> it, but these tailored distros simply don't use it

> and they are huge in numbers. They need help to do

> the right thing. Fixing a udev rule doesn't solve

> even half the world's problems I'm afraid.


Further, even those embedded systems that do use udev (some of them do
lean heavily on modern init stuff like that and systemd, especially when
boot time is a priority) they'll still need to get the relevant udev
rule installed somehow.
Bart Van Assche Oct. 4, 2018, 3:10 p.m. | #25
On Thu, 2018-10-04 at 11:13 +0100, Mark Brown wrote:
> On Thu, Oct 04, 2018 at 10:14:38AM +0200, Linus Walleij wrote:

> 

> > And these are millions and millions of deployed

> > systems not using udev+systemd we are talking about,

> > they are not fringe hobby projects. It's not that I

> > personally dislike udev or anything, I kind of like

> > it, but these tailored distros simply don't use it

> > and they are huge in numbers. They need help to do

> > the right thing. Fixing a udev rule doesn't solve

> > even half the world's problems I'm afraid.

> 

> Further, even those embedded systems that do use udev (some of them do

> lean heavily on modern init stuff like that and systemd, especially when

> boot time is a priority) they'll still need to get the relevant udev

> rule installed somehow.


Hi Mark,

Are you aware that the systemd source tree includes a set of udev rules?
See also https://github.com/systemd/systemd/tree/master/rules.

Bart.
Mark Brown Oct. 4, 2018, 3:26 p.m. | #26
On Thu, Oct 04, 2018 at 08:10:57AM -0700, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 11:13 +0100, Mark Brown wrote:


> > Further, even those embedded systems that do use udev (some of them do

> > lean heavily on modern init stuff like that and systemd, especially when

> > boot time is a priority) they'll still need to get the relevant udev

> > rule installed somehow.


> Are you aware that the systemd source tree includes a set of udev rules?

> See also https://github.com/systemd/systemd/tree/master/rules.


Yeah, but then you're back to the situation where someone needs to go
pick up a new version of systemd to get the new rules along with the new
kernel.  It's not insurmountable but it's an obstacle.
Bart Van Assche Oct. 4, 2018, 8:09 p.m. | #27
On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:
> > I agree with Jens that it's best to leave it to the Linux distributors to

> > select a default I/O scheduler.

> 

> That assumes such a thing exists. The kernel knows what devices it is

> dealing with. The kernel 'default' ought to be 'whatever is usually best

> for this device'. A distro cannot just pick a correct single default

> because NVME and USB sticks are both normal and rather different in needs.


Which I/O scheduler works best also depends which workload the user will run.
BFQ has significant advantages for interactive workloads like video replay
with concurrent background I/O but probably slows down kernel builds. That's
why I'm not sure whether the kernel should select the default I/O scheduler.

Bart.
Paolo Valente Oct. 4, 2018, 8:19 p.m. | #28
> Il giorno 04 ott 2018, alle ore 21:25, Alan Cox <gnomes@lxorguk.ukuu.org.uk> ha scritto:

> 

>> I agree with Jens that it's best to leave it to the Linux distributors to

>> select a default I/O scheduler.

> 

> That assumes such a thing exists.


Well, as of now the default is more or less in the Schrdinger's cat state :)

Metaphors apart, I do agree with your point.  As for me, what you
point out is one of the core issues at stake here.

Thanks,
Paolo

> The kernel knows what devices it is

> dealing with. The kernel 'default' ought to be 'whatever is usually best

> for this device'. A distro cannot just pick a correct single default

> because NVME and USB sticks are both normal and rather different in needs.

> 

> Alan
Paolo Valente Oct. 4, 2018, 8:39 p.m. | #29
> Il giorno 04 ott 2018, alle ore 22:09, Bart Van Assche <bvanassche@acm.org> ha scritto:

> 

> On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:

>>> I agree with Jens that it's best to leave it to the Linux distributors to

>>> select a default I/O scheduler.

>> 

>> That assumes such a thing exists. The kernel knows what devices it is

>> dealing with. The kernel 'default' ought to be 'whatever is usually best

>> for this device'. A distro cannot just pick a correct single default

>> because NVME and USB sticks are both normal and rather different in needs.

> 

> Which I/O scheduler works best also depends which workload the user will run.

> BFQ has significant advantages for interactive workloads like video replay

> with concurrent background I/O but probably slows down kernel builds.


No, kernel build is, for evident reasons, one of the workloads I cared
most about.  Actually, I tried to focus on all my main
kernel-development tasks, such as also git checkout, git merge, git
grep, ...

According to my test results, with BFQ these tasks are at least as
fast as, or, in most system configurations, much faster than with the
other schedulers.  Of course, at the same time the system also remains
responsive with BFQ.

You can repeat these tests using one of my first scripts in the S
suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
hypertrophied the names I gave :) ).

I stopped sharing also my kernel-build results years ago, because I
went on obtaining the same, identical good results for years, and I'm
aware that I tend to show and say too much stuff.

Thanks,
Paolo

> That's

> why I'm not sure whether the kernel should select the default I/O scheduler.

> 

> Bart.
Bart Van Assche Oct. 4, 2018, 10:42 p.m. | #30
On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:
> No, kernel build is, for evident reasons, one of the workloads I cared

> most about.  Actually, I tried to focus on all my main

> kernel-development tasks, such as also git checkout, git merge, git

> grep, ...

> 

> According to my test results, with BFQ these tasks are at least as

> fast as, or, in most system configurations, much faster than with the

> other schedulers.  Of course, at the same time the system also remains

> responsive with BFQ.

> 

> You can repeat these tests using one of my first scripts in the S

> suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more

> hypertrophied the names I gave :) ).

> 

> I stopped sharing also my kernel-build results years ago, because I

> went on obtaining the same, identical good results for years, and I'm

> aware that I tend to show and say too much stuff.


On my test setup building the kernel is slightly slower when using the BFQ
scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,
single CPU with 6 cores, hyperthreading disabled). I am aware that the
proposal at the start of this thread was to make BFQ the default for devices
with a single hardware queue and not for devices like NVMe SSDs that support
multiple hardware queues.

What I think is missing is measurement results for BFQ on a system with
multiple CPU sockets and against a fast storage medium. Eliminating
the host lock from the SCSI core yielded a significant performance
improvement for such storage devices. Since the BFQ scheduler locks and
unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
will slow down I/O for fast storage devices, even if their driver only
creates a single hardware queue.

Bart.
Christoph Hellwig Oct. 5, 2018, 6:24 a.m. | #31
On Wed, Oct 03, 2018 at 08:15:24AM -0700, Bart Van Assche wrote:
> On Wed, 2018-10-03 at 08:01 -0700, Christoph Hellwig wrote:

> > On Wed, Oct 03, 2018 at 07:58:52AM -0700, Bart Van Assche wrote:

> > > Has the work with the T10 committee to standardize the SCSI equivalent of anonymous

> > > writes already started?

> > 

> > No, and I don't know of anyone who wants to do that in the short term.

> 

> That's unfortunate. I think having such a command available in the SCSI

> command set would be a step forward.


I'm not saying it doesn't make sense, only that I don't know of any short
term plans.
Pavel Machek Oct. 5, 2018, 8:04 a.m. | #32
Hi!

> I talked to Pavel a bit back and it turns out he has a

> usecase for BFQ as well and I bet he also would like it

> as default scheduler for that system (Pavel tell us more,

> I don't remember what it was!)


I'm not sure I remember clearly, either.

IIRC I was working with ionice on spinning disks, and it had no
effect. I switched to BFQ and suddenly ionice was effective.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Jan Kara Oct. 5, 2018, 9:16 a.m. | #33
On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:

> > No, kernel build is, for evident reasons, one of the workloads I cared

> > most about.  Actually, I tried to focus on all my main

> > kernel-development tasks, such as also git checkout, git merge, git

> > grep, ...

> > 

> > According to my test results, with BFQ these tasks are at least as

> > fast as, or, in most system configurations, much faster than with the

> > other schedulers.  Of course, at the same time the system also remains

> > responsive with BFQ.

> > 

> > You can repeat these tests using one of my first scripts in the S

> > suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more

> > hypertrophied the names I gave :) ).

> > 

> > I stopped sharing also my kernel-build results years ago, because I

> > went on obtaining the same, identical good results for years, and I'm

> > aware that I tend to show and say too much stuff.

> 

> On my test setup building the kernel is slightly slower when using the BFQ

> scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,

> single CPU with 6 cores, hyperthreading disabled). I am aware that the

> proposal at the start of this thread was to make BFQ the default for devices

> with a single hardware queue and not for devices like NVMe SSDs that support

> multiple hardware queues.

> 

> What I think is missing is measurement results for BFQ on a system with

> multiple CPU sockets and against a fast storage medium. Eliminating

> the host lock from the SCSI core yielded a significant performance

> improvement for such storage devices. Since the BFQ scheduler locks and

> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ

> will slow down I/O for fast storage devices, even if their driver only

> creates a single hardware queue.


Well, I'm not sure why that is missing. I don't think anyone proposed to
default to BFQ for such setup? Neither was anyone claiming that BFQ is
better in such situation... The proposal has been: Default to BFQ for slow
storage, leave it to deadline-mq otherwise.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Bart Van Assche Oct. 6, 2018, 3:12 a.m. | #34
On 10/5/18 2:16 AM, Jan Kara wrote:
> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:

>> What I think is missing is measurement results for BFQ on a system with

>> multiple CPU sockets and against a fast storage medium. Eliminating

>> the host lock from the SCSI core yielded a significant performance

>> improvement for such storage devices. Since the BFQ scheduler locks and

>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ

>> will slow down I/O for fast storage devices, even if their driver only

>> creates a single hardware queue.

> 

> Well, I'm not sure why that is missing. I don't think anyone proposed to

> default to BFQ for such setup? Neither was anyone claiming that BFQ is

> better in such situation... The proposal has been: Default to BFQ for slow

> storage, leave it to deadline-mq otherwise.


Hi Jan,

How do you define slow storage? The proposal at the start of this thread 
was to make BFQ the default for all block devices that create a single 
hardware queue. That includes all SATA storage since scsi-mq only 
creates a single hardware queue when using the SATA protocol. The 
proposal to make BFQ the default for systems with a single hard disk 
probably makes sense but I am not sure that making BFQ the default for 
systems equipped with one or more (SATA) SSDs is also a good idea. 
Especially for multi-socket systems since BFQ reintroduces a queue-wide 
lock. As you know no queue-wide locking happens during I/O in the 
scsi-mq core nor in the blk-mq core.

Bart.
Paolo Valente Oct. 6, 2018, 6:46 a.m. | #35
> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto:

> 

> On 10/5/18 2:16 AM, Jan Kara wrote:

>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:

>>> What I think is missing is measurement results for BFQ on a system with

>>> multiple CPU sockets and against a fast storage medium. Eliminating

>>> the host lock from the SCSI core yielded a significant performance

>>> improvement for such storage devices. Since the BFQ scheduler locks and

>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ

>>> will slow down I/O for fast storage devices, even if their driver only

>>> creates a single hardware queue.

>> Well, I'm not sure why that is missing. I don't think anyone proposed to

>> default to BFQ for such setup? Neither was anyone claiming that BFQ is

>> better in such situation... The proposal has been: Default to BFQ for slow

>> storage, leave it to deadline-mq otherwise.

> 

> Hi Jan,

> 

> How do you define slow storage? The proposal at the start of this thread was to make BFQ the default for all block devices that create a single hardware queue. That includes all SATA storage since scsi-mq only creates a single hardware queue when using the SATA protocol. The proposal to make BFQ the default for systems with a single hard disk probably makes sense but I am not sure that making BFQ the default for systems equipped with one or more (SATA) SSDs is also a good idea. Especially for multi-socket systems since BFQ reintroduces a queue-wide lock.


No, BFQ has no queue-wide lock.  The very first change made to BFQ for
porting it to blk-mq was to remove the queue lock.  Guided by Jens, I
replaced that lock with the exact, same scheduler lock used in
mq-deadline.

Thanks,
Paolo

> As you know no queue-wide locking happens during I/O in the scsi-mq core nor in the blk-mq core.

> 

> Bart.
Bart Van Assche Oct. 6, 2018, 4:20 p.m. | #36
On 10/5/18 11:46 PM, Paolo Valente wrote:
>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto:

>> On 10/5/18 2:16 AM, Jan Kara wrote:

>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:

>>>> What I think is missing is measurement results for BFQ on a system with

>>>> multiple CPU sockets and against a fast storage medium. Eliminating

>>>> the host lock from the SCSI core yielded a significant performance

>>>> improvement for such storage devices. Since the BFQ scheduler locks and

>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ

>>>> will slow down I/O for fast storage devices, even if their driver only

>>>> creates a single hardware queue.

>>> Well, I'm not sure why that is missing. I don't think anyone proposed to

>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is

>>> better in such situation... The proposal has been: Default to BFQ for slow

>>> storage, leave it to deadline-mq otherwise.

>>

>> How do you define slow storage? The proposal at the start of this thread

>> was to make BFQ the default for all block devices that create a single

>> hardware queue. That includes all SATA storage since scsi-mq only creates

>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense

>> but I am not sure that making BFQ the default for systems equipped with

>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket

>> systems since BFQ reintroduces a queue-wide lock.

> 

> No, BFQ has no queue-wide lock.  The very first change made to BFQ for

> porting it to blk-mq was to remove the queue lock.  Guided by Jens, I

> replaced that lock with the exact, same scheduler lock used in

> mq-deadline.


It's easy to see that both mq-deadline and BFQ define a queue-wide lock. 
For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That 
last lock serializes all bfq_dispatch_request() calls and hence reduces 
concurrency while processing I/O requests. From bfq_dispatch_request():

static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
	[ ... ]
	spin_lock_irq(&bfqd->lock);
	[ ... ]
}

I think the above makes it very clear that bfqd->lock is queue-wide.

It is easy to understand why both I/O schedulers need a queue-wide lock: 
the only way to avoid race conditions when considering all pending I/O 
requests for scheduling decisions is to use a lock that covers all 
pending requests and hence that is queue-wide.

Bart.
Paolo Valente Oct. 6, 2018, 4:46 p.m. | #37
> Il giorno 06 ott 2018, alle ore 18:20, Bart Van Assche <bvanassche@acm.org> ha scritto:

> 

> On 10/5/18 11:46 PM, Paolo Valente wrote:

>>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <bvanassche@acm.org> ha scritto:

>>> On 10/5/18 2:16 AM, Jan Kara wrote:

>>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:

>>>>> What I think is missing is measurement results for BFQ on a system with

>>>>> multiple CPU sockets and against a fast storage medium. Eliminating

>>>>> the host lock from the SCSI core yielded a significant performance

>>>>> improvement for such storage devices. Since the BFQ scheduler locks and

>>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ

>>>>> will slow down I/O for fast storage devices, even if their driver only

>>>>> creates a single hardware queue.

>>>> Well, I'm not sure why that is missing. I don't think anyone proposed to

>>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is

>>>> better in such situation... The proposal has been: Default to BFQ for slow

>>>> storage, leave it to deadline-mq otherwise.

>>> 

>>> How do you define slow storage? The proposal at the start of this thread

>>> was to make BFQ the default for all block devices that create a single

>>> hardware queue. That includes all SATA storage since scsi-mq only creates

>>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense

>>> but I am not sure that making BFQ the default for systems equipped with

>>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket

>>> systems since BFQ reintroduces a queue-wide lock.

>> No, BFQ has no queue-wide lock.  The very first change made to BFQ for

>> porting it to blk-mq was to remove the queue lock.  Guided by Jens, I

>> replaced that lock with the exact, same scheduler lock used in

>> mq-deadline.

> 

> It's easy to see that both mq-deadline and BFQ define a queue-wide lock. For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That last lock serializes all bfq_dispatch_request() calls and hence reduces concurrency while processing I/O requests. From bfq_dispatch_request():

> 

> static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)

> {

> 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;

> 	[ ... ]

> 	spin_lock_irq(&bfqd->lock);

> 	[ ... ]

> }

> 

> I think the above makes it very clear that bfqd->lock is queue-wide.

> 

> It is easy to understand why both I/O schedulers need a queue-wide lock: the only way to avoid race conditions when considering all pending I/O requests for scheduling decisions is to use a lock that covers all pending requests and hence that is queue-wide.

> 


Absolutely true.  Queue lock is evidently a very general concept, and
a lock on a scheduler is, in the end, a lock on its internal queue(s).
But the queue lock removed by blk-mq is not that small per-scheduler
lock, but the big, single-request-queue lock.  The effects of the
latter are probably almost one order of magnitude higher than those of
a scheduler lock, even with a non-trivial scheduler like BFQ.

As a simple concrete proof of this fact, consider the numbers that I
already gave you, and that you can re-obtain in five minutes: on a
laptop, BFQ may support up to 400KIOPS.  Probably, even just with noop
as I/O scheduler, the same PC cannot process so many IOPS with legacy
blk (because of the single-request-queue lock).

To sum up, in your argument you mixed two different locks.

Anyway, you are going very deep in this issue.  This takes you very
close to what I'm currently working on (still in a design phase):
increasing the parallel efficiency of BFQ, mainly by reducing the
duration of the pieces of BFQ executed under its scheduler lock.

But the goal of such a non-trivial improvement is to go from the
current 400 KIOPS to more than one million of IOPS.  This is an
improvement that will most likely provide no benefits for probably 99%
of the systems with single-queue devices.  Those systems simply do no go
beyond 300 KIOPS.

So, I'm trying to first devote my limited single-person bandwidth
(sorry, I didn't resist the temptation to joke on this growing
discussion on single-something issues :) ) to improvements that make
BFQ better within its current hardware scope.

Thanks,
Paolo

> Bart.

Patch

diff --git a/block/elevator.c b/block/elevator.c
index e18ac68626e3..e5a2c39eee7b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,15 @@  int elevator_switch_mq(struct request_queue *q,
 }
 
 /*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices.  If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * For blk-mq devices, we default to using:
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
  */
 int elevator_init_mq(struct request_queue *q)
 {
-	struct elevator_type *e;
+	struct elevator_type *e = NULL;
 	int err = 0;
 
 	if (q->nr_hw_queues != 1)
@@ -968,9 +970,14 @@  int elevator_init_mq(struct request_queue *q)
 	if (unlikely(q->elevator))
 		goto out_unlock;
 
-	e = elevator_get(q, "mq-deadline", false);
-	if (!e)
-		goto out_unlock;
+	if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+		e = elevator_get(q, "bfq", false);
+
+	if (!e) {
+		e = elevator_get(q, "mq-deadline", false);
+		if (!e)
+			goto out_unlock;
+	}
 
 	err = blk_mq_init_sched(q, e);
 	if (err)