Message ID | 20240529050507.1392041-3-hch@lst.de |
---|---|
State | Superseded |
Headers | show |
Series | [01/12] ubd: untagle discard vs write zeroes not support handling | expand |
On 5/29/24 14:04, Christoph Hellwig wrote: > The soft max_sectors limit is normally capped by the hardware limits and > an arbitrary upper limit enforced by the kernel, but can be modified by > the user. A few drivers want to increase this limit (nbd, rbd) or > adjust it up or down based on hardware capabilities (sd). > > Change blk_validate_limits to default max_sectors to the optimal I/O > size, or upgrade it to the preferred minimal I/O size if that is > larger than the kernel default if no optimal I/O size is provided based > on the logic in the SD driver. > > This keeps the existing kernel default for drivers that do not provide > an io_opt or very big io_min value, but picks a much more useful > default for those who provide these hints, and allows to remove the > hacks to set the user max_sectors limit in nbd, rbd and sd. > > Note that rd picks a different value for the optimal I/O size vs the > user max_sectors value, so this is a bit of a behavior change that > could use careful review from people familiar with rbd. > > Signed-off-by: Christoph Hellwig <hch@lst.de> Looks OK to me. Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
On 5/28/24 22:04, Christoph Hellwig wrote: > The soft max_sectors limit is normally capped by the hardware limits and > an arbitrary upper limit enforced by the kernel, but can be modified by > the user. A few drivers want to increase this limit (nbd, rbd) or > adjust it up or down based on hardware capabilities (sd). > > Change blk_validate_limits to default max_sectors to the optimal I/O > size, or upgrade it to the preferred minimal I/O size if that is > larger than the kernel default if no optimal I/O size is provided based > on the logic in the SD driver. > > This keeps the existing kernel default for drivers that do not provide > an io_opt or very big io_min value, but picks a much more useful > default for those who provide these hints, and allows to remove the > hacks to set the user max_sectors limit in nbd, rbd and sd. > > Note that rd picks a different value for the optimal I/O size vs the ^^ rbd? > user max_sectors value, so this is a bit of a behavior change that > could use careful review from people familiar with rbd. Anyway: Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On Wed, May 29, 2024 at 7:05 AM Christoph Hellwig <hch@lst.de> wrote: > > The soft max_sectors limit is normally capped by the hardware limits and > an arbitrary upper limit enforced by the kernel, but can be modified by > the user. A few drivers want to increase this limit (nbd, rbd) or > adjust it up or down based on hardware capabilities (sd). > > Change blk_validate_limits to default max_sectors to the optimal I/O > size, or upgrade it to the preferred minimal I/O size if that is > larger than the kernel default if no optimal I/O size is provided based > on the logic in the SD driver. > > This keeps the existing kernel default for drivers that do not provide > an io_opt or very big io_min value, but picks a much more useful > default for those who provide these hints, and allows to remove the > hacks to set the user max_sectors limit in nbd, rbd and sd. > > Note that rd picks a different value for the optimal I/O size vs the > user max_sectors value, so this is a bit of a behavior change that > could use careful review from people familiar with rbd. Hi Christoph, For rbd, this change effectively lowers max_sectors from 4M to 64K or less and that is definitely not desirable. From previous interactions with users we want max_sectors to match max_hw_sectors -- this has come up a quite a few times over the years. Some people just aren't aware of the soft cap and the fact that it's adjustable and get frustrated over the time poured into debugging their iostat numbers for workloads that can send object (set) size I/Os. Looking at the git history, we lowered io_opt from objset_bytes to opts->alloc_size in commit [1], but I guess io_opt was lowered just along for the ride. What that commit was concerned with is really discard_granularity and to a smaller extent io_min. How much difference does io_opt make in the real world? If what rbd does stands in the way of a tree-wide cleanup, I would much rather bump io_opt back to objset_bytes (i.e. what max_user_sectors is currently set to). [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16d80c54ad42c573a897ae7bcf5a9816be54e6fe Thanks, Ilya
On Thu, May 30, 2024 at 09:48:06PM +0200, Ilya Dryomov wrote: > For rbd, this change effectively lowers max_sectors from 4M to 64K or > less and that is definitely not desirable. From previous interactions > with users we want max_sectors to match max_hw_sectors -- this has come > up a quite a few times over the years. Some people just aren't aware > of the soft cap and the fact that it's adjustable and get frustrated > over the time poured into debugging their iostat numbers for workloads > that can send object (set) size I/Os. > > Looking at the git history, we lowered io_opt from objset_bytes to > opts->alloc_size in commit [1], but I guess io_opt was lowered just > along for the ride. What that commit was concerned with is really > discard_granularity and to a smaller extent io_min. > > How much difference does io_opt make in the real world? If what rbd > does stands in the way of a tree-wide cleanup, I would much rather bump > io_opt back to objset_bytes (i.e. what max_user_sectors is currently > set to). The only existing in-kernel usage is to set the readahead size. Based on your comments I seems like we should revert io_opt to objset to ->alloc_size in a prep patch?
On Fri, May 31, 2024 at 7:54 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 30, 2024 at 09:48:06PM +0200, Ilya Dryomov wrote: > > For rbd, this change effectively lowers max_sectors from 4M to 64K or > > less and that is definitely not desirable. From previous interactions > > with users we want max_sectors to match max_hw_sectors -- this has come > > up a quite a few times over the years. Some people just aren't aware > > of the soft cap and the fact that it's adjustable and get frustrated > > over the time poured into debugging their iostat numbers for workloads > > that can send object (set) size I/Os. > > > > Looking at the git history, we lowered io_opt from objset_bytes to > > opts->alloc_size in commit [1], but I guess io_opt was lowered just > > along for the ride. What that commit was concerned with is really > > discard_granularity and to a smaller extent io_min. > > > > How much difference does io_opt make in the real world? If what rbd > > does stands in the way of a tree-wide cleanup, I would much rather bump > > io_opt back to objset_bytes (i.e. what max_user_sectors is currently > > set to). > > The only existing in-kernel usage is to set the readahead size. > Based on your comments I seems like we should revert io_opt to > objset to ->alloc_size in a prep patch? We should revert io_opt from opts->alloc_size to objset_bytes (I think it's what you meant to say but typoed). How do you want to handle it? I can put together a patch, send it to ceph-devel and it will be picked by linux-next sometime next week. Then this patch would grow a contextual conflict and the description would need to be updated to not mention a behavior change for rbd anymore. Thanks, Ilya
On Fri, May 31, 2024 at 08:48:12AM +0200, Ilya Dryomov wrote: > We should revert io_opt from opts->alloc_size to objset_bytes (I think > it's what you meant to say but typoed). Yes, sorry. > How do you want to handle it? I can put together a patch, send it to > ceph-devel and it will be picked by linux-next sometime next week. Then > this patch would grow a contextual conflict and the description would > need to be updated to not mention a behavior change for rbd anymore. If that works for you I'd much prefer to include it with this series. I'd be happy to write it myself.
diff --git a/block/blk-settings.c b/block/blk-settings.c index effeb9a639bb45..a49abdb3554834 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -153,6 +153,13 @@ static int blk_validate_limits(struct queue_limits *lim) if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE) return -EINVAL; lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors); + } else if (lim->io_opt) { + lim->max_sectors = + min(max_hw_sectors, lim->io_opt >> SECTOR_SHIFT); + } else if (lim->io_min && + lim->io_min > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { + lim->max_sectors = + min(max_hw_sectors, lim->io_min >> SECTOR_SHIFT); } else { lim->max_sectors = min(max_hw_sectors, BLK_DEF_MAX_SECTORS_CAP); } diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 22a79a62cc4eab..ad887d614d5b3f 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1808,7 +1808,7 @@ static struct nbd_device *nbd_dev_add(int index, unsigned int refs) { struct queue_limits lim = { .max_hw_sectors = 65536, - .max_user_sectors = 256, + .io_opt = 256 << SECTOR_SHIFT, .max_segments = USHRT_MAX, .max_segment_size = UINT_MAX, }; diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 26ff5cd2bf0abc..05096172f334a3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4954,7 +4954,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) rbd_dev->layout.object_size * rbd_dev->layout.stripe_count; struct queue_limits lim = { .max_hw_sectors = objset_bytes >> SECTOR_SHIFT, - .max_user_sectors = objset_bytes >> SECTOR_SHIFT, .io_min = rbd_dev->opts->alloc_size, .io_opt = rbd_dev->opts->alloc_size, .max_segments = USHRT_MAX, diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index f6c822c9cbd2d3..3dff9150ce11e2 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3593,7 +3593,7 @@ static int sd_revalidate_disk(struct gendisk *disk) struct request_queue *q = sdkp->disk->queue; sector_t old_capacity = sdkp->capacity; unsigned char *buffer; - unsigned int dev_max, rw_max; + unsigned int dev_max; SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp, "sd_revalidate_disk\n")); @@ -3675,34 +3675,15 @@ static int sd_revalidate_disk(struct gendisk *disk) else blk_queue_io_min(sdkp->disk->queue, 0); - if (sd_validate_opt_xfer_size(sdkp, dev_max)) { - q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); - rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks); - } else { - q->limits.io_opt = 0; - rw_max = min_not_zero(logical_to_sectors(sdp, dev_max), - (sector_t)BLK_DEF_MAX_SECTORS_CAP); - } - /* * Limit default to SCSI host optimal sector limit if set. There may be * an impact on performance for when the size of a request exceeds this * host limit. */ - rw_max = min_not_zero(rw_max, sdp->host->opt_sectors); - - /* Do not exceed controller limit */ - rw_max = min(rw_max, queue_max_hw_sectors(q)); - - /* - * Only update max_sectors if previously unset or if the current value - * exceeds the capabilities of the hardware. - */ - if (sdkp->first_scan || - q->limits.max_sectors > q->limits.max_dev_sectors || - q->limits.max_sectors > q->limits.max_hw_sectors) { - q->limits.max_sectors = rw_max; - q->limits.max_user_sectors = rw_max; + q->limits.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT; + if (sd_validate_opt_xfer_size(sdkp, dev_max)) { + q->limits.io_opt = min_not_zero(q->limits.io_opt, + logical_to_bytes(sdp, sdkp->opt_xfer_blocks)); } sdkp->first_scan = 0;
The soft max_sectors limit is normally capped by the hardware limits and an arbitrary upper limit enforced by the kernel, but can be modified by the user. A few drivers want to increase this limit (nbd, rbd) or adjust it up or down based on hardware capabilities (sd). Change blk_validate_limits to default max_sectors to the optimal I/O size, or upgrade it to the preferred minimal I/O size if that is larger than the kernel default if no optimal I/O size is provided based on the logic in the SD driver. This keeps the existing kernel default for drivers that do not provide an io_opt or very big io_min value, but picks a much more useful default for those who provide these hints, and allows to remove the hacks to set the user max_sectors limit in nbd, rbd and sd. Note that rd picks a different value for the optimal I/O size vs the user max_sectors value, so this is a bit of a behavior change that could use careful review from people familiar with rbd. Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/blk-settings.c | 7 +++++++ drivers/block/nbd.c | 2 +- drivers/block/rbd.c | 1 - drivers/scsi/sd.c | 29 +++++------------------------ 4 files changed, 13 insertions(+), 26 deletions(-)