mbox series

[v12,0/9] Implement copy offload support

Message ID 20230605121732.28468-1-nj.shetty@samsung.com
Headers show
Series Implement copy offload support | expand

Message

Nitesh Shetty June 5, 2023, 12:17 p.m. UTC
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev implementation.

This is next iteration of our previous patchset v11[1].

Overall series supports:
========================
	1. Driver
		- NVMe Copy command (single NS, TP 4065), including support
		in nvme-target (for block and file backend).

	2. Block layer
		- Block-generic copy (REQ_COPY flag), with interface
		accommodating two block-devs
		- Emulation, for in-kernel user when offload is natively 
                absent
		- dm-linear support (for cases not requiring split)

	3. User-interface
		- copy_file_range

Testing
=======
	Copy offload can be tested on:
	a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
		parameters mssrl,mcl, msrc. For more info [2].
	b. Null block device
        c. NVMe Fabrics loopback.
	d. blktests[3] (tests block/034-037, nvme/050-053)

	Emulation can be tested on any device.

	fio[4].

Infra and plumbing:
===================
        We populate copy_file_range callback in def_blk_fops. 
        For devices that support copy-offload, use blkdev_copy_offload to
        achieve in-device copy.
        However for cases, where device doesn't support offload,
        fallback to generic_copy_file_range.
        For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
        which implements its own emulation, as fd is not available.
        Modify checks in generic_copy_file_range to support block-device.

Performance:
============
        The major benefit of this copy-offload/emulation framework is
        observed in fabrics setup, for copy workloads across the network.
        The host will send offload command over the network and actual copy
        can be achieved using emulation on the target.
        This results in higher performance and lower network consumption,
        as compared to read and write travelling across the network.
        With async-design of copy-offload/emulation we are able to see the
        following improvements as compared to userspace read + write on a
        NVMeOF TCP setup:

        Setup1: Network Speed: 1000Mb/s
        Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
        Target PC: AMD Ryzen 9 5900X 12-Core Processor
        block size 8k:
        710% improvement in IO BW (108 MiB/s to 876 MiB/s).
        Network utilisation drops from  97% to 15%.
        block-size 1M:
        2532% improvement in IO BW (101 MiB/s to 2659 MiB/s).
        Network utilisation drops from 89% to 0.62%.

        Setup2: Network Speed: 100Gb/s
        Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
        (host and target have the same configuration)
        block-size 8k:
        17.5% improvement in IO BW (794 MiB/s to 933 MiB/s).
        Network utilisation drops from  6.75% to 0.16%.

Blktests[3]
======================
	tests/block/034,035: Runs copy offload and emulation on block
                              device.
	tests/block/035,036: Runs copy offload and emulation on null
                              block device.
        tests/nvme/050-053: Create a loop backed fabrics device and
                              run copy offload and emulation.

Future Work
===========
	- loopback device copy offload support
	- upstream fio to use copy offload
	- upstream blktest to test copy offload

	These are to be taken up after this minimal series is agreed upon.

Additional links:
=================
	[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
            https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a5c3@nvidia.com/
            https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.shetty@samsung.com/
	[1] https://lore.kernel.org/all/20230522104146.2856-1-nj.shetty@samsung.com/
	[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
	[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v12
	[4] https://github.com/vincentkfu/fio/tree/copyoffload-3.35-v12

Changes since v11:
=================
        - Documentation: Improved documentation (Damien Le Moal)
        - block,nvme: ssize_t return values (Darrick J. Wong)
        - block: token is allocated to SECTOR_SIZE (Matthew Wilcox)
        - block: mem leak fix (Maurizio Lombardi)

Changes since v10:
=================
        - NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
        - NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=================
        - null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
        - fio, expanded testing and minor fixes (Vincent Fu)

Changes since v8:
=================
        - null_blk, copy_max_bytes_hw is made config fs parameter
          (Damien Le Moal)
        - Negative error handling in copy_file_range (Christian Brauner)
        - minor fixes, better documentation (Damien Le Moal)
        - fio upgraded to 3.34 (Vincent Fu)

Changes since v7:
=================
        - null block copy offload support for testing (Damien Le Moal)
        - adding direct flag check for copy offload to block device,
	  as we are using generic_copy_file_range for cached cases.
        - Minor fixes

Changes since v6:
=================
        - copy_file_range instead of ioctl for direct block device
        - Remove support for multi range (vectored) copy
        - Remove ioctl interface for copy.
        - Remove offload support in dm kcopyd.

Changes since v5:
=================
	- Addition of blktests (Chaitanya Kulkarni)
        - Minor fix for fabrics file backed path
        - Remove buggy zonefs copy file range implementation.

Changes since v4:
=================
	- make the offload and emulation design asynchronous (Hannes
	  Reinecke)
	- fabrics loopback support
	- sysfs naming improvements (Damien Le Moal)
	- use kfree() instead of kvfree() in cio_await_completion
	  (Damien Le Moal)
	- use ranges instead of rlist to represent range_entry (Damien
	  Le Moal)
	- change argument ordering in blk_copy_offload suggested (Damien
	  Le Moal)
	- removed multiple copy limit and merged into only one limit
	  (Damien Le Moal)
	- wrap overly long lines (Damien Le Moal)
	- other naming improvements and cleanups (Damien Le Moal)
	- correctly format the code example in description (Damien Le
	  Moal)
	- mark blk_copy_offload as static (kernel test robot)
	
Changes since v3:
=================
	- added copy_file_range support for zonefs
	- added documentation about new sysfs entries
	- incorporated review comments on v3
	- minor fixes

Changes since v2:
=================
	- fixed possible race condition reported by Damien Le Moal
	- new sysfs controls as suggested by Damien Le Moal
	- fixed possible memory leak reported by Dan Carpenter, lkp
	- minor fixes

Changes since v1:
=================
	- sysfs documentation (Greg KH)
        - 2 bios for copy operation (Bart Van Assche, Mikulas Patocka,
          Martin K. Petersen, Douglas Gilbert)
        - better payload design (Darrick J. Wong)

Nitesh Shetty (9):
  block: Introduce queue limits for copy-offload support
  block: Add copy offload support infrastructure
  block: add emulation for copy
  fs, block: copy_file_range for def_blk_ops for direct block device
  nvme: add copy offload support
  nvmet: add copy command support for bdev and file ns
  dm: Add support for copy offload
  dm: Enable copy offload for dm-linear target
  null_blk: add support for copy offload

 Documentation/ABI/stable/sysfs-block |  33 ++
 Documentation/block/null_blk.rst     |   5 +
 block/blk-lib.c                      | 445 +++++++++++++++++++++++++++
 block/blk-map.c                      |   4 +-
 block/blk-settings.c                 |  24 ++
 block/blk-sysfs.c                    |  63 ++++
 block/blk.h                          |   2 +
 block/fops.c                         |  20 ++
 drivers/block/null_blk/main.c        | 108 ++++++-
 drivers/block/null_blk/null_blk.h    |   8 +
 drivers/md/dm-linear.c               |   1 +
 drivers/md/dm-table.c                |  41 +++
 drivers/md/dm.c                      |   7 +
 drivers/nvme/host/constants.c        |   1 +
 drivers/nvme/host/core.c             | 103 ++++++-
 drivers/nvme/host/fc.c               |   5 +
 drivers/nvme/host/nvme.h             |   7 +
 drivers/nvme/host/pci.c              |  27 +-
 drivers/nvme/host/rdma.c             |   7 +
 drivers/nvme/host/tcp.c              |  16 +
 drivers/nvme/host/trace.c            |  19 ++
 drivers/nvme/target/admin-cmd.c      |   9 +-
 drivers/nvme/target/io-cmd-bdev.c    |  62 ++++
 drivers/nvme/target/io-cmd-file.c    |  52 ++++
 drivers/nvme/target/loop.c           |   6 +
 drivers/nvme/target/nvmet.h          |   1 +
 fs/read_write.c                      |   7 +-
 include/linux/blk_types.h            |  25 ++
 include/linux/blkdev.h               |  23 ++
 include/linux/device-mapper.h        |   5 +
 include/linux/nvme.h                 |  43 ++-
 include/uapi/linux/fs.h              |   6 +
 32 files changed, 1166 insertions(+), 19 deletions(-)


base-commit: 9ca10bfb8aa8fbf19ee22e702c8cf9b66ea73a54

Comments

Christoph Hellwig June 5, 2023, 1:43 p.m. UTC | #1
>  		break;
>  	case REQ_OP_READ:
> -		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
> +		if (unlikely(req->cmd_flags & REQ_COPY))
> +			nvme_setup_copy_read(ns, req);
> +		else
> +			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
>  		break;
>  	case REQ_OP_WRITE:
> -		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
> +		if (unlikely(req->cmd_flags & REQ_COPY))
> +			ret = nvme_setup_copy_write(ns, req, cmd);
> +		else
> +			ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);

Yikes.  Overloading REQ_OP_READ and REQ_OP_WRITE with something entirely
different brings us back the horrors of the block layer 15 years ago.
Don't do that.  Please add separate REQ_COPY_IN/OUT (or maybe
SEND/RECEIVE or whatever) methods.

> +	/* setting copy limits */
> +	if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))

I don't understand this comment.

> +struct nvme_copy_token {
> +	char *subsys;
> +	struct nvme_ns *ns;
> +	sector_t src_sector;
> +	sector_t sectors;
> +};

Why do we need a subsys token?  Inter-namespace copy is pretty crazy,
and not really anything we should aim for.  But this whole token design
is pretty odd anyway.  The only thing we'd need is a sequence number /
idr / etc to find an input and output side match up, as long as we
stick to the proper namespace scope.

> +	if (unlikely((req->cmd_flags & REQ_COPY) &&
> +				(req_op(req) == REQ_OP_READ))) {
> +		blk_mq_start_request(req);
> +		return BLK_STS_OK;
> +	}

This really needs to be hiden inside of nvme_setup_cmd.  And given
that other drivers might need similar handling the best way is probably
to have a new magic BLK_STS_* value for request started but we're
not actually sending it to hardware.
Christoph Hellwig June 7, 2023, 7:12 a.m. UTC | #2
On Tue, Jun 06, 2023 at 05:05:35PM +0530, Nitesh Shetty wrote:
> Downside will be duplicating checks which are present for read, write in
> block layer, device-mapper and zoned devices.
> But we can do this, shouldn't be an issue.

Yes.  Please never overload operations, this is just causing problems
everywhere, and that why I split the operations from the flag a few
years ago.

> The idea behind subsys is to prevent copy across different subsystem.
> For example, copy across nvme subsystem and the scsi subsystem. [1]
> At present, we don't support inter-namespace(copy across NVMe namespace),
> but after community feedback for previous series we left scope for it.

Never leave scope for something that isn't actually added.  That just
creates a giant maintainance nightmare.  Cross-device copies are giant
nightmare in general, and in the case of NVMe completely unusable
as currently done in the working group.  Messing up something that
is entirely reasonable (local copy) for something like that is a sure
way to never get this series in.
Martin K. Petersen June 8, 2023, 1:36 a.m. UTC | #3
Christoph,

> Yikes. Overloading REQ_OP_READ and REQ_OP_WRITE with something
> entirely different brings us back the horrors of the block layer 15
> years ago. Don't do that. Please add separate REQ_COPY_IN/OUT (or
> maybe SEND/RECEIVE or whatever) methods.

I agree, I used REQ_COPY_IN and REQ_COPY_OUT in my original series.

>> +	/* setting copy limits */
>> +	if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
>
> I don't understand this comment.
>
>> +struct nvme_copy_token {
>> +	char *subsys;
>> +	struct nvme_ns *ns;
>> +	sector_t src_sector;
>> +	sector_t sectors;
>> +};
>
> Why do we need a subsys token? Inter-namespace copy is pretty crazy,
> and not really anything we should aim for. But this whole token design
> is pretty odd anyway. The only thing we'd need is a sequence number /
> idr / etc to find an input and output side match up, as long as we
> stick to the proper namespace scope.

Yeah, I don't think we need to carry this in a token. Doing the sanity
check up front in blkdev_copy_offload() should be fine. For NVMe it's
not currently possible to copy across and for SCSI we'd just make sure
the copy scope is the same for the two block devices before we even
issue the operations.
Christoph Hellwig June 9, 2023, 4:24 a.m. UTC | #4
On Thu, Jun 08, 2023 at 05:38:17PM +0530, Nitesh Shetty wrote:
> Sure, we can do away with subsys and realign more on single namespace copy.
> We are planning to use token to store source info, such as src sector,
> len and namespace. Something like below,
> 
> struct nvme_copy_token {
> 	struct nvme_ns *ns; // to make sure we are copying within same namespace
> /* store source info during *IN operation, will be used by *OUT operation */
> 	sector_t src_sector;
> 	sector_t sectors;
> };
> Do you have any better way to handle this in mind ?

In general every time we tried to come up with a request payload that is
not just data passed to the device it has been a nightmare.

So my gut feeling would be that bi_sector and bi_iter.bi_size are the
ranges, with multiple bios being allowed to form the input data, similar
to how we implement discard merging.

The interesting part is how we'd match up these bios.  One idea would
be that since copy by definition doesn't need integrity data we just
add a copy_id that unions it, and use a simple per-gendisk copy I/D
allocator, but I'm not entirely sure how well that interacts stacking
drivers.
Nitesh Shetty July 10, 2023, 6:14 a.m. UTC | #5
On 23/06/08 09:24PM, Christoph Hellwig wrote:
>On Thu, Jun 08, 2023 at 05:38:17PM +0530, Nitesh Shetty wrote:
>> Sure, we can do away with subsys and realign more on single namespace copy.
>> We are planning to use token to store source info, such as src sector,
>> len and namespace. Something like below,
>>
>> struct nvme_copy_token {
>> 	struct nvme_ns *ns; // to make sure we are copying within same namespace
>> /* store source info during *IN operation, will be used by *OUT operation */
>> 	sector_t src_sector;
>> 	sector_t sectors;
>> };
>> Do you have any better way to handle this in mind ?
>
>In general every time we tried to come up with a request payload that is
>not just data passed to the device it has been a nightmare.
>
>So my gut feeling would be that bi_sector and bi_iter.bi_size are the
>ranges, with multiple bios being allowed to form the input data, similar
>to how we implement discard merging.
>
>The interesting part is how we'd match up these bios.  One idea would
>be that since copy by definition doesn't need integrity data we just
>add a copy_id that unions it, and use a simple per-gendisk copy I/D
>allocator, but I'm not entirely sure how well that interacts stacking
>drivers.

V13[1] implements that route. Please see if that matches with what you had
in mind?

[1] https://lore.kernel.org/linux-nvme/20230627183629.26571-1-nj.shetty@samsung.com/

Thank you, 
Nitesh Shetty