[RFC,00/53] netfs, afs, cifs: Delegate high-level I/O to netfslib

Message ID	20231013160423.2218093-1-dhowells@redhat.com
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> From: David Howells <dhowells@redhat.com> To: Jeff Layton <jlayton@kernel.org>, Steve French <smfrench@gmail.com> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Marc Dionne <marc.dionne@auristor.com>, Paulo Alcantara <pc@manguebit.com>, Shyam Prasad N <sprasad@microsoft.com>, Tom Talpey <tom@talpey.com>, Dominique Martinet <asmadeus@codewreck.org>, Ilya Dryomov <idryomov@gmail.com>, Christian Brauner <christian@brauner.io>, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 00/53] netfs, afs, cifs: Delegate high-level I/O to netfslib Date: Fri, 13 Oct 2023 17:03:29 +0100 Message-ID: <20231013160423.2218093-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	netfs, afs, cifs: Delegate high-level I/O to netfslib \| expand [RFC,00/53] netfs, afs, cifs: Delegate high-level I/O to netfslib [RFC,05/53] netfs: Add a ->free_subrequest() op [RFC,01/53] netfs: Add a procfile to list in-progress requests [RFC,02/53] netfs: Track the fpos above which the server has no data [RFC,03/53] netfs: Note nonblockingness in the netfs_io_request struct [RFC,04/53] netfs: Allow the netfs to make the io (sub)request alloc larger [RFC,06/53] afs: Don't use folio->private to record partial modification [RFC,07/53] netfs: Provide invalidate_folio and release_folio calls [RFC,08/53] netfs: Add rsize to netfs_io_request [RFC,09/53] netfs: Implement unbuffered/DIO vs buffered I/O locking [RFC,10/53] netfs: Add iov_iters to (sub)requests to describe various buffers [RFC,11/53] netfs: Add support for DIO buffering [RFC,12/53] netfs: Provide tools to create a buffer in an xarray [RFC,13/53] netfs: Add bounce buffering support [RFC,14/53] netfs: Add func to calculate pagecount/size-limited span of an iterator [RFC,15/53] netfs: Limit subrequest by size or number of segments [RFC,16/53] netfs: Export netfs_put_subrequest() and some tracepoints [RFC,17/53] netfs: Extend the netfs_io_*request structs to handle writes [RFC,18/53] netfs: Add a hook to allow tell the netfs to update its i_size [RFC,19/53] netfs: Make netfs_put_request() handle a NULL pointer [RFC,20/53] fscache: Add a function to begin an cache op from a netfslib request [RFC,21/53] netfs: Make the refcounting of netfs_begin_read() easier to use [RFC,22/53] netfs: Prep to use folio->private for write grouping and streaming write [RFC,23/53] netfs: Dispatch write requests to process a writeback slice [RFC,24/53] netfs: Provide func to copy data to pagecache for buffered write [RFC,25/53] netfs: Make netfs_read_folio() handle streaming-write pages [RFC,26/53] netfs: Allocate multipage folios in the writepath [RFC,27/53] netfs: Implement support for unbuffered/DIO read [RFC,28/53] netfs: Implement unbuffered/DIO write support [RFC,29/53] netfs: Implement buffered write API [RFC,30/53] netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite() [RFC,31/53] netfs: Provide netfs_file_read_iter() [RFC,32/53] netfs: Provide a writepages implementation [RFC,33/53] netfs: Provide minimum blocksize parameter [RFC,34/53] netfs: Make netfs_skip_folio_read() take account of blocksize [RFC,35/53] netfs: Perform content encryption [RFC,36/53] netfs: Decrypt encrypted content [RFC,37/53] netfs: Support decryption on ubuffered/DIO read [RFC,38/53] netfs: Support encryption on Unbuffered/DIO write [RFC,39/53] netfs: Provide a launder_folio implementation [RFC,40/53] netfs: Implement a write-through caching option [RFC,41/53] netfs: Rearrange netfs_io_subrequest to put request pointer first [RFC,42/53] afs: Use the netfs write helpers [RFC,43/53] cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest [RFC,44/53] cifs: Share server EOF pos with netfslib [RFC,45/53] cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest [RFC,46/53] cifs: Use more fields from netfs_io_subrequest [RFC,47/53] cifs: Make wait_mtu_credits take size_t args [RFC,48/53] cifs: Implement netfslib hooks [RFC,49/53] cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c [RFC,50/53] cifs: Cut over to using netfslib [RFC,51/53] cifs: Remove some code that's no longer used, part 1 [RFC,52/53] cifs: Remove some code that's no longer used, part 2 [RFC,53/53] cifs: Remove some code that's no longer used, part 3

Message ID

20231013160423.2218093-1-dhowells@redhat.com

Headers

From: David Howells <dhowells@redhat.com>
To: Jeff Layton <jlayton@kernel.org>, Steve French <smfrench@gmail.com>
Cc: David Howells <dhowells@redhat.com>,
        Matthew Wilcox <willy@infradead.org>,
        Marc Dionne <marc.dionne@auristor.com>,
        Paulo Alcantara <pc@manguebit.com>,
        Shyam Prasad N <sprasad@microsoft.com>,
        Tom Talpey <tom@talpey.com>,
        Dominique Martinet <asmadeus@codewreck.org>,
        Ilya Dryomov <idryomov@gmail.com>,
        Christian Brauner <christian@brauner.io>,
        linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org,
        linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org,
        v9fs@lists.linux.dev, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, netdev@vger.kernel.org,
        linux-kernel@vger.kernel.org
Subject: [RFC PATCH 00/53] netfs, afs,
 cifs: Delegate high-level I/O to netfslib
Date: Fri, 13 Oct 2023 17:03:29 +0100
Message-ID: <20231013160423.2218093-1-dhowells@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

netfs, afs, cifs: Delegate high-level I/O to netfslib | expand

Message

David Howells Oct. 13, 2023, 4:03 p.m. UTC

Hi Jeff, Steve,

I have been working on my netfslib helpers to the point that I can run
xfstests on AFS to completion (both with write-back buffering and, with a
small patch, write-through buffering in the pagecache).  I can also run a
certain amount of xfstests on CIFS, though that requires some more
debugging.  However, this seems like a good time to post a preview of the
patches.

The patches remove a little over 800 lines from AFS and over 2000 from
CIFS, albeit with around 3000 lines added to netfs.  Hopefully, I will be
able to remove a bunch of lines from 9P and Ceph too.

The main aims of these patches are to get high-level I/O and knowledge of
the pagecache out of the filesystem drivers as much as possible and to get
rid, as much of possible, of the knowledge that pages/folios exist.

Further, I would like to see ->write_begin, ->write_end and ->launder_folio
go away.

Features that are added by these patches to that which is already there in
netfslib:

 (1) NFS-style (and Ceph-style) locking around DIO vs buffered I/O calls to
     prevent these from happening at the same time.  mmap'd I/O can, of
     necessity, happen at any time ignoring these locks.

 (2) Support for unbuffered I/O.  The data is kept in the bounce buffer and
     the pagecache is not used.  This can be turned on with an inode flag.

 (3) Support for direct I/O.  This is basically unbuffered I/O with some
     extra restrictions and no RMW.

 (4) Support for using a bounce buffer in an operation.  The bounce buffer
     may be bigger than the target data/buffer, allowing for crypto
     rounding.

 (5) Support for content encryption.  This isn't supported yet by AFS/CIFS
     but is aimed initially at Ceph.

 (6) ->write_begin() and ->write_end() are ignored in favour of merging all
     of that into one function, netfs_perform_write(), thereby avoiding the
     function pointer traversals.

 (7) Support for write-through caching in the pagecache.
     netfs_perform_write() adds the pages is modifies to an I/O operation
     as it goes and directly marks them writeback rather than dirty.  When
     writing back from write-through, it limits the range written back.
     This should allow CIFS to deal with byte-range mandatory locks
     correctly.

 (8) O_*SYNC and RWF_*SYNC writes use write-through rather than writing to
     the pagecache and then flushing afterwards.  An AIO O_*SYNC write will
     notify of completion when the sub-writes all complete.

 (9) Support for write-streaming where modifed data is held in !uptodate
     folios, with a private struct attached indicating the range that is
     valid.

(10) Support for write grouping, multiplexing a pointer to a group in the
     folio private data with the write-streaming data.  The writepages
     algorithm only writes stuff back that's in the nominated group.  This
     is intended for use by Ceph to write is snaps in order.

(11) Skipping reads for which we know the server could only supply zeros or
     EOF (for instance if we've done a local write that leaves a hole in
     the file and extends the local inode size).


General notes:

 (1) netfslib now makes use of folio->private, which means the filesystem
     can't use it.

 (2) Use of fscache is not yet tested.  I'm not sure whether to allow a
     cache to be used with a write-through write.

 (3) The filesystem provides wrappers to call the write helpers, allowing
     it to do pre-validation, oplock/capability fetching and the passing in
     of write group info.

 (4) I want to try flushing the data when tearing down an inode before
     invalidating it to try and render launder_folio unnecessary.

 (5) Write-through caching will generate and dispatch write subrequests as
     it gathers enough data to hit wsize and has whole pages that at least
     span that size.  This needs to be a bit more flexible, allowing for a
     filesystem such as CIFS to have a variable wsize.

 (6) The filesystem driver is just given read and write calls with an
     iov_iter describing the data/buffer to use.  Ideally, they don't see
     pages or folios at all.  A function, extract_iter_to_sg(), is already
     available to decant part of an iterator into a scatterlist for crypto
     purposes.


CIFS notes:

 (1) CIFS is made to use unbuffered I/O for unbuffered caching modes and
     write-through caching for cache=strict.

 (2) cifs_init_request() occasionally throws an error that it can't get a
     writable file when trying to do writeback.

 (3) Apparent file corruption frequently appears in the target file when
     cifs_copy_file_range(), even though it doesn't use any netfslib
     helpers and even if it doesn't overlap with any pages in the
     pagecache.

 (4) I should be able to turn multipage folio support on in CIFS now.

 (5) The then-unused CIFS code is removed in three patches, not one, to
     avoid the git patch generator from producing confusing patches in
     which it thinks code is being moved around rather than just being
     removed.

David

David Howells (53):
  netfs: Add a procfile to list in-progress requests
  netfs: Track the fpos above which the server has no data
  netfs: Note nonblockingness in the netfs_io_request struct
  netfs: Allow the netfs to make the io (sub)request alloc larger
  netfs: Add a ->free_subrequest() op
  afs: Don't use folio->private to record partial modification
  netfs: Provide invalidate_folio and release_folio calls
  netfs: Add rsize to netfs_io_request
  netfs: Implement unbuffered/DIO vs buffered I/O locking
  netfs: Add iov_iters to (sub)requests to describe various buffers
  netfs: Add support for DIO buffering
  netfs: Provide tools to create a buffer in an xarray
  netfs: Add bounce buffering support
  netfs: Add func to calculate pagecount/size-limited span of an
    iterator
  netfs: Limit subrequest by size or number of segments
  netfs: Export netfs_put_subrequest() and some tracepoints
  netfs: Extend the netfs_io_*request structs to handle writes
  netfs: Add a hook to allow tell the netfs to update its i_size
  netfs: Make netfs_put_request() handle a NULL pointer
  fscache: Add a function to begin an cache op from a netfslib request
  netfs: Make the refcounting of netfs_begin_read() easier to use
  netfs: Prep to use folio->private for write grouping and streaming
    write
  netfs: Dispatch write requests to process a writeback slice
  netfs: Provide func to copy data to pagecache for buffered write
  netfs: Make netfs_read_folio() handle streaming-write pages
  netfs: Allocate multipage folios in the writepath
  netfs: Implement support for unbuffered/DIO read
  netfs: Implement unbuffered/DIO write support
  netfs: Implement buffered write API
  netfs: Allow buffered shared-writeable mmap through
    netfs_page_mkwrite()
  netfs: Provide netfs_file_read_iter()
  netfs: Provide a writepages implementation
  netfs: Provide minimum blocksize parameter
  netfs: Make netfs_skip_folio_read() take account of blocksize
  netfs: Perform content encryption
  netfs: Decrypt encrypted content
  netfs: Support decryption on ubuffered/DIO read
  netfs: Support encryption on Unbuffered/DIO write
  netfs: Provide a launder_folio implementation
  netfs: Implement a write-through caching option
  netfs: Rearrange netfs_io_subrequest to put request pointer first
  afs: Use the netfs write helpers
  cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest
  cifs: Share server EOF pos with netfslib
  cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest
  cifs: Use more fields from netfs_io_subrequest
  cifs: Make wait_mtu_credits take size_t args
  cifs: Implement netfslib hooks
  cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c
  cifs: Cut over to using netfslib
  cifs: Remove some code that's no longer used, part 1
  cifs: Remove some code that's no longer used, part 2
  cifs: Remove some code that's no longer used, part 3

 fs/9p/vfs_addr.c             |   51 +-
 fs/afs/file.c                |  206 +--
 fs/afs/inode.c               |   15 +-
 fs/afs/internal.h            |   66 +-
 fs/afs/write.c               |  816 +---------
 fs/ceph/addr.c               |   28 +-
 fs/ceph/cache.h              |   12 -
 fs/fscache/io.c              |   42 +
 fs/netfs/Makefile            |    9 +-
 fs/netfs/buffered_read.c     |  245 ++-
 fs/netfs/buffered_write.c    | 1223 ++++++++++++++
 fs/netfs/crypto.c            |  148 ++
 fs/netfs/direct_read.c       |  263 +++
 fs/netfs/direct_write.c      |  359 +++++
 fs/netfs/internal.h          |  121 ++
 fs/netfs/io.c                |  325 +++-
 fs/netfs/iterator.c          |   97 ++
 fs/netfs/locking.c           |  209 +++
 fs/netfs/main.c              |  101 ++
 fs/netfs/misc.c              |  237 +++
 fs/netfs/objects.c           |   64 +-
 fs/netfs/output.c            |  485 ++++++
 fs/netfs/stats.c             |   22 +-
 fs/smb/client/Kconfig        |    1 +
 fs/smb/client/cifsfs.c       |   65 +-
 fs/smb/client/cifsfs.h       |   10 +-
 fs/smb/client/cifsglob.h     |   59 +-
 fs/smb/client/cifsproto.h    |   10 +-
 fs/smb/client/cifssmb.c      |  111 +-
 fs/smb/client/file.c         | 2905 ++++++----------------------------
 fs/smb/client/fscache.c      |  109 --
 fs/smb/client/fscache.h      |   54 -
 fs/smb/client/inode.c        |   25 +-
 fs/smb/client/smb2ops.c      |   20 +-
 fs/smb/client/smb2pdu.c      |  168 +-
 fs/smb/client/smb2proto.h    |    5 +-
 fs/smb/client/trace.h        |  144 +-
 fs/smb/client/transport.c    |   17 +-
 include/linux/fscache.h      |    6 +
 include/linux/netfs.h        |  173 +-
 include/trace/events/afs.h   |   31 -
 include/trace/events/netfs.h |  158 +-
 42 files changed, 5136 insertions(+), 4079 deletions(-)
 create mode 100644 fs/netfs/buffered_write.c
 create mode 100644 fs/netfs/crypto.c
 create mode 100644 fs/netfs/direct_read.c
 create mode 100644 fs/netfs/direct_write.c
 create mode 100644 fs/netfs/locking.c
 create mode 100644 fs/netfs/misc.c
 create mode 100644 fs/netfs/output.c

Comments

Jeff Layton Oct. 16, 2023, 3:54 p.m. UTC | #1

On Fri, 2023-10-13 at 17:03 +0100, David Howells wrote:
> Add an rsize parameter to netfs_io_request to be filled in by the network
> filesystem when the request is initialised.  This indicates the maximum
> size of a read request that the netfs will honour in that region.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: linux-cachefs@redhat.com
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  fs/afs/file.c         | 1 +
>  fs/ceph/addr.c        | 2 ++
>  include/linux/netfs.h | 1 +
>  3 files changed, 4 insertions(+)
> 
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index 3fea5cd8ef13..3d2e1913ea27 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -360,6 +360,7 @@ static int afs_symlink_read_folio(struct file *file, struct folio *folio)
>  static int afs_init_request(struct netfs_io_request *rreq, struct file *file)
>  {
>  	rreq->netfs_priv = key_get(afs_file_key(file));
> +	rreq->rsize = 4 * 1024 * 1024;
>  	return 0;
>  }
>  
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index ced19ff08988..92a5ddcd9a76 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -419,6 +419,8 @@ static int ceph_init_request(struct netfs_io_request *rreq, struct file *file)
>  	struct ceph_netfs_request_data *priv;
>  	int ret = 0;
>  
> +	rreq->rsize = 1024 * 1024;
> +

Holy magic numbers, batman! I think this deserves a comment that
explains how you came up with these values.

Also, do 9p and cifs not need this for some reason?

>  	if (rreq->origin != NETFS_READAHEAD)
>  		return 0;
>  
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index daa431c4148d..02e888c170da 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -188,6 +188,7 @@ struct netfs_io_request {
>  	struct list_head	subrequests;	/* Contributory I/O operations */
>  	void			*netfs_priv;	/* Private data for the netfs */
>  	unsigned int		debug_id;
> +	unsigned int		rsize;		/* Maximum read size (0 for none) */
>  	atomic_t		nr_outstanding;	/* Number of ops in progress */
>  	atomic_t		nr_copy_ops;	/* Number of copy-to-cache ops in progress */
>  	size_t			submitted;	/* Amount submitted for I/O so far */
>

Jeff Layton Oct. 16, 2023, 3:56 p.m. UTC | #2

On Fri, 2023-10-13 at 17:03 +0100, David Howells wrote:
> Borrow NFS's direct-vs-buffered I/O locking into netfslib.  Similar code is
> also used in ceph.
> 
> Modify it to have the correct checker annotations for i_rwsem lock
> acquisition/release and to return -ERESTARTSYS if waits are interrupted.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: linux-cachefs@redhat.com
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  fs/netfs/Makefile     |   1 +
>  fs/netfs/locking.c    | 209 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/netfs.h |  10 ++
>  3 files changed, 220 insertions(+)
>  create mode 100644 fs/netfs/locking.c
> 
> diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
> index cd22554d9048..647ce1935674 100644
> --- a/fs/netfs/Makefile
> +++ b/fs/netfs/Makefile
> @@ -4,6 +4,7 @@ netfs-y := \
>  	buffered_read.o \
>  	io.o \
>  	iterator.o \
> +	locking.o \
>  	main.o \
>  	misc.o \
>  	objects.o
> diff --git a/fs/netfs/locking.c b/fs/netfs/locking.c
> new file mode 100644
> index 000000000000..fecca8ea6322
> --- /dev/null
> +++ b/fs/netfs/locking.c
> @@ -0,0 +1,209 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * I/O and data path helper functionality.
> + *
> + * Borrowed from NFS Copyright (c) 2016 Trond Myklebust
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/netfs.h>
> +
> +/*
> + * inode_dio_wait_interruptible - wait for outstanding DIO requests to finish
> + * @inode: inode to wait for
> + *
> + * Waits for all pending direct I/O requests to finish so that we can
> + * proceed with a truncate or equivalent operation.
> + *
> + * Must be called under a lock that serializes taking new references
> + * to i_dio_count, usually by inode->i_mutex.
> + */
> +static int inode_dio_wait_interruptible(struct inode *inode)
> +{
> +	if (!atomic_read(&inode->i_dio_count))
> +		return 0;
> +
> +	wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP);
> +	DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP);
> +
> +	for (;;) {
> +		prepare_to_wait(wq, &q.wq_entry, TASK_INTERRUPTIBLE);
> +		if (!atomic_read(&inode->i_dio_count))
> +			break;
> +		if (signal_pending(current))
> +			break;
> +		schedule();
> +	}
> +	finish_wait(wq, &q.wq_entry);
> +
> +	return atomic_read(&inode->i_dio_count) ? -ERESTARTSYS : 0;
> +}
> +
> +/* Call with exclusively locked inode->i_rwsem */
> +static int netfs_block_o_direct(struct netfs_inode *ictx)
> +{
> +	if (!test_bit(NETFS_ICTX_ODIRECT, &ictx->flags))
> +		return 0;
> +	clear_bit(NETFS_ICTX_ODIRECT, &ictx->flags);
> +	return inode_dio_wait_interruptible(&ictx->inode);
> +}
> +
> +/**
> + * netfs_start_io_read - declare the file is being used for buffered reads
> + * @inode: file inode
> + *
> + * Declare that a buffered read operation is about to start, and ensure
> + * that we block all direct I/O.
> + * On exit, the function ensures that the NETFS_ICTX_ODIRECT flag is unset,
> + * and holds a shared lock on inode->i_rwsem to ensure that the flag
> + * cannot be changed.
> + * In practice, this means that buffered read operations are allowed to
> + * execute in parallel, thanks to the shared lock, whereas direct I/O
> + * operations need to wait to grab an exclusive lock in order to set
> + * NETFS_ICTX_ODIRECT.
> + * Note that buffered writes and truncates both take a write lock on
> + * inode->i_rwsem, meaning that those are serialised w.r.t. the reads.
> + */
> +int netfs_start_io_read(struct inode *inode)
> +	__acquires(inode->i_rwsem)
> +{
> +	struct netfs_inode *ictx = netfs_inode(inode);
> +
> +	/* Be an optimist! */
> +	if (down_read_interruptible(&inode->i_rwsem) < 0)
> +		return -ERESTARTSYS;
> +	if (test_bit(NETFS_ICTX_ODIRECT, &ictx->flags) == 0)
> +		return 0;
> +	up_read(&inode->i_rwsem);
> +
> +	/* Slow path.... */
> +	if (down_write_killable(&inode->i_rwsem) < 0)
> +		return -ERESTARTSYS;
> +	if (netfs_block_o_direct(ictx) < 0) {
> +		up_write(&inode->i_rwsem);
> +		return -ERESTARTSYS;
> +	}
> +	downgrade_write(&inode->i_rwsem);
> +	return 0;
> +}
> +
> +/**
> + * netfs_end_io_read - declare that the buffered read operation is done
> + * @inode: file inode
> + *
> + * Declare that a buffered read operation is done, and release the shared
> + * lock on inode->i_rwsem.
> + */
> +void netfs_end_io_read(struct inode *inode)
> +	__releases(inode->i_rwsem)
> +{
> +	up_read(&inode->i_rwsem);
> +}
> +
> +/**
> + * netfs_start_io_write - declare the file is being used for buffered writes
> + * @inode: file inode
> + *
> + * Declare that a buffered read operation is about to start, and ensure
> + * that we block all direct I/O.
> + */
> +int netfs_start_io_write(struct inode *inode)
> +	__acquires(inode->i_rwsem)
> +{
> +	struct netfs_inode *ictx = netfs_inode(inode);
> +
> +	if (down_write_killable(&inode->i_rwsem) < 0)
> +		return -ERESTARTSYS;
> +	if (netfs_block_o_direct(ictx) < 0) {
> +		up_write(&inode->i_rwsem);
> +		return -ERESTARTSYS;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * netfs_end_io_write - declare that the buffered write operation is done
> + * @inode: file inode
> + *
> + * Declare that a buffered write operation is done, and release the
> + * lock on inode->i_rwsem.
> + */
> +void netfs_end_io_write(struct inode *inode)
> +	__releases(inode->i_rwsem)
> +{
> +	up_write(&inode->i_rwsem);
> +}
> +
> +/* Call with exclusively locked inode->i_rwsem */
> +static int netfs_block_buffered(struct inode *inode)
> +{
> +	struct netfs_inode *ictx = netfs_inode(inode);
> +	int ret;
> +
> +	if (!test_bit(NETFS_ICTX_ODIRECT, &ictx->flags)) {
> +		set_bit(NETFS_ICTX_ODIRECT, &ictx->flags);
> +		if (inode->i_mapping->nrpages != 0) {
> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> +			ret = filemap_fdatawait(inode->i_mapping);
> +			if (ret < 0) {
> +				clear_bit(NETFS_ICTX_ODIRECT, &ictx->flags);
> +				return ret;
> +			}
> +		}
> +	}
> +	return 0;
> +}
> +
> +/**
> + * netfs_start_io_direct - declare the file is being used for direct i/o
> + * @inode: file inode
> + *
> + * Declare that a direct I/O operation is about to start, and ensure
> + * that we block all buffered I/O.
> + * On exit, the function ensures that the NETFS_ICTX_ODIRECT flag is set,
> + * and holds a shared lock on inode->i_rwsem to ensure that the flag
> + * cannot be changed.
> + * In practice, this means that direct I/O operations are allowed to
> + * execute in parallel, thanks to the shared lock, whereas buffered I/O
> + * operations need to wait to grab an exclusive lock in order to clear
> + * NETFS_ICTX_ODIRECT.
> + * Note that buffered writes and truncates both take a write lock on
> + * inode->i_rwsem, meaning that those are serialised w.r.t. O_DIRECT.
> + */
> +int netfs_start_io_direct(struct inode *inode)
> +	__acquires(inode->i_rwsem)
> +{
> +	struct netfs_inode *ictx = netfs_inode(inode);
> +	int ret;
> +
> +	/* Be an optimist! */
> +	if (down_read_interruptible(&inode->i_rwsem) < 0)
> +		return -ERESTARTSYS;
> +	if (test_bit(NETFS_ICTX_ODIRECT, &ictx->flags) != 0)
> +		return 0;
> +	up_read(&inode->i_rwsem);
> +
> +	/* Slow path.... */
> +	if (down_write_killable(&inode->i_rwsem) < 0)
> +		return -ERESTARTSYS;
> +	ret = netfs_block_buffered(inode);
> +	if (ret < 0) {
> +		up_write(&inode->i_rwsem);
> +		return ret;
> +	}
> +	downgrade_write(&inode->i_rwsem);
> +	return 0;
> +}
> +
> +/**
> + * netfs_end_io_direct - declare that the direct i/o operation is done
> + * @inode: file inode
> + *
> + * Declare that a direct I/O operation is done, and release the shared
> + * lock on inode->i_rwsem.
> + */
> +void netfs_end_io_direct(struct inode *inode)
> +	__releases(inode->i_rwsem)
> +{
> +	up_read(&inode->i_rwsem);
> +}
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index 02e888c170da..33d4487a91e9 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -131,6 +131,8 @@ struct netfs_inode {
>  	loff_t			remote_i_size;	/* Size of the remote file */
>  	loff_t			zero_point;	/* Size after which we assume there's no data
>  						 * on the server */
> +	unsigned long		flags;
> +#define NETFS_ICTX_ODIRECT	0		/* The file has DIO in progress */
>  };
>  
>  /*
> @@ -315,6 +317,13 @@ ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
>  				struct iov_iter *new,
>  				iov_iter_extraction_t extraction_flags);
>  
> +int netfs_start_io_read(struct inode *inode);
> +void netfs_end_io_read(struct inode *inode);
> +int netfs_start_io_write(struct inode *inode);
> +void netfs_end_io_write(struct inode *inode);
> +int netfs_start_io_direct(struct inode *inode);
> +void netfs_end_io_direct(struct inode *inode);
> +
>  /**
>   * netfs_inode - Get the netfs inode context from the inode
>   * @inode: The inode to query
> @@ -341,6 +350,7 @@ static inline void netfs_inode_init(struct netfs_inode *ctx,
>  	ctx->ops = ops;
>  	ctx->remote_i_size = i_size_read(&ctx->inode);
>  	ctx->zero_point = ctx->remote_i_size;
> +	ctx->flags = 0;
>  #if IS_ENABLED(CONFIG_FSCACHE)
>  	ctx->cache = NULL;
>  #endif
> 

It's nice to see this go into common code, but why not go ahead and
convert ceph (and possibly NFS) to use this? Is there any reason not to?

David Howells Oct. 16, 2023, 4:09 p.m. UTC | #3

Jeff Layton <jlayton@kernel.org> wrote:

> It's nice to see this go into common code, but why not go ahead and
> convert ceph (and possibly NFS) to use this? Is there any reason not to?

I'm converting ceph on a follow-on branch and for ceph this will be dealt with
there.

I could do NFS round about here, I suppose.

David

David Howells Oct. 16, 2023, 4:19 p.m. UTC | #4

Jeff Layton <jlayton@kernel.org> wrote:

> > +	rreq->rsize = 4 * 1024 * 1024;
> >  	return 0;
> ...
> > +	rreq->rsize = 1024 * 1024;
> > +
> 
> Holy magic numbers, batman! I think this deserves a comment that
> explains how you came up with these values.

Actually, that should be set to something like the object size for ceph.

> Also, do 9p and cifs not need this for some reason?

At this point, cifs doesn't use netfslib, so that's implemented in a later
patch in this series.

9p does need setting, but I haven't tested that yet.  It probably needs
setting to 1MiB as I think that's the maximum the 9p transport can handle.

But in the case of cifs, this is actually dynamic, depending on how many
credits we can obtain.  The same may be true of ceph, though I'm not entirely
clear on that as yet.

For afs, the maximum [rw]size the protocol supports is actually something like
281350422593565 (ie. (65535-28) * (2^32-1)) minus a few bytes, but that's
probably not a good idea.  I might be best setting it at something like 256KiB
as that's what OpenAFS uses.

David