[v6,0/9] vfs/nfsd: clean up handling of i_version counter

Message ID	20220930111840.10695-1-jlayton@kernel.org
Headers	show Return-Path: <ceph-devel-owner@kernel.org> From: Jeff Layton <jlayton@kernel.org> To: tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org, david@fromorbit.com, trondmy@hammerspace.com, neilb@suse.de, viro@zeniv.linux.org.uk, zohar@linux.ibm.com, xiubli@redhat.com, chuck.lever@oracle.com, lczerner@redhat.com, jack@suse.cz, bfields@fieldses.org, brauner@kernel.org, fweimer@redhat.com Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, ceph-devel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org Subject: [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Date: Fri, 30 Sep 2022 07:18:31 -0400 Message-Id: <20220930111840.10695-1-jlayton@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfs/nfsd: clean up handling of i_version counter \| expand [v6,0/9] vfs/nfsd: clean up handling of i_version counter [v6,1/9] iversion: move inode_query_iversion to libfs.c [v6,2/9] iversion: clarify when the i_version counter must be updated [v6,3/9] vfs: plumb i_version handling into struct kstat [v6,4/9] nfs: report the inode version in getattr if requested [v6,5/9] ceph: report the inode version in getattr if requested [v6,6/9] nfsd: use the getattr operation to fetch i_version [v6,7/9] vfs: expose STATX_VERSION to userland [v6,8/9] vfs: update times after copying data in __generic_file_write_iter [v6,9/9] ext4: update times after I/O in write codepaths

Message ID

20220930111840.10695-1-jlayton@kernel.org

Headers

From: Jeff Layton <jlayton@kernel.org>
To: tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org,
        david@fromorbit.com, trondmy@hammerspace.com, neilb@suse.de,
        viro@zeniv.linux.org.uk, zohar@linux.ibm.com, xiubli@redhat.com,
        chuck.lever@oracle.com, lczerner@redhat.com, jack@suse.cz,
        bfields@fieldses.org, brauner@kernel.org, fweimer@redhat.com
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-kernel@vger.kernel.org, ceph-devel@vger.kernel.org,
        linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org,
        linux-xfs@vger.kernel.org
Subject: [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter
Date: Fri, 30 Sep 2022 07:18:31 -0400
Message-Id: <20220930111840.10695-1-jlayton@kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

vfs/nfsd: clean up handling of i_version counter | expand

Message

Jeff Layton Sept. 30, 2022, 11:18 a.m. UTC

v6: add support for STATX_ATTR_VERSION_MONOTONIC
    add patch to expose i_version counter to userland
    patches to update times/version after copying write data

An updated set of i_version handling changes! I've dropped the earlier
ext4 patches since Ted has picked up the relevant ext4 ones.

This set is based on linux-next, to make sure we don't collide with the
statx DIO alignment patches, and some other i_version cleanups that are
in flight.  I'm hoping those land in 6.1.

There are a few changes since v5, mostly centered around adding
STATX_ATTR_VERSION_MONOTONIC. I've also re-added the patch to expose
STATX_VERSION to userland via statx. What I'm proposing should now
(mostly) conform to the semantics I layed out in the manpage patch I
sent recently [1].

Finally, I've added two patches to make __generic_file_write_iter and
ext4 update the c/mtime after copying file data instead of before, which
Neil pointed out makes for better cache-coherency handling. Those should
take care of ext4 and tmpfs. xfs and btrfs will need to make the same
changes.

One thing I'm not sure of is what we should do if update_times fails
after an otherwise successful write. Should we just ignore that and move
on (and maybe WARN)? Return an error? Set a writeback error? What's the
right recourse there?

I'd like to go ahead and get the first 6 patches from this series into
linux-next fairly soon, so if anyone has objections, please speak up!

[1]: https://lore.kernel.org/linux-nfs/20220928134200.28741-1-jlayton@kernel.org/T/#u

Jeff Layton (9):
  iversion: move inode_query_iversion to libfs.c
  iversion: clarify when the i_version counter must be updated
  vfs: plumb i_version handling into struct kstat
  nfs: report the inode version in getattr if requested
  ceph: report the inode version in getattr if requested
  nfsd: use the getattr operation to fetch i_version
  vfs: expose STATX_VERSION to userland
  vfs: update times after copying data in __generic_file_write_iter
  ext4: update times after I/O in write codepaths

 fs/ceph/inode.c           | 16 +++++++++----
 fs/ext4/file.c            | 20 +++++++++++++---
 fs/libfs.c                | 36 +++++++++++++++++++++++++++++
 fs/nfs/export.c           |  7 ------
 fs/nfs/inode.c            | 10 ++++++--
 fs/nfsd/nfs4xdr.c         |  4 +++-
 fs/nfsd/nfsfh.c           | 40 ++++++++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h           | 29 +----------------------
 fs/nfsd/vfs.h             |  7 +++++-
 fs/stat.c                 |  7 ++++++
 include/linux/exportfs.h  |  1 -
 include/linux/iversion.h  | 48 ++++++++-------------------------------
 include/linux/stat.h      |  2 +-
 include/uapi/linux/stat.h |  6 +++--
 mm/filemap.c              | 17 ++++++++++----
 samples/vfs/test-statx.c  |  8 +++++--
 16 files changed, 163 insertions(+), 95 deletions(-)

Comments

Amir Goldstein Oct. 2, 2022, 7:08 a.m. UTC | #1

On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> The c/mtime and i_version currently get updated before the data is
> copied (or a DIO write is issued), which is problematic for NFS.
>
> READ+GETATTR can race with a write (even a local one) in such a way as
> to make the client associate the state of the file with the wrong change
> attribute. That association can persist indefinitely if the file sees no
> further changes.
>
> Move the setting of times to the bottom of the function in
> __generic_file_write_iter and only update it if something was
> successfully written.
>

This solution is wrong for several reasons:

1. There is still file_update_time() in ->page_mkwrite() so you haven't
    solved the problem completely
2. The other side of the coin is that post crash state is more likely to end
    up data changes without mtime/ctime change

If I read the problem description correctly, then a solution that invalidates
the NFS cache before AND after the write would be acceptable. Right?
Would an extra i_version bump after the write solve the race?

> If the time update fails, log a warning once, but don't fail the write.
> All of the existing callers use update_time functions that don't fail,
> so we should never trip this.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  mm/filemap.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 15800334147b..72c0ceb75176 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>         if (err)
>                 goto out;
>
> -       err = file_update_time(file);
> -       if (err)
> -               goto out;
> -
>         if (iocb->ki_flags & IOCB_DIRECT) {
>                 loff_t pos, endbyte;
>
> @@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                         iocb->ki_pos += written;
>         }
>  out:
> +       if (written > 0) {
> +               err = file_update_time(file);
> +               /*
> +                * There isn't much we can do at this point if updating the
> +                * times fails after a successful write. The times and i_version
> +                * should still be updated in the inode, and it should still be
> +                * marked dirty, so hopefully the next inode update will catch it.
> +                * Log a warning once so we have a record that something untoward
> +                * has occurred.
> +                */
> +               WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);

pr_warn_once() please - this is not a programming assertion.

Thanks,
Amir.

Jeff Layton Oct. 3, 2022, 1:01 p.m. UTC | #2

On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > The c/mtime and i_version currently get updated before the data is
> > copied (or a DIO write is issued), which is problematic for NFS.
> > 
> > READ+GETATTR can race with a write (even a local one) in such a way as
> > to make the client associate the state of the file with the wrong change
> > attribute. That association can persist indefinitely if the file sees no
> > further changes.
> > 
> > Move the setting of times to the bottom of the function in
> > __generic_file_write_iter and only update it if something was
> > successfully written.
> > 
> 
> This solution is wrong for several reasons:
> 
> 1. There is still file_update_time() in ->page_mkwrite() so you haven't
>     solved the problem completely

Right. I don't think there is a way to solve the problem vs. mmap.
Userland can write to a writeable mmap'ed page at any time and we'd
never know. We have to specifically carve out mmap as an exception here.
I'll plan to add something to the manpage patch for this.

> 2. The other side of the coin is that post crash state is more likely to end
>     up data changes without mtime/ctime change
> 

Is this really something filesystems rely on? I suppose the danger is
that some cached data gets written to disk before the write returns and
the inode on disk never gets updated.

But...isn't that a danger now? Some of the cached data could get written
out and the updated inode just never makes it to disk before a crash
(AFAIU). I'm not sure that this increases our exposure to that problem.

> If I read the problem description correctly, then a solution that invalidates
> the NFS cache before AND after the write would be acceptable. Right?
> Would an extra i_version bump after the write solve the race?
> 

I based this patch on Neil's assertion that updating the time before an
operation was pointless if we were going to do it afterward. The NFS
client only really cares about seeing it change after a write.

Doing both would be fine from a correctness standpoint, and in most
cases, the second would be a no-op anyway since a query would have to
race in between the two for that to happen.

FWIW, I think we should update the m/ctime and version at the same time.
If the version changes, then there is always the potential that a timer
tick has occurred. So, that would translate to a second call to
file_update_time in here.

The downside of bumping the times/version both before and after is that
these are hot codepaths, and we'd be adding extra operations there. Even
in the case where nothing has changed, we'd have to call
inode_needs_update_time a second time for every write. Is that worth the
cost?

> > If the time update fails, log a warning once, but don't fail the write.
> > All of the existing callers use update_time functions that don't fail,
> > so we should never trip this.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  mm/filemap.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 15800334147b..72c0ceb75176 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >         if (err)
> >                 goto out;
> > 
> > -       err = file_update_time(file);
> > -       if (err)
> > -               goto out;
> > -
> >         if (iocb->ki_flags & IOCB_DIRECT) {
> >                 loff_t pos, endbyte;
> > 
> > @@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >                         iocb->ki_pos += written;
> >         }
> >  out:
> > +       if (written > 0) {
> > +               err = file_update_time(file);
> > +               /*
> > +                * There isn't much we can do at this point if updating the
> > +                * times fails after a successful write. The times and i_version
> > +                * should still be updated in the inode, and it should still be
> > +                * marked dirty, so hopefully the next inode update will catch it.
> > +                * Log a warning once so we have a record that something untoward
> > +                * has occurred.
> > +                */
> > +               WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
> 
> pr_warn_once() please - this is not a programming assertion.
> 

ACK. I'll change that.

Jeff Layton Oct. 5, 2022, 4:40 p.m. UTC | #3

On Tue, 2022-10-04 at 09:56 +1100, NeilBrown wrote:
> On Tue, 04 Oct 2022, Amir Goldstein wrote:
> > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > 
> > > > > The c/mtime and i_version currently get updated before the data is
> > > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > > > 
> > > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > > to make the client associate the state of the file with the wrong change
> > > > > attribute. That association can persist indefinitely if the file sees no
> > > > > further changes.
> > > > > 
> > > > > Move the setting of times to the bottom of the function in
> > > > > __generic_file_write_iter and only update it if something was
> > > > > successfully written.
> > > > > 
> > > > 
> > > > This solution is wrong for several reasons:
> > > > 
> > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > >     solved the problem completely
> > > 
> > > Right. I don't think there is a way to solve the problem vs. mmap.
> > > Userland can write to a writeable mmap'ed page at any time and we'd
> > > never know. We have to specifically carve out mmap as an exception here.
> > > I'll plan to add something to the manpage patch for this.
> > > 
> > > > 2. The other side of the coin is that post crash state is more likely to end
> > > >     up data changes without mtime/ctime change
> > > > 
> > > 
> > > Is this really something filesystems rely on? I suppose the danger is
> > > that some cached data gets written to disk before the write returns and
> > > the inode on disk never gets updated.
> > > 
> > > But...isn't that a danger now? Some of the cached data could get written
> > > out and the updated inode just never makes it to disk before a crash
> > > (AFAIU). I'm not sure that this increases our exposure to that problem.
> > > 
> > > 
> > 
> > You are correct that that danger exists, but it only exists for overwriting
> > to allocated blocks.
> > 
> > For writing to new blocks, mtime change is recorded in transaction
> > before the block mapping is recorded in transaction so there is no
> > danger in this case (before your patch).
> > 
> > Also, observing size change without observing mtime change
> > after crash seems like a very bad outcome that may be possible
> > after your change.
> > 
> > These are just a few cases that I could think of, they may be filesystem
> > dependent, but my gut feeling is that if you remove the time update before
> > the operation, that has been like that forever, a lot of s#!t is going to float
> > for various filesystems and applications.
> > 
> > And it is not one of those things that are discovered  during rc or even
> > stable kernel testing - they are discovered much later when users start to
> > realize their applications got bogged up after crash, so it feels like to me
> > like playing with fire.
> > 
> > > > If I read the problem description correctly, then a solution that invalidates
> > > > the NFS cache before AND after the write would be acceptable. Right?
> > > > Would an extra i_version bump after the write solve the race?
> > > > 
> > > 
> > > I based this patch on Neil's assertion that updating the time before an
> > > operation was pointless if we were going to do it afterward. The NFS
> > > client only really cares about seeing it change after a write.
> > > 
> > 
> > Pointless to NFS client maybe.
> > Whether or not this is not changing user behavior for other applications
> > is up to you to prove and I doubt that you can prove it because I doubt
> > that it is true.
> > 
> > > Doing both would be fine from a correctness standpoint, and in most
> > > cases, the second would be a no-op anyway since a query would have to
> > > race in between the two for that to happen.
> > > 
> > > FWIW, I think we should update the m/ctime and version at the same time.
> > > If the version changes, then there is always the potential that a timer
> > > tick has occurred. So, that would translate to a second call to
> > > file_update_time in here.
> > > 
> > > The downside of bumping the times/version both before and after is that
> > > these are hot codepaths, and we'd be adding extra operations there. Even
> > > in the case where nothing has changed, we'd have to call
> > > inode_needs_update_time a second time for every write. Is that worth the
> > > cost?
> > 
> > Is there a practical cost for iversion bump AFTER write as I suggested?
> > If you NEED m/ctime update AFTER write and iversion update is not enough
> > then I did not understand from your commit message why that is.
> > 
> > Thanks,
> > Amir.
> > 
> 
> Maybe we should split i_version updates from ctime updates.
> 
> While it isn't true that ctime updates have happened before the write
> "forever" it has been true since 2.3.43[1] which is close to forever.
> 
> For ctime there doesn't appear to be a strong specification of when the
> change happens, so history provides a good case for leaving it before.
> For i_version we want to provide clear and unambiguous semantics.
> Performing 2 updates makes the specification muddy.
> 
> So I would prefer a single update for i_version, performed after the
> change becomes visible.  If that means it has to be separate from ctime,
> then so be it.
> 
> NeilBrown
> 
> 
> [1]:  https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29


Not necessarily. We can document it in such a way that bumping it twice
is allowed, but not required.

My main concern with splitting them up is that we'd have to dirty the
inode twice if both the times and the i_version need updating. If the
inode gets written out in between, then we end up doing twice the I/O.
The interim on-disk metadata would be in sort of a weird state too --
the ctime would have changed but the version would still be old.

It might be worthwhile to just go ahead and continue bumping it in
file_update_time, and then we'd just attempt to bump the i_version again
afterward. The second bump will almost always be a no-op anyway.