mbox series

[v7,00/13] fs: implement multigrain timestamps

Message ID 20230807-mgctime-v7-0-d1dec143a704@kernel.org
Headers show
Series fs: implement multigrain timestamps | expand

Message

Jeff Layton Aug. 7, 2023, 7:38 p.m. UTC
The VFS always uses coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this coarseness has always been an issue when we're
exporting via NFSv3, which relies on timestamps to validate caches. A
lot of changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps (e.g
backup applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. The idea is to use an unused bit in the ctime's
tv_nsec field to mark when the mtime or ctime has been queried via
getattr. Once that has been marked, the next m/ctime update will use a
fine-grained timestamp.

Credit goes to Dave Chinner for the original idea, and to Ben Coddington
for the catchy name. This series should apply cleanly onto Christian's
vfs.ctime branch, once the v6 mgtime patches have been dropped. That
should be everything above this commit:

    525deaeb2fbf gfs2: fix timestamp handling on quota inodes

base-commit: cf22d118b89a09a0160586412160d89098f7c4c7
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v7:
- change update_time operation to fetch the current time itself
- don't modify current_time operation. Leave it always returning coarse timestamp
- rework inode_set_ctime_current for better atomicity and ensure that
  all mgtime filesystems use it
- reorder arguments to fill_mg_cmtime

Changes in v6:
- drop the patch that removed XFS_ICHGTIME_CHG
- change WARN_ON_ONCE to ASSERT in xfs conversion patch

---
Jeff Layton (13):
      fs: remove silly warning from current_time
      fs: pass the request_mask to generic_fillattr
      fs: drop the timespec64 arg from generic_update_time
      btrfs: have it use inode_update_timestamps
      fat: make fat_update_time get its own timestamp
      ubifs: have ubifs_update_time use inode_update_timestamps
      xfs: have xfs_vn_update_time gets its own timestamp
      fs: drop the timespec64 argument from update_time
      fs: add infrastructure for multigrain timestamps
      tmpfs: add support for multigrain timestamps
      xfs: switch to multigrain timestamps
      ext4: switch to multigrain timestamps
      btrfs: convert to multigrain timestamps

 fs/9p/vfs_inode.c               |   4 +-
 fs/9p/vfs_inode_dotl.c          |   4 +-
 fs/afs/inode.c                  |   2 +-
 fs/bad_inode.c                  |   3 +-
 fs/btrfs/file.c                 |  24 +----
 fs/btrfs/inode.c                |  14 +--
 fs/btrfs/super.c                |   5 +-
 fs/btrfs/volumes.c              |   4 +-
 fs/ceph/inode.c                 |   2 +-
 fs/coda/inode.c                 |   3 +-
 fs/ecryptfs/inode.c             |   5 +-
 fs/erofs/inode.c                |   2 +-
 fs/exfat/file.c                 |   2 +-
 fs/ext2/inode.c                 |   2 +-
 fs/ext4/inode.c                 |   2 +-
 fs/ext4/super.c                 |   2 +-
 fs/f2fs/file.c                  |   2 +-
 fs/fat/fat.h                    |   3 +-
 fs/fat/file.c                   |   2 +-
 fs/fat/misc.c                   |   6 +-
 fs/fuse/dir.c                   |   2 +-
 fs/gfs2/inode.c                 |   8 +-
 fs/hfsplus/inode.c              |   2 +-
 fs/inode.c                      | 200 +++++++++++++++++++++++++++++++---------
 fs/kernfs/inode.c               |   2 +-
 fs/libfs.c                      |   4 +-
 fs/minix/inode.c                |   2 +-
 fs/nfs/inode.c                  |   2 +-
 fs/nfs/namespace.c              |   3 +-
 fs/ntfs3/file.c                 |   2 +-
 fs/ocfs2/file.c                 |   2 +-
 fs/orangefs/inode.c             |   5 +-
 fs/overlayfs/inode.c            |   2 +-
 fs/overlayfs/overlayfs.h        |   2 +-
 fs/proc/base.c                  |   4 +-
 fs/proc/fd.c                    |   2 +-
 fs/proc/generic.c               |   2 +-
 fs/proc/proc_net.c              |   2 +-
 fs/proc/proc_sysctl.c           |   2 +-
 fs/proc/root.c                  |   3 +-
 fs/smb/client/inode.c           |   2 +-
 fs/smb/server/smb2pdu.c         |  22 ++---
 fs/smb/server/vfs.c             |   3 +-
 fs/stat.c                       |  65 ++++++++++---
 fs/sysv/itree.c                 |   3 +-
 fs/ubifs/dir.c                  |   2 +-
 fs/ubifs/file.c                 |  19 ++--
 fs/ubifs/ubifs.h                |   2 +-
 fs/udf/symlink.c                |   2 +-
 fs/vboxsf/utils.c               |   2 +-
 fs/xfs/libxfs/xfs_trans_inode.c |   6 +-
 fs/xfs/xfs_iops.c               |  25 +++--
 fs/xfs/xfs_super.c              |   2 +-
 include/linux/fs.h              |  55 +++++++++--
 mm/shmem.c                      |   4 +-
 55 files changed, 368 insertions(+), 192 deletions(-)
---
base-commit: 525deaeb2fbf634222f4231608c72190c551c935
change-id: 20230713-mgctime-f2a9fc324918

Best regards,

Comments

Jan Kara Aug. 8, 2023, 9:05 a.m. UTC | #1
On Mon 07-08-23 15:38:32, Jeff Layton wrote:
> An inode with no superblock? Unpossible!
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/inode.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index d4ab92233062..3fc251bfaf73 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2495,12 +2495,6 @@ struct timespec64 current_time(struct inode *inode)
>  	struct timespec64 now;
>  
>  	ktime_get_coarse_real_ts64(&now);
> -
> -	if (unlikely(!inode->i_sb)) {
> -		WARN(1, "current_time() called with uninitialized super_block in the inode");
> -		return now;
> -	}
> -
>  	return timestamp_truncate(now, inode);
>  }
>  EXPORT_SYMBOL(current_time);
> 
> -- 
> 2.41.0
>
Jan Kara Aug. 8, 2023, 9:37 a.m. UTC | #2
On Mon 07-08-23 15:38:37, Jeff Layton wrote:
> In later patches, we're going to drop the "now" parameter from the
> update_time operation. Prepare ubifs for this, by having it use the new
> inode_update_timestamps helper.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

One comment below:

> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index df9086b19cd0..2d0178922e19 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1397,15 +1397,9 @@ int ubifs_update_time(struct inode *inode, struct timespec64 *time,
>  		return err;
>  
>  	mutex_lock(&ui->ui_mutex);
> -	if (flags & S_ATIME)
> -		inode->i_atime = *time;
> -	if (flags & S_CTIME)
> -		inode_set_ctime_to_ts(inode, *time);
> -	if (flags & S_MTIME)
> -		inode->i_mtime = *time;
> -
> -	release = ui->dirty;
> +	inode_update_timestamps(inode, flags);
>  	__mark_inode_dirty(inode, I_DIRTY_SYNC);
> +	release = ui->dirty;
>  	mutex_unlock(&ui->ui_mutex);

I think this is wrong. You need to keep sampling ui->dirty before calling
__mark_inode_dirty(). Otherwise you could release budget for inode update
you really need...

>  	if (release)
>  		ubifs_release_budget(c, &req);

									Honza
Jan Kara Aug. 8, 2023, 9:45 a.m. UTC | #3
On Mon 07-08-23 15:38:39, Jeff Layton wrote:
> Now that all of the update_time operations are prepared for it, we can
> drop the timespec64 argument from the update_time operation. Do that and
> remove it from some associated functions like inode_update_time and
> inode_needs_update_time.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
Jan Kara Aug. 8, 2023, 10:02 a.m. UTC | #4
On Mon 07-08-23 15:38:40, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamps when updating the ctime
> and mtime after a change. This has the benefit of allowing filesystems
> to optimize away a lot metadata updates, down to around 1 per jiffy,
> even when a file is under heavy writes.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
> 
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
> 
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried.
> 
> POSIX generally mandates that when the the mtime changes, the ctime must
> also change. The kernel always stores normalized ctime values, so only
> the first 30 bits of the tv_nsec field are ever used.
> 
> Use the 31st bit of the ctime tv_nsec field to indicate that something
> has queried the inode for the mtime or ctime. When this flag is set,
> on the next mtime or ctime update, the kernel will fetch a fine-grained
> timestamp instead of the usual coarse-grained one.
> 
> Filesytems can opt into this behavior by setting the FS_MGTIME flag in
> the fstype. Filesystems that don't set this flag will continue to use
> coarse-grained timestamps.
> 
> Later patches will convert individual filesystems to use the new
> infrastructure.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/inode.c         | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/stat.c          | 41 +++++++++++++++++++++++++--
>  include/linux/fs.h | 46 ++++++++++++++++++++++++++++--
>  3 files changed, 162 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index e50d94a136fe..f55957ac80e6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2118,10 +2118,52 @@ int file_remove_privs(struct file *file)
>  }
>  EXPORT_SYMBOL(file_remove_privs);
>  
> +/**
> + * current_mgtime - Return FS time (possibly fine-grained)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> + * as having been QUERIED, get a fine-grained timestamp.
> + */
> +struct timespec64 current_mgtime(struct inode *inode)
> +{
> +	struct timespec64 now, ctime;
> +	atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec;
> +	long nsec = atomic_long_read(pnsec);
> +
> +	if (nsec & I_CTIME_QUERIED) {
> +		ktime_get_real_ts64(&now);
> +		return timestamp_truncate(now, inode);
> +	}
> +
> +	ktime_get_coarse_real_ts64(&now);
> +	now = timestamp_truncate(now, inode);
> +
> +	/*
> +	 * If we've recently fetched a fine-grained timestamp
> +	 * then the coarse-grained one may still be earlier than the
> +	 * existing ctime. Just keep the existing value if so.
> +	 */
> +	ctime = inode_get_ctime(inode);
> +	if (timespec64_compare(&ctime, &now) > 0)
> +		now = ctime;
> +
> +	return now;
> +}
> +EXPORT_SYMBOL(current_mgtime);
> +
> +static struct timespec64 current_ctime(struct inode *inode)
> +{
> +	if (is_mgtime(inode))
> +		return current_mgtime(inode);
> +	return current_time(inode);
> +}
> +
>  static int inode_needs_update_time(struct inode *inode)
>  {
>  	int sync_it = 0;
> -	struct timespec64 now = current_time(inode);
> +	struct timespec64 now = current_ctime(inode);
>  	struct timespec64 ctime;
>  
>  	/* First try to exhaust all avenues to not sync */
> @@ -2552,9 +2594,43 @@ EXPORT_SYMBOL(current_time);
>   */
>  struct timespec64 inode_set_ctime_current(struct inode *inode)
>  {
> -	struct timespec64 now = current_time(inode);
> +	struct timespec64 now;
> +	struct timespec64 ctime;
> +
> +	ctime.tv_nsec = READ_ONCE(inode->__i_ctime.tv_nsec);
> +	if (!(ctime.tv_nsec & I_CTIME_QUERIED)) {
> +		now = current_time(inode);
>  
> -	inode_set_ctime(inode, now.tv_sec, now.tv_nsec);
> +		/* Just copy it into place if it's not multigrain */
> +		if (!is_mgtime(inode)) {
> +			inode_set_ctime_to_ts(inode, now);
> +			return now;
> +		}
> +
> +		/*
> +		 * If we've recently updated with a fine-grained timestamp,
> +		 * then the coarse-grained one may still be earlier than the
> +		 * existing ctime. Just keep the existing value if so.
> +		 */
> +		ctime.tv_sec = inode->__i_ctime.tv_sec;
> +		if (timespec64_compare(&ctime, &now) > 0)
> +			return ctime;
> +
> +		/*
> +		 * Ctime updates are usually protected by the inode_lock, but
> +		 * we can still race with someone setting the QUERIED flag.
> +		 * Try to swap the new nsec value into place. If it's changed
> +		 * in the interim, then just go with a fine-grained timestamp.
> +		 */
> +		if (cmpxchg(&inode->__i_ctime.tv_nsec, ctime.tv_nsec,
> +			    now.tv_nsec) != ctime.tv_nsec)
> +			goto fine_grained;
> +		inode->__i_ctime.tv_sec = now.tv_sec;
> +		return now;
> +	}
> +fine_grained:
> +	ktime_get_real_ts64(&now);
> +	inode_set_ctime_to_ts(inode, timestamp_truncate(now, inode));
>  	return now;
>  }
>  EXPORT_SYMBOL(inode_set_ctime_current);
> diff --git a/fs/stat.c b/fs/stat.c
> index 7644e5997035..136711ae72fb 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -26,6 +26,37 @@
>  #include "internal.h"
>  #include "mount.h"
>  
> +/**
> + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
> + * @stat: where to store the resulting values
> + * @request_mask: STATX_* values requested
> + * @inode: inode from which to grab the c/mtime
> + *
> + * Given @inode, grab the ctime and mtime out if it and store the result
> + * in @stat. When fetching the value, flag it as queried so the next write
> + * will use a fine-grained timestamp.
> + */
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
> +{
> +	atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec;
> +
> +	/* If neither time was requested, then don't report them */
> +	if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
> +		stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
> +		return;
> +	}
> +
> +	stat->mtime = inode->i_mtime;
> +	stat->ctime.tv_sec = inode->__i_ctime.tv_sec;
> +	/*
> +	 * Atomically set the QUERIED flag and fetch the new value with
> +	 * the flag masked off.
> +	 */
> +	stat->ctime.tv_nsec = atomic_long_fetch_or(I_CTIME_QUERIED, pnsec) &
> +					~I_CTIME_QUERIED;
> +}
> +EXPORT_SYMBOL(fill_mg_cmtime);
> +
>  /**
>   * generic_fillattr - Fill in the basic attributes from the inode struct
>   * @idmap:		idmap of the mount the inode was found from
> @@ -58,8 +89,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
>  	stat->rdev = inode->i_rdev;
>  	stat->size = i_size_read(inode);
>  	stat->atime = inode->i_atime;
> -	stat->mtime = inode->i_mtime;
> -	stat->ctime = inode_get_ctime(inode);
> +
> +	if (is_mgtime(inode)) {
> +		fill_mg_cmtime(stat, request_mask, inode);
> +	} else {
> +		stat->mtime = inode->i_mtime;
> +		stat->ctime = inode_get_ctime(inode);
> +	}
> +
>  	stat->blksize = i_blocksize(inode);
>  	stat->blocks = inode->i_blocks;
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a83313f90fe3..455835d0e963 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1474,18 +1474,47 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb,
>  	       kgid_has_mapping(fs_userns, kgid);
>  }
>  
> +struct timespec64 current_mgtime(struct inode *inode);
>  struct timespec64 current_time(struct inode *inode);
>  struct timespec64 inode_set_ctime_current(struct inode *inode);
>  
> +/*
> + * Multigrain timestamps
> + *
> + * Conditionally use fine-grained ctime and mtime timestamps when there
> + * are users actively observing them via getattr. The primary use-case
> + * for this is NFS clients that use the ctime to distinguish between
> + * different states of the file, and that are often fooled by multiple
> + * operations that occur in the same coarse-grained timer tick.
> + *
> + * The kernel always keeps normalized struct timespec64 values in the ctime,
> + * which means that only the first 30 bits of the value are used. Use the
> + * 31st bit of the ctime's tv_nsec field as a flag to indicate that the value
> + * has been queried since it was last updated.
> + */
> +#define I_CTIME_QUERIED		(1L<<30)
> +
>  /**
>   * inode_get_ctime - fetch the current ctime from the inode
>   * @inode: inode from which to fetch ctime
>   *
> - * Grab the current ctime from the inode and return it.
> + * Grab the current ctime tv_nsec field from the inode, mask off the
> + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> + * internal consumers of the ctime that aren't concerned with ensuring a
> + * fine-grained update on the next change (e.g. when preparing to store
> + * the value in the backing store for later retrieval).
> + *
> + * This is safe to call regardless of whether the underlying filesystem
> + * is using multigrain timestamps.
>   */
>  static inline struct timespec64 inode_get_ctime(const struct inode *inode)
>  {
> -	return inode->__i_ctime;
> +	struct timespec64 ctime;
> +
> +	ctime.tv_sec = inode->__i_ctime.tv_sec;
> +	ctime.tv_nsec = inode->__i_ctime.tv_nsec & ~I_CTIME_QUERIED;
> +
> +	return ctime;
>  }
>  
>  /**
> @@ -2259,6 +2288,7 @@ struct file_system_type {
>  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
>  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
>  #define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
> +#define FS_MGTIME		64	/* FS uses multigrain timestamps */
>  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
>  	int (*init_fs_context)(struct fs_context *);
>  	const struct fs_parameter_spec *parameters;
> @@ -2282,6 +2312,17 @@ struct file_system_type {
>  
>  #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
>  
> +/**
> + * is_mgtime: is this inode using multigrain timestamps
> + * @inode: inode to test for multigrain timestamps
> + *
> + * Return true if the inode uses multigrain timestamps, false otherwise.
> + */
> +static inline bool is_mgtime(const struct inode *inode)
> +{
> +	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> +}
> +
>  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
>  	int flags, const char *dev_name, void *data,
>  	int (*fill_super)(struct super_block *, void *, int));
> @@ -2918,6 +2959,7 @@ extern void page_put_link(void *);
>  extern int page_symlink(struct inode *inode, const char *symname, int len);
>  extern const struct inode_operations page_symlink_inode_operations;
>  extern void kfree_link(void *);
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode);
>  void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
>  void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
>  extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
> 
> -- 
> 2.41.0
>
Jan Kara Aug. 8, 2023, 10:05 a.m. UTC | #5
On Mon 07-08-23 15:38:44, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Beyond enabling the FS_MGTIME flag, this patch eliminates
> update_time_for_write, which goes to great pains to avoid in-memory
> stores. Just have it overwrite the timestamps unconditionally.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> Acked-by: David Sterba <dsterba@suse.com>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/btrfs/file.c  | 24 ++++--------------------
>  fs/btrfs/super.c |  5 +++--
>  2 files changed, 7 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index d7a9ece7a40b..b9e75c9f95ac 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1106,25 +1106,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
>  	btrfs_drew_write_unlock(&inode->root->snapshot_lock);
>  }
>  
> -static void update_time_for_write(struct inode *inode)
> -{
> -	struct timespec64 now, ctime;
> -
> -	if (IS_NOCMTIME(inode))
> -		return;
> -
> -	now = current_time(inode);
> -	if (!timespec64_equal(&inode->i_mtime, &now))
> -		inode->i_mtime = now;
> -
> -	ctime = inode_get_ctime(inode);
> -	if (!timespec64_equal(&ctime, &now))
> -		inode_set_ctime_to_ts(inode, now);
> -
> -	if (IS_I_VERSION(inode))
> -		inode_inc_iversion(inode);
> -}
> -
>  static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
>  			     size_t count)
>  {
> @@ -1156,7 +1137,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
>  	 * need to start yet another transaction to update the inode as we will
>  	 * update the inode when we finish writing whatever data we write.
>  	 */
> -	update_time_for_write(inode);
> +	if (!IS_NOCMTIME(inode)) {
> +		inode->i_mtime = inode_set_ctime_current(inode);
> +		inode_inc_iversion(inode);
> +	}
>  
>  	start_pos = round_down(pos, fs_info->sectorsize);
>  	oldsize = i_size_read(inode);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index f1dd172d8d5b..8eda51b095c9 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2144,7 +2144,7 @@ static struct file_system_type btrfs_fs_type = {
>  	.name		= "btrfs",
>  	.mount		= btrfs_mount,
>  	.kill_sb	= btrfs_kill_super,
> -	.fs_flags	= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
> +	.fs_flags	= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_MGTIME,
>  };
>  
>  static struct file_system_type btrfs_root_fs_type = {
> @@ -2152,7 +2152,8 @@ static struct file_system_type btrfs_root_fs_type = {
>  	.name		= "btrfs",
>  	.mount		= btrfs_mount_root,
>  	.kill_sb	= btrfs_kill_super,
> -	.fs_flags	= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_ALLOW_IDMAP,
> +	.fs_flags	= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
> +			  FS_ALLOW_IDMAP | FS_MGTIME,
>  };
>  
>  MODULE_ALIAS_FS("btrfs");
> 
> -- 
> 2.41.0
>
Christian Brauner Aug. 9, 2023, 7:06 a.m. UTC | #6
On Tue, Aug 08, 2023 at 11:37:01AM +0200, Jan Kara wrote:
> On Mon 07-08-23 15:38:37, Jeff Layton wrote:
> > In later patches, we're going to drop the "now" parameter from the
> > update_time operation. Prepare ubifs for this, by having it use the new
> > inode_update_timestamps helper.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> One comment below:
> 
> > diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> > index df9086b19cd0..2d0178922e19 100644
> > --- a/fs/ubifs/file.c
> > +++ b/fs/ubifs/file.c
> > @@ -1397,15 +1397,9 @@ int ubifs_update_time(struct inode *inode, struct timespec64 *time,
> >  		return err;
> >  
> >  	mutex_lock(&ui->ui_mutex);
> > -	if (flags & S_ATIME)
> > -		inode->i_atime = *time;
> > -	if (flags & S_CTIME)
> > -		inode_set_ctime_to_ts(inode, *time);
> > -	if (flags & S_MTIME)
> > -		inode->i_mtime = *time;
> > -
> > -	release = ui->dirty;
> > +	inode_update_timestamps(inode, flags);
> >  	__mark_inode_dirty(inode, I_DIRTY_SYNC);
> > +	release = ui->dirty;
> >  	mutex_unlock(&ui->ui_mutex);
> 
> I think this is wrong. You need to keep sampling ui->dirty before calling
> __mark_inode_dirty(). Otherwise you could release budget for inode update
> you really need...

Fixed in-tree.
Christian Brauner Aug. 9, 2023, 7:09 a.m. UTC | #7
On Mon, 07 Aug 2023 15:38:31 -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this coarseness has always been an issue when we're
> exporting via NFSv3, which relies on timestamps to validate caches. A
> lot of changes can happen in a jiffy, so timestamps aren't sufficient to
> help the client decide to invalidate the cache.
> 
> [...]

Applied to the vfs.ctime branch of the vfs/vfs.git tree.
Patches in the vfs.ctime branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.ctime

[01/13] fs: remove silly warning from current_time
        https://git.kernel.org/vfs/vfs/c/562bcab11bf4
[02/13] fs: pass the request_mask to generic_fillattr
        https://git.kernel.org/vfs/vfs/c/3592165f4378
[03/13] fs: drop the timespec64 arg from generic_update_time
        https://git.kernel.org/vfs/vfs/c/32586bb50943
[04/13] btrfs: have it use inode_update_timestamps
        https://git.kernel.org/vfs/vfs/c/51fc38e7c7d1
[05/13] fat: make fat_update_time get its own timestamp
        https://git.kernel.org/vfs/vfs/c/d6e7faae82dc
[06/13] ubifs: have ubifs_update_time use inode_update_timestamps
        https://git.kernel.org/vfs/vfs/c/853ff59b434a
[07/13] xfs: have xfs_vn_update_time gets its own timestamp
        https://git.kernel.org/vfs/vfs/c/7ad056c2cf36
[08/13] fs: drop the timespec64 argument from update_time
        https://git.kernel.org/vfs/vfs/c/3beae086b71f
[09/13] fs: add infrastructure for multigrain timestamps
        https://git.kernel.org/vfs/vfs/c/b16956ed812f
[10/13] tmpfs: add support for multigrain timestamps
        https://git.kernel.org/vfs/vfs/c/bd21ec574f16
[11/13] xfs: switch to multigrain timestamps
        https://git.kernel.org/vfs/vfs/c/fb9812d2c39e
[12/13] ext4: switch to multigrain timestamps
        https://git.kernel.org/vfs/vfs/c/7fdf02299f5d
[13/13] btrfs: convert to multigrain timestamps
        https://git.kernel.org/vfs/vfs/c/2ebbf04988b9
Jan Kara Aug. 9, 2023, 8:23 a.m. UTC | #8
On Wed 09-08-23 09:06:34, Christian Brauner wrote:
> On Tue, Aug 08, 2023 at 11:37:01AM +0200, Jan Kara wrote:
> > On Mon 07-08-23 15:38:37, Jeff Layton wrote:
> > > In later patches, we're going to drop the "now" parameter from the
> > > update_time operation. Prepare ubifs for this, by having it use the new
> > > inode_update_timestamps helper.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > One comment below:
> > 
> > > diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> > > index df9086b19cd0..2d0178922e19 100644
> > > --- a/fs/ubifs/file.c
> > > +++ b/fs/ubifs/file.c
> > > @@ -1397,15 +1397,9 @@ int ubifs_update_time(struct inode *inode, struct timespec64 *time,
> > >  		return err;
> > >  
> > >  	mutex_lock(&ui->ui_mutex);
> > > -	if (flags & S_ATIME)
> > > -		inode->i_atime = *time;
> > > -	if (flags & S_CTIME)
> > > -		inode_set_ctime_to_ts(inode, *time);
> > > -	if (flags & S_MTIME)
> > > -		inode->i_mtime = *time;
> > > -
> > > -	release = ui->dirty;
> > > +	inode_update_timestamps(inode, flags);
> > >  	__mark_inode_dirty(inode, I_DIRTY_SYNC);
> > > +	release = ui->dirty;
> > >  	mutex_unlock(&ui->ui_mutex);
> > 
> > I think this is wrong. You need to keep sampling ui->dirty before calling
> > __mark_inode_dirty(). Otherwise you could release budget for inode update
> > you really need...
> 
> Fixed in-tree.

Thanks. With that feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
Christian Brauner Aug. 9, 2023, 12:31 p.m. UTC | #9
On Mon, Aug 07, 2023 at 03:38:39PM -0400, Jeff Layton wrote:
> Now that all of the update_time operations are prepared for it, we can
> drop the timespec64 argument from the update_time operation. Do that and
> remove it from some associated functions like inode_update_time and
> inode_needs_update_time.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/bad_inode.c           |  3 +--
>  fs/btrfs/inode.c         |  3 +--
>  fs/btrfs/volumes.c       |  4 +---
>  fs/fat/fat.h             |  3 +--
>  fs/fat/misc.c            |  2 +-
>  fs/gfs2/inode.c          |  3 +--
>  fs/inode.c               | 30 +++++++++++++-----------------
>  fs/overlayfs/inode.c     |  2 +-
>  fs/overlayfs/overlayfs.h |  2 +-
>  fs/ubifs/file.c          |  3 +--
>  fs/ubifs/ubifs.h         |  2 +-
>  fs/xfs/xfs_iops.c        |  1 -
>  include/linux/fs.h       |  4 ++--

This was missing the conversion of fs/orangefs orangefs_update_time()
causing the build to fail. So at some point kbuild will yell here.
Fwiw, I've fixed that up in-tree.
Mike Marshall Aug. 9, 2023, 6:38 p.m. UTC | #10
I've been following this patch on fsdevel... is there a
remote I could fetch with a branch that has this in it?

-Mike

On Wed, Aug 9, 2023 at 8:32 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Aug 07, 2023 at 03:38:39PM -0400, Jeff Layton wrote:
> > Now that all of the update_time operations are prepared for it, we can
> > drop the timespec64 argument from the update_time operation. Do that and
> > remove it from some associated functions like inode_update_time and
> > inode_needs_update_time.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/bad_inode.c           |  3 +--
> >  fs/btrfs/inode.c         |  3 +--
> >  fs/btrfs/volumes.c       |  4 +---
> >  fs/fat/fat.h             |  3 +--
> >  fs/fat/misc.c            |  2 +-
> >  fs/gfs2/inode.c          |  3 +--
> >  fs/inode.c               | 30 +++++++++++++-----------------
> >  fs/overlayfs/inode.c     |  2 +-
> >  fs/overlayfs/overlayfs.h |  2 +-
> >  fs/ubifs/file.c          |  3 +--
> >  fs/ubifs/ubifs.h         |  2 +-
> >  fs/xfs/xfs_iops.c        |  1 -
> >  include/linux/fs.h       |  4 ++--
>
> This was missing the conversion of fs/orangefs orangefs_update_time()
> causing the build to fail. So at some point kbuild will yell here.
> Fwiw, I've fixed that up in-tree.
Jeff Layton Aug. 9, 2023, 7:05 p.m. UTC | #11
Yes. It's in Christian's vfs.ctime branch:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.ctime

On Wed, 2023-08-09 at 14:38 -0400, Mike Marshall wrote:
> I've been following this patch on fsdevel... is there a
> remote I could fetch with a branch that has this in it?
> 
> -Mike
> 
> On Wed, Aug 9, 2023 at 8:32 AM Christian Brauner <brauner@kernel.org> wrote:
> > 
> > On Mon, Aug 07, 2023 at 03:38:39PM -0400, Jeff Layton wrote:
> > > Now that all of the update_time operations are prepared for it, we can
> > > drop the timespec64 argument from the update_time operation. Do that and
> > > remove it from some associated functions like inode_update_time and
> > > inode_needs_update_time.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/bad_inode.c           |  3 +--
> > >  fs/btrfs/inode.c         |  3 +--
> > >  fs/btrfs/volumes.c       |  4 +---
> > >  fs/fat/fat.h             |  3 +--
> > >  fs/fat/misc.c            |  2 +-
> > >  fs/gfs2/inode.c          |  3 +--
> > >  fs/inode.c               | 30 +++++++++++++-----------------
> > >  fs/overlayfs/inode.c     |  2 +-
> > >  fs/overlayfs/overlayfs.h |  2 +-
> > >  fs/ubifs/file.c          |  3 +--
> > >  fs/ubifs/ubifs.h         |  2 +-
> > >  fs/xfs/xfs_iops.c        |  1 -
> > >  include/linux/fs.h       |  4 ++--
> > 
> > This was missing the conversion of fs/orangefs orangefs_update_time()
> > causing the build to fail. So at some point kbuild will yell here.
> > Fwiw, I've fixed that up in-tree.

Cheers,
patchwork-bot+f2fs@kernel.org Sept. 4, 2023, 6:11 p.m. UTC | #12
Hello:

This series was applied to jaegeuk/f2fs.git (dev)
by Christian Brauner <brauner@kernel.org>:

On Mon, 07 Aug 2023 15:38:31 -0400 you wrote:
> The VFS always uses coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this coarseness has always been an issue when we're
> exporting via NFSv3, which relies on timestamps to validate caches. A
> lot of changes can happen in a jiffy, so timestamps aren't sufficient to
> help the client decide to invalidate the cache.
> 
> [...]

Here is the summary with links:
  - [f2fs-dev,v7,01/13] fs: remove silly warning from current_time
    https://git.kernel.org/jaegeuk/f2fs/c/b3030e4f2344
  - [f2fs-dev,v7,02/13] fs: pass the request_mask to generic_fillattr
    https://git.kernel.org/jaegeuk/f2fs/c/0d72b92883c6
  - [f2fs-dev,v7,03/13] fs: drop the timespec64 arg from generic_update_time
    https://git.kernel.org/jaegeuk/f2fs/c/541d4c798a59
  - [f2fs-dev,v7,04/13] btrfs: have it use inode_update_timestamps
    https://git.kernel.org/jaegeuk/f2fs/c/bb7cc0a62e47
  - [f2fs-dev,v7,05/13] fat: make fat_update_time get its own timestamp
    (no matching commit)
  - [f2fs-dev,v7,06/13] ubifs: have ubifs_update_time use inode_update_timestamps
    (no matching commit)
  - [f2fs-dev,v7,07/13] xfs: have xfs_vn_update_time gets its own timestamp
    (no matching commit)
  - [f2fs-dev,v7,08/13] fs: drop the timespec64 argument from update_time
    (no matching commit)
  - [f2fs-dev,v7,09/13] fs: add infrastructure for multigrain timestamps
    https://git.kernel.org/jaegeuk/f2fs/c/ffb6cf19e063
  - [f2fs-dev,v7,10/13] tmpfs: add support for multigrain timestamps
    https://git.kernel.org/jaegeuk/f2fs/c/d48c33972916
  - [f2fs-dev,v7,11/13] xfs: switch to multigrain timestamps
    (no matching commit)
  - [f2fs-dev,v7,12/13] ext4: switch to multigrain timestamps
    https://git.kernel.org/jaegeuk/f2fs/c/0269b585868e
  - [f2fs-dev,v7,13/13] btrfs: convert to multigrain timestamps
    https://git.kernel.org/jaegeuk/f2fs/c/50e9ceef1d4f

You are awesome, thank you!
Jeff Layton Sept. 19, 2023, 4:31 p.m. UTC | #13
On Tue, 2023-09-19 at 16:52 +0200, Bruno Haible wrote:
> Jeff Layton wrote:
> > I'm not sure what we can do for this test. The nap() function is making
> > an assumption that the timestamp granularity will be constant, and that
> > isn't necessarily the case now.
> 
> This is only of secondary importance, because the scenario by Jan Kara
> shows a much more fundamental breakage:
> 
> > > The ultimate problem is that a sequence like:
> > > 
> > > write(f1)
> > > stat(f2)
> > > write(f2)
> > > stat(f2)
> > > write(f1)
> > > stat(f1)
> > > 
> > > can result in f1 timestamp to be (slightly) lower than the final f2
> > > timestamp because the second write to f1 didn't bother updating the
> > > timestamp. That can indeed be a bit confusing to programs if they compare
> > > timestamps between two files. Jeff?
> > > 
> > 
> > Basically yes.
> 
> f1 was last written to *after* f2 was last written to. If the timestamp of f1
> is then lower than the timestamp of f2, timestamps are fundamentally broken.
> 
> Many things in user-space depend on timestamps, such as build system
> centered around 'make', but also 'find ... -newer ...'.
> 


What does breakage with make look like in this situation? The "fuzz"
here is going to be on the order of a jiffy. The typical case for make
timestamp comparisons is comparing source files vs. a build target. If
those are being written nearly simultaneously, then that could be an
issue, but is that a typical behavior? It seems like it would be hard to
rely on that anyway, esp. given filesystems like NFS that can do lazy
writeback.

One of the operating principles with this series is that timestamps can
be of varying granularity between different files. Note that Linux
already violates this assumption when you're working across filesystems
of different types.

As to potential fixes if this is a real problem:

I don't really want to put this behind a mount or mkfs option (a'la
relatime, etc.), but that is one possibility.

I wonder if it would be feasible to just advance the coarse-grained
current_time whenever we end up updating a ctime with a fine-grained
timestamp? It might produce some inode write amplification. Files that
were written within the same jiffy could see more inode transactions
logged, but that still might not be _too_ awful.

I'll keep thinking about it for now.
Paul Eggert Sept. 19, 2023, 8:10 p.m. UTC | #14
On 2023-09-19 09:31, Jeff Layton wrote:
> The typical case for make
> timestamp comparisons is comparing source files vs. a build target. If
> those are being written nearly simultaneously, then that could be an
> issue, but is that a typical behavior?

I vaguely remember running into problems with 'make' a while ago 
(perhaps with a BSDish system) when filesystem timestamps were 
arbitrarily truncated in some cases but not others. These files would 
look older than they really were, so 'make' would think they were 
up-to-date when they weren't, and 'make' would omit actions that it 
should have done, thus screwing up the build.

File timestamps can be close together with 'make -j' on fast hosts. 
Sometimes a shell script (or 'make' itself) will run 'make', then modify 
a file F, then immediately run 'make' again; the latter 'make' won't 
work if F's timestamp is mistakenly older than targets that depend on it.

Although 'make'-like apps are the biggest canaries in this coal mine, 
the issue also affects 'find -newer' (as Bruno mentioned), 'rsync -u', 
'mv -u', 'tar -u', Emacs file-newer-than-file-p, and surely many other 
places. For example, any app that creates a timestamp file, then backs 
up all files newer than that file, would be at risk.


> I wonder if it would be feasible to just advance the coarse-grained
> current_time whenever we end up updating a ctime with a fine-grained
> timestamp?

Wouldn't this need to be done globally, that is, not just on a per-file 
or per-filesystem basis? If so, I don't see how we'd avoid locking 
performance issues.


PS. Although I'm no expert in the Linux inode code I hope you don't mind 
my asking a question about this part of inode_set_ctime_current:

	/*
	 * If we've recently updated with a fine-grained timestamp,
	 * then the coarse-grained one may still be earlier than the
	 * existing ctime. Just keep the existing value if so.
	 */
	ctime.tv_sec = inode->__i_ctime.tv_sec;
	if (timespec64_compare(&ctime, &now) > 0)
		return ctime;

Suppose root used clock_settime to set the clock backwards. Won't this 
code incorrectly refuse to update the file's timestamp afterwards? That 
is, shouldn't the last line be "goto fine_grained;" rather than "return 
ctime;", with the comment changed from "keep the existing value" to "use 
a fine-grained value"?
Jeff Layton Sept. 19, 2023, 8:46 p.m. UTC | #15
On Tue, 2023-09-19 at 13:10 -0700, Paul Eggert wrote:
> On 2023-09-19 09:31, Jeff Layton wrote:
> > The typical case for make
> > timestamp comparisons is comparing source files vs. a build target. If
> > those are being written nearly simultaneously, then that could be an
> > issue, but is that a typical behavior?
> 
> I vaguely remember running into problems with 'make' a while ago 
> (perhaps with a BSDish system) when filesystem timestamps were 
> arbitrarily truncated in some cases but not others. These files would 
> look older than they really were, so 'make' would think they were 
> up-to-date when they weren't, and 'make' would omit actions that it 
> should have done, thus screwing up the build.
> 
> File timestamps can be close together with 'make -j' on fast hosts. 
> Sometimes a shell script (or 'make' itself) will run 'make', then modify 
> a file F, then immediately run 'make' again; the latter 'make' won't 
> work if F's timestamp is mistakenly older than targets that depend on it.
> 
> Although 'make'-like apps are the biggest canaries in this coal mine, 
> the issue also affects 'find -newer' (as Bruno mentioned), 'rsync -u', 
> 'mv -u', 'tar -u', Emacs file-newer-than-file-p, and surely many other 
> places. For example, any app that creates a timestamp file, then backs 
> up all files newer than that file, would be at risk.
> 
> 
> > I wonder if it would be feasible to just advance the coarse-grained
> > current_time whenever we end up updating a ctime with a fine-grained
> > timestamp?
> 
> Wouldn't this need to be done globally, that is, not just on a per-file 
> or per-filesystem basis? If so, I don't see how we'd avoid locking 
> performance issues.
> 

Maybe. Another idea might be to introduce a new timekeeper for
multigrain filesystems, but all of those would likely have to share the
same coarse-grained clock source.

So yeah, if you stat an inode and then update it, any inode written on a
multigrain filesystem within the same jiffy-sized window would have to
log an extra transaction to write out the inode. That's what I meant
when I was talking about write amplification.

> 
> PS. Although I'm no expert in the Linux inode code I hope you don't mind 
> my asking a question about this part of inode_set_ctime_current:
> 
> 	/*
> 	 * If we've recently updated with a fine-grained timestamp,
> 	 * then the coarse-grained one may still be earlier than the
> 	 * existing ctime. Just keep the existing value if so.
> 	 */
> 	ctime.tv_sec = inode->__i_ctime.tv_sec;
> 	if (timespec64_compare(&ctime, &now) > 0)
> 		return ctime;
> 
> Suppose root used clock_settime to set the clock backwards. Won't this 
> code incorrectly refuse to update the file's timestamp afterwards? That 
> is, shouldn't the last line be "goto fine_grained;" rather than "return 
> ctime;", with the comment changed from "keep the existing value" to "use 
> a fine-grained value"?

It is a problem, and Linus pointed that out yesterday, which is why I
sent this earlier today:

https://lore.kernel.org/linux-fsdevel/20230919-ctime-v1-1-97b3da92f504@kernel.org/T/#u

Bear in mind that we're not dealing with a situation where the value has
not been queried since its last update, so we don't need to use a fine
grained timestamp there (and really, it's preferable not to do so). A
coarse one should be fine in this case.
Christian Brauner Sept. 20, 2023, 8:41 a.m. UTC | #16
> > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > 
> > Many things in user-space depend on timestamps, such as build system
> > centered around 'make', but also 'find ... -newer ...'.
> > 
> 
> 
> What does breakage with make look like in this situation? The "fuzz"
> here is going to be on the order of a jiffy. The typical case for make
> timestamp comparisons is comparing source files vs. a build target. If
> those are being written nearly simultaneously, then that could be an
> issue, but is that a typical behavior? It seems like it would be hard to
> rely on that anyway, esp. given filesystems like NFS that can do lazy
> writeback.
> 
> One of the operating principles with this series is that timestamps can
> be of varying granularity between different files. Note that Linux
> already violates this assumption when you're working across filesystems
> of different types.
> 
> As to potential fixes if this is a real problem:
> 
> I don't really want to put this behind a mount or mkfs option (a'la
> relatime, etc.), but that is one possibility.
> 
> I wonder if it would be feasible to just advance the coarse-grained
> current_time whenever we end up updating a ctime with a fine-grained
> timestamp? It might produce some inode write amplification. Files that

Less than ideal imho.

If this risks breaking existing workloads by enabling it unconditionally
and there isn't a clear way to detect and handle these situations
without risk of regression then we should move this behind a mount
option.

So how about the following:
Xi Ruoyao Sept. 20, 2023, 8:50 a.m. UTC | #17
On Wed, 2023-09-20 at 10:41 +0200, Christian Brauner wrote:
> > > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > > 
> > > Many things in user-space depend on timestamps, such as build system
> > > centered around 'make', but also 'find ... -newer ...'.
> > > 
> > 
> > 
> > What does breakage with make look like in this situation? The "fuzz"
> > here is going to be on the order of a jiffy. The typical case for make
> > timestamp comparisons is comparing source files vs. a build target. If
> > those are being written nearly simultaneously, then that could be an
> > issue, but is that a typical behavior? It seems like it would be hard to
> > rely on that anyway, esp. given filesystems like NFS that can do lazy
> > writeback.
> > 
> > One of the operating principles with this series is that timestamps can
> > be of varying granularity between different files. Note that Linux
> > already violates this assumption when you're working across filesystems
> > of different types.
> > 
> > As to potential fixes if this is a real problem:
> > 
> > I don't really want to put this behind a mount or mkfs option (a'la
> > relatime, etc.), but that is one possibility.
> > 
> > I wonder if it would be feasible to just advance the coarse-grained
> > current_time whenever we end up updating a ctime with a fine-grained
> > timestamp? It might produce some inode write amplification. Files that
> 
> Less than ideal imho.
> 
> If this risks breaking existing workloads by enabling it unconditionally
> and there isn't a clear way to detect and handle these situations
> without risk of regression then we should move this behind a mount
> option.
> 
> So how about the following:
> 
> From cb14add421967f6e374eb77c36cc4a0526b10d17 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Wed, 20 Sep 2023 10:00:08 +0200
> Subject: [PATCH] vfs: move multi-grain timestamps behind a mount option
> 
> While we initially thought we can do this unconditionally it turns out
> that this might break existing workloads that rely on timestamps in very
> specific ways and we always knew this was a possibility. Move
> multi-grain timestamps behind a vfs mount option.

I agree with this solution.

You can add some metainfo:

Reported-by: Ken Moffat <ken@linuxfromscratch.org>
Bisected-by: Xi Ruoyao <xry111@linuxfromscratch.org>
Link: https://lists.linuxfromscratch.org/sympa/arc/lfs-dev/2023-09/msg00036.html

> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/fs_context.c     | 18 ++++++++++++++++++
>  fs/inode.c          |  4 ++--
>  fs/proc_namespace.c |  1 +
>  fs/stat.c           |  2 +-
>  include/linux/fs.h  |  4 +++-
>  5 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index a0ad7a0c4680..dd4dade0bb9e 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -44,6 +44,7 @@ static const struct constant_table common_set_sb_flag[] = {
>  	{ "mand",	SB_MANDLOCK },
>  	{ "ro",		SB_RDONLY },
>  	{ "sync",	SB_SYNCHRONOUS },
> +	{ "mgtime",	SB_MGTIME },
>  	{ },
>  };
>  
> @@ -52,18 +53,32 @@ static const struct constant_table common_clear_sb_flag[] = {
>  	{ "nolazytime",	SB_LAZYTIME },
>  	{ "nomand",	SB_MANDLOCK },
>  	{ "rw",		SB_RDONLY },
> +	{ "nomgtime",	SB_MGTIME },
>  	{ },
>  };
>  
> +static inline int check_mgtime(unsigned int token, const struct fs_context *fc)
> +{
> +	if (token != SB_MGTIME)
> +		return 0;
> +	if (!(fc->fs_type->fs_flags & FS_MGTIME))
> +		return invalf(fc, "Filesystem doesn't support multi-grain timestamps");
> +	return 0;
> +}
> +
>  /*
>   * Check for a common mount option that manipulates s_flags.
>   */
>  static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
>  {
>  	unsigned int token;
> +	int ret;
>  
>  	token = lookup_constant(common_set_sb_flag, key, 0);
>  	if (token) {
> +		ret = check_mgtime(token, fc);
> +		if (ret)
> +			return ret;
>  		fc->sb_flags |= token;
>  		fc->sb_flags_mask |= token;
>  		return 0;
> @@ -71,6 +86,9 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
>  
>  	token = lookup_constant(common_clear_sb_flag, key, 0);
>  	if (token) {
> +		ret = check_mgtime(token, fc);
> +		if (ret)
> +			return ret;
>  		fc->sb_flags &= ~token;
>  		fc->sb_flags_mask |= token;
>  		return 0;
> diff --git a/fs/inode.c b/fs/inode.c
> index 54237f4242ff..fd1a2390aaa3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2141,7 +2141,7 @@ EXPORT_SYMBOL(current_mgtime);
>  
>  static struct timespec64 current_ctime(struct inode *inode)
>  {
> -	if (is_mgtime(inode))
> +	if (IS_MGTIME(inode))
>  		return current_mgtime(inode);
>  	return current_time(inode);
>  }
> @@ -2588,7 +2588,7 @@ struct timespec64 inode_set_ctime_current(struct inode *inode)
>  		now = current_time(inode);
>  
>  		/* Just copy it into place if it's not multigrain */
> -		if (!is_mgtime(inode)) {
> +		if (!IS_MGTIME(inode)) {
>  			inode_set_ctime_to_ts(inode, now);
>  			return now;
>  		}
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 250eb5bf7b52..08f5bf4d2c6c 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>  		{ SB_DIRSYNC, ",dirsync" },
>  		{ SB_MANDLOCK, ",mand" },
>  		{ SB_LAZYTIME, ",lazytime" },
> +		{ SB_MGTIME, ",mgtime" },
>  		{ 0, NULL }
>  	};
>  	const struct proc_fs_opts *fs_infop;
> diff --git a/fs/stat.c b/fs/stat.c
> index 6e60389d6a15..2f18dd5de18b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -90,7 +90,7 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
>  	stat->size = i_size_read(inode);
>  	stat->atime = inode->i_atime;
>  
> -	if (is_mgtime(inode)) {
> +	if (IS_MGTIME(inode)) {
>  		fill_mg_cmtime(stat, request_mask, inode);
>  	} else {
>  		stat->mtime = inode->i_mtime;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4aeb3fa11927..03e415fb3a7c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1114,6 +1114,7 @@ extern int send_sigurg(struct fown_struct *fown);
>  #define SB_NODEV        BIT(2)	/* Disallow access to device special files */
>  #define SB_NOEXEC       BIT(3)	/* Disallow program execution */
>  #define SB_SYNCHRONOUS  BIT(4)	/* Writes are synced at once */
> +#define SB_MGTIME	BIT(5)	/* Use multi-grain timestamps */
>  #define SB_MANDLOCK     BIT(6)	/* Allow mandatory locks on an FS */
>  #define SB_DIRSYNC      BIT(7)	/* Directory modifications are synchronous */
>  #define SB_NOATIME      BIT(10)	/* Do not update access times. */
> @@ -2105,6 +2106,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
>  					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
>  #define IS_MANDLOCK(inode)	__IS_FLG(inode, SB_MANDLOCK)
>  #define IS_NOATIME(inode)	__IS_FLG(inode, SB_RDONLY|SB_NOATIME)
> +#define IS_MGTIME(inode)	__IS_FLG(inode, SB_MGTIME)
>  #define IS_I_VERSION(inode)	__IS_FLG(inode, SB_I_VERSION)
>  
>  #define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
> @@ -2366,7 +2368,7 @@ struct file_system_type {
>   */
>  static inline bool is_mgtime(const struct inode *inode)
>  {
> -	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> +	return inode->i_sb->s_flags & SB_MGTIME;
>  }
>  
>  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
Jeff Layton Sept. 20, 2023, 9:56 a.m. UTC | #18
On Wed, 2023-09-20 at 10:41 +0200, Christian Brauner wrote:
> > > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > > 
> > > Many things in user-space depend on timestamps, such as build system
> > > centered around 'make', but also 'find ... -newer ...'.
> > > 
> > 
> > 
> > What does breakage with make look like in this situation? The "fuzz"
> > here is going to be on the order of a jiffy. The typical case for make
> > timestamp comparisons is comparing source files vs. a build target. If
> > those are being written nearly simultaneously, then that could be an
> > issue, but is that a typical behavior? It seems like it would be hard to
> > rely on that anyway, esp. given filesystems like NFS that can do lazy
> > writeback.
> > 
> > One of the operating principles with this series is that timestamps can
> > be of varying granularity between different files. Note that Linux
> > already violates this assumption when you're working across filesystems
> > of different types.
> > 
> > As to potential fixes if this is a real problem:
> > 
> > I don't really want to put this behind a mount or mkfs option (a'la
> > relatime, etc.), but that is one possibility.
> > 
> > I wonder if it would be feasible to just advance the coarse-grained
> > current_time whenever we end up updating a ctime with a fine-grained
> > timestamp? It might produce some inode write amplification. Files that
> 
> Less than ideal imho.
> 
> If this risks breaking existing workloads by enabling it unconditionally
> and there isn't a clear way to detect and handle these situations
> without risk of regression then we should move this behind a mount
> option.
> 
> So how about the following:
> 
> From cb14add421967f6e374eb77c36cc4a0526b10d17 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Wed, 20 Sep 2023 10:00:08 +0200
> Subject: [PATCH] vfs: move multi-grain timestamps behind a mount option
> 
> While we initially thought we can do this unconditionally it turns out
> that this might break existing workloads that rely on timestamps in very
> specific ways and we always knew this was a possibility. Move
> multi-grain timestamps behind a vfs mount option.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/fs_context.c     | 18 ++++++++++++++++++
>  fs/inode.c          |  4 ++--
>  fs/proc_namespace.c |  1 +
>  fs/stat.c           |  2 +-
>  include/linux/fs.h  |  4 +++-
>  5 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index a0ad7a0c4680..dd4dade0bb9e 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -44,6 +44,7 @@ static const struct constant_table common_set_sb_flag[] = {
>  	{ "mand",	SB_MANDLOCK },
>  	{ "ro",		SB_RDONLY },
>  	{ "sync",	SB_SYNCHRONOUS },
> +	{ "mgtime",	SB_MGTIME },
>  	{ },
>  };
>  
> 
> @@ -52,18 +53,32 @@ static const struct constant_table common_clear_sb_flag[] = {
>  	{ "nolazytime",	SB_LAZYTIME },
>  	{ "nomand",	SB_MANDLOCK },
>  	{ "rw",		SB_RDONLY },
> +	{ "nomgtime",	SB_MGTIME },
>  	{ },
>  };
>  
> 
> +static inline int check_mgtime(unsigned int token, const struct fs_context *fc)
> +{
> +	if (token != SB_MGTIME)
> +		return 0;
> +	if (!(fc->fs_type->fs_flags & FS_MGTIME))
> +		return invalf(fc, "Filesystem doesn't support multi-grain timestamps");
> +	return 0;
> +}
> +
>  /*
>   * Check for a common mount option that manipulates s_flags.
>   */
>  static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
>  {
>  	unsigned int token;
> +	int ret;
>  
> 
>  	token = lookup_constant(common_set_sb_flag, key, 0);
>  	if (token) {
> +		ret = check_mgtime(token, fc);
> +		if (ret)
> +			return ret;
>  		fc->sb_flags |= token;
>  		fc->sb_flags_mask |= token;
>  		return 0;
> @@ -71,6 +86,9 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
>  
> 
>  	token = lookup_constant(common_clear_sb_flag, key, 0);
>  	if (token) {
> +		ret = check_mgtime(token, fc);
> +		if (ret)
> +			return ret;
>  		fc->sb_flags &= ~token;
>  		fc->sb_flags_mask |= token;
>  		return 0;
> diff --git a/fs/inode.c b/fs/inode.c
> index 54237f4242ff..fd1a2390aaa3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2141,7 +2141,7 @@ EXPORT_SYMBOL(current_mgtime);
>  
> 
>  static struct timespec64 current_ctime(struct inode *inode)
>  {
> -	if (is_mgtime(inode))
> +	if (IS_MGTIME(inode))
>  		return current_mgtime(inode);
>  	return current_time(inode);
>  }
> @@ -2588,7 +2588,7 @@ struct timespec64 inode_set_ctime_current(struct inode *inode)
>  		now = current_time(inode);
>  
> 
>  		/* Just copy it into place if it's not multigrain */
> -		if (!is_mgtime(inode)) {
> +		if (!IS_MGTIME(inode)) {
>  			inode_set_ctime_to_ts(inode, now);
>  			return now;
>  		}
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 250eb5bf7b52..08f5bf4d2c6c 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>  		{ SB_DIRSYNC, ",dirsync" },
>  		{ SB_MANDLOCK, ",mand" },
>  		{ SB_LAZYTIME, ",lazytime" },
> +		{ SB_MGTIME, ",mgtime" },
>  		{ 0, NULL }
>  	};
>  	const struct proc_fs_opts *fs_infop;
> diff --git a/fs/stat.c b/fs/stat.c
> index 6e60389d6a15..2f18dd5de18b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -90,7 +90,7 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
>  	stat->size = i_size_read(inode);
>  	stat->atime = inode->i_atime;
>  
> 
> -	if (is_mgtime(inode)) {
> +	if (IS_MGTIME(inode)) {
>  		fill_mg_cmtime(stat, request_mask, inode);
>  	} else {
>  		stat->mtime = inode->i_mtime;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4aeb3fa11927..03e415fb3a7c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1114,6 +1114,7 @@ extern int send_sigurg(struct fown_struct *fown);
>  #define SB_NODEV        BIT(2)	/* Disallow access to device special files */
>  #define SB_NOEXEC       BIT(3)	/* Disallow program execution */
>  #define SB_SYNCHRONOUS  BIT(4)	/* Writes are synced at once */
> +#define SB_MGTIME	BIT(5)	/* Use multi-grain timestamps */
>  #define SB_MANDLOCK     BIT(6)	/* Allow mandatory locks on an FS */
>  #define SB_DIRSYNC      BIT(7)	/* Directory modifications are synchronous */
>  #define SB_NOATIME      BIT(10)	/* Do not update access times. */
> @@ -2105,6 +2106,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
>  					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
>  #define IS_MANDLOCK(inode)	__IS_FLG(inode, SB_MANDLOCK)
>  #define IS_NOATIME(inode)	__IS_FLG(inode, SB_RDONLY|SB_NOATIME)
> +#define IS_MGTIME(inode)	__IS_FLG(inode, SB_MGTIME)
>  #define IS_I_VERSION(inode)	__IS_FLG(inode, SB_I_VERSION)
>  
> 
>  #define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
> @@ -2366,7 +2368,7 @@ struct file_system_type {
>   */
>  static inline bool is_mgtime(const struct inode *inode)
>  {
> -	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> +	return inode->i_sb->s_flags & SB_MGTIME;
>  }
>  
> 
>  extern struct dentry *mount_bdev(struct file_system_type *fs_type,

The mount option looks reasonable. Thanks for throwing together the
patch. Maybe in the future we can come up with a way to mitigate the
problems and do this unconditionally?

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Jan Kara Sept. 20, 2023, 9:58 a.m. UTC | #19
On Tue 19-09-23 12:31:08, Jeff Layton wrote:
> On Tue, 2023-09-19 at 16:52 +0200, Bruno Haible wrote:
> > Jeff Layton wrote:
> > > I'm not sure what we can do for this test. The nap() function is making
> > > an assumption that the timestamp granularity will be constant, and that
> > > isn't necessarily the case now.
> > 
> > This is only of secondary importance, because the scenario by Jan Kara
> > shows a much more fundamental breakage:
> > 
> > > > The ultimate problem is that a sequence like:
> > > > 
> > > > write(f1)
> > > > stat(f2)
> > > > write(f2)
> > > > stat(f2)
> > > > write(f1)
> > > > stat(f1)
> > > > 
> > > > can result in f1 timestamp to be (slightly) lower than the final f2
> > > > timestamp because the second write to f1 didn't bother updating the
> > > > timestamp. That can indeed be a bit confusing to programs if they compare
> > > > timestamps between two files. Jeff?
> > > > 
> > > 
> > > Basically yes.
> > 
> > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > 
> > Many things in user-space depend on timestamps, such as build system
> > centered around 'make', but also 'find ... -newer ...'.
> > 
> 
> 
> What does breakage with make look like in this situation? The "fuzz"
> here is going to be on the order of a jiffy. The typical case for make
> timestamp comparisons is comparing source files vs. a build target. If
> those are being written nearly simultaneously, then that could be an
> issue, but is that a typical behavior? It seems like it would be hard to
> rely on that anyway, esp. given filesystems like NFS that can do lazy
> writeback.

TL;DR I don't think we can just wave away the change as "the problem has
always been there".

Firstly, the fact that something is not quite reliable on NFS doesn't mean
people don't rely on the behavior on local filesystems. NFS has a
historical reputation of being a bit weird ;). Secondly, I agree that the
same problems can manifest currently for files on two filesystems with
different timestamp granularity. But again that is something that is rare -
widely used filesystems have a granularity of a jiffy and in most cases
build and source files are on the same filesystem anyway. So yes, in
principle the problems could happen even before multigrain timestamps but
having different granularity per inode just made them manifest in much much
more setups and that matters because setups that were perfectly fine before
are not anymore.

> One of the operating principles with this series is that timestamps can
> be of varying granularity between different files. Note that Linux
> already violates this assumption when you're working across filesystems
> of different types.
> 
> As to potential fixes if this is a real problem:

Regarding whether the problem is real: I wouldn't worry too much about the
particular test that started this thread. That seems like something very
special. But the build system issues could be real - as you wrote in your
motivation for the series - a lot can happen in a jiffy on contemporary
computers. I can imagine build product having newer timestamp than build
source because the modification of source managed to squeeze into the same
jiffy and still use a coarse-grained timestamp. Or some other
producer-consumer type of setup... Sure usually there would be enough
stat(2) calls on both sides to force finegrained timestamps on both files
but if there are not in some corner case, debugging the problem is really
tough.

> I don't really want to put this behind a mount or mkfs option (a'la
> relatime, etc.), but that is one possibility.
> 
> I wonder if it would be feasible to just advance the coarse-grained
> current_time whenever we end up updating a ctime with a fine-grained
> timestamp? It might produce some inode write amplification. Files that
> were written within the same jiffy could see more inode transactions
> logged, but that still might not be _too_ awful.
Jan Kara Sept. 20, 2023, 10:17 a.m. UTC | #20
On Wed 20-09-23 10:41:30, Christian Brauner wrote:
> > > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > > 
> > > Many things in user-space depend on timestamps, such as build system
> > > centered around 'make', but also 'find ... -newer ...'.
> > > 
> > 
> > 
> > What does breakage with make look like in this situation? The "fuzz"
> > here is going to be on the order of a jiffy. The typical case for make
> > timestamp comparisons is comparing source files vs. a build target. If
> > those are being written nearly simultaneously, then that could be an
> > issue, but is that a typical behavior? It seems like it would be hard to
> > rely on that anyway, esp. given filesystems like NFS that can do lazy
> > writeback.
> > 
> > One of the operating principles with this series is that timestamps can
> > be of varying granularity between different files. Note that Linux
> > already violates this assumption when you're working across filesystems
> > of different types.
> > 
> > As to potential fixes if this is a real problem:
> > 
> > I don't really want to put this behind a mount or mkfs option (a'la
> > relatime, etc.), but that is one possibility.
> > 
> > I wonder if it would be feasible to just advance the coarse-grained
> > current_time whenever we end up updating a ctime with a fine-grained
> > timestamp? It might produce some inode write amplification. Files that
> 
> Less than ideal imho.
> 
> If this risks breaking existing workloads by enabling it unconditionally
> and there isn't a clear way to detect and handle these situations
> without risk of regression then we should move this behind a mount
> option.
> 
> So how about the following:
> 
> From cb14add421967f6e374eb77c36cc4a0526b10d17 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Wed, 20 Sep 2023 10:00:08 +0200
> Subject: [PATCH] vfs: move multi-grain timestamps behind a mount option
> 
> While we initially thought we can do this unconditionally it turns out
> that this might break existing workloads that rely on timestamps in very
> specific ways and we always knew this was a possibility. Move
> multi-grain timestamps behind a vfs mount option.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Surely this is a safe choice as it moves the responsibility to the sysadmin
and the cases where finegrained timestamps are required. But I kind of
wonder how is the sysadmin going to decide whether mgtime is safe for his
system or not? Because the possible breakage needn't be obvious at the
first sight... If I were a sysadmin, I'd rather opt for something like
finegrained timestamps + lazytime (if I needed the finegrained timestamps
functionality). That should avoid the IO overhead of finegrained timestamps
as well and I'd know I can have problems with timestamps only after a
system crash.

I've just got another idea how we could solve the problem: Couldn't we
always just report coarsegrained timestamp to userspace and provide access
to finegrained value only to NFS which should know what it's doing?

								Honza

> ---
>  fs/fs_context.c     | 18 ++++++++++++++++++
>  fs/inode.c          |  4 ++--
>  fs/proc_namespace.c |  1 +
>  fs/stat.c           |  2 +-
>  include/linux/fs.h  |  4 +++-
>  5 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index a0ad7a0c4680..dd4dade0bb9e 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -44,6 +44,7 @@ static const struct constant_table common_set_sb_flag[] = {
>  	{ "mand",	SB_MANDLOCK },
>  	{ "ro",		SB_RDONLY },
>  	{ "sync",	SB_SYNCHRONOUS },
> +	{ "mgtime",	SB_MGTIME },
>  	{ },
>  };
>  
> @@ -52,18 +53,32 @@ static const struct constant_table common_clear_sb_flag[] = {
>  	{ "nolazytime",	SB_LAZYTIME },
>  	{ "nomand",	SB_MANDLOCK },
>  	{ "rw",		SB_RDONLY },
> +	{ "nomgtime",	SB_MGTIME },
>  	{ },
>  };
>  
> +static inline int check_mgtime(unsigned int token, const struct fs_context *fc)
> +{
> +	if (token != SB_MGTIME)
> +		return 0;
> +	if (!(fc->fs_type->fs_flags & FS_MGTIME))
> +		return invalf(fc, "Filesystem doesn't support multi-grain timestamps");
> +	return 0;
> +}
> +
>  /*
>   * Check for a common mount option that manipulates s_flags.
>   */
>  static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
>  {
>  	unsigned int token;
> +	int ret;
>  
>  	token = lookup_constant(common_set_sb_flag, key, 0);
>  	if (token) {
> +		ret = check_mgtime(token, fc);
> +		if (ret)
> +			return ret;
>  		fc->sb_flags |= token;
>  		fc->sb_flags_mask |= token;
>  		return 0;
> @@ -71,6 +86,9 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
>  
>  	token = lookup_constant(common_clear_sb_flag, key, 0);
>  	if (token) {
> +		ret = check_mgtime(token, fc);
> +		if (ret)
> +			return ret;
>  		fc->sb_flags &= ~token;
>  		fc->sb_flags_mask |= token;
>  		return 0;
> diff --git a/fs/inode.c b/fs/inode.c
> index 54237f4242ff..fd1a2390aaa3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2141,7 +2141,7 @@ EXPORT_SYMBOL(current_mgtime);
>  
>  static struct timespec64 current_ctime(struct inode *inode)
>  {
> -	if (is_mgtime(inode))
> +	if (IS_MGTIME(inode))
>  		return current_mgtime(inode);
>  	return current_time(inode);
>  }
> @@ -2588,7 +2588,7 @@ struct timespec64 inode_set_ctime_current(struct inode *inode)
>  		now = current_time(inode);
>  
>  		/* Just copy it into place if it's not multigrain */
> -		if (!is_mgtime(inode)) {
> +		if (!IS_MGTIME(inode)) {
>  			inode_set_ctime_to_ts(inode, now);
>  			return now;
>  		}
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 250eb5bf7b52..08f5bf4d2c6c 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>  		{ SB_DIRSYNC, ",dirsync" },
>  		{ SB_MANDLOCK, ",mand" },
>  		{ SB_LAZYTIME, ",lazytime" },
> +		{ SB_MGTIME, ",mgtime" },
>  		{ 0, NULL }
>  	};
>  	const struct proc_fs_opts *fs_infop;
> diff --git a/fs/stat.c b/fs/stat.c
> index 6e60389d6a15..2f18dd5de18b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -90,7 +90,7 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
>  	stat->size = i_size_read(inode);
>  	stat->atime = inode->i_atime;
>  
> -	if (is_mgtime(inode)) {
> +	if (IS_MGTIME(inode)) {
>  		fill_mg_cmtime(stat, request_mask, inode);
>  	} else {
>  		stat->mtime = inode->i_mtime;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4aeb3fa11927..03e415fb3a7c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1114,6 +1114,7 @@ extern int send_sigurg(struct fown_struct *fown);
>  #define SB_NODEV        BIT(2)	/* Disallow access to device special files */
>  #define SB_NOEXEC       BIT(3)	/* Disallow program execution */
>  #define SB_SYNCHRONOUS  BIT(4)	/* Writes are synced at once */
> +#define SB_MGTIME	BIT(5)	/* Use multi-grain timestamps */
>  #define SB_MANDLOCK     BIT(6)	/* Allow mandatory locks on an FS */
>  #define SB_DIRSYNC      BIT(7)	/* Directory modifications are synchronous */
>  #define SB_NOATIME      BIT(10)	/* Do not update access times. */
> @@ -2105,6 +2106,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
>  					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
>  #define IS_MANDLOCK(inode)	__IS_FLG(inode, SB_MANDLOCK)
>  #define IS_NOATIME(inode)	__IS_FLG(inode, SB_RDONLY|SB_NOATIME)
> +#define IS_MGTIME(inode)	__IS_FLG(inode, SB_MGTIME)
>  #define IS_I_VERSION(inode)	__IS_FLG(inode, SB_I_VERSION)
>  
>  #define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
> @@ -2366,7 +2368,7 @@ struct file_system_type {
>   */
>  static inline bool is_mgtime(const struct inode *inode)
>  {
> -	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> +	return inode->i_sb->s_flags & SB_MGTIME;
>  }
>  
>  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
> -- 
> 2.34.1
>
Christian Brauner Sept. 20, 2023, 10:30 a.m. UTC | #21
On Wed, Sep 20, 2023 at 12:17:31PM +0200, Jan Kara wrote:
> On Wed 20-09-23 10:41:30, Christian Brauner wrote:
> > > > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > > > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > > > 
> > > > Many things in user-space depend on timestamps, such as build system
> > > > centered around 'make', but also 'find ... -newer ...'.
> > > > 
> > > 
> > > 
> > > What does breakage with make look like in this situation? The "fuzz"
> > > here is going to be on the order of a jiffy. The typical case for make
> > > timestamp comparisons is comparing source files vs. a build target. If
> > > those are being written nearly simultaneously, then that could be an
> > > issue, but is that a typical behavior? It seems like it would be hard to
> > > rely on that anyway, esp. given filesystems like NFS that can do lazy
> > > writeback.
> > > 
> > > One of the operating principles with this series is that timestamps can
> > > be of varying granularity between different files. Note that Linux
> > > already violates this assumption when you're working across filesystems
> > > of different types.
> > > 
> > > As to potential fixes if this is a real problem:
> > > 
> > > I don't really want to put this behind a mount or mkfs option (a'la
> > > relatime, etc.), but that is one possibility.
> > > 
> > > I wonder if it would be feasible to just advance the coarse-grained
> > > current_time whenever we end up updating a ctime with a fine-grained
> > > timestamp? It might produce some inode write amplification. Files that
> > 
> > Less than ideal imho.
> > 
> > If this risks breaking existing workloads by enabling it unconditionally
> > and there isn't a clear way to detect and handle these situations
> > without risk of regression then we should move this behind a mount
> > option.
> > 
> > So how about the following:
> > 
> > From cb14add421967f6e374eb77c36cc4a0526b10d17 Mon Sep 17 00:00:00 2001
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Wed, 20 Sep 2023 10:00:08 +0200
> > Subject: [PATCH] vfs: move multi-grain timestamps behind a mount option
> > 
> > While we initially thought we can do this unconditionally it turns out
> > that this might break existing workloads that rely on timestamps in very
> > specific ways and we always knew this was a possibility. Move
> > multi-grain timestamps behind a vfs mount option.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> Surely this is a safe choice as it moves the responsibility to the sysadmin
> and the cases where finegrained timestamps are required. But I kind of
> wonder how is the sysadmin going to decide whether mgtime is safe for his
> system or not? Because the possible breakage needn't be obvious at the
> first sight... If I were a sysadmin, I'd rather opt for something like

I think you'll basically enable this because you want to export a
filesystem via NFS.

> finegrained timestamps + lazytime (if I needed the finegrained timestamps
> functionality). That should avoid the IO overhead of finegrained timestamps

That would work with this patch, no? Or are you saying it would need
something else?

> as well and I'd know I can have problems with timestamps only after a
> system crash.
> 
> I've just got another idea how we could solve the problem: Couldn't we
> always just report coarsegrained timestamp to userspace and provide access
> to finegrained value only to NFS which should know what it's doing?

What would changes would be involved for that?

If this is invasive work and we decide this is something that we want to
do then we should remove FS_MGTIME from btrfs, xfs, ext4, and tmpfs for
v6.6.
Jeff Layton Sept. 20, 2023, 10:35 a.m. UTC | #22
On Wed, 2023-09-20 at 12:17 +0200, Jan Kara wrote:
> On Wed 20-09-23 10:41:30, Christian Brauner wrote:
> > > > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > > > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > > > 
> > > > Many things in user-space depend on timestamps, such as build system
> > > > centered around 'make', but also 'find ... -newer ...'.
> > > > 
> > > 
> > > 
> > > What does breakage with make look like in this situation? The "fuzz"
> > > here is going to be on the order of a jiffy. The typical case for make
> > > timestamp comparisons is comparing source files vs. a build target. If
> > > those are being written nearly simultaneously, then that could be an
> > > issue, but is that a typical behavior? It seems like it would be hard to
> > > rely on that anyway, esp. given filesystems like NFS that can do lazy
> > > writeback.
> > > 
> > > One of the operating principles with this series is that timestamps can
> > > be of varying granularity between different files. Note that Linux
> > > already violates this assumption when you're working across filesystems
> > > of different types.
> > > 
> > > As to potential fixes if this is a real problem:
> > > 
> > > I don't really want to put this behind a mount or mkfs option (a'la
> > > relatime, etc.), but that is one possibility.
> > > 
> > > I wonder if it would be feasible to just advance the coarse-grained
> > > current_time whenever we end up updating a ctime with a fine-grained
> > > timestamp? It might produce some inode write amplification. Files that
> > 
> > Less than ideal imho.
> > 
> > If this risks breaking existing workloads by enabling it unconditionally
> > and there isn't a clear way to detect and handle these situations
> > without risk of regression then we should move this behind a mount
> > option.
> > 
> > So how about the following:
> > 
> > From cb14add421967f6e374eb77c36cc4a0526b10d17 Mon Sep 17 00:00:00 2001
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Wed, 20 Sep 2023 10:00:08 +0200
> > Subject: [PATCH] vfs: move multi-grain timestamps behind a mount option
> > 
> > While we initially thought we can do this unconditionally it turns out
> > that this might break existing workloads that rely on timestamps in very
> > specific ways and we always knew this was a possibility. Move
> > multi-grain timestamps behind a vfs mount option.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> Surely this is a safe choice as it moves the responsibility to the sysadmin
> and the cases where finegrained timestamps are required. But I kind of
> wonder how is the sysadmin going to decide whether mgtime is safe for his
> system or not? Because the possible breakage needn't be obvious at the
> first sight...
> 

That's the main reason I really didn't want to go with a mount option.
Documenting that may be difficult. While there is some pessimism around
it, I may still take a stab at just advancing the coarse clock whenever
we fetch a fine-grained timestamp. It'd be nice to remove this option in
the future if that turns out to be feasible.

> If I were a sysadmin, I'd rather opt for something like
> finegrained timestamps + lazytime (if I needed the finegrained timestamps
> functionality). That should avoid the IO overhead of finegrained timestamps
> as well and I'd know I can have problems with timestamps only after a
> system crash.

> I've just got another idea how we could solve the problem: Couldn't we
> always just report coarsegrained timestamp to userspace and provide access
> to finegrained value only to NFS which should know what it's doing?
> 

I think that'd be hard. First of all, where would we store the second
timestamp? We can't just truncate the fine-grained ones to come up with
a coarse-grained one. It might also be confusing having nfsd and local
filesystems present different attributes.


> > ---
> >  fs/fs_context.c     | 18 ++++++++++++++++++
> >  fs/inode.c          |  4 ++--
> >  fs/proc_namespace.c |  1 +
> >  fs/stat.c           |  2 +-
> >  include/linux/fs.h  |  4 +++-
> >  5 files changed, 25 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/fs_context.c b/fs/fs_context.c
> > index a0ad7a0c4680..dd4dade0bb9e 100644
> > --- a/fs/fs_context.c
> > +++ b/fs/fs_context.c
> > @@ -44,6 +44,7 @@ static const struct constant_table common_set_sb_flag[] = {
> >  	{ "mand",	SB_MANDLOCK },
> >  	{ "ro",		SB_RDONLY },
> >  	{ "sync",	SB_SYNCHRONOUS },
> > +	{ "mgtime",	SB_MGTIME },
> >  	{ },
> >  };
> >  
> > @@ -52,18 +53,32 @@ static const struct constant_table common_clear_sb_flag[] = {
> >  	{ "nolazytime",	SB_LAZYTIME },
> >  	{ "nomand",	SB_MANDLOCK },
> >  	{ "rw",		SB_RDONLY },
> > +	{ "nomgtime",	SB_MGTIME },
> >  	{ },
> >  };
> >  
> > +static inline int check_mgtime(unsigned int token, const struct fs_context *fc)
> > +{
> > +	if (token != SB_MGTIME)
> > +		return 0;
> > +	if (!(fc->fs_type->fs_flags & FS_MGTIME))
> > +		return invalf(fc, "Filesystem doesn't support multi-grain timestamps");
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Check for a common mount option that manipulates s_flags.
> >   */
> >  static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
> >  {
> >  	unsigned int token;
> > +	int ret;
> >  
> >  	token = lookup_constant(common_set_sb_flag, key, 0);
> >  	if (token) {
> > +		ret = check_mgtime(token, fc);
> > +		if (ret)
> > +			return ret;
> >  		fc->sb_flags |= token;
> >  		fc->sb_flags_mask |= token;
> >  		return 0;
> > @@ -71,6 +86,9 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
> >  
> >  	token = lookup_constant(common_clear_sb_flag, key, 0);
> >  	if (token) {
> > +		ret = check_mgtime(token, fc);
> > +		if (ret)
> > +			return ret;
> >  		fc->sb_flags &= ~token;
> >  		fc->sb_flags_mask |= token;
> >  		return 0;
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 54237f4242ff..fd1a2390aaa3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2141,7 +2141,7 @@ EXPORT_SYMBOL(current_mgtime);
> >  
> >  static struct timespec64 current_ctime(struct inode *inode)
> >  {
> > -	if (is_mgtime(inode))
> > +	if (IS_MGTIME(inode))
> >  		return current_mgtime(inode);
> >  	return current_time(inode);
> >  }
> > @@ -2588,7 +2588,7 @@ struct timespec64 inode_set_ctime_current(struct inode *inode)
> >  		now = current_time(inode);
> >  
> >  		/* Just copy it into place if it's not multigrain */
> > -		if (!is_mgtime(inode)) {
> > +		if (!IS_MGTIME(inode)) {
> >  			inode_set_ctime_to_ts(inode, now);
> >  			return now;
> >  		}
> > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> > index 250eb5bf7b52..08f5bf4d2c6c 100644
> > --- a/fs/proc_namespace.c
> > +++ b/fs/proc_namespace.c
> > @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
> >  		{ SB_DIRSYNC, ",dirsync" },
> >  		{ SB_MANDLOCK, ",mand" },
> >  		{ SB_LAZYTIME, ",lazytime" },
> > +		{ SB_MGTIME, ",mgtime" },
> >  		{ 0, NULL }
> >  	};
> >  	const struct proc_fs_opts *fs_infop;
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 6e60389d6a15..2f18dd5de18b 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -90,7 +90,7 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
> >  	stat->size = i_size_read(inode);
> >  	stat->atime = inode->i_atime;
> >  
> > -	if (is_mgtime(inode)) {
> > +	if (IS_MGTIME(inode)) {
> >  		fill_mg_cmtime(stat, request_mask, inode);
> >  	} else {
> >  		stat->mtime = inode->i_mtime;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 4aeb3fa11927..03e415fb3a7c 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1114,6 +1114,7 @@ extern int send_sigurg(struct fown_struct *fown);
> >  #define SB_NODEV        BIT(2)	/* Disallow access to device special files */
> >  #define SB_NOEXEC       BIT(3)	/* Disallow program execution */
> >  #define SB_SYNCHRONOUS  BIT(4)	/* Writes are synced at once */
> > +#define SB_MGTIME	BIT(5)	/* Use multi-grain timestamps */
> >  #define SB_MANDLOCK     BIT(6)	/* Allow mandatory locks on an FS */
> >  #define SB_DIRSYNC      BIT(7)	/* Directory modifications are synchronous */
> >  #define SB_NOATIME      BIT(10)	/* Do not update access times. */
> > @@ -2105,6 +2106,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
> >  					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
> >  #define IS_MANDLOCK(inode)	__IS_FLG(inode, SB_MANDLOCK)
> >  #define IS_NOATIME(inode)	__IS_FLG(inode, SB_RDONLY|SB_NOATIME)
> > +#define IS_MGTIME(inode)	__IS_FLG(inode, SB_MGTIME)
> >  #define IS_I_VERSION(inode)	__IS_FLG(inode, SB_I_VERSION)
> >  
> >  #define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
> > @@ -2366,7 +2368,7 @@ struct file_system_type {
> >   */
> >  static inline bool is_mgtime(const struct inode *inode)
> >  {
> > -	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> > +	return inode->i_sb->s_flags & SB_MGTIME;
> >  }
> >  
> >  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
> > -- 
> > 2.34.1
> >
Christian Brauner Sept. 20, 2023, 11:48 a.m. UTC | #23
> > > While we initially thought we can do this unconditionally it turns out
> > > that this might break existing workloads that rely on timestamps in very
> > > specific ways and we always knew this was a possibility. Move
> > > multi-grain timestamps behind a vfs mount option.
> > 
> > Surely this is a safe choice as it moves the responsibility to the sysadmin
> > and the cases where finegrained timestamps are required. But I kind of
> > wonder how is the sysadmin going to decide whether mgtime is safe for his
> > system or not? Because the possible breakage needn't be obvious at the
> > first sight...
> > 
> 
> That's the main reason I really didn't want to go with a mount option.
> Documenting that may be difficult. While there is some pessimism around
> it, I may still take a stab at just advancing the coarse clock whenever
> we fetch a fine-grained timestamp. It'd be nice to remove this option in
> the future if that turns out to be feasible.
> 
> > If I were a sysadmin, I'd rather opt for something like
> > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > functionality). That should avoid the IO overhead of finegrained timestamps
> > as well and I'd know I can have problems with timestamps only after a
> > system crash.
> 
> > I've just got another idea how we could solve the problem: Couldn't we
> > always just report coarsegrained timestamp to userspace and provide access
> > to finegrained value only to NFS which should know what it's doing?
> > 
> 
> I think that'd be hard. First of all, where would we store the second
> timestamp? We can't just truncate the fine-grained ones to come up with
> a coarse-grained one. It might also be confusing having nfsd and local
> filesystems present different attributes.

As far as I can tell we have two options. The first one is to make this
into a mount option which I really think isn't a big deal and lets us
avoid this whole problem while allowing filesytems exposed via NFS to
make use of this feature for change tracking.

The second option is that we turn off fine-grained finestamps for v6.6
and you get to explore other options.

It isn't a big deal regressions like this were always to be expected but
v6.6 needs to stabilize so anything that requires more significant work
is not an option.
Jeff Layton Sept. 20, 2023, 11:56 a.m. UTC | #24
On Wed, 2023-09-20 at 13:48 +0200, Christian Brauner wrote:
> > > > While we initially thought we can do this unconditionally it turns out
> > > > that this might break existing workloads that rely on timestamps in very
> > > > specific ways and we always knew this was a possibility. Move
> > > > multi-grain timestamps behind a vfs mount option.
> > > 
> > > Surely this is a safe choice as it moves the responsibility to the sysadmin
> > > and the cases where finegrained timestamps are required. But I kind of
> > > wonder how is the sysadmin going to decide whether mgtime is safe for his
> > > system or not? Because the possible breakage needn't be obvious at the
> > > first sight...
> > > 
> > 
> > That's the main reason I really didn't want to go with a mount option.
> > Documenting that may be difficult. While there is some pessimism around
> > it, I may still take a stab at just advancing the coarse clock whenever
> > we fetch a fine-grained timestamp. It'd be nice to remove this option in
> > the future if that turns out to be feasible.
> > 
> > > If I were a sysadmin, I'd rather opt for something like
> > > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > > functionality). That should avoid the IO overhead of finegrained timestamps
> > > as well and I'd know I can have problems with timestamps only after a
> > > system crash.
> > 
> > > I've just got another idea how we could solve the problem: Couldn't we
> > > always just report coarsegrained timestamp to userspace and provide access
> > > to finegrained value only to NFS which should know what it's doing?
> > > 
> > 
> > I think that'd be hard. First of all, where would we store the second
> > timestamp? We can't just truncate the fine-grained ones to come up with
> > a coarse-grained one. It might also be confusing having nfsd and local
> > filesystems present different attributes.
> 
> As far as I can tell we have two options. The first one is to make this
> into a mount option which I really think isn't a big deal and lets us
> avoid this whole problem while allowing filesytems exposed via NFS to
> make use of this feature for change tracking.
> 
> The second option is that we turn off fine-grained finestamps for v6.6
> and you get to explore other options.
> 
> It isn't a big deal regressions like this were always to be expected but
> v6.6 needs to stabilize so anything that requires more significant work
> is not an option.

Oh, absolutely.

I wasn't proposing to do that work for v6.6. For that, we absolutely
either need the mount option or to just revert the mgtime conversions.

My plan was to take a stab at doing this for a later kernel release.
This is very much a "back to the drawing board" idea. It may not pan out
after all, but if it does then we could consider removing the mount
option at that point.
Christian Brauner Sept. 20, 2023, 12:08 p.m. UTC | #25
> I wasn't proposing to do that work for v6.6. For that, we absolutely
> either need the mount option or to just revert the mgtime conversions.

This sounds like you want me to do a full-on revert of your series but
why? The conversion and changes support an actual use-case and are fine.
It's a matter of whether we unconditionally expose it to users or not.

@Jan, what do you think?

> My plan was to take a stab at doing this for a later kernel release.

Ok.
Jeff Layton Sept. 20, 2023, 12:26 p.m. UTC | #26
On Wed, 2023-09-20 at 14:08 +0200, Christian Brauner wrote:
> > I wasn't proposing to do that work for v6.6. For that, we absolutely
> > either need the mount option or to just revert the mgtime conversions.
> 
> This sounds like you want me to do a full-on revert of your series but
> why? The conversion and changes support an actual use-case and are fine.
> It's a matter of whether we unconditionally expose it to users or not.
> 

I don't, actually. I'm just mentioning that it's possible if we find the
mount option to be unpalatable.

> @Jan, what do you think?
> 
> > My plan was to take a stab at doing this for a later kernel release.
> 
> Ok.

If it works out, then we may be able to eventually remove the mount
option, but that is a separate project altogether.
Christian Brauner Sept. 20, 2023, 12:30 p.m. UTC | #27
> I don't, actually. I'm just mentioning that it's possible if we find the
> mount option to be unpalatable.

Ok.

> 
> > @Jan, what do you think?
> > 
> > > My plan was to take a stab at doing this for a later kernel release.
> > 
> > Ok.
> 
> If it works out, then we may be able to eventually remove the mount
> option, but that is a separate project altogether.

It would just become a nop for anyone setting it which is fine by me.
Jan Kara Sept. 20, 2023, 12:48 p.m. UTC | #28
On Wed 20-09-23 06:35:18, Jeff Layton wrote:
> On Wed, 2023-09-20 at 12:17 +0200, Jan Kara wrote:
> > If I were a sysadmin, I'd rather opt for something like
> > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > functionality). That should avoid the IO overhead of finegrained timestamps
> > as well and I'd know I can have problems with timestamps only after a
> > system crash.
> 
> > I've just got another idea how we could solve the problem: Couldn't we
> > always just report coarsegrained timestamp to userspace and provide access
> > to finegrained value only to NFS which should know what it's doing?
> > 
> 
> I think that'd be hard. First of all, where would we store the second
> timestamp? We can't just truncate the fine-grained ones to come up with
> a coarse-grained one. It might also be confusing having nfsd and local
> filesystems present different attributes.

So what I had in mind (and I definitely miss all the NFS intricacies so the
idea may be bogus) was that inode->i_ctime would be maintained exactly as
is now. There will be new (kernel internal at least for now) STATX flag
STATX_MULTIGRAIN_TS. fill_mg_cmtime() will return timestamp truncated to
sb->s_time_gran unless STATX_MULTIGRAIN_TS is set. Hence unless you set
STATX_MULTIGRAIN_TS, there is no difference in the returned timestamps
compared to the state before multigrain timestamps were introduced. With
STATX_MULTIGRAIN_TS we return full precision timestamp as stored in the
inode. Then NFS in fh_fill_pre_attrs() and fh_fill_post_attrs() needs to
make sure STATX_MULTIGRAIN_TS is set when calling vfs_getattr() to get
multigrain time.

I agree nfsd may now be presenting slightly different timestamps than user
is able to see with stat(2) directly on the filesystem. But is that a
problem? Essentially it is a similar solution as the mgtime mount option
but now sysadmin doesn't have to decide on filesystem mount how to report
timestamps but the stat caller knowingly opts into possibly inconsistent
(among files) but high precision timestamps. And in the particular NFS
usecase where stat is called all the time anyway, timestamps will likely
even be consistent among files.

								Honza
Jan Kara Sept. 20, 2023, 1:03 p.m. UTC | #29
On Wed 20-09-23 12:30:52, Christian Brauner wrote:
> On Wed, Sep 20, 2023 at 12:17:31PM +0200, Jan Kara wrote:
> > On Wed 20-09-23 10:41:30, Christian Brauner wrote:
> > > > > f1 was last written to *after* f2 was last written to. If the timestamp of f1
> > > > > is then lower than the timestamp of f2, timestamps are fundamentally broken.
> > > > > 
> > > > > Many things in user-space depend on timestamps, such as build system
> > > > > centered around 'make', but also 'find ... -newer ...'.
> > > > > 
> > > > 
> > > > 
> > > > What does breakage with make look like in this situation? The "fuzz"
> > > > here is going to be on the order of a jiffy. The typical case for make
> > > > timestamp comparisons is comparing source files vs. a build target. If
> > > > those are being written nearly simultaneously, then that could be an
> > > > issue, but is that a typical behavior? It seems like it would be hard to
> > > > rely on that anyway, esp. given filesystems like NFS that can do lazy
> > > > writeback.
> > > > 
> > > > One of the operating principles with this series is that timestamps can
> > > > be of varying granularity between different files. Note that Linux
> > > > already violates this assumption when you're working across filesystems
> > > > of different types.
> > > > 
> > > > As to potential fixes if this is a real problem:
> > > > 
> > > > I don't really want to put this behind a mount or mkfs option (a'la
> > > > relatime, etc.), but that is one possibility.
> > > > 
> > > > I wonder if it would be feasible to just advance the coarse-grained
> > > > current_time whenever we end up updating a ctime with a fine-grained
> > > > timestamp? It might produce some inode write amplification. Files that
> > > 
> > > Less than ideal imho.
> > > 
> > > If this risks breaking existing workloads by enabling it unconditionally
> > > and there isn't a clear way to detect and handle these situations
> > > without risk of regression then we should move this behind a mount
> > > option.
> > > 
> > > So how about the following:
> > > 
> > > From cb14add421967f6e374eb77c36cc4a0526b10d17 Mon Sep 17 00:00:00 2001
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Wed, 20 Sep 2023 10:00:08 +0200
> > > Subject: [PATCH] vfs: move multi-grain timestamps behind a mount option
> > > 
> > > While we initially thought we can do this unconditionally it turns out
> > > that this might break existing workloads that rely on timestamps in very
> > > specific ways and we always knew this was a possibility. Move
> > > multi-grain timestamps behind a vfs mount option.
> > > 
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > Surely this is a safe choice as it moves the responsibility to the sysadmin
> > and the cases where finegrained timestamps are required. But I kind of
> > wonder how is the sysadmin going to decide whether mgtime is safe for his
> > system or not? Because the possible breakage needn't be obvious at the
> > first sight... If I were a sysadmin, I'd rather opt for something like
> 
> I think you'll basically enable this because you want to export a
> filesystem via NFS.

OK, that's what I thought but then you have to make a tough choice between:

1) Possibly inconsistent NFS caches on frequent changes.
2) Possibly broken builds on NFS.

Pick your poison ;)

> > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > functionality). That should avoid the IO overhead of finegrained timestamps
> 
> That would work with this patch, no? Or are you saying it would need
> something else?

Sorry, I was not really precise here. What I meant was that instead of
having multigrain timestamps, I (as a sysadmin) would want the filesystem
to set sb->s_time_gran to 1 ns and use lazytime to remove the IO overhead
of the frequent timestamp updates. But that is just me brainstorming
possible solutions of the original NFS problem.

> > as well and I'd know I can have problems with timestamps only after a
> > system crash.
> > 
> > I've just got another idea how we could solve the problem: Couldn't we
> > always just report coarsegrained timestamp to userspace and provide access
> > to finegrained value only to NFS which should know what it's doing?
> 
> What would changes would be involved for that?

See my other email. It should be fairly small...

> If this is invasive work and we decide this is something that we want to
> do then we should remove FS_MGTIME from btrfs, xfs, ext4, and tmpfs for
> v6.6.

.. but let's see what Jeff thinks. I can miss some problem with the
solution.

								Honza
Chuck Lever Sept. 20, 2023, 1:57 p.m. UTC | #30
> On Sep 20, 2023, at 7:48 AM, Christian Brauner <brauner@kernel.org> wrote:
> 
>>>> While we initially thought we can do this unconditionally it turns out
>>>> that this might break existing workloads that rely on timestamps in very
>>>> specific ways and we always knew this was a possibility. Move
>>>> multi-grain timestamps behind a vfs mount option.
>>> 
>>> Surely this is a safe choice as it moves the responsibility to the sysadmin
>>> and the cases where finegrained timestamps are required. But I kind of
>>> wonder how is the sysadmin going to decide whether mgtime is safe for his
>>> system or not? Because the possible breakage needn't be obvious at the
>>> first sight...
>>> 
>> 
>> That's the main reason I really didn't want to go with a mount option.
>> Documenting that may be difficult. While there is some pessimism around
>> it, I may still take a stab at just advancing the coarse clock whenever
>> we fetch a fine-grained timestamp. It'd be nice to remove this option in
>> the future if that turns out to be feasible.
>> 
>>> If I were a sysadmin, I'd rather opt for something like
>>> finegrained timestamps + lazytime (if I needed the finegrained timestamps
>>> functionality). That should avoid the IO overhead of finegrained timestamps
>>> as well and I'd know I can have problems with timestamps only after a
>>> system crash.
>> 
>>> I've just got another idea how we could solve the problem: Couldn't we
>>> always just report coarsegrained timestamp to userspace and provide access
>>> to finegrained value only to NFS which should know what it's doing?
>>> 
>> 
>> I think that'd be hard. First of all, where would we store the second
>> timestamp? We can't just truncate the fine-grained ones to come up with
>> a coarse-grained one. It might also be confusing having nfsd and local
>> filesystems present different attributes.
> 
> As far as I can tell we have two options. The first one is to make this
> into a mount option which I really think isn't a big deal and lets us
> avoid this whole problem while allowing filesytems exposed via NFS to
> make use of this feature for change tracking.

A mount option isn't hard to implement, but I think it would be a
mistake.

As Jan pointed out, the two alternative compromises are often very
difficult to choose between. Tossing this decision to administrators
doesn't seem like a responsible way to handle a question that might
result in, at the least, unexpected behavior, and at worst, data
corruption.

Plus, on Linux, often times files are accessed locally on NFS servers
as well as remotely -- how does the server's administrator pick the
correct setting in that case?


> The second option is that we turn off fine-grained finestamps for v6.6
> and you get to explore other options.

You could put it behind an EXPERIMENTAL Kconfig option so that the
code stays in and can be used by the brave or foolish while it is
still being refined.


> It isn't a big deal regressions like this were always to be expected but
> v6.6 needs to stabilize so anything that requires more significant work
> is not an option.


--
Chuck Lever
Jeff Layton Sept. 20, 2023, 2:12 p.m. UTC | #31
On Wed, 2023-09-20 at 14:48 +0200, Jan Kara wrote:
> On Wed 20-09-23 06:35:18, Jeff Layton wrote:
> > On Wed, 2023-09-20 at 12:17 +0200, Jan Kara wrote:
> > > If I were a sysadmin, I'd rather opt for something like
> > > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > > functionality). That should avoid the IO overhead of finegrained timestamps
> > > as well and I'd know I can have problems with timestamps only after a
> > > system crash.
> > 
> > > I've just got another idea how we could solve the problem: Couldn't we
> > > always just report coarsegrained timestamp to userspace and provide access
> > > to finegrained value only to NFS which should know what it's doing?
> > > 
> > 
> > I think that'd be hard. First of all, where would we store the second
> > timestamp? We can't just truncate the fine-grained ones to come up with
> > a coarse-grained one. It might also be confusing having nfsd and local
> > filesystems present different attributes.
> 
> So what I had in mind (and I definitely miss all the NFS intricacies so the
> idea may be bogus) was that inode->i_ctime would be maintained exactly as
> is now. There will be new (kernel internal at least for now) STATX flag
> STATX_MULTIGRAIN_TS. fill_mg_cmtime() will return timestamp truncated to
> sb->s_time_gran unless STATX_MULTIGRAIN_TS is set. Hence unless you set
> STATX_MULTIGRAIN_TS, there is no difference in the returned timestamps
> compared to the state before multigrain timestamps were introduced. With
> STATX_MULTIGRAIN_TS we return full precision timestamp as stored in the
> inode. Then NFS in fh_fill_pre_attrs() and fh_fill_post_attrs() needs to
> make sure STATX_MULTIGRAIN_TS is set when calling vfs_getattr() to get
> multigrain time.

> I agree nfsd may now be presenting slightly different timestamps than user
> is able to see with stat(2) directly on the filesystem. But is that a
> problem? Essentially it is a similar solution as the mgtime mount option
> but now sysadmin doesn't have to decide on filesystem mount how to report
> timestamps but the stat caller knowingly opts into possibly inconsistent
> (among files) but high precision timestamps. And in the particular NFS
> usecase where stat is called all the time anyway, timestamps will likely
> even be consistent among files.
> 

I like this idea...

Would we also need to raise sb->s_time_gran to something corresponding
to HZ on these filesystems? If we truncate the timestamps at a
granularity corresponding to HZ before presenting them via statx and the
like then that should work around the problem with programs that compare
timestamps between inodes.

With NFSv4, when a filesystem doesn't report a STATX_CHANGE_COOKIE, nfsd
will fake one up using the ctime. It's fine for that to use a full fine-
grained timestamp since we don't expect to be able to compare that value
with one of a different inode.

I think we'd want nfsd to present the mtime/ctime values as truncated,
just like we would with a local fs. We could hit the same problem of an
earlier-looking timestamp with NFS if we try to present the actual fine-
grained values to the clients. IOW, I'm convinced that we need to avoid
this behavior in most situations.

If we do this, then we technically don't need the mount option either.
We could still add it though, and have it govern whether fill_mg_cmtime
truncates the timestamps before storing them in the kstat.
Christian Brauner Sept. 20, 2023, 2:53 p.m. UTC | #32
> You could put it behind an EXPERIMENTAL Kconfig option so that the
> code stays in and can be used by the brave or foolish while it is
> still being refined.

Given that the discussion has now fully gone back to the drawing board
and this is a regression the honest thing to do is to revert the five
patches that introduce the infrastructure:

ffb6cf19e063 ("fs: add infrastructure for multigrain timestamps")
d48c33972916 ("tmpfs: add support for multigrain timestamps")
e44df2664746 ("xfs: switch to multigrain timestamps")
0269b585868e ("ext4: switch to multigrain timestamps")
50e9ceef1d4f ("btrfs: convert to multigrain timestamps")

The conversion to helpers and cleanups are sane and should stay and can
be used for any solution that gets built on top of it.

I'd appreciate a look at the branch here:
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.ctime.revert

survives xfstests.
Jeff Layton Sept. 20, 2023, 3:29 p.m. UTC | #33
On Wed, 2023-09-20 at 16:53 +0200, Christian Brauner wrote:
> > You could put it behind an EXPERIMENTAL Kconfig option so that the
> > code stays in and can be used by the brave or foolish while it is
> > still being refined.
> 
> Given that the discussion has now fully gone back to the drawing board
> and this is a regression the honest thing to do is to revert the five
> patches that introduce the infrastructure:
> 
> ffb6cf19e063 ("fs: add infrastructure for multigrain timestamps")
> d48c33972916 ("tmpfs: add support for multigrain timestamps")
> e44df2664746 ("xfs: switch to multigrain timestamps")
> 0269b585868e ("ext4: switch to multigrain timestamps")
> 50e9ceef1d4f ("btrfs: convert to multigrain timestamps")
> 
> The conversion to helpers and cleanups are sane and should stay and can
> be used for any solution that gets built on top of it.
> 
> I'd appreciate a look at the branch here:
> git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.ctime.revert
> 
> survives xfstests.

I think that's probably the wisest course of action. I need some time to
ponder the options for this series anyway, and another cycle in next
wouldn't hurt.

The branch itself looks fine, but you might want to reverse the order of
the patches in case someone lands there in the middle of a bisect. IOW,
I think you want to revert the "convert to multigrain" patches before
you revert the infrastructure.
Jan Kara Sept. 20, 2023, 3:30 p.m. UTC | #34
On Wed 20-09-23 16:53:26, Christian Brauner wrote:
> > You could put it behind an EXPERIMENTAL Kconfig option so that the
> > code stays in and can be used by the brave or foolish while it is
> > still being refined.
> 
> Given that the discussion has now fully gone back to the drawing board
> and this is a regression the honest thing to do is to revert the five
> patches that introduce the infrastructure:
> 
> ffb6cf19e063 ("fs: add infrastructure for multigrain timestamps")
> d48c33972916 ("tmpfs: add support for multigrain timestamps")
> e44df2664746 ("xfs: switch to multigrain timestamps")
> 0269b585868e ("ext4: switch to multigrain timestamps")
> 50e9ceef1d4f ("btrfs: convert to multigrain timestamps")
> 
> The conversion to helpers and cleanups are sane and should stay and can
> be used for any solution that gets built on top of it.
> 
> I'd appreciate a look at the branch here:
> git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.ctime.revert
> 
> survives xfstests.

Agreed. I think most of ffb6cf19e063 ("fs: add infrastructure for
multigrain timestamps") will be needed anyway but there's no problem in
reintroducing it in the new solution. I've checked the branch and the
reverts look good to me. Feel free to add:

Acked-by: Jan Kara <jack@suse.cz>

								Honza
Jan Kara Sept. 20, 2023, 3:45 p.m. UTC | #35
On Wed 20-09-23 10:12:03, Jeff Layton wrote:
> On Wed, 2023-09-20 at 14:48 +0200, Jan Kara wrote:
> > On Wed 20-09-23 06:35:18, Jeff Layton wrote:
> > > On Wed, 2023-09-20 at 12:17 +0200, Jan Kara wrote:
> > > > If I were a sysadmin, I'd rather opt for something like
> > > > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > > > functionality). That should avoid the IO overhead of finegrained timestamps
> > > > as well and I'd know I can have problems with timestamps only after a
> > > > system crash.
> > > 
> > > > I've just got another idea how we could solve the problem: Couldn't we
> > > > always just report coarsegrained timestamp to userspace and provide access
> > > > to finegrained value only to NFS which should know what it's doing?
> > > > 
> > > 
> > > I think that'd be hard. First of all, where would we store the second
> > > timestamp? We can't just truncate the fine-grained ones to come up with
> > > a coarse-grained one. It might also be confusing having nfsd and local
> > > filesystems present different attributes.
> > 
> > So what I had in mind (and I definitely miss all the NFS intricacies so the
> > idea may be bogus) was that inode->i_ctime would be maintained exactly as
> > is now. There will be new (kernel internal at least for now) STATX flag
> > STATX_MULTIGRAIN_TS. fill_mg_cmtime() will return timestamp truncated to
> > sb->s_time_gran unless STATX_MULTIGRAIN_TS is set. Hence unless you set
> > STATX_MULTIGRAIN_TS, there is no difference in the returned timestamps
> > compared to the state before multigrain timestamps were introduced. With
> > STATX_MULTIGRAIN_TS we return full precision timestamp as stored in the
> > inode. Then NFS in fh_fill_pre_attrs() and fh_fill_post_attrs() needs to
> > make sure STATX_MULTIGRAIN_TS is set when calling vfs_getattr() to get
> > multigrain time.
> 
> > I agree nfsd may now be presenting slightly different timestamps than user
> > is able to see with stat(2) directly on the filesystem. But is that a
> > problem? Essentially it is a similar solution as the mgtime mount option
> > but now sysadmin doesn't have to decide on filesystem mount how to report
> > timestamps but the stat caller knowingly opts into possibly inconsistent
> > (among files) but high precision timestamps. And in the particular NFS
> > usecase where stat is called all the time anyway, timestamps will likely
> > even be consistent among files.
> > 
> 
> I like this idea...
> 
> Would we also need to raise sb->s_time_gran to something corresponding
> to HZ on these filesystems?

I was actually confused a bit about how timestamp_truncate() works. The
jiffie granularity is just direct consequence of current_time() using
ktime_get_coarse_real_ts64() and not of timestamp_truncate().
sb->s_time_gran seems to be more about the on-disk format so it doesn't
seem like a great idea to touch it. So probably we can just truncate
timestamps in generic_fillattr() to HZ granularity unconditionally.

> If we truncate the timestamps at a granularity corresponding to HZ before
> presenting them via statx and the like then that should work around the
> problem with programs that compare timestamps between inodes.

Exactly.

> With NFSv4, when a filesystem doesn't report a STATX_CHANGE_COOKIE, nfsd
> will fake one up using the ctime. It's fine for that to use a full fine-
> grained timestamp since we don't expect to be able to compare that value
> with one of a different inode.

Yes.

> I think we'd want nfsd to present the mtime/ctime values as truncated,
> just like we would with a local fs. We could hit the same problem of an
> earlier-looking timestamp with NFS if we try to present the actual fine-
> grained values to the clients. IOW, I'm convinced that we need to avoid
> this behavior in most situations.

I wasn't sure if there's a way to do this within NFS - i.e., if the value
communicated via NFSv3 protocol (I know v4 has a special change cookie
field for it) that gets used for detecting need to revalidate file contents
isn't the one presented to client's userspace as ctime. If there's a way to
do this then great, I'm all for presenting truncated timestamps even for
NFS.

> If we do this, then we technically don't need the mount option either.

Yes, that was my hope.

> We could still add it though, and have it govern whether fill_mg_cmtime
> truncates the timestamps before storing them in the kstat.

Well, if we decide these timestamps are useful for userspace as well, I'd
rather make that a userspace visible STATX flag than a mount option. So
applications aware of the pitfalls can get high precision timestamps
without possibly breaking unaware applications.

								Honza