mbox series

[0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Message ID 20250401-work-freeze-v1-0-d000611d4ab0@kernel.org
Headers show
Series power: wire-up filesystem freeze/thaw with suspend/resume | expand

Message

Christian Brauner April 1, 2025, 12:32 a.m. UTC
The whole shebang can also be found at:
https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze

I know nothing about power or hibernation. I've tested it as best as I
could. Works for me (TM).

I need to catch some actual sleep now...

---

Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.

If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (3):
      fs: add owner of freeze/thaw
      fs: allow pagefault based writers to be frozen
      power: freeze filesystems during suspend/resume

Luis Chamberlain (3):
      ext4: replace kthread freezing with auto fs freezing
      btrfs: replace kthread freezing with auto fs freezing
      xfs: replace kthread freezing with auto fs freezing

 fs/btrfs/disk-io.c          |  4 +--
 fs/btrfs/scrub.c            |  2 +-
 fs/ext4/mballoc.c           |  2 +-
 fs/ext4/super.c             |  3 --
 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++-----
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/super.c                  | 82 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_discard.c        |  2 +-
 fs/xfs/xfs_log.c            |  3 +-
 fs/xfs/xfs_log_cil.c        |  2 +-
 fs/xfs/xfs_mru_cache.c      |  2 +-
 fs/xfs/xfs_notify_failure.c |  6 ++--
 fs/xfs/xfs_pwork.c          |  2 +-
 fs/xfs/xfs_super.c          | 14 ++++----
 fs/xfs/xfs_trans_ail.c      |  3 --
 fs/xfs/xfs_zone_gc.c        |  2 --
 include/linux/fs.h          | 16 ++++++---
 kernel/power/hibernate.c    | 13 ++++++-
 kernel/power/suspend.c      |  8 +++++
 22 files changed, 139 insertions(+), 69 deletions(-)
---
base-commit: a68c99192db8060f383a2680333866c0be688ece
change-id: 20250401-work-freeze-693b5b5a78e0

Comments

Dave Chinner April 1, 2025, 1:11 a.m. UTC | #1
On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> The kernel power management now supports allowing the VFS
> to handle filesystem freezing freezes and thawing. Take advantage
> of that and remove the kthread freezing. This is needed so that we
> properly really stop IO in flight without races after userspace
> has been frozen. Without this we rely on kthread freezing and
> its semantics are loose and error prone.
> 
> The filesystem therefore is in charge of properly dealing with
> quiescing of the filesystem through its callbacks if it thinks
> it knows better than how the VFS handles it.
> 
.....

> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 0fcb1828e598..ad8183db0780 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -636,7 +636,6 @@ xfsaild(
>  	unsigned int	noreclaim_flag;
>  
>  	noreclaim_flag = memalloc_noreclaim_save();
> -	set_freezable();
>  
>  	while (1) {
>  		/*
> @@ -695,8 +694,6 @@ xfsaild(
>  
>  		__set_current_state(TASK_RUNNING);
>  
> -		try_to_freeze();
> -
>  		tout = xfsaild_push(ailp);
>  	}
>  

So what about the TASK_FREEZABLE flag that is set in this code
before sleeping?

i.e. this code before we schedule():

                if (tout && tout <= 20)
                        set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
                else
                        set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);

Shouldn't TASK_FREEZABLE go away, too?

> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index c5136ea9bb1d..1875b6551ab0 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
>  	}
>  
>  	__set_current_state(TASK_RUNNING);
> -	try_to_freeze();
>  
>  	if (reset_list)
>  		xfs_zone_gc_reset_zones(data, reset_list);
> @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
>  	unsigned int		nofs_flag;
>  
>  	nofs_flag = memalloc_nofs_save();
> -	set_freezable();
>  
>  	for (;;) {
>  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);

Same question here for this newly merged code, too...

-Dave.
Christian Brauner April 1, 2025, 7:17 a.m. UTC | #2
On Tue, Apr 01, 2025 at 12:11:04PM +1100, Dave Chinner wrote:
> On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > The kernel power management now supports allowing the VFS
> > to handle filesystem freezing freezes and thawing. Take advantage
> > of that and remove the kthread freezing. This is needed so that we
> > properly really stop IO in flight without races after userspace
> > has been frozen. Without this we rely on kthread freezing and
> > its semantics are loose and error prone.
> > 
> > The filesystem therefore is in charge of properly dealing with
> > quiescing of the filesystem through its callbacks if it thinks
> > it knows better than how the VFS handles it.
> > 
> .....
> 
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 0fcb1828e598..ad8183db0780 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -636,7 +636,6 @@ xfsaild(
> >  	unsigned int	noreclaim_flag;
> >  
> >  	noreclaim_flag = memalloc_noreclaim_save();
> > -	set_freezable();
> >  
> >  	while (1) {
> >  		/*
> > @@ -695,8 +694,6 @@ xfsaild(
> >  
> >  		__set_current_state(TASK_RUNNING);
> >  
> > -		try_to_freeze();
> > -
> >  		tout = xfsaild_push(ailp);
> >  	}
> >  
> 
> So what about the TASK_FREEZABLE flag that is set in this code
> before sleeping?
> 
> i.e. this code before we schedule():
> 
>                 if (tout && tout <= 20)
>                         set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
>                 else
>                         set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> 
> Shouldn't TASK_FREEZABLE go away, too?

Thanks for spotting! Yes, yesterday late at night I just took Luis
patches as they are and had only gotten around to testing btrfs. The
coccinelle scripts seemed to have missed those. I'll wait for comments
and will do another pass and send out v2.

> > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> > index c5136ea9bb1d..1875b6551ab0 100644
> > --- a/fs/xfs/xfs_zone_gc.c
> > +++ b/fs/xfs/xfs_zone_gc.c
> > @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
> >  	}
> >  
> >  	__set_current_state(TASK_RUNNING);
> > -	try_to_freeze();
> >  
> >  	if (reset_list)
> >  		xfs_zone_gc_reset_zones(data, reset_list);
> > @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
> >  	unsigned int		nofs_flag;
> >  
> >  	nofs_flag = memalloc_nofs_save();
> > -	set_freezable();
> >  
> >  	for (;;) {
> >  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
> 
> Same question here for this newly merged code, too...

I'm not sure if this is supposed to be a snipe or not but just in case
this is a hidden question: This isn't merged. Per the cover letter this
is in a work.* branch. Anything that is considered mergable is in
vfs-6.16.* branches. But since we're pre -rc1 even those branches are
not yet showing up in -next.
Jan Kara April 1, 2025, 9:16 a.m. UTC | #3
On Tue 01-04-25 02:32:46, Christian Brauner wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> The kernel power management now supports allowing the VFS
> to handle filesystem freezing freezes and thawing. Take advantage
> of that and remove the kthread freezing. This is needed so that we
> properly really stop IO in flight without races after userspace
> has been frozen. Without this we rely on kthread freezing and
> its semantics are loose and error prone.
> 
> The filesystem therefore is in charge of properly dealing with
> quiescing of the filesystem through its callbacks if it thinks
> it knows better than how the VFS handles it.
> 
> The following Coccinelle rule was used as to remove the now superfluous
> freezer calls:
> 
> make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/ext4
> 
> virtual patch
> 
> @ remove_set_freezable @
> expression time;
> statement S, S2;
> expression task, current;
> @@
> 
> (
> -       set_freezable();
> |
> -       if (try_to_freeze())
> -               continue;
> |
> -       try_to_freeze();
> |
> -       freezable_schedule();
> +       schedule();
> |
> -       freezable_schedule_timeout(time);
> +       schedule_timeout(time);
> |
> -       if (freezing(task)) { S }
> |
> -       if (freezing(task)) { S }
> -       else
> 	    { S2 }
> |
> -       freezing(current)
> )
> 
> @ remove_wq_freezable @
> expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
> identifier fs_wq_fn;
> @@
> 
> (
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_FREEZABLE,
> +                              WQ_ARG2,
> 			   ...);
> |
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
> +                              WQ_ARG2 | WQ_ARG3,
> 			   ...);
> |
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
> +                              WQ_ARG2 | WQ_ARG3,
> 			   ...);
> |
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
> +                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
> 			   ...);
> |
> 	    WQ_E =
> -               WQ_ARG1 | WQ_FREEZABLE
> +               WQ_ARG1
> |
> 	    WQ_E =
> -               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
> +               WQ_ARG1 | WQ_ARG3
> |
>     fs_wq_fn(
> -               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
> +               WQ_ARG2 | WQ_ARG3
>     )
> |
>     fs_wq_fn(
> -               WQ_FREEZABLE | WQ_ARG2
> +               WQ_ARG2
>     )
> |
>     fs_wq_fn(
> -               WQ_FREEZABLE
> +               0
>     )
> )
> 
> @ add_auto_flag @
> expression E1;
> identifier fs_type;
> @@
> 
> struct file_system_type fs_type = {
> 	.fs_flags = E1
> +                   | FS_AUTOFREEZE
> 	,
> };
> 
> Generated-by: Coccinelle SmPL
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Link: https://lore.kernel.org/r/20250326112220.1988619-5-mcgrof@kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/ext4/mballoc.c | 2 +-
>  fs/ext4/super.c   | 3 ---
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 0d523e9fb3d5..ae235ec5ff3a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -6782,7 +6782,7 @@ static ext4_grpblk_t ext4_last_grp_cluster(struct super_block *sb,
>  
>  static bool ext4_trim_interrupted(void)
>  {
> -	return fatal_signal_pending(current) || freezing(current);
> +	return fatal_signal_pending(current);
>  }

This change should not happen. ext4_trim_interrupted() makes sure FITRIM
ioctl doesn't cause hibernation failures and has nothing to do with kthread
freezing...

Otherwise the patch looks good.

								Honza
Christian Brauner April 1, 2025, 9:35 a.m. UTC | #4
On Tue, Apr 01, 2025 at 11:16:18AM +0200, Jan Kara wrote:
> On Tue 01-04-25 02:32:46, Christian Brauner wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > The kernel power management now supports allowing the VFS
> > to handle filesystem freezing freezes and thawing. Take advantage
> > of that and remove the kthread freezing. This is needed so that we
> > properly really stop IO in flight without races after userspace
> > has been frozen. Without this we rely on kthread freezing and
> > its semantics are loose and error prone.
> > 
> > The filesystem therefore is in charge of properly dealing with
> > quiescing of the filesystem through its callbacks if it thinks
> > it knows better than how the VFS handles it.
> > 
> > The following Coccinelle rule was used as to remove the now superfluous
> > freezer calls:
> > 
> > make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/ext4
> > 
> > virtual patch
> > 
> > @ remove_set_freezable @
> > expression time;
> > statement S, S2;
> > expression task, current;
> > @@
> > 
> > (
> > -       set_freezable();
> > |
> > -       if (try_to_freeze())
> > -               continue;
> > |
> > -       try_to_freeze();
> > |
> > -       freezable_schedule();
> > +       schedule();
> > |
> > -       freezable_schedule_timeout(time);
> > +       schedule_timeout(time);
> > |
> > -       if (freezing(task)) { S }
> > |
> > -       if (freezing(task)) { S }
> > -       else
> > 	    { S2 }
> > |
> > -       freezing(current)
> > )
> > 
> > @ remove_wq_freezable @
> > expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
> > identifier fs_wq_fn;
> > @@
> > 
> > (
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_FREEZABLE,
> > +                              WQ_ARG2,
> > 			   ...);
> > |
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
> > +                              WQ_ARG2 | WQ_ARG3,
> > 			   ...);
> > |
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
> > +                              WQ_ARG2 | WQ_ARG3,
> > 			   ...);
> > |
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
> > +                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
> > 			   ...);
> > |
> > 	    WQ_E =
> > -               WQ_ARG1 | WQ_FREEZABLE
> > +               WQ_ARG1
> > |
> > 	    WQ_E =
> > -               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
> > +               WQ_ARG1 | WQ_ARG3
> > |
> >     fs_wq_fn(
> > -               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
> > +               WQ_ARG2 | WQ_ARG3
> >     )
> > |
> >     fs_wq_fn(
> > -               WQ_FREEZABLE | WQ_ARG2
> > +               WQ_ARG2
> >     )
> > |
> >     fs_wq_fn(
> > -               WQ_FREEZABLE
> > +               0
> >     )
> > )
> > 
> > @ add_auto_flag @
> > expression E1;
> > identifier fs_type;
> > @@
> > 
> > struct file_system_type fs_type = {
> > 	.fs_flags = E1
> > +                   | FS_AUTOFREEZE
> > 	,
> > };
> > 
> > Generated-by: Coccinelle SmPL
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Link: https://lore.kernel.org/r/20250326112220.1988619-5-mcgrof@kernel.org
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/ext4/mballoc.c | 2 +-
> >  fs/ext4/super.c   | 3 ---
> >  2 files changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 0d523e9fb3d5..ae235ec5ff3a 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -6782,7 +6782,7 @@ static ext4_grpblk_t ext4_last_grp_cluster(struct super_block *sb,
> >  
> >  static bool ext4_trim_interrupted(void)
> >  {
> > -	return fatal_signal_pending(current) || freezing(current);
> > +	return fatal_signal_pending(current);
> >  }
> 
> This change should not happen. ext4_trim_interrupted() makes sure FITRIM
> ioctl doesn't cause hibernation failures and has nothing to do with kthread
> freezing...
> 
> Otherwise the patch looks good.

Afaict, we don't have to do these changes now. Yes, once fsfreeze
reliably works in the suspend/resume codepaths then we can switch all
that off and remove the old freezer. But we should only do that once we
have some experience with the new filesystem freezing during
suspend/hibernate. So we should place this under a
/sys/power/freeze_filesystems knob and wait a few kernel releases to see
whether we see significant problems. How does that sound to you?
Dave Chinner April 1, 2025, 11:35 a.m. UTC | #5
On Tue, Apr 01, 2025 at 09:17:12AM +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 12:11:04PM +1100, Dave Chinner wrote:
> > On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> > > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> > > index c5136ea9bb1d..1875b6551ab0 100644
> > > --- a/fs/xfs/xfs_zone_gc.c
> > > +++ b/fs/xfs/xfs_zone_gc.c
> > > @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
> > >  	}
> > >  
> > >  	__set_current_state(TASK_RUNNING);
> > > -	try_to_freeze();
> > >  
> > >  	if (reset_list)
> > >  		xfs_zone_gc_reset_zones(data, reset_list);
> > > @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
> > >  	unsigned int		nofs_flag;
> > >  
> > >  	nofs_flag = memalloc_nofs_save();
> > > -	set_freezable();
> > >  
> > >  	for (;;) {
> > >  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
> > 
> > Same question here for this newly merged code, too...
>
> I'm not sure if this is supposed to be a snipe or not but just in case
> this is a hidden question:

No, I meant that this is changing shiny new just-merged XFS code
(part of zone device support). It only just arrived this merge
window and is largely just doing the same thing as the older aild
code. It is probably safe to assume that this new code has never
been tested against hibernate...

-Dave.
Peter Zijlstra April 1, 2025, 2:14 p.m. UTC | #6
On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as I
> could. Works for me (TM).
> 
> I need to catch some actual sleep now...
> 
> ---
> 
> Now all the pieces are in place to actually allow the power subsystem to
> freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.

Urgh, I was relying on all kthreads to be freezable for live-patching:

  https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net

So I understand the problem with freezing filesystems, but can't we
leave the TASK_FREEZABLE in the kthreads? The way I understand it, the
power subsystem will first freeze the filesystems before it goes freeze
threads anyway. So them remaining freezable should not affect anything,
right?