mbox series

[v2,0/2] ceph: fix fscrypt_destroy_keyring use-after-free bug

Message ID 20230606033212.1068823-1-xiubli@redhat.com
Headers show
Series ceph: fix fscrypt_destroy_keyring use-after-free bug | expand

Message

Xiubo Li June 6, 2023, 3:32 a.m. UTC
From: Xiubo Li <xiubli@redhat.com>

V2:
- Improve the code by switching to wait_for_completion_killable_timeout()
  when umounting, at the same add one umount_timeout option.
- Improve the inc/dec helpers for metadata/IO cases.


Xiubo Li (2):
  ceph: drop the messages from MDS when unmounting
  ceph: just wait the osd requests' callbacks to finish when unmounting

 fs/ceph/addr.c       | 10 +++++
 fs/ceph/caps.c       |  6 ++-
 fs/ceph/mds_client.c | 14 +++++--
 fs/ceph/mds_client.h | 11 ++++-
 fs/ceph/quota.c      | 14 +++----
 fs/ceph/snap.c       | 10 +++--
 fs/ceph/super.c      | 99 +++++++++++++++++++++++++++++++++++++++++++-
 fs/ceph/super.h      |  7 ++++
 8 files changed, 154 insertions(+), 17 deletions(-)

Comments

Milind Changire June 6, 2023, 7:47 a.m. UTC | #1
nits in commit message:
* s/When unmounting and all/When unmounting and ???/
Is something missing after *and* or has been dropped ?
Please remove *and* if something has been dropped or
add it after *and* if it was forgotten to be added

* s/ihode/ihold/

-----
Otherwise, it looks good to me.

Reviewed-by: Milind Changire <mchangir@redhat.com>

On Tue, Jun 6, 2023 at 9:04 AM <xiubli@redhat.com> wrote:
>
> From: Xiubo Li <xiubli@redhat.com>
>
> When unmounting and all the dirty buffer will be flushed and after
> the last osd request is finished the last reference of the i_count
> will be released. Then it will flush the dirty cap/snap to MDSs,
> and the unmounting won't wait the possible acks, which will ihode
> the inodes when updating the metadata locally but makes no sense
> any more, of this. This will make the evict_inodes() to skip these
> inodes.
>
> If encrypt is enabled the kernel generate a warning when removing
> the encrypt keys when the skipped inodes still hold the keyring:
>
> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
> CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
> FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> generic_shutdown_super+0x47/0x120
> kill_anon_super+0x14/0x30
> ceph_kill_sb+0x36/0x90 [ceph]
> deactivate_locked_super+0x29/0x60
> cleanup_mnt+0xb8/0x140
> task_work_run+0x67/0xb0
> exit_to_user_mode_prepare+0x23d/0x240
> syscall_exit_to_user_mode+0x25/0x60
> do_syscall_64+0x40/0x80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7fd83dc39e9b
>
> Later the kernel will crash when iput() the inodes and dereferencing
> the "sb->s_master_keys", which has been released by the
> generic_shutdown_super().
>
> URL: https://tracker.ceph.com/issues/58126
> URL: https://tracker.ceph.com/issues/59162
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/caps.c       |  6 ++-
>  fs/ceph/mds_client.c | 14 +++++--
>  fs/ceph/mds_client.h | 11 +++++-
>  fs/ceph/quota.c      | 14 +++----
>  fs/ceph/snap.c       | 10 +++--
>  fs/ceph/super.c      | 88 +++++++++++++++++++++++++++++++++++++++++++-
>  fs/ceph/super.h      |  5 +++
>  7 files changed, 131 insertions(+), 17 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index ccd4791a5e71..74ba335af25d 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -4235,6 +4235,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>
>         dout("handle_caps from mds%d\n", session->s_mds);
>
> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
> +               return;
> +
>         /* decode */
>         end = msg->front.iov_base + msg->front.iov_len;
>         if (msg->front.iov_len < sizeof(*h))
> @@ -4336,7 +4339,6 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>              vino.snap, inode);
>
>         mutex_lock(&session->s_mutex);
> -       inc_session_sequence(session);
>         dout(" mds%d seq %lld cap seq %u\n", session->s_mds, session->s_seq,
>              (unsigned)seq);
>
> @@ -4448,6 +4450,8 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>  done_unlocked:
>         iput(inode);
>  out:
> +       ceph_dec_mds_stopping_blocker(mdsc);
> +
>         ceph_put_string(extra_info.pool_ns);
>
>         /* Defer closing the sessions after s_mutex lock being released */
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 6a82b0708343..0a70a2438cb2 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -4904,6 +4904,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>
>         dout("handle_lease from mds%d\n", mds);
>
> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
> +               return;
> +
>         /* decode */
>         if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
>                 goto bad;
> @@ -4922,8 +4925,6 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>              dname.len, dname.name);
>
>         mutex_lock(&session->s_mutex);
> -       inc_session_sequence(session);
> -
>         if (!inode) {
>                 dout("handle_lease no inode %llx\n", vino.ino);
>                 goto release;
> @@ -4985,9 +4986,13 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>  out:
>         mutex_unlock(&session->s_mutex);
>         iput(inode);
> +
> +       ceph_dec_mds_stopping_blocker(mdsc);
>         return;
>
>  bad:
> +       ceph_dec_mds_stopping_blocker(mdsc);
> +
>         pr_err("corrupt lease message\n");
>         ceph_msg_dump(msg);
>  }
> @@ -5183,6 +5188,9 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>         }
>
>         init_completion(&mdsc->safe_umount_waiters);
> +       spin_lock_init(&mdsc->stopping_lock);
> +       atomic_set(&mdsc->stopping_blockers, 0);
> +       init_completion(&mdsc->stopping_waiter);
>         init_waitqueue_head(&mdsc->session_close_wq);
>         INIT_LIST_HEAD(&mdsc->waiting_for_map);
>         mdsc->quotarealms_inodes = RB_ROOT;
> @@ -5297,7 +5305,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
>  void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>  {
>         dout("pre_umount\n");
> -       mdsc->stopping = 1;
> +       mdsc->stopping = CEPH_MDSC_STOPPING_BEGIN;
>
>         ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
>         ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 82165f09a516..351d92f7fc4f 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -398,6 +398,11 @@ struct cap_wait {
>         int                     want;
>  };
>
> +enum {
> +       CEPH_MDSC_STOPPING_BEGIN = 1,
> +       CEPH_MDSC_STOPPING_FLUSHED = 2,
> +};
> +
>  /*
>   * mds client state
>   */
> @@ -414,7 +419,11 @@ struct ceph_mds_client {
>         struct ceph_mds_session **sessions;    /* NULL for mds if no session */
>         atomic_t                num_sessions;
>         int                     max_sessions;  /* len of sessions array */
> -       int                     stopping;      /* true if shutting down */
> +
> +       spinlock_t              stopping_lock;  /* protect snap_empty */
> +       int                     stopping;      /* the stage of shutting down */
> +       atomic_t                stopping_blockers;
> +       struct completion       stopping_waiter;
>
>         atomic64_t              quotarealms_count; /* # realms with quota */
>         /*
> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> index 64592adfe48f..f7fcf7f08ec6 100644
> --- a/fs/ceph/quota.c
> +++ b/fs/ceph/quota.c
> @@ -47,25 +47,23 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>         struct inode *inode;
>         struct ceph_inode_info *ci;
>
> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
> +               return;
> +
>         if (msg->front.iov_len < sizeof(*h)) {
>                 pr_err("%s corrupt message mds%d len %d\n", __func__,
>                        session->s_mds, (int)msg->front.iov_len);
>                 ceph_msg_dump(msg);
> -               return;
> +               goto out;
>         }
>
> -       /* increment msg sequence number */
> -       mutex_lock(&session->s_mutex);
> -       inc_session_sequence(session);
> -       mutex_unlock(&session->s_mutex);
> -
>         /* lookup inode */
>         vino.ino = le64_to_cpu(h->ino);
>         vino.snap = CEPH_NOSNAP;
>         inode = ceph_find_inode(sb, vino);
>         if (!inode) {
>                 pr_warn("Failed to find inode %llu\n", vino.ino);
> -               return;
> +               goto out;
>         }
>         ci = ceph_inode(inode);
>
> @@ -78,6 +76,8 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>         spin_unlock(&ci->i_ceph_lock);
>
>         iput(inode);
> +out:
> +       ceph_dec_mds_stopping_blocker(mdsc);
>  }
>
>  static struct ceph_quotarealm_inode *
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index a6af991116b9..e03b020d87d7 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -1016,6 +1016,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>         int locked_rwsem = 0;
>         bool close_sessions = false;
>
> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
> +               return;
> +
>         /* decode */
>         if (msg->front.iov_len < sizeof(*h))
>                 goto bad;
> @@ -1031,10 +1034,6 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>         dout("%s from mds%d op %s split %llx tracelen %d\n", __func__,
>              mds, ceph_snap_op_name(op), split, trace_len);
>
> -       mutex_lock(&session->s_mutex);
> -       inc_session_sequence(session);
> -       mutex_unlock(&session->s_mutex);
> -
>         down_write(&mdsc->snap_rwsem);
>         locked_rwsem = 1;
>
> @@ -1152,12 +1151,15 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>         up_write(&mdsc->snap_rwsem);
>
>         flush_snaps(mdsc);
> +       ceph_dec_mds_stopping_blocker(mdsc);
>         return;
>
>  bad:
>         pr_err("%s corrupt snap message from mds%d\n", __func__, mds);
>         ceph_msg_dump(msg);
>  out:
> +       ceph_dec_mds_stopping_blocker(mdsc);
> +
>         if (locked_rwsem)
>                 up_write(&mdsc->snap_rwsem);
>
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index b606b33a253b..d3f54f3d7b17 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -169,6 +169,7 @@ enum {
>         Opt_wsync,
>         Opt_pagecache,
>         Opt_sparseread,
> +       Opt_umount_timeout,
>  };
>
>  enum ceph_recover_session_mode {
> @@ -214,6 +215,7 @@ static const struct fs_parameter_spec ceph_mount_parameters[] = {
>         fsparam_flag_no ("wsync",                       Opt_wsync),
>         fsparam_flag_no ("pagecache",                   Opt_pagecache),
>         fsparam_flag_no ("sparseread",                  Opt_sparseread),
> +       fsparam_u32     ("umount_timeout",              Opt_umount_timeout),
>         {}
>  };
>
> @@ -600,6 +602,12 @@ static int ceph_parse_mount_param(struct fs_context *fc,
>                 else
>                         fsopt->flags |= CEPH_MOUNT_OPT_SPARSEREAD;
>                 break;
> +       case Opt_umount_timeout:
> +               /* 0 is "wait forever" (i.e. infinite timeout) */
> +               if (result.uint_32 > INT_MAX / 1000)
> +                       goto out_of_range;
> +               fsopt->umount_timeout = msecs_to_jiffies(result.uint_32 * 1000);
> +               break;
>         case Opt_test_dummy_encryption:
>  #ifdef CONFIG_FS_ENCRYPTION
>                 fscrypt_free_dummy_policy(&fsopt->dummy_enc_policy);
> @@ -779,6 +787,8 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root)
>                 seq_printf(m, ",readdir_max_entries=%u", fsopt->max_readdir);
>         if (fsopt->max_readdir_bytes != CEPH_MAX_READDIR_BYTES_DEFAULT)
>                 seq_printf(m, ",readdir_max_bytes=%u", fsopt->max_readdir_bytes);
> +       if (fsopt->umount_timeout != CEPH_UMOUNT_TIMEOUT_DEFAULT)
> +               seq_printf(m, ",umount_timeout=%u", fsopt->umount_timeout);
>         if (strcmp(fsopt->snapdir_name, CEPH_SNAPDIRNAME_DEFAULT))
>                 seq_show_option(m, "snapdirname", fsopt->snapdir_name);
>
> @@ -1456,6 +1466,7 @@ static int ceph_init_fs_context(struct fs_context *fc)
>         fsopt->max_readdir = CEPH_MAX_READDIR_DEFAULT;
>         fsopt->max_readdir_bytes = CEPH_MAX_READDIR_BYTES_DEFAULT;
>         fsopt->congestion_kb = default_congestion_kb();
> +       fsopt->umount_timeout = CEPH_UMOUNT_TIMEOUT_DEFAULT;
>
>  #ifdef CONFIG_CEPH_FS_POSIX_ACL
>         fc->sb_flags |= SB_POSIXACL;
> @@ -1472,15 +1483,90 @@ static int ceph_init_fs_context(struct fs_context *fc)
>         return -ENOMEM;
>  }
>
> +/*
> + * Return true if it successfully increases the blocker counter,
> + * or false if the mdsc is in stopping and flushed state.
> + */
> +static bool __inc_stopping_blocker(struct ceph_mds_client *mdsc)
> +{
> +       spin_lock(&mdsc->stopping_lock);
> +       if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED) {
> +               spin_unlock(&mdsc->stopping_lock);
> +               return false;
> +       }
> +       atomic_inc(&mdsc->stopping_blockers);
> +       spin_unlock(&mdsc->stopping_lock);
> +       return true;
> +}
> +
> +static void __dec_stopping_blocker(struct ceph_mds_client *mdsc)
> +{
> +       spin_lock(&mdsc->stopping_lock);
> +       if (!atomic_dec_return(&mdsc->stopping_blockers) &&
> +           mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> +               complete_all(&mdsc->stopping_waiter);
> +       spin_unlock(&mdsc->stopping_lock);
> +}
> +
> +/* For metadata IO requests */
> +bool ceph_inc_mds_stopping_blocker(struct ceph_mds_client *mdsc,
> +                                  struct ceph_mds_session *session)
> +{
> +       mutex_lock(&session->s_mutex);
> +       inc_session_sequence(session);
> +       mutex_unlock(&session->s_mutex);
> +
> +       return __inc_stopping_blocker(mdsc);
> +}
> +
> +void ceph_dec_mds_stopping_blocker(struct ceph_mds_client *mdsc)
> +{
> +       __dec_stopping_blocker(mdsc);
> +}
> +
>  static void ceph_kill_sb(struct super_block *s)
>  {
>         struct ceph_fs_client *fsc = ceph_sb_to_client(s);
> +       struct ceph_mds_client *mdsc = fsc->mdsc;
> +       struct ceph_mount_options *opt = fsc->mount_options;
> +       bool wait;
>
>         dout("kill_sb %p\n", s);
>
> -       ceph_mdsc_pre_umount(fsc->mdsc);
> +       ceph_mdsc_pre_umount(mdsc);
>         flush_fs_workqueues(fsc);
>
> +       /*
> +        * Though the kill_anon_super() will finally trigger the
> +        * sync_filesystem() anyway, we still need to do it here and
> +        * then bump the stage of shutdown. This will allow us to
> +        * drop any further message, which will increase the inodes'
> +        * i_count reference counters but makes no sense any more,
> +        * from MDSs.
> +        *
> +        * Without this when evicting the inodes it may fail in the
> +        * kill_anon_super(), which will trigger a warning when
> +        * destroying the fscrypt keyring and then possibly trigger
> +        * a further crash in ceph module when the iput() tries to
> +        * evict the inodes later.
> +        */
> +       sync_filesystem(s);
> +
> +       spin_lock(&mdsc->stopping_lock);
> +       mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
> +       wait = !!atomic_read(&mdsc->stopping_blockers);
> +       spin_unlock(&mdsc->stopping_lock);
> +
> +       if (wait && atomic_read(&mdsc->stopping_blockers)) {
> +               long timeleft = wait_for_completion_killable_timeout(
> +                                       &mdsc->stopping_waiter,
> +                                       opt->umount_timeout);
> +               if (!timeleft) /* timed out */
> +                       pr_warn("umount timed out, %ld\n", timeleft);
> +               else if (timeleft < 0) /* killed */
> +                       pr_warn("umount was killed, %ld\n", timeleft);
> +       }
> +
>         kill_anon_super(s);
>
>         fsc->client->extra_mon_dispatch = NULL;
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index a0688dc63fa0..cd5b88d819ca 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -66,6 +66,7 @@
>  #define CEPH_MAX_READDIR_DEFAULT        1024
>  #define CEPH_MAX_READDIR_BYTES_DEFAULT  (512*1024)
>  #define CEPH_SNAPDIRNAME_DEFAULT        ".snap"
> +#define CEPH_UMOUNT_TIMEOUT_DEFAULT     msecs_to_jiffies(60 * 1000)
>
>  /*
>   * Delay telling the MDS we no longer want caps, in case we reopen
> @@ -87,6 +88,7 @@ struct ceph_mount_options {
>         int caps_max;
>         unsigned int max_readdir;       /* max readdir result (entries) */
>         unsigned int max_readdir_bytes; /* max readdir result (bytes) */
> +       unsigned int umount_timeout;    /* jiffies */
>
>         bool new_dev_syntax;
>
> @@ -1413,4 +1415,7 @@ extern bool ceph_quota_update_statfs(struct ceph_fs_client *fsc,
>                                      struct kstatfs *buf);
>  extern void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc);
>
> +bool ceph_inc_mds_stopping_blocker(struct ceph_mds_client *mdsc,
> +                              struct ceph_mds_session *session);
> +void ceph_dec_mds_stopping_blocker(struct ceph_mds_client *mdsc);
>  #endif /* _FS_CEPH_SUPER_H */
> --
> 2.40.1
>
Xiubo Li June 6, 2023, 8:50 a.m. UTC | #2
On 6/6/23 15:47, Milind Changire wrote:
> nits in commit message:
> * s/When unmounting and all/When unmounting and ???/
> Is something missing after *and* or has been dropped ?
> Please remove *and* if something has been dropped or
> add it after *and* if it was forgotten to be added
>
> * s/ihode/ihold/

Good catch, will fix it.

Thanks

- Xiubo


> -----
> Otherwise, it looks good to me.
>
> Reviewed-by: Milind Changire <mchangir@redhat.com>
>
> On Tue, Jun 6, 2023 at 9:04 AM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When unmounting and all the dirty buffer will be flushed and after
>> the last osd request is finished the last reference of the i_count
>> will be released. Then it will flush the dirty cap/snap to MDSs,
>> and the unmounting won't wait the possible acks, which will ihode
>> the inodes when updating the metadata locally but makes no sense
>> any more, of this. This will make the evict_inodes() to skip these
>> inodes.
>>
>> If encrypt is enabled the kernel generate a warning when removing
>> the encrypt keys when the skipped inodes still hold the keyring:
>>
>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
>> CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
>> FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>> <TASK>
>> generic_shutdown_super+0x47/0x120
>> kill_anon_super+0x14/0x30
>> ceph_kill_sb+0x36/0x90 [ceph]
>> deactivate_locked_super+0x29/0x60
>> cleanup_mnt+0xb8/0x140
>> task_work_run+0x67/0xb0
>> exit_to_user_mode_prepare+0x23d/0x240
>> syscall_exit_to_user_mode+0x25/0x60
>> do_syscall_64+0x40/0x80
>> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> RIP: 0033:0x7fd83dc39e9b
>>
>> Later the kernel will crash when iput() the inodes and dereferencing
>> the "sb->s_master_keys", which has been released by the
>> generic_shutdown_super().
>>
>> URL: https://tracker.ceph.com/issues/58126
>> URL: https://tracker.ceph.com/issues/59162
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/caps.c       |  6 ++-
>>   fs/ceph/mds_client.c | 14 +++++--
>>   fs/ceph/mds_client.h | 11 +++++-
>>   fs/ceph/quota.c      | 14 +++----
>>   fs/ceph/snap.c       | 10 +++--
>>   fs/ceph/super.c      | 88 +++++++++++++++++++++++++++++++++++++++++++-
>>   fs/ceph/super.h      |  5 +++
>>   7 files changed, 131 insertions(+), 17 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index ccd4791a5e71..74ba335af25d 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -4235,6 +4235,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>
>>          dout("handle_caps from mds%d\n", session->s_mds);
>>
>> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
>> +               return;
>> +
>>          /* decode */
>>          end = msg->front.iov_base + msg->front.iov_len;
>>          if (msg->front.iov_len < sizeof(*h))
>> @@ -4336,7 +4339,6 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>               vino.snap, inode);
>>
>>          mutex_lock(&session->s_mutex);
>> -       inc_session_sequence(session);
>>          dout(" mds%d seq %lld cap seq %u\n", session->s_mds, session->s_seq,
>>               (unsigned)seq);
>>
>> @@ -4448,6 +4450,8 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>   done_unlocked:
>>          iput(inode);
>>   out:
>> +       ceph_dec_mds_stopping_blocker(mdsc);
>> +
>>          ceph_put_string(extra_info.pool_ns);
>>
>>          /* Defer closing the sessions after s_mutex lock being released */
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 6a82b0708343..0a70a2438cb2 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -4904,6 +4904,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>>
>>          dout("handle_lease from mds%d\n", mds);
>>
>> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
>> +               return;
>> +
>>          /* decode */
>>          if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
>>                  goto bad;
>> @@ -4922,8 +4925,6 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>>               dname.len, dname.name);
>>
>>          mutex_lock(&session->s_mutex);
>> -       inc_session_sequence(session);
>> -
>>          if (!inode) {
>>                  dout("handle_lease no inode %llx\n", vino.ino);
>>                  goto release;
>> @@ -4985,9 +4986,13 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>>   out:
>>          mutex_unlock(&session->s_mutex);
>>          iput(inode);
>> +
>> +       ceph_dec_mds_stopping_blocker(mdsc);
>>          return;
>>
>>   bad:
>> +       ceph_dec_mds_stopping_blocker(mdsc);
>> +
>>          pr_err("corrupt lease message\n");
>>          ceph_msg_dump(msg);
>>   }
>> @@ -5183,6 +5188,9 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>>          }
>>
>>          init_completion(&mdsc->safe_umount_waiters);
>> +       spin_lock_init(&mdsc->stopping_lock);
>> +       atomic_set(&mdsc->stopping_blockers, 0);
>> +       init_completion(&mdsc->stopping_waiter);
>>          init_waitqueue_head(&mdsc->session_close_wq);
>>          INIT_LIST_HEAD(&mdsc->waiting_for_map);
>>          mdsc->quotarealms_inodes = RB_ROOT;
>> @@ -5297,7 +5305,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
>>   void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>>   {
>>          dout("pre_umount\n");
>> -       mdsc->stopping = 1;
>> +       mdsc->stopping = CEPH_MDSC_STOPPING_BEGIN;
>>
>>          ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
>>          ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>> index 82165f09a516..351d92f7fc4f 100644
>> --- a/fs/ceph/mds_client.h
>> +++ b/fs/ceph/mds_client.h
>> @@ -398,6 +398,11 @@ struct cap_wait {
>>          int                     want;
>>   };
>>
>> +enum {
>> +       CEPH_MDSC_STOPPING_BEGIN = 1,
>> +       CEPH_MDSC_STOPPING_FLUSHED = 2,
>> +};
>> +
>>   /*
>>    * mds client state
>>    */
>> @@ -414,7 +419,11 @@ struct ceph_mds_client {
>>          struct ceph_mds_session **sessions;    /* NULL for mds if no session */
>>          atomic_t                num_sessions;
>>          int                     max_sessions;  /* len of sessions array */
>> -       int                     stopping;      /* true if shutting down */
>> +
>> +       spinlock_t              stopping_lock;  /* protect snap_empty */
>> +       int                     stopping;      /* the stage of shutting down */
>> +       atomic_t                stopping_blockers;
>> +       struct completion       stopping_waiter;
>>
>>          atomic64_t              quotarealms_count; /* # realms with quota */
>>          /*
>> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
>> index 64592adfe48f..f7fcf7f08ec6 100644
>> --- a/fs/ceph/quota.c
>> +++ b/fs/ceph/quota.c
>> @@ -47,25 +47,23 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>>          struct inode *inode;
>>          struct ceph_inode_info *ci;
>>
>> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
>> +               return;
>> +
>>          if (msg->front.iov_len < sizeof(*h)) {
>>                  pr_err("%s corrupt message mds%d len %d\n", __func__,
>>                         session->s_mds, (int)msg->front.iov_len);
>>                  ceph_msg_dump(msg);
>> -               return;
>> +               goto out;
>>          }
>>
>> -       /* increment msg sequence number */
>> -       mutex_lock(&session->s_mutex);
>> -       inc_session_sequence(session);
>> -       mutex_unlock(&session->s_mutex);
>> -
>>          /* lookup inode */
>>          vino.ino = le64_to_cpu(h->ino);
>>          vino.snap = CEPH_NOSNAP;
>>          inode = ceph_find_inode(sb, vino);
>>          if (!inode) {
>>                  pr_warn("Failed to find inode %llu\n", vino.ino);
>> -               return;
>> +               goto out;
>>          }
>>          ci = ceph_inode(inode);
>>
>> @@ -78,6 +76,8 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>>          spin_unlock(&ci->i_ceph_lock);
>>
>>          iput(inode);
>> +out:
>> +       ceph_dec_mds_stopping_blocker(mdsc);
>>   }
>>
>>   static struct ceph_quotarealm_inode *
>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>> index a6af991116b9..e03b020d87d7 100644
>> --- a/fs/ceph/snap.c
>> +++ b/fs/ceph/snap.c
>> @@ -1016,6 +1016,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>          int locked_rwsem = 0;
>>          bool close_sessions = false;
>>
>> +       if (!ceph_inc_mds_stopping_blocker(mdsc, session))
>> +               return;
>> +
>>          /* decode */
>>          if (msg->front.iov_len < sizeof(*h))
>>                  goto bad;
>> @@ -1031,10 +1034,6 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>          dout("%s from mds%d op %s split %llx tracelen %d\n", __func__,
>>               mds, ceph_snap_op_name(op), split, trace_len);
>>
>> -       mutex_lock(&session->s_mutex);
>> -       inc_session_sequence(session);
>> -       mutex_unlock(&session->s_mutex);
>> -
>>          down_write(&mdsc->snap_rwsem);
>>          locked_rwsem = 1;
>>
>> @@ -1152,12 +1151,15 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>          up_write(&mdsc->snap_rwsem);
>>
>>          flush_snaps(mdsc);
>> +       ceph_dec_mds_stopping_blocker(mdsc);
>>          return;
>>
>>   bad:
>>          pr_err("%s corrupt snap message from mds%d\n", __func__, mds);
>>          ceph_msg_dump(msg);
>>   out:
>> +       ceph_dec_mds_stopping_blocker(mdsc);
>> +
>>          if (locked_rwsem)
>>                  up_write(&mdsc->snap_rwsem);
>>
>> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>> index b606b33a253b..d3f54f3d7b17 100644
>> --- a/fs/ceph/super.c
>> +++ b/fs/ceph/super.c
>> @@ -169,6 +169,7 @@ enum {
>>          Opt_wsync,
>>          Opt_pagecache,
>>          Opt_sparseread,
>> +       Opt_umount_timeout,
>>   };
>>
>>   enum ceph_recover_session_mode {
>> @@ -214,6 +215,7 @@ static const struct fs_parameter_spec ceph_mount_parameters[] = {
>>          fsparam_flag_no ("wsync",                       Opt_wsync),
>>          fsparam_flag_no ("pagecache",                   Opt_pagecache),
>>          fsparam_flag_no ("sparseread",                  Opt_sparseread),
>> +       fsparam_u32     ("umount_timeout",              Opt_umount_timeout),
>>          {}
>>   };
>>
>> @@ -600,6 +602,12 @@ static int ceph_parse_mount_param(struct fs_context *fc,
>>                  else
>>                          fsopt->flags |= CEPH_MOUNT_OPT_SPARSEREAD;
>>                  break;
>> +       case Opt_umount_timeout:
>> +               /* 0 is "wait forever" (i.e. infinite timeout) */
>> +               if (result.uint_32 > INT_MAX / 1000)
>> +                       goto out_of_range;
>> +               fsopt->umount_timeout = msecs_to_jiffies(result.uint_32 * 1000);
>> +               break;
>>          case Opt_test_dummy_encryption:
>>   #ifdef CONFIG_FS_ENCRYPTION
>>                  fscrypt_free_dummy_policy(&fsopt->dummy_enc_policy);
>> @@ -779,6 +787,8 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root)
>>                  seq_printf(m, ",readdir_max_entries=%u", fsopt->max_readdir);
>>          if (fsopt->max_readdir_bytes != CEPH_MAX_READDIR_BYTES_DEFAULT)
>>                  seq_printf(m, ",readdir_max_bytes=%u", fsopt->max_readdir_bytes);
>> +       if (fsopt->umount_timeout != CEPH_UMOUNT_TIMEOUT_DEFAULT)
>> +               seq_printf(m, ",umount_timeout=%u", fsopt->umount_timeout);
>>          if (strcmp(fsopt->snapdir_name, CEPH_SNAPDIRNAME_DEFAULT))
>>                  seq_show_option(m, "snapdirname", fsopt->snapdir_name);
>>
>> @@ -1456,6 +1466,7 @@ static int ceph_init_fs_context(struct fs_context *fc)
>>          fsopt->max_readdir = CEPH_MAX_READDIR_DEFAULT;
>>          fsopt->max_readdir_bytes = CEPH_MAX_READDIR_BYTES_DEFAULT;
>>          fsopt->congestion_kb = default_congestion_kb();
>> +       fsopt->umount_timeout = CEPH_UMOUNT_TIMEOUT_DEFAULT;
>>
>>   #ifdef CONFIG_CEPH_FS_POSIX_ACL
>>          fc->sb_flags |= SB_POSIXACL;
>> @@ -1472,15 +1483,90 @@ static int ceph_init_fs_context(struct fs_context *fc)
>>          return -ENOMEM;
>>   }
>>
>> +/*
>> + * Return true if it successfully increases the blocker counter,
>> + * or false if the mdsc is in stopping and flushed state.
>> + */
>> +static bool __inc_stopping_blocker(struct ceph_mds_client *mdsc)
>> +{
>> +       spin_lock(&mdsc->stopping_lock);
>> +       if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED) {
>> +               spin_unlock(&mdsc->stopping_lock);
>> +               return false;
>> +       }
>> +       atomic_inc(&mdsc->stopping_blockers);
>> +       spin_unlock(&mdsc->stopping_lock);
>> +       return true;
>> +}
>> +
>> +static void __dec_stopping_blocker(struct ceph_mds_client *mdsc)
>> +{
>> +       spin_lock(&mdsc->stopping_lock);
>> +       if (!atomic_dec_return(&mdsc->stopping_blockers) &&
>> +           mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>> +               complete_all(&mdsc->stopping_waiter);
>> +       spin_unlock(&mdsc->stopping_lock);
>> +}
>> +
>> +/* For metadata IO requests */
>> +bool ceph_inc_mds_stopping_blocker(struct ceph_mds_client *mdsc,
>> +                                  struct ceph_mds_session *session)
>> +{
>> +       mutex_lock(&session->s_mutex);
>> +       inc_session_sequence(session);
>> +       mutex_unlock(&session->s_mutex);
>> +
>> +       return __inc_stopping_blocker(mdsc);
>> +}
>> +
>> +void ceph_dec_mds_stopping_blocker(struct ceph_mds_client *mdsc)
>> +{
>> +       __dec_stopping_blocker(mdsc);
>> +}
>> +
>>   static void ceph_kill_sb(struct super_block *s)
>>   {
>>          struct ceph_fs_client *fsc = ceph_sb_to_client(s);
>> +       struct ceph_mds_client *mdsc = fsc->mdsc;
>> +       struct ceph_mount_options *opt = fsc->mount_options;
>> +       bool wait;
>>
>>          dout("kill_sb %p\n", s);
>>
>> -       ceph_mdsc_pre_umount(fsc->mdsc);
>> +       ceph_mdsc_pre_umount(mdsc);
>>          flush_fs_workqueues(fsc);
>>
>> +       /*
>> +        * Though the kill_anon_super() will finally trigger the
>> +        * sync_filesystem() anyway, we still need to do it here and
>> +        * then bump the stage of shutdown. This will allow us to
>> +        * drop any further message, which will increase the inodes'
>> +        * i_count reference counters but makes no sense any more,
>> +        * from MDSs.
>> +        *
>> +        * Without this when evicting the inodes it may fail in the
>> +        * kill_anon_super(), which will trigger a warning when
>> +        * destroying the fscrypt keyring and then possibly trigger
>> +        * a further crash in ceph module when the iput() tries to
>> +        * evict the inodes later.
>> +        */
>> +       sync_filesystem(s);
>> +
>> +       spin_lock(&mdsc->stopping_lock);
>> +       mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
>> +       wait = !!atomic_read(&mdsc->stopping_blockers);
>> +       spin_unlock(&mdsc->stopping_lock);
>> +
>> +       if (wait && atomic_read(&mdsc->stopping_blockers)) {
>> +               long timeleft = wait_for_completion_killable_timeout(
>> +                                       &mdsc->stopping_waiter,
>> +                                       opt->umount_timeout);
>> +               if (!timeleft) /* timed out */
>> +                       pr_warn("umount timed out, %ld\n", timeleft);
>> +               else if (timeleft < 0) /* killed */
>> +                       pr_warn("umount was killed, %ld\n", timeleft);
>> +       }
>> +
>>          kill_anon_super(s);
>>
>>          fsc->client->extra_mon_dispatch = NULL;
>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>> index a0688dc63fa0..cd5b88d819ca 100644
>> --- a/fs/ceph/super.h
>> +++ b/fs/ceph/super.h
>> @@ -66,6 +66,7 @@
>>   #define CEPH_MAX_READDIR_DEFAULT        1024
>>   #define CEPH_MAX_READDIR_BYTES_DEFAULT  (512*1024)
>>   #define CEPH_SNAPDIRNAME_DEFAULT        ".snap"
>> +#define CEPH_UMOUNT_TIMEOUT_DEFAULT     msecs_to_jiffies(60 * 1000)
>>
>>   /*
>>    * Delay telling the MDS we no longer want caps, in case we reopen
>> @@ -87,6 +88,7 @@ struct ceph_mount_options {
>>          int caps_max;
>>          unsigned int max_readdir;       /* max readdir result (entries) */
>>          unsigned int max_readdir_bytes; /* max readdir result (bytes) */
>> +       unsigned int umount_timeout;    /* jiffies */
>>
>>          bool new_dev_syntax;
>>
>> @@ -1413,4 +1415,7 @@ extern bool ceph_quota_update_statfs(struct ceph_fs_client *fsc,
>>                                       struct kstatfs *buf);
>>   extern void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc);
>>
>> +bool ceph_inc_mds_stopping_blocker(struct ceph_mds_client *mdsc,
>> +                              struct ceph_mds_session *session);
>> +void ceph_dec_mds_stopping_blocker(struct ceph_mds_client *mdsc);
>>   #endif /* _FS_CEPH_SUPER_H */
>> --
>> 2.40.1
>>
>
Luis Henriques June 6, 2023, 9:53 a.m. UTC | #3
xiubli@redhat.com writes:

> From: Xiubo Li <xiubli@redhat.com>
>
> V2:
> - Improve the code by switching to wait_for_completion_killable_timeout()
>   when umounting, at the same add one umount_timeout option.

Instead of adding yet another (undocumented!) mount option, why not re-use
the already existent 'mount_timeout' instead?  It's already defined and
kept in 'struct ceph_options', and the default value is defined with the
same value you're using, in CEPH_MOUNT_TIMEOUT_DEFAULT.

Cheers,
Xiubo Li June 6, 2023, 12:29 p.m. UTC | #4
On 6/6/23 17:53, Luís Henriques wrote:
> xiubli@redhat.com writes:
>
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> V2:
>> - Improve the code by switching to wait_for_completion_killable_timeout()
>>    when umounting, at the same add one umount_timeout option.
> Instead of adding yet another (undocumented!) mount option, why not re-use
> the already existent 'mount_timeout' instead?  It's already defined and
> kept in 'struct ceph_options', and the default value is defined with the
> same value you're using, in CEPH_MOUNT_TIMEOUT_DEFAULT.

This is for mount purpose. Is that okay to use the in umount case ?

Thanks

- Xiubo

> Cheers,
Luis Henriques June 7, 2023, 9:18 a.m. UTC | #5
Xiubo Li <xiubli@redhat.com> writes:

> On 6/6/23 17:53, Luís Henriques wrote:
>> xiubli@redhat.com writes:
>>
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> V2:
>>> - Improve the code by switching to wait_for_completion_killable_timeout()
>>>    when umounting, at the same add one umount_timeout option.
>> Instead of adding yet another (undocumented!) mount option, why not re-use
>> the already existent 'mount_timeout' instead?  It's already defined and
>> kept in 'struct ceph_options', and the default value is defined with the
>> same value you're using, in CEPH_MOUNT_TIMEOUT_DEFAULT.
>
> This is for mount purpose. Is that okay to use the in umount case ?

Yeah, you're probably right.  It's just that adding yet another knob for a
corner case that probably will never be used and very few people will know
about is never a good thing (IMO).  Anyway, I think that at least this new
mount option needs to be mentioned in 'Documentation/filesystems/ceph.rst'.

Cheers,
Ilya Dryomov June 7, 2023, 10 a.m. UTC | #6
On Wed, Jun 7, 2023 at 11:18 AM Luís Henriques <lhenriques@suse.de> wrote:
>
> Xiubo Li <xiubli@redhat.com> writes:
>
> > On 6/6/23 17:53, Luís Henriques wrote:
> >> xiubli@redhat.com writes:
> >>
> >>> From: Xiubo Li <xiubli@redhat.com>
> >>>
> >>> V2:
> >>> - Improve the code by switching to wait_for_completion_killable_timeout()
> >>>    when umounting, at the same add one umount_timeout option.
> >> Instead of adding yet another (undocumented!) mount option, why not re-use
> >> the already existent 'mount_timeout' instead?  It's already defined and
> >> kept in 'struct ceph_options', and the default value is defined with the
> >> same value you're using, in CEPH_MOUNT_TIMEOUT_DEFAULT.
> >
> > This is for mount purpose. Is that okay to use the in umount case ?
>
> Yeah, you're probably right.  It's just that adding yet another knob for a
> corner case that probably will never be used and very few people will know
> about is never a good thing (IMO).  Anyway, I think that at least this new
> mount option needs to be mentioned in 'Documentation/filesystems/ceph.rst'.

It's OK and in fact preferrable to stick to mount_timeout to avoid new
knobs.  There is even a precedent for this: RBD uses mount_timeout both
when waiting for the latest osdmap (on "rbd map") and when tearing down
a watch (on "rbd unmap").

Thanks,

                Ilya