diff mbox series

[v2] ceph: drop the messages from MDS when unmouting

Message ID 20221221093031.132792-1-xiubli@redhat.com
State New
Headers show
Series [v2] ceph: drop the messages from MDS when unmouting | expand

Commit Message

Xiubo Li Dec. 21, 2022, 9:30 a.m. UTC
From: Xiubo Li <xiubli@redhat.com>

When unmounting it will just wait for the inflight requests to be
finished, but just before the sessions are closed the kclient still
could receive the caps/snaps/lease/quota msgs from MDS. All these
msgs need to hold some inodes, which will cause ceph_kill_sb() failing
to evict the inodes in time.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

URL: https://tracker.ceph.com/issues/58126
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---

V2:
- Fix it in ceph layer.


 fs/ceph/caps.c       |  3 +++
 fs/ceph/mds_client.c |  5 ++++-
 fs/ceph/mds_client.h |  7 ++++++-
 fs/ceph/quota.c      |  3 +++
 fs/ceph/snap.c       |  3 +++
 fs/ceph/super.c      | 14 ++++++++++++++
 6 files changed, 33 insertions(+), 2 deletions(-)

Comments

Xiubo Li Jan. 22, 2023, 1:57 p.m. UTC | #1
Hi Luis,

On 20/01/2023 18:36, Luís Henriques wrote:
> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> When unmounting it will just wait for the inflight requests to be
>>> finished, but just before the sessions are closed the kclient still
>>> could receive the caps/snaps/lease/quota msgs from MDS. All these
>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing
>>> to evict the inodes in time.
>>>
>>> If encrypt is enabled the kernel generate a warning when removing
>>> the encrypt keys when the skipped inodes still hold the keyring:
>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
>> does look good, but I started hitting the WARNING added by patch:
>>
>>    [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
>>
>> This patch seems to be working but I'm not sure we really need the extra
> OK, looks like I jumped the gun here: I still see the warning with your
> patch.
>
> I've done a quick hack and the patch below sees fix it.  But again, it
> will impact performance.  I'll see if I can figure out something else.
>
> Cheers,
> --
> Luís
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 2cd134ad02a9..bdb4efa0f9f7 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>   	return ret;
>   }
>   
> +static int ceph_flush(struct file *file, fl_owner_t id)
> +{
> +	struct inode *inode = file_inode(file);
> +	int ret;
> +
> +	if ((file->f_mode & FMODE_WRITE) == 0)
> +		return 0;
> +
> +	ret = filemap_write_and_wait(inode->i_mapping);
> +	if (ret)
> +		ret = filemap_check_wb_err(file->f_mapping, 0);
> +
> +	return ret;
> +}
> +
>   const struct file_operations ceph_file_fops = {
>   	.open = ceph_open,
>   	.release = ceph_release,
> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
>   	.compat_ioctl = compat_ptr_ioctl,
>   	.fallocate	= ceph_fallocate,
>   	.copy_file_range = ceph_copy_file_range,
> +	.flush = ceph_flush,
>   };
>
This will only fix the second crash case in 
https://tracker.ceph.com/issues/58126#note-6, but not the case in 
https://tracker.ceph.com/issues/58126#note-5.

This issue could be triggered with "test_dummy_encryption" and with 
xfstests-dev's generic/124. You can have a try.

Locally I am still reading the code to check why "sync_filesystem(s);" 
couldn't do the same with "ceph_flush" as above.

I am still on holiday these days and I will test this after my back.

Thanks


>> 'stopping' state.  Looking at the code, we've flushed all the workqueues
>> and done all the waits, so I *think* the sync_filesystem() call should be
>> enough.
>>
>> The other alternative I see would be to add a ->flush() to ceph_file_fops,
>> where we'd do a filemap_write_and_wait().  But that would probably have a
>> negative performance impact -- my understand is that it basically means
>> we'll have sync file closes.
>>
>> Cheers,
>> --
>> Luís
>>
>>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
>>> CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
>>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
>>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
>>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
>>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
>>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
>>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
>>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
>>> FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>> <TASK>
>>> generic_shutdown_super+0x47/0x120
>>> kill_anon_super+0x14/0x30
>>> ceph_kill_sb+0x36/0x90 [ceph]
>>> deactivate_locked_super+0x29/0x60
>>> cleanup_mnt+0xb8/0x140
>>> task_work_run+0x67/0xb0
>>> exit_to_user_mode_prepare+0x23d/0x240
>>> syscall_exit_to_user_mode+0x25/0x60
>>> do_syscall_64+0x40/0x80
>>> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> RIP: 0033:0x7fd83dc39e9b
>>>
>>> URL: https://tracker.ceph.com/issues/58126
>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>> ---
>>>
>>> V2:
>>> - Fix it in ceph layer.
>>>
>>>
>>>   fs/ceph/caps.c       |  3 +++
>>>   fs/ceph/mds_client.c |  5 ++++-
>>>   fs/ceph/mds_client.h |  7 ++++++-
>>>   fs/ceph/quota.c      |  3 +++
>>>   fs/ceph/snap.c       |  3 +++
>>>   fs/ceph/super.c      | 14 ++++++++++++++
>>>   6 files changed, 33 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> index 15d9e0f0d65a..e8a53aeb2a8c 100644
>>> --- a/fs/ceph/caps.c
>>> +++ b/fs/ceph/caps.c
>>> @@ -4222,6 +4222,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>>   
>>>   	dout("handle_caps from mds%d\n", session->s_mds);
>>>   
>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>> +		return;
>>> +
>>>   	/* decode */
>>>   	end = msg->front.iov_base + msg->front.iov_len;
>>>   	if (msg->front.iov_len < sizeof(*h))
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index d41ab68f0130..1ad85af49b45 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -4869,6 +4869,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>>>   
>>>   	dout("handle_lease from mds%d\n", mds);
>>>   
>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>> +		return;
>>> +
>>>   	/* decode */
>>>   	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
>>>   		goto bad;
>>> @@ -5262,7 +5265,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
>>>   void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>>>   {
>>>   	dout("pre_umount\n");
>>> -	mdsc->stopping = 1;
>>> +	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
>>>   
>>>   	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
>>>   	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>> index 81a1f9a4ac3b..56f9d8077068 100644
>>> --- a/fs/ceph/mds_client.h
>>> +++ b/fs/ceph/mds_client.h
>>> @@ -398,6 +398,11 @@ struct cap_wait {
>>>   	int			want;
>>>   };
>>>   
>>> +enum {
>>> +	CEPH_MDSC_STOPPING_BEGAIN = 1,
>>> +	CEPH_MDSC_STOPPING_FLUSHED = 2,
>>> +};
>>> +
>>>   /*
>>>    * mds client state
>>>    */
>>> @@ -414,7 +419,7 @@ struct ceph_mds_client {
>>>   	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
>>>   	atomic_t		num_sessions;
>>>   	int                     max_sessions;  /* len of sessions array */
>>> -	int                     stopping;      /* true if shutting down */
>>> +	int                     stopping;      /* the stage of shutting down */
>>>   
>>>   	atomic64_t		quotarealms_count; /* # realms with quota */
>>>   	/*
>>> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
>>> index 64592adfe48f..f5819fc31d28 100644
>>> --- a/fs/ceph/quota.c
>>> +++ b/fs/ceph/quota.c
>>> @@ -47,6 +47,9 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>>>   	struct inode *inode;
>>>   	struct ceph_inode_info *ci;
>>>   
>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>> +		return;
>>> +
>>>   	if (msg->front.iov_len < sizeof(*h)) {
>>>   		pr_err("%s corrupt message mds%d len %d\n", __func__,
>>>   		       session->s_mds, (int)msg->front.iov_len);
>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>> index a73943e51a77..eeabdd0211d8 100644
>>> --- a/fs/ceph/snap.c
>>> +++ b/fs/ceph/snap.c
>>> @@ -1010,6 +1010,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>   	int locked_rwsem = 0;
>>>   	bool close_sessions = false;
>>>   
>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>> +		return;
>>> +
>>>   	/* decode */
>>>   	if (msg->front.iov_len < sizeof(*h))
>>>   		goto bad;
>>> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>>> index f10a076f47e5..012b35be41a9 100644
>>> --- a/fs/ceph/super.c
>>> +++ b/fs/ceph/super.c
>>> @@ -1483,6 +1483,20 @@ static void ceph_kill_sb(struct super_block *s)
>>>   	ceph_mdsc_pre_umount(fsc->mdsc);
>>>   	flush_fs_workqueues(fsc);
>>>   
>>> +	/*
>>> +	 * Though the kill_anon_super() will finally trigger the
>>> +	 * sync_filesystem() anyway, we still need to do it here and
>>> +	 * then bump the stage of shutdown. This will drop any further
>>> +	 * message, which makes no sense any more, from MDSs.
>>> +	 *
>>> +	 * Without this when evicting the inodes it may fail in the
>>> +	 * kill_anon_super(), which will trigger a warning when
>>> +	 * destroying the fscrypt keyring and then possibly trigger
>>> +	 * a further crash in ceph module when iput() the inodes.
>>> +	 */
>>> +	sync_filesystem(s);
>>> +	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
>>> +
>>>   	kill_anon_super(s);
>>>   
>>>   	fsc->client->extra_mon_dispatch = NULL;
>>> -- 
>>> 2.31.1
>>>
Luis Henriques Jan. 23, 2023, 10:15 a.m. UTC | #2
On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
> Hi Luis,
> 
> On 20/01/2023 18:36, Luís Henriques wrote:
> > On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
> > > On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
> > > > From: Xiubo Li <xiubli@redhat.com>
> > > > 
> > > > When unmounting it will just wait for the inflight requests to be
> > > > finished, but just before the sessions are closed the kclient still
> > > > could receive the caps/snaps/lease/quota msgs from MDS. All these
> > > > msgs need to hold some inodes, which will cause ceph_kill_sb() failing
> > > > to evict the inodes in time.
> > > > 
> > > > If encrypt is enabled the kernel generate a warning when removing
> > > > the encrypt keys when the skipped inodes still hold the keyring:
> > > Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
> > > does look good, but I started hitting the WARNING added by patch:
> > > 
> > >    [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
> > > 
> > > This patch seems to be working but I'm not sure we really need the extra
> > OK, looks like I jumped the gun here: I still see the warning with your
> > patch.
> > 
> > I've done a quick hack and the patch below sees fix it.  But again, it
> > will impact performance.  I'll see if I can figure out something else.
> > 
> > Cheers,
> > --
> > Luís
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 2cd134ad02a9..bdb4efa0f9f7 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >   	return ret;
> >   }
> > +static int ceph_flush(struct file *file, fl_owner_t id)
> > +{
> > +	struct inode *inode = file_inode(file);
> > +	int ret;
> > +
> > +	if ((file->f_mode & FMODE_WRITE) == 0)
> > +		return 0;
> > +
> > +	ret = filemap_write_and_wait(inode->i_mapping);
> > +	if (ret)
> > +		ret = filemap_check_wb_err(file->f_mapping, 0);
> > +
> > +	return ret;
> > +}
> > +
> >   const struct file_operations ceph_file_fops = {
> >   	.open = ceph_open,
> >   	.release = ceph_release,
> > @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
> >   	.compat_ioctl = compat_ptr_ioctl,
> >   	.fallocate	= ceph_fallocate,
> >   	.copy_file_range = ceph_copy_file_range,
> > +	.flush = ceph_flush,
> >   };
> > 
> This will only fix the second crash case in
> https://tracker.ceph.com/issues/58126#note-6, but not the case in
> https://tracker.ceph.com/issues/58126#note-5.
> 
> This issue could be triggered with "test_dummy_encryption" and with
> xfstests-dev's generic/124. You can have a try.

OK, thanks.  I'll give it a try.  BTW, my local reproducer was
generic/132, not generic/124.  I'll let you know if I find something.
 
> Locally I am still reading the code to check why "sync_filesystem(s);"
> couldn't do the same with "ceph_flush" as above.
> 
> I am still on holiday these days and I will test this after my back.

Sure, no problem.  Enjoy your break!

Cheers,
--
Luís

> 
> Thanks
> 
> 
> > > 'stopping' state.  Looking at the code, we've flushed all the workqueues
> > > and done all the waits, so I *think* the sync_filesystem() call should be
> > > enough.
> > > 
> > > The other alternative I see would be to add a ->flush() to ceph_file_fops,
> > > where we'd do a filemap_write_and_wait().  But that would probably have a
> > > negative performance impact -- my understand is that it basically means
> > > we'll have sync file closes.
> > > 
> > > Cheers,
> > > --
> > > Luís
> > > 
> > > > WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
> > > > CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
> > > > Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
> > > > RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
> > > > RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
> > > > RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
> > > > RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
> > > > RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
> > > > R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
> > > > R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
> > > > FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > Call Trace:
> > > > <TASK>
> > > > generic_shutdown_super+0x47/0x120
> > > > kill_anon_super+0x14/0x30
> > > > ceph_kill_sb+0x36/0x90 [ceph]
> > > > deactivate_locked_super+0x29/0x60
> > > > cleanup_mnt+0xb8/0x140
> > > > task_work_run+0x67/0xb0
> > > > exit_to_user_mode_prepare+0x23d/0x240
> > > > syscall_exit_to_user_mode+0x25/0x60
> > > > do_syscall_64+0x40/0x80
> > > > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > > RIP: 0033:0x7fd83dc39e9b
> > > > 
> > > > URL: https://tracker.ceph.com/issues/58126
> > > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > ---
> > > > 
> > > > V2:
> > > > - Fix it in ceph layer.
> > > > 
> > > > 
> > > >   fs/ceph/caps.c       |  3 +++
> > > >   fs/ceph/mds_client.c |  5 ++++-
> > > >   fs/ceph/mds_client.h |  7 ++++++-
> > > >   fs/ceph/quota.c      |  3 +++
> > > >   fs/ceph/snap.c       |  3 +++
> > > >   fs/ceph/super.c      | 14 ++++++++++++++
> > > >   6 files changed, 33 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index 15d9e0f0d65a..e8a53aeb2a8c 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -4222,6 +4222,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >   	dout("handle_caps from mds%d\n", session->s_mds);
> > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > +		return;
> > > > +
> > > >   	/* decode */
> > > >   	end = msg->front.iov_base + msg->front.iov_len;
> > > >   	if (msg->front.iov_len < sizeof(*h))
> > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > index d41ab68f0130..1ad85af49b45 100644
> > > > --- a/fs/ceph/mds_client.c
> > > > +++ b/fs/ceph/mds_client.c
> > > > @@ -4869,6 +4869,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
> > > >   	dout("handle_lease from mds%d\n", mds);
> > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > +		return;
> > > > +
> > > >   	/* decode */
> > > >   	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
> > > >   		goto bad;
> > > > @@ -5262,7 +5265,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
> > > >   void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
> > > >   {
> > > >   	dout("pre_umount\n");
> > > > -	mdsc->stopping = 1;
> > > > +	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
> > > >   	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
> > > >   	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
> > > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > > index 81a1f9a4ac3b..56f9d8077068 100644
> > > > --- a/fs/ceph/mds_client.h
> > > > +++ b/fs/ceph/mds_client.h
> > > > @@ -398,6 +398,11 @@ struct cap_wait {
> > > >   	int			want;
> > > >   };
> > > > +enum {
> > > > +	CEPH_MDSC_STOPPING_BEGAIN = 1,
> > > > +	CEPH_MDSC_STOPPING_FLUSHED = 2,
> > > > +};
> > > > +
> > > >   /*
> > > >    * mds client state
> > > >    */
> > > > @@ -414,7 +419,7 @@ struct ceph_mds_client {
> > > >   	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
> > > >   	atomic_t		num_sessions;
> > > >   	int                     max_sessions;  /* len of sessions array */
> > > > -	int                     stopping;      /* true if shutting down */
> > > > +	int                     stopping;      /* the stage of shutting down */
> > > >   	atomic64_t		quotarealms_count; /* # realms with quota */
> > > >   	/*
> > > > diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> > > > index 64592adfe48f..f5819fc31d28 100644
> > > > --- a/fs/ceph/quota.c
> > > > +++ b/fs/ceph/quota.c
> > > > @@ -47,6 +47,9 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
> > > >   	struct inode *inode;
> > > >   	struct ceph_inode_info *ci;
> > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > +		return;
> > > > +
> > > >   	if (msg->front.iov_len < sizeof(*h)) {
> > > >   		pr_err("%s corrupt message mds%d len %d\n", __func__,
> > > >   		       session->s_mds, (int)msg->front.iov_len);
> > > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > > > index a73943e51a77..eeabdd0211d8 100644
> > > > --- a/fs/ceph/snap.c
> > > > +++ b/fs/ceph/snap.c
> > > > @@ -1010,6 +1010,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > > >   	int locked_rwsem = 0;
> > > >   	bool close_sessions = false;
> > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > +		return;
> > > > +
> > > >   	/* decode */
> > > >   	if (msg->front.iov_len < sizeof(*h))
> > > >   		goto bad;
> > > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> > > > index f10a076f47e5..012b35be41a9 100644
> > > > --- a/fs/ceph/super.c
> > > > +++ b/fs/ceph/super.c
> > > > @@ -1483,6 +1483,20 @@ static void ceph_kill_sb(struct super_block *s)
> > > >   	ceph_mdsc_pre_umount(fsc->mdsc);
> > > >   	flush_fs_workqueues(fsc);
> > > > +	/*
> > > > +	 * Though the kill_anon_super() will finally trigger the
> > > > +	 * sync_filesystem() anyway, we still need to do it here and
> > > > +	 * then bump the stage of shutdown. This will drop any further
> > > > +	 * message, which makes no sense any more, from MDSs.
> > > > +	 *
> > > > +	 * Without this when evicting the inodes it may fail in the
> > > > +	 * kill_anon_super(), which will trigger a warning when
> > > > +	 * destroying the fscrypt keyring and then possibly trigger
> > > > +	 * a further crash in ceph module when iput() the inodes.
> > > > +	 */
> > > > +	sync_filesystem(s);
> > > > +	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
> > > > +
> > > >   	kill_anon_super(s);
> > > >   	fsc->client->extra_mon_dispatch = NULL;
> > > > -- 
> > > > 2.31.1
> > > > 
> -- 
> Best Regards,
> 
> Xiubo Li (李秀波)
> 
> Email: xiubli@redhat.com/xiubli@ibm.com
> Slack: @Xiubo Li
>
Xiubo Li Jan. 24, 2023, 10:26 a.m. UTC | #3
On 23/01/2023 18:15, Luís Henriques wrote:
> On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
>> Hi Luis,
>>
>> On 20/01/2023 18:36, Luís Henriques wrote:
>>> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
>>>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>
>>>>> When unmounting it will just wait for the inflight requests to be
>>>>> finished, but just before the sessions are closed the kclient still
>>>>> could receive the caps/snaps/lease/quota msgs from MDS. All these
>>>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing
>>>>> to evict the inodes in time.
>>>>>
>>>>> If encrypt is enabled the kernel generate a warning when removing
>>>>> the encrypt keys when the skipped inodes still hold the keyring:
>>>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
>>>> does look good, but I started hitting the WARNING added by patch:
>>>>
>>>>     [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
>>>>
>>>> This patch seems to be working but I'm not sure we really need the extra
>>> OK, looks like I jumped the gun here: I still see the warning with your
>>> patch.
>>>
>>> I've done a quick hack and the patch below sees fix it.  But again, it
>>> will impact performance.  I'll see if I can figure out something else.
>>>
>>> Cheers,
>>> --
>>> Luís
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 2cd134ad02a9..bdb4efa0f9f7 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>>    	return ret;
>>>    }
>>> +static int ceph_flush(struct file *file, fl_owner_t id)
>>> +{
>>> +	struct inode *inode = file_inode(file);
>>> +	int ret;
>>> +
>>> +	if ((file->f_mode & FMODE_WRITE) == 0)
>>> +		return 0;
>>> +
>>> +	ret = filemap_write_and_wait(inode->i_mapping);
>>> +	if (ret)
>>> +		ret = filemap_check_wb_err(file->f_mapping, 0);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>    const struct file_operations ceph_file_fops = {
>>>    	.open = ceph_open,
>>>    	.release = ceph_release,
>>> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
>>>    	.compat_ioctl = compat_ptr_ioctl,
>>>    	.fallocate	= ceph_fallocate,
>>>    	.copy_file_range = ceph_copy_file_range,
>>> +	.flush = ceph_flush,
>>>    };
>>>
>> This will only fix the second crash case in
>> https://tracker.ceph.com/issues/58126#note-6, but not the case in
>> https://tracker.ceph.com/issues/58126#note-5.
>>
>> This issue could be triggered with "test_dummy_encryption" and with
>> xfstests-dev's generic/124. You can have a try.
> OK, thanks.  I'll give it a try.  BTW, my local reproducer was
> generic/132, not generic/124.  I'll let you know if I find something.

Hi Luis,

I added some logs and found that when doing the aio_write, it will split 
to many aio requests and when the last req finished it will call the 
"writepages_finish()", which will iput() the inode and release the last 
refcount of the inode.

But it seems the complete_all() is called without the req->r_callback() 
is totally finished:

<4>[500400.268200] writepages_finish 0000000060940222 rc 0
<4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the 
last osd req->r_callback()
<4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
<4>[500400.268526] sync_fs (blocking) done
<4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== 
the sync_filesystem() will be called, I just added it in ceph code but 
the VFS will call it again in "kill_anon_super()"
<4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
<4>[500400.268626] sync_fs (blocking)
<4>[500400.268631] sync_fs (blocking) done
<4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was 
skipped!    <=== skipped
<4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977 
mk_active_refs = 2
<4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   
<==== evict the inode in the req->r_callback()

Locally my VM is not working and I couldn't run the test for now. Could 
you help test the following patch ?

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 78b622178a3d..5cdaba0d3003 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2539,6 +2539,7 @@ static void __complete_request(struct 
ceph_osd_request *req)

         if (req->r_callback)
                 req->r_callback(req);
+       barrier();
         complete_all(&req->r_completion);
         ceph_osdc_put_request(req);
  }

Thanks

- Xiubo


>   
>> Locally I am still reading the code to check why "sync_filesystem(s);"
>> couldn't do the same with "ceph_flush" as above.
>>
>> I am still on holiday these days and I will test this after my back.
> Sure, no problem.  Enjoy your break!
>
> Cheers,
> --
> Luís
>
>> Thanks
>>
>>
>>>> 'stopping' state.  Looking at the code, we've flushed all the workqueues
>>>> and done all the waits, so I *think* the sync_filesystem() call should be
>>>> enough.
>>>>
>>>> The other alternative I see would be to add a ->flush() to ceph_file_fops,
>>>> where we'd do a filemap_write_and_wait().  But that would probably have a
>>>> negative performance impact -- my understand is that it basically means
>>>> we'll have sync file closes.
>>>>
>>>> Cheers,
>>>> --
>>>> Luís
>>>>
>>>>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
>>>>> CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
>>>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
>>>>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
>>>>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
>>>>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
>>>>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
>>>>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
>>>>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
>>>>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
>>>>> FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>> <TASK>
>>>>> generic_shutdown_super+0x47/0x120
>>>>> kill_anon_super+0x14/0x30
>>>>> ceph_kill_sb+0x36/0x90 [ceph]
>>>>> deactivate_locked_super+0x29/0x60
>>>>> cleanup_mnt+0xb8/0x140
>>>>> task_work_run+0x67/0xb0
>>>>> exit_to_user_mode_prepare+0x23d/0x240
>>>>> syscall_exit_to_user_mode+0x25/0x60
>>>>> do_syscall_64+0x40/0x80
>>>>> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>>> RIP: 0033:0x7fd83dc39e9b
>>>>>
>>>>> URL: https://tracker.ceph.com/issues/58126
>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>> ---
>>>>>
>>>>> V2:
>>>>> - Fix it in ceph layer.
>>>>>
>>>>>
>>>>>    fs/ceph/caps.c       |  3 +++
>>>>>    fs/ceph/mds_client.c |  5 ++++-
>>>>>    fs/ceph/mds_client.h |  7 ++++++-
>>>>>    fs/ceph/quota.c      |  3 +++
>>>>>    fs/ceph/snap.c       |  3 +++
>>>>>    fs/ceph/super.c      | 14 ++++++++++++++
>>>>>    6 files changed, 33 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>>> index 15d9e0f0d65a..e8a53aeb2a8c 100644
>>>>> --- a/fs/ceph/caps.c
>>>>> +++ b/fs/ceph/caps.c
>>>>> @@ -4222,6 +4222,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>>>>    	dout("handle_caps from mds%d\n", session->s_mds);
>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>> +		return;
>>>>> +
>>>>>    	/* decode */
>>>>>    	end = msg->front.iov_base + msg->front.iov_len;
>>>>>    	if (msg->front.iov_len < sizeof(*h))
>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>> index d41ab68f0130..1ad85af49b45 100644
>>>>> --- a/fs/ceph/mds_client.c
>>>>> +++ b/fs/ceph/mds_client.c
>>>>> @@ -4869,6 +4869,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>>>>>    	dout("handle_lease from mds%d\n", mds);
>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>> +		return;
>>>>> +
>>>>>    	/* decode */
>>>>>    	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
>>>>>    		goto bad;
>>>>> @@ -5262,7 +5265,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
>>>>>    void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>>>>>    {
>>>>>    	dout("pre_umount\n");
>>>>> -	mdsc->stopping = 1;
>>>>> +	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
>>>>>    	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
>>>>>    	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>> index 81a1f9a4ac3b..56f9d8077068 100644
>>>>> --- a/fs/ceph/mds_client.h
>>>>> +++ b/fs/ceph/mds_client.h
>>>>> @@ -398,6 +398,11 @@ struct cap_wait {
>>>>>    	int			want;
>>>>>    };
>>>>> +enum {
>>>>> +	CEPH_MDSC_STOPPING_BEGAIN = 1,
>>>>> +	CEPH_MDSC_STOPPING_FLUSHED = 2,
>>>>> +};
>>>>> +
>>>>>    /*
>>>>>     * mds client state
>>>>>     */
>>>>> @@ -414,7 +419,7 @@ struct ceph_mds_client {
>>>>>    	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
>>>>>    	atomic_t		num_sessions;
>>>>>    	int                     max_sessions;  /* len of sessions array */
>>>>> -	int                     stopping;      /* true if shutting down */
>>>>> +	int                     stopping;      /* the stage of shutting down */
>>>>>    	atomic64_t		quotarealms_count; /* # realms with quota */
>>>>>    	/*
>>>>> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
>>>>> index 64592adfe48f..f5819fc31d28 100644
>>>>> --- a/fs/ceph/quota.c
>>>>> +++ b/fs/ceph/quota.c
>>>>> @@ -47,6 +47,9 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>>>>>    	struct inode *inode;
>>>>>    	struct ceph_inode_info *ci;
>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>> +		return;
>>>>> +
>>>>>    	if (msg->front.iov_len < sizeof(*h)) {
>>>>>    		pr_err("%s corrupt message mds%d len %d\n", __func__,
>>>>>    		       session->s_mds, (int)msg->front.iov_len);
>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>> index a73943e51a77..eeabdd0211d8 100644
>>>>> --- a/fs/ceph/snap.c
>>>>> +++ b/fs/ceph/snap.c
>>>>> @@ -1010,6 +1010,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>    	int locked_rwsem = 0;
>>>>>    	bool close_sessions = false;
>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>> +		return;
>>>>> +
>>>>>    	/* decode */
>>>>>    	if (msg->front.iov_len < sizeof(*h))
>>>>>    		goto bad;
>>>>> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>>>>> index f10a076f47e5..012b35be41a9 100644
>>>>> --- a/fs/ceph/super.c
>>>>> +++ b/fs/ceph/super.c
>>>>> @@ -1483,6 +1483,20 @@ static void ceph_kill_sb(struct super_block *s)
>>>>>    	ceph_mdsc_pre_umount(fsc->mdsc);
>>>>>    	flush_fs_workqueues(fsc);
>>>>> +	/*
>>>>> +	 * Though the kill_anon_super() will finally trigger the
>>>>> +	 * sync_filesystem() anyway, we still need to do it here and
>>>>> +	 * then bump the stage of shutdown. This will drop any further
>>>>> +	 * message, which makes no sense any more, from MDSs.
>>>>> +	 *
>>>>> +	 * Without this when evicting the inodes it may fail in the
>>>>> +	 * kill_anon_super(), which will trigger a warning when
>>>>> +	 * destroying the fscrypt keyring and then possibly trigger
>>>>> +	 * a further crash in ceph module when iput() the inodes.
>>>>> +	 */
>>>>> +	sync_filesystem(s);
>>>>> +	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
>>>>> +
>>>>>    	kill_anon_super(s);
>>>>>    	fsc->client->extra_mon_dispatch = NULL;
>>>>> -- 
>>>>> 2.31.1
>>>>>
>> -- 
>> Best Regards,
>>
>> Xiubo Li (李秀波)
>>
>> Email: xiubli@redhat.com/xiubli@ibm.com
>> Slack: @Xiubo Li
>>
Luis Henriques Jan. 24, 2023, 12:32 p.m. UTC | #4
On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
> 
> On 23/01/2023 18:15, Luís Henriques wrote:
> > On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
> > > Hi Luis,
> > > 
> > > On 20/01/2023 18:36, Luís Henriques wrote:
> > > > On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
> > > > > On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
> > > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > > 
> > > > > > When unmounting it will just wait for the inflight requests to be
> > > > > > finished, but just before the sessions are closed the kclient still
> > > > > > could receive the caps/snaps/lease/quota msgs from MDS. All these
> > > > > > msgs need to hold some inodes, which will cause ceph_kill_sb() failing
> > > > > > to evict the inodes in time.
> > > > > > 
> > > > > > If encrypt is enabled the kernel generate a warning when removing
> > > > > > the encrypt keys when the skipped inodes still hold the keyring:
> > > > > Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
> > > > > does look good, but I started hitting the WARNING added by patch:
> > > > > 
> > > > >     [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
> > > > > 
> > > > > This patch seems to be working but I'm not sure we really need the extra
> > > > OK, looks like I jumped the gun here: I still see the warning with your
> > > > patch.
> > > > 
> > > > I've done a quick hack and the patch below sees fix it.  But again, it
> > > > will impact performance.  I'll see if I can figure out something else.
> > > > 
> > > > Cheers,
> > > > --
> > > > Luís
> > > > 
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 2cd134ad02a9..bdb4efa0f9f7 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > >    	return ret;
> > > >    }
> > > > +static int ceph_flush(struct file *file, fl_owner_t id)
> > > > +{
> > > > +	struct inode *inode = file_inode(file);
> > > > +	int ret;
> > > > +
> > > > +	if ((file->f_mode & FMODE_WRITE) == 0)
> > > > +		return 0;
> > > > +
> > > > +	ret = filemap_write_and_wait(inode->i_mapping);
> > > > +	if (ret)
> > > > +		ret = filemap_check_wb_err(file->f_mapping, 0);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > >    const struct file_operations ceph_file_fops = {
> > > >    	.open = ceph_open,
> > > >    	.release = ceph_release,
> > > > @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
> > > >    	.compat_ioctl = compat_ptr_ioctl,
> > > >    	.fallocate	= ceph_fallocate,
> > > >    	.copy_file_range = ceph_copy_file_range,
> > > > +	.flush = ceph_flush,
> > > >    };
> > > > 
> > > This will only fix the second crash case in
> > > https://tracker.ceph.com/issues/58126#note-6, but not the case in
> > > https://tracker.ceph.com/issues/58126#note-5.
> > > 
> > > This issue could be triggered with "test_dummy_encryption" and with
> > > xfstests-dev's generic/124. You can have a try.
> > OK, thanks.  I'll give it a try.  BTW, my local reproducer was
> > generic/132, not generic/124.  I'll let you know if I find something.
> 
> Hi Luis,
> 
> I added some logs and found that when doing the aio_write, it will split to
> many aio requests and when the last req finished it will call the
> "writepages_finish()", which will iput() the inode and release the last
> refcount of the inode.
> 
> But it seems the complete_all() is called without the req->r_callback() is
> totally finished:
> 
> <4>[500400.268200] writepages_finish 0000000060940222 rc 0
> <4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
> osd req->r_callback()
> <4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
> <4>[500400.268526] sync_fs (blocking) done
> <4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
> sync_filesystem() will be called, I just added it in ceph code but the VFS
> will call it again in "kill_anon_super()"
> <4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
> <4>[500400.268626] sync_fs (blocking)
> <4>[500400.268631] sync_fs (blocking) done
> <4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
> skipped!    <=== skipped
> <4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
> mk_active_refs = 2
> <4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
> evict the inode in the req->r_callback()
> 
> Locally my VM is not working and I couldn't run the test for now. Could you
> help test the following patch ?

So, running generic/132 in my test environment with your patch applied to
the 'testing' branch I still see the WARNING (pasted the backtrace
bellow).  I'll try help to dig a bit more on this issue in the next few
days.

Cheers,
--
Luís


[  102.713299] ceph: test_dummy_encryption mode enabled
[  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
[  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
[  121.810439] ------------[ cut here ]------------
[  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
[  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
[  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
[  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
[  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
[  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
[  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
[  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
[  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
[  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
[  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
[  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
[  121.842190] Call Trace:
[  121.843125]  <TASK>
[  121.843945]  generic_shutdown_super+0x42/0x130
[  121.845656]  kill_anon_super+0xe/0x30
[  121.847081]  ceph_kill_sb+0x3d/0xa0
[  121.848418]  deactivate_locked_super+0x34/0x60
[  121.850166]  cleanup_mnt+0xb8/0x150
[  121.851498]  task_work_run+0x61/0x90
[  121.852863]  exit_to_user_mode_prepare+0x147/0x170
[  121.854741]  syscall_exit_to_user_mode+0x20/0x40
[  121.856489]  do_syscall_64+0x48/0x80
[  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  121.859659] RIP: 0033:0x7f2442444a67
[  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
[  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
[  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
[  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
[  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
[  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
[  121.879418]  </TASK>
[  121.880039] ---[ end trace 0000000000000000 ]---


> 
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 78b622178a3d..5cdaba0d3003 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -2539,6 +2539,7 @@ static void __complete_request(struct ceph_osd_request
> *req)
> 
>         if (req->r_callback)
>                 req->r_callback(req);
> +       barrier();
>         complete_all(&req->r_completion);
>         ceph_osdc_put_request(req);
>  }
> 
> Thanks
> 
> - Xiubo
> 
> 
> > > Locally I am still reading the code to check why "sync_filesystem(s);"
> > > couldn't do the same with "ceph_flush" as above.
> > > 
> > > I am still on holiday these days and I will test this after my back.
> > Sure, no problem.  Enjoy your break!
> > 
> > Cheers,
> > --
> > Luís
> > 
> > > Thanks
> > > 
> > > 
> > > > > 'stopping' state.  Looking at the code, we've flushed all the workqueues
> > > > > and done all the waits, so I *think* the sync_filesystem() call should be
> > > > > enough.
> > > > > 
> > > > > The other alternative I see would be to add a ->flush() to ceph_file_fops,
> > > > > where we'd do a filemap_write_and_wait().  But that would probably have a
> > > > > negative performance impact -- my understand is that it basically means
> > > > > we'll have sync file closes.
> > > > > 
> > > > > Cheers,
> > > > > --
> > > > > Luís
> > > > > 
> > > > > > WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
> > > > > > CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
> > > > > > Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
> > > > > > RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
> > > > > > RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
> > > > > > RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
> > > > > > RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
> > > > > > RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
> > > > > > R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
> > > > > > R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
> > > > > > FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > Call Trace:
> > > > > > <TASK>
> > > > > > generic_shutdown_super+0x47/0x120
> > > > > > kill_anon_super+0x14/0x30
> > > > > > ceph_kill_sb+0x36/0x90 [ceph]
> > > > > > deactivate_locked_super+0x29/0x60
> > > > > > cleanup_mnt+0xb8/0x140
> > > > > > task_work_run+0x67/0xb0
> > > > > > exit_to_user_mode_prepare+0x23d/0x240
> > > > > > syscall_exit_to_user_mode+0x25/0x60
> > > > > > do_syscall_64+0x40/0x80
> > > > > > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > > > > RIP: 0033:0x7fd83dc39e9b
> > > > > > 
> > > > > > URL: https://tracker.ceph.com/issues/58126
> > > > > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > > V2:
> > > > > > - Fix it in ceph layer.
> > > > > > 
> > > > > > 
> > > > > >    fs/ceph/caps.c       |  3 +++
> > > > > >    fs/ceph/mds_client.c |  5 ++++-
> > > > > >    fs/ceph/mds_client.h |  7 ++++++-
> > > > > >    fs/ceph/quota.c      |  3 +++
> > > > > >    fs/ceph/snap.c       |  3 +++
> > > > > >    fs/ceph/super.c      | 14 ++++++++++++++
> > > > > >    6 files changed, 33 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > > index 15d9e0f0d65a..e8a53aeb2a8c 100644
> > > > > > --- a/fs/ceph/caps.c
> > > > > > +++ b/fs/ceph/caps.c
> > > > > > @@ -4222,6 +4222,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > > > >    	dout("handle_caps from mds%d\n", session->s_mds);
> > > > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > > > +		return;
> > > > > > +
> > > > > >    	/* decode */
> > > > > >    	end = msg->front.iov_base + msg->front.iov_len;
> > > > > >    	if (msg->front.iov_len < sizeof(*h))
> > > > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > > > index d41ab68f0130..1ad85af49b45 100644
> > > > > > --- a/fs/ceph/mds_client.c
> > > > > > +++ b/fs/ceph/mds_client.c
> > > > > > @@ -4869,6 +4869,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
> > > > > >    	dout("handle_lease from mds%d\n", mds);
> > > > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > > > +		return;
> > > > > > +
> > > > > >    	/* decode */
> > > > > >    	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
> > > > > >    		goto bad;
> > > > > > @@ -5262,7 +5265,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
> > > > > >    void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
> > > > > >    {
> > > > > >    	dout("pre_umount\n");
> > > > > > -	mdsc->stopping = 1;
> > > > > > +	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
> > > > > >    	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
> > > > > >    	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
> > > > > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > > > > index 81a1f9a4ac3b..56f9d8077068 100644
> > > > > > --- a/fs/ceph/mds_client.h
> > > > > > +++ b/fs/ceph/mds_client.h
> > > > > > @@ -398,6 +398,11 @@ struct cap_wait {
> > > > > >    	int			want;
> > > > > >    };
> > > > > > +enum {
> > > > > > +	CEPH_MDSC_STOPPING_BEGAIN = 1,
> > > > > > +	CEPH_MDSC_STOPPING_FLUSHED = 2,
> > > > > > +};
> > > > > > +
> > > > > >    /*
> > > > > >     * mds client state
> > > > > >     */
> > > > > > @@ -414,7 +419,7 @@ struct ceph_mds_client {
> > > > > >    	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
> > > > > >    	atomic_t		num_sessions;
> > > > > >    	int                     max_sessions;  /* len of sessions array */
> > > > > > -	int                     stopping;      /* true if shutting down */
> > > > > > +	int                     stopping;      /* the stage of shutting down */
> > > > > >    	atomic64_t		quotarealms_count; /* # realms with quota */
> > > > > >    	/*
> > > > > > diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> > > > > > index 64592adfe48f..f5819fc31d28 100644
> > > > > > --- a/fs/ceph/quota.c
> > > > > > +++ b/fs/ceph/quota.c
> > > > > > @@ -47,6 +47,9 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
> > > > > >    	struct inode *inode;
> > > > > >    	struct ceph_inode_info *ci;
> > > > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > > > +		return;
> > > > > > +
> > > > > >    	if (msg->front.iov_len < sizeof(*h)) {
> > > > > >    		pr_err("%s corrupt message mds%d len %d\n", __func__,
> > > > > >    		       session->s_mds, (int)msg->front.iov_len);
> > > > > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > > > > > index a73943e51a77..eeabdd0211d8 100644
> > > > > > --- a/fs/ceph/snap.c
> > > > > > +++ b/fs/ceph/snap.c
> > > > > > @@ -1010,6 +1010,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > > > > >    	int locked_rwsem = 0;
> > > > > >    	bool close_sessions = false;
> > > > > > +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
> > > > > > +		return;
> > > > > > +
> > > > > >    	/* decode */
> > > > > >    	if (msg->front.iov_len < sizeof(*h))
> > > > > >    		goto bad;
> > > > > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> > > > > > index f10a076f47e5..012b35be41a9 100644
> > > > > > --- a/fs/ceph/super.c
> > > > > > +++ b/fs/ceph/super.c
> > > > > > @@ -1483,6 +1483,20 @@ static void ceph_kill_sb(struct super_block *s)
> > > > > >    	ceph_mdsc_pre_umount(fsc->mdsc);
> > > > > >    	flush_fs_workqueues(fsc);
> > > > > > +	/*
> > > > > > +	 * Though the kill_anon_super() will finally trigger the
> > > > > > +	 * sync_filesystem() anyway, we still need to do it here and
> > > > > > +	 * then bump the stage of shutdown. This will drop any further
> > > > > > +	 * message, which makes no sense any more, from MDSs.
> > > > > > +	 *
> > > > > > +	 * Without this when evicting the inodes it may fail in the
> > > > > > +	 * kill_anon_super(), which will trigger a warning when
> > > > > > +	 * destroying the fscrypt keyring and then possibly trigger
> > > > > > +	 * a further crash in ceph module when iput() the inodes.
> > > > > > +	 */
> > > > > > +	sync_filesystem(s);
> > > > > > +	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
> > > > > > +
> > > > > >    	kill_anon_super(s);
> > > > > >    	fsc->client->extra_mon_dispatch = NULL;
> > > > > > -- 
> > > > > > 2.31.1
> > > > > > 
> > > -- 
> > > Best Regards,
> > > 
> > > Xiubo Li (李秀波)
> > > 
> > > Email: xiubli@redhat.com/xiubli@ibm.com
> > > Slack: @Xiubo Li
> > > 
> -- 
> Best Regards,
> 
> Xiubo Li (李秀波)
> 
> Email: xiubli@redhat.com/xiubli@ibm.com
> Slack: @Xiubo Li
>
Xiubo Li Jan. 26, 2023, 1:03 p.m. UTC | #5
On 24/01/2023 20:32, Luís Henriques wrote:
> On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
>> On 23/01/2023 18:15, Luís Henriques wrote:
>>> On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
>>>> Hi Luis,
>>>>
>>>> On 20/01/2023 18:36, Luís Henriques wrote:
>>>>> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
>>>>>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>
>>>>>>> When unmounting it will just wait for the inflight requests to be
>>>>>>> finished, but just before the sessions are closed the kclient still
>>>>>>> could receive the caps/snaps/lease/quota msgs from MDS. All these
>>>>>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing
>>>>>>> to evict the inodes in time.
>>>>>>>
>>>>>>> If encrypt is enabled the kernel generate a warning when removing
>>>>>>> the encrypt keys when the skipped inodes still hold the keyring:
>>>>>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
>>>>>> does look good, but I started hitting the WARNING added by patch:
>>>>>>
>>>>>>      [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
>>>>>>
>>>>>> This patch seems to be working but I'm not sure we really need the extra
>>>>> OK, looks like I jumped the gun here: I still see the warning with your
>>>>> patch.
>>>>>
>>>>> I've done a quick hack and the patch below sees fix it.  But again, it
>>>>> will impact performance.  I'll see if I can figure out something else.
>>>>>
>>>>> Cheers,
>>>>> --
>>>>> Luís
>>>>>
>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>> index 2cd134ad02a9..bdb4efa0f9f7 100644
>>>>> --- a/fs/ceph/file.c
>>>>> +++ b/fs/ceph/file.c
>>>>> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>>>>     	return ret;
>>>>>     }
>>>>> +static int ceph_flush(struct file *file, fl_owner_t id)
>>>>> +{
>>>>> +	struct inode *inode = file_inode(file);
>>>>> +	int ret;
>>>>> +
>>>>> +	if ((file->f_mode & FMODE_WRITE) == 0)
>>>>> +		return 0;
>>>>> +
>>>>> +	ret = filemap_write_and_wait(inode->i_mapping);
>>>>> +	if (ret)
>>>>> +		ret = filemap_check_wb_err(file->f_mapping, 0);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>>     const struct file_operations ceph_file_fops = {
>>>>>     	.open = ceph_open,
>>>>>     	.release = ceph_release,
>>>>> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
>>>>>     	.compat_ioctl = compat_ptr_ioctl,
>>>>>     	.fallocate	= ceph_fallocate,
>>>>>     	.copy_file_range = ceph_copy_file_range,
>>>>> +	.flush = ceph_flush,
>>>>>     };
>>>>>
>>>> This will only fix the second crash case in
>>>> https://tracker.ceph.com/issues/58126#note-6, but not the case in
>>>> https://tracker.ceph.com/issues/58126#note-5.
>>>>
>>>> This issue could be triggered with "test_dummy_encryption" and with
>>>> xfstests-dev's generic/124. You can have a try.
>>> OK, thanks.  I'll give it a try.  BTW, my local reproducer was
>>> generic/132, not generic/124.  I'll let you know if I find something.
>> Hi Luis,
>>
>> I added some logs and found that when doing the aio_write, it will split to
>> many aio requests and when the last req finished it will call the
>> "writepages_finish()", which will iput() the inode and release the last
>> refcount of the inode.
>>
>> But it seems the complete_all() is called without the req->r_callback() is
>> totally finished:
>>
>> <4>[500400.268200] writepages_finish 0000000060940222 rc 0
>> <4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
>> osd req->r_callback()
>> <4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
>> <4>[500400.268526] sync_fs (blocking) done
>> <4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
>> sync_filesystem() will be called, I just added it in ceph code but the VFS
>> will call it again in "kill_anon_super()"
>> <4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
>> <4>[500400.268626] sync_fs (blocking)
>> <4>[500400.268631] sync_fs (blocking) done
>> <4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
>> skipped!    <=== skipped
>> <4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
>> mk_active_refs = 2
>> <4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
>> evict the inode in the req->r_callback()
>>
>> Locally my VM is not working and I couldn't run the test for now. Could you
>> help test the following patch ?
> So, running generic/132 in my test environment with your patch applied to
> the 'testing' branch I still see the WARNING (pasted the backtrace
> bellow).  I'll try help to dig a bit more on this issue in the next few
> days.
>
> Cheers,
> --
> Luís
>
>
> [  102.713299] ceph: test_dummy_encryption mode enabled
> [  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
> [  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
> [  121.810439] ------------[ cut here ]------------
> [  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
> [  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
> [  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> [  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
> [  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
> [  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
> [  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
> [  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
> [  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
> [  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
> [  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
> [  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
> [  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
> [  121.842190] Call Trace:
> [  121.843125]  <TASK>
> [  121.843945]  generic_shutdown_super+0x42/0x130
> [  121.845656]  kill_anon_super+0xe/0x30
> [  121.847081]  ceph_kill_sb+0x3d/0xa0
> [  121.848418]  deactivate_locked_super+0x34/0x60
> [  121.850166]  cleanup_mnt+0xb8/0x150
> [  121.851498]  task_work_run+0x61/0x90
> [  121.852863]  exit_to_user_mode_prepare+0x147/0x170
> [  121.854741]  syscall_exit_to_user_mode+0x20/0x40
> [  121.856489]  do_syscall_64+0x48/0x80
> [  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [  121.859659] RIP: 0033:0x7f2442444a67
> [  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
> [  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> [  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
> [  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
> [  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
> [  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
> [  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
> [  121.879418]  </TASK>
> [  121.880039] ---[ end trace 0000000000000000 ]---
>
Okay.

I found the sync_filesystem() will wait for the last the osd request, 
but the sequence of deleting the osd req from osdc and calling 
req->r_callback seems incorrect.

The following patch should fix it:


diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 78b622178a3d..a3b4c5cabb80 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2532,12 +2532,12 @@ static void finish_request(struct 
ceph_osd_request *req)
         ceph_msg_revoke_incoming(req->r_reply);
  }

-static void __complete_request(struct ceph_osd_request *req)
+static void __complete_request(struct ceph_osd_request *req, bool callback)
  {
         dout("%s req %p tid %llu cb %ps result %d\n", __func__, req,
              req->r_tid, req->r_callback, req->r_result);

-       if (req->r_callback)
+       if (callback && req->r_callback)
                 req->r_callback(req);
         complete_all(&req->r_completion);
         ceph_osdc_put_request(req);
@@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct 
work_struct *work)
         struct ceph_osd_request *req =
             container_of(work, struct ceph_osd_request, r_complete_work);

-       __complete_request(req);
+       __complete_request(req, true);
  }

  /*
@@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd, 
struct ceph_msg *msg)
         WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
         req->r_version = m.user_version;
         req->r_result = m.result ?: data_len;
+       if (req->r_callback)
+               req->r_callback(req);
         finish_request(req);
         mutex_unlock(&osd->lock);
         up_read(&osdc->lock);

-       __complete_request(req);
+       __complete_request(req, false);
         return;

  fail_request:

Thanks

- Xiubo


>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 78b622178a3d..5cdaba0d3003 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -2539,6 +2539,7 @@ static void __complete_request(struct ceph_osd_request
>> *req)
>>
>>          if (req->r_callback)
>>                  req->r_callback(req);
>> +       barrier();
>>          complete_all(&req->r_completion);
>>          ceph_osdc_put_request(req);
>>   }
>>
>> Thanks
>>
>> - Xiubo
>>
>>
>>>> Locally I am still reading the code to check why "sync_filesystem(s);"
>>>> couldn't do the same with "ceph_flush" as above.
>>>>
>>>> I am still on holiday these days and I will test this after my back.
>>> Sure, no problem.  Enjoy your break!
>>>
>>> Cheers,
>>> --
>>> Luís
>>>
>>>> Thanks
>>>>
>>>>
>>>>>> 'stopping' state.  Looking at the code, we've flushed all the workqueues
>>>>>> and done all the waits, so I *think* the sync_filesystem() call should be
>>>>>> enough.
>>>>>>
>>>>>> The other alternative I see would be to add a ->flush() to ceph_file_fops,
>>>>>> where we'd do a filemap_write_and_wait().  But that would probably have a
>>>>>> negative performance impact -- my understand is that it basically means
>>>>>> we'll have sync file closes.
>>>>>>
>>>>>> Cheers,
>>>>>> --
>>>>>> Luís
>>>>>>
>>>>>>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
>>>>>>> CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
>>>>>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
>>>>>>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
>>>>>>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
>>>>>>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
>>>>>>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
>>>>>>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
>>>>>>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
>>>>>>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
>>>>>>> FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>> Call Trace:
>>>>>>> <TASK>
>>>>>>> generic_shutdown_super+0x47/0x120
>>>>>>> kill_anon_super+0x14/0x30
>>>>>>> ceph_kill_sb+0x36/0x90 [ceph]
>>>>>>> deactivate_locked_super+0x29/0x60
>>>>>>> cleanup_mnt+0xb8/0x140
>>>>>>> task_work_run+0x67/0xb0
>>>>>>> exit_to_user_mode_prepare+0x23d/0x240
>>>>>>> syscall_exit_to_user_mode+0x25/0x60
>>>>>>> do_syscall_64+0x40/0x80
>>>>>>> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>>>>> RIP: 0033:0x7fd83dc39e9b
>>>>>>>
>>>>>>> URL: https://tracker.ceph.com/issues/58126
>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>> ---
>>>>>>>
>>>>>>> V2:
>>>>>>> - Fix it in ceph layer.
>>>>>>>
>>>>>>>
>>>>>>>     fs/ceph/caps.c       |  3 +++
>>>>>>>     fs/ceph/mds_client.c |  5 ++++-
>>>>>>>     fs/ceph/mds_client.h |  7 ++++++-
>>>>>>>     fs/ceph/quota.c      |  3 +++
>>>>>>>     fs/ceph/snap.c       |  3 +++
>>>>>>>     fs/ceph/super.c      | 14 ++++++++++++++
>>>>>>>     6 files changed, 33 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>>>>> index 15d9e0f0d65a..e8a53aeb2a8c 100644
>>>>>>> --- a/fs/ceph/caps.c
>>>>>>> +++ b/fs/ceph/caps.c
>>>>>>> @@ -4222,6 +4222,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>>>>>>     	dout("handle_caps from mds%d\n", session->s_mds);
>>>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>>>> +		return;
>>>>>>> +
>>>>>>>     	/* decode */
>>>>>>>     	end = msg->front.iov_base + msg->front.iov_len;
>>>>>>>     	if (msg->front.iov_len < sizeof(*h))
>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>> index d41ab68f0130..1ad85af49b45 100644
>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>> @@ -4869,6 +4869,9 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>>>>>>>     	dout("handle_lease from mds%d\n", mds);
>>>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>>>> +		return;
>>>>>>> +
>>>>>>>     	/* decode */
>>>>>>>     	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
>>>>>>>     		goto bad;
>>>>>>> @@ -5262,7 +5265,7 @@ void send_flush_mdlog(struct ceph_mds_session *s)
>>>>>>>     void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>>>>>>>     {
>>>>>>>     	dout("pre_umount\n");
>>>>>>> -	mdsc->stopping = 1;
>>>>>>> +	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
>>>>>>>     	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
>>>>>>>     	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
>>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>>>> index 81a1f9a4ac3b..56f9d8077068 100644
>>>>>>> --- a/fs/ceph/mds_client.h
>>>>>>> +++ b/fs/ceph/mds_client.h
>>>>>>> @@ -398,6 +398,11 @@ struct cap_wait {
>>>>>>>     	int			want;
>>>>>>>     };
>>>>>>> +enum {
>>>>>>> +	CEPH_MDSC_STOPPING_BEGAIN = 1,
>>>>>>> +	CEPH_MDSC_STOPPING_FLUSHED = 2,
>>>>>>> +};
>>>>>>> +
>>>>>>>     /*
>>>>>>>      * mds client state
>>>>>>>      */
>>>>>>> @@ -414,7 +419,7 @@ struct ceph_mds_client {
>>>>>>>     	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
>>>>>>>     	atomic_t		num_sessions;
>>>>>>>     	int                     max_sessions;  /* len of sessions array */
>>>>>>> -	int                     stopping;      /* true if shutting down */
>>>>>>> +	int                     stopping;      /* the stage of shutting down */
>>>>>>>     	atomic64_t		quotarealms_count; /* # realms with quota */
>>>>>>>     	/*
>>>>>>> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
>>>>>>> index 64592adfe48f..f5819fc31d28 100644
>>>>>>> --- a/fs/ceph/quota.c
>>>>>>> +++ b/fs/ceph/quota.c
>>>>>>> @@ -47,6 +47,9 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
>>>>>>>     	struct inode *inode;
>>>>>>>     	struct ceph_inode_info *ci;
>>>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>>>> +		return;
>>>>>>> +
>>>>>>>     	if (msg->front.iov_len < sizeof(*h)) {
>>>>>>>     		pr_err("%s corrupt message mds%d len %d\n", __func__,
>>>>>>>     		       session->s_mds, (int)msg->front.iov_len);
>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>> index a73943e51a77..eeabdd0211d8 100644
>>>>>>> --- a/fs/ceph/snap.c
>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>> @@ -1010,6 +1010,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>>     	int locked_rwsem = 0;
>>>>>>>     	bool close_sessions = false;
>>>>>>> +	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
>>>>>>> +		return;
>>>>>>> +
>>>>>>>     	/* decode */
>>>>>>>     	if (msg->front.iov_len < sizeof(*h))
>>>>>>>     		goto bad;
>>>>>>> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>>>>>>> index f10a076f47e5..012b35be41a9 100644
>>>>>>> --- a/fs/ceph/super.c
>>>>>>> +++ b/fs/ceph/super.c
>>>>>>> @@ -1483,6 +1483,20 @@ static void ceph_kill_sb(struct super_block *s)
>>>>>>>     	ceph_mdsc_pre_umount(fsc->mdsc);
>>>>>>>     	flush_fs_workqueues(fsc);
>>>>>>> +	/*
>>>>>>> +	 * Though the kill_anon_super() will finally trigger the
>>>>>>> +	 * sync_filesystem() anyway, we still need to do it here and
>>>>>>> +	 * then bump the stage of shutdown. This will drop any further
>>>>>>> +	 * message, which makes no sense any more, from MDSs.
>>>>>>> +	 *
>>>>>>> +	 * Without this when evicting the inodes it may fail in the
>>>>>>> +	 * kill_anon_super(), which will trigger a warning when
>>>>>>> +	 * destroying the fscrypt keyring and then possibly trigger
>>>>>>> +	 * a further crash in ceph module when iput() the inodes.
>>>>>>> +	 */
>>>>>>> +	sync_filesystem(s);
>>>>>>> +	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
>>>>>>> +
>>>>>>>     	kill_anon_super(s);
>>>>>>>     	fsc->client->extra_mon_dispatch = NULL;
>>>>>>> -- 
>>>>>>> 2.31.1
>>>>>>>
>>>> -- 
>>>> Best Regards,
>>>>
>>>> Xiubo Li (李秀波)
>>>>
>>>> Email: xiubli@redhat.com/xiubli@ibm.com
>>>> Slack: @Xiubo Li
>>>>
>> -- 
>> Best Regards,
>>
>> Xiubo Li (李秀波)
>>
>> Email: xiubli@redhat.com/xiubli@ibm.com
>> Slack: @Xiubo Li
>>
Ilya Dryomov Jan. 26, 2023, 2:04 p.m. UTC | #6
On Thu, Jan 26, 2023 at 2:03 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 24/01/2023 20:32, Luís Henriques wrote:
> > On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
> >> On 23/01/2023 18:15, Luís Henriques wrote:
> >>> On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
> >>>> Hi Luis,
> >>>>
> >>>> On 20/01/2023 18:36, Luís Henriques wrote:
> >>>>> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
> >>>>>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
> >>>>>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>>>>
> >>>>>>> When unmounting it will just wait for the inflight requests to be
> >>>>>>> finished, but just before the sessions are closed the kclient still
> >>>>>>> could receive the caps/snaps/lease/quota msgs from MDS. All these
> >>>>>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing
> >>>>>>> to evict the inodes in time.
> >>>>>>>
> >>>>>>> If encrypt is enabled the kernel generate a warning when removing
> >>>>>>> the encrypt keys when the skipped inodes still hold the keyring:
> >>>>>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
> >>>>>> does look good, but I started hitting the WARNING added by patch:
> >>>>>>
> >>>>>>      [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
> >>>>>>
> >>>>>> This patch seems to be working but I'm not sure we really need the extra
> >>>>> OK, looks like I jumped the gun here: I still see the warning with your
> >>>>> patch.
> >>>>>
> >>>>> I've done a quick hack and the patch below sees fix it.  But again, it
> >>>>> will impact performance.  I'll see if I can figure out something else.
> >>>>>
> >>>>> Cheers,
> >>>>> --
> >>>>> Luís
> >>>>>
> >>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >>>>> index 2cd134ad02a9..bdb4efa0f9f7 100644
> >>>>> --- a/fs/ceph/file.c
> >>>>> +++ b/fs/ceph/file.c
> >>>>> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >>>>>           return ret;
> >>>>>     }
> >>>>> +static int ceph_flush(struct file *file, fl_owner_t id)
> >>>>> +{
> >>>>> + struct inode *inode = file_inode(file);
> >>>>> + int ret;
> >>>>> +
> >>>>> + if ((file->f_mode & FMODE_WRITE) == 0)
> >>>>> +         return 0;
> >>>>> +
> >>>>> + ret = filemap_write_and_wait(inode->i_mapping);
> >>>>> + if (ret)
> >>>>> +         ret = filemap_check_wb_err(file->f_mapping, 0);
> >>>>> +
> >>>>> + return ret;
> >>>>> +}
> >>>>> +
> >>>>>     const struct file_operations ceph_file_fops = {
> >>>>>           .open = ceph_open,
> >>>>>           .release = ceph_release,
> >>>>> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
> >>>>>           .compat_ioctl = compat_ptr_ioctl,
> >>>>>           .fallocate      = ceph_fallocate,
> >>>>>           .copy_file_range = ceph_copy_file_range,
> >>>>> + .flush = ceph_flush,
> >>>>>     };
> >>>>>
> >>>> This will only fix the second crash case in
> >>>> https://tracker.ceph.com/issues/58126#note-6, but not the case in
> >>>> https://tracker.ceph.com/issues/58126#note-5.
> >>>>
> >>>> This issue could be triggered with "test_dummy_encryption" and with
> >>>> xfstests-dev's generic/124. You can have a try.
> >>> OK, thanks.  I'll give it a try.  BTW, my local reproducer was
> >>> generic/132, not generic/124.  I'll let you know if I find something.
> >> Hi Luis,
> >>
> >> I added some logs and found that when doing the aio_write, it will split to
> >> many aio requests and when the last req finished it will call the
> >> "writepages_finish()", which will iput() the inode and release the last
> >> refcount of the inode.
> >>
> >> But it seems the complete_all() is called without the req->r_callback() is
> >> totally finished:
> >>
> >> <4>[500400.268200] writepages_finish 0000000060940222 rc 0
> >> <4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
> >> osd req->r_callback()
> >> <4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
> >> <4>[500400.268526] sync_fs (blocking) done
> >> <4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
> >> sync_filesystem() will be called, I just added it in ceph code but the VFS
> >> will call it again in "kill_anon_super()"
> >> <4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
> >> <4>[500400.268626] sync_fs (blocking)
> >> <4>[500400.268631] sync_fs (blocking) done
> >> <4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
> >> skipped!    <=== skipped
> >> <4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
> >> mk_active_refs = 2
> >> <4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
> >> evict the inode in the req->r_callback()
> >>
> >> Locally my VM is not working and I couldn't run the test for now. Could you
> >> help test the following patch ?
> > So, running generic/132 in my test environment with your patch applied to
> > the 'testing' branch I still see the WARNING (pasted the backtrace
> > bellow).  I'll try help to dig a bit more on this issue in the next few
> > days.
> >
> > Cheers,
> > --
> > Luís
> >
> >
> > [  102.713299] ceph: test_dummy_encryption mode enabled
> > [  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
> > [  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
> > [  121.810439] ------------[ cut here ]------------
> > [  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
> > [  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
> > [  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > [  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
> > [  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
> > [  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
> > [  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
> > [  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
> > [  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
> > [  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
> > [  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
> > [  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
> > [  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
> > [  121.842190] Call Trace:
> > [  121.843125]  <TASK>
> > [  121.843945]  generic_shutdown_super+0x42/0x130
> > [  121.845656]  kill_anon_super+0xe/0x30
> > [  121.847081]  ceph_kill_sb+0x3d/0xa0
> > [  121.848418]  deactivate_locked_super+0x34/0x60
> > [  121.850166]  cleanup_mnt+0xb8/0x150
> > [  121.851498]  task_work_run+0x61/0x90
> > [  121.852863]  exit_to_user_mode_prepare+0x147/0x170
> > [  121.854741]  syscall_exit_to_user_mode+0x20/0x40
> > [  121.856489]  do_syscall_64+0x48/0x80
> > [  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > [  121.859659] RIP: 0033:0x7f2442444a67
> > [  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
> > [  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> > [  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
> > [  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
> > [  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
> > [  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
> > [  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
> > [  121.879418]  </TASK>
> > [  121.880039] ---[ end trace 0000000000000000 ]---
> >
> Okay.
>
> I found the sync_filesystem() will wait for the last the osd request,
> but the sequence of deleting the osd req from osdc and calling
> req->r_callback seems incorrect.
>
> The following patch should fix it:
>
>
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 78b622178a3d..a3b4c5cabb80 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -2532,12 +2532,12 @@ static void finish_request(struct
> ceph_osd_request *req)
>          ceph_msg_revoke_incoming(req->r_reply);
>   }
>
> -static void __complete_request(struct ceph_osd_request *req)
> +static void __complete_request(struct ceph_osd_request *req, bool callback)
>   {
>          dout("%s req %p tid %llu cb %ps result %d\n", __func__, req,
>               req->r_tid, req->r_callback, req->r_result);
>
> -       if (req->r_callback)
> +       if (callback && req->r_callback)
>                  req->r_callback(req);
>          complete_all(&req->r_completion);
>          ceph_osdc_put_request(req);
> @@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct
> work_struct *work)
>          struct ceph_osd_request *req =
>              container_of(work, struct ceph_osd_request, r_complete_work);
>
> -       __complete_request(req);
> +       __complete_request(req, true);
>   }
>
>   /*
> @@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd,
> struct ceph_msg *msg)
>          WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
>          req->r_version = m.user_version;
>          req->r_result = m.result ?: data_len;
> +       if (req->r_callback)
> +               req->r_callback(req);
>          finish_request(req);
>          mutex_unlock(&osd->lock);
>          up_read(&osdc->lock);
>
> -       __complete_request(req);
> +       __complete_request(req, false);

Hi Xiubo,

I haven't looked into the actual problem that you trying to fix here
but this patch seems wrong and very unlikely to fly.  The messenger
invokes the OSD request callback outside of osd->lock and osdc->lock
critical sections on purpose to avoid deadlocks.  This goes way back
and also consistent with how Objecter behaves in userspace.

Thanks,

                Ilya
Xiubo Li Jan. 28, 2023, 3:11 a.m. UTC | #7
On 26/01/2023 22:04, Ilya Dryomov wrote:
> On Thu, Jan 26, 2023 at 2:03 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 24/01/2023 20:32, Luís Henriques wrote:
>>> On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
>>>> On 23/01/2023 18:15, Luís Henriques wrote:
>>>>> On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
>>>>>> Hi Luis,
>>>>>>
>>>>>> On 20/01/2023 18:36, Luís Henriques wrote:
>>>>>>> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
>>>>>>>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
>>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>>
>>>>>>>>> When unmounting it will just wait for the inflight requests to be
>>>>>>>>> finished, but just before the sessions are closed the kclient still
>>>>>>>>> could receive the caps/snaps/lease/quota msgs from MDS. All these
>>>>>>>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing
>>>>>>>>> to evict the inodes in time.
>>>>>>>>>
>>>>>>>>> If encrypt is enabled the kernel generate a warning when removing
>>>>>>>>> the encrypt keys when the skipped inodes still hold the keyring:
>>>>>>>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
>>>>>>>> does look good, but I started hitting the WARNING added by patch:
>>>>>>>>
>>>>>>>>       [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
>>>>>>>>
>>>>>>>> This patch seems to be working but I'm not sure we really need the extra
>>>>>>> OK, looks like I jumped the gun here: I still see the warning with your
>>>>>>> patch.
>>>>>>>
>>>>>>> I've done a quick hack and the patch below sees fix it.  But again, it
>>>>>>> will impact performance.  I'll see if I can figure out something else.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> --
>>>>>>> Luís
>>>>>>>
>>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>>>> index 2cd134ad02a9..bdb4efa0f9f7 100644
>>>>>>> --- a/fs/ceph/file.c
>>>>>>> +++ b/fs/ceph/file.c
>>>>>>> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>>>>>>            return ret;
>>>>>>>      }
>>>>>>> +static int ceph_flush(struct file *file, fl_owner_t id)
>>>>>>> +{
>>>>>>> + struct inode *inode = file_inode(file);
>>>>>>> + int ret;
>>>>>>> +
>>>>>>> + if ((file->f_mode & FMODE_WRITE) == 0)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + ret = filemap_write_and_wait(inode->i_mapping);
>>>>>>> + if (ret)
>>>>>>> +         ret = filemap_check_wb_err(file->f_mapping, 0);
>>>>>>> +
>>>>>>> + return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>>      const struct file_operations ceph_file_fops = {
>>>>>>>            .open = ceph_open,
>>>>>>>            .release = ceph_release,
>>>>>>> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
>>>>>>>            .compat_ioctl = compat_ptr_ioctl,
>>>>>>>            .fallocate      = ceph_fallocate,
>>>>>>>            .copy_file_range = ceph_copy_file_range,
>>>>>>> + .flush = ceph_flush,
>>>>>>>      };
>>>>>>>
>>>>>> This will only fix the second crash case in
>>>>>> https://tracker.ceph.com/issues/58126#note-6, but not the case in
>>>>>> https://tracker.ceph.com/issues/58126#note-5.
>>>>>>
>>>>>> This issue could be triggered with "test_dummy_encryption" and with
>>>>>> xfstests-dev's generic/124. You can have a try.
>>>>> OK, thanks.  I'll give it a try.  BTW, my local reproducer was
>>>>> generic/132, not generic/124.  I'll let you know if I find something.
>>>> Hi Luis,
>>>>
>>>> I added some logs and found that when doing the aio_write, it will split to
>>>> many aio requests and when the last req finished it will call the
>>>> "writepages_finish()", which will iput() the inode and release the last
>>>> refcount of the inode.
>>>>
>>>> But it seems the complete_all() is called without the req->r_callback() is
>>>> totally finished:
>>>>
>>>> <4>[500400.268200] writepages_finish 0000000060940222 rc 0
>>>> <4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
>>>> osd req->r_callback()
>>>> <4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
>>>> <4>[500400.268526] sync_fs (blocking) done
>>>> <4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
>>>> sync_filesystem() will be called, I just added it in ceph code but the VFS
>>>> will call it again in "kill_anon_super()"
>>>> <4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
>>>> <4>[500400.268626] sync_fs (blocking)
>>>> <4>[500400.268631] sync_fs (blocking) done
>>>> <4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
>>>> skipped!    <=== skipped
>>>> <4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
>>>> mk_active_refs = 2
>>>> <4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
>>>> evict the inode in the req->r_callback()
>>>>
>>>> Locally my VM is not working and I couldn't run the test for now. Could you
>>>> help test the following patch ?
>>> So, running generic/132 in my test environment with your patch applied to
>>> the 'testing' branch I still see the WARNING (pasted the backtrace
>>> bellow).  I'll try help to dig a bit more on this issue in the next few
>>> days.
>>>
>>> Cheers,
>>> --
>>> Luís
>>>
>>>
>>> [  102.713299] ceph: test_dummy_encryption mode enabled
>>> [  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
>>> [  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
>>> [  121.810439] ------------[ cut here ]------------
>>> [  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
>>> [  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
>>> [  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
>>> [  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
>>> [  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
>>> [  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
>>> [  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
>>> [  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
>>> [  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
>>> [  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
>>> [  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
>>> [  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
>>> [  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
>>> [  121.842190] Call Trace:
>>> [  121.843125]  <TASK>
>>> [  121.843945]  generic_shutdown_super+0x42/0x130
>>> [  121.845656]  kill_anon_super+0xe/0x30
>>> [  121.847081]  ceph_kill_sb+0x3d/0xa0
>>> [  121.848418]  deactivate_locked_super+0x34/0x60
>>> [  121.850166]  cleanup_mnt+0xb8/0x150
>>> [  121.851498]  task_work_run+0x61/0x90
>>> [  121.852863]  exit_to_user_mode_prepare+0x147/0x170
>>> [  121.854741]  syscall_exit_to_user_mode+0x20/0x40
>>> [  121.856489]  do_syscall_64+0x48/0x80
>>> [  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>> [  121.859659] RIP: 0033:0x7f2442444a67
>>> [  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
>>> [  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
>>> [  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
>>> [  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
>>> [  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
>>> [  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
>>> [  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
>>> [  121.879418]  </TASK>
>>> [  121.880039] ---[ end trace 0000000000000000 ]---
>>>
>> Okay.
>>
>> I found the sync_filesystem() will wait for the last the osd request,
>> but the sequence of deleting the osd req from osdc and calling
>> req->r_callback seems incorrect.
>>
>> The following patch should fix it:
>>
>>
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 78b622178a3d..a3b4c5cabb80 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -2532,12 +2532,12 @@ static void finish_request(struct
>> ceph_osd_request *req)
>>           ceph_msg_revoke_incoming(req->r_reply);
>>    }
>>
>> -static void __complete_request(struct ceph_osd_request *req)
>> +static void __complete_request(struct ceph_osd_request *req, bool callback)
>>    {
>>           dout("%s req %p tid %llu cb %ps result %d\n", __func__, req,
>>                req->r_tid, req->r_callback, req->r_result);
>>
>> -       if (req->r_callback)
>> +       if (callback && req->r_callback)
>>                   req->r_callback(req);
>>           complete_all(&req->r_completion);
>>           ceph_osdc_put_request(req);
>> @@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct
>> work_struct *work)
>>           struct ceph_osd_request *req =
>>               container_of(work, struct ceph_osd_request, r_complete_work);
>>
>> -       __complete_request(req);
>> +       __complete_request(req, true);
>>    }
>>
>>    /*
>> @@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd,
>> struct ceph_msg *msg)
>>           WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
>>           req->r_version = m.user_version;
>>           req->r_result = m.result ?: data_len;
>> +       if (req->r_callback)
>> +               req->r_callback(req);
>>           finish_request(req);
>>           mutex_unlock(&osd->lock);
>>           up_read(&osdc->lock);
>>
>> -       __complete_request(req);
>> +       __complete_request(req, false);
> Hi Xiubo,
>
> I haven't looked into the actual problem that you trying to fix here
> but this patch seems wrong and very unlikely to fly.  The messenger
> invokes the OSD request callback outside of osd->lock and osdc->lock
> critical sections on purpose to avoid deadlocks.  This goes way back
> and also consistent with how Objecter behaves in userspace.

Hi Ilya

This is just a draft patch here.

I didn't test other cases yet and only tested the issue here in cephfs 
and it succeeded.

The root cause is the sequence issue to make the sync_filesystem() 
failing to wait the last osd request, which is the last 
req->r_callback() isn't finished yet the waiter is woke up. And more 
detail please see my comment in 
https://tracker.ceph.com/issues/58126?issue_count=405&issue_position=3&next_issue_id=58082&prev_issue_id=58564#note-7.

I will figure out the best approach to fix it later.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
Ilya Dryomov Jan. 28, 2023, 5:41 p.m. UTC | #8
On Sat, Jan 28, 2023 at 4:12 AM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 26/01/2023 22:04, Ilya Dryomov wrote:
> > On Thu, Jan 26, 2023 at 2:03 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>
> >> On 24/01/2023 20:32, Luís Henriques wrote:
> >>> On Tue, Jan 24, 2023 at 06:26:46PM +0800, Xiubo Li wrote:
> >>>> On 23/01/2023 18:15, Luís Henriques wrote:
> >>>>> On Sun, Jan 22, 2023 at 09:57:46PM +0800, Xiubo Li wrote:
> >>>>>> Hi Luis,
> >>>>>>
> >>>>>> On 20/01/2023 18:36, Luís Henriques wrote:
> >>>>>>> On Thu, Jan 19, 2023 at 04:27:09PM +0000, Luís Henriques wrote:
> >>>>>>>> On Wed, Dec 21, 2022 at 05:30:31PM +0800, xiubli@redhat.com wrote:
> >>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>>>>>>
> >>>>>>>>> When unmounting it will just wait for the inflight requests to be
> >>>>>>>>> finished, but just before the sessions are closed the kclient still
> >>>>>>>>> could receive the caps/snaps/lease/quota msgs from MDS. All these
> >>>>>>>>> msgs need to hold some inodes, which will cause ceph_kill_sb() failing
> >>>>>>>>> to evict the inodes in time.
> >>>>>>>>>
> >>>>>>>>> If encrypt is enabled the kernel generate a warning when removing
> >>>>>>>>> the encrypt keys when the skipped inodes still hold the keyring:
> >>>>>>>> Finally (sorry for the delay!) I managed to look into the 6.1 rebase.  It
> >>>>>>>> does look good, but I started hitting the WARNING added by patch:
> >>>>>>>>
> >>>>>>>>       [DO NOT MERGE] ceph: make sure all the files successfully put before unmounting
> >>>>>>>>
> >>>>>>>> This patch seems to be working but I'm not sure we really need the extra
> >>>>>>> OK, looks like I jumped the gun here: I still see the warning with your
> >>>>>>> patch.
> >>>>>>>
> >>>>>>> I've done a quick hack and the patch below sees fix it.  But again, it
> >>>>>>> will impact performance.  I'll see if I can figure out something else.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> --
> >>>>>>> Luís
> >>>>>>>
> >>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >>>>>>> index 2cd134ad02a9..bdb4efa0f9f7 100644
> >>>>>>> --- a/fs/ceph/file.c
> >>>>>>> +++ b/fs/ceph/file.c
> >>>>>>> @@ -2988,6 +2988,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >>>>>>>            return ret;
> >>>>>>>      }
> >>>>>>> +static int ceph_flush(struct file *file, fl_owner_t id)
> >>>>>>> +{
> >>>>>>> + struct inode *inode = file_inode(file);
> >>>>>>> + int ret;
> >>>>>>> +
> >>>>>>> + if ((file->f_mode & FMODE_WRITE) == 0)
> >>>>>>> +         return 0;
> >>>>>>> +
> >>>>>>> + ret = filemap_write_and_wait(inode->i_mapping);
> >>>>>>> + if (ret)
> >>>>>>> +         ret = filemap_check_wb_err(file->f_mapping, 0);
> >>>>>>> +
> >>>>>>> + return ret;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>      const struct file_operations ceph_file_fops = {
> >>>>>>>            .open = ceph_open,
> >>>>>>>            .release = ceph_release,
> >>>>>>> @@ -3005,4 +3020,5 @@ const struct file_operations ceph_file_fops = {
> >>>>>>>            .compat_ioctl = compat_ptr_ioctl,
> >>>>>>>            .fallocate      = ceph_fallocate,
> >>>>>>>            .copy_file_range = ceph_copy_file_range,
> >>>>>>> + .flush = ceph_flush,
> >>>>>>>      };
> >>>>>>>
> >>>>>> This will only fix the second crash case in
> >>>>>> https://tracker.ceph.com/issues/58126#note-6, but not the case in
> >>>>>> https://tracker.ceph.com/issues/58126#note-5.
> >>>>>>
> >>>>>> This issue could be triggered with "test_dummy_encryption" and with
> >>>>>> xfstests-dev's generic/124. You can have a try.
> >>>>> OK, thanks.  I'll give it a try.  BTW, my local reproducer was
> >>>>> generic/132, not generic/124.  I'll let you know if I find something.
> >>>> Hi Luis,
> >>>>
> >>>> I added some logs and found that when doing the aio_write, it will split to
> >>>> many aio requests and when the last req finished it will call the
> >>>> "writepages_finish()", which will iput() the inode and release the last
> >>>> refcount of the inode.
> >>>>
> >>>> But it seems the complete_all() is called without the req->r_callback() is
> >>>> totally finished:
> >>>>
> >>>> <4>[500400.268200] writepages_finish 0000000060940222 rc 0
> >>>> <4>[500400.268476] writepages_finish 0000000060940222 rc 0 <===== the last
> >>>> osd req->r_callback()
> >>>> <4>[500400.268515] sync_fs (blocking)  <===== unmounting begin
> >>>> <4>[500400.268526] sync_fs (blocking) done
> >>>> <4>[500400.268530] kill_sb after sync_filesystem 00000000a01a1cf4   <=== the
> >>>> sync_filesystem() will be called, I just added it in ceph code but the VFS
> >>>> will call it again in "kill_anon_super()"
> >>>> <4>[500400.268539] ceph_evict_inode:682, dec inode 0000000044f12aa7
> >>>> <4>[500400.268626] sync_fs (blocking)
> >>>> <4>[500400.268631] sync_fs (blocking) done
> >>>> <4>[500400.268634] evict_inodes inode 0000000060940222, i_count = 1, was
> >>>> skipped!    <=== skipped
> >>>> <4>[500400.268642] fscrypt_destroy_keyring: mk 00000000baf04977
> >>>> mk_active_refs = 2
> >>>> <4>[500400.268651] ceph_evict_inode:682, dec inode 0000000060940222   <====
> >>>> evict the inode in the req->r_callback()
> >>>>
> >>>> Locally my VM is not working and I couldn't run the test for now. Could you
> >>>> help test the following patch ?
> >>> So, running generic/132 in my test environment with your patch applied to
> >>> the 'testing' branch I still see the WARNING (pasted the backtrace
> >>> bellow).  I'll try help to dig a bit more on this issue in the next few
> >>> days.
> >>>
> >>> Cheers,
> >>> --
> >>> Luís
> >>>
> >>>
> >>> [  102.713299] ceph: test_dummy_encryption mode enabled
> >>> [  121.807203] evict_inodes inode 000000000d85998d, i_count = 1, was skipped!
> >>> [  121.809055] fscrypt_destroy_keyring: mk_active_refs = 2
> >>> [  121.810439] ------------[ cut here ]------------
> >>> [  121.811937] WARNING: CPU: 1 PID: 2671 at fs/crypto/keyring.c:244 fscrypt_destroy_keyring+0x109/0x110
> >>> [  121.814243] CPU: 1 PID: 2671 Comm: umount Not tainted 6.2.0-rc2+ #23
> >>> [  121.815810] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> >>> [  121.818588] RIP: 0010:fscrypt_destroy_keyring+0x109/0x110
> >>> [  121.819925] Code: 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 77 38 48 c7 c7 18 40 c0 81 e8 e2 b2 51 00 41 8b 47 38 83 f8 01 74 9e <0f> 0b eb 9a 0f 1f 00 0f0
> >>> [  121.824469] RSP: 0018:ffffc900039f3e20 EFLAGS: 00010202
> >>> [  121.825908] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000001
> >>> [  121.827660] RDX: 0000000000000000 RSI: ffff88842fc9b180 RDI: 00000000ffffffff
> >>> [  121.829425] RBP: ffff888102a99000 R08: 0000000000000000 R09: c0000000ffffefff
> >>> [  121.831188] R10: ffffc90000093e00 R11: ffffc900039f3cd8 R12: 0000000000000000
> >>> [  121.833408] R13: ffff888102a9a948 R14: ffffffff823a66b0 R15: ffff8881048f6c00
> >>> [  121.835186] FS:  00007f2442206840(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
> >>> [  121.838205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [  121.840386] CR2: 00007f0a289daf70 CR3: 000000011446a000 CR4: 00000000000006a0
> >>> [  121.842190] Call Trace:
> >>> [  121.843125]  <TASK>
> >>> [  121.843945]  generic_shutdown_super+0x42/0x130
> >>> [  121.845656]  kill_anon_super+0xe/0x30
> >>> [  121.847081]  ceph_kill_sb+0x3d/0xa0
> >>> [  121.848418]  deactivate_locked_super+0x34/0x60
> >>> [  121.850166]  cleanup_mnt+0xb8/0x150
> >>> [  121.851498]  task_work_run+0x61/0x90
> >>> [  121.852863]  exit_to_user_mode_prepare+0x147/0x170
> >>> [  121.854741]  syscall_exit_to_user_mode+0x20/0x40
> >>> [  121.856489]  do_syscall_64+0x48/0x80
> >>> [  121.857887]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>> [  121.859659] RIP: 0033:0x7f2442444a67
> >>> [  121.860922] Code: 24 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 018
> >>> [  121.866942] RSP: 002b:00007ffc56533fc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> >>> [  121.869305] RAX: 0000000000000000 RBX: 00007f2442579264 RCX: 00007f2442444a67
> >>> [  121.871400] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 00005612dcd02b10
> >>> [  121.873512] RBP: 00005612dccfd960 R08: 0000000000000000 R09: 00007f2442517be0
> >>> [  121.875492] R10: 00005612dcd03ea0 R11: 0000000000000246 R12: 0000000000000000
> >>> [  121.877448] R13: 00005612dcd02b10 R14: 00005612dccfda70 R15: 00005612dccfdb90
> >>> [  121.879418]  </TASK>
> >>> [  121.880039] ---[ end trace 0000000000000000 ]---
> >>>
> >> Okay.
> >>
> >> I found the sync_filesystem() will wait for the last the osd request,
> >> but the sequence of deleting the osd req from osdc and calling
> >> req->r_callback seems incorrect.
> >>
> >> The following patch should fix it:
> >>
> >>
> >> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> >> index 78b622178a3d..a3b4c5cabb80 100644
> >> --- a/net/ceph/osd_client.c
> >> +++ b/net/ceph/osd_client.c
> >> @@ -2532,12 +2532,12 @@ static void finish_request(struct
> >> ceph_osd_request *req)
> >>           ceph_msg_revoke_incoming(req->r_reply);
> >>    }
> >>
> >> -static void __complete_request(struct ceph_osd_request *req)
> >> +static void __complete_request(struct ceph_osd_request *req, bool callback)
> >>    {
> >>           dout("%s req %p tid %llu cb %ps result %d\n", __func__, req,
> >>                req->r_tid, req->r_callback, req->r_result);
> >>
> >> -       if (req->r_callback)
> >> +       if (callback && req->r_callback)
> >>                   req->r_callback(req);
> >>           complete_all(&req->r_completion);
> >>           ceph_osdc_put_request(req);
> >> @@ -2548,7 +2548,7 @@ static void complete_request_workfn(struct
> >> work_struct *work)
> >>           struct ceph_osd_request *req =
> >>               container_of(work, struct ceph_osd_request, r_complete_work);
> >>
> >> -       __complete_request(req);
> >> +       __complete_request(req, true);
> >>    }
> >>
> >>    /*
> >> @@ -3873,11 +3873,13 @@ static void handle_reply(struct ceph_osd *osd,
> >> struct ceph_msg *msg)
> >>           WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
> >>           req->r_version = m.user_version;
> >>           req->r_result = m.result ?: data_len;
> >> +       if (req->r_callback)
> >> +               req->r_callback(req);
> >>           finish_request(req);
> >>           mutex_unlock(&osd->lock);
> >>           up_read(&osdc->lock);
> >>
> >> -       __complete_request(req);
> >> +       __complete_request(req, false);
> > Hi Xiubo,
> >
> > I haven't looked into the actual problem that you trying to fix here
> > but this patch seems wrong and very unlikely to fly.  The messenger
> > invokes the OSD request callback outside of osd->lock and osdc->lock
> > critical sections on purpose to avoid deadlocks.  This goes way back
> > and also consistent with how Objecter behaves in userspace.
>
> Hi Ilya
>
> This is just a draft patch here.
>
> I didn't test other cases yet and only tested the issue here in cephfs
> and it succeeded.
>
> The root cause is the sequence issue to make the sync_filesystem()
> failing to wait the last osd request, which is the last
> req->r_callback() isn't finished yet the waiter is woke up. And more
> detail please see my comment in
> https://tracker.ceph.com/issues/58126?issue_count=405&issue_position=3&next_issue_id=58082&prev_issue_id=58564#note-7.

Hi Xiubo,

This seems to boil down to an expectation mismatch.  The diagram in the
tracker comment seems to point the finger at ceph_osdc_sync() because
it appears to return while req->r_callback is still running.  This can
indeed be the case but note that ceph_osdc_sync() has never waited for
any higher-level callback -- req->r_callback could schedule some
delayed work which the OSD client would know nothing about, after all.
ceph_osdc_sync() just waits for a safe/commit reply from an OSD (and
nowadays all replies are safe -- unsafe/ack replies aren't really
a thing anymore).

Thanks,

                Ilya
Xiubo Li Jan. 29, 2023, 1:58 a.m. UTC | #9
On 29/01/2023 01:41, Ilya Dryomov wrote:
> On Sat, Jan 28, 2023 at 4:12 AM Xiubo Li <xiubli@redhat.com> wrote:
[...]
>>> Hi Xiubo,
>>>
>>> I haven't looked into the actual problem that you trying to fix here
>>> but this patch seems wrong and very unlikely to fly.  The messenger
>>> invokes the OSD request callback outside of osd->lock and osdc->lock
>>> critical sections on purpose to avoid deadlocks.  This goes way back
>>> and also consistent with how Objecter behaves in userspace.
>> Hi Ilya
>>
>> This is just a draft patch here.
>>
>> I didn't test other cases yet and only tested the issue here in cephfs
>> and it succeeded.
>>
>> The root cause is the sequence issue to make the sync_filesystem()
>> failing to wait the last osd request, which is the last
>> req->r_callback() isn't finished yet the waiter is woke up. And more
>> detail please see my comment in
>> https://tracker.ceph.com/issues/58126?issue_count=405&issue_position=3&next_issue_id=58082&prev_issue_id=58564#note-7.
> Hi Xiubo,
>
> This seems to boil down to an expectation mismatch.  The diagram in the
> tracker comment seems to point the finger at ceph_osdc_sync() because
> it appears to return while req->r_callback is still running.  This can
> indeed be the case but note that ceph_osdc_sync() has never waited for
> any higher-level callback -- req->r_callback could schedule some
> delayed work which the OSD client would know nothing about, after all.
> ceph_osdc_sync() just waits for a safe/commit reply from an OSD (and
> nowadays all replies are safe -- unsafe/ack replies aren't really
> a thing anymore).

Hi Ilya,

Yeah, for cephfs the req->r_callback() will release all the resources 
directly the unmounting thread is waiting for and it's enough.

For the higher-level callback you mentioned above in cephfs it should be 
at the same time the last req->r_callback() will flush the dirty 
cap/snap to MDSs, which may send some ack replies back later and will 
ihold() the inodes, and this will cause another crash in 
https://tracker.ceph.com/issues/58126?issue_count=405&issue_position=3&next_issue_id=58082&prev_issue_id=58564#note-5. 
This is another story, if the unmounting finishes just before the acks 
come they will be dropped directly, if the acks come faster while the 
unmouting is not finished yet it will update the inodes and then the 
inodes will be evicted. So the acks makes no sense any more and this is 
why I need this patch to drop the ack replies.

Thanks
diff mbox series

Patch

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 15d9e0f0d65a..e8a53aeb2a8c 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -4222,6 +4222,9 @@  void ceph_handle_caps(struct ceph_mds_session *session,
 
 	dout("handle_caps from mds%d\n", session->s_mds);
 
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
 	/* decode */
 	end = msg->front.iov_base + msg->front.iov_len;
 	if (msg->front.iov_len < sizeof(*h))
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d41ab68f0130..1ad85af49b45 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4869,6 +4869,9 @@  static void handle_lease(struct ceph_mds_client *mdsc,
 
 	dout("handle_lease from mds%d\n", mds);
 
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
 	/* decode */
 	if (msg->front.iov_len < sizeof(*h) + sizeof(u32))
 		goto bad;
@@ -5262,7 +5265,7 @@  void send_flush_mdlog(struct ceph_mds_session *s)
 void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
 {
 	dout("pre_umount\n");
-	mdsc->stopping = 1;
+	mdsc->stopping = CEPH_MDSC_STOPPING_BEGAIN;
 
 	ceph_mdsc_iterate_sessions(mdsc, send_flush_mdlog, true);
 	ceph_mdsc_iterate_sessions(mdsc, lock_unlock_session, false);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 81a1f9a4ac3b..56f9d8077068 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -398,6 +398,11 @@  struct cap_wait {
 	int			want;
 };
 
+enum {
+	CEPH_MDSC_STOPPING_BEGAIN = 1,
+	CEPH_MDSC_STOPPING_FLUSHED = 2,
+};
+
 /*
  * mds client state
  */
@@ -414,7 +419,7 @@  struct ceph_mds_client {
 	struct ceph_mds_session **sessions;    /* NULL for mds if no session */
 	atomic_t		num_sessions;
 	int                     max_sessions;  /* len of sessions array */
-	int                     stopping;      /* true if shutting down */
+	int                     stopping;      /* the stage of shutting down */
 
 	atomic64_t		quotarealms_count; /* # realms with quota */
 	/*
diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index 64592adfe48f..f5819fc31d28 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -47,6 +47,9 @@  void ceph_handle_quota(struct ceph_mds_client *mdsc,
 	struct inode *inode;
 	struct ceph_inode_info *ci;
 
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
 	if (msg->front.iov_len < sizeof(*h)) {
 		pr_err("%s corrupt message mds%d len %d\n", __func__,
 		       session->s_mds, (int)msg->front.iov_len);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index a73943e51a77..eeabdd0211d8 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -1010,6 +1010,9 @@  void ceph_handle_snap(struct ceph_mds_client *mdsc,
 	int locked_rwsem = 0;
 	bool close_sessions = false;
 
+	if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED)
+		return;
+
 	/* decode */
 	if (msg->front.iov_len < sizeof(*h))
 		goto bad;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index f10a076f47e5..012b35be41a9 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1483,6 +1483,20 @@  static void ceph_kill_sb(struct super_block *s)
 	ceph_mdsc_pre_umount(fsc->mdsc);
 	flush_fs_workqueues(fsc);
 
+	/*
+	 * Though the kill_anon_super() will finally trigger the
+	 * sync_filesystem() anyway, we still need to do it here and
+	 * then bump the stage of shutdown. This will drop any further
+	 * message, which makes no sense any more, from MDSs.
+	 *
+	 * Without this when evicting the inodes it may fail in the
+	 * kill_anon_super(), which will trigger a warning when
+	 * destroying the fscrypt keyring and then possibly trigger
+	 * a further crash in ceph module when iput() the inodes.
+	 */
+	sync_filesystem(s);
+	fsc->mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
+
 	kill_anon_super(s);
 
 	fsc->client->extra_mon_dispatch = NULL;