diff mbox series

ceph: fix use-after-free bug for inodes when flushing capsnaps

Message ID 20230525024438.507082-1-xiubli@redhat.com
State New
Headers show
Series ceph: fix use-after-free bug for inodes when flushing capsnaps | expand

Commit Message

Xiubo Li May 25, 2023, 2:44 a.m. UTC
From: Xiubo Li <xiubli@redhat.com>

There is racy between capsnaps flush and removing the inode from
'mdsc->snap_flush_list' list:

   Thread A                            Thread B
ceph_queue_cap_snap()
 -> allocate 'capsnapA'
 ->ihold('&ci->vfs_inode')
 ->add 'capsnapA' to 'ci->i_cap_snaps'
 ->add 'ci' to 'mdsc->snap_flush_list'
    ...
ceph_flush_snaps()
 ->__ceph_flush_snaps()
  ->__send_flush_snap()
                                handle_cap_flushsnap_ack()
	                         ->iput('&ci->vfs_inode')
	                           this also will release 'ci'
                                    ...
                                ceph_handle_snap()
                                 ->flush_snaps()
                                  ->iterate 'mdsc->snap_flush_list'
                                   ->get the stale 'ci'
 ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
   'mdsc->snap_flush_list'           will WARNING

To fix this we will remove the 'ci' from 'mdsc->snap_flush_list'
list just before '__send_flush_snaps()' to make sure the flushsnap
'ack' will always after removing the 'ci' from 'snap_flush_list'.

Cc: stable@vger.kernel.org
URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/caps.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Ilya Dryomov May 31, 2023, 11:09 a.m. UTC | #1
On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote:
>
> From: Xiubo Li <xiubli@redhat.com>
>
> There is racy between capsnaps flush and removing the inode from
> 'mdsc->snap_flush_list' list:
>
>    Thread A                            Thread B
> ceph_queue_cap_snap()
>  -> allocate 'capsnapA'
>  ->ihold('&ci->vfs_inode')
>  ->add 'capsnapA' to 'ci->i_cap_snaps'
>  ->add 'ci' to 'mdsc->snap_flush_list'
>     ...
> ceph_flush_snaps()
>  ->__ceph_flush_snaps()
>   ->__send_flush_snap()
>                                 handle_cap_flushsnap_ack()
>                                  ->iput('&ci->vfs_inode')
>                                    this also will release 'ci'
>                                     ...
>                                 ceph_handle_snap()
>                                  ->flush_snaps()
>                                   ->iterate 'mdsc->snap_flush_list'
>                                    ->get the stale 'ci'
>  ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
>    'mdsc->snap_flush_list'           will WARNING
>
> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list'
> list just before '__send_flush_snaps()' to make sure the flushsnap
> 'ack' will always after removing the 'ci' from 'snap_flush_list'.

Hi Xiubo,

I'm not sure I'm following the logic here.  If the issue is that the
inode can be released by handle_cap_flushsnap_ack(), meaning that ci is
unsafe to dereference after the ack is received, what makes e.g. the
following snippet in __ceph_flush_snaps() work:

    ret = __send_flush_snap(inode, session, capsnap, cap->mseq,
                            oldest_flush_tid);
    if (ret < 0) {
            pr_err("__flush_snaps: error sending cap flushsnap, "
                   "ino (%llx.%llx) tid %llu follows %llu\n",
                    ceph_vinop(inode), cf->tid, capsnap->follows);
    }

    ceph_put_cap_snap(capsnap);
    spin_lock(&ci->i_ceph_lock);

If the ack is handled after capsnap is put but before ci->i_ceph_lock
is reacquired, could use-after-free occur inside spin_lock()?

Thanks,

                Ilya

>
> Cc: stable@vger.kernel.org
> URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/caps.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index feabf4cc0c4f..a8f890b3bb9a 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
>
>         dout("__flush_snaps %p session %p\n", inode, session);
>
> +       /* we will flush them all; remove this inode from the queue */
> +       spin_lock(&mdsc->snap_flush_lock);
> +       list_del_init(&ci->i_snap_flush_item);
> +       spin_unlock(&mdsc->snap_flush_lock);
> +
>         list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
>                 /*
>                  * we need to wait for sync writes to complete and for dirty
> @@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
>                 *psession = session;
>         else
>                 ceph_put_mds_session(session);
> -       /* we flushed them all; remove this inode from the queue */
> -       spin_lock(&mdsc->snap_flush_lock);
> -       list_del_init(&ci->i_snap_flush_item);
> -       spin_unlock(&mdsc->snap_flush_lock);
>  }
>
>  /*
> --
> 2.40.1
>
Xiubo Li May 31, 2023, 11:33 a.m. UTC | #2
On 5/31/23 19:09, Ilya Dryomov wrote:
> On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> There is racy between capsnaps flush and removing the inode from
>> 'mdsc->snap_flush_list' list:
>>
>>     Thread A                            Thread B
>> ceph_queue_cap_snap()
>>   -> allocate 'capsnapA'
>>   ->ihold('&ci->vfs_inode')
>>   ->add 'capsnapA' to 'ci->i_cap_snaps'
>>   ->add 'ci' to 'mdsc->snap_flush_list'
>>      ...
>> ceph_flush_snaps()
>>   ->__ceph_flush_snaps()
>>    ->__send_flush_snap()
>>                                  handle_cap_flushsnap_ack()
>>                                   ->iput('&ci->vfs_inode')
>>                                     this also will release 'ci'
>>                                      ...
>>                                  ceph_handle_snap()
>>                                   ->flush_snaps()
>>                                    ->iterate 'mdsc->snap_flush_list'
>>                                     ->get the stale 'ci'
>>   ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
>>     'mdsc->snap_flush_list'           will WARNING
>>
>> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list'
>> list just before '__send_flush_snaps()' to make sure the flushsnap
>> 'ack' will always after removing the 'ci' from 'snap_flush_list'.
> Hi Xiubo,
>
> I'm not sure I'm following the logic here.  If the issue is that the
> inode can be released by handle_cap_flushsnap_ack(), meaning that ci is
> unsafe to dereference after the ack is received, what makes e.g. the
> following snippet in __ceph_flush_snaps() work:
>
>      ret = __send_flush_snap(inode, session, capsnap, cap->mseq,
>                              oldest_flush_tid);
>      if (ret < 0) {
>              pr_err("__flush_snaps: error sending cap flushsnap, "
>                     "ino (%llx.%llx) tid %llu follows %llu\n",
>                      ceph_vinop(inode), cf->tid, capsnap->follows);
>      }
>
>      ceph_put_cap_snap(capsnap);
>      spin_lock(&ci->i_ceph_lock);
>
> If the ack is handled after capsnap is put but before ci->i_ceph_lock
> is reacquired, could use-after-free occur inside spin_lock()?

Yeah, certainly this could happen.

After the 'ci' being freed it's possible that the 'ci' is still cached 
in the 'ceph_inode_cachep' slab caches or reused by others, so 
dereferring the 'ci' mostly won't crash. But just before freeing 'ci' 
the 'ci->vfs_inode.i_count' will always be 0 already, this is why we can 
see the 'WARN_ON(atomic_inc_return(&inode->i_count) < 2)' call trace 
much easier in use-after-free case.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
>> Cc: stable@vger.kernel.org
>> URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/caps.c | 9 +++++----
>>   1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index feabf4cc0c4f..a8f890b3bb9a 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
>>
>>          dout("__flush_snaps %p session %p\n", inode, session);
>>
>> +       /* we will flush them all; remove this inode from the queue */
>> +       spin_lock(&mdsc->snap_flush_lock);
>> +       list_del_init(&ci->i_snap_flush_item);
>> +       spin_unlock(&mdsc->snap_flush_lock);
>> +
>>          list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
>>                  /*
>>                   * we need to wait for sync writes to complete and for dirty
>> @@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
>>                  *psession = session;
>>          else
>>                  ceph_put_mds_session(session);
>> -       /* we flushed them all; remove this inode from the queue */
>> -       spin_lock(&mdsc->snap_flush_lock);
>> -       list_del_init(&ci->i_snap_flush_item);
>> -       spin_unlock(&mdsc->snap_flush_lock);
>>   }
>>
>>   /*
>> --
>> 2.40.1
>>
Ilya Dryomov May 31, 2023, 12:01 p.m. UTC | #3
On Wed, May 31, 2023 at 1:33 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 5/31/23 19:09, Ilya Dryomov wrote:
> > On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote:
> >> From: Xiubo Li <xiubli@redhat.com>
> >>
> >> There is racy between capsnaps flush and removing the inode from
> >> 'mdsc->snap_flush_list' list:
> >>
> >>     Thread A                            Thread B
> >> ceph_queue_cap_snap()
> >>   -> allocate 'capsnapA'
> >>   ->ihold('&ci->vfs_inode')
> >>   ->add 'capsnapA' to 'ci->i_cap_snaps'
> >>   ->add 'ci' to 'mdsc->snap_flush_list'
> >>      ...
> >> ceph_flush_snaps()
> >>   ->__ceph_flush_snaps()
> >>    ->__send_flush_snap()
> >>                                  handle_cap_flushsnap_ack()
> >>                                   ->iput('&ci->vfs_inode')
> >>                                     this also will release 'ci'
> >>                                      ...
> >>                                  ceph_handle_snap()
> >>                                   ->flush_snaps()
> >>                                    ->iterate 'mdsc->snap_flush_list'
> >>                                     ->get the stale 'ci'
> >>   ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
> >>     'mdsc->snap_flush_list'           will WARNING
> >>
> >> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list'
> >> list just before '__send_flush_snaps()' to make sure the flushsnap
> >> 'ack' will always after removing the 'ci' from 'snap_flush_list'.
> > Hi Xiubo,
> >
> > I'm not sure I'm following the logic here.  If the issue is that the
> > inode can be released by handle_cap_flushsnap_ack(), meaning that ci is
> > unsafe to dereference after the ack is received, what makes e.g. the
> > following snippet in __ceph_flush_snaps() work:
> >
> >      ret = __send_flush_snap(inode, session, capsnap, cap->mseq,
> >                              oldest_flush_tid);
> >      if (ret < 0) {
> >              pr_err("__flush_snaps: error sending cap flushsnap, "
> >                     "ino (%llx.%llx) tid %llu follows %llu\n",
> >                      ceph_vinop(inode), cf->tid, capsnap->follows);
> >      }
> >
> >      ceph_put_cap_snap(capsnap);
> >      spin_lock(&ci->i_ceph_lock);
> >
> > If the ack is handled after capsnap is put but before ci->i_ceph_lock
> > is reacquired, could use-after-free occur inside spin_lock()?
>
> Yeah, certainly this could happen.
>
> After the 'ci' being freed it's possible that the 'ci' is still cached
> in the 'ceph_inode_cachep' slab caches or reused by others, so
> dereferring the 'ci' mostly won't crash. But just before freeing 'ci'

Dereferencing an invalid pointer is a major bug regardless of whether
it leads to a crash.  Crashing fast is actually a pretty good case as
far as potential outcomes go.

This needs to be addressed: just removing ci from mdsc->snap_flush_list
earlier obviously doesn't cut it.

Thanks,

                Ilya
Xiubo Li May 31, 2023, 12:30 p.m. UTC | #4
On 5/31/23 20:01, Ilya Dryomov wrote:
> On Wed, May 31, 2023 at 1:33 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 5/31/23 19:09, Ilya Dryomov wrote:
>>> On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> There is racy between capsnaps flush and removing the inode from
>>>> 'mdsc->snap_flush_list' list:
>>>>
>>>>      Thread A                            Thread B
>>>> ceph_queue_cap_snap()
>>>>    -> allocate 'capsnapA'
>>>>    ->ihold('&ci->vfs_inode')
>>>>    ->add 'capsnapA' to 'ci->i_cap_snaps'
>>>>    ->add 'ci' to 'mdsc->snap_flush_list'
>>>>       ...
>>>> ceph_flush_snaps()
>>>>    ->__ceph_flush_snaps()
>>>>     ->__send_flush_snap()
>>>>                                   handle_cap_flushsnap_ack()
>>>>                                    ->iput('&ci->vfs_inode')
>>>>                                      this also will release 'ci'
>>>>                                       ...
>>>>                                   ceph_handle_snap()
>>>>                                    ->flush_snaps()
>>>>                                     ->iterate 'mdsc->snap_flush_list'
>>>>                                      ->get the stale 'ci'
>>>>    ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
>>>>      'mdsc->snap_flush_list'           will WARNING
>>>>
>>>> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list'
>>>> list just before '__send_flush_snaps()' to make sure the flushsnap
>>>> 'ack' will always after removing the 'ci' from 'snap_flush_list'.
>>> Hi Xiubo,
>>>
>>> I'm not sure I'm following the logic here.  If the issue is that the
>>> inode can be released by handle_cap_flushsnap_ack(), meaning that ci is
>>> unsafe to dereference after the ack is received, what makes e.g. the
>>> following snippet in __ceph_flush_snaps() work:
>>>
>>>       ret = __send_flush_snap(inode, session, capsnap, cap->mseq,
>>>                               oldest_flush_tid);
>>>       if (ret < 0) {
>>>               pr_err("__flush_snaps: error sending cap flushsnap, "
>>>                      "ino (%llx.%llx) tid %llu follows %llu\n",
>>>                       ceph_vinop(inode), cf->tid, capsnap->follows);
>>>       }
>>>
>>>       ceph_put_cap_snap(capsnap);
>>>       spin_lock(&ci->i_ceph_lock);
>>>
>>> If the ack is handled after capsnap is put but before ci->i_ceph_lock
>>> is reacquired, could use-after-free occur inside spin_lock()?
>> Yeah, certainly this could happen.
>>
>> After the 'ci' being freed it's possible that the 'ci' is still cached
>> in the 'ceph_inode_cachep' slab caches or reused by others, so
>> dereferring the 'ci' mostly won't crash. But just before freeing 'ci'
> Dereferencing an invalid pointer is a major bug regardless of whether
> it leads to a crash.  Crashing fast is actually a pretty good case as
> far as potential outcomes go.
>
> This needs to be addressed: just removing ci from mdsc->snap_flush_list
> earlier obviously doesn't cut it.

This could make sure that the 'ci' will be remove from the 
'mdsc->snap_flush_list' list before the ack.

While there is another method to fix this, which is to increase the 
'i_count' instead when adding the 'ci' to the 'mdsc->snap_flush_list' 
list. This one seems safer ?

Thanks

- Xiubo


> Thanks,
>
>                  Ilya
>
diff mbox series

Patch

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index feabf4cc0c4f..a8f890b3bb9a 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1595,6 +1595,11 @@  static void __ceph_flush_snaps(struct ceph_inode_info *ci,
 
 	dout("__flush_snaps %p session %p\n", inode, session);
 
+	/* we will flush them all; remove this inode from the queue */
+	spin_lock(&mdsc->snap_flush_lock);
+	list_del_init(&ci->i_snap_flush_item);
+	spin_unlock(&mdsc->snap_flush_lock);
+
 	list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
 		/*
 		 * we need to wait for sync writes to complete and for dirty
@@ -1726,10 +1731,6 @@  void ceph_flush_snaps(struct ceph_inode_info *ci,
 		*psession = session;
 	else
 		ceph_put_mds_session(session);
-	/* we flushed them all; remove this inode from the queue */
-	spin_lock(&mdsc->snap_flush_lock);
-	list_del_init(&ci->i_snap_flush_item);
-	spin_unlock(&mdsc->snap_flush_lock);
 }
 
 /*