diff mbox series

[v3] ceph: blocklist the kclient when receiving corrupted snap trace

Message ID 20221206125915.37404-1-xiubli@redhat.com
State New
Headers show
Series [v3] ceph: blocklist the kclient when receiving corrupted snap trace | expand

Commit Message

Xiubo Li Dec. 6, 2022, 12:59 p.m. UTC
From: Xiubo Li <xiubli@redhat.com>

When received corrupted snap trace we don't know what exactly has
happened in MDS side. And we shouldn't continue writing to OSD,
which may corrupt the snapshot contents.

Just try to blocklist this client and If fails we need to crash the
client instead of leaving it writeable to OSDs.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/57686
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---

Thanks Aaron's feedback.

V3:
- Fixed ERROR: spaces required around that ':' (ctx:VxW)

V2:
- Switched to WARN() to taint the Linux kernel.

 fs/ceph/mds_client.c |  3 ++-
 fs/ceph/mds_client.h |  1 +
 fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

Comments

Ilya Dryomov Dec. 7, 2022, 10:59 a.m. UTC | #1
On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>
> From: Xiubo Li <xiubli@redhat.com>
>
> When received corrupted snap trace we don't know what exactly has
> happened in MDS side. And we shouldn't continue writing to OSD,
> which may corrupt the snapshot contents.
>
> Just try to blocklist this client and If fails we need to crash the
> client instead of leaving it writeable to OSDs.
>
> Cc: stable@vger.kernel.org
> URL: https://tracker.ceph.com/issues/57686
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>
> Thanks Aaron's feedback.
>
> V3:
> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>
> V2:
> - Switched to WARN() to taint the Linux kernel.
>
>  fs/ceph/mds_client.c |  3 ++-
>  fs/ceph/mds_client.h |  1 +
>  fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index cbbaf334b6b8..59094944af28 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
>         struct ceph_mds_client *mdsc = s->s_mdsc;
>
>         pr_warn("mds%d closed our session\n", s->s_mds);
> -       send_mds_reconnect(mdsc, s);
> +       if (!mdsc->no_reconnect)
> +               send_mds_reconnect(mdsc, s);
>  }
>
>  static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 728b7d72bf76..8e8f0447c0ad 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>         atomic_t                num_sessions;
>         int                     max_sessions;  /* len of sessions array */
>         int                     stopping;      /* true if shutting down */
> +       int                     no_reconnect;  /* true if snap trace is corrupted */
>
>         atomic64_t              quotarealms_count; /* # realms with quota */
>         /*
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index c1c452afa84d..023852b7c527 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>         struct ceph_snap_realm *realm;
>         struct ceph_snap_realm *first_realm = NULL;
>         struct ceph_snap_realm *realm_to_rebuild = NULL;
> +       struct ceph_client *client = mdsc->fsc->client;
>         int rebuild_snapcs;
>         int err = -ENOMEM;
> +       int ret;
>         LIST_HEAD(dirty_realms);
>
>         lockdep_assert_held_write(&mdsc->snap_rwsem);
> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>         if (first_realm)
>                 ceph_put_snap_realm(mdsc, first_realm);
>         pr_err("%s error %d\n", __func__, err);
> +
> +       /*
> +        * When receiving a corrupted snap trace we don't know what
> +        * exactly has happened in MDS side. And we shouldn't continue
> +        * writing to OSD, which may corrupt the snapshot contents.
> +        *
> +        * Just try to blocklist this kclient and if it fails we need
> +        * to crash the kclient instead of leaving it writeable.

Hi Xiubo,

I'm not sure I understand this "let's blocklist ourselves" concept.
If the kernel client shouldn't continue writing to OSDs in this case,
why not just stop issuing writes -- perhaps initiating some equivalent
of a read-only remount like many local filesystems would do on I/O
errors (e.g. errors=remount-ro mode)?

Or, perhaps, all in-memory snap contexts could somehow be invalidated
in this case, making writes fail naturally -- on the client side,
without actually being sent to OSDs just to be nixed by the blocklist
hammer.

But further, what makes a failure to decode a snap trace special?
AFAIK we don't do anything close to this for any other decoding
failure.  Wouldn't "when received corrupted XYZ we don't know what
exactly has happened in MDS side" argument apply to pretty much all
decoding failures?

> +        *
> +        * Then this kclient must be remounted to continue after the
> +        * corrupted metadata fixed in the MDS side.
> +        */
> +       mdsc->no_reconnect = 1;
> +       ret = ceph_monc_blocklist_add(&client->monc, &client->msgr.inst.addr);
> +       if (ret) {
> +               pr_err("%s blocklist of %s failed: %d", __func__,
> +                      ceph_pr_addr(&client->msgr.inst.addr), ret);
> +               BUG();

... and this is a rough equivalent of errors=panic mode.

Is there a corresponding userspace client PR that can be referenced?
This needs additional background and justification IMO.

Thanks,

                Ilya
Xiubo Li Dec. 7, 2022, 12:35 p.m. UTC | #2
On 07/12/2022 18:59, Ilya Dryomov wrote:
> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When received corrupted snap trace we don't know what exactly has
>> happened in MDS side. And we shouldn't continue writing to OSD,
>> which may corrupt the snapshot contents.
>>
>> Just try to blocklist this client and If fails we need to crash the
>> client instead of leaving it writeable to OSDs.
>>
>> Cc: stable@vger.kernel.org
>> URL: https://tracker.ceph.com/issues/57686
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>
>> Thanks Aaron's feedback.
>>
>> V3:
>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>
>> V2:
>> - Switched to WARN() to taint the Linux kernel.
>>
>>   fs/ceph/mds_client.c |  3 ++-
>>   fs/ceph/mds_client.h |  1 +
>>   fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>   3 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index cbbaf334b6b8..59094944af28 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
>>          struct ceph_mds_client *mdsc = s->s_mdsc;
>>
>>          pr_warn("mds%d closed our session\n", s->s_mds);
>> -       send_mds_reconnect(mdsc, s);
>> +       if (!mdsc->no_reconnect)
>> +               send_mds_reconnect(mdsc, s);
>>   }
>>
>>   static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>> index 728b7d72bf76..8e8f0447c0ad 100644
>> --- a/fs/ceph/mds_client.h
>> +++ b/fs/ceph/mds_client.h
>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>          atomic_t                num_sessions;
>>          int                     max_sessions;  /* len of sessions array */
>>          int                     stopping;      /* true if shutting down */
>> +       int                     no_reconnect;  /* true if snap trace is corrupted */
>>
>>          atomic64_t              quotarealms_count; /* # realms with quota */
>>          /*
>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>> index c1c452afa84d..023852b7c527 100644
>> --- a/fs/ceph/snap.c
>> +++ b/fs/ceph/snap.c
>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>>          struct ceph_snap_realm *realm;
>>          struct ceph_snap_realm *first_realm = NULL;
>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
>> +       struct ceph_client *client = mdsc->fsc->client;
>>          int rebuild_snapcs;
>>          int err = -ENOMEM;
>> +       int ret;
>>          LIST_HEAD(dirty_realms);
>>
>>          lockdep_assert_held_write(&mdsc->snap_rwsem);
>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>>          if (first_realm)
>>                  ceph_put_snap_realm(mdsc, first_realm);
>>          pr_err("%s error %d\n", __func__, err);
>> +
>> +       /*
>> +        * When receiving a corrupted snap trace we don't know what
>> +        * exactly has happened in MDS side. And we shouldn't continue
>> +        * writing to OSD, which may corrupt the snapshot contents.
>> +        *
>> +        * Just try to blocklist this kclient and if it fails we need
>> +        * to crash the kclient instead of leaving it writeable.
> Hi Xiubo,
>
> I'm not sure I understand this "let's blocklist ourselves" concept.
> If the kernel client shouldn't continue writing to OSDs in this case,
> why not just stop issuing writes -- perhaps initiating some equivalent
> of a read-only remount like many local filesystems would do on I/O
> errors (e.g. errors=remount-ro mode)?

I still haven't found how could I handle it this way from ceph layer. I 
saw they are just marking the inodes as EIO when this happens.

>
> Or, perhaps, all in-memory snap contexts could somehow be invalidated
> in this case, making writes fail naturally -- on the client side,
> without actually being sent to OSDs just to be nixed by the blocklist
> hammer.
>
> But further, what makes a failure to decode a snap trace special?

 From the known tracker the snapid was corrupted in one inode in MDS and 
then when trying to build the snap trace with the corrupted snapid it 
will corrupt.

And also there maybe other cases.

> AFAIK we don't do anything close to this for any other decoding
> failure.  Wouldn't "when received corrupted XYZ we don't know what
> exactly has happened in MDS side" argument apply to pretty much all
> decoding failures?

The snap trace is different from other cases. The corrupted snap trace 
will affect the whole snap realm hierarchy, which will affect the whole 
inodes in the mount in worst case.

This is why I was trying to evict the mount to prevent further IOs.

>
>> +        *
>> +        * Then this kclient must be remounted to continue after the
>> +        * corrupted metadata fixed in the MDS side.
>> +        */
>> +       mdsc->no_reconnect = 1;
>> +       ret = ceph_monc_blocklist_add(&client->monc, &client->msgr.inst.addr);
>> +       if (ret) {
>> +               pr_err("%s blocklist of %s failed: %d", __func__,
>> +                      ceph_pr_addr(&client->msgr.inst.addr), ret);
>> +               BUG();
> ... and this is a rough equivalent of errors=panic mode.
>
> Is there a corresponding userspace client PR that can be referenced?
> This needs additional background and justification IMO.

Not yet. Any way we shouldn't let it continue do the IOs if fails to add 
it to the blocklist.

- Xiubo

>
> Thanks,
>
>                  Ilya
>
Xiubo Li Dec. 7, 2022, 1:19 p.m. UTC | #3
On 07/12/2022 18:59, Ilya Dryomov wrote:
> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When received corrupted snap trace we don't know what exactly has
>> happened in MDS side. And we shouldn't continue writing to OSD,
>> which may corrupt the snapshot contents.
>>
>> Just try to blocklist this client and If fails we need to crash the
>> client instead of leaving it writeable to OSDs.
>>
>> Cc: stable@vger.kernel.org
>> URL: https://tracker.ceph.com/issues/57686
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>
>> Thanks Aaron's feedback.
>>
>> V3:
>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>
>> V2:
>> - Switched to WARN() to taint the Linux kernel.
>>
>>   fs/ceph/mds_client.c |  3 ++-
>>   fs/ceph/mds_client.h |  1 +
>>   fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>   3 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index cbbaf334b6b8..59094944af28 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
>>          struct ceph_mds_client *mdsc = s->s_mdsc;
>>
>>          pr_warn("mds%d closed our session\n", s->s_mds);
>> -       send_mds_reconnect(mdsc, s);
>> +       if (!mdsc->no_reconnect)
>> +               send_mds_reconnect(mdsc, s);
>>   }
>>
>>   static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>> index 728b7d72bf76..8e8f0447c0ad 100644
>> --- a/fs/ceph/mds_client.h
>> +++ b/fs/ceph/mds_client.h
>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>          atomic_t                num_sessions;
>>          int                     max_sessions;  /* len of sessions array */
>>          int                     stopping;      /* true if shutting down */
>> +       int                     no_reconnect;  /* true if snap trace is corrupted */
>>
>>          atomic64_t              quotarealms_count; /* # realms with quota */
>>          /*
>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>> index c1c452afa84d..023852b7c527 100644
>> --- a/fs/ceph/snap.c
>> +++ b/fs/ceph/snap.c
>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>>          struct ceph_snap_realm *realm;
>>          struct ceph_snap_realm *first_realm = NULL;
>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
>> +       struct ceph_client *client = mdsc->fsc->client;
>>          int rebuild_snapcs;
>>          int err = -ENOMEM;
>> +       int ret;
>>          LIST_HEAD(dirty_realms);
>>
>>          lockdep_assert_held_write(&mdsc->snap_rwsem);
>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>>          if (first_realm)
>>                  ceph_put_snap_realm(mdsc, first_realm);
>>          pr_err("%s error %d\n", __func__, err);
>> +
>> +       /*
>> +        * When receiving a corrupted snap trace we don't know what
>> +        * exactly has happened in MDS side. And we shouldn't continue
>> +        * writing to OSD, which may corrupt the snapshot contents.
>> +        *
>> +        * Just try to blocklist this kclient and if it fails we need
>> +        * to crash the kclient instead of leaving it writeable.
> Hi Xiubo,
>
> I'm not sure I understand this "let's blocklist ourselves" concept.
> If the kernel client shouldn't continue writing to OSDs in this case,
> why not just stop issuing writes -- perhaps initiating some equivalent
> of a read-only remount like many local filesystems would do on I/O
> errors (e.g. errors=remount-ro mode)?

The following patch seems working. Let me do more test to make sure 
there is not further crash.

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index c1c452afa84d..cd487f8a4cb5 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
         struct ceph_snap_realm *realm;
         struct ceph_snap_realm *first_realm = NULL;
         struct ceph_snap_realm *realm_to_rebuild = NULL;
+       struct super_block *sb = mdsc->fsc->sb;
         int rebuild_snapcs;
         int err = -ENOMEM;
         LIST_HEAD(dirty_realms);
@@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
         if (first_realm)
                 ceph_put_snap_realm(mdsc, first_realm);
         pr_err("%s error %d\n", __func__, err);
+       pr_err("Remounting filesystem read-only\n");
+       sb->s_flags |= SB_RDONLY;
+
         return err;
  }




>
> Or, perhaps, all in-memory snap contexts could somehow be invalidated
> in this case, making writes fail naturally -- on the client side,
> without actually being sent to OSDs just to be nixed by the blocklist
> hammer.
>
> But further, what makes a failure to decode a snap trace special?
> AFAIK we don't do anything close to this for any other decoding
> failure.  Wouldn't "when received corrupted XYZ we don't know what
> exactly has happened in MDS side" argument apply to pretty much all
> decoding failures?
>
>> +        *
>> +        * Then this kclient must be remounted to continue after the
>> +        * corrupted metadata fixed in the MDS side.
>> +        */
>> +       mdsc->no_reconnect = 1;
>> +       ret = ceph_monc_blocklist_add(&client->monc, &client->msgr.inst.addr);
>> +       if (ret) {
>> +               pr_err("%s blocklist of %s failed: %d", __func__,
>> +                      ceph_pr_addr(&client->msgr.inst.addr), ret);
>> +               BUG();
> ... and this is a rough equivalent of errors=panic mode.
>
> Is there a corresponding userspace client PR that can be referenced?
> This needs additional background and justification IMO.
>
> Thanks,
>
>                  Ilya
>
Xiubo Li Dec. 7, 2022, 1:30 p.m. UTC | #4
On 07/12/2022 21:19, Xiubo Li wrote:
>
> On 07/12/2022 18:59, Ilya Dryomov wrote:
>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> When received corrupted snap trace we don't know what exactly has
>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>> which may corrupt the snapshot contents.
>>>
>>> Just try to blocklist this client and If fails we need to crash the
>>> client instead of leaving it writeable to OSDs.
>>>
>>> Cc: stable@vger.kernel.org
>>> URL: https://tracker.ceph.com/issues/57686
>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>> ---
>>>
>>> Thanks Aaron's feedback.
>>>
>>> V3:
>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>
>>> V2:
>>> - Switched to WARN() to taint the Linux kernel.
>>>
>>>   fs/ceph/mds_client.c |  3 ++-
>>>   fs/ceph/mds_client.h |  1 +
>>>   fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>   3 files changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index cbbaf334b6b8..59094944af28 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct 
>>> ceph_connection *con)
>>>          struct ceph_mds_client *mdsc = s->s_mdsc;
>>>
>>>          pr_warn("mds%d closed our session\n", s->s_mds);
>>> -       send_mds_reconnect(mdsc, s);
>>> +       if (!mdsc->no_reconnect)
>>> +               send_mds_reconnect(mdsc, s);
>>>   }
>>>
>>>   static void mds_dispatch(struct ceph_connection *con, struct 
>>> ceph_msg *msg)
>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>> --- a/fs/ceph/mds_client.h
>>> +++ b/fs/ceph/mds_client.h
>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>          atomic_t                num_sessions;
>>>          int                     max_sessions;  /* len of sessions 
>>> array */
>>>          int                     stopping;      /* true if shutting 
>>> down */
>>> +       int                     no_reconnect;  /* true if snap trace 
>>> is corrupted */
>>>
>>>          atomic64_t              quotarealms_count; /* # realms with 
>>> quota */
>>>          /*
>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>> index c1c452afa84d..023852b7c527 100644
>>> --- a/fs/ceph/snap.c
>>> +++ b/fs/ceph/snap.c
>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct 
>>> ceph_mds_client *mdsc,
>>>          struct ceph_snap_realm *realm;
>>>          struct ceph_snap_realm *first_realm = NULL;
>>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>          int rebuild_snapcs;
>>>          int err = -ENOMEM;
>>> +       int ret;
>>>          LIST_HEAD(dirty_realms);
>>>
>>>          lockdep_assert_held_write(&mdsc->snap_rwsem);
>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct 
>>> ceph_mds_client *mdsc,
>>>          if (first_realm)
>>>                  ceph_put_snap_realm(mdsc, first_realm);
>>>          pr_err("%s error %d\n", __func__, err);
>>> +
>>> +       /*
>>> +        * When receiving a corrupted snap trace we don't know what
>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>> +        *
>>> +        * Just try to blocklist this kclient and if it fails we need
>>> +        * to crash the kclient instead of leaving it writeable.
>> Hi Xiubo,
>>
>> I'm not sure I understand this "let's blocklist ourselves" concept.
>> If the kernel client shouldn't continue writing to OSDs in this case,
>> why not just stop issuing writes -- perhaps initiating some equivalent
>> of a read-only remount like many local filesystems would do on I/O
>> errors (e.g. errors=remount-ro mode)?
>
> The following patch seems working. Let me do more test to make sure 
> there is not further crash.
>
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index c1c452afa84d..cd487f8a4cb5 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client 
> *mdsc,
>         struct ceph_snap_realm *realm;
>         struct ceph_snap_realm *first_realm = NULL;
>         struct ceph_snap_realm *realm_to_rebuild = NULL;
> +       struct super_block *sb = mdsc->fsc->sb;
>         int rebuild_snapcs;
>         int err = -ENOMEM;
>         LIST_HEAD(dirty_realms);
> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client 
> *mdsc,
>         if (first_realm)
>                 ceph_put_snap_realm(mdsc, first_realm);
>         pr_err("%s error %d\n", __func__, err);
> +       pr_err("Remounting filesystem read-only\n");
> +       sb->s_flags |= SB_RDONLY;
> +
>         return err;
>  }
>
>
For readonly approach is also my first thought it should be, but I was 
just not very sure whether it would be the best approach.

Because by evicting the kclient we could prevent the buffer to be wrote 
to OSDs. But the readonly one seems won't ?

- Xiubo

>
>
>>
>> Or, perhaps, all in-memory snap contexts could somehow be invalidated
>> in this case, making writes fail naturally -- on the client side,
>> without actually being sent to OSDs just to be nixed by the blocklist
>> hammer.
>>
>> But further, what makes a failure to decode a snap trace special?
>> AFAIK we don't do anything close to this for any other decoding
>> failure.  Wouldn't "when received corrupted XYZ we don't know what
>> exactly has happened in MDS side" argument apply to pretty much all
>> decoding failures?
>>
>>> +        *
>>> +        * Then this kclient must be remounted to continue after the
>>> +        * corrupted metadata fixed in the MDS side.
>>> +        */
>>> +       mdsc->no_reconnect = 1;
>>> +       ret = ceph_monc_blocklist_add(&client->monc, 
>>> &client->msgr.inst.addr);
>>> +       if (ret) {
>>> +               pr_err("%s blocklist of %s failed: %d", __func__,
>>> + ceph_pr_addr(&client->msgr.inst.addr), ret);
>>> +               BUG();
>> ... and this is a rough equivalent of errors=panic mode.
>>
>> Is there a corresponding userspace client PR that can be referenced?
>> This needs additional background and justification IMO.
>>
>> Thanks,
>>
>>                  Ilya
>>
Ilya Dryomov Dec. 7, 2022, 2:20 p.m. UTC | #5
On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 07/12/2022 21:19, Xiubo Li wrote:
> >
> > On 07/12/2022 18:59, Ilya Dryomov wrote:
> >> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
> >>> From: Xiubo Li <xiubli@redhat.com>
> >>>
> >>> When received corrupted snap trace we don't know what exactly has
> >>> happened in MDS side. And we shouldn't continue writing to OSD,
> >>> which may corrupt the snapshot contents.
> >>>
> >>> Just try to blocklist this client and If fails we need to crash the
> >>> client instead of leaving it writeable to OSDs.
> >>>
> >>> Cc: stable@vger.kernel.org
> >>> URL: https://tracker.ceph.com/issues/57686
> >>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>> ---
> >>>
> >>> Thanks Aaron's feedback.
> >>>
> >>> V3:
> >>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
> >>>
> >>> V2:
> >>> - Switched to WARN() to taint the Linux kernel.
> >>>
> >>>   fs/ceph/mds_client.c |  3 ++-
> >>>   fs/ceph/mds_client.h |  1 +
> >>>   fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
> >>>   3 files changed, 28 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >>> index cbbaf334b6b8..59094944af28 100644
> >>> --- a/fs/ceph/mds_client.c
> >>> +++ b/fs/ceph/mds_client.c
> >>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
> >>> ceph_connection *con)
> >>>          struct ceph_mds_client *mdsc = s->s_mdsc;
> >>>
> >>>          pr_warn("mds%d closed our session\n", s->s_mds);
> >>> -       send_mds_reconnect(mdsc, s);
> >>> +       if (!mdsc->no_reconnect)
> >>> +               send_mds_reconnect(mdsc, s);
> >>>   }
> >>>
> >>>   static void mds_dispatch(struct ceph_connection *con, struct
> >>> ceph_msg *msg)
> >>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> >>> index 728b7d72bf76..8e8f0447c0ad 100644
> >>> --- a/fs/ceph/mds_client.h
> >>> +++ b/fs/ceph/mds_client.h
> >>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
> >>>          atomic_t                num_sessions;
> >>>          int                     max_sessions;  /* len of sessions
> >>> array */
> >>>          int                     stopping;      /* true if shutting
> >>> down */
> >>> +       int                     no_reconnect;  /* true if snap trace
> >>> is corrupted */
> >>>
> >>>          atomic64_t              quotarealms_count; /* # realms with
> >>> quota */
> >>>          /*
> >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>> index c1c452afa84d..023852b7c527 100644
> >>> --- a/fs/ceph/snap.c
> >>> +++ b/fs/ceph/snap.c
> >>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
> >>> ceph_mds_client *mdsc,
> >>>          struct ceph_snap_realm *realm;
> >>>          struct ceph_snap_realm *first_realm = NULL;
> >>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>> +       struct ceph_client *client = mdsc->fsc->client;
> >>>          int rebuild_snapcs;
> >>>          int err = -ENOMEM;
> >>> +       int ret;
> >>>          LIST_HEAD(dirty_realms);
> >>>
> >>>          lockdep_assert_held_write(&mdsc->snap_rwsem);
> >>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
> >>> ceph_mds_client *mdsc,
> >>>          if (first_realm)
> >>>                  ceph_put_snap_realm(mdsc, first_realm);
> >>>          pr_err("%s error %d\n", __func__, err);
> >>> +
> >>> +       /*
> >>> +        * When receiving a corrupted snap trace we don't know what
> >>> +        * exactly has happened in MDS side. And we shouldn't continue
> >>> +        * writing to OSD, which may corrupt the snapshot contents.
> >>> +        *
> >>> +        * Just try to blocklist this kclient and if it fails we need
> >>> +        * to crash the kclient instead of leaving it writeable.
> >> Hi Xiubo,
> >>
> >> I'm not sure I understand this "let's blocklist ourselves" concept.
> >> If the kernel client shouldn't continue writing to OSDs in this case,
> >> why not just stop issuing writes -- perhaps initiating some equivalent
> >> of a read-only remount like many local filesystems would do on I/O
> >> errors (e.g. errors=remount-ro mode)?
> >
> > The following patch seems working. Let me do more test to make sure
> > there is not further crash.
> >
> > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > index c1c452afa84d..cd487f8a4cb5 100644
> > --- a/fs/ceph/snap.c
> > +++ b/fs/ceph/snap.c
> > @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
> > *mdsc,
> >         struct ceph_snap_realm *realm;
> >         struct ceph_snap_realm *first_realm = NULL;
> >         struct ceph_snap_realm *realm_to_rebuild = NULL;
> > +       struct super_block *sb = mdsc->fsc->sb;
> >         int rebuild_snapcs;
> >         int err = -ENOMEM;
> >         LIST_HEAD(dirty_realms);
> > @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
> > *mdsc,
> >         if (first_realm)
> >                 ceph_put_snap_realm(mdsc, first_realm);
> >         pr_err("%s error %d\n", __func__, err);
> > +       pr_err("Remounting filesystem read-only\n");
> > +       sb->s_flags |= SB_RDONLY;
> > +
> >         return err;
> >  }
> >
> >
> For readonly approach is also my first thought it should be, but I was
> just not very sure whether it would be the best approach.
>
> Because by evicting the kclient we could prevent the buffer to be wrote
> to OSDs. But the readonly one seems won't ?

The read-only setting is more for the VFS and the user.  Technically,
the kernel client could just stop issuing writes (i.e. OSD requests
containing a write op) and not set SB_RDONLY.  That should cover any
buffered data as well.

By employing self-blocklisting, you are shifting the responsibility
of rejecting OSD requests to the OSDs.  I'm saying that not issuing
OSD requests from a potentially busted client in the first place is
probably a better idea.  At the very least you wouldn't need to BUG
on ceph_monc_blocklist_add() errors.

Thanks,

                Ilya
Ilya Dryomov Dec. 7, 2022, 2:28 p.m. UTC | #6
On Wed, Dec 7, 2022 at 1:35 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 07/12/2022 18:59, Ilya Dryomov wrote:
> > On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
> >> From: Xiubo Li <xiubli@redhat.com>
> >>
> >> When received corrupted snap trace we don't know what exactly has
> >> happened in MDS side. And we shouldn't continue writing to OSD,
> >> which may corrupt the snapshot contents.
> >>
> >> Just try to blocklist this client and If fails we need to crash the
> >> client instead of leaving it writeable to OSDs.
> >>
> >> Cc: stable@vger.kernel.org
> >> URL: https://tracker.ceph.com/issues/57686
> >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >> ---
> >>
> >> Thanks Aaron's feedback.
> >>
> >> V3:
> >> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
> >>
> >> V2:
> >> - Switched to WARN() to taint the Linux kernel.
> >>
> >>   fs/ceph/mds_client.c |  3 ++-
> >>   fs/ceph/mds_client.h |  1 +
> >>   fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
> >>   3 files changed, 28 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >> index cbbaf334b6b8..59094944af28 100644
> >> --- a/fs/ceph/mds_client.c
> >> +++ b/fs/ceph/mds_client.c
> >> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
> >>          struct ceph_mds_client *mdsc = s->s_mdsc;
> >>
> >>          pr_warn("mds%d closed our session\n", s->s_mds);
> >> -       send_mds_reconnect(mdsc, s);
> >> +       if (!mdsc->no_reconnect)
> >> +               send_mds_reconnect(mdsc, s);
> >>   }
> >>
> >>   static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> >> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> >> index 728b7d72bf76..8e8f0447c0ad 100644
> >> --- a/fs/ceph/mds_client.h
> >> +++ b/fs/ceph/mds_client.h
> >> @@ -413,6 +413,7 @@ struct ceph_mds_client {
> >>          atomic_t                num_sessions;
> >>          int                     max_sessions;  /* len of sessions array */
> >>          int                     stopping;      /* true if shutting down */
> >> +       int                     no_reconnect;  /* true if snap trace is corrupted */
> >>
> >>          atomic64_t              quotarealms_count; /* # realms with quota */
> >>          /*
> >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >> index c1c452afa84d..023852b7c527 100644
> >> --- a/fs/ceph/snap.c
> >> +++ b/fs/ceph/snap.c
> >> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
> >>          struct ceph_snap_realm *realm;
> >>          struct ceph_snap_realm *first_realm = NULL;
> >>          struct ceph_snap_realm *realm_to_rebuild = NULL;
> >> +       struct ceph_client *client = mdsc->fsc->client;
> >>          int rebuild_snapcs;
> >>          int err = -ENOMEM;
> >> +       int ret;
> >>          LIST_HEAD(dirty_realms);
> >>
> >>          lockdep_assert_held_write(&mdsc->snap_rwsem);
> >> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
> >>          if (first_realm)
> >>                  ceph_put_snap_realm(mdsc, first_realm);
> >>          pr_err("%s error %d\n", __func__, err);
> >> +
> >> +       /*
> >> +        * When receiving a corrupted snap trace we don't know what
> >> +        * exactly has happened in MDS side. And we shouldn't continue
> >> +        * writing to OSD, which may corrupt the snapshot contents.
> >> +        *
> >> +        * Just try to blocklist this kclient and if it fails we need
> >> +        * to crash the kclient instead of leaving it writeable.
> > Hi Xiubo,
> >
> > I'm not sure I understand this "let's blocklist ourselves" concept.
> > If the kernel client shouldn't continue writing to OSDs in this case,
> > why not just stop issuing writes -- perhaps initiating some equivalent
> > of a read-only remount like many local filesystems would do on I/O
> > errors (e.g. errors=remount-ro mode)?
>
> I still haven't found how could I handle it this way from ceph layer. I
> saw they are just marking the inodes as EIO when this happens.
>
> >
> > Or, perhaps, all in-memory snap contexts could somehow be invalidated
> > in this case, making writes fail naturally -- on the client side,
> > without actually being sent to OSDs just to be nixed by the blocklist
> > hammer.
> >
> > But further, what makes a failure to decode a snap trace special?
>
>  From the known tracker the snapid was corrupted in one inode in MDS and
> then when trying to build the snap trace with the corrupted snapid it
> will corrupt.
>
> And also there maybe other cases.
>
> > AFAIK we don't do anything close to this for any other decoding
> > failure.  Wouldn't "when received corrupted XYZ we don't know what
> > exactly has happened in MDS side" argument apply to pretty much all
> > decoding failures?
>
> The snap trace is different from other cases. The corrupted snap trace
> will affect the whole snap realm hierarchy, which will affect the whole
> inodes in the mount in worst case.
>
> This is why I was trying to evict the mount to prevent further IOs.

I suspected as much and my other suggestion was to look at somehow
invalidating snap contexts/realms.  Perhaps decode out-of-place and on
any error set a flag indicating that the snap context can't be trusted
anymore?  The OSD client could then check whether this flag is set
before admitting the snap context blob into the request message and
return an error, effectively rejecting the write.

Thanks,

                Ilya
Xiubo Li Dec. 8, 2022, 12:36 a.m. UTC | #7
On 07/12/2022 22:20, Ilya Dryomov wrote:
> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 07/12/2022 21:19, Xiubo Li wrote:
>>> On 07/12/2022 18:59, Ilya Dryomov wrote:
>>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>
>>>>> When received corrupted snap trace we don't know what exactly has
>>>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>>>> which may corrupt the snapshot contents.
>>>>>
>>>>> Just try to blocklist this client and If fails we need to crash the
>>>>> client instead of leaving it writeable to OSDs.
>>>>>
>>>>> Cc: stable@vger.kernel.org
>>>>> URL: https://tracker.ceph.com/issues/57686
>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>> ---
>>>>>
>>>>> Thanks Aaron's feedback.
>>>>>
>>>>> V3:
>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>>>
>>>>> V2:
>>>>> - Switched to WARN() to taint the Linux kernel.
>>>>>
>>>>>    fs/ceph/mds_client.c |  3 ++-
>>>>>    fs/ceph/mds_client.h |  1 +
>>>>>    fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>>>    3 files changed, 28 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>> index cbbaf334b6b8..59094944af28 100644
>>>>> --- a/fs/ceph/mds_client.c
>>>>> +++ b/fs/ceph/mds_client.c
>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
>>>>> ceph_connection *con)
>>>>>           struct ceph_mds_client *mdsc = s->s_mdsc;
>>>>>
>>>>>           pr_warn("mds%d closed our session\n", s->s_mds);
>>>>> -       send_mds_reconnect(mdsc, s);
>>>>> +       if (!mdsc->no_reconnect)
>>>>> +               send_mds_reconnect(mdsc, s);
>>>>>    }
>>>>>
>>>>>    static void mds_dispatch(struct ceph_connection *con, struct
>>>>> ceph_msg *msg)
>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>>>> --- a/fs/ceph/mds_client.h
>>>>> +++ b/fs/ceph/mds_client.h
>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>>>           atomic_t                num_sessions;
>>>>>           int                     max_sessions;  /* len of sessions
>>>>> array */
>>>>>           int                     stopping;      /* true if shutting
>>>>> down */
>>>>> +       int                     no_reconnect;  /* true if snap trace
>>>>> is corrupted */
>>>>>
>>>>>           atomic64_t              quotarealms_count; /* # realms with
>>>>> quota */
>>>>>           /*
>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>> index c1c452afa84d..023852b7c527 100644
>>>>> --- a/fs/ceph/snap.c
>>>>> +++ b/fs/ceph/snap.c
>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
>>>>> ceph_mds_client *mdsc,
>>>>>           struct ceph_snap_realm *realm;
>>>>>           struct ceph_snap_realm *first_realm = NULL;
>>>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>>>           int rebuild_snapcs;
>>>>>           int err = -ENOMEM;
>>>>> +       int ret;
>>>>>           LIST_HEAD(dirty_realms);
>>>>>
>>>>>           lockdep_assert_held_write(&mdsc->snap_rwsem);
>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
>>>>> ceph_mds_client *mdsc,
>>>>>           if (first_realm)
>>>>>                   ceph_put_snap_realm(mdsc, first_realm);
>>>>>           pr_err("%s error %d\n", __func__, err);
>>>>> +
>>>>> +       /*
>>>>> +        * When receiving a corrupted snap trace we don't know what
>>>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>>>> +        *
>>>>> +        * Just try to blocklist this kclient and if it fails we need
>>>>> +        * to crash the kclient instead of leaving it writeable.
>>>> Hi Xiubo,
>>>>
>>>> I'm not sure I understand this "let's blocklist ourselves" concept.
>>>> If the kernel client shouldn't continue writing to OSDs in this case,
>>>> why not just stop issuing writes -- perhaps initiating some equivalent
>>>> of a read-only remount like many local filesystems would do on I/O
>>>> errors (e.g. errors=remount-ro mode)?
>>> The following patch seems working. Let me do more test to make sure
>>> there is not further crash.
>>>
>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>> index c1c452afa84d..cd487f8a4cb5 100644
>>> --- a/fs/ceph/snap.c
>>> +++ b/fs/ceph/snap.c
>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>> *mdsc,
>>>          struct ceph_snap_realm *realm;
>>>          struct ceph_snap_realm *first_realm = NULL;
>>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
>>> +       struct super_block *sb = mdsc->fsc->sb;
>>>          int rebuild_snapcs;
>>>          int err = -ENOMEM;
>>>          LIST_HEAD(dirty_realms);
>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>> *mdsc,
>>>          if (first_realm)
>>>                  ceph_put_snap_realm(mdsc, first_realm);
>>>          pr_err("%s error %d\n", __func__, err);
>>> +       pr_err("Remounting filesystem read-only\n");
>>> +       sb->s_flags |= SB_RDONLY;
>>> +
>>>          return err;
>>>   }
>>>
>>>
>> For readonly approach is also my first thought it should be, but I was
>> just not very sure whether it would be the best approach.
>>
>> Because by evicting the kclient we could prevent the buffer to be wrote
>> to OSDs. But the readonly one seems won't ?
> The read-only setting is more for the VFS and the user.  Technically,
> the kernel client could just stop issuing writes (i.e. OSD requests
> containing a write op) and not set SB_RDONLY.  That should cover any
> buffered data as well.

 From reading the local exit4 and other fs, they all doing it this way 
and the VFS will help stop further writing. Tested the above patch and 
it worked as expected.

I think to stop the following OSD requests we can just check the 
SB_RDONLY flag to prevent the buffer writeback.

> By employing self-blocklisting, you are shifting the responsibility
> of rejecting OSD requests to the OSDs.  I'm saying that not issuing
> OSD requests from a potentially busted client in the first place is
> probably a better idea.  At the very least you wouldn't need to BUG
> on ceph_monc_blocklist_add() errors.

I found an issue for the read-only approach:

In read-only mode it still can access to the MDSs and OSDs, which will 
continue trying to update the snap realms with the corrupted snap trace 
as before when reading. What if users try to read or backup the 
snapshots by using the corrupted snap realms ?

Isn't that a problem ?

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
Xiubo Li Dec. 8, 2022, 1:04 a.m. UTC | #8
On 07/12/2022 22:28, Ilya Dryomov wrote:
> On Wed, Dec 7, 2022 at 1:35 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 07/12/2022 18:59, Ilya Dryomov wrote:
>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> When received corrupted snap trace we don't know what exactly has
>>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>>> which may corrupt the snapshot contents.
>>>>
>>>> Just try to blocklist this client and If fails we need to crash the
>>>> client instead of leaving it writeable to OSDs.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> URL: https://tracker.ceph.com/issues/57686
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>
>>>> Thanks Aaron's feedback.
>>>>
>>>> V3:
>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>>
>>>> V2:
>>>> - Switched to WARN() to taint the Linux kernel.
>>>>
>>>>    fs/ceph/mds_client.c |  3 ++-
>>>>    fs/ceph/mds_client.h |  1 +
>>>>    fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>>    3 files changed, 28 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>> index cbbaf334b6b8..59094944af28 100644
>>>> --- a/fs/ceph/mds_client.c
>>>> +++ b/fs/ceph/mds_client.c
>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
>>>>           struct ceph_mds_client *mdsc = s->s_mdsc;
>>>>
>>>>           pr_warn("mds%d closed our session\n", s->s_mds);
>>>> -       send_mds_reconnect(mdsc, s);
>>>> +       if (!mdsc->no_reconnect)
>>>> +               send_mds_reconnect(mdsc, s);
>>>>    }
>>>>
>>>>    static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>>> --- a/fs/ceph/mds_client.h
>>>> +++ b/fs/ceph/mds_client.h
>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>>           atomic_t                num_sessions;
>>>>           int                     max_sessions;  /* len of sessions array */
>>>>           int                     stopping;      /* true if shutting down */
>>>> +       int                     no_reconnect;  /* true if snap trace is corrupted */
>>>>
>>>>           atomic64_t              quotarealms_count; /* # realms with quota */
>>>>           /*
>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>> index c1c452afa84d..023852b7c527 100644
>>>> --- a/fs/ceph/snap.c
>>>> +++ b/fs/ceph/snap.c
>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>>>>           struct ceph_snap_realm *realm;
>>>>           struct ceph_snap_realm *first_realm = NULL;
>>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>>           int rebuild_snapcs;
>>>>           int err = -ENOMEM;
>>>> +       int ret;
>>>>           LIST_HEAD(dirty_realms);
>>>>
>>>>           lockdep_assert_held_write(&mdsc->snap_rwsem);
>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>>>>           if (first_realm)
>>>>                   ceph_put_snap_realm(mdsc, first_realm);
>>>>           pr_err("%s error %d\n", __func__, err);
>>>> +
>>>> +       /*
>>>> +        * When receiving a corrupted snap trace we don't know what
>>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>>> +        *
>>>> +        * Just try to blocklist this kclient and if it fails we need
>>>> +        * to crash the kclient instead of leaving it writeable.
>>> Hi Xiubo,
>>>
>>> I'm not sure I understand this "let's blocklist ourselves" concept.
>>> If the kernel client shouldn't continue writing to OSDs in this case,
>>> why not just stop issuing writes -- perhaps initiating some equivalent
>>> of a read-only remount like many local filesystems would do on I/O
>>> errors (e.g. errors=remount-ro mode)?
>> I still haven't found how could I handle it this way from ceph layer. I
>> saw they are just marking the inodes as EIO when this happens.
>>
>>> Or, perhaps, all in-memory snap contexts could somehow be invalidated
>>> in this case, making writes fail naturally -- on the client side,
>>> without actually being sent to OSDs just to be nixed by the blocklist
>>> hammer.
>>>
>>> But further, what makes a failure to decode a snap trace special?
>>   From the known tracker the snapid was corrupted in one inode in MDS and
>> then when trying to build the snap trace with the corrupted snapid it
>> will corrupt.
>>
>> And also there maybe other cases.
>>
>>> AFAIK we don't do anything close to this for any other decoding
>>> failure.  Wouldn't "when received corrupted XYZ we don't know what
>>> exactly has happened in MDS side" argument apply to pretty much all
>>> decoding failures?
>> The snap trace is different from other cases. The corrupted snap trace
>> will affect the whole snap realm hierarchy, which will affect the whole
>> inodes in the mount in worst case.
>>
>> This is why I was trying to evict the mount to prevent further IOs.
> I suspected as much and my other suggestion was to look at somehow
> invalidating snap contexts/realms.  Perhaps decode out-of-place and on
> any error set a flag indicating that the snap context can't be trusted
> anymore?  The OSD client could then check whether this flag is set
> before admitting the snap context blob into the request message and
> return an error, effectively rejecting the write.

The snap realms are organize as tree-like hierarchy. When the snap trace 
is corruppted maybe only one of the snap realms are affected and maybe 
several or all. The problem is when decoding the corrupted snap trace we 
couldn't know exactly which realms will be affected. If one realm is 
marked as invalid all the child realms should be affected too.

So I don't think this is a better approach than read-only or evicting ones.

Thanks,

- Xiubo

>
> Thanks,
>
>                  Ilya
>
Venky Shankar Dec. 9, 2022, 6:14 a.m. UTC | #9
On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 07/12/2022 22:20, Ilya Dryomov wrote:
> > On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>
> >> On 07/12/2022 21:19, Xiubo Li wrote:
> >>> On 07/12/2022 18:59, Ilya Dryomov wrote:
> >>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
> >>>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>>
> >>>>> When received corrupted snap trace we don't know what exactly has
> >>>>> happened in MDS side. And we shouldn't continue writing to OSD,
> >>>>> which may corrupt the snapshot contents.
> >>>>>
> >>>>> Just try to blocklist this client and If fails we need to crash the
> >>>>> client instead of leaving it writeable to OSDs.
> >>>>>
> >>>>> Cc: stable@vger.kernel.org
> >>>>> URL: https://tracker.ceph.com/issues/57686
> >>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>>>> ---
> >>>>>
> >>>>> Thanks Aaron's feedback.
> >>>>>
> >>>>> V3:
> >>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
> >>>>>
> >>>>> V2:
> >>>>> - Switched to WARN() to taint the Linux kernel.
> >>>>>
> >>>>>    fs/ceph/mds_client.c |  3 ++-
> >>>>>    fs/ceph/mds_client.h |  1 +
> >>>>>    fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
> >>>>>    3 files changed, 28 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >>>>> index cbbaf334b6b8..59094944af28 100644
> >>>>> --- a/fs/ceph/mds_client.c
> >>>>> +++ b/fs/ceph/mds_client.c
> >>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
> >>>>> ceph_connection *con)
> >>>>>           struct ceph_mds_client *mdsc = s->s_mdsc;
> >>>>>
> >>>>>           pr_warn("mds%d closed our session\n", s->s_mds);
> >>>>> -       send_mds_reconnect(mdsc, s);
> >>>>> +       if (!mdsc->no_reconnect)
> >>>>> +               send_mds_reconnect(mdsc, s);
> >>>>>    }
> >>>>>
> >>>>>    static void mds_dispatch(struct ceph_connection *con, struct
> >>>>> ceph_msg *msg)
> >>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> >>>>> index 728b7d72bf76..8e8f0447c0ad 100644
> >>>>> --- a/fs/ceph/mds_client.h
> >>>>> +++ b/fs/ceph/mds_client.h
> >>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
> >>>>>           atomic_t                num_sessions;
> >>>>>           int                     max_sessions;  /* len of sessions
> >>>>> array */
> >>>>>           int                     stopping;      /* true if shutting
> >>>>> down */
> >>>>> +       int                     no_reconnect;  /* true if snap trace
> >>>>> is corrupted */
> >>>>>
> >>>>>           atomic64_t              quotarealms_count; /* # realms with
> >>>>> quota */
> >>>>>           /*
> >>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>>> index c1c452afa84d..023852b7c527 100644
> >>>>> --- a/fs/ceph/snap.c
> >>>>> +++ b/fs/ceph/snap.c
> >>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
> >>>>> ceph_mds_client *mdsc,
> >>>>>           struct ceph_snap_realm *realm;
> >>>>>           struct ceph_snap_realm *first_realm = NULL;
> >>>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>>>> +       struct ceph_client *client = mdsc->fsc->client;
> >>>>>           int rebuild_snapcs;
> >>>>>           int err = -ENOMEM;
> >>>>> +       int ret;
> >>>>>           LIST_HEAD(dirty_realms);
> >>>>>
> >>>>>           lockdep_assert_held_write(&mdsc->snap_rwsem);
> >>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
> >>>>> ceph_mds_client *mdsc,
> >>>>>           if (first_realm)
> >>>>>                   ceph_put_snap_realm(mdsc, first_realm);
> >>>>>           pr_err("%s error %d\n", __func__, err);
> >>>>> +
> >>>>> +       /*
> >>>>> +        * When receiving a corrupted snap trace we don't know what
> >>>>> +        * exactly has happened in MDS side. And we shouldn't continue
> >>>>> +        * writing to OSD, which may corrupt the snapshot contents.
> >>>>> +        *
> >>>>> +        * Just try to blocklist this kclient and if it fails we need
> >>>>> +        * to crash the kclient instead of leaving it writeable.
> >>>> Hi Xiubo,
> >>>>
> >>>> I'm not sure I understand this "let's blocklist ourselves" concept.
> >>>> If the kernel client shouldn't continue writing to OSDs in this case,
> >>>> why not just stop issuing writes -- perhaps initiating some equivalent
> >>>> of a read-only remount like many local filesystems would do on I/O
> >>>> errors (e.g. errors=remount-ro mode)?
> >>> The following patch seems working. Let me do more test to make sure
> >>> there is not further crash.
> >>>
> >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>> index c1c452afa84d..cd487f8a4cb5 100644
> >>> --- a/fs/ceph/snap.c
> >>> +++ b/fs/ceph/snap.c
> >>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
> >>> *mdsc,
> >>>          struct ceph_snap_realm *realm;
> >>>          struct ceph_snap_realm *first_realm = NULL;
> >>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>> +       struct super_block *sb = mdsc->fsc->sb;
> >>>          int rebuild_snapcs;
> >>>          int err = -ENOMEM;
> >>>          LIST_HEAD(dirty_realms);
> >>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
> >>> *mdsc,
> >>>          if (first_realm)
> >>>                  ceph_put_snap_realm(mdsc, first_realm);
> >>>          pr_err("%s error %d\n", __func__, err);
> >>> +       pr_err("Remounting filesystem read-only\n");
> >>> +       sb->s_flags |= SB_RDONLY;
> >>> +
> >>>          return err;
> >>>   }
> >>>
> >>>
> >> For readonly approach is also my first thought it should be, but I was
> >> just not very sure whether it would be the best approach.
> >>
> >> Because by evicting the kclient we could prevent the buffer to be wrote
> >> to OSDs. But the readonly one seems won't ?
> > The read-only setting is more for the VFS and the user.  Technically,
> > the kernel client could just stop issuing writes (i.e. OSD requests
> > containing a write op) and not set SB_RDONLY.  That should cover any
> > buffered data as well.
>
>  From reading the local exit4 and other fs, they all doing it this way
> and the VFS will help stop further writing. Tested the above patch and
> it worked as expected.
>
> I think to stop the following OSD requests we can just check the
> SB_RDONLY flag to prevent the buffer writeback.
>
> > By employing self-blocklisting, you are shifting the responsibility
> > of rejecting OSD requests to the OSDs.  I'm saying that not issuing
> > OSD requests from a potentially busted client in the first place is
> > probably a better idea.  At the very least you wouldn't need to BUG
> > on ceph_monc_blocklist_add() errors.
>
> I found an issue for the read-only approach:
>
> In read-only mode it still can access to the MDSs and OSDs, which will
> continue trying to update the snap realms with the corrupted snap trace
> as before when reading. What if users try to read or backup the
> snapshots by using the corrupted snap realms ?
>
> Isn't that a problem ?

Yeh - this might end up in more problems in various places than what
this change is supposed to handle.

Maybe we could track affected realms (although, not that granular) and
disallow reads to them (and its children), but I think it might not be
worth putting in the effort.

>
> Thanks
>
> - Xiubo
>
> > Thanks,
> >
> >                  Ilya
> >
>
Xiubo Li Dec. 9, 2022, 6:59 a.m. UTC | #10
On 09/12/2022 14:14, Venky Shankar wrote:
> On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 07/12/2022 22:20, Ilya Dryomov wrote:
>>> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 07/12/2022 21:19, Xiubo Li wrote:
>>>>> On 07/12/2022 18:59, Ilya Dryomov wrote:
>>>>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>
>>>>>>> When received corrupted snap trace we don't know what exactly has
>>>>>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>>>>>> which may corrupt the snapshot contents.
>>>>>>>
>>>>>>> Just try to blocklist this client and If fails we need to crash the
>>>>>>> client instead of leaving it writeable to OSDs.
>>>>>>>
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> URL: https://tracker.ceph.com/issues/57686
>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>> ---
>>>>>>>
>>>>>>> Thanks Aaron's feedback.
>>>>>>>
>>>>>>> V3:
>>>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>>>>>
>>>>>>> V2:
>>>>>>> - Switched to WARN() to taint the Linux kernel.
>>>>>>>
>>>>>>>     fs/ceph/mds_client.c |  3 ++-
>>>>>>>     fs/ceph/mds_client.h |  1 +
>>>>>>>     fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>>>>>     3 files changed, 28 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>> index cbbaf334b6b8..59094944af28 100644
>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
>>>>>>> ceph_connection *con)
>>>>>>>            struct ceph_mds_client *mdsc = s->s_mdsc;
>>>>>>>
>>>>>>>            pr_warn("mds%d closed our session\n", s->s_mds);
>>>>>>> -       send_mds_reconnect(mdsc, s);
>>>>>>> +       if (!mdsc->no_reconnect)
>>>>>>> +               send_mds_reconnect(mdsc, s);
>>>>>>>     }
>>>>>>>
>>>>>>>     static void mds_dispatch(struct ceph_connection *con, struct
>>>>>>> ceph_msg *msg)
>>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>>>>>> --- a/fs/ceph/mds_client.h
>>>>>>> +++ b/fs/ceph/mds_client.h
>>>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>>>>>            atomic_t                num_sessions;
>>>>>>>            int                     max_sessions;  /* len of sessions
>>>>>>> array */
>>>>>>>            int                     stopping;      /* true if shutting
>>>>>>> down */
>>>>>>> +       int                     no_reconnect;  /* true if snap trace
>>>>>>> is corrupted */
>>>>>>>
>>>>>>>            atomic64_t              quotarealms_count; /* # realms with
>>>>>>> quota */
>>>>>>>            /*
>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>> index c1c452afa84d..023852b7c527 100644
>>>>>>> --- a/fs/ceph/snap.c
>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
>>>>>>> ceph_mds_client *mdsc,
>>>>>>>            struct ceph_snap_realm *realm;
>>>>>>>            struct ceph_snap_realm *first_realm = NULL;
>>>>>>>            struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>>>>>            int rebuild_snapcs;
>>>>>>>            int err = -ENOMEM;
>>>>>>> +       int ret;
>>>>>>>            LIST_HEAD(dirty_realms);
>>>>>>>
>>>>>>>            lockdep_assert_held_write(&mdsc->snap_rwsem);
>>>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
>>>>>>> ceph_mds_client *mdsc,
>>>>>>>            if (first_realm)
>>>>>>>                    ceph_put_snap_realm(mdsc, first_realm);
>>>>>>>            pr_err("%s error %d\n", __func__, err);
>>>>>>> +
>>>>>>> +       /*
>>>>>>> +        * When receiving a corrupted snap trace we don't know what
>>>>>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>>>>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>>>>>> +        *
>>>>>>> +        * Just try to blocklist this kclient and if it fails we need
>>>>>>> +        * to crash the kclient instead of leaving it writeable.
>>>>>> Hi Xiubo,
>>>>>>
>>>>>> I'm not sure I understand this "let's blocklist ourselves" concept.
>>>>>> If the kernel client shouldn't continue writing to OSDs in this case,
>>>>>> why not just stop issuing writes -- perhaps initiating some equivalent
>>>>>> of a read-only remount like many local filesystems would do on I/O
>>>>>> errors (e.g. errors=remount-ro mode)?
>>>>> The following patch seems working. Let me do more test to make sure
>>>>> there is not further crash.
>>>>>
>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>> index c1c452afa84d..cd487f8a4cb5 100644
>>>>> --- a/fs/ceph/snap.c
>>>>> +++ b/fs/ceph/snap.c
>>>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>> *mdsc,
>>>>>           struct ceph_snap_realm *realm;
>>>>>           struct ceph_snap_realm *first_realm = NULL;
>>>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>> +       struct super_block *sb = mdsc->fsc->sb;
>>>>>           int rebuild_snapcs;
>>>>>           int err = -ENOMEM;
>>>>>           LIST_HEAD(dirty_realms);
>>>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>> *mdsc,
>>>>>           if (first_realm)
>>>>>                   ceph_put_snap_realm(mdsc, first_realm);
>>>>>           pr_err("%s error %d\n", __func__, err);
>>>>> +       pr_err("Remounting filesystem read-only\n");
>>>>> +       sb->s_flags |= SB_RDONLY;
>>>>> +
>>>>>           return err;
>>>>>    }
>>>>>
>>>>>
>>>> For readonly approach is also my first thought it should be, but I was
>>>> just not very sure whether it would be the best approach.
>>>>
>>>> Because by evicting the kclient we could prevent the buffer to be wrote
>>>> to OSDs. But the readonly one seems won't ?
>>> The read-only setting is more for the VFS and the user.  Technically,
>>> the kernel client could just stop issuing writes (i.e. OSD requests
>>> containing a write op) and not set SB_RDONLY.  That should cover any
>>> buffered data as well.
>>   From reading the local exit4 and other fs, they all doing it this way
>> and the VFS will help stop further writing. Tested the above patch and
>> it worked as expected.
>>
>> I think to stop the following OSD requests we can just check the
>> SB_RDONLY flag to prevent the buffer writeback.
>>
>>> By employing self-blocklisting, you are shifting the responsibility
>>> of rejecting OSD requests to the OSDs.  I'm saying that not issuing
>>> OSD requests from a potentially busted client in the first place is
>>> probably a better idea.  At the very least you wouldn't need to BUG
>>> on ceph_monc_blocklist_add() errors.
>> I found an issue for the read-only approach:
>>
>> In read-only mode it still can access to the MDSs and OSDs, which will
>> continue trying to update the snap realms with the corrupted snap trace
>> as before when reading. What if users try to read or backup the
>> snapshots by using the corrupted snap realms ?
>>
>> Isn't that a problem ?
> Yeh - this might end up in more problems in various places than what
> this change is supposed to handle.
>
> Maybe we could track affected realms (although, not that granular) and
> disallow reads to them (and its children), but I think it might not be
> worth putting in the effort.

IMO this doesn't make much sense.

When reading we need to use the inodes to get the corresponding realms, 
once the metadatas are corrupted and aborted here the inodes' 
corresponding realms could be incorrect. That's because when updating 
the snap realms here it's possible that the snap realm hierarchy will be 
adjusted and some inodes' realms will be changed.

Thanks

- Xiubo

>> Thanks
>>
>> - Xiubo
>>
>>> Thanks,
>>>
>>>                   Ilya
>>>
>
Venky Shankar Dec. 9, 2022, 7:15 a.m. UTC | #11
On Fri, Dec 9, 2022 at 12:29 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 09/12/2022 14:14, Venky Shankar wrote:
> > On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li <xiubli@redhat.com> wrote:
> >>
> >> On 07/12/2022 22:20, Ilya Dryomov wrote:
> >>> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>>> On 07/12/2022 21:19, Xiubo Li wrote:
> >>>>> On 07/12/2022 18:59, Ilya Dryomov wrote:
> >>>>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
> >>>>>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>>>>
> >>>>>>> When received corrupted snap trace we don't know what exactly has
> >>>>>>> happened in MDS side. And we shouldn't continue writing to OSD,
> >>>>>>> which may corrupt the snapshot contents.
> >>>>>>>
> >>>>>>> Just try to blocklist this client and If fails we need to crash the
> >>>>>>> client instead of leaving it writeable to OSDs.
> >>>>>>>
> >>>>>>> Cc: stable@vger.kernel.org
> >>>>>>> URL: https://tracker.ceph.com/issues/57686
> >>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>>>>>> ---
> >>>>>>>
> >>>>>>> Thanks Aaron's feedback.
> >>>>>>>
> >>>>>>> V3:
> >>>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
> >>>>>>>
> >>>>>>> V2:
> >>>>>>> - Switched to WARN() to taint the Linux kernel.
> >>>>>>>
> >>>>>>>     fs/ceph/mds_client.c |  3 ++-
> >>>>>>>     fs/ceph/mds_client.h |  1 +
> >>>>>>>     fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
> >>>>>>>     3 files changed, 28 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >>>>>>> index cbbaf334b6b8..59094944af28 100644
> >>>>>>> --- a/fs/ceph/mds_client.c
> >>>>>>> +++ b/fs/ceph/mds_client.c
> >>>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
> >>>>>>> ceph_connection *con)
> >>>>>>>            struct ceph_mds_client *mdsc = s->s_mdsc;
> >>>>>>>
> >>>>>>>            pr_warn("mds%d closed our session\n", s->s_mds);
> >>>>>>> -       send_mds_reconnect(mdsc, s);
> >>>>>>> +       if (!mdsc->no_reconnect)
> >>>>>>> +               send_mds_reconnect(mdsc, s);
> >>>>>>>     }
> >>>>>>>
> >>>>>>>     static void mds_dispatch(struct ceph_connection *con, struct
> >>>>>>> ceph_msg *msg)
> >>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> >>>>>>> index 728b7d72bf76..8e8f0447c0ad 100644
> >>>>>>> --- a/fs/ceph/mds_client.h
> >>>>>>> +++ b/fs/ceph/mds_client.h
> >>>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
> >>>>>>>            atomic_t                num_sessions;
> >>>>>>>            int                     max_sessions;  /* len of sessions
> >>>>>>> array */
> >>>>>>>            int                     stopping;      /* true if shutting
> >>>>>>> down */
> >>>>>>> +       int                     no_reconnect;  /* true if snap trace
> >>>>>>> is corrupted */
> >>>>>>>
> >>>>>>>            atomic64_t              quotarealms_count; /* # realms with
> >>>>>>> quota */
> >>>>>>>            /*
> >>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>>>>> index c1c452afa84d..023852b7c527 100644
> >>>>>>> --- a/fs/ceph/snap.c
> >>>>>>> +++ b/fs/ceph/snap.c
> >>>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
> >>>>>>> ceph_mds_client *mdsc,
> >>>>>>>            struct ceph_snap_realm *realm;
> >>>>>>>            struct ceph_snap_realm *first_realm = NULL;
> >>>>>>>            struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>>>>>> +       struct ceph_client *client = mdsc->fsc->client;
> >>>>>>>            int rebuild_snapcs;
> >>>>>>>            int err = -ENOMEM;
> >>>>>>> +       int ret;
> >>>>>>>            LIST_HEAD(dirty_realms);
> >>>>>>>
> >>>>>>>            lockdep_assert_held_write(&mdsc->snap_rwsem);
> >>>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
> >>>>>>> ceph_mds_client *mdsc,
> >>>>>>>            if (first_realm)
> >>>>>>>                    ceph_put_snap_realm(mdsc, first_realm);
> >>>>>>>            pr_err("%s error %d\n", __func__, err);
> >>>>>>> +
> >>>>>>> +       /*
> >>>>>>> +        * When receiving a corrupted snap trace we don't know what
> >>>>>>> +        * exactly has happened in MDS side. And we shouldn't continue
> >>>>>>> +        * writing to OSD, which may corrupt the snapshot contents.
> >>>>>>> +        *
> >>>>>>> +        * Just try to blocklist this kclient and if it fails we need
> >>>>>>> +        * to crash the kclient instead of leaving it writeable.
> >>>>>> Hi Xiubo,
> >>>>>>
> >>>>>> I'm not sure I understand this "let's blocklist ourselves" concept.
> >>>>>> If the kernel client shouldn't continue writing to OSDs in this case,
> >>>>>> why not just stop issuing writes -- perhaps initiating some equivalent
> >>>>>> of a read-only remount like many local filesystems would do on I/O
> >>>>>> errors (e.g. errors=remount-ro mode)?
> >>>>> The following patch seems working. Let me do more test to make sure
> >>>>> there is not further crash.
> >>>>>
> >>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>>> index c1c452afa84d..cd487f8a4cb5 100644
> >>>>> --- a/fs/ceph/snap.c
> >>>>> +++ b/fs/ceph/snap.c
> >>>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
> >>>>> *mdsc,
> >>>>>           struct ceph_snap_realm *realm;
> >>>>>           struct ceph_snap_realm *first_realm = NULL;
> >>>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>>>> +       struct super_block *sb = mdsc->fsc->sb;
> >>>>>           int rebuild_snapcs;
> >>>>>           int err = -ENOMEM;
> >>>>>           LIST_HEAD(dirty_realms);
> >>>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
> >>>>> *mdsc,
> >>>>>           if (first_realm)
> >>>>>                   ceph_put_snap_realm(mdsc, first_realm);
> >>>>>           pr_err("%s error %d\n", __func__, err);
> >>>>> +       pr_err("Remounting filesystem read-only\n");
> >>>>> +       sb->s_flags |= SB_RDONLY;
> >>>>> +
> >>>>>           return err;
> >>>>>    }
> >>>>>
> >>>>>
> >>>> For readonly approach is also my first thought it should be, but I was
> >>>> just not very sure whether it would be the best approach.
> >>>>
> >>>> Because by evicting the kclient we could prevent the buffer to be wrote
> >>>> to OSDs. But the readonly one seems won't ?
> >>> The read-only setting is more for the VFS and the user.  Technically,
> >>> the kernel client could just stop issuing writes (i.e. OSD requests
> >>> containing a write op) and not set SB_RDONLY.  That should cover any
> >>> buffered data as well.
> >>   From reading the local exit4 and other fs, they all doing it this way
> >> and the VFS will help stop further writing. Tested the above patch and
> >> it worked as expected.
> >>
> >> I think to stop the following OSD requests we can just check the
> >> SB_RDONLY flag to prevent the buffer writeback.
> >>
> >>> By employing self-blocklisting, you are shifting the responsibility
> >>> of rejecting OSD requests to the OSDs.  I'm saying that not issuing
> >>> OSD requests from a potentially busted client in the first place is
> >>> probably a better idea.  At the very least you wouldn't need to BUG
> >>> on ceph_monc_blocklist_add() errors.
> >> I found an issue for the read-only approach:
> >>
> >> In read-only mode it still can access to the MDSs and OSDs, which will
> >> continue trying to update the snap realms with the corrupted snap trace
> >> as before when reading. What if users try to read or backup the
> >> snapshots by using the corrupted snap realms ?
> >>
> >> Isn't that a problem ?
> > Yeh - this might end up in more problems in various places than what
> > this change is supposed to handle.
> >
> > Maybe we could track affected realms (although, not that granular) and
> > disallow reads to them (and its children), but I think it might not be
> > worth putting in the effort.
>
> IMO this doesn't make much sense.
>
> When reading we need to use the inodes to get the corresponding realms,
> once the metadatas are corrupted and aborted here the inodes'
> corresponding realms could be incorrect. That's because when updating
> the snap realms here it's possible that the snap realm hierarchy will be
> adjusted and some inodes' realms will be changed.

My point was if at all we could identify the realms correctly, which
seems risky with the corrupted info received. Seems like this needs to
be an all or none approach.

>
> Thanks
>
> - Xiubo
>
> >> Thanks
> >>
> >> - Xiubo
> >>
> >>> Thanks,
> >>>
> >>>                   Ilya
> >>>
> >
>
Xiubo Li Dec. 9, 2022, 8:07 a.m. UTC | #12
On 09/12/2022 15:15, Venky Shankar wrote:
> On Fri, Dec 9, 2022 at 12:29 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 09/12/2022 14:14, Venky Shankar wrote:
>>> On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 07/12/2022 22:20, Ilya Dryomov wrote:
>>>>> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>> On 07/12/2022 21:19, Xiubo Li wrote:
>>>>>>> On 07/12/2022 18:59, Ilya Dryomov wrote:
>>>>>>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>>
>>>>>>>>> When received corrupted snap trace we don't know what exactly has
>>>>>>>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>>>>>>>> which may corrupt the snapshot contents.
>>>>>>>>>
>>>>>>>>> Just try to blocklist this client and If fails we need to crash the
>>>>>>>>> client instead of leaving it writeable to OSDs.
>>>>>>>>>
>>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>>> URL: https://tracker.ceph.com/issues/57686
>>>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Thanks Aaron's feedback.
>>>>>>>>>
>>>>>>>>> V3:
>>>>>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>>>>>>>
>>>>>>>>> V2:
>>>>>>>>> - Switched to WARN() to taint the Linux kernel.
>>>>>>>>>
>>>>>>>>>      fs/ceph/mds_client.c |  3 ++-
>>>>>>>>>      fs/ceph/mds_client.h |  1 +
>>>>>>>>>      fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>>>>>>>      3 files changed, 28 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>>> index cbbaf334b6b8..59094944af28 100644
>>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
>>>>>>>>> ceph_connection *con)
>>>>>>>>>             struct ceph_mds_client *mdsc = s->s_mdsc;
>>>>>>>>>
>>>>>>>>>             pr_warn("mds%d closed our session\n", s->s_mds);
>>>>>>>>> -       send_mds_reconnect(mdsc, s);
>>>>>>>>> +       if (!mdsc->no_reconnect)
>>>>>>>>> +               send_mds_reconnect(mdsc, s);
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>>      static void mds_dispatch(struct ceph_connection *con, struct
>>>>>>>>> ceph_msg *msg)
>>>>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>>>>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>>>>>>>> --- a/fs/ceph/mds_client.h
>>>>>>>>> +++ b/fs/ceph/mds_client.h
>>>>>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>>>>>>>             atomic_t                num_sessions;
>>>>>>>>>             int                     max_sessions;  /* len of sessions
>>>>>>>>> array */
>>>>>>>>>             int                     stopping;      /* true if shutting
>>>>>>>>> down */
>>>>>>>>> +       int                     no_reconnect;  /* true if snap trace
>>>>>>>>> is corrupted */
>>>>>>>>>
>>>>>>>>>             atomic64_t              quotarealms_count; /* # realms with
>>>>>>>>> quota */
>>>>>>>>>             /*
>>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>>>> index c1c452afa84d..023852b7c527 100644
>>>>>>>>> --- a/fs/ceph/snap.c
>>>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
>>>>>>>>> ceph_mds_client *mdsc,
>>>>>>>>>             struct ceph_snap_realm *realm;
>>>>>>>>>             struct ceph_snap_realm *first_realm = NULL;
>>>>>>>>>             struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>>>>>>>             int rebuild_snapcs;
>>>>>>>>>             int err = -ENOMEM;
>>>>>>>>> +       int ret;
>>>>>>>>>             LIST_HEAD(dirty_realms);
>>>>>>>>>
>>>>>>>>>             lockdep_assert_held_write(&mdsc->snap_rwsem);
>>>>>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
>>>>>>>>> ceph_mds_client *mdsc,
>>>>>>>>>             if (first_realm)
>>>>>>>>>                     ceph_put_snap_realm(mdsc, first_realm);
>>>>>>>>>             pr_err("%s error %d\n", __func__, err);
>>>>>>>>> +
>>>>>>>>> +       /*
>>>>>>>>> +        * When receiving a corrupted snap trace we don't know what
>>>>>>>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>>>>>>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>>>>>>>> +        *
>>>>>>>>> +        * Just try to blocklist this kclient and if it fails we need
>>>>>>>>> +        * to crash the kclient instead of leaving it writeable.
>>>>>>>> Hi Xiubo,
>>>>>>>>
>>>>>>>> I'm not sure I understand this "let's blocklist ourselves" concept.
>>>>>>>> If the kernel client shouldn't continue writing to OSDs in this case,
>>>>>>>> why not just stop issuing writes -- perhaps initiating some equivalent
>>>>>>>> of a read-only remount like many local filesystems would do on I/O
>>>>>>>> errors (e.g. errors=remount-ro mode)?
>>>>>>> The following patch seems working. Let me do more test to make sure
>>>>>>> there is not further crash.
>>>>>>>
>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>> index c1c452afa84d..cd487f8a4cb5 100644
>>>>>>> --- a/fs/ceph/snap.c
>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>>>> *mdsc,
>>>>>>>            struct ceph_snap_realm *realm;
>>>>>>>            struct ceph_snap_realm *first_realm = NULL;
>>>>>>>            struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>> +       struct super_block *sb = mdsc->fsc->sb;
>>>>>>>            int rebuild_snapcs;
>>>>>>>            int err = -ENOMEM;
>>>>>>>            LIST_HEAD(dirty_realms);
>>>>>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>>>> *mdsc,
>>>>>>>            if (first_realm)
>>>>>>>                    ceph_put_snap_realm(mdsc, first_realm);
>>>>>>>            pr_err("%s error %d\n", __func__, err);
>>>>>>> +       pr_err("Remounting filesystem read-only\n");
>>>>>>> +       sb->s_flags |= SB_RDONLY;
>>>>>>> +
>>>>>>>            return err;
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>> For readonly approach is also my first thought it should be, but I was
>>>>>> just not very sure whether it would be the best approach.
>>>>>>
>>>>>> Because by evicting the kclient we could prevent the buffer to be wrote
>>>>>> to OSDs. But the readonly one seems won't ?
>>>>> The read-only setting is more for the VFS and the user.  Technically,
>>>>> the kernel client could just stop issuing writes (i.e. OSD requests
>>>>> containing a write op) and not set SB_RDONLY.  That should cover any
>>>>> buffered data as well.
>>>>    From reading the local exit4 and other fs, they all doing it this way
>>>> and the VFS will help stop further writing. Tested the above patch and
>>>> it worked as expected.
>>>>
>>>> I think to stop the following OSD requests we can just check the
>>>> SB_RDONLY flag to prevent the buffer writeback.
>>>>
>>>>> By employing self-blocklisting, you are shifting the responsibility
>>>>> of rejecting OSD requests to the OSDs.  I'm saying that not issuing
>>>>> OSD requests from a potentially busted client in the first place is
>>>>> probably a better idea.  At the very least you wouldn't need to BUG
>>>>> on ceph_monc_blocklist_add() errors.
>>>> I found an issue for the read-only approach:
>>>>
>>>> In read-only mode it still can access to the MDSs and OSDs, which will
>>>> continue trying to update the snap realms with the corrupted snap trace
>>>> as before when reading. What if users try to read or backup the
>>>> snapshots by using the corrupted snap realms ?
>>>>
>>>> Isn't that a problem ?
>>> Yeh - this might end up in more problems in various places than what
>>> this change is supposed to handle.
>>>
>>> Maybe we could track affected realms (although, not that granular) and
>>> disallow reads to them (and its children), but I think it might not be
>>> worth putting in the effort.
>> IMO this doesn't make much sense.
>>
>> When reading we need to use the inodes to get the corresponding realms,
>> once the metadatas are corrupted and aborted here the inodes'
>> corresponding realms could be incorrect. That's because when updating
>> the snap realms here it's possible that the snap realm hierarchy will be
>> adjusted and some inodes' realms will be changed.
> My point was if at all we could identify the realms correctly, which
> seems risky with the corrupted info received. Seems like this needs to
> be an all or none approach.

Yeah. Agree.

>
>> Thanks
>>
>> - Xiubo
>>
>>>> Thanks
>>>>
>>>> - Xiubo
>>>>
>>>>> Thanks,
>>>>>
>>>>>                    Ilya
>>>>>
>
Ilya Dryomov Dec. 9, 2022, 11:28 a.m. UTC | #13
On Thu, Dec 8, 2022 at 2:05 AM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 07/12/2022 22:28, Ilya Dryomov wrote:
> > On Wed, Dec 7, 2022 at 1:35 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>
> >> On 07/12/2022 18:59, Ilya Dryomov wrote:
> >>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
> >>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>
> >>>> When received corrupted snap trace we don't know what exactly has
> >>>> happened in MDS side. And we shouldn't continue writing to OSD,
> >>>> which may corrupt the snapshot contents.
> >>>>
> >>>> Just try to blocklist this client and If fails we need to crash the
> >>>> client instead of leaving it writeable to OSDs.
> >>>>
> >>>> Cc: stable@vger.kernel.org
> >>>> URL: https://tracker.ceph.com/issues/57686
> >>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>>> ---
> >>>>
> >>>> Thanks Aaron's feedback.
> >>>>
> >>>> V3:
> >>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
> >>>>
> >>>> V2:
> >>>> - Switched to WARN() to taint the Linux kernel.
> >>>>
> >>>>    fs/ceph/mds_client.c |  3 ++-
> >>>>    fs/ceph/mds_client.h |  1 +
> >>>>    fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
> >>>>    3 files changed, 28 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >>>> index cbbaf334b6b8..59094944af28 100644
> >>>> --- a/fs/ceph/mds_client.c
> >>>> +++ b/fs/ceph/mds_client.c
> >>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
> >>>>           struct ceph_mds_client *mdsc = s->s_mdsc;
> >>>>
> >>>>           pr_warn("mds%d closed our session\n", s->s_mds);
> >>>> -       send_mds_reconnect(mdsc, s);
> >>>> +       if (!mdsc->no_reconnect)
> >>>> +               send_mds_reconnect(mdsc, s);
> >>>>    }
> >>>>
> >>>>    static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> >>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> >>>> index 728b7d72bf76..8e8f0447c0ad 100644
> >>>> --- a/fs/ceph/mds_client.h
> >>>> +++ b/fs/ceph/mds_client.h
> >>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
> >>>>           atomic_t                num_sessions;
> >>>>           int                     max_sessions;  /* len of sessions array */
> >>>>           int                     stopping;      /* true if shutting down */
> >>>> +       int                     no_reconnect;  /* true if snap trace is corrupted */
> >>>>
> >>>>           atomic64_t              quotarealms_count; /* # realms with quota */
> >>>>           /*
> >>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>> index c1c452afa84d..023852b7c527 100644
> >>>> --- a/fs/ceph/snap.c
> >>>> +++ b/fs/ceph/snap.c
> >>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
> >>>>           struct ceph_snap_realm *realm;
> >>>>           struct ceph_snap_realm *first_realm = NULL;
> >>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>>> +       struct ceph_client *client = mdsc->fsc->client;
> >>>>           int rebuild_snapcs;
> >>>>           int err = -ENOMEM;
> >>>> +       int ret;
> >>>>           LIST_HEAD(dirty_realms);
> >>>>
> >>>>           lockdep_assert_held_write(&mdsc->snap_rwsem);
> >>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
> >>>>           if (first_realm)
> >>>>                   ceph_put_snap_realm(mdsc, first_realm);
> >>>>           pr_err("%s error %d\n", __func__, err);
> >>>> +
> >>>> +       /*
> >>>> +        * When receiving a corrupted snap trace we don't know what
> >>>> +        * exactly has happened in MDS side. And we shouldn't continue
> >>>> +        * writing to OSD, which may corrupt the snapshot contents.
> >>>> +        *
> >>>> +        * Just try to blocklist this kclient and if it fails we need
> >>>> +        * to crash the kclient instead of leaving it writeable.
> >>> Hi Xiubo,
> >>>
> >>> I'm not sure I understand this "let's blocklist ourselves" concept.
> >>> If the kernel client shouldn't continue writing to OSDs in this case,
> >>> why not just stop issuing writes -- perhaps initiating some equivalent
> >>> of a read-only remount like many local filesystems would do on I/O
> >>> errors (e.g. errors=remount-ro mode)?
> >> I still haven't found how could I handle it this way from ceph layer. I
> >> saw they are just marking the inodes as EIO when this happens.
> >>
> >>> Or, perhaps, all in-memory snap contexts could somehow be invalidated
> >>> in this case, making writes fail naturally -- on the client side,
> >>> without actually being sent to OSDs just to be nixed by the blocklist
> >>> hammer.
> >>>
> >>> But further, what makes a failure to decode a snap trace special?
> >>   From the known tracker the snapid was corrupted in one inode in MDS and
> >> then when trying to build the snap trace with the corrupted snapid it
> >> will corrupt.
> >>
> >> And also there maybe other cases.
> >>
> >>> AFAIK we don't do anything close to this for any other decoding
> >>> failure.  Wouldn't "when received corrupted XYZ we don't know what
> >>> exactly has happened in MDS side" argument apply to pretty much all
> >>> decoding failures?
> >> The snap trace is different from other cases. The corrupted snap trace
> >> will affect the whole snap realm hierarchy, which will affect the whole
> >> inodes in the mount in worst case.
> >>
> >> This is why I was trying to evict the mount to prevent further IOs.
> > I suspected as much and my other suggestion was to look at somehow
> > invalidating snap contexts/realms.  Perhaps decode out-of-place and on
> > any error set a flag indicating that the snap context can't be trusted
> > anymore?  The OSD client could then check whether this flag is set
> > before admitting the snap context blob into the request message and
> > return an error, effectively rejecting the write.
>
> The snap realms are organize as tree-like hierarchy. When the snap trace
> is corruppted maybe only one of the snap realms are affected and maybe
> several or all. The problem is when decoding the corrupted snap trace we
> couldn't know exactly which realms will be affected. If one realm is
> marked as invalid all the child realms should be affected too.

I realize that ;)  My suggestion (quoted above) was to look into
invalidating all snap realms, not a particular one:

    Or, perhaps, all in-memory snap contexts could somehow be invalidated
    in this case, making writes fail naturally -- on the client side,
    without actually being sent to OSDs just to be nixed by the blocklist
    hammer.

Thanks,

                Ilya
diff mbox series

Patch

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index cbbaf334b6b8..59094944af28 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -5648,7 +5648,8 @@  static void mds_peer_reset(struct ceph_connection *con)
 	struct ceph_mds_client *mdsc = s->s_mdsc;
 
 	pr_warn("mds%d closed our session\n", s->s_mds);
-	send_mds_reconnect(mdsc, s);
+	if (!mdsc->no_reconnect)
+		send_mds_reconnect(mdsc, s);
 }
 
 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 728b7d72bf76..8e8f0447c0ad 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -413,6 +413,7 @@  struct ceph_mds_client {
 	atomic_t		num_sessions;
 	int                     max_sessions;  /* len of sessions array */
 	int                     stopping;      /* true if shutting down */
+	int                     no_reconnect;  /* true if snap trace is corrupted */
 
 	atomic64_t		quotarealms_count; /* # realms with quota */
 	/*
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index c1c452afa84d..023852b7c527 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -767,8 +767,10 @@  int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
 	struct ceph_snap_realm *realm;
 	struct ceph_snap_realm *first_realm = NULL;
 	struct ceph_snap_realm *realm_to_rebuild = NULL;
+	struct ceph_client *client = mdsc->fsc->client;
 	int rebuild_snapcs;
 	int err = -ENOMEM;
+	int ret;
 	LIST_HEAD(dirty_realms);
 
 	lockdep_assert_held_write(&mdsc->snap_rwsem);
@@ -885,6 +887,29 @@  int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
 	if (first_realm)
 		ceph_put_snap_realm(mdsc, first_realm);
 	pr_err("%s error %d\n", __func__, err);
+
+	/*
+	 * When receiving a corrupted snap trace we don't know what
+	 * exactly has happened in MDS side. And we shouldn't continue
+	 * writing to OSD, which may corrupt the snapshot contents.
+	 *
+	 * Just try to blocklist this kclient and if it fails we need
+	 * to crash the kclient instead of leaving it writeable.
+	 *
+	 * Then this kclient must be remounted to continue after the
+	 * corrupted metadata fixed in the MDS side.
+	 */
+	mdsc->no_reconnect = 1;
+	ret = ceph_monc_blocklist_add(&client->monc, &client->msgr.inst.addr);
+	if (ret) {
+		pr_err("%s blocklist of %s failed: %d", __func__,
+		       ceph_pr_addr(&client->msgr.inst.addr), ret);
+		BUG();
+	}
+	WARN(1, "%s %s was blocklisted, do remount to continue%s",
+	     __func__, ceph_pr_addr(&client->msgr.inst.addr),
+	     err == -EIO ? " after corrupted snaptrace fixed" : "");
+
 	return err;
 }