mbox series

[0/2] block: scsi: introduce and use BLK_STS_OFFLINE

Message ID 20220203064009.1795344-1-song@kernel.org
Headers show
Series block: scsi: introduce and use BLK_STS_OFFLINE | expand

Message

Song Liu Feb. 3, 2022, 6:40 a.m. UTC
We have a use case where HDDs are regularly power on/off to perserve power.
When a drive is being removed, we often see errors like

   [  172.803279] I/O error, dev sda, sector 3137184

These messages are confusing for automations that grep dmesg, as they look
very similar to real HDD error.

Solve this issue with a new block state BLK_STS_OFFLINE. After the change,
the error message looks like

   [  172.803279] device offline error, dev sda, sector 3137184

so that the automations won't confuse them with real I/O error.

Song Liu (2):
  block: introduce BLK_STS_OFFLINE
  scsi: use BLK_STS_OFFLINE for not fully online devices

 block/blk-core.c          | 1 +
 drivers/scsi/scsi_lib.c   | 2 +-
 include/linux/blk_types.h | 7 +++++++
 3 files changed, 9 insertions(+), 1 deletion(-)

--
2.30.2

Comments

Song Liu Feb. 3, 2022, 6:52 a.m. UTC | #1
CC linux-block (it was a typo in the original email)

On Wed, Feb 2, 2022 at 10:40 PM Song Liu <song@kernel.org> wrote:
>
> We have a use case where HDDs are regularly power on/off to perserve power.
> When a drive is being removed, we often see errors like
>
>    [  172.803279] I/O error, dev sda, sector 3137184
>
> These messages are confusing for automations that grep dmesg, as they look
> very similar to real HDD error.
>
> Solve this issue with a new block state BLK_STS_OFFLINE. After the change,
> the error message looks like
>
>    [  172.803279] device offline error, dev sda, sector 3137184
>
> so that the automations won't confuse them with real I/O error.
>
> Song Liu (2):
>   block: introduce BLK_STS_OFFLINE
>   scsi: use BLK_STS_OFFLINE for not fully online devices
>
>  block/blk-core.c          | 1 +
>  drivers/scsi/scsi_lib.c   | 2 +-
>  include/linux/blk_types.h | 7 +++++++
>  3 files changed, 9 insertions(+), 1 deletion(-)
>
> --
> 2.30.2
Jens Axboe Feb. 3, 2022, 1:47 p.m. UTC | #2
On 2/3/22 12:24 AM, Hannes Reinecke wrote:
> On 2/3/22 07:52, Song Liu wrote:
>> CC linux-block (it was a typo in the original email)
>>
>> On Wed, Feb 2, 2022 at 10:40 PM Song Liu <song@kernel.org> wrote:
>>>
>>> Currently, drivers reports BLK_STS_IOERR for devices that are not full
>>> online or being removed. This behavior could cause confusion for users,
>>> as they are not really I/O errors from the device.
>>>
>>> Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
>>> offline error" in dmesg instead of "I/O error".
>>>
>>> Signed-off-by: Song Liu <song@kernel.org>
>>> ---
>>>   block/blk-core.c          | 1 +
>>>   include/linux/blk_types.h | 7 +++++++
>>>   2 files changed, 8 insertions(+)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 61f6a0dc4511..24035dd2eef1 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -164,6 +164,7 @@ static const struct {
>>>          [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
>>>          [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
>>>          [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },
>>> +       [BLK_STS_OFFLINE]       = { -EIO,       "device offline" },
>>>
>>>          /* device mapper special case, should not leak out: */
>>>          [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index fe065c394fff..5561e58d158a 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -153,6 +153,13 @@ typedef u8 __bitwise blk_status_t;
>>>    */
>>>   #define BLK_STS_ZONE_ACTIVE_RESOURCE   ((__force blk_status_t)16)
>>>
>>> +/*
>>> + * BLK_STS_OFFLINE is returned from the driver when the target device is offline
>>> + * or is being taken offline. This could help differentiate the case where a
>>> + * device is intentionally being shut down from a real I/O error.
>>> + */
>>> +#define BLK_STS_OFFLINE                ((__force blk_status_t)17)
>>> +
>>>   /**
>>>    * blk_path_error - returns true if error may be path related
>>>    * @error: status the request was completed with
>>> --
>>> 2.30.2
>>>
> Please do not overload EIO here.
> EIO already is a catch-all error if we don't know any better, but for 
> the 'device offline' case we do (or rather should).
> Please map it onto 'ENODEV' or 'ENXIO'.

It's deliberately EIO as not to force a change in behavior. I don't mind
using something else, but that should be a separate change then.
Song Liu Feb. 3, 2022, 5:23 p.m. UTC | #3
Hi Hannes and Jens,

> On Feb 3, 2022, at 5:47 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 2/3/22 12:24 AM, Hannes Reinecke wrote:
>> On 2/3/22 07:52, Song Liu wrote:
>>> CC linux-block (it was a typo in the original email)
>>> 
>>> On Wed, Feb 2, 2022 at 10:40 PM Song Liu <song@kernel.org> wrote:
>>>> 
>>>> Currently, drivers reports BLK_STS_IOERR for devices that are not full
>>>> online or being removed. This behavior could cause confusion for users,
>>>> as they are not really I/O errors from the device.
>>>> 
>>>> Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
>>>> offline error" in dmesg instead of "I/O error".
>>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>>  block/blk-core.c          | 1 +
>>>>  include/linux/blk_types.h | 7 +++++++
>>>>  2 files changed, 8 insertions(+)
>>>> 
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 61f6a0dc4511..24035dd2eef1 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -164,6 +164,7 @@ static const struct {
>>>>         [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
>>>>         [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
>>>>         [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },
>>>> +       [BLK_STS_OFFLINE]       = { -EIO,       "device offline" },
>>>> 
>>>>         /* device mapper special case, should not leak out: */
>>>>         [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },
>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>> index fe065c394fff..5561e58d158a 100644
>>>> --- a/include/linux/blk_types.h
>>>> +++ b/include/linux/blk_types.h
>>>> @@ -153,6 +153,13 @@ typedef u8 __bitwise blk_status_t;
>>>>   */
>>>>  #define BLK_STS_ZONE_ACTIVE_RESOURCE   ((__force blk_status_t)16)
>>>> 
>>>> +/*
>>>> + * BLK_STS_OFFLINE is returned from the driver when the target device is offline
>>>> + * or is being taken offline. This could help differentiate the case where a
>>>> + * device is intentionally being shut down from a real I/O error.
>>>> + */
>>>> +#define BLK_STS_OFFLINE                ((__force blk_status_t)17)
>>>> +
>>>>  /**
>>>>   * blk_path_error - returns true if error may be path related
>>>>   * @error: status the request was completed with
>>>> --
>>>> 2.30.2
>>>> 
>> Please do not overload EIO here.
>> EIO already is a catch-all error if we don't know any better, but for 
>> the 'device offline' case we do (or rather should).
>> Please map it onto 'ENODEV' or 'ENXIO'.
> 
> It's deliberately EIO as not to force a change in behavior. I don't mind
> using something else, but that should be a separate change then.

Thanks for these feedbacks. Shall I send v2 with an extra patch that 
changes EIO to ENODEV/ENXIO? Or shall we do that in a follow up patch? 
Also, any preference between ENODEV and ENXIO? 

Thanks,
Song
Jens Axboe Feb. 3, 2022, 6:51 p.m. UTC | #4
On 2/3/22 10:23 AM, Song Liu wrote:
> Hi Hannes and Jens,
> 
>> On Feb 3, 2022, at 5:47 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/3/22 12:24 AM, Hannes Reinecke wrote:
>>> On 2/3/22 07:52, Song Liu wrote:
>>>> CC linux-block (it was a typo in the original email)
>>>>
>>>> On Wed, Feb 2, 2022 at 10:40 PM Song Liu <song@kernel.org> wrote:
>>>>>
>>>>> Currently, drivers reports BLK_STS_IOERR for devices that are not full
>>>>> online or being removed. This behavior could cause confusion for users,
>>>>> as they are not really I/O errors from the device.
>>>>>
>>>>> Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
>>>>> offline error" in dmesg instead of "I/O error".
>>>>>
>>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>>> ---
>>>>>  block/blk-core.c          | 1 +
>>>>>  include/linux/blk_types.h | 7 +++++++
>>>>>  2 files changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index 61f6a0dc4511..24035dd2eef1 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -164,6 +164,7 @@ static const struct {
>>>>>         [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
>>>>>         [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
>>>>>         [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },
>>>>> +       [BLK_STS_OFFLINE]       = { -EIO,       "device offline" },
>>>>>
>>>>>         /* device mapper special case, should not leak out: */
>>>>>         [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },
>>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>>> index fe065c394fff..5561e58d158a 100644
>>>>> --- a/include/linux/blk_types.h
>>>>> +++ b/include/linux/blk_types.h
>>>>> @@ -153,6 +153,13 @@ typedef u8 __bitwise blk_status_t;
>>>>>   */
>>>>>  #define BLK_STS_ZONE_ACTIVE_RESOURCE   ((__force blk_status_t)16)
>>>>>
>>>>> +/*
>>>>> + * BLK_STS_OFFLINE is returned from the driver when the target device is offline
>>>>> + * or is being taken offline. This could help differentiate the case where a
>>>>> + * device is intentionally being shut down from a real I/O error.
>>>>> + */
>>>>> +#define BLK_STS_OFFLINE                ((__force blk_status_t)17)
>>>>> +
>>>>>  /**
>>>>>   * blk_path_error - returns true if error may be path related
>>>>>   * @error: status the request was completed with
>>>>> --
>>>>> 2.30.2
>>>>>
>>> Please do not overload EIO here.
>>> EIO already is a catch-all error if we don't know any better, but for 
>>> the 'device offline' case we do (or rather should).
>>> Please map it onto 'ENODEV' or 'ENXIO'.
>>
>> It's deliberately EIO as not to force a change in behavior. I don't mind
>> using something else, but that should be a separate change then.
> 
> Thanks for these feedbacks. Shall I send v2 with an extra patch that 
> changes EIO to ENODEV/ENXIO? Or shall we do that in a follow up patch? 
> Also, any preference between ENODEV and ENXIO? 

Yeah I think so, and perhaps put a mention in this patch on why EIO is
chosen to not change the user visible return value.
Hannes Reinecke Feb. 4, 2022, 7:14 a.m. UTC | #5
On 2/3/22 18:23, Song Liu wrote:
> Hi Hannes and Jens,
> 
>> On Feb 3, 2022, at 5:47 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/3/22 12:24 AM, Hannes Reinecke wrote:
>>> On 2/3/22 07:52, Song Liu wrote:
>>>> CC linux-block (it was a typo in the original email)
>>>>
>>>> On Wed, Feb 2, 2022 at 10:40 PM Song Liu <song@kernel.org> wrote:
>>>>>
>>>>> Currently, drivers reports BLK_STS_IOERR for devices that are not full
>>>>> online or being removed. This behavior could cause confusion for users,
>>>>> as they are not really I/O errors from the device.
>>>>>
>>>>> Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
>>>>> offline error" in dmesg instead of "I/O error".
>>>>>
>>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>>> ---
>>>>>   block/blk-core.c          | 1 +
>>>>>   include/linux/blk_types.h | 7 +++++++
>>>>>   2 files changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index 61f6a0dc4511..24035dd2eef1 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -164,6 +164,7 @@ static const struct {
>>>>>          [BLK_STS_RESOURCE]      = { -ENOMEM,    "kernel resource" },
>>>>>          [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
>>>>>          [BLK_STS_AGAIN]         = { -EAGAIN,    "nonblocking retry" },
>>>>> +       [BLK_STS_OFFLINE]       = { -EIO,       "device offline" },
>>>>>
>>>>>          /* device mapper special case, should not leak out: */
>>>>>          [BLK_STS_DM_REQUEUE]    = { -EREMCHG, "dm internal retry" },
>>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>>> index fe065c394fff..5561e58d158a 100644
>>>>> --- a/include/linux/blk_types.h
>>>>> +++ b/include/linux/blk_types.h
>>>>> @@ -153,6 +153,13 @@ typedef u8 __bitwise blk_status_t;
>>>>>    */
>>>>>   #define BLK_STS_ZONE_ACTIVE_RESOURCE   ((__force blk_status_t)16)
>>>>>
>>>>> +/*
>>>>> + * BLK_STS_OFFLINE is returned from the driver when the target device is offline
>>>>> + * or is being taken offline. This could help differentiate the case where a
>>>>> + * device is intentionally being shut down from a real I/O error.
>>>>> + */
>>>>> +#define BLK_STS_OFFLINE                ((__force blk_status_t)17)
>>>>> +
>>>>>   /**
>>>>>    * blk_path_error - returns true if error may be path related
>>>>>    * @error: status the request was completed with
>>>>> --
>>>>> 2.30.2
>>>>>
>>> Please do not overload EIO here.
>>> EIO already is a catch-all error if we don't know any better, but for
>>> the 'device offline' case we do (or rather should).
>>> Please map it onto 'ENODEV' or 'ENXIO'.
>>
>> It's deliberately EIO as not to force a change in behavior. I don't mind
>> using something else, but that should be a separate change then.
> 
> Thanks for these feedbacks. Shall I send v2 with an extra patch that
> changes EIO to ENODEV/ENXIO? Or shall we do that in a follow up patch?
> Also, any preference between ENODEV and ENXIO?
> 
Please make it an addtional patch, and use ENODEV as a return value.
For this patch you can add:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes