diff mbox series

[1/1] : scsi dm-mpath do not fail paths which are in ALUA state transitioning

Message ID CAHZQxyKJ1qFatzhR-k19PXjAPo7eC0ZgwgaGKwfndB=jEO8mRQ@mail.gmail.com
State New
Headers show
Series [1/1] : scsi dm-mpath do not fail paths which are in ALUA state transitioning | expand

Commit Message

Brian Bunker July 8, 2021, 8:42 p.m. UTC
In a controller failover do not fail paths that are transitioning or
an unexpected I/O error will return when accessing a multipath device.

Consider this case, a two controller array with paths coming from a
primary and a secondary controller. During any upgrade there will be a
transition from a secondary to a primary state.

1. In the beginning all paths active / optimized:
3624a9370602220bd9986439b00012c1e dm-12 PURE,FlashArray
size=3.0T features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 13:0:0:11 sdch 69:80  active ready running
  |- 14:0:0:11 sdce 69:32  active ready running
  |- 15:0:0:11 sdcf 69:48  active ready running
  |- 1:0:0:11  sdci 69:96  active ready running
  |- 9:0:0:11  sdck 69:128 active ready running
  |- 10:0:0:11 sdcg 69:64  active ready running
  |- 11:0:0:11 sdcd 69:16  active ready running
  `- 12:0:0:11 sdcj 69:112 active ready running

CT0 paths - sdce, sdcf, sdcg, sdcd
CT1 paths - sdch, sdci, sdck, sdcj

2. Run I/O to the multipath device:
[root@init115-15 ~]# /opt/Purity/bin/bb/pureload -m initthreads=32 /dev/dm-12
Thu Jul  8 13:33:47 2021: /opt/Purity/bin/bb/pureload num_cpus = 64
Thu Jul  8 13:33:47 2021: /opt/Purity/bin/bb/pureload num numa nodes 2
Thu Jul  8 13:33:47 2021: /opt/Purity/bin/bb/pureload Starting test
with 32 threads

3. In an upgrade the primary controller is failed and the secondary
controller transitions to primary. From an ALUA paths perspective, the
paths to the previous primary go to ALUA state unavailable while the
paths transitioning to the promoting primary move to ALUA state
transitioning.

It is expected that 4 paths will fail:
Jul  8 13:33:58 init115-15 kernel: sd 14:0:0:11: [sdce] tag#1178 Add.
Sense: Logical unit not accessible, target port in unavailable state
Jul  8 13:33:58 init115-15 kernel: sd 15:0:0:11: [sdcf] tag#1374 Add.
Sense: Logical unit not accessible, target port in unavailable state
Jul  8 13:33:58 init115-15 kernel: sd 10:0:0:11: [sdcg] tag#600 Add.
Sense: Logical unit not accessible, target port in unavailable state
Jul  8 13:33:58 init115-15 kernel: sd 11:0:0:11: [sdcd] tag#1460 Add.
Sense: Logical unit not accessible, target port in unavailable state

Jul  8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:64.
Jul  8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:48.
Jul  8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:16.
Jul  8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:32.

Jul  8 13:33:58 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 7
Jul  8 13:33:58 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 6
Jul  8 13:33:59 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 5
Jul  8 13:33:59 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 4

4. It is not expected that the remaining 4 paths will also fail. This
was not the case until the change which introduced BLK_STS_AGAIN into
the SCSI ALUA device handler. With that change new I/O which reaches
that handler on paths that are in ALUA state transitioning will result
in those paths failing. Previous Linux versions, before that change,
will not return an I/O error back to the client application.
Similarly, this problem does not happen in other operating systems,
e.g. ESXi, Windows, AIX, etc.

5. It is not expected that the paths to the promoting primary fail yet they do:
Jul  8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:96.
Jul  8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:112.
Jul  8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:80.
Jul  8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12:
Failing path 69:128.
Jul  8 13:33:59 init115-15 multipath[53813]: dm-12: no usable paths found
Jul  8 13:33:59 init115-15 multipath[53833]: dm-12: no usable paths found
Jul  8 13:33:59 init115-15 multipath[53853]: dm-12: no usable paths found

Jul  8 13:33:59 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 3
Jul  8 13:33:59 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 2
Jul  8 13:33:59 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 1
Jul  8 13:33:59 init115-15 multipathd[46030]:
3624a9370602220bd9986439b00012c1e: remaining active paths: 0

6. The error gets back to the user of the muitipath device unexpectedly:
Thu Jul  8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io
43047 fd 36  op read  offset 00000028ef7a7000  size 4096  errno 11
rsize -1

The earlier patch I made for this was not desirable, so I am proposing
this much smaller patch which will similarly not allow the
transitioning paths to result in immediate failure.

Signed-off-by: Brian Bunker <brian@purestorage.com>
Acked-by: Krishna Kant <krishna.kant@purestorage.com>
Acked-by: Seamus Connor <sconnor@purestorage.com>

Comments

Martin Wilck July 12, 2021, 9:19 a.m. UTC | #1
Hello Brian,

On Do, 2021-07-08 at 13:42 -0700, Brian Bunker wrote:
> In a controller failover do not fail paths that are transitioning or

> an unexpected I/O error will return when accessing a multipath device.

> 

> Consider this case, a two controller array with paths coming from a

> primary and a secondary controller. During any upgrade there will be a

> transition from a secondary to a primary state.

> 

> [...]

> 4. It is not expected that the remaining 4 paths will also fail. This

> was not the case until the change which introduced BLK_STS_AGAIN into

> the SCSI ALUA device handler. With that change new I/O which reaches

> that handler on paths that are in ALUA state transitioning will result

> in those paths failing. Previous Linux versions, before that change,

> will not return an I/O error back to the client application.

> Similarly, this problem does not happen in other operating systems,

> e.g. ESXi, Windows, AIX, etc.


Please confirm that your kernel included ee8868c5c78f ("scsi:
scsi_dh_alua: Retry RTPG on a different path after failure").
That commit should cause the RTPG to be retried on other map members
which are not in failed state, thus avoiding this phenomenon.


> [...]

> 

> 6. The error gets back to the user of the muitipath device

> unexpectedly:

> Thu Jul  8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io

> 43047 fd 36  op read  offset 00000028ef7a7000  size 4096  errno 11

> rsize -1

> 

> The earlier patch I made for this was not desirable, so I am proposing

> this much smaller patch which will similarly not allow the

> transitioning paths to result in immediate failure.

> 

> Signed-off-by: Brian Bunker <brian@purestorage.com>

> Acked-by: Krishna Kant <krishna.kant@purestorage.com>

> Acked-by: Seamus Connor <sconnor@purestorage.com>

> 

> ____

> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c

> index bced42f082b0..d5d6be96068d 100644

> --- a/drivers/md/dm-mpath.c

> +++ b/drivers/md/dm-mpath.c

> @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target

> *ti, struct request *clone,

>                 else

>                         r = DM_ENDIO_REQUEUE;

> 

> -               if (pgpath)

> +               if (pgpath && (error != BLK_STS_AGAIN))

>                         fail_path(pgpath);

> 

>                 if (!atomic_read(&m->nr_valid_paths) &&

> 


I doubt that this is correct. If you look at the commit msg of
268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
transitioning state"):


 "When the ALUA state indicates transitioning we should not retry the command
 immediately, but rather complete the command with BLK_STS_AGAIN to signal
 the completion handler that it might be retried.  This allows multipathing
 to redirect the command to another path if possible, and avoid stalls
 during lengthy transitioning times."

The purpose of that patch was to set the state of the transitioning
path to failed in order to make sure IO is retried on a different path.
Your patch would undermine this purpose.

Regards
Martin
Brian Bunker July 12, 2021, 9:38 p.m. UTC | #2
Martin,

Please confirm that your kernel included ee8868c5c78f ("scsi:
scsi_dh_alua: Retry RTPG on a different path after failure").
That commit should cause the RTPG to be retried on other map members
which are not in failed state, thus avoiding this phenomenon.

In my case, there are no other map members that are not in the failed
state. One set of paths goes to the ALUA unavailable state when the
primary fails, and the second set of paths moves to ALUA state
transitioning as the previous secondary becomes the primary. If the
paths are failed which are transitioning, an all paths down state
happens which is not expected. There should be a time for which
transitioning is a transient state until the next state is entered.
Failing a path assuming there would be non-failed paths seems wrong.

The purpose of that patch was to set the state of the transitioning
path to failed in order to make sure IO is retried on a different path.
Your patch would undermine this purpose.

I agree this is what happens but those transitioning paths might be
the only non-failed paths available. I don't think it is reasonable to
fail them. This is the same as treating transitioning as standby or
unavailable. As you point out this happened with the commit you
mention. Before this commit what I am doing does not result in an all
paths down error, and similarly, it does not in earlier Linux versions
or other OS's under the same condition. I see this as a regression.

Thanks,
Brian

On Mon, Jul 12, 2021 at 2:19 AM Martin Wilck <mwilck@suse.com> wrote:
>

> Hello Brian,

>

> On Do, 2021-07-08 at 13:42 -0700, Brian Bunker wrote:

> > In a controller failover do not fail paths that are transitioning or

> > an unexpected I/O error will return when accessing a multipath device.

> >

> > Consider this case, a two controller array with paths coming from a

> > primary and a secondary controller. During any upgrade there will be a

> > transition from a secondary to a primary state.

> >

> > [...]

> > 4. It is not expected that the remaining 4 paths will also fail. This

> > was not the case until the change which introduced BLK_STS_AGAIN into

> > the SCSI ALUA device handler. With that change new I/O which reaches

> > that handler on paths that are in ALUA state transitioning will result

> > in those paths failing. Previous Linux versions, before that change,

> > will not return an I/O error back to the client application.

> > Similarly, this problem does not happen in other operating systems,

> > e.g. ESXi, Windows, AIX, etc.

>

> Please confirm that your kernel included ee8868c5c78f ("scsi:

> scsi_dh_alua: Retry RTPG on a different path after failure").

> That commit should cause the RTPG to be retried on other map members

> which are not in failed state, thus avoiding this phenomenon.

>

>

> > [...]

> >

> > 6. The error gets back to the user of the muitipath device

> > unexpectedly:

> > Thu Jul  8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io

> > 43047 fd 36  op read  offset 00000028ef7a7000  size 4096  errno 11

> > rsize -1

> >

> > The earlier patch I made for this was not desirable, so I am proposing

> > this much smaller patch which will similarly not allow the

> > transitioning paths to result in immediate failure.

> >

> > Signed-off-by: Brian Bunker <brian@purestorage.com>

> > Acked-by: Krishna Kant <krishna.kant@purestorage.com>

> > Acked-by: Seamus Connor <sconnor@purestorage.com>

> >

> > ____

> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c

> > index bced42f082b0..d5d6be96068d 100644

> > --- a/drivers/md/dm-mpath.c

> > +++ b/drivers/md/dm-mpath.c

> > @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target

> > *ti, struct request *clone,

> >                 else

> >                         r = DM_ENDIO_REQUEUE;

> >

> > -               if (pgpath)

> > +               if (pgpath && (error != BLK_STS_AGAIN))

> >                         fail_path(pgpath);

> >

> >                 if (!atomic_read(&m->nr_valid_paths) &&

> >

>

> I doubt that this is correct. If you look at the commit msg of

> 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA

> transitioning state"):

>

>

>  "When the ALUA state indicates transitioning we should not retry the command

>  immediately, but rather complete the command with BLK_STS_AGAIN to signal

>  the completion handler that it might be retried.  This allows multipathing

>  to redirect the command to another path if possible, and avoid stalls

>  during lengthy transitioning times."

>

> The purpose of that patch was to set the state of the transitioning

> path to failed in order to make sure IO is retried on a different path.

> Your patch would undermine this purpose.

>

> Regards

> Martin

>

>



-- 
Brian Bunker
PURE Storage, Inc.
brian@purestorage.com
Martin Wilck July 13, 2021, 9:13 a.m. UTC | #3
Hello Brian,

On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:
> Martin,

> 

> > Please confirm that your kernel included ee8868c5c78f ("scsi:

> > scsi_dh_alua: Retry RTPG on a different path after failure").

> > That commit should cause the RTPG to be retried on other map 

> > members

> > which are not in failed state, thus avoiding this phenomenon.

> 

> In my case, there are no other map members that are not in the failed

> state. One set of paths goes to the ALUA unavailable state when the

> primary fails, and the second set of paths moves to ALUA state

> transitioning as the previous secondary becomes the primary.


IMO this is the problem. How does your array respond to SCSI commands
while ports are transitioning? 

SPC5 (§5.16.2.6) says that the server should either fail all commands
with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT
ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support
all TMFs and a subset of SCSI commands, while responding with
CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.

No matter how you read that paragraph, it's pretty clear that
"transitioning" is generally not a healthy state to attempt I/O.

Are you saying that on your server, the transitioning ports are able to
process regular I/O commands like READ and WRITE? If that's the case,
why do you pretend to be "transitioning" at all, rather than in an
active state? If it's not the case, why would it make sense for the
host to retry I/O on the transitioning path?

>  If the

> paths are failed which are transitioning, an all paths down state

> happens which is not expected.


IMO it _is_ expected if in fact no path is able to process SCSI
commands at the given point in time.

>  There should be a time for which

> transitioning is a transient state until the next state is entered.

> Failing a path assuming there would be non-failed paths seems wrong.


This is a misunderstanding. The path isn't failed because of
assumptions about other paths. It is failed because we know that it's
non-functional, and thus we must try other paths, if there are any.

Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
transitioning state"), I/O was indeed retried on transitioning paths,
possibly forever. This posed a serious problem when a transitioning
path was about to be removed (e.g. dev_loss_tmo expired). And I'm still
quite convinced that it was wrong in general, because by all reasonable
means a "transitioning" path isn't usable for the host.

If we find a transitioning path, it might make sense to retry on other
paths first and eventually switch back to the transitioning path, when
all others have failed hard (e.g. "unavailable" state). However, this
logic doesn't exist in the kernel. In theory, it could be mapped to a
"transitioning" priority group in device-mapper multipath. But prio
groups are managed in user space (multipathd), which treats
transitioning paths as "almost failed" (priority 0) anyway. We can
discuss enhancing multipathd such that it re-checks transitioning paths
more frequently, in order to be able to reinstate them ASAP.

According to what you said above, the "transitioning" ports in the
problem situation ("second set") are those that were in "unavailable"
state before, which means "failed" as far as device mapper is concerned
- IOW, the paths in question would be unused anyway, until they got
reinstated, which wouldn't happen before they are fully up. With this
in mind, I have to say I don't understand why your proposed patch would
help at all. Please explain.

> > The purpose of that patch was to set the state of the transitioning

> > path to failed in order to make sure IO is retried on a different  >

> path.

> > Your patch would undermine this purpose.


(Additional indentation added by me) Can you please use proper quoting?
You were mixing my statements and your own. 

> I agree this is what happens but those transitioning paths might be

> the only non-failed paths available. I don't think it is reasonable

> to

> fail them. This is the same as treating transitioning as standby or

> unavailable.


Right, and according to the SPC spec (see above), that makes more sense
than treating it as "active".

Storage vendors seem to interpret "transitioning" very differently,
both in terms of commands supported and in terms of time required to
reach the target state. That makes it hard to deal with it correctly on
the host side.

>  As you point out this happened with the commit you

> mention. Before this commit what I am doing does not result in an all

> paths down error, and similarly, it does not in earlier Linux

> versions

> or other OS's under the same condition. I see this as a regression.


If you use a suitable "no_path_retry" setting in multipathd, you should
be able to handle the situation you describe just fine by queueing the
I/O until the transitioning paths are fully up. IIUC, on your server
"transitioning" is a transient state that ends quickly, so queueing
shouldn't be an issue. E.g. if you are certain that "transitioning"
won't last longer than 10s, you could set "no_path_retry 2".

Regards,
Martin
Brian Bunker July 14, 2021, 12:32 a.m. UTC | #4
On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:
>

> Hello Brian,

>

> On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:

> > Martin,

> >

> > > Please confirm that your kernel included ee8868c5c78f ("scsi:

> > > scsi_dh_alua: Retry RTPG on a different path after failure").

> > > That commit should cause the RTPG to be retried on other map

> > > members

> > > which are not in failed state, thus avoiding this phenomenon.

> >

> > In my case, there are no other map members that are not in the failed

> > state. One set of paths goes to the ALUA unavailable state when the

> > primary fails, and the second set of paths moves to ALUA state

> > transitioning as the previous secondary becomes the primary.

>

> IMO this is the problem. How does your array respond to SCSI commands

> while ports are transitioning?

>

> SPC5 (§5.16.2.6) says that the server should either fail all commands

> with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT

> ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support

> all TMFs and a subset of SCSI commands, while responding with

> CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.

>

> No matter how you read that paragraph, it's pretty clear that

> "transitioning" is generally not a healthy state to attempt I/O.

>

> Are you saying that on your server, the transitioning ports are able to

> process regular I/O commands like READ and WRITE? If that's the case,

> why do you pretend to be "transitioning" at all, rather than in an

> active state? If it's not the case, why would it make sense for the

> host to retry I/O on the transitioning path?


In the ALUA transitioning state, we cannot process READ or WRITE and
will return with the sense data as you mentioned above. We expect
retries down that transitioning path until it transitions to another
ALUA state (at least for some reasonable period of time for the
transition). The spec defines the state as the transition between
target asymmetric states. The current implementation requires
coordination on the target not to return a state transition down all
paths at the same time or risk all paths being failed. Using the ALUA
transitioning state allows us to respond to initiator READ and WRITE
requests even if we can't serve them when our internal target state is
transitioning (secondary to primary). The alternative is to queue them
which presents a different set of problems.

>

> >  If the

> > paths are failed which are transitioning, an all paths down state

> > happens which is not expected.

>

> IMO it _is_ expected if in fact no path is able to process SCSI

> commands at the given point in time.


In this case it would seem having all paths move to transitioning
would lead to all paths lost. It is possible to imagine
implementations where for a brief period of time all paths are in a
transitioning state. What would be the point of returning a transient
state if the result is a permanent failure?

>

> >  There should be a time for which

> > transitioning is a transient state until the next state is entered.

> > Failing a path assuming there would be non-failed paths seems wrong.

>

> This is a misunderstanding. The path isn't failed because of

> assumptions about other paths. It is failed because we know that it's

> non-functional, and thus we must try other paths, if there are any.

>

> Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA

> transitioning state"), I/O was indeed retried on transitioning paths,

> possibly forever. This posed a serious problem when a transitioning

> path was about to be removed (e.g. dev_loss_tmo expired). And I'm still

> quite convinced that it was wrong in general, because by all reasonable

> means a "transitioning" path isn't usable for the host.

>

> If we find a transitioning path, it might make sense to retry on other

> paths first and eventually switch back to the transitioning path, when

> all others have failed hard (e.g. "unavailable" state). However, this

> logic doesn't exist in the kernel. In theory, it could be mapped to a

> "transitioning" priority group in device-mapper multipath. But prio

> groups are managed in user space (multipathd), which treats

> transitioning paths as "almost failed" (priority 0) anyway. We can

> discuss enhancing multipathd such that it re-checks transitioning paths

> more frequently, in order to be able to reinstate them ASAP.

>

> According to what you said above, the "transitioning" ports in the

> problem situation ("second set") are those that were in "unavailable"

> state before, which means "failed" as far as device mapper is concerned

> - IOW, the paths in question would be unused anyway, until they got

> reinstated, which wouldn't happen before they are fully up. With this

> in mind, I have to say I don't understand why your proposed patch would

> help at all. Please explain.

>


My proposed patch would not fail the paths in the case of
BLK_STS_AGAIN. This seems to result in the requests being retried
until the path transitions to either a failed state, standby or
unavailable, or an online state.

> > > The purpose of that patch was to set the state of the transitioning

> > > path to failed in order to make sure IO is retried on a different  >

> > path.

> > > Your patch would undermine this purpose.

>

> (Additional indentation added by me) Can you please use proper quoting?

> You were mixing my statements and your own.

>

> > I agree this is what happens but those transitioning paths might be

> > the only non-failed paths available. I don't think it is reasonable

> > to

> > fail them. This is the same as treating transitioning as standby or

> > unavailable.

>

> Right, and according to the SPC spec (see above), that makes more sense

> than treating it as "active".

>

> Storage vendors seem to interpret "transitioning" very differently,

> both in terms of commands supported and in terms of time required to

> reach the target state. That makes it hard to deal with it correctly on

> the host side.

>

> >  As you point out this happened with the commit you

> > mention. Before this commit what I am doing does not result in an all

> > paths down error, and similarly, it does not in earlier Linux

> > versions

> > or other OS's under the same condition. I see this as a regression.

>

> If you use a suitable "no_path_retry" setting in multipathd, you should

> be able to handle the situation you describe just fine by queueing the

> I/O until the transitioning paths are fully up. IIUC, on your server

> "transitioning" is a transient state that ends quickly, so queueing

> shouldn't be an issue. E.g. if you are certain that "transitioning"

> won't last longer than 10s, you could set "no_path_retry 2".

>

> Regards,

> Martin

>

>

>


I have tested using the no_path_retry and you are correct that it does
work around the issue that I am seeing. The problem with that is there
are times we want to convey all paths down to the initiator as quickly
as possible and doing this will delay that.

Thanks,
Brian

Brian Bunker
PURE Storage, Inc.
brian@purestorage.com
Brian Bunker July 14, 2021, 12:37 a.m. UTC | #5
On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:
>

> Hello Brian,

>

> On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:

> > Martin,

> >

> > > Please confirm that your kernel included ee8868c5c78f ("scsi:

> > > scsi_dh_alua: Retry RTPG on a different path after failure").

> > > That commit should cause the RTPG to be retried on other map

> > > members

> > > which are not in failed state, thus avoiding this phenomenon.

> >

> > In my case, there are no other map members that are not in the failed

> > state. One set of paths goes to the ALUA unavailable state when the

> > primary fails, and the second set of paths moves to ALUA state

> > transitioning as the previous secondary becomes the primary.

>

> IMO this is the problem. How does your array respond to SCSI commands

> while ports are transitioning?

>

> SPC5 (§5.16.2.6) says that the server should either fail all commands

> with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT

> ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support

> all TMFs and a subset of SCSI commands, while responding with

> CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.

>

> No matter how you read that paragraph, it's pretty clear that

> "transitioning" is generally not a healthy state to attempt I/O.

>

> Are you saying that on your server, the transitioning ports are able to

> process regular I/O commands like READ and WRITE? If that's the case,

> why do you pretend to be "transitioning" at all, rather than in an

> active state? If it's not the case, why would it make sense for the

> host to retry I/O on the transitioning path?


In the ALUA transitioning state, we cannot process READ or WRITE and
will return with the sense data as you mentioned above. We expect
retries down that transitioning path until it transitions to another
ALUA state (at least for some reasonable period of time for the
transition). The spec defines the state as the transition between
target asymmetric states. The current implementation requires
coordination on the target not to return a state transition down all
paths at the same time or risk all paths being failed. Using the ALUA
transition state allows us to respond to initiator READ and WRITE
requests even if we can't serve them when our internal target state is
transitioning (secondary to primary). The alternative is to queue them
which presents a different set of problems.

>

> >  If the

> > paths are failed which are transitioning, an all paths down state

> > happens which is not expected.

>

> IMO it _is_ expected if in fact no path is able to process SCSI

> commands at the given point in time.


In this case it would seem having all paths move to transitioning
would lead to all paths lost. It is possible to imagine
implementations where for a brief period of time all paths are in a
transitioning state. What would be the point of returning a transient
state if the result is a permanent failure?

>

> >  There should be a time for which

> > transitioning is a transient state until the next state is entered.

> > Failing a path assuming there would be non-failed paths seems wrong.

>

> This is a misunderstanding. The path isn't failed because of

> assumptions about other paths. It is failed because we know that it's

> non-functional, and thus we must try other paths, if there are any.

>

> Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA

> transitioning state"), I/O was indeed retried on transitioning paths,

> possibly forever. This posed a serious problem when a transitioning

> path was about to be removed (e.g. dev_loss_tmo expired). And I'm still

> quite convinced that it was wrong in general, because by all reasonable

> means a "transitioning" path isn't usable for the host.

>

> If we find a transitioning path, it might make sense to retry on other

> paths first and eventually switch back to the transitioning path, when

> all others have failed hard (e.g. "unavailable" state). However, this

> logic doesn't exist in the kernel. In theory, it could be mapped to a

> "transitioning" priority group in device-mapper multipath. But prio

> groups are managed in user space (multipathd), which treats

> transitioning paths as "almost failed" (priority 0) anyway. We can

> discuss enhancing multipathd such that it re-checks transitioning paths

> more frequently, in order to be able to reinstate them ASAP.

>

> According to what you said above, the "transitioning" ports in the

> problem situation ("second set") are those that were in "unavailable"

> state before, which means "failed" as far as device mapper is concerned

> - IOW, the paths in question would be unused anyway, until they got

> reinstated, which wouldn't happen before they are fully up. With this

> in mind, I have to say I don't understand why your proposed patch would

> help at all. Please explain.

>


My proposed patch would not fail the paths in the case of
BLK_STS_AGAIN. This seems to result in the requests being retried
until the path transitions to either a failed state, standby or
unavailable, or an online state.

> > > The purpose of that patch was to set the state of the transitioning

> > > path to failed in order to make sure IO is retried on a different  >

> > path.

> > > Your patch would undermine this purpose.

>

> (Additional indentation added by me) Can you please use proper quoting?

> You were mixing my statements and your own.

>

> > I agree this is what happens but those transitioning paths might be

> > the only non-failed paths available. I don't think it is reasonable

> > to

> > fail them. This is the same as treating transitioning as standby or

> > unavailable.

>

> Right, and according to the SPC spec (see above), that makes more sense

> than treating it as "active".

>

> Storage vendors seem to interpret "transitioning" very differently,

> both in terms of commands supported and in terms of time required to

> reach the target state. That makes it hard to deal with it correctly on

> the host side.

>

> >  As you point out this happened with the commit you

> > mention. Before this commit what I am doing does not result in an all

> > paths down error, and similarly, it does not in earlier Linux

> > versions

> > or other OS's under the same condition. I see this as a regression.

>

> If you use a suitable "no_path_retry" setting in multipathd, you should

> be able to handle the situation you describe just fine by queueing the

> I/O until the transitioning paths are fully up. IIUC, on your server

> "transitioning" is a transient state that ends quickly, so queueing

> shouldn't be an issue. E.g. if you are certain that "transitioning"

> won't last longer than 10s, you could set "no_path_retry 2".

>

> Regards,

> Martin

>

>

>


I have tested using the no_path_retry and you are correct that it does
work around the issue that I am seeing. The problem with that is are times
we want to convey all paths down to the initiator as quickly
as possible and doing this will delay that.

Thanks,
Brian

Brian Bunker
PURE Storage, Inc.
brian@purestorage.com

On Tue, Jul 13, 2021 at 5:32 PM Brian Bunker <brian@purestorage.com> wrote:
>

> On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:

> >

> > Hello Brian,

> >

> > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:

> > > Martin,

> > >

> > > > Please confirm that your kernel included ee8868c5c78f ("scsi:

> > > > scsi_dh_alua: Retry RTPG on a different path after failure").

> > > > That commit should cause the RTPG to be retried on other map

> > > > members

> > > > which are not in failed state, thus avoiding this phenomenon.

> > >

> > > In my case, there are no other map members that are not in the failed

> > > state. One set of paths goes to the ALUA unavailable state when the

> > > primary fails, and the second set of paths moves to ALUA state

> > > transitioning as the previous secondary becomes the primary.

> >

> > IMO this is the problem. How does your array respond to SCSI commands

> > while ports are transitioning?

> >

> > SPC5 (§5.16.2.6) says that the server should either fail all commands

> > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT

> > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support

> > all TMFs and a subset of SCSI commands, while responding with

> > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.

> >

> > No matter how you read that paragraph, it's pretty clear that

> > "transitioning" is generally not a healthy state to attempt I/O.

> >

> > Are you saying that on your server, the transitioning ports are able to

> > process regular I/O commands like READ and WRITE? If that's the case,

> > why do you pretend to be "transitioning" at all, rather than in an

> > active state? If it's not the case, why would it make sense for the

> > host to retry I/O on the transitioning path?

>

> In the ALUA transitioning state, we cannot process READ or WRITE and

> will return with the sense data as you mentioned above. We expect

> retries down that transitioning path until it transitions to another

> ALUA state (at least for some reasonable period of time for the

> transition). The spec defines the state as the transition between

> target asymmetric states. The current implementation requires

> coordination on the target not to return a state transition down all

> paths at the same time or risk all paths being failed. Using the ALUA

> transitioning state allows us to respond to initiator READ and WRITE

> requests even if we can't serve them when our internal target state is

> transitioning (secondary to primary). The alternative is to queue them

> which presents a different set of problems.

>

> >

> > >  If the

> > > paths are failed which are transitioning, an all paths down state

> > > happens which is not expected.

> >

> > IMO it _is_ expected if in fact no path is able to process SCSI

> > commands at the given point in time.

>

> In this case it would seem having all paths move to transitioning

> would lead to all paths lost. It is possible to imagine

> implementations where for a brief period of time all paths are in a

> transitioning state. What would be the point of returning a transient

> state if the result is a permanent failure?

>

> >

> > >  There should be a time for which

> > > transitioning is a transient state until the next state is entered.

> > > Failing a path assuming there would be non-failed paths seems wrong.

> >

> > This is a misunderstanding. The path isn't failed because of

> > assumptions about other paths. It is failed because we know that it's

> > non-functional, and thus we must try other paths, if there are any.

> >

> > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA

> > transitioning state"), I/O was indeed retried on transitioning paths,

> > possibly forever. This posed a serious problem when a transitioning

> > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still

> > quite convinced that it was wrong in general, because by all reasonable

> > means a "transitioning" path isn't usable for the host.

> >

> > If we find a transitioning path, it might make sense to retry on other

> > paths first and eventually switch back to the transitioning path, when

> > all others have failed hard (e.g. "unavailable" state). However, this

> > logic doesn't exist in the kernel. In theory, it could be mapped to a

> > "transitioning" priority group in device-mapper multipath. But prio

> > groups are managed in user space (multipathd), which treats

> > transitioning paths as "almost failed" (priority 0) anyway. We can

> > discuss enhancing multipathd such that it re-checks transitioning paths

> > more frequently, in order to be able to reinstate them ASAP.

> >

> > According to what you said above, the "transitioning" ports in the

> > problem situation ("second set") are those that were in "unavailable"

> > state before, which means "failed" as far as device mapper is concerned

> > - IOW, the paths in question would be unused anyway, until they got

> > reinstated, which wouldn't happen before they are fully up. With this

> > in mind, I have to say I don't understand why your proposed patch would

> > help at all. Please explain.

> >

>

> My proposed patch would not fail the paths in the case of

> BLK_STS_AGAIN. This seems to result in the requests being retried

> until the path transitions to either a failed state, standby or

> unavailable, or an online state.

>

> > > > The purpose of that patch was to set the state of the transitioning

> > > > path to failed in order to make sure IO is retried on a different  >

> > > path.

> > > > Your patch would undermine this purpose.

> >

> > (Additional indentation added by me) Can you please use proper quoting?

> > You were mixing my statements and your own.

> >

> > > I agree this is what happens but those transitioning paths might be

> > > the only non-failed paths available. I don't think it is reasonable

> > > to

> > > fail them. This is the same as treating transitioning as standby or

> > > unavailable.

> >

> > Right, and according to the SPC spec (see above), that makes more sense

> > than treating it as "active".

> >

> > Storage vendors seem to interpret "transitioning" very differently,

> > both in terms of commands supported and in terms of time required to

> > reach the target state. That makes it hard to deal with it correctly on

> > the host side.

> >

> > >  As you point out this happened with the commit you

> > > mention. Before this commit what I am doing does not result in an all

> > > paths down error, and similarly, it does not in earlier Linux

> > > versions

> > > or other OS's under the same condition. I see this as a regression.

> >

> > If you use a suitable "no_path_retry" setting in multipathd, you should

> > be able to handle the situation you describe just fine by queueing the

> > I/O until the transitioning paths are fully up. IIUC, on your server

> > "transitioning" is a transient state that ends quickly, so queueing

> > shouldn't be an issue. E.g. if you are certain that "transitioning"

> > won't last longer than 10s, you could set "no_path_retry 2".

> >

> > Regards,

> > Martin

> >

> >

> >

>

> I have tested using the no_path_retry and you are correct that it does

> work around the issue that I am seeing. The problem with that is there

> are times we want to convey all paths down to the initiator as quickly

> as possible and doing this will delay that.

>

> Thanks,

> Brian

>

> Brian Bunker

> PURE Storage, Inc.

> brian@purestorage.com




-- 
Brian Bunker
PURE Storage, Inc.
brian@purestorage.com
Martin Wilck July 14, 2021, 8:38 a.m. UTC | #6
On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote:
> On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:

> > Are you saying that on your server, the transitioning ports are able

> > to

> > process regular I/O commands like READ and WRITE? If that's the case,

> > why do you pretend to be "transitioning" at all, rather than in an

> > active state? If it's not the case, why would it make sense for the

> > host to retry I/O on the transitioning path?

> 

> In the ALUA transitioning state, we cannot process READ or WRITE and

> will return with the sense data as you mentioned above. We expect

> retries down that transitioning path until it transitions to another

> ALUA state (at least for some reasonable period of time for the

> transition). The spec defines the state as the transition between

> target asymmetric states. The current implementation requires

> coordination on the target not to return a state transition down all

> paths at the same time or risk all paths being failed. Using the ALUA

> transition state allows us to respond to initiator READ and WRITE

> requests even if we can't serve them when our internal target state is

> transitioning (secondary to primary). The alternative is to queue them

> which presents a different set of problems.


Indeed, it would be less prone to host-side errors if the "new"
pathgroup went to a usable state before the "old" pathgroup got
unavailable. Granted, this may be difficult to guarantee on the storage
side.

> > >  If the

> > > paths are failed which are transitioning, an all paths down state

> > > happens which is not expected.

> > 

> > IMO it _is_ expected if in fact no path is able to process SCSI

> > commands at the given point in time.

> 

> In this case it would seem having all paths move to transitioning

> would lead to all paths lost. It is possible to imagine

> implementations where for a brief period of time all paths are in a

> transitioning state. What would be the point of returning a transient

> state if the result is a permanent failure?


When a command fails with ALUA TRANSITIONING status, we must make sure
that:

 1) The command itself is not retried on the path at hand, neither on
the SCSI layer nor on the blk-mq layer. The former was the case anyway,
the latter is guaranteed by 0d88232010d5 ("scsi: core: Return
BLK_STS_AGAIN for ALUA transitioning").

 2) No new commands are sent down this path until it reaches a usable
final state. This is achieved on the SCSI layer by alua_prep_fn(), with
268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
transitioning state").

These two items would still be true with your patch applied. However,
the problem is that if the path isn't failed, dm-multipath would
continue sending I/O down this path. If the path isn't set to failed
state, the path selector algorithm may or may not choose a different
path next time. In the worst case, dm-multipath would busy-loop
retrying the I/O on the same path. Please consider the following:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 86518aec32b4..3f3a89fc2b3b 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
        if (error && blk_path_error(error)) {
                struct multipath *m = ti->private;
 
-               if (error == BLK_STS_RESOURCE)
+               if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN)
                        r = DM_ENDIO_DELAY_REQUEUE;
                else
                        r = DM_ENDIO_REQUEUE;
 
-               if (pgpath)
+               if (pgpath && (error != BLK_STS_AGAIN))
                        fail_path(pgpath);

This way we'd avoid busy-looping by delaying the retry. This would
cause I/O delay in the case where some healthy paths are still in the
same dm-multipath priority group as the transitioning path. I suppose
this is a minor problem, because in the default case for ALUA
(group_by_prio in multipathd), all paths in the PG would have switched
to "transitioning" simultaneously.
 
Note that multipathd assigns priority 0 (the same prio as
"unavailable") if it happens to notice a "transitioning" path. This is
something we may want to change eventually. In your specific case, it
would cause the paths to be temporarily re-grouped, all paths being
moved to the same non-functional PG. The way you describe it, for your
storage at least, "transitioning" should be assigned a higher priority.
> 


> > If you use a suitable "no_path_retry" setting in multipathd, you

> > should

> > be able to handle the situation you describe just fine by queueing

> > the

> > I/O until the transitioning paths are fully up. IIUC, on your

> > server

> > "transitioning" is a transient state that ends quickly, so queueing

> > shouldn't be an issue. E.g. if you are certain that "transitioning"

> > won't last longer than 10s, you could set "no_path_retry 2".

> 

> I have tested using the no_path_retry and you are correct that it

> does

> work around the issue that I am seeing. The problem with that is are

> times

> we want to convey all paths down to the initiator as quickly

> as possible and doing this will delay that.


Ok, that makes sense e.g. for cluster configurations. So, the purpose
is  to distinguish between two cases where no path can process SCSI
commands: a) all paths are really down on the storage, and b) some
paths are down and some are "coming up".

Thanks,
Martin
Brian Bunker July 14, 2021, 6:13 p.m. UTC | #7
On Wed, Jul 14, 2021 at 1:39 AM Martin Wilck <mwilck@suse.com> wrote:
>

> On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote:

> > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:

> > > Are you saying that on your server, the transitioning ports are able

> > > to

> > > process regular I/O commands like READ and WRITE? If that's the case,

> > > why do you pretend to be "transitioning" at all, rather than in an

> > > active state? If it's not the case, why would it make sense for the

> > > host to retry I/O on the transitioning path?

> >

> > In the ALUA transitioning state, we cannot process READ or WRITE and

> > will return with the sense data as you mentioned above. We expect

> > retries down that transitioning path until it transitions to another

> > ALUA state (at least for some reasonable period of time for the

> > transition). The spec defines the state as the transition between

> > target asymmetric states. The current implementation requires

> > coordination on the target not to return a state transition down all

> > paths at the same time or risk all paths being failed. Using the ALUA

> > transition state allows us to respond to initiator READ and WRITE

> > requests even if we can't serve them when our internal target state is

> > transitioning (secondary to primary). The alternative is to queue them

> > which presents a different set of problems.

>

> Indeed, it would be less prone to host-side errors if the "new"

> pathgroup went to a usable state before the "old" pathgroup got

> unavailable. Granted, this may be difficult to guarantee on the storage

> side.

>


For us, this is not easily doable. We are transitioning from a
secondary to a primary, so for a brief time we have no paths which can
serve the I/O. The transition looks like this (a bit oversimplified,
but the general idea):

1. primary and secondary are active optimized
2. primary fails and secondary starts promoting
3. both previous primary and promoting secondary return transitioning
4. once the promoting primary gets far enough along the previous
primary returns unavailable
5. the promoting primary continues to return transitioning
6. the promoting primary is done and returns active optimized
7. the previous primary becomes secondary and returns active optimized

So there are windows when we can return AO for all paths,
transitioning for all paths, transitioning for half the paths and
unavailable for the other half, even though these windows can be very
small, they are possible.

> > > >  If the

> > > > paths are failed which are transitioning, an all paths down state

> > > > happens which is not expected.

> > >

> > > IMO it _is_ expected if in fact no path is able to process SCSI

> > > commands at the given point in time.

> >

> > In this case it would seem having all paths move to transitioning

> > would lead to all paths lost. It is possible to imagine

> > implementations where for a brief period of time all paths are in a

> > transitioning state. What would be the point of returning a transient

> > state if the result is a permanent failure?

>

> When a command fails with ALUA TRANSITIONING status, we must make sure

> that:

>

>  1) The command itself is not retried on the path at hand, neither on

> the SCSI layer nor on the blk-mq layer. The former was the case anyway,

> the latter is guaranteed by 0d88232010d5 ("scsi: core: Return

> BLK_STS_AGAIN for ALUA transitioning").

>

>  2) No new commands are sent down this path until it reaches a usable

> final state. This is achieved on the SCSI layer by alua_prep_fn(), with

> 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA

> transitioning state").

>

> These two items would still be true with your patch applied. However,

> the problem is that if the path isn't failed, dm-multipath would

> continue sending I/O down this path. If the path isn't set to failed

> state, the path selector algorithm may or may not choose a different

> path next time. In the worst case, dm-multipath would busy-loop

> retrying the I/O on the same path. Please consider the following:

>

> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c

> index 86518aec32b4..3f3a89fc2b3b 100644

> --- a/drivers/md/dm-mpath.c

> +++ b/drivers/md/dm-mpath.c

> @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,

>         if (error && blk_path_error(error)) {

>                 struct multipath *m = ti->private;

>

> -               if (error == BLK_STS_RESOURCE)

> +               if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN)

>                         r = DM_ENDIO_DELAY_REQUEUE;

>                 else

>                         r = DM_ENDIO_REQUEUE;

>

> -               if (pgpath)

> +               if (pgpath && (error != BLK_STS_AGAIN))

>                         fail_path(pgpath);

>

> This way we'd avoid busy-looping by delaying the retry. This would

> cause I/O delay in the case where some healthy paths are still in the

> same dm-multipath priority group as the transitioning path. I suppose

> this is a minor problem, because in the default case for ALUA

> (group_by_prio in multipathd), all paths in the PG would have switched

> to "transitioning" simultaneously.

>

> Note that multipathd assigns priority 0 (the same prio as

> "unavailable") if it happens to notice a "transitioning" path. This is

> something we may want to change eventually. In your specific case, it

> would cause the paths to be temporarily re-grouped, all paths being

> moved to the same non-functional PG. The way you describe it, for your

> storage at least, "transitioning" should be assigned a higher priority.

> >

>


Yes. I tried it with this change and it works. Should I re-submit this
modified version or do you want to do it? Either way works for me. I
was flying a bit in the dark with my initial patch since I just found
the fail_path and made that not run without much thought to what
happens after that.

> > > If you use a suitable "no_path_retry" setting in multipathd, you

> > > should

> > > be able to handle the situation you describe just fine by queueing

> > > the

> > > I/O until the transitioning paths are fully up. IIUC, on your

> > > server

> > > "transitioning" is a transient state that ends quickly, so queueing

> > > shouldn't be an issue. E.g. if you are certain that "transitioning"

> > > won't last longer than 10s, you could set "no_path_retry 2".

> >

> > I have tested using the no_path_retry and you are correct that it

> > does

> > work around the issue that I am seeing. The problem with that is are

> > times

> > we want to convey all paths down to the initiator as quickly

> > as possible and doing this will delay that.

>

> Ok, that makes sense e.g. for cluster configurations. So, the purpose

> is  to distinguish between two cases where no path can process SCSI

> commands: a) all paths are really down on the storage, and b) some

> paths are down and some are "coming up".

>

> Thanks,

> Martin

>

>

Thanks,
Brian

-- 
Brian Bunker
PURE Storage, Inc.
brian@purestorage.com
Martin Wilck July 14, 2021, 9:06 p.m. UTC | #8
On Mi, 2021-07-14 at 11:13 -0700, Brian Bunker wrote:
> On Wed, Jul 14, 2021 at 1:39 AM Martin Wilck <mwilck@suse.com> wrote:

> 

> > 

> > When a command fails with ALUA TRANSITIONING status, we must make

> > sure

> > that:

> > 

> >  1) The command itself is not retried on the path at hand, neither

> > on

> > the SCSI layer nor on the blk-mq layer. The former was the case

> > anyway,

> > the latter is guaranteed by 0d88232010d5 ("scsi: core: Return

> > BLK_STS_AGAIN for ALUA transitioning").

> > 

> >  2) No new commands are sent down this path until it reaches a

> > usable

> > final state. This is achieved on the SCSI layer by alua_prep_fn(),

> > with

> > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA

> > transitioning state").

> > 

> > These two items would still be true with your patch applied.

> > However,

> > the problem is that if the path isn't failed, dm-multipath would

> > continue sending I/O down this path. If the path isn't set to

> > failed

> > state, the path selector algorithm may or may not choose a

> > different

> > path next time. In the worst case, dm-multipath would busy-loop

> > retrying the I/O on the same path. Please consider the following:

> > 

> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c

> > index 86518aec32b4..3f3a89fc2b3b 100644

> > --- a/drivers/md/dm-mpath.c

> > +++ b/drivers/md/dm-mpath.c

> > @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct

> > dm_target *ti, struct request *clone,

> >         if (error && blk_path_error(error)) {

> >                 struct multipath *m = ti->private;

> > 

> > -               if (error == BLK_STS_RESOURCE)

> > +               if (error == BLK_STS_RESOURCE || error ==

> > BLK_STS_AGAIN)

> >                         r = DM_ENDIO_DELAY_REQUEUE;

> >                 else

> >                         r = DM_ENDIO_REQUEUE;

> > 

> > -               if (pgpath)

> > +               if (pgpath && (error != BLK_STS_AGAIN))

> >                         fail_path(pgpath);

> > 

> > This way we'd avoid busy-looping by delaying the retry. This would

> > cause I/O delay in the case where some healthy paths are still in

> > the

> > same dm-multipath priority group as the transitioning path. I

> > suppose

> > this is a minor problem, because in the default case for ALUA

> > (group_by_prio in multipathd), all paths in the PG would have

> > switched

> > to "transitioning" simultaneously.

> > 

> > Note that multipathd assigns priority 0 (the same prio as

> > "unavailable") if it happens to notice a "transitioning" path. This

> > is

> > something we may want to change eventually. In your specific case,

> > it

> > would cause the paths to be temporarily re-grouped, all paths being

> > moved to the same non-functional PG. The way you describe it, for

> > your

> > storage at least, "transitioning" should be assigned a higher

> > priority.

> > > 

> > 

> 

> Yes. I tried it with this change and it works. Should I re-submit

> this

> modified version or do you want to do it? 


Yes, please.

Regards,
Martin
diff mbox series

Patch

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index bced42f082b0..d5d6be96068d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1657,7 +1657,7 @@  static int multipath_end_io(struct dm_target
*ti, struct request *clone,
                else
                        r = DM_ENDIO_REQUEUE;

-               if (pgpath)
+               if (pgpath && (error != BLK_STS_AGAIN))
                        fail_path(pgpath);