Message ID | CAHZQxy+4sTPz9+pY3=7VJH+CLUJsDct81KtnR2be8ycN5mhqTg@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/1] scsi_dh_alua: properly handling the ALUA transitioning state | expand |
Brian, Martin, sorry, I've overlooked this patch previously. I have to say I think it's wrong and shouldn't have been applied. At least I need more in- depth explanation. On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: > > > The handling of the ALUA transitioning state is currently broken. > > When > > a target goes into this state, it is expected that the target is > > allowed to stay in this state for the implicit transition timeout > > without a path failure. Can you please show me a quote from the specs on which this expectation ("without a path failure") is based? AFAIK the SCSI specs don't say anything about device-mapper multipath semantics. > > The handler has this logic, but it gets > > skipped currently. > > > > When the target transitions, there is in-flight I/O from the > > initiator. The first of these responses from the target will be a > > unit > > attention letting the initiator know that the ALUA state has > > changed. > > The remaining in-flight I/Os, before the initiator finds out that > > the > > portal state has changed, will return not ready, ALUA state is > > transitioning. The portal state will change to > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O > > immediately failing the path unexpectedly. The path failure happens > > in > > less than a second instead of the expected successes until the > > transition timer is exceeded. dm multipath has no concept of "transitioning" state. Path state can be either active or inactive. As Brian wrote, commands sent to the transitioning device will return NOT READY, TRANSITIONING, and require retries on the SCSI layer. If we know this in advance, why should we continue sending I/O down this semi-broken path? If other, healthy paths are available, why it would it not be the right thing to switch I/O to them ASAP? I suppose the problem you want to solve here is a transient situation in which all paths are transitioning (some up, some down), which would lead to a failure on the dm level (at least with no_path_retry=0). IMO this has to be avoided at the firmware level, and if that is impossible, multipath-tools' (no_path_retry * polling_interval) must be set to a value that is higher than the time for which this transient degraded situation would persist. Am I missing something? The way I see it, this is a problem that affects only storage from one vendor, and would cause suboptimal behavior on most others. If you really need this, I would suggest a new devinfo flag, e.g. BLIST_DONT_FAIL_TRANSITIONING. Regards, Martin > > > > [...] > > Applied to 5.18/scsi-fixes, thanks! > > [1/1] scsi_dh_alua: properly handling the ALUA transitioning state > https://git.kernel.org/mkp/scsi/c/6056a92ceb2a >
On 5/20/22 12:57, Martin Wilck wrote: > Brian, Martin, > > sorry, I've overlooked this patch previously. I have to say I think > it's wrong and shouldn't have been applied. At least I need more in- > depth explanation. > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: >> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: >> >>> The handling of the ALUA transitioning state is currently broken. >>> When >>> a target goes into this state, it is expected that the target is >>> allowed to stay in this state for the implicit transition timeout >>> without a path failure. > > Can you please show me a quote from the specs on which this expectation > ("without a path failure") is based? AFAIK the SCSI specs don't say > anything about device-mapper multipath semantics. > >>> The handler has this logic, but it gets >>> skipped currently. >>> >>> When the target transitions, there is in-flight I/O from the >>> initiator. The first of these responses from the target will be a >>> unit >>> attention letting the initiator know that the ALUA state has >>> changed. >>> The remaining in-flight I/Os, before the initiator finds out that >>> the >>> portal state has changed, will return not ready, ALUA state is >>> transitioning. The portal state will change to >>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O >>> immediately failing the path unexpectedly. The path failure happens >>> in >>> less than a second instead of the expected successes until the >>> transition timer is exceeded. > > dm multipath has no concept of "transitioning" state. Path state can be > either active or inactive. As Brian wrote, commands sent to the > transitioning device will return NOT READY, TRANSITIONING, and require > retries on the SCSI layer. If we know this in advance, why should we > continue sending I/O down this semi-broken path? If other, healthy > paths are available, why it would it not be the right thing to switch > I/O to them ASAP? > But we do, don't we? Commands are being returned with the appropriate status, and dm-multipath should make the corresponding decisions here. This patch just modifies the check when _sending_ commands; ie multipath had decided that the path is still usable. Question rather would be why multipath did that; however that logic isn't modified here. Cheers, Hannes
On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote: > On 5/20/22 12:57, Martin Wilck wrote: > > Brian, Martin, > > > > sorry, I've overlooked this patch previously. I have to say I think > > it's wrong and shouldn't have been applied. At least I need more > > in- > > depth explanation. > > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: > > > > > > > The handling of the ALUA transitioning state is currently > > > > broken. > > > > When > > > > a target goes into this state, it is expected that the target > > > > is > > > > allowed to stay in this state for the implicit transition > > > > timeout > > > > without a path failure. > > > > Can you please show me a quote from the specs on which this > > expectation > > ("without a path failure") is based? AFAIK the SCSI specs don't say > > anything about device-mapper multipath semantics. > > > > > > The handler has this logic, but it gets > > > > skipped currently. > > > > > > > > When the target transitions, there is in-flight I/O from the > > > > initiator. The first of these responses from the target will be > > > > a > > > > unit > > > > attention letting the initiator know that the ALUA state has > > > > changed. > > > > The remaining in-flight I/Os, before the initiator finds out > > > > that > > > > the > > > > portal state has changed, will return not ready, ALUA state is > > > > transitioning. The portal state will change to > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O > > > > immediately failing the path unexpectedly. The path failure > > > > happens > > > > in > > > > less than a second instead of the expected successes until the > > > > transition timer is exceeded. > > > > dm multipath has no concept of "transitioning" state. Path state > > can be > > either active or inactive. As Brian wrote, commands sent to the > > transitioning device will return NOT READY, TRANSITIONING, and > > require > > retries on the SCSI layer. If we know this in advance, why should > > we > > continue sending I/O down this semi-broken path? If other, healthy > > paths are available, why it would it not be the right thing to > > switch > > I/O to them ASAP? > > > But we do, don't we? > Commands are being returned with the appropriate status, and > dm-multipath should make the corresponding decisions here. > This patch just modifies the check when _sending_ commands; ie > multipath > had decided that the path is still usable. > Question rather would be why multipath did that; If alua_prep_fn() got called, the path was considered usable at the given point in time by dm-multipath. Most probably the reason was simply that no error condition had occured on this path before ALUA state switched to transitioning. I suppose this can happen if storage switches a PG consisting of multiple paths to TRANSITIONING. We get an error on one path (sda, say), issue an RTPG, and receive the new ALUA state for all paths of the PG. For all paths except sda, we'd just see a switch to TRANSITIONING without a previous SCSI error. With this patch, we'll dispatch I/O (usually an entire bunch) to these paths despite seeing them in TRANSITIONING state. Eventually, when the SCSI responses are received, this leads to path failures. If I/O latencies are small, this happens after a few ms. In that case, the goal of Brian's patch is not reached, because the time until path failure would still be on the order of milliseconds. OTOH, if latencies are high, it takes substantially longer for the kernel to realize that the path is non-functional, while other, good paths may be idle. I fail to see the benefit. Regards, Martin > however that logic > isn't modified here. > > Cheers, > > Hannes
On 5/20/22 9:03 AM, Martin Wilck wrote: > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote: >> On 5/20/22 12:57, Martin Wilck wrote: >>> Brian, Martin, >>> >>> sorry, I've overlooked this patch previously. I have to say I think >>> it's wrong and shouldn't have been applied. At least I need more >>> in- >>> depth explanation. >>> >>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: >>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: >>>> >>>>> The handling of the ALUA transitioning state is currently >>>>> broken. >>>>> When >>>>> a target goes into this state, it is expected that the target >>>>> is >>>>> allowed to stay in this state for the implicit transition >>>>> timeout >>>>> without a path failure. >>> >>> Can you please show me a quote from the specs on which this >>> expectation >>> ("without a path failure") is based? AFAIK the SCSI specs don't say >>> anything about device-mapper multipath semantics. >>> >>>>> The handler has this logic, but it gets >>>>> skipped currently. >>>>> >>>>> When the target transitions, there is in-flight I/O from the >>>>> initiator. The first of these responses from the target will be >>>>> a >>>>> unit >>>>> attention letting the initiator know that the ALUA state has >>>>> changed. >>>>> The remaining in-flight I/Os, before the initiator finds out >>>>> that >>>>> the >>>>> portal state has changed, will return not ready, ALUA state is >>>>> transitioning. The portal state will change to >>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O >>>>> immediately failing the path unexpectedly. The path failure >>>>> happens >>>>> in >>>>> less than a second instead of the expected successes until the >>>>> transition timer is exceeded. >>> >>> dm multipath has no concept of "transitioning" state. Path state >>> can be >>> either active or inactive. As Brian wrote, commands sent to the >>> transitioning device will return NOT READY, TRANSITIONING, and >>> require >>> retries on the SCSI layer. If we know this in advance, why should >>> we >>> continue sending I/O down this semi-broken path? If other, healthy >>> paths are available, why it would it not be the right thing to >>> switch >>> I/O to them ASAP? >>> >> But we do, don't we? >> Commands are being returned with the appropriate status, and >> dm-multipath should make the corresponding decisions here. >> This patch just modifies the check when _sending_ commands; ie >> multipath >> had decided that the path is still usable. >> Question rather would be why multipath did that; > > If alua_prep_fn() got called, the path was considered usable at the > given point in time by dm-multipath. Most probably the reason was > simply that no error condition had occured on this path before ALUA > state switched to transitioning. I suppose this can happen if storage > switches a PG consisting of multiple paths to TRANSITIONING. We get an > error on one path (sda, say), issue an RTPG, and receive the new ALUA > state for all paths of the PG. For all paths except sda, we'd just see > a switch to TRANSITIONING without a previous SCSI error. > > With this patch, we'll dispatch I/O (usually an entire bunch) to these > paths despite seeing them in TRANSITIONING state. Eventually, when the > SCSI responses are received, this leads to path failures. If I/O > latencies are small, this happens after a few ms. In that case, the > goal of Brian's patch is not reached, because the time until path > failure would still be on the order of milliseconds. OTOH, if latencies > are high, it takes substantially longer for the kernel to realize that > the path is non-functional, while other, good paths may be idle. I fail > to see the benefit. > I'm not sure everyone agrees with you on the meaning of transitioning. If we go from non-optimized to optimized or standby to non-opt/optimized we don't want to try other paths because it can cause thrashing. We just need to transition resources before we can fully use the path. It could be a local cache operation or for distributed targets it could be a really expensive operation. For both though, it can take longer than the retries we get from scsi-ml. For example this patch: commit 2b35865e7a290d313c3d156c0c2074b4c4ffaf52 Author: Hannes Reinecke <hare@suse.de> Date: Fri Feb 19 09:17:13 2016 +0100 scsi_dh_alua: Recheck state on unit attention caused us issues because the retries were used up quickly. We just changed the target to return BUSY status and we don't use the transitioning state. The spec does mention using either return value in "5.15.2.5 Transitions between target port asymmetric access states": ------ if during the transition the logical unit is inaccessible, then the transition is performed as a single indivisible event and the device server shall respond by either returning BUSY status, or returning CHECK CONDITION status, with the sense key set to NOT READY, and the sense code set to LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION; ------ So Brian's patch works if you return BUSY instead of 02/04/0a and are setting the state to transitioning during the time it's transitioning. I do partially agree with you and it's kind of a messy mix and match. However, I think we should change alua_check_sense to handle 02/04/0a the same way we handle it in alua_prep_fn. And then we should add a new flag for devices that have a bug and return transitioning forever.
On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote: > On 5/20/22 9:03 AM, Martin Wilck wrote: > > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote: > > > On 5/20/22 12:57, Martin Wilck wrote: > > > > Brian, Martin, > > > > > > > > sorry, I've overlooked this patch previously. I have to say I > > > > think > > > > it's wrong and shouldn't have been applied. At least I need > > > > more > > > > in- > > > > depth explanation. > > > > > > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: > > > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: > > > > > > > > > > > The handling of the ALUA transitioning state is currently > > > > > > broken. > > > > > > When > > > > > > a target goes into this state, it is expected that the > > > > > > target > > > > > > is > > > > > > allowed to stay in this state for the implicit transition > > > > > > timeout > > > > > > without a path failure. > > > > > > > > Can you please show me a quote from the specs on which this > > > > expectation > > > > ("without a path failure") is based? AFAIK the SCSI specs don't > > > > say > > > > anything about device-mapper multipath semantics. > > > > > > > > > > The handler has this logic, but it gets > > > > > > skipped currently. > > > > > > > > > > > > When the target transitions, there is in-flight I/O from > > > > > > the > > > > > > initiator. The first of these responses from the target > > > > > > will be > > > > > > a > > > > > > unit > > > > > > attention letting the initiator know that the ALUA state > > > > > > has > > > > > > changed. > > > > > > The remaining in-flight I/Os, before the initiator finds > > > > > > out > > > > > > that > > > > > > the > > > > > > portal state has changed, will return not ready, ALUA state > > > > > > is > > > > > > transitioning. The portal state will change to > > > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new > > > > > > I/O > > > > > > immediately failing the path unexpectedly. The path failure > > > > > > happens > > > > > > in > > > > > > less than a second instead of the expected successes until > > > > > > the > > > > > > transition timer is exceeded. > > > > > > > > dm multipath has no concept of "transitioning" state. Path > > > > state > > > > can be > > > > either active or inactive. As Brian wrote, commands sent to the > > > > transitioning device will return NOT READY, TRANSITIONING, and > > > > require > > > > retries on the SCSI layer. If we know this in advance, why > > > > should > > > > we > > > > continue sending I/O down this semi-broken path? If other, > > > > healthy > > > > paths are available, why it would it not be the right thing to > > > > switch > > > > I/O to them ASAP? > > > > > > > But we do, don't we? > > > Commands are being returned with the appropriate status, and > > > dm-multipath should make the corresponding decisions here. > > > This patch just modifies the check when _sending_ commands; ie > > > multipath > > > had decided that the path is still usable. > > > Question rather would be why multipath did that; > > > > If alua_prep_fn() got called, the path was considered usable at the > > given point in time by dm-multipath. Most probably the reason was > > simply that no error condition had occured on this path before ALUA > > state switched to transitioning. I suppose this can happen > > if storage > > switches a PG consisting of multiple paths to TRANSITIONING. We get > > an > > error on one path (sda, say), issue an RTPG, and receive the new > > ALUA > > state for all paths of the PG. For all paths except sda, we'd just > > see > > a switch to TRANSITIONING without a previous SCSI error. > > > > With this patch, we'll dispatch I/O (usually an entire bunch) to > > these > > paths despite seeing them in TRANSITIONING state. Eventually, when > > the > > SCSI responses are received, this leads to path failures. If I/O > > latencies are small, this happens after a few ms. In that case, the > > goal of Brian's patch is not reached, because the time until path > > failure would still be on the order of milliseconds. OTOH, if > > latencies > > are high, it takes substantially longer for the kernel to realize > > that > > the path is non-functional, while other, good paths may be idle. I > > fail > > to see the benefit. > > > > I'm not sure everyone agrees with you on the meaning of > transitioning. > > If we go from non-optimized to optimized or standby to non- > opt/optimized > we don't want to try other paths because it can cause thrashing. But only with explicit ALUA, or am I missing something? I agree that the host shouldn't initiate a PG switch if it encounters transitioning state. I also agree that for transitioning towards a "better" state, e.g. standby to (non)-optimized, failing the path would be questionable. Unfortunately we don't know in which "direction" the path is transitioning - it could be for 'better' or 'worse'. I suppose that in the case of a PG switch, it can happen that we dispatch I/O to a device that used to be in Standby and is now transitioning. Would it make sense to remember the previous state and "guess" what we're going to transition to? I.e. if the previous state was "Standby", it's probably going to be (non)optimized after the transition, and vice- versa? > We just > need to transition resources before we can fully use the path. It > could > be a local cache operation or for distributed targets it could be a > really > expensive operation. > > For both though, it can take longer than the retries we get from > scsi-ml. So if we want to do "the right thing", we'd continue dispatching to the device until either the state changes or the device-reported transition timeout has expired? Martin > For example this patch: > > commit 2b35865e7a290d313c3d156c0c2074b4c4ffaf52 > Author: Hannes Reinecke <hare@suse.de> > Date: Fri Feb 19 09:17:13 2016 +0100 > > scsi_dh_alua: Recheck state on unit attention > > > caused us issues because the retries were used up quickly. We just > changed > the target to return BUSY status and we don't use the transitioning > state. > The spec does mention using either return value in "5.15.2.5 > Transitions > between target port asymmetric access states": > > ------ > if during the transition the logical unit is inaccessible, then the > transition > is performed as a single indivisible event and the device server > shall respond > by either returning BUSY status, or returning CHECK CONDITION status, > with the > sense key set to NOT READY, and the sense code set to LOGICAL UNIT > NOT ACCESSIBLE, > ASYMMETRIC ACCESS STATE TRANSITION; > > ------ > > So Brian's patch works if you return BUSY instead of 02/04/0a and are > setting > the state to transitioning during the time it's transitioning. > > I do partially agree with you and it's kind of a messy mix and match. > However, > I think we should change alua_check_sense to handle 02/04/0a the same > way we > handle it in alua_prep_fn. And then we should add a new flag for > devices that > have a bug and return transitioning forever. >
On 5/20/22 13:03, Martin Wilck wrote: > On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote: >> On 5/20/22 9:03 AM, Martin Wilck wrote: >>> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote: >>>> On 5/20/22 12:57, Martin Wilck wrote: >>>>> Brian, Martin, >>>>> >>>>> sorry, I've overlooked this patch previously. I have to say I >>>>> think >>>>> it's wrong and shouldn't have been applied. At least I need >>>>> more >>>>> in- >>>>> depth explanation. >>>>> >>>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: >>>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: >>>>>> >>>>>>> The handling of the ALUA transitioning state is currently >>>>>>> broken. >>>>>>> When >>>>>>> a target goes into this state, it is expected that the >>>>>>> target >>>>>>> is >>>>>>> allowed to stay in this state for the implicit transition >>>>>>> timeout >>>>>>> without a path failure. >>>>> >>>>> Can you please show me a quote from the specs on which this >>>>> expectation >>>>> ("without a path failure") is based? AFAIK the SCSI specs don't >>>>> say >>>>> anything about device-mapper multipath semantics. >>>>> >>>>>>> The handler has this logic, but it gets >>>>>>> skipped currently. >>>>>>> >>>>>>> When the target transitions, there is in-flight I/O from >>>>>>> the >>>>>>> initiator. The first of these responses from the target >>>>>>> will be >>>>>>> a >>>>>>> unit >>>>>>> attention letting the initiator know that the ALUA state >>>>>>> has >>>>>>> changed. >>>>>>> The remaining in-flight I/Os, before the initiator finds >>>>>>> out >>>>>>> that >>>>>>> the >>>>>>> portal state has changed, will return not ready, ALUA state >>>>>>> is >>>>>>> transitioning. The portal state will change to >>>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new >>>>>>> I/O >>>>>>> immediately failing the path unexpectedly. The path failure >>>>>>> happens >>>>>>> in >>>>>>> less than a second instead of the expected successes until >>>>>>> the >>>>>>> transition timer is exceeded. >>>>> >>>>> dm multipath has no concept of "transitioning" state. Path >>>>> state >>>>> can be >>>>> either active or inactive. As Brian wrote, commands sent to the >>>>> transitioning device will return NOT READY, TRANSITIONING, and >>>>> require >>>>> retries on the SCSI layer. If we know this in advance, why >>>>> should >>>>> we >>>>> continue sending I/O down this semi-broken path? If other, >>>>> healthy >>>>> paths are available, why it would it not be the right thing to >>>>> switch >>>>> I/O to them ASAP? >>>>> >>>> But we do, don't we? >>>> Commands are being returned with the appropriate status, and >>>> dm-multipath should make the corresponding decisions here. >>>> This patch just modifies the check when _sending_ commands; ie >>>> multipath >>>> had decided that the path is still usable. >>>> Question rather would be why multipath did that; >>> >>> If alua_prep_fn() got called, the path was considered usable at the >>> given point in time by dm-multipath. Most probably the reason was >>> simply that no error condition had occured on this path before ALUA >>> state switched to transitioning. I suppose this can happen >>> if storage >>> switches a PG consisting of multiple paths to TRANSITIONING. We get >>> an >>> error on one path (sda, say), issue an RTPG, and receive the new >>> ALUA >>> state for all paths of the PG. For all paths except sda, we'd just >>> see >>> a switch to TRANSITIONING without a previous SCSI error. >>> >>> With this patch, we'll dispatch I/O (usually an entire bunch) to >>> these >>> paths despite seeing them in TRANSITIONING state. Eventually, when >>> the >>> SCSI responses are received, this leads to path failures. If I/O >>> latencies are small, this happens after a few ms. In that case, the >>> goal of Brian's patch is not reached, because the time until path >>> failure would still be on the order of milliseconds. OTOH, if >>> latencies >>> are high, it takes substantially longer for the kernel to realize >>> that >>> the path is non-functional, while other, good paths may be idle. I >>> fail >>> to see the benefit. >>> >> >> I'm not sure everyone agrees with you on the meaning of >> transitioning. >> >> If we go from non-optimized to optimized or standby to non- >> opt/optimized >> we don't want to try other paths because it can cause thrashing. > > But only with explicit ALUA, or am I missing something? I agree that > the host shouldn't initiate a PG switch if it encounters transitioning > state. I also agree that for transitioning towards a "better" state, > e.g. standby to (non)-optimized, failing the path would be > questionable. Unfortunately we don't know in which "direction" the path > is transitioning - it could be for 'better' or 'worse'. I suppose that > in the case of a PG switch, it can happen that we dispatch I/O to a > device that used to be in Standby and is now transitioning. Would it > make sense to remember the previous state and "guess" what we're going > to transition to? I.e. if the previous state was "Standby", it's > probably going to be (non)optimized after the transition, and vice- > versa? > >> We just >> need to transition resources before we can fully use the path. It >> could >> be a local cache operation or for distributed targets it could be a >> really >> expensive operation. >> >> For both though, it can take longer than the retries we get from >> scsi-ml. > > So if we want to do "the right thing", we'd continue dispatching to the > device until either the state changes or the device-reported transition > timeout has expired? > Not quite. I think multipath should make the decision, and we shouldn't try to out-guess multipathing from the device driver. We should only make the information available (to wit: the path state) for multipath to make the decision. But if multipath decides to send I/O via a path which we _think_ it's in transitioning we shouldn't be arguing with that (ie we shouldn't preempt is from prep_fn) but rather let it rip; who knows, maybe there was a reason for that. The only thing this patch changes (for any current setup) is that we'll incur additional round-trip times for I/O being sent in between the first I/O returning with 'transitioning' and multipath picking up the new path state. This I/O will be returned more-or-less immediately by the target, so really we're only having an additional round-trip latency for each I/O during that time. This is at most in the milliseconds range, so I guess we can live with that. If, however, multipath fails to pick up the 'transitioning' state then this won't work, and we indeed will incur a regression. But that arguably is an issue with multipath ... Cheers, Hannes
On 5/20/22 3:03 PM, Martin Wilck wrote: > On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote: >> On 5/20/22 9:03 AM, Martin Wilck wrote: >>> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote: >>>> On 5/20/22 12:57, Martin Wilck wrote: >>>>> Brian, Martin, >>>>> >>>>> sorry, I've overlooked this patch previously. I have to say I >>>>> think >>>>> it's wrong and shouldn't have been applied. At least I need >>>>> more >>>>> in- >>>>> depth explanation. >>>>> >>>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote: >>>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: >>>>>> >>>>>>> The handling of the ALUA transitioning state is currently >>>>>>> broken. >>>>>>> When >>>>>>> a target goes into this state, it is expected that the >>>>>>> target >>>>>>> is >>>>>>> allowed to stay in this state for the implicit transition >>>>>>> timeout >>>>>>> without a path failure. >>>>> >>>>> Can you please show me a quote from the specs on which this >>>>> expectation >>>>> ("without a path failure") is based? AFAIK the SCSI specs don't >>>>> say >>>>> anything about device-mapper multipath semantics. >>>>> >>>>>>> The handler has this logic, but it gets >>>>>>> skipped currently. >>>>>>> >>>>>>> When the target transitions, there is in-flight I/O from >>>>>>> the >>>>>>> initiator. The first of these responses from the target >>>>>>> will be >>>>>>> a >>>>>>> unit >>>>>>> attention letting the initiator know that the ALUA state >>>>>>> has >>>>>>> changed. >>>>>>> The remaining in-flight I/Os, before the initiator finds >>>>>>> out >>>>>>> that >>>>>>> the >>>>>>> portal state has changed, will return not ready, ALUA state >>>>>>> is >>>>>>> transitioning. The portal state will change to >>>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new >>>>>>> I/O >>>>>>> immediately failing the path unexpectedly. The path failure >>>>>>> happens >>>>>>> in >>>>>>> less than a second instead of the expected successes until >>>>>>> the >>>>>>> transition timer is exceeded. >>>>> >>>>> dm multipath has no concept of "transitioning" state. Path >>>>> state >>>>> can be >>>>> either active or inactive. As Brian wrote, commands sent to the >>>>> transitioning device will return NOT READY, TRANSITIONING, and >>>>> require >>>>> retries on the SCSI layer. If we know this in advance, why >>>>> should >>>>> we >>>>> continue sending I/O down this semi-broken path? If other, >>>>> healthy >>>>> paths are available, why it would it not be the right thing to >>>>> switch >>>>> I/O to them ASAP? >>>>> >>>> But we do, don't we? >>>> Commands are being returned with the appropriate status, and >>>> dm-multipath should make the corresponding decisions here. >>>> This patch just modifies the check when _sending_ commands; ie >>>> multipath >>>> had decided that the path is still usable. >>>> Question rather would be why multipath did that; >>> >>> If alua_prep_fn() got called, the path was considered usable at the >>> given point in time by dm-multipath. Most probably the reason was >>> simply that no error condition had occured on this path before ALUA >>> state switched to transitioning. I suppose this can happen >>> if storage >>> switches a PG consisting of multiple paths to TRANSITIONING. We get >>> an >>> error on one path (sda, say), issue an RTPG, and receive the new >>> ALUA >>> state for all paths of the PG. For all paths except sda, we'd just >>> see >>> a switch to TRANSITIONING without a previous SCSI error. >>> >>> With this patch, we'll dispatch I/O (usually an entire bunch) to >>> these >>> paths despite seeing them in TRANSITIONING state. Eventually, when >>> the >>> SCSI responses are received, this leads to path failures. If I/O >>> latencies are small, this happens after a few ms. In that case, the >>> goal of Brian's patch is not reached, because the time until path >>> failure would still be on the order of milliseconds. OTOH, if >>> latencies >>> are high, it takes substantially longer for the kernel to realize >>> that >>> the path is non-functional, while other, good paths may be idle. I >>> fail >>> to see the benefit. >>> >> >> I'm not sure everyone agrees with you on the meaning of >> transitioning. >> >> If we go from non-optimized to optimized or standby to non- >> opt/optimized >> we don't want to try other paths because it can cause thrashing. > > But only with explicit ALUA, or am I missing something? I agree that That section of the spec mentions both implicit and explicit. For implicit, the target can want to rebalance resources for things like a resource is down permanently, or you add more ports, or we bring up/down resources dynamically based on usage or maintenance. > the host shouldn't initiate a PG switch if it encounters transitioning > state. I also agree that for transitioning towards a "better" state, > e.g. standby to (non)-optimized, failing the path would be > questionable. Unfortunately we don't know in which "direction" the path > is transitioning - it could be for 'better' or 'worse'. I suppose that For implicit, the target knows. It's initiating the transition based on whatever metrics or resources it has. We want the initiator to let us complete what we are doing. For explicit, then again the target knows what it wants to do when it gets the STPG and we only it to set the paths to optimized. So if it goes that route where it completes the STPG before the transition completes, then goes into transitioning, then we can just let the device do it's transitions. > in the case of a PG switch, it can happen that we dispatch I/O to a > device that used to be in Standby and is now transitioning. Would it > make sense to remember the previous state and "guess" what we're going > to transition to? I.e. if the previous state was "Standby", it's > probably going to be (non)optimized after the transition, and vice- > versa? You are referring to the issue Hannes mentions where multipath can pick up the transitioning state and it might get confused, right? I'm not sure what to do. > >> We just >> need to transition resources before we can fully use the path. It >> could >> be a local cache operation or for distributed targets it could be a >> really >> expensive operation. >> >> For both though, it can take longer than the retries we get from >> scsi-ml. > > So if we want to do "the right thing", we'd continue dispatching to the > device until either the state changes or the device-reported transition > timeout has expired? Sort of. Ideally I think it would be nice if we blocked the device/queue for normal IO, then just sent a RTPG every N secs or msecs until we changed state or until the timer expired. We then unblock and either fail upwards or dispatch. I think this is a lot of work though. The problem with constant dispatching is that on low latency systems we retry too quickly. I had to add a little sleep on the target side for this or we hammer the target/initiator too hard and we got warnings (I can't remember the exact warn/err).
On Sat, 2022-05-21 at 12:17 +0200, Hannes Reinecke wrote: > On 5/20/22 13:03, Martin Wilck wrote: > > On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote: > > > On 5/20/22 9:03 AM, Martin Wilck wrote: > > > > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote: > > > > > On 5/20/22 12:57, Martin Wilck wrote: > > > > > > Brian, Martin, > > > > > > > > > > > > sorry, I've overlooked this patch previously. I have to say > > > > > > I > > > > > > think > > > > > > it's wrong and shouldn't have been applied. At least I need > > > > > > more > > > > > > in- > > > > > > depth explanation. > > > > > > > > > > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen > > > > > > wrote: > > > > > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote: > > > > > > > > > > > > > > > The handling of the ALUA transitioning state is > > > > > > > > currently > > > > > > > > broken. > > > > > > > > When > > > > > > > > a target goes into this state, it is expected that the > > > > > > > > target > > > > > > > > is > > > > > > > > allowed to stay in this state for the implicit > > > > > > > > transition > > > > > > > > timeout > > > > > > > > without a path failure. > > > > > > > > > > > > Can you please show me a quote from the specs on which this > > > > > > expectation > > > > > > ("without a path failure") is based? AFAIK the SCSI specs > > > > > > don't > > > > > > say > > > > > > anything about device-mapper multipath semantics. > > > > > > > > > > > > > > The handler has this logic, but it gets > > > > > > > > skipped currently. > > > > > > > > > > > > > > > > When the target transitions, there is in-flight I/O > > > > > > > > from > > > > > > > > the > > > > > > > > initiator. The first of these responses from the target > > > > > > > > will be > > > > > > > > a > > > > > > > > unit > > > > > > > > attention letting the initiator know that the ALUA > > > > > > > > state > > > > > > > > has > > > > > > > > changed. > > > > > > > > The remaining in-flight I/Os, before the initiator > > > > > > > > finds > > > > > > > > out > > > > > > > > that > > > > > > > > the > > > > > > > > portal state has changed, will return not ready, ALUA > > > > > > > > state > > > > > > > > is > > > > > > > > transitioning. The portal state will change to > > > > > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all > > > > > > > > new > > > > > > > > I/O > > > > > > > > immediately failing the path unexpectedly. The path > > > > > > > > failure > > > > > > > > happens > > > > > > > > in > > > > > > > > less than a second instead of the expected successes > > > > > > > > until > > > > > > > > the > > > > > > > > transition timer is exceeded. > > > > > > > > > > > > dm multipath has no concept of "transitioning" state. Path > > > > > > state > > > > > > can be > > > > > > either active or inactive. As Brian wrote, commands sent to > > > > > > the > > > > > > transitioning device will return NOT READY, TRANSITIONING, > > > > > > and > > > > > > require > > > > > > retries on the SCSI layer. If we know this in advance, why > > > > > > should > > > > > > we > > > > > > continue sending I/O down this semi-broken path? If other, > > > > > > healthy > > > > > > paths are available, why it would it not be the right thing > > > > > > to > > > > > > switch > > > > > > I/O to them ASAP? > > > > > > > > > > > But we do, don't we? > > > > > Commands are being returned with the appropriate status, and > > > > > dm-multipath should make the corresponding decisions here. > > > > > This patch just modifies the check when _sending_ commands; > > > > > ie > > > > > multipath Depending on timing, multipathd may or may not > > > > > already have done a PG switch because of 3.) > > > > > had decided that the path is still usable. > > > > > Question rather would be why multipath did that; > > > > > > > > If alua_prep_fn() got called, the path was considered usable at > > > > the > > > > given point in time by dm-multipath. Most probably the reason > > > > was > > > > simply that no error condition had occured on this path before > > > > ALUA > > > > state switched to transitioning. I suppose this can happen > > > > if storage > > > > switches a PG consisting of multiple paths to TRANSITIONING. We > > > > get > > > > an > > > > error on one path (sda, say), issue an RTPG, and receive the > > > > new > > > > ALUA > > > > state for all paths of the PG. For all paths except sda, we'd > > > > just > > > > see > > > > a switch to TRANSITIONING without a previous SCSI error. > > > > > > > > With this patch, we'll dispatch I/O (usually an entire bunch) > > > > to > > > > these > > > > paths despite seeing them in TRANSITIONING state. Eventually, > > > > when > > > > the > > > > SCSI responses are received, this leads to path failures. If > > > > I/O > > > > latencies are small, this happens after a few ms. In that case, > > > > the > > > > goal of Brian's patch is not reached, because the time until > > > > path > > > > failure would still be on the order of milliseconds. OTOH, if > > > > latencies > > > > are high, it takes substantially longer for the kernel to > > > > realize > > > > that > > > > the path is non-functional, while other, good paths may be > > > > idle. I > > > > fail > > > > to see the benefit. > > > > > > > > > > I'm not sure everyone agrees with you on the meaning of > > > transitioning. > > > > > > If we go from non-optimized to optimized or standby to non- > > > opt/optimized > > > we don't want to try other paths because it can cause thrashing. > > > > But only with explicit ALUA, or am I missing something? I agree > > that > > the host shouldn't initiate a PG switch if it encounters > > transitioning > > state. I also agree that for transitioning towards a "better" > > state, > > e.g. standby to (non)-optimized, failing the path would be > > questionable. Unfortunately we don't know in which "direction" the > > path > > is transitioning - it could be for 'better' or 'worse'. I suppose > > that > > in the case of a PG switch, it can happen that we dispatch I/O to a > > device that used to be in Standby and is now transitioning. Would > > it > > make sense to remember the previous state and "guess" what we're > > going > > to transition to? I.e. if the previous state was "Standby", it's > > probably going to be (non)optimized after the transition, and vice- > > versa? > > > > > We just > > > need to transition resources before we can fully use the path. It > > > could > > > be a local cache operation or for distributed targets it could be > > > a > > > really > > > expensive operation. > > > > > > For both though, it can take longer than the retries we get from > > > scsi-ml. > > > > So if we want to do "the right thing", we'd continue dispatching to > > the > > device until either the state changes or the device-reported > > transition > > timeout has expired? > > > Not quite. > I think multipath should make the decision, and we shouldn't try to > out-guess multipathing from the device driver. > We should only make the information available (to wit: the path > state) > for multipath to make the decision. The only information that dm-multipath currently has is the block-level error code. The SCSI layer uses BLK_STS_AGAIN for "transitioning". Can dm-multipath assume that BLK_STS_AGAIN *always* has these semantics, regadless of the low-level driver? It isn't clearly documented, AFAICS. Doing this properly requires additional logic in dm-mpath.c to treat BLK_STS_AGAIN differently from other path errors. IIUC, what Brian and Mike say is that dm-multipath may do path failover, but not PG failover for BLK_STS_AGAIN. We currently don't have this logic, but we could work on it. IMO this means that Brian's patch would be a step in the wrong direction: by returning BLK_STS_OK instead of BLK_STS_AGAIN, dm- multipath gets *less* information about path state than before. dm-multipath has 3 ways to get notified of transitioning state (correct me if I'm missing anything): 1. I/O queuing fails with BLK_STS_AGAIN in scsi_prep_fn(). This won't happen any more with Brian's patch. 2. I/O returns from target with CHECK CONDITION / NOT READY / 0x4:0xa (LU not accessible - asymmetric state change transition). In this case the SCSI layer applies the "maybe_retry" logic, i.e. it retries the command. When retries are exhausted, it fails the command with BLK_STS_AGAIN. 3. multipathd in user space sees a TRANSITIONING status during routine path checking. Currently multipathd uses prio=0 for TRANSITIONING state, which is even less than the prio used for STANDBY (1). Thus multipathd would initiate a PG switch sooner or later. I believe these prio assignments by multipathd are questionable, but that's a different discussion. dm-multipath interprets BLK_STS_AGAIN in 2.) as path error, and will thus do a path failover in the PG. In a case like Brian's, the other paths in the PG will likely also be in TRANSITIONING state, so procedure 2.) will repeat until all paths in the PG have failed. In that situation, dm-multipath will initiate a PG switch, which is what Brian wanted to avoid in the first place. > But if multipath decides to send I/O via a path which we _think_ it's > in > transitioning we shouldn't be arguing with that (ie we shouldn't > preempt > is from prep_fn) but rather let it rip; who knows, maybe there was a > reason for that. You seem to assume that multipath knows more about the device then the SCSI layer. Isn't the opposite true? dm-multipath would never know that the path is "transitioning", and it has zero knowledge about the transition timeout. Only the ALUA code knows that certain path devices will change their status simultaneously, always. The only additional "knowledge" that multipath has is knowledge about path siblings and PGs. By returning BLK_STS_OK from scsi_prep_fn(), we hold back information that the multipath layer could use for an informed decision. > The only thing this patch changes (for any current setup) is that > we'll > incur additional round-trip times for I/O being sent in between the > first I/O returning with 'transitioning' and multipath picking up the > new path state. This I/O will be returned more-or-less immediately by > the target, so really we're only having an additional round-trip > latency > for each I/O during that time. This is at most in the milliseconds > range, so I guess we can live with that. Side note: AFAICS it'll be several round-trips (until SCSI retries are exhausted, as this is not a "fast io fail" case). Anyway, you're right, a PG switch will happen after a few milliseconds. That's what Brian wanted to avoid. IOW, the patch won't achieve what it's supposed to achieve. *But* it will hold back information that dm- multipath could use for handling transitions smarter than it currently does. > If, however, multipath fails to pick up the 'transitioning' state > then > this won't work, and we indeed will incur a regression. But that > arguably is an issue with multipath ... dm-multipath currently doesn't. Requesting that it should is a valid feature request, no less and no more. I'm not against working on this feature. But the generic device model in dm-multipath isn't quite powerful enough for this. We could try to handle it more cleverly in user space, but yet I don't see how it could be done. Cheers, Martin
On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote: > From my perspective, the ALUA transitioning state is a temporary > state > where the target is saying that it does not have a permanent state > yet. Having the initiator try another pg to me doesn't seem like the > right thing for it to do. I agree. Unfortunately, there's no logic in dm-multipath saying "I may switch paths inside a PG, but I may not do PG failover". > If the target wanted the initiator to use a > different pg, it should use an ALUA state which would make that > clear, > standby, unavailable, etc. The target would only return an error > state > if it was aware that some other path is in an active state.When > transitioning is returned, I don't think the initiator should assume > that any other pg would be a better choice. I think it should assume > that the target will make its intention clear for that path with a > permanent state within a transition timeout. For me the question is still whether trying to send I/O to the path that is known not to be able to process it makes sense. As noted elsewhere, you patch just delays the BLK_STS_AGAIN by a few milliseconds. You want to avoid a PG switch, and I second that, but IMO that needs a different approach. > From my perspective the right thing to do is to let the ALUA handler > do what it is trying to do. If the pg state is transitioning and > within the transition timeout it should continue to retry that > request > checking each time the transition timeout. But this means that we should modify the logic not only in alua_prep_fn() but also for handling of NOT READY conditions, either in alua_check_sense() or in scsi_io_completion_action(). I agree that this would make a lot of sense, perhaps more than trying to implement a cleverer logic in dm-multipath as discussed between Hannes and myself. This is what we need to figure out first: Do we want to change the logic in the multipath layer, making it somehow aware of the special nature of "transitioning" state, or should we tune the retry logic in the SCSI layer such that dm-multipath will "do the right thing" automatically? Regards Martin
> On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote: > > On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote: >> From my perspective, the ALUA transitioning state is a temporary >> state >> where the target is saying that it does not have a permanent state >> yet. Having the initiator try another pg to me doesn't seem like the >> right thing for it to do. > > I agree. Unfortunately, there's no logic in dm-multipath saying "I may > switch paths inside a PG, but I may not do PG failover". > >> If the target wanted the initiator to use a >> different pg, it should use an ALUA state which would make that >> clear, >> standby, unavailable, etc. The target would only return an error >> state >> if it was aware that some other path is in an active state.When >> transitioning is returned, I don't think the initiator should assume >> that any other pg would be a better choice. I think it should assume >> that the target will make its intention clear for that path with a >> permanent state within a transition timeout. > > For me the question is still whether trying to send I/O to the path > that is known not to be able to process it makes sense. As noted > elsewhere, you patch just delays the BLK_STS_AGAIN by a few > milliseconds. You want to avoid a PG switch, and I second that, but IMO > that needs a different approach. > >> From my perspective the right thing to do is to let the ALUA handler >> do what it is trying to do. If the pg state is transitioning and >> within the transition timeout it should continue to retry that >> request >> checking each time the transition timeout. > > But this means that we should modify the logic not only in > alua_prep_fn() but also for handling of NOT READY conditions, either in > alua_check_sense() or in scsi_io_completion_action(). > I agree that this would make a lot of sense, perhaps more than trying > to implement a cleverer logic in dm-multipath as discussed between > Hannes and myself. > > This is what we need to figure out first: Do we want to change the > logic in the multipath layer, making it somehow aware of the special > nature of "transitioning" state, or should we tune the retry logic in > the SCSI layer such that dm-multipath will "do the right thing" > automatically? > > Regards > Martin > I think that a couple of things are needed to get to where my expectation of how it should work would be. First this code should come out of the not ready sense check. The reason is that it will continually override alua_check’s attempt to change the pg state as it exceeds the allowed time and the path state is changed to standby to handle a misbehaving target which stays in transitioning past the timer. --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -409,20 +409,12 @@ static char print_alua_state(unsigned char state) static enum scsi_disposition alua_check_sense(struct scsi_device *sdev, struct scsi_sense_hdr *sense_hdr) { - struct alua_dh_data *h = sdev->handler_data; - struct alua_port_group *pg; - switch (sense_hdr->sense_key) { case NOT_READY: if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) { /* * LUN Not Accessible - ALUA state transition */ - rcu_read_lock(); - pg = rcu_dereference(h->pg); - if (pg) - pg->state = SCSI_ACCESS_STATE_TRANSITIONING; - rcu_read_unlock(); Second for this to work how it used to work in Linux and how it works for other OS’s where ALUA state transition is not a fail path response: diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 8d18cc7e510e..33828aa44347 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result) case 0x1b: /* sanitize in progress */ case 0x1d: /* configuration in progress */ case 0x24: /* depopulation in progress */ action = ACTION_DELAYED_RETRY; break; case 0x0a: /* ALUA state transition */ - blk_stat = BLK_STS_AGAIN; - fallthrough; + action = ACTION_REPREP; + break; Because, as you said the expiation check for whether to continually allow new I/O on the pg is in the prep function of the ALUA handler. I think that this does introduce a lot error processing for a target that will respond quickly. Maybe there is some way to use the equivalent of ACTION_DELAYED_RETRY so that the target is not as aggressively retried in the transitioning state. It is possible, maybe likely, in other OS’s that this retry is done at the multipath level. The DSM from Microsoft in GitHub would seem to show that Windows does it that way. The multipath-tools in Linux though don’t seem to use sense data to make decisions so getting this logic in would seem like a decent set of changes and getting the buy-in to go that route. Thanks, Brian
On Mon, May 02, 2022 at 08:09:17AM -0700, Brian Bunker wrote: > case SCSI_ACCESS_STATE_ACTIVE: > case SCSI_ACCESS_STATE_LBA: > - return BLK_STS_OK; > case SCSI_ACCESS_STATE_TRANSITIONING: > - return BLK_STS_AGAIN; > + return BLK_STS_OK; As there is a lot of discussion on BLK_STS_AGAIN in the thread: Independent of the actul outcome here, BLK_STS_AGAIN is fundamentally the wrong thing to return here. BLK_STS_AGAIN must only be returned for REQ_NOWAIT requests that would have to block.
On 5/24/22 07:25, Christoph Hellwig wrote: > On Mon, May 02, 2022 at 08:09:17AM -0700, Brian Bunker wrote: >> case SCSI_ACCESS_STATE_ACTIVE: >> case SCSI_ACCESS_STATE_LBA: >> - return BLK_STS_OK; >> case SCSI_ACCESS_STATE_TRANSITIONING: >> - return BLK_STS_AGAIN; >> + return BLK_STS_OK; > > As there is a lot of discussion on BLK_STS_AGAIN in the thread: > Independent of the actual outcome here, BLK_STS_AGAIN is fundamentally > the wrong thing to return here. BLK_STS_AGAIN must only be returned > for REQ_NOWAIT requests that would have to block. Guess we should document that. It wasn't clear to me that this is a requirement. Cheers, Hannes
On Mon, 2022-05-23 at 09:52 -0700, Brian Bunker wrote: > > > On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote: > > > > On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote: > > > From my perspective, the ALUA transitioning state is a temporary > > > state > > > where the target is saying that it does not have a permanent > > > state > > > yet. Having the initiator try another pg to me doesn't seem like > > > the > > > right thing for it to do. > > > > I agree. Unfortunately, there's no logic in dm-multipath saying "I > > may > > switch paths inside a PG, but I may not do PG failover". > > > > > If the target wanted the initiator to use a > > > different pg, it should use an ALUA state which would make that > > > clear, > > > standby, unavailable, etc. The target would only return an error > > > state > > > if it was aware that some other path is in an active state.When > > > transitioning is returned, I don't think the initiator should > > > assume > > > that any other pg would be a better choice. I think it should > > > assume > > > that the target will make its intention clear for that path with > > > a > > > permanent state within a transition timeout. > > > > For me the question is still whether trying to send I/O to the path > > that is known not to be able to process it makes sense. As noted > > elsewhere, you patch just delays the BLK_STS_AGAIN by a few > > milliseconds. You want to avoid a PG switch, and I second that, but > > IMO > > that needs a different approach. > > > > > From my perspective the right thing to do is to let the ALUA > > > handler > > > do what it is trying to do. If the pg state is transitioning and > > > within the transition timeout it should continue to retry that > > > request > > > checking each time the transition timeout. > > > > But this means that we should modify the logic not only in > > alua_prep_fn() but also for handling of NOT READY conditions, > > either in > > alua_check_sense() or in scsi_io_completion_action(). > > I agree that this would make a lot of sense, perhaps more than > > trying > > to implement a cleverer logic in dm-multipath as discussed between > > Hannes and myself. > > > > This is what we need to figure out first: Do we want to change the > > logic in the multipath layer, making it somehow aware of the > > special > > nature of "transitioning" state, or should we tune the retry logic > > in > > the SCSI layer such that dm-multipath will "do the right thing" > > automatically? > > > > Regards > > Martin > > > I think that a couple of things are needed to get to where my > expectation of how it should work would be. First this code should > come out of the not ready sense check. The reason is that it will > continually override alua_check’s attempt to change the pg state as > it exceeds the allowed time and the path state is changed to standby > to handle a misbehaving target which stays in transitioning past the > timer. > > --- a/drivers/scsi/device_handler/scsi_dh_alua.c > +++ b/drivers/scsi/device_handler/scsi_dh_alua.c > @@ -409,20 +409,12 @@ static char print_alua_state(unsigned char > state) > static enum scsi_disposition alua_check_sense(struct scsi_device > *sdev, > struct scsi_sense_hdr > *sense_hdr) > { > - struct alua_dh_data *h = sdev->handler_data; > - struct alua_port_group *pg; > - > switch (sense_hdr->sense_key) { > case NOT_READY: > if (sense_hdr->asc == 0x04 && sense_hdr->ascq == > 0x0a) { > /* > * LUN Not Accessible - ALUA state transition > */ > - rcu_read_lock(); > - pg = rcu_dereference(h->pg); > - if (pg) > - pg->state = > SCSI_ACCESS_STATE_TRANSITIONING; > - rcu_read_unlock(); > Good point. But I'd say that rather than removing it, this code should be made aware of the timer. > Second for this to work how it used to work in Linux and how it works > for other OS’s where ALUA state transition is not a fail path > response: > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 8d18cc7e510e..33828aa44347 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct > scsi_cmnd *cmd, int result) > case 0x1b: /* sanitize in progress */ > case 0x1d: /* configuration in > progress */ > case 0x24: /* depopulation in > progress */ > action = ACTION_DELAYED_RETRY; > break; > case 0x0a: /* ALUA state transition > */ > - blk_stat = BLK_STS_AGAIN; > - fallthrough; > + action = ACTION_REPREP; > + break; > > Because, as you said the expiation check for whether to continually > allow new I/O on the pg is in the prep function of the ALUA handler. Not any more with your patch, or what am I missing? > I think that this does introduce a lot error processing for a target > that will respond quickly. Maybe there is some way to use the > equivalent of ACTION_DELAYED_RETRY so that the target is not as > aggressively retried in the transitioning state. IMHO ACTION_DELAYED_RETRY would be a good option. Note that it blocks the device; thus it prevents dispatching IO to the device according to the scsi_device_block() logic. This way it automatically introduces some delay. > > It is possible, maybe likely, in other OS’s that this retry is done > at the multipath level. The DSM from Microsoft in GitHub would seem > to show that Windows does it that way. The multipath-tools in Linux > though don’t seem to use sense data to make decisions so getting this > logic in would seem like a decent set of changes and getting the buy- > in to go that route. IMO we're closer to a good solution on the SCSI side. I've no idea how Microsoft's DSM works, but it probably has a more detailed concept of path states than Linux' dm-multipath, which operates on abstract block devices. Neither dm-multipath nor multipath-tools have a concept of transitioning state. While it'd be possible to add this logic in multipath-tools (user space), I see a couple of hard-to-solve problems: - multipathd has no concept of ALUA. The ALUA functionality is split into the device handler part (handled in the kernel, responsible for explicit PG switching and error handling), and the "prioritizer" handled in user space. ALUA is just one out of several path prioritizers. The generic prioritizer concept doesn't define a "transitioning" property. We'd need to try to generalize the concept, and possibly figure out how to implement it with other prioritizers. - the dm-multipath path group concept is abstract and, in general, independent of ALUA's port groups. The core property of ALUA (all ports in a group always have the same primary port state) doesn't generlly apply to multipath PGs. Nothing prevents users from configuring different path grouping policies, like multibus or grouping by latency, even if ALUA is supported. The idea to prevent dm-multipath from switching PG away from a transitioning path group is only valid if dm- multipath PGs map 1:1 to ALUA port groups. - ALUA itself is more generic than the scenario you describe. You say that that if a path is transitioning, trying to switch PGs is wrong. This is true if there's just one other PG, and that other PG is unavailable. But there may be setups with more than 2 PGs, where one PG is perfectly healthy while another is transitioning. In such a case, it'd be reasonable to switch PGs towards the healthy one. - multipathd doesn't get notified of ALUA state changes. It polls in rather long time intervals (5s). If this logic is implemented in multipathd, we'll see considerable latency in PG switching. - In general, multipathd doesn't have full control over dm-multipath PG switching. Some logic is in the kernel and some in user space. It's not easy for multipathd to prevent the kernel from switching to a certain PG. Basically the only way is to set all paths in that PG to failed status. Here's one dumb question: ALUA defines "transitioning" in an abstract fashion, as a "movement from one target port asymmetric access state to another". This doesn't say anything about the state the port is transitioning towards. IOW, it could be transitioning towards "STANDBY" or "UNAVAILABLE", perhaps even "OFFLINE". I am unsure if this makes sense in practice; I _guess_ that most of the time "transitioning" means something like "booting". Your suggestions seem to confirm that. at least for PURE hardware, "transitioning" always means "will be usable soon". Is this correct? And if yes, to which extent could this be generalized to other arrays, past, present, and future? Regards Martin
diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index 37d06f993b76..1d9be771f3ee 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -1172,9 +1172,8 @@ static blk_status_t alua_prep_fn(struct scsi_device *sdev, struct request *req) case SCSI_ACCESS_STATE_OPTIMAL: case SCSI_ACCESS_STATE_ACTIVE: case SCSI_ACCESS_STATE_LBA: - return BLK_STS_OK; case SCSI_ACCESS_STATE_TRANSITIONING: - return BLK_STS_AGAIN; + return BLK_STS_OK; default: req->rq_flags |= RQF_QUIET;