scsi: iscsi: prefer xmit of DataOut before new cmd

Message ID	20220607131953.11584-1-d.bogdanov@yadro.com
State	New
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Dmitry Bogdanov <d.bogdanov@yadro.com> To: Lee Duncan <lduncan@suse.com>, Chris Leech <cleech@redhat.com> CC: <open-iscsi@googlegroups.com>, <linux-scsi@vger.kernel.org>, <linux@yadro.com>, Dmitry Bogdanov <d.bogdanov@yadro.com>, "Konstantin Shelekhin" <k.shelekhin@yadro.com> Subject: [PATCH] scsi: iscsi: prefer xmit of DataOut before new cmd Date: Tue, 7 Jun 2022 16:19:53 +0300 Message-ID: <20220607131953.11584-1-d.bogdanov@yadro.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	scsi: iscsi: prefer xmit of DataOut before new cmd \| expand scsi: iscsi: prefer xmit of DataOut before new cmd

Dmitry Bogdanov June 7, 2022, 1:19 p.m. UTC

In function iscsi_data_xmit (TX worker) there is walking through the
queue of new SCSI commands that is replenished in parallell. And only
after that queue got emptied the function will start sending pending
DataOut PDUs. That lead to DataOut timer time out on target side and
to connection reinstatment.

This patch swaps walking through the new commands queue and the pending
DataOut queue. To make a preference to ongoing commands over new ones.

Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
---
 drivers/scsi/libiscsi.c | 44 ++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 20 deletions(-)

Mike Christie June 7, 2022, 4:06 p.m. UTC | #1

On 6/7/22 10:55 AM, Mike Christie wrote:
> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>> In function iscsi_data_xmit (TX worker) there is walking through the
>> queue of new SCSI commands that is replenished in parallell. And only
>> after that queue got emptied the function will start sending pending
>> DataOut PDUs. That lead to DataOut timer time out on target side and
>> to connection reinstatment.
>>
>> This patch swaps walking through the new commands queue and the pending
>> DataOut queue. To make a preference to ongoing commands over new ones.
>>
> 
> ...
> 
>>  		task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>  		 */
>>  		if (!list_empty(&conn->mgmtqueue))
>>  			goto check_mgmt;
>> +		if (!list_empty(&conn->requeue))
>> +			goto check_requeue;
> 
> 
> 
> Hey, I've been posting a similar patch:
> 
> https://www.spinics.net/lists/linux-scsi/msg156939.html
> 
> A problem I hit is a possible pref regression so I tried to allow
> us to start up a burst of cmds in parallel. It's pretty simple where
> we allow up to a queue's worth of cmds to start. It doesn't try to
> check that all cmds are from the same queue or anything fancy to try
> and keep the code simple. Mostly just assuming users might try to bunch
> cmds together during submission or they might hit the queue plugging
> code.
> 
> What do you think?

Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
0 would check after every transmission like above.

Dmitry Bogdanov June 8, 2022, 2:16 p.m. UTC | #2

Hi Mike,

>On 6/7/22 10:55 AM, Mike Christie wrote:
>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>> queue of new SCSI commands that is replenished in parallell. And only
>>> after that queue got emptied the function will start sending pending
>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>> to connection reinstatment.
>>>
>>> This patch swaps walking through the new commands queue and the pending
>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>
>>
>> ...
>>
>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>               */
>>>              if (!list_empty(&conn->mgmtqueue))
>>>                      goto check_mgmt;
>>> +            if (!list_empty(&conn->requeue))
>>> +                    goto check_requeue;
>>
>>
>>
>> Hey, I've been posting a similar patch:
>>
>> https://www.spinics.net/lists/linux-scsi/msg156939.html
>>
>> A problem I hit is a possible pref regression so I tried to allow
>> us to start up a burst of cmds in parallel. It's pretty simple where
>> we allow up to a queue's worth of cmds to start. It doesn't try to
>> check that all cmds are from the same queue or anything fancy to try
>> and keep the code simple. Mostly just assuming users might try to bunch
>> cmds together during submission or they might hit the queue plugging
>> code.
>>
>> What do you think?
>
>Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>0 would check after every transmission like above.

 Did you really face with a perf regression? I could not imagine how it is
possible.
DataOut PDU contains a data too, so a throughput performance cannot be
decreased by sending DataOut PDUs.

 The only thing is a latency performance. But that is not an easy question.
IMHO, a system should strive to reduce a maximum value of the latency almost
without impacting of a minimum value (prefer current commands) instead of
to reduce a minimum value of the latency to the detriment of maximum value
(prefer new commands).

 Any preference of new commands over current ones looks like an io scheduler
functionality, but on underlying layer, so to say a BUS layer.
I think is a matter of future investigation/development.

BR,
 Dmitry

Mike Christie June 8, 2022, 3:36 p.m. UTC | #3

On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote:
> Hi Mike,
> 
>> On 6/7/22 10:55 AM, Mike Christie wrote:
>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>>> queue of new SCSI commands that is replenished in parallell. And only
>>>> after that queue got emptied the function will start sending pending
>>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>>> to connection reinstatment.
>>>>
>>>> This patch swaps walking through the new commands queue and the pending
>>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>>
>>>
>>> ...
>>>
>>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>>               */
>>>>              if (!list_empty(&conn->mgmtqueue))
>>>>                      goto check_mgmt;
>>>> +            if (!list_empty(&conn->requeue))
>>>> +                    goto check_requeue;
>>>
>>>
>>>
>>> Hey, I've been posting a similar patch:
>>>
>>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg156939.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-BiuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$ 
>>>
>>> A problem I hit is a possible pref regression so I tried to allow
>>> us to start up a burst of cmds in parallel. It's pretty simple where
>>> we allow up to a queue's worth of cmds to start. It doesn't try to
>>> check that all cmds are from the same queue or anything fancy to try
>>> and keep the code simple. Mostly just assuming users might try to bunch
>>> cmds together during submission or they might hit the queue plugging
>>> code.
>>>
>>> What do you think?
>>
>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>> 0 would check after every transmission like above.
> 
>  Did you really face with a perf regression? I could not imagine how it is
> possible.
> DataOut PDU contains a data too, so a throughput performance cannot be
> decreased by sending DataOut PDUs.

We can agree that queue plugging and batching improves throughput right?
The app or block layer may try to batch commands. It could be with something
like fio's batch args or you hit the block layer queue plugging.

With the current code we can end up sending all cmds to the target in a way
the target can send them to the real device batched. For example, we send off
the initial N scsi command PDUs in one run of iscsi_data_xmit. The target reads
them in, and sends off N R2Ts. We are able to read N R2Ts in the same call.
And again we are able to send the needed data for them in one call of
iscsi_data_xmit. The target is able to read in the data and send off the
WRITEs to the physical device in a batch.

With your patch, we can end up not batching them like the app/block layer
intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop.
We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to
handle. So we goto check_requeue. We send the needed data. The target then
starts to send the cmd to the physical device. If we have read in multiple
R2Ts then we will continue the requeue loop. And so we might be able to send
the data fast enough that the target can then send those commands to the
physical device. But we've now broken up the batching the upper layers sent
to us and we were doing before.

> 
>  The only thing is a latency performance. But that is not an easy question.

Agree latency is important and that's why I was saying we can make it config
option. Users can continue to try and batch their cmds and we don't break
them. We also fix the bug in that we don't get stuck in the cmdqueue loop
always taking in new cmds.

> IMHO, a system should strive to reduce a maximum value of the latency almost
> without impacting of a minimum value (prefer current commands) instead of
> to reduce a minimum value of the latency to the detriment of maximum value
> (prefer new commands).
> 
>  Any preference of new commands over current ones looks like an io scheduler

I can see your point of view where you see it as preferring new cmds
vs existing. It's probably due to my patch not hooking into commit_rqs
and trying to figure out the batching exactly. It's more of a simple
estimate.

However, that's not what I'm talking about. I'm talking about the block
layer / iosched has sent us these commands as a batch. We are now more
likely to break that up.

> functionality, but on underlying layer, so to say a BUS layer.
> I think is a matter of future investigation/development.
> 
> BR,
>  Dmitry

Ulrich Windl June 9, 2022, 6:56 a.m. UTC | #4

>>> Mike Christie <michael.christie@oracle.com> schrieb am 08.06.2022 um 17:36 in
Nachricht <48af6f5f-c3b6-ac65-836d-518153ab2dd5@oracle.com>:
> On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote:
>> Hi Mike,
>> 
>>> On 6/7/22 10:55 AM, Mike Christie wrote:
>>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>>>> queue of new SCSI commands that is replenished in parallell. And only
>>>>> after that queue got emptied the function will start sending pending
>>>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>>>> to connection reinstatment.
>>>>>
>>>>> This patch swaps walking through the new commands queue and the pending
>>>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>>>
>>>>
>>>> ...
>>>>
>>>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>>>               */
>>>>>              if (!list_empty(&conn->mgmtqueue))
>>>>>                      goto check_mgmt;
>>>>> +            if (!list_empty(&conn->requeue))
>>>>> +                    goto check_requeue;
>>>>
>>>>
>>>>
>>>> Hey, I've been posting a similar patch:
>>>>
>>>> 
> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg15693 
> 9.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-B
> iuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$ 
>>>>
>>>> A problem I hit is a possible pref regression so I tried to allow
>>>> us to start up a burst of cmds in parallel. It's pretty simple where
>>>> we allow up to a queue's worth of cmds to start. It doesn't try to
>>>> check that all cmds are from the same queue or anything fancy to try
>>>> and keep the code simple. Mostly just assuming users might try to bunch
>>>> cmds together during submission or they might hit the queue plugging
>>>> code.
>>>>
>>>> What do you think?
>>>
>>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>>> 0 would check after every transmission like above.
>> 
>>  Did you really face with a perf regression? I could not imagine how it is
>> possible.
>> DataOut PDU contains a data too, so a throughput performance cannot be
>> decreased by sending DataOut PDUs.
> 
> 
> We can agree that queue plugging and batching improves throughput right?

Hi!

Isn't that the classic "throughput vs. response time"? I think you cannot optimize one without affecting the other.
(I can remember discussions like "You are sending one ethernet packet for each key pressed; are you crazy?" when network admins felt worried about throughput)

> The app or block layer may try to batch commands. It could be with something
> like fio's batch args or you hit the block layer queue plugging.
> 
> With the current code we can end up sending all cmds to the target in a way
> the target can send them to the real device batched. For example, we send 
> off
> the initial N scsi command PDUs in one run of iscsi_data_xmit. The target 
> reads
> them in, and sends off N R2Ts. We are able to read N R2Ts in the same call.
> And again we are able to send the needed data for them in one call of
> iscsi_data_xmit. The target is able to read in the data and send off the
> WRITEs to the physical device in a batch.
> 
> With your patch, we can end up not batching them like the app/block layer
> intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop.
> We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to
> handle. So we goto check_requeue. We send the needed data. The target then
> starts to send the cmd to the physical device. If we have read in multiple
> R2Ts then we will continue the requeue loop. And so we might be able to send
> the data fast enough that the target can then send those commands to the
> physical device. But we've now broken up the batching the upper layers sent
> to us and we were doing before.
> 
>> 
>>  The only thing is a latency performance. But that is not an easy question.
> 
> Agree latency is important and that's why I was saying we can make it config
> option. Users can continue to try and batch their cmds and we don't break
> them. We also fix the bug in that we don't get stuck in the cmdqueue loop
> always taking in new cmds.
> 
>> IMHO, a system should strive to reduce a maximum value of the latency almost
>> without impacting of a minimum value (prefer current commands) instead of
>> to reduce a minimum value of the latency to the detriment of maximum value
>> (prefer new commands).
>> 
>>  Any preference of new commands over current ones looks like an io scheduler
> 
> I can see your point of view where you see it as preferring new cmds
> vs existing. It's probably due to my patch not hooking into commit_rqs
> and trying to figure out the batching exactly. It's more of a simple
> estimate.

Is it also about the classic "reads stall when all buffers are dirty" (reads to a fast device may time-out while writing to a slow device)?
There the solution was to limit the amount of dirty buffers, effectively leaving room for the other direction (i.e. read).
This is NOT about which SCSI commands are scheduled first; it's about lack of buffers causing a request to wait.

> 
> However, that's not what I'm talking about. I'm talking about the block
> layer / iosched has sent us these commands as a batch. We are now more
> likely to break that up.

Isn't the block-layer the correct place to "tune" that then?
E.g. Limiting "read_ahead_kb" (which can be rather large, depending on the filesystem)
Some storage systems "break up" large requests to smaller ones internally, causing an extra delay for such large requests, specifically if there is a bottleneck in the storage system.

So are we "tuning" at the right spot here?

> 
>> functionality, but on underlying layer, so to say a BUS layer.
>> I think is a matter of future investigation/development.

Regards,
Ulrich

Dmitry Bogdanov June 9, 2022, 9:02 a.m. UTC | #5

Hi Mike,

>>On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote:
>>> On 6/7/22 10:55 AM, Mike Christie wrote:
>>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>>>> queue of new SCSI commands that is replenished in parallell. And only
>>>>> after that queue got emptied the function will start sending pending
>>>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>>>> to connection reinstatment.
>>>>>
>>>>> This patch swaps walking through the new commands queue and the pending
>>>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>>>
>>>>
>>>> ...
>>>>
>>>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>>>               */
>>>>>              if (!list_empty(&conn->mgmtqueue))
>>>>>                      goto check_mgmt;
>>>>> +            if (!list_empty(&conn->requeue))
>>>>> +                    goto check_requeue;
>>>>
>>>>
>>>>
>>>> Hey, I've been posting a similar patch:
>>>>
>>>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg156939.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-BiuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$
>>>>
>>>> A problem I hit is a possible pref regression so I tried to allow
>>>> us to start up a burst of cmds in parallel. It's pretty simple where
>>>> we allow up to a queue's worth of cmds to start. It doesn't try to
>>>> check that all cmds are from the same queue or anything fancy to try
>>>> and keep the code simple. Mostly just assuming users might try to bunch
>>>> cmds together during submission or they might hit the queue plugging
>>>> code.
>>>>
>>>> What do you think?
>>>
>>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>>> 0 would check after every transmission like above.
>>
>>  Did you really face with a perf regression? I could not imagine how it is
>> possible.
>> DataOut PDU contains a data too, so a throughput performance cannot be
>> decreased by sending DataOut PDUs.
>
>
>We can agree that queue plugging and batching improves throughput right?
>The app or block layer may try to batch commands. It could be with something
>like fio's batch args or you hit the block layer queue plugging.

I agree that those features 100% gives an improvement of a throughput on local
devices on serial interfaces like SATA1. Since SATA2 (Native Command Queuing)
devices can reorder incoming commmands to provide the best thoughput.
SCSI I guess has similar feature from the very beginning.
But on remote devices (iSCSI and FC) it is not 100% - initiators's command
order may be reordered by the network protocol nature itself. I mean 1PDU vs
R2T+DataOut PDUs, PDU resending due to crc errors or something like that.

>With the current code we can end up sending all cmds to the target in a way
>the target can send them to the real device batched. For example, we send off
>the initial N scsi command PDUs in one run of iscsi_data_xmit. The target reads
>them in, and sends off N R2Ts. We are able to read N R2Ts in the same call.
>And again we are able to send the needed data for them in one call of
>iscsi_data_xmit. The target is able to read in the data and send off the
>WRITEs to the physical device in a batch.
>
>With your patch, we can end up not batching them like the app/block layer
>intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop.
>We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to
>handle. So we goto check_requeue. We send the needed data. The target then
>starts to send the cmd to the physical device. If we have read in multiple
>R2Ts then we will continue the requeue loop. And so we might be able to send
>the data fast enough that the target can then send those commands to the
>physical device. But we've now broken up the batching the upper layers sent
>to us and we were doing before.

In my head everything is vice-versa :)
Current code breaks a batching by sending new commands instead of completing
the transmission of current commands that could be in its own batch.
With my patch the batches will be received (in best effort manner) on target
port in the same order as block layer sends to iSCSI layer. BTW, another target
ports may receive other commands in meantime and the real device will receive
broken batch anyway. The initiator side cannot guarantee batching on the real
device.

Lets imagine, that block layer submits two batches of big commands and then
submits single small commands, the command queue will be 
"ABCD EFGH" + "S" + "S" + "S" + "S" 
and that what will be sent out:
current code (worst case):    A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch)
current code (best case):     A1B1C1D1 E1F1G1H1 SSSS A2B2C2D2 E2F2G2H2 (reorder)
current code (bug addressed): A1B1C1D1 E1F1G1H1 SS...S (connection fail)
current code (!impossible):   A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder)
with my patch (best case):    A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder)
with my patch (your case):    A1B1C1D1 E1 A2 F1 B2 G1H1 C2 E2 D2 F2G2H2 SSSS (still inorder)
with my patch (worst case):   A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch)

My better best case (command order as block layer submits) will never happen in
the current code.
If "S" comes in parrallel, the worst cases are the same, both may break a batch
by new commands. But if "S" are already in the queue, my patch will produce
(most likely) the best case instead of the worst case.


>>  The only thing is a latency performance. But that is not an easy question.
>
>Agree latency is important and that's why I was saying we can make it config
>option. Users can continue to try and batch their cmds and we don't break
>them. We also fix the bug in that we don't get stuck in the cmdqueue loop
>always taking in new cmds.

>> IMHO, a system should strive to reduce a maximum value of the latency almost
>> without impacting of a minimum value (prefer current commands) instead of
>> to reduce a minimum value of the latency to the detriment of maximum value
>> (prefer new commands).
>>
>>  Any preference of new commands over current ones looks like an io scheduler
>
>I can see your point of view where you see it as preferring new cmds
>vs existing. It's probably due to my patch not hooking into commit_rqs
>and trying to figure out the batching exactly. It's more of a simple
>estimate.
>
>However, that's not what I'm talking about. I'm talking about the block
>layer / iosched has sent us these commands as a batch. We are now more
>likely to break that up.

No, my patch does not break a batching more than current code, instead it
decreases the probability of a break the currently sending batch by trying to
transmit in the same order as they come from block layer.

It's very complicated question - how to schedule "new vs old" commands. It's
not just a counter. Even with a counter there are a lot of questions - should
it begin from the cmdqueue loop or from iscsi_data_xmit, why it has static
limit, per LUN or not, and so on and so on.
That is a new IO scheduler on bus layer. 

>> I think is a matter of future investigation/development.
I am not against of it at all. But it should not delay a simple bugfix patch.

BR,
 Dmitry

Mike Christie June 9, 2022, 8:58 p.m. UTC | #6

On 6/9/22 4:02 AM, Dmitriy Bogdanov wrote:
> Hi Mike,
> 
>>> On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote:
>>>> On 6/7/22 10:55 AM, Mike Christie wrote:
>>>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>>>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>>>>> queue of new SCSI commands that is replenished in parallell. And only
>>>>>> after that queue got emptied the function will start sending pending
>>>>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>>>>> to connection reinstatment.
>>>>>>
>>>>>> This patch swaps walking through the new commands queue and the pending
>>>>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>>>>               */
>>>>>>              if (!list_empty(&conn->mgmtqueue))
>>>>>>                      goto check_mgmt;
>>>>>> +            if (!list_empty(&conn->requeue))
>>>>>> +                    goto check_requeue;
>>>>>
>>>>>
>>>>>
>>>>> Hey, I've been posting a similar patch:
>>>>>
>>>>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg156939.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-BiuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$
>>>>>
>>>>> A problem I hit is a possible pref regression so I tried to allow
>>>>> us to start up a burst of cmds in parallel. It's pretty simple where
>>>>> we allow up to a queue's worth of cmds to start. It doesn't try to
>>>>> check that all cmds are from the same queue or anything fancy to try
>>>>> and keep the code simple. Mostly just assuming users might try to bunch
>>>>> cmds together during submission or they might hit the queue plugging
>>>>> code.
>>>>>
>>>>> What do you think?
>>>>
>>>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>>>> 0 would check after every transmission like above.
>>>
>>>  Did you really face with a perf regression? I could not imagine how it is
>>> possible.
>>> DataOut PDU contains a data too, so a throughput performance cannot be
>>> decreased by sending DataOut PDUs.
>>
>>
>> We can agree that queue plugging and batching improves throughput right?
>> The app or block layer may try to batch commands. It could be with something
>> like fio's batch args or you hit the block layer queue plugging.
> 
> I agree that those features 100% gives an improvement of a throughput on local
> devices on serial interfaces like SATA1. Since SATA2 (Native Command Queuing)
> devices can reorder incoming commmands to provide the best thoughput.
> SCSI I guess has similar feature from the very beginning.
> But on remote devices (iSCSI and FC) it is not 100% - initiators's command
> order may be reordered by the network protocol nature itself. I mean 1PDU vs
> R2T+DataOut PDUs, PDU resending due to crc errors or something like that.
I think we are talking about slightly different things. You are coming up with
different possible scenarios where it doesn't work. I get them. You are correct
for those cases. I'm not talking about those cases. I'm talking about the specific
case I described.

I'm saying we have targets where we use backends that get improved performance
when they get batched cmds. When the network is ok, and the user's app is
batching cmds, they come from the app down to the target's backend device as
a batch. My concern is that with your patch we will no longer get that behavior.

The reason is that the target and initiator can do:

1. initiator sends scsi cmd pdu1
2. target sends R2T
3. initiator sees R2T and hits the goto. Sends data
4. target reads in data. Sends cmd to block layer.
5. initiator sends scsi cmd pdu2
6. target sends R2T
7. initiator reads in R2T sends data.
8. target reads in data and sends cmd to block layer.

The problem here could be between 4 and 8 the block layer has run the queue
and sent that one cmd to the real device already because we have that extra
delay now. So when I implemented the fix in the same way as you and I run
iostat I would see lower aqu-sz for example.

With the current code we might not have that extra delay between 4 - 8 so
I would see:

1. initiator sends scsi cmd pdu1
2. initiator sends scsi cmd pdu2
3. initiator sends scsi cmd pdu3
4. target reads in all those cmd pdus
5. target sends R2Ts for each cmd.
6. initiator reads in all the R2Ts
7. initiator sends data for cmd 1 - 3.
8. target reads in data and sends cmd1 to block layer
9. target reads in data and sends cmd2 to block layer
10. target reads in data and sends cmd3 to block layer
11. block layer/iosched hits unplug limit and then sends
those cmds to the real device.

In this case the time between 8 - 9 is small since we are just reading
from the socket, building structs and passing the cmds down to the block
layer. Here when I'm watching the backend device with iostat I see higher
qu-sz values.

Note that you might not be seeing this with LIO for example because we
plug/unplug on every cmd.

When the network is congested, the lun is shared, we are doing retries,
machine gets bogged donw, etc then yeah, we might not get the above behavior.
The user knows this is and it's expected. It's not expected that they just
update the kernel and they get a perf drop in the normal case.

How do we not give these users a regression while fixing the bug?

> 
>> With the current code we can end up sending all cmds to the target in a way
>> the target can send them to the real device batched. For example, we send off
>> the initial N scsi command PDUs in one run of iscsi_data_xmit. The target reads
>> them in, and sends off N R2Ts. We are able to read N R2Ts in the same call.
>> And again we are able to send the needed data for them in one call of
>> iscsi_data_xmit. The target is able to read in the data and send off the
>> WRITEs to the physical device in a batch.
>>
>> With your patch, we can end up not batching them like the app/block layer
>> intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop.
>> We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to
>> handle. So we goto check_requeue. We send the needed data. The target then
>> starts to send the cmd to the physical device. If we have read in multiple
>> R2Ts then we will continue the requeue loop. And so we might be able to send
>> the data fast enough that the target can then send those commands to the
>> physical device. But we've now broken up the batching the upper layers sent
>> to us and we were doing before.
> 
> In my head everything is vice-versa :)

:)

> Current code breaks a batching by sending new commands instead of completing
> the transmission of current commands that could be in its own batch.

We are not talking about the same thing. I'm talking about specifically what
a user would do today to boost perf for their app where they have a single app
using the LUN. Single LUN per session. App uses something like fio's batch
args. You have it send down 32 cmds. When those complete, it sends down 32 again.
We are not getting a constant stream or mix of cmds from different sources.

We are not talking about multiple users using the same LUN or even same session.

> With my patch the batches will be received (in best effort manner) on target
> port in the same order as block layer sends to iSCSI layer. BTW, another target> ports may receive other commands in meantime and the real device will receive
> broken batch anyway. The initiator side cannot guarantee batching on the real
> device.

The user does this. The user has set things up so they are using the device for
their one app. The app is sending down these commands as a batch. No other 
initiators are using the device.

It's *not* say a shared LUN example like we have N VMs on a LUN all doing their
own IO. The user does not expect the same perf for that type of case we they do
when they have specifically tuned their app and device use like I mention.

> 
> Lets imagine, that block layer submits two batches of big commands and then
> submits single small commands, the command queue will be 

That's not what I'm talking about. We are not getting a mix like this.

> "ABCD EFGH" + "S" + "S" + "S" + "S" 
> and that what will be sent out:
> current code (worst case):    A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch)
> current code (best case):     A1B1C1D1 E1F1G1H1 SSSS A2B2C2D2 E2F2G2H2 (reorder)
> current code (bug addressed): A1B1C1D1 E1F1G1H1 SS...S (connection fail)
> current code (!impossible):   A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder)
> with my patch (best case):    A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder)
> with my patch (your case):    A1B1C1D1 E1 A2 F1 B2 G1H1 C2 E2 D2 F2G2H2 SSSS (still inorder)
> with my patch (worst case):   A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch)
> 
> My better best case (command order as block layer submits) will never happen in
> the current code.
> If "S" comes in parrallel, the worst cases are the same, both may break a batch
> by new commands. But if "S" are already in the queue, my patch will produce
> (most likely) the best case instead of the worst case.
> 
> 
>>>  The only thing is a latency performance. But that is not an easy question.
>>
>> Agree latency is important and that's why I was saying we can make it config
>> option. Users can continue to try and batch their cmds and we don't break
>> them. We also fix the bug in that we don't get stuck in the cmdqueue loop
>> always taking in new cmds.
> 
>>> IMHO, a system should strive to reduce a maximum value of the latency almost
>>> without impacting of a minimum value (prefer current commands) instead of
>>> to reduce a minimum value of the latency to the detriment of maximum value
>>> (prefer new commands).
>>>
>>>  Any preference of new commands over current ones looks like an io scheduler
>>
>> I can see your point of view where you see it as preferring new cmds
>> vs existing. It's probably due to my patch not hooking into commit_rqs
>> and trying to figure out the batching exactly. It's more of a simple
>> estimate.
>>
>> However, that's not what I'm talking about. I'm talking about the block
>> layer / iosched has sent us these commands as a batch. We are now more
>> likely to break that up.
> 
> No, my patch does not break a batching more than current code, instead it
> decreases the probability of a break the currently sending batch by trying to
> transmit in the same order as they come from block layer.
> 
> It's very complicated question - how to schedule "new vs old" commands. It's
> not just a counter. Even with a counter there are a lot of questions - should
> it begin from the cmdqueue loop or from iscsi_data_xmit, why it has static
> limit, per LUN or not, and so on and so on.

What about a compromise? Instead of doing while (!list_empty). We switch it
to list_for_each and move the requeue handling to infront of the cmdqueue
handling.

We will know that the cmds on the cmdqueue at the time iscsi_data_xmit is run
were sent to us from the block layer at the same time so I get what I need. The
bug is fixed.

> That is a new IO scheduler on bus layer. 
> 
>>> I think is a matter of future investigation/development.
> I am not against of it at all. But it should not delay a simple bugfix patch.
> 
> BR,
>  Dmitry
>

Ulrich Windl June 10, 2022, 6:10 a.m. UTC | #7

Hi!

In my primitive point of view iSCSI is just "another type of cable", making me wonder:
Is iSCSI allowed to reorder the requests at all? Shouldn't the block layer or initiator do so, or the target doing out-of order processing (tagged queueing)?

I mean: If there is a problem that occurs even without using iSCSI, should iSCSI try to fix it?

Regards,
Ulrich

>>> Mike Christie <michael.christie@oracle.com> schrieb am 09.06.2022 um 22:58 in
Nachricht <d3277470-9ef0-9a1a-974d-e80250bd35ac@oracle.com>:
> On 6/9/22 4:02 AM, Dmitriy Bogdanov wrote:
>> Hi Mike,
>> 
>>>> On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote:
>>>>> On 6/7/22 10:55 AM, Mike Christie wrote:
>>>>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>>>>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>>>>>> queue of new SCSI commands that is replenished in parallell. And only
>>>>>>> after that queue got emptied the function will start sending pending
>>>>>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>>>>>> to connection reinstatment.
>>>>>>>
>>>>>>> This patch swaps walking through the new commands queue and the pending
>>>>>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>>>>>
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>>>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>>>>>               */
>>>>>>>              if (!list_empty(&conn->mgmtqueue))
>>>>>>>                      goto check_mgmt;
>>>>>>> +            if (!list_empty(&conn->requeue))
>>>>>>> +                    goto check_requeue;
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hey, I've been posting a similar patch:
>>>>>>
>>>>>> 
> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg15693 
> 9.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-B
> iuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$
>>>>>>
>>>>>> A problem I hit is a possible pref regression so I tried to allow
>>>>>> us to start up a burst of cmds in parallel. It's pretty simple where
>>>>>> we allow up to a queue's worth of cmds to start. It doesn't try to
>>>>>> check that all cmds are from the same queue or anything fancy to try
>>>>>> and keep the code simple. Mostly just assuming users might try to bunch
>>>>>> cmds together during submission or they might hit the queue plugging
>>>>>> code.
>>>>>>
>>>>>> What do you think?
>>>>>
>>>>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>>>>> 0 would check after every transmission like above.
>>>>
>>>>  Did you really face with a perf regression? I could not imagine how it is
>>>> possible.
>>>> DataOut PDU contains a data too, so a throughput performance cannot be
>>>> decreased by sending DataOut PDUs.
>>>
>>>
>>> We can agree that queue plugging and batching improves throughput right?
>>> The app or block layer may try to batch commands. It could be with something
>>> like fio's batch args or you hit the block layer queue plugging.
>> 
>> I agree that those features 100% gives an improvement of a throughput on 
> local
>> devices on serial interfaces like SATA1. Since SATA2 (Native Command 
> Queuing)
>> devices can reorder incoming commmands to provide the best thoughput.
>> SCSI I guess has similar feature from the very beginning.
>> But on remote devices (iSCSI and FC) it is not 100% - initiators's command
>> order may be reordered by the network protocol nature itself. I mean 1PDU vs
>> R2T+DataOut PDUs, PDU resending due to crc errors or something like that.
> I think we are talking about slightly different things. You are coming up 
> with
> different possible scenarios where it doesn't work. I get them. You are 
> correct
> for those cases. I'm not talking about those cases. I'm talking about the 
> specific
> case I described.
> 
> I'm saying we have targets where we use backends that get improved 
> performance
> when they get batched cmds. When the network is ok, and the user's app is
> batching cmds, they come from the app down to the target's backend device as
> a batch. My concern is that with your patch we will no longer get that 
> behavior.
> 
> The reason is that the target and initiator can do:
> 
> 1. initiator sends scsi cmd pdu1
> 2. target sends R2T
> 3. initiator sees R2T and hits the goto. Sends data
> 4. target reads in data. Sends cmd to block layer.
> 5. initiator sends scsi cmd pdu2
> 6. target sends R2T
> 7. initiator reads in R2T sends data.
> 8. target reads in data and sends cmd to block layer.
> 
> The problem here could be between 4 and 8 the block layer has run the queue
> and sent that one cmd to the real device already because we have that extra
> delay now. So when I implemented the fix in the same way as you and I run
> iostat I would see lower aqu-sz for example.
> 
> With the current code we might not have that extra delay between 4 - 8 so
> I would see:
> 
> 1. initiator sends scsi cmd pdu1
> 2. initiator sends scsi cmd pdu2
> 3. initiator sends scsi cmd pdu3
> 4. target reads in all those cmd pdus
> 5. target sends R2Ts for each cmd.
> 6. initiator reads in all the R2Ts
> 7. initiator sends data for cmd 1 - 3.
> 8. target reads in data and sends cmd1 to block layer
> 9. target reads in data and sends cmd2 to block layer
> 10. target reads in data and sends cmd3 to block layer
> 11. block layer/iosched hits unplug limit and then sends
> those cmds to the real device.
> 
> In this case the time between 8 - 9 is small since we are just reading
> from the socket, building structs and passing the cmds down to the block
> layer. Here when I'm watching the backend device with iostat I see higher
> qu-sz values.
> 
> Note that you might not be seeing this with LIO for example because we
> plug/unplug on every cmd.
> 
> When the network is congested, the lun is shared, we are doing retries,
> machine gets bogged donw, etc then yeah, we might not get the above 
> behavior.
> The user knows this is and it's expected. It's not expected that they just
> update the kernel and they get a perf drop in the normal case.
> 
> How do we not give these users a regression while fixing the bug?
> 
>> 
>>> With the current code we can end up sending all cmds to the target in a way
>>> the target can send them to the real device batched. For example, we send 
> off
>>> the initial N scsi command PDUs in one run of iscsi_data_xmit. The target 
> reads
>>> them in, and sends off N R2Ts. We are able to read N R2Ts in the same call.
>>> And again we are able to send the needed data for them in one call of
>>> iscsi_data_xmit. The target is able to read in the data and send off the
>>> WRITEs to the physical device in a batch.
>>>
>>> With your patch, we can end up not batching them like the app/block layer
>>> intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop.
>>> We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to
>>> handle. So we goto check_requeue. We send the needed data. The target then
>>> starts to send the cmd to the physical device. If we have read in multiple
>>> R2Ts then we will continue the requeue loop. And so we might be able to send
>>> the data fast enough that the target can then send those commands to the
>>> physical device. But we've now broken up the batching the upper layers sent
>>> to us and we were doing before.
>> 
>> In my head everything is vice-versa :)
> 
> :)
> 
>> Current code breaks a batching by sending new commands instead of completing
>> the transmission of current commands that could be in its own batch.
> 
> We are not talking about the same thing. I'm talking about specifically what
> a user would do today to boost perf for their app where they have a single 
> app
> using the LUN. Single LUN per session. App uses something like fio's batch
> args. You have it send down 32 cmds. When those complete, it sends down 32 
> again.
> We are not getting a constant stream or mix of cmds from different sources.
> 
> We are not talking about multiple users using the same LUN or even same 
> session.
> 
> 
>> With my patch the batches will be received (in best effort manner) on target
>> port in the same order as block layer sends to iSCSI layer. BTW, another 
> target> ports may receive other commands in meantime and the real device will 
> receive
>> broken batch anyway. The initiator side cannot guarantee batching on the 
> real
>> device.
> 
> The user does this. The user has set things up so they are using the device 
> for
> their one app. The app is sending down these commands as a batch. No other 
> initiators are using the device.
> 
> It's *not* say a shared LUN example like we have N VMs on a LUN all doing 
> their
> own IO. The user does not expect the same perf for that type of case we they 
> do
> when they have specifically tuned their app and device use like I mention.
> 
>> 
>> Lets imagine, that block layer submits two batches of big commands and then
>> submits single small commands, the command queue will be 
> 
> That's not what I'm talking about. We are not getting a mix like this.
> 
> 
>> "ABCD EFGH" + "S" + "S" + "S" + "S" 
>> and that what will be sent out:
>> current code (worst case):    A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S 
> G2H2 (breaks batch)
>> current code (best case):     A1B1C1D1 E1F1G1H1 SSSS A2B2C2D2 E2F2G2H2 
> (reorder)
>> current code (bug addressed): A1B1C1D1 E1F1G1H1 SS...S (connection fail)
>> current code (!impossible):   A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS 
> (inorder)
>> with my patch (best case):    A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS 
> (inorder)
>> with my patch (your case):    A1B1C1D1 E1 A2 F1 B2 G1H1 C2 E2 D2 F2G2H2 SSSS 
> (still inorder)
>> with my patch (worst case):   A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S 
> G2H2 (breaks batch)
>> 
>> My better best case (command order as block layer submits) will never happen 
> in
>> the current code.
>> If "S" comes in parrallel, the worst cases are the same, both may break a 
> batch
>> by new commands. But if "S" are already in the queue, my patch will produce
>> (most likely) the best case instead of the worst case.
>> 
>> 
>>>>  The only thing is a latency performance. But that is not an easy question.
>>>
>>> Agree latency is important and that's why I was saying we can make it config
>>> option. Users can continue to try and batch their cmds and we don't break
>>> them. We also fix the bug in that we don't get stuck in the cmdqueue loop
>>> always taking in new cmds.
>> 
>>>> IMHO, a system should strive to reduce a maximum value of the latency almost
>>>> without impacting of a minimum value (prefer current commands) instead of
>>>> to reduce a minimum value of the latency to the detriment of maximum value
>>>> (prefer new commands).
>>>>
>>>>  Any preference of new commands over current ones looks like an io scheduler
>>>
>>> I can see your point of view where you see it as preferring new cmds
>>> vs existing. It's probably due to my patch not hooking into commit_rqs
>>> and trying to figure out the batching exactly. It's more of a simple
>>> estimate.
>>>
>>> However, that's not what I'm talking about. I'm talking about the block
>>> layer / iosched has sent us these commands as a batch. We are now more
>>> likely to break that up.
>> 
>> No, my patch does not break a batching more than current code, instead it
>> decreases the probability of a break the currently sending batch by trying 
> to
>> transmit in the same order as they come from block layer.
>> 
>> It's very complicated question - how to schedule "new vs old" commands. It's
>> not just a counter. Even with a counter there are a lot of questions - 
> should
>> it begin from the cmdqueue loop or from iscsi_data_xmit, why it has static
>> limit, per LUN or not, and so on and so on.
> 
> What about a compromise? Instead of doing while (!list_empty). We switch it
> to list_for_each and move the requeue handling to infront of the cmdqueue
> handling.
> 
> We will know that the cmds on the cmdqueue at the time iscsi_data_xmit is 
> run
> were sent to us from the block layer at the same time so I get what I need. 
> The
> bug is fixed.
> 
> 
>> That is a new IO scheduler on bus layer. 
>> 
>>>> I think is a matter of future investigation/development.
>> I am not against of it at all. But it should not delay a simple bugfix 
> patch.
>> 
>> BR,
>>  Dmitry
>> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscribe@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/open-iscsi/d3277470-9ef0-9a1a-974d-e80250bd 
> 35ac%40oracle.com.

Dmitry Bogdanov June 10, 2022, 10:12 a.m. UTC | #8

Hi Ulrich,

> In my primitive point of view iSCSI is just "another type of cable", making me wonder:
> Is iSCSI allowed to reorder the requests at all? Shouldn't the block layer or initiator do
> so, or the target doing out-of order processing (tagged queueing)?

iSCSI RFC does not require to serialize a commands flow. It's just an "iSCSI user" feature -
to send some set of SCSI commands in an unbreakable batch to a device.
But, as far as I understood, the problem, Mike described, is not a reorder but an increasing
of time between full data transmission of the commands from the batch.

> I mean: If there is a problem that occurs even without using iSCSI, should iSCSI try to fix it?
Since that is software iSCSI specific issue it could be fixed/improved in software. How it's handled in
HW offloaded implementation is unknown for me.

BR,
 Dmitry

Dmitry Bogdanov June 10, 2022, 11:52 a.m. UTC | #9

Hi Mike,

>>>> On 6/8/22 9:16 AM, Dmitriy Bogdanov wrote:
>>>>> On 6/7/22 10:55 AM, Mike Christie wrote:
>>>>>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>>>>>>> In function iscsi_data_xmit (TX worker) there is walking through the
>>>>>>> queue of new SCSI commands that is replenished in parallell. And only
>>>>>>> after that queue got emptied the function will start sending pending
>>>>>>> DataOut PDUs. That lead to DataOut timer time out on target side and
>>>>>>> to connection reinstatment.
>>>>>>>
>>>>>>> This patch swaps walking through the new commands queue and the pending
>>>>>>> DataOut queue. To make a preference to ongoing commands over new ones.
>>>>>>>
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>              task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>>>>>>> @@ -1594,28 +1616,10 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>>>>>>               */
>>>>>>>              if (!list_empty(&conn->mgmtqueue))
>>>>>>>                      goto check_mgmt;
>>>>>>> +            if (!list_empty(&conn->requeue))
>>>>>>> +                    goto check_requeue;
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hey, I've been posting a similar patch:
>>>>>>
>>>>>> https://urldefense.com/v3/__https://www.spinics.net/lists/linux-scsi/msg156939.html__;!!ACWV5N9M2RV99hQ!LHLghPLuyBZadpsGme03-HBoowa8sNiZYMKxKoz5E_BNu-M9-BiuNV_JS9kFxhnumNfhrxuR7qVdIaOH5X7iTfMO$
>>>>>>
>>>>>> A problem I hit is a possible pref regression so I tried to allow
>>>>>> us to start up a burst of cmds in parallel. It's pretty simple where
>>>>>> we allow up to a queue's worth of cmds to start. It doesn't try to
>>>>>> check that all cmds are from the same queue or anything fancy to try
>>>>>> and keep the code simple. Mostly just assuming users might try to bunch
>>>>>> cmds together during submission or they might hit the queue plugging
>>>>>> code.
>>>>>>
>>>>>> What do you think?
>>>>>
>>>>> Oh yeah, what about a modparam batch_limit? It's between 0 and cmd_per_lun.
>>>>> 0 would check after every transmission like above.
>>>>
>>>>  Did you really face with a perf regression? I could not imagine how it is
>>>> possible.
>>>> DataOut PDU contains a data too, so a throughput performance cannot be
>>>> decreased by sending DataOut PDUs.
>>>
>>>
>>> We can agree that queue plugging and batching improves throughput right?
>>> The app or block layer may try to batch commands. It could be with something
>>> like fio's batch args or you hit the block layer queue plugging.
>>
>> I agree that those features 100% gives an improvement of a throughput on local
>> devices on serial interfaces like SATA1. Since SATA2 (Native Command Queuing)
>> devices can reorder incoming commmands to provide the best thoughput.
>> SCSI I guess has similar feature from the very beginning.
>> But on remote devices (iSCSI and FC) it is not 100% - initiators's command
>> order may be reordered by the network protocol nature itself. I mean 1PDU vs
>> R2T+DataOut PDUs, PDU resending due to crc errors or something like that.
>I think we are talking about slightly different things. You are coming up with
>different possible scenarios where it doesn't work. I get them. You are correct
>for those cases. I'm not talking about those cases. I'm talking about the specific
>case I described.
>
>I'm saying we have targets where we use backends that get improved performance
>when they get batched cmds. When the network is ok, and the user's app is
>batching cmds, they come from the app down to the target's backend device as
>a batch. My concern is that with your patch we will no longer get that behavior.
>
>The reason is that the target and initiator can do:
>
>1. initiator sends scsi cmd pdu1
>2. target sends R2T
>3. initiator sees R2T and hits the goto. Sends data
>4. target reads in data. Sends cmd to block layer.
>5. initiator sends scsi cmd pdu2
>6. target sends R2T
>7. initiator reads in R2T sends data.
>8. target reads in data and sends cmd to block layer.
>
>The problem here could be between 4 and 8 the block layer has run the queue
>and sent that one cmd to the real device already because we have that extra
>delay now. So when I implemented the fix in the same way as you and I run
>iostat I would see lower aqu-sz for example.
>
>With the current code we might not have that extra delay between 4 - 8 so
>I would see:
>
>1. initiator sends scsi cmd pdu1
>2. initiator sends scsi cmd pdu2
>3. initiator sends scsi cmd pdu3
>4. target reads in all those cmd pdus
>5. target sends R2Ts for each cmd.
>6. initiator reads in all the R2Ts
>7. initiator sends data for cmd 1 - 3.
>8. target reads in data and sends cmd1 to block layer
>9. target reads in data and sends cmd2 to block layer
>10. target reads in data and sends cmd3 to block layer
>11. block layer/iosched hits unplug limit and then sends
>those cmds to the real device.
>
>In this case the time between 8 - 9 is small since we are just reading
>from the socket, building structs and passing the cmds down to the block
>layer. Here when I'm watching the backend device with iostat I see higher
>qu-sz values.
>
>Note that you might not be seeing this with LIO for example because we
>plug/unplug on every cmd.
>
>When the network is congested, the lun is shared, we are doing retries,
>machine gets bogged donw, etc then yeah, we might not get the above behavior.
>The user knows this is and it's expected. It's not expected that they just
>update the kernel and they get a perf drop in the normal case.
>
>How do we not give these users a regression while fixing the bug?

Now I've got your concern. Due to change of pattern of DataOuts in the some
specific case there will be a throughput regression.
I agree that this should be addressed in my patch.

>>
>>> With the current code we can end up sending all cmds to the target in a way
>>> the target can send them to the real device batched. For example, we send off
>>> the initial N scsi command PDUs in one run of iscsi_data_xmit. The target reads
>>> them in, and sends off N R2Ts. We are able to read N R2Ts in the same call.
>>> And again we are able to send the needed data for them in one call of
>>> iscsi_data_xmit. The target is able to read in the data and send off the
>>> WRITEs to the physical device in a batch.
>>>
>>> With your patch, we can end up not batching them like the app/block layer
>>> intended. For example, we now call iscsi_data_xmit and in the cmdqueue loop.
>>> We've sent N - M scsi cmd PDUs, then see that we've got an incoming R2T to
>>> handle. So we goto check_requeue. We send the needed data. The target then
>>> starts to send the cmd to the physical device. If we have read in multiple
>>> R2Ts then we will continue the requeue loop. And so we might be able to send
>>> the data fast enough that the target can then send those commands to the
>>> physical device. But we've now broken up the batching the upper layers sent
>>> to us and we were doing before.
>>
>> In my head everything is vice-versa :)
>
>:)
>
>> Current code breaks a batching by sending new commands instead of completing
>> the transmission of current commands that could be in its own batch.
>
>We are not talking about the same thing. I'm talking about specifically what
>a user would do today to boost perf for their app where they have a single app
>using the LUN. Single LUN per session. App uses something like fio's batch
>args. You have it send down 32 cmds. When those complete, it sends down 32 again.
>We are not getting a constant stream or mix of cmds from different sources.
>
>We are not talking about multiple users using the same LUN or even same session.
>
>
>> With my patch the batches will be received (in best effort manner) on target
>> port in the same order as block layer sends to iSCSI layer. BTW, another target
>> ports may receive other commands in meantime and the real device will receive
>> broken batch anyway. The initiator side cannot guarantee batching on the real
>> device.
>
>The user does this. The user has set things up so they are using the device for
>their one app. The app is sending down these commands as a batch. No other
>initiators are using the device.
>
>It's *not* say a shared LUN example like we have N VMs on a LUN all doing their
>own IO. The user does not expect the same perf for that type of case we they do
>when they have specifically tuned their app and device use like I mention.
>
>>
>> Lets imagine, that block layer submits two batches of big commands and then
>> submits single small commands, the command queue will be
>
>That's not what I'm talking about. We are not getting a mix like this.
>
>
>> "ABCD EFGH" + "S" + "S" + "S" + "S"
>> and that what will be sent out:
>> current code (worst case):    A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch)
>> current code (best case):     A1B1C1D1 E1F1G1H1 SSSS A2B2C2D2 E2F2G2H2 (reorder)
>> current code (bug addressed): A1B1C1D1 E1F1G1H1 SS...S (connection fail)
>> current code (!impossible):   A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder)
>> with my patch (best case):    A1B1C1D1 E1F1G1H1 A2B2C2D2 E2F2G2H2 SSSS (inorder)
>> with my patch (your case):    A1B1C1D1 E1 A2 F1 B2 G1H1 C2 E2 D2 F2G2H2 SSSS (still inorder)
>> with my patch (worst case):   A1B1C1D1 E1F1G1H1 A2 S B2 S C2D2 E2 S F2 S G2H2 (breaks batch)
>>
>> My better best case (command order as block layer submits) will never happen in
>> the current code.
>> If "S" comes in parrallel, the worst cases are the same, both may break a batch
>> by new commands. But if "S" are already in the queue, my patch will produce
>> (most likely) the best case instead of the worst case.
>>
>>
>>>>  The only thing is a latency performance. But that is not an easy question.
>>>
>>> Agree latency is important and that's why I was saying we can make it config
>>> option. Users can continue to try and batch their cmds and we don't break
>>> them. We also fix the bug in that we don't get stuck in the cmdqueue loop
>>> always taking in new cmds.
>>
>>>> IMHO, a system should strive to reduce a maximum value of the latency almost
>>>> without impacting of a minimum value (prefer current commands) instead of
>>>> to reduce a minimum value of the latency to the detriment of maximum value
>>>> (prefer new commands).
>>>>
>>>>  Any preference of new commands over current ones looks like an io scheduler
>>>
>>> I can see your point of view where you see it as preferring new cmds
>>> vs existing. It's probably due to my patch not hooking into commit_rqs
>>> and trying to figure out the batching exactly. It's more of a simple
>>> estimate.
>>>
>>> However, that's not what I'm talking about. I'm talking about the block
>>> layer / iosched has sent us these commands as a batch. We are now more
>>> likely to break that up.
>>
>> No, my patch does not break a batching more than current code, instead it
>> decreases the probability of a break the currently sending batch by trying to
>> transmit in the same order as they come from block layer.
>>
>> It's very complicated question - how to schedule "new vs old" commands. It's
>> not just a counter. Even with a counter there are a lot of questions - should
>> it begin from the cmdqueue loop or from iscsi_data_xmit, why it has static
>> limit, per LUN or not, and so on and so on.
>
>What about a compromise? Instead of doing while (!list_empty). We switch it
>to list_for_each and move the requeue handling to infront of the cmdqueue
>handling.

That will not help because libiscsi is not adapted for batching, it calls
iscsi_data_xmit for each queued command. And the few first commands of a batch will
be the only command in the cmdqueue list and be followed by its DatOut just
because of DataOut will be sent first in next iscsi_data_xmit call.
Probably you saw a regression exactly because of that.

>We will know that the cmds on the cmdqueue at the time iscsi_data_xmit is run
>were sent to us from the block layer at the same time so I get what I need. The
>bug is fixed.

I think the better solution will be to support commit_rqs callback and call
iscsi_data_xmit there. Then cmdqueue list will have all commands from a batch.
Then I should add some check whether the current command is from the same batch
as the previous to make a decision to go to sending DataOuts.

Sounds good?

BR,
 Dmitry.

Mike Christie June 15, 2022, 3:37 p.m. UTC | #10

On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
> In function iscsi_data_xmit (TX worker) there is walking through the
> queue of new SCSI commands that is replenished in parallell. And only
> after that queue got emptied the function will start sending pending
> DataOut PDUs. That lead to DataOut timer time out on target side and
> to connection reinstatment.
> 
> This patch swaps walking through the new commands queue and the pending
> DataOut queue. To make a preference to ongoing commands over new ones.
> 
> Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
> Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com>

Let's do this patch. I've tried so many combos of implementations and
they all have different perf gains or losses with different workloads.
I've already been going back and forth with myself for over a year
(the link for my patch in the other mail was version N) and I don't
think a common solution is going to happen.

You patch fixes the bug, and I've found a workaround for my issue
where I tweak the queue depth, so I think we will be ok.

Reviewed-by: Mike Christie <michael.christie@oracle.com>

Adam Hutchinson June 15, 2022, 6:57 p.m. UTC | #11

Is there any reason not to use time as an indicator that pending R2Ts
need to be processed?  Could R2Ts be tagged with a timestamp when
received and only given priority over new commands if the age of the
R2T at the head exceeds some configurable limit? This would guarantee
R2T will eventually be serviced even if the block layer doesn't reduce
the submission rate of new commands, it wouldn't remove the
performance benefits of the current algorithm which gives priority to
new commands and it would be a relatively simple solution.  A
threshold of 0 could indicate that R2Ts should always be given
priority over new commands. Just a thought..

Regards,
Adam

On Wed, Jun 15, 2022 at 11:37 AM Mike Christie
<michael.christie@oracle.com> wrote:
>
> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
> > In function iscsi_data_xmit (TX worker) there is walking through the
> > queue of new SCSI commands that is replenished in parallell. And only
> > after that queue got emptied the function will start sending pending
> > DataOut PDUs. That lead to DataOut timer time out on target side and
> > to connection reinstatment.
> >
> > This patch swaps walking through the new commands queue and the pending
> > DataOut queue. To make a preference to ongoing commands over new ones.
> >
> > Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
> > Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
>
> Let's do this patch. I've tried so many combos of implementations and
> they all have different perf gains or losses with different workloads.
> I've already been going back and forth with myself for over a year
> (the link for my patch in the other mail was version N) and I don't
> think a common solution is going to happen.
>
> You patch fixes the bug, and I've found a workaround for my issue
> where I tweak the queue depth, so I think we will be ok.
>
> Reviewed-by: Mike Christie <michael.christie@oracle.com>
>
> --
> You received this message because you are subscribed to the Google Groups "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/open-iscsi/237bed01-819a-55be-5163-274fac3b61e6%40oracle.com.

Ulrich Windl June 17, 2022, 5:54 a.m. UTC | #12

>>> Adam Hutchinson <ajhutchin@gmail.com> schrieb am 15.06.2022 um 20:57 in
Nachricht
<CAFU8FUgwMX_d85OG+qC+qTX-NpFiSVkwBtradzAmeJW-3PCmEQ@mail.gmail.com>:
> Is there any reason not to use time as an indicator that pending R2Ts
> need to be processed?  Could R2Ts be tagged with a timestamp when
> received and only given priority over new commands if the age of the
> R2T at the head exceeds some configurable limit? This would guarantee
> R2T will eventually be serviced even if the block layer doesn't reduce
> the submission rate of new commands, it wouldn't remove the
> performance benefits of the current algorithm which gives priority to
> new commands and it would be a relatively simple solution.  A
> threshold of 0 could indicate that R2Ts should always be given
> priority over new commands. Just a thought..

I had similar thought comparing SCSI command scheduling with process scheduling
real-time scheduling can cause starvation when newer requests are postponed indefinitely,
while the classic scheduler increases the chance of longer-waiting tasks to be scheduled next.
In any case that would require some sorting of the queue (or searching for a maximum/minimum in the requests which is equivalent).

Regards,
Ulrich


> 
> Regards,
> Adam
> 
> On Wed, Jun 15, 2022 at 11:37 AM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> On 6/7/22 8:19 AM, Dmitry Bogdanov wrote:
>> > In function iscsi_data_xmit (TX worker) there is walking through the
>> > queue of new SCSI commands that is replenished in parallell. And only
>> > after that queue got emptied the function will start sending pending
>> > DataOut PDUs. That lead to DataOut timer time out on target side and
>> > to connection reinstatment.
>> >
>> > This patch swaps walking through the new commands queue and the pending
>> > DataOut queue. To make a preference to ongoing commands over new ones.
>> >
>> > Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
>> > Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
>>
>> Let's do this patch. I've tried so many combos of implementations and
>> they all have different perf gains or losses with different workloads.
>> I've already been going back and forth with myself for over a year
>> (the link for my patch in the other mail was version N) and I don't
>> think a common solution is going to happen.
>>
>> You patch fixes the bug, and I've found a workaround for my issue
>> where I tweak the queue depth, so I think we will be ok.
>>
>> Reviewed-by: Mike Christie <michael.christie@oracle.com>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscribe@googlegroups.com.
>> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/open-iscsi/237bed01-819a-55be-5163-274fac3b 
> 61e6%40oracle.com.
> 
> 
> 
> -- 
> "Things turn out best for the people who make the best out of the way
> things turn out." - Art Linkletter
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscribe@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/open-iscsi/CAFU8FUgwMX_d85OG%2BqC%2BqTX-NpF 
> iSVkwBtradzAmeJW-3PCmEQ%40mail.gmail.com.

Martin K. Petersen June 22, 2022, 2:10 a.m. UTC | #13

On Tue, 7 Jun 2022 16:19:53 +0300, Dmitry Bogdanov wrote:

> In function iscsi_data_xmit (TX worker) there is walking through the
> queue of new SCSI commands that is replenished in parallell. And only
> after that queue got emptied the function will start sending pending
> DataOut PDUs. That lead to DataOut timer time out on target side and
> to connection reinstatment.
> 
> This patch swaps walking through the new commands queue and the pending
> DataOut queue. To make a preference to ongoing commands over new ones.
> 
> [...]

Applied to 5.20/scsi-queue, thanks!

[1/1] scsi: iscsi: prefer xmit of DataOut before new cmd
      https://git.kernel.org/mkp/scsi/c/65080c51fde4

scsi: iscsi: prefer xmit of DataOut before new cmd

Commit Message

Comments

Patch