mbox series

[00/12,V2] target: fix cmd plugging and completion

Message ID 20210209123845.4856-1-michael.christie@oracle.com
Headers show
Series target: fix cmd plugging and completion | expand

Message

Mike Christie Feb. 9, 2021, 12:38 p.m. UTC
The following patches made over Martin's 5.12 branches fix two
issues:

1. target_core_iblock plugs and unplugs the queue for every
command. To handle this issue and handle an issue that
vhost-scsi and loop were avoiding by adding their own workqueue,
I added a new submission workqueue to LIO. Drivers can pass cmds
to it, and we can then submit batches of cmds.

2. vhost-scsi and loop on the submission side were doing a work
per cmd and on the lio completion side it was doing a work per
cmd. The cap on running works is 512 (max_active) and so we can
end up end up using a lot of threads when submissions start blocking
because they hit the block tag limit or the completion side blocks
trying to send the cmd. In this patchset I just use a cmd list
per session to avoid abusing the workueue layer.

The combined patchset fixes a major perf issue we've been hitting
where IOPs is stuck at 230K when running:

    fio --filename=/dev/sda  --direct=1 --rw=randrw --bs=4k
    --ioengine=libaio --iodepth=128  --numjobs=8 --time_based
    --group_reporting --runtime=60

The patches in this set get me to 350K when using devices that
have native IOPs of around 400-500K.

Note that 5.12 has some interrupt changes that my patches
collide with. Martin's 5.12 branches had the changes so I
based my patches on that.

V2:
- Fix up container_of use coding style
- Handle offlist review comment from Laurence where with the
original code and my patches we can hit a bug where the cmd
times out, LIO starts up the TMR code, but it misses the cmd
because it's on the workqueue.
- Made the work per device work instead of session to handle
the previous issue and so if one dev hits some issue it sleeps on,
it won't block other devices.

Comments

Bodo Stroesser Feb. 9, 2021, 5:01 p.m. UTC | #1
On 09.02.21 13:38, Mike Christie wrote:
>   
> +void target_queued_compl_work(struct work_struct *work)
> +{
> +	struct se_cmd_queue *cq = container_of(work, struct se_cmd_queue,
> +					       work);
> +	struct se_cmd *se_cmd, *next_cmd;
> +	struct llist_node *cmd_list;
> +
> +	cmd_list = llist_del_all(&cq->cmd_list);

Probably nit-picking: I'd like to reverse the list before processing like you did during submission.

> +	llist_for_each_entry_safe(se_cmd, next_cmd, cmd_list, se_cmd_list)
> +		target_complete_cmd_work(se_cmd);
> +}
> +
Mike Christie Feb. 9, 2021, 6:50 p.m. UTC | #2
On 2/9/21 11:01 AM, Bodo Stroesser wrote:
> On 09.02.21 13:38, Mike Christie wrote:
>>   
>> +void target_queued_compl_work(struct work_struct *work)
>> +{
>> +	struct se_cmd_queue *cq = container_of(work, struct se_cmd_queue,
>> +					       work);
>> +	struct se_cmd *se_cmd, *next_cmd;
>> +	struct llist_node *cmd_list;
>> +
>> +	cmd_list = llist_del_all(&cq->cmd_list);
> 
> Probably nit-picking: I'd like to reverse the list before processing like you did during submission.
> 

Yeah, I flip flopped a lot on both. I'll just do it. I didn't notice
any perf differences.
Mike Christie Feb. 9, 2021, 6:59 p.m. UTC | #3
On 2/9/21 10:32 AM, Bodo Stroesser wrote:
> On 09.02.21 13:38, Mike Christie wrote:
>> This patch adds plug/unplug callouts for tcmu, so we can avoid the
>> number of times we switch to userspace. Using this driver with tcm
>> loop is a common config, and dependng on the nr_hw_queues
>> (nr_hw_queues=1 performs much better) and fio jobs (lower num jobs
>> around 4) this patch can increase IOPs by only around 5-10% because
>> we hit other issues like the big per tcmu device mutex.> 
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> ---
>>   drivers/target/target_core_user.c | 27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>> index a5991df23581..a030ca6f0f4c 100644
>> --- a/drivers/target/target_core_user.c
>> +++ b/drivers/target/target_core_user.c
>> @@ -111,6 +111,7 @@ struct tcmu_dev {
>>   	struct kref kref;
>>   
>>   	struct se_device se_dev;
>> +	struct se_dev_plug se_plug;
>>   
>>   	char *name;
>>   	struct se_hba *hba;
>> @@ -119,6 +120,7 @@ struct tcmu_dev {
>>   #define TCMU_DEV_BIT_BROKEN 1
>>   #define TCMU_DEV_BIT_BLOCKED 2
>>   #define TCMU_DEV_BIT_TMR_NOTIFY 3
>> +#define TCM_DEV_BIT_PLUGGED 4
>>   	unsigned long flags;
>>   
>>   	struct uio_info uio_info;
>> @@ -959,6 +961,25 @@ static uint32_t ring_insert_padding(struct tcmu_dev *udev, size_t cmd_size)
>>   	return cmd_head;
>>   }
>>   
>> +static void tcmu_unplug_device(struct se_dev_plug *se_plug)
>> +{
>> +	struct se_device *se_dev = se_plug->se_dev;
>> +	struct tcmu_dev *udev = TCMU_DEV(se_dev);
>> +
>> +	uio_event_notify(&udev->uio_info);
> 
> Don't we have a race here?
> 
> Let's assume that
>  - just here the thread is interrupted
>  - userspace starts,empties the ring and sleeps again
>  - another cpu queues a new CDB in the ring
> In that - of course very rare condition - userspace will not wake up for the freshly queued CDB.
> 
> I think, first clearing the bit, then doing the uio_event_notify would work (without need to take the big tcmu mutex).

You,re right. Will fix. I have the same issue in iblock and there
I made a mistake where it per cpu when it should be per task.
Christoph Hellwig Feb. 10, 2021, 8:35 a.m. UTC | #4
On Tue, Feb 09, 2021 at 06:38:33AM -0600, Mike Christie wrote:
> The next patch splits target_submit_cmd_map_sgls so the initialization

> and submission part can be called at different times. If the init part

> fails we can reference the t_task_cdb early in some of the logging

> and tracing code. This moves it to transport_init_se_cmd so we don't

> hit NULL pointer crashes.

> 

> Signed-off-by: Mike Christie <michael.christie@oracle.com>


Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 10, 2021, 8:37 a.m. UTC | #5
On Tue, Feb 09, 2021 at 06:38:37AM -0600, Mike Christie wrote:
> This just has tcm loop use the block layer cmd allocator for se_cmds

> instead of using the tcm_loop_cmd_cache. In future patches when we

> can use the host tags for internal requests like TMFs we can completely

> kill the tcm_loop_cmd_cache.

> 

> Signed-off-by: Mike Christie <michael.christie@oracle.com>


Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 10, 2021, 8:42 a.m. UTC | #6
On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote:
> Doing a work per cmd can lead to lots of threads being created.

> This patch just replaces the completion work per cmd with a per cpu

> list. Combined with the first patches this allows tcm loop on top of

> initiators like iser to go from around 700K IOPs to 1000K and reduces

> the number of threads that get created when the system is under heavy

> load and hitting the initiator drivers tagging limits.


OTOH it does increase completion latency, which might be the preference
for some workloads.  Do we need a tunable here?

> +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd,

> +				  int cpu, struct workqueue_struct *wq)

>  {

> -	struct se_cmd *cmd = container_of(work, struct se_cmd, work);

> +	llist_add(&se_cmd->se_cmd_list, &q->cmd_list);

> +	queue_work_on(cpu, wq, &q->work);

> +}


Do we need this helper at all?  Having it open coded in the two callers
would seem easier to follow to me.
Christoph Hellwig Feb. 10, 2021, 8:44 a.m. UTC | #7
>  	struct se_device *se_dev = se_cmd->se_dev;

> -	int cpu = se_cmd->cpuid;

> +	int cpu;

> +

> +	if (se_cmd->se_cmd_flags & SCF_IGNORE_CPUID_COMPL)

> +		cpu = smp_processor_id();

> +	else

> +		cpu = se_cmd->cpuid;

>  

>  	target_queue_cmd_work(&se_dev->queues[cpu].cq, se_cmd, cpu,

>  			      target_completion_wq);


Can't we just use se_cmd->cpuid == WORK_CPU_UNBOUND as the no affinity
indicator, which would remove all branches here.
Mike Christie Feb. 10, 2021, 6:33 p.m. UTC | #8
On 2/10/21 2:42 AM, Christoph Hellwig wrote:
> On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote:

>> Doing a work per cmd can lead to lots of threads being created.

>> This patch just replaces the completion work per cmd with a per cpu

>> list. Combined with the first patches this allows tcm loop on top of

>> initiators like iser to go from around 700K IOPs to 1000K and reduces

>> the number of threads that get created when the system is under heavy

>> load and hitting the initiator drivers tagging limits.

> 

> OTOH it does increase completion latency, which might be the preference

> for some workloads.  Do we need a tunable here?

> 


I didn't see an increase with higher perf setups like 100G (really 60
somehting) iser/srp and even using ram disks. I think it's because either
way we run one cmd on each cpu at a time, or if we do kick off cmds on
different threads they still run on the same cpu so latency didn't change
overall.

1. Without my patch each cmd has a work struct. target_complete_cmd does
queue_work_on which puts the cmd's work on the worker pool for that cpu.
With faster drivers, workqueue.c does the equivalent of a loop over each
work/cmd running one work on the cpu, that completes, then running the next.

If we get to the point where the workqueue code runs cmds in different threads
on the same cpu, they end up sharing the cpu so latency wasn't changing
(I didn't debug this very much).

I can't tell how much the rescuer thread comes into this though. Maybe it
could come in and run cmds on a different cpu, and that would help.

2. With the patch, we add the cmd to the device's per cpu list, then do
queue_work_on on the list's per cpu work. queue_work_on then either hits
the pending check or we run like #1 but on a per device basis instead of
per cmd.

However,

1. While I was checking this I noticed for commands that set
transport_complete_callback, then I need to do the per cmd work, since
they can be slow. I'll fix this on the next posting.

2. For the latency issue, did you want the cmd to be able to run on any cpu?
I think this won't happen now under normal cases, because the workqueue does
not set WQ_UNBOUND. Did we want a tunable for that?



>> +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd,

>> +				  int cpu, struct workqueue_struct *wq)

>>  {

>> -	struct se_cmd *cmd = container_of(work, struct se_cmd, work);

>> +	llist_add(&se_cmd->se_cmd_list, &q->cmd_list);

>> +	queue_work_on(cpu, wq, &q->work);

>> +}

> 

> Do we need this helper at all?  Having it open coded in the two callers

> would seem easier to follow to me.


I can do that.
Mike Christie Feb. 10, 2021, 6:43 p.m. UTC | #9
On 2/10/21 2:44 AM, Christoph Hellwig wrote:
>>  	struct se_device *se_dev = se_cmd->se_dev;

>> -	int cpu = se_cmd->cpuid;

>> +	int cpu;

>> +

>> +	if (se_cmd->se_cmd_flags & SCF_IGNORE_CPUID_COMPL)

>> +		cpu = smp_processor_id();

>> +	else

>> +		cpu = se_cmd->cpuid;

>>  

>>  	target_queue_cmd_work(&se_dev->queues[cpu].cq, se_cmd, cpu,

>>  			      target_completion_wq);

> 

> Can't we just use se_cmd->cpuid == WORK_CPU_UNBOUND as the no affinity

> indicator, which would remove all branches here.


We can't right now because the workqueue_struct does
not have the WQ_UNBOUND bit set. __queue_work will then do:

        /* pwq which will be used unless @work is executing elsewhere */
        if (wq->flags & WQ_UNBOUND) {
                if (req_cpu == WORK_CPU_UNBOUND)
                        cpu = wq_select_unbound_cpu(raw_smp_processor_id());
                pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
        } else {
                if (req_cpu == WORK_CPU_UNBOUND)
                        cpu = raw_smp_processor_id();
                pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
        }

so even if you pass in WORK_CPU_UNBOUND, it will do raw_smp_processor_id
and add the work to the cpu's worker pool.

I think if I add in a new tunable to make the workqueue bound or unbound like I
mentioned in the other thread then I think it will do what you want for both
review comments.
Mike Christie Feb. 10, 2021, 6:45 p.m. UTC | #10
On 2/10/21 12:43 PM, Mike Christie wrote:
> On 2/10/21 2:44 AM, Christoph Hellwig wrote:

>>>  	struct se_device *se_dev = se_cmd->se_dev;

>>> -	int cpu = se_cmd->cpuid;

>>> +	int cpu;

>>> +

>>> +	if (se_cmd->se_cmd_flags & SCF_IGNORE_CPUID_COMPL)

>>> +		cpu = smp_processor_id();

>>> +	else

>>> +		cpu = se_cmd->cpuid;

>>>  

>>>  	target_queue_cmd_work(&se_dev->queues[cpu].cq, se_cmd, cpu,

>>>  			      target_completion_wq);

>>

>> Can't we just use se_cmd->cpuid == WORK_CPU_UNBOUND as the no affinity

>> indicator, which would remove all branches here.

> 

> We can't right now because the workqueue_struct does

> not have the WQ_UNBOUND bit set. __queue_work will then do:

> 

>         /* pwq which will be used unless @work is executing elsewhere */

>         if (wq->flags & WQ_UNBOUND) {

>                 if (req_cpu == WORK_CPU_UNBOUND)

>                         cpu = wq_select_unbound_cpu(raw_smp_processor_id());

>                 pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

>         } else {

>                 if (req_cpu == WORK_CPU_UNBOUND)

>                         cpu = raw_smp_processor_id();

>                 pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);

>         }

> 

> so even if you pass in WORK_CPU_UNBOUND, it will do raw_smp_processor_id

> and add the work to the cpu's worker pool.


Nevermind. What I wrote is exactly what you asked for :) I can do that.