[v2,1/3] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

Message ID 1505221238-9428-2-git-send-email-thunder.leizhen@huawei.com
State New
Headers show
Series
  • arm-smmu: performance optimization
Related show

Commit Message

Leizhen (ThunderTown) Sept. 12, 2017, 1 p.m.
Because all TLBI commands should be followed by a SYNC command, to make
sure that it has been completely finished. So we can just add the TLBI
commands into the queue, and put off the execution until meet SYNC or
other commands. To prevent the followed SYNC command waiting for a long
time because of too many commands have been delayed, restrict the max
delayed number.

According to my test, I got the same performance data as I replaced writel
with writel_relaxed in queue_inc_prod.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>

---
 drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

-- 
2.5.0

Comments

Will Deacon Oct. 18, 2017, 12:58 p.m. | #1
Hi Thunder,

On Tue, Sep 12, 2017 at 09:00:36PM +0800, Zhen Lei wrote:
> Because all TLBI commands should be followed by a SYNC command, to make

> sure that it has been completely finished. So we can just add the TLBI

> commands into the queue, and put off the execution until meet SYNC or

> other commands. To prevent the followed SYNC command waiting for a long

> time because of too many commands have been delayed, restrict the max

> delayed number.

> 

> According to my test, I got the same performance data as I replaced writel

> with writel_relaxed in queue_inc_prod.

> 

> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>

> ---

>  drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++-----

>  1 file changed, 37 insertions(+), 5 deletions(-)


If we want to go down the route of explicit command batching, I'd much
rather do it by implementing the iotlb_range_add callback in the driver,
and have a fixed-length array of batched ranges on the domain. We could
potentially toggle this function pointer based on the compatible string too,
if it shows only to benefit some systems.

Will
Leizhen (ThunderTown) Oct. 19, 2017, 3 a.m. | #2
On 2017/10/18 20:58, Will Deacon wrote:
> Hi Thunder,

> 

> On Tue, Sep 12, 2017 at 09:00:36PM +0800, Zhen Lei wrote:

>> Because all TLBI commands should be followed by a SYNC command, to make

>> sure that it has been completely finished. So we can just add the TLBI

>> commands into the queue, and put off the execution until meet SYNC or

>> other commands. To prevent the followed SYNC command waiting for a long

>> time because of too many commands have been delayed, restrict the max

>> delayed number.

>>

>> According to my test, I got the same performance data as I replaced writel

>> with writel_relaxed in queue_inc_prod.

>>

>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>

>> ---

>>  drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++-----

>>  1 file changed, 37 insertions(+), 5 deletions(-)

> 

> If we want to go down the route of explicit command batching, I'd much

> rather do it by implementing the iotlb_range_add callback in the driver,

> and have a fixed-length array of batched ranges on the domain. We could

I think even if iotlb_range_add callback is implemented, this patch is still valuable. The main purpose
of this patch is to reduce dsb operation. So in the scenario with iotlb_range_add implemented:
.iotlb_range_add:
spin_lock_irqsave(&smmu->cmdq.lock, flags);
...
add tlbi range-1 to cmq-queue
...
add tlbi range-n to cmq-queue			//n
dsb
...
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

.iotlb_sync
spin_lock_irqsave(&smmu->cmdq.lock, flags);
...
add cmd_sync to cmq-queue
dsb
...
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

Although iotlb_range_add can reduce n-1 dsb operations, but there are still 1 left. If n is not large enough,
this patch is helpful.


> potentially toggle this function pointer based on the compatible string too,

> if it shows only to benefit some systems.

[
On 2017/9/19 12:31, Nate Watterson wrote:
I tested these (2) patches on QDF2400 hardware and saw performance
improvements in line with those I reported when testing the original
series.
]

I'm not sure whether this patch can improve performance on QDF2400, because there are two patches. But at least
it seems harmless, maybe the other hardware platforms are the same.

> 

> Will

> 

> .

> 


-- 
Thanks!
BestRegards
Will Deacon Oct. 19, 2017, 9:12 a.m. | #3
On Thu, Oct 19, 2017 at 11:00:45AM +0800, Leizhen (ThunderTown) wrote:
> 

> 

> On 2017/10/18 20:58, Will Deacon wrote:

> > Hi Thunder,

> > 

> > On Tue, Sep 12, 2017 at 09:00:36PM +0800, Zhen Lei wrote:

> >> Because all TLBI commands should be followed by a SYNC command, to make

> >> sure that it has been completely finished. So we can just add the TLBI

> >> commands into the queue, and put off the execution until meet SYNC or

> >> other commands. To prevent the followed SYNC command waiting for a long

> >> time because of too many commands have been delayed, restrict the max

> >> delayed number.

> >>

> >> According to my test, I got the same performance data as I replaced writel

> >> with writel_relaxed in queue_inc_prod.

> >>

> >> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>

> >> ---

> >>  drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++-----

> >>  1 file changed, 37 insertions(+), 5 deletions(-)

> > 

> > If we want to go down the route of explicit command batching, I'd much

> > rather do it by implementing the iotlb_range_add callback in the driver,

> > and have a fixed-length array of batched ranges on the domain. We could

> I think even if iotlb_range_add callback is implemented, this patch is still valuable. The main purpose

> of this patch is to reduce dsb operation. So in the scenario with iotlb_range_add implemented:

> .iotlb_range_add:

> spin_lock_irqsave(&smmu->cmdq.lock, flags);

> ...

> add tlbi range-1 to cmq-queue

> ...

> add tlbi range-n to cmq-queue			//n

> dsb

> ...

> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

> 

> .iotlb_sync

> spin_lock_irqsave(&smmu->cmdq.lock, flags);

> ...

> add cmd_sync to cmq-queue

> dsb

> ...

> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);

> 

> Although iotlb_range_add can reduce n-1 dsb operations, but there are

> still 1 left. If n is not large enough, this patch is helpful.


Then pick an n that is large enough, based on the compatible string.

Will

Patch

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e67ba6c..ef42c4b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -337,6 +337,7 @@ 
 /* Command queue */
 #define CMDQ_ENT_DWORDS			2
 #define CMDQ_MAX_SZ_SHIFT		8
+#define CMDQ_MAX_DELAYED		32
 
 #define CMDQ_ERR_SHIFT			24
 #define CMDQ_ERR_MASK			0x7f
@@ -482,6 +483,7 @@  struct arm_smmu_cmdq_ent {
 			};
 		} cfgi;
 
+		#define CMDQ_OP_TLBI_NH_ALL	0x10
 		#define CMDQ_OP_TLBI_NH_ASID	0x11
 		#define CMDQ_OP_TLBI_NH_VA	0x12
 		#define CMDQ_OP_TLBI_EL2_ALL	0x20
@@ -509,6 +511,7 @@  struct arm_smmu_cmdq_ent {
 
 struct arm_smmu_queue {
 	int				irq; /* Wired interrupt */
+	u32				nr_delay;
 
 	__le64				*base;
 	dma_addr_t			base_dma;
@@ -745,11 +748,16 @@  static int queue_sync_prod(struct arm_smmu_queue *q)
 	return ret;
 }
 
-static void queue_inc_prod(struct arm_smmu_queue *q)
+static void queue_inc_swprod(struct arm_smmu_queue *q)
 {
-	u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
+	u32 prod = q->prod + 1;
 
 	q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static void queue_inc_prod(struct arm_smmu_queue *q)
+{
+	queue_inc_swprod(q);
 	writel(q->prod, q->prod_reg);
 }
 
@@ -791,13 +799,24 @@  static void queue_write(__le64 *dst, u64 *src, size_t n_dwords)
 		*dst++ = cpu_to_le64(*src++);
 }
 
-static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
+static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize)
 {
 	if (queue_full(q))
 		return -ENOSPC;
 
 	queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
-	queue_inc_prod(q);
+
+	/*
+	 * We don't want too many commands to be delayed, this may lead the
+	 * followed sync command to wait for a long time.
+	 */
+	if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
+		queue_inc_swprod(q);
+	} else {
+		queue_inc_prod(q);
+		q->nr_delay = 0;
+	}
+
 	return 0;
 }
 
@@ -939,6 +958,7 @@  static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
 				    struct arm_smmu_cmdq_ent *ent)
 {
+	int optimize = 0;
 	u64 cmd[CMDQ_ENT_DWORDS];
 	unsigned long flags;
 	bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
@@ -950,8 +970,17 @@  static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
 		return;
 	}
 
+	/*
+	 * All TLBI commands should be followed by a sync command later.
+	 * The CFGI commands is the same, but they are rarely executed.
+	 * So just optimize TLBI commands now, to reduce the "if" judgement.
+	 */
+	if ((ent->opcode >= CMDQ_OP_TLBI_NH_ALL) &&
+	    (ent->opcode <= CMDQ_OP_TLBI_NSNH_ALL))
+		optimize = 1;
+
 	spin_lock_irqsave(&smmu->cmdq.lock, flags);
-	while (queue_insert_raw(q, cmd) == -ENOSPC) {
+	while (queue_insert_raw(q, cmd, optimize) == -ENOSPC) {
 		if (queue_poll_cons(q, false, wfe))
 			dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 	}
@@ -2001,6 +2030,8 @@  static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 		     << Q_BASE_LOG2SIZE_SHIFT;
 
 	q->prod = q->cons = 0;
+	q->nr_delay = 0;
+
 	return 0;
 }
 
@@ -2584,6 +2615,7 @@  static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 		dev_err(smmu->dev, "unit-length command queue not supported\n");
 		return -ENXIO;
 	}
+	BUILD_BUG_ON(CMDQ_MAX_DELAYED >= (1 << CMDQ_MAX_SZ_SHIFT));
 
 	smmu->evtq.q.max_n_shift = min((u32)EVTQ_MAX_SZ_SHIFT,
 				       reg >> IDR1_EVTQ_SHIFT & IDR1_EVTQ_MASK);