diff mbox series

[v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages()

Message ID 1542812051-178935-1-git-send-email-john.garry@huawei.com
State Superseded
Headers show
Series [v3] iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages() | expand

Commit Message

John Garry Nov. 21, 2018, 2:54 p.m. UTC
From: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>


Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
respective device NUMA node. The ternary operator which would be for
alloc_pages_node() is tidied along with this.

We also include a change to use kvzalloc() for kzalloc()/vzalloc()
combination.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator]
Signed-off-by: John Garry <john.garry@huawei.com>


-- 
1.9.1

Comments

Will Deacon Nov. 21, 2018, 4:07 p.m. UTC | #1
On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:
> From: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

> 

> Change function __iommu_dma_alloc_pages() to allocate pages for DMA from

> respective device NUMA node. The ternary operator which would be for

> alloc_pages_node() is tidied along with this.

> 

> We also include a change to use kvzalloc() for kzalloc()/vzalloc()

> combination.

> 

> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

> [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator]

> Signed-off-by: John Garry <john.garry@huawei.com>


Weird, you're missing a diffstat here.

Anyway, the patch looks fine to me, but it would be nice if you could
justify the change with some numbers. Do you actually see an improvement
from this change?

Will
John Garry Nov. 21, 2018, 4:47 p.m. UTC | #2
On 21/11/2018 16:07, Will Deacon wrote:
> On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

>> From: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

>>

>> Change function __iommu_dma_alloc_pages() to allocate pages for DMA from

>> respective device NUMA node. The ternary operator which would be for

>> alloc_pages_node() is tidied along with this.

>>

>> We also include a change to use kvzalloc() for kzalloc()/vzalloc()

>> combination.

>>

>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

>> [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator]

>> Signed-off-by: John Garry <john.garry@huawei.com>

>

> Weird, you're missing a diffstat here.

>

> Anyway, the patch looks fine to me, but it would be nice if you could

> justify the change with some numbers. Do you actually see an improvement

> from this change?

>


Hi Will,

Ah, I missed adding my comments explaining the motivation. It would be 
better in the commit log. Anyway, here's the snippet:

" ... as mentioned in [3], dma_alloc_coherent() uses the locality
information from the device - as in direct DMA - so this patch is just
applying this same policy.

[3] 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html"

I did have some numbers to show improvement in some scenarios when I 
tested this a while back which I'll dig out.

However I would say that some scenarios will improve and the opposite 
for others with this change, considering different conditions in which 
DMA memory may be used.

Cheers,
John

> Will

>

> .

>
Will Deacon Nov. 21, 2018, 4:57 p.m. UTC | #3
On Wed, Nov 21, 2018 at 04:47:48PM +0000, John Garry wrote:
> On 21/11/2018 16:07, Will Deacon wrote:

> >On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

> >>From: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

> >>

> >>Change function __iommu_dma_alloc_pages() to allocate pages for DMA from

> >>respective device NUMA node. The ternary operator which would be for

> >>alloc_pages_node() is tidied along with this.

> >>

> >>We also include a change to use kvzalloc() for kzalloc()/vzalloc()

> >>combination.

> >>

> >>Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

> >>[JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator]

> >>Signed-off-by: John Garry <john.garry@huawei.com>

> >

> >Weird, you're missing a diffstat here.

> >

> >Anyway, the patch looks fine to me, but it would be nice if you could

> >justify the change with some numbers. Do you actually see an improvement

> >from this change?

> >

> 

> Hi Will,

> 

> Ah, I missed adding my comments explaining the motivation. It would be

> better in the commit log. Anyway, here's the snippet:

> 

> " ... as mentioned in [3], dma_alloc_coherent() uses the locality

> information from the device - as in direct DMA - so this patch is just

> applying this same policy.

> 

> [3]

> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html"


Yes, please add to this to the commit log.

> I did have some numbers to show improvement in some scenarios when I tested

> this a while back which I'll dig out.

> 

> However I would say that some scenarios will improve and the opposite for

> others with this change, considering different conditions in which DMA

> memory may be used.


Well, if you can show that it's useful in some cases and not catastrophic in
others, then I think shooting for parity with direct DMA is a reasonable
justification for the change.

Will
John Garry Nov. 23, 2018, 4:40 p.m. UTC | #4
On 21/11/2018 16:57, Will Deacon wrote:
> On Wed, Nov 21, 2018 at 04:47:48PM +0000, John Garry wrote:

>> On 21/11/2018 16:07, Will Deacon wrote:

>>> On Wed, Nov 21, 2018 at 10:54:10PM +0800, John Garry wrote:

>>>> From: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

>>>>

>>>> Change function __iommu_dma_alloc_pages() to allocate pages for DMA from

>>>> respective device NUMA node. The ternary operator which would be for

>>>> alloc_pages_node() is tidied along with this.

>>>>

>>>> We also include a change to use kvzalloc() for kzalloc()/vzalloc()

>>>> combination.

>>>>

>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

>>>> [JPG: Added kvzalloc(), drop pages ** being device local, tidied ternary operator]

>>>> Signed-off-by: John Garry <john.garry@huawei.com>

>>>

>>> Weird, you're missing a diffstat here.

>>>

>>> Anyway, the patch looks fine to me, but it would be nice if you could

>>> justify the change with some numbers. Do you actually see an improvement

>> >from this change?

>>>

>>

>> Hi Will,

>>

>> Ah, I missed adding my comments explaining the motivation. It would be

>> better in the commit log. Anyway, here's the snippet:

>>

>> " ... as mentioned in [3], dma_alloc_coherent() uses the locality

>> information from the device - as in direct DMA - so this patch is just

>> applying this same policy.

>>

>> [3]

>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.html"

>

> Yes, please add to this to the commit log.

>


Sure,

>> I did have some numbers to show improvement in some scenarios when I tested

>> this a while back which I'll dig out.

>>

>> However I would say that some scenarios will improve and the opposite for

>> others with this change, considering different conditions in which DMA

>> memory may be used.

>

> Well, if you can show that it's useful in some cases and not catastrophic in

> others, then I think shooting for parity with direct DMA is a reasonable

> justification for the change.

>


So I have done some more testing with our SoC crypto engine, using 
tcrypt module. The reason I used this device was because we can utilise 
a local device per socket in the system, unlike other DMAing devices, 
which generally only exist on a single socket (for us, anyway).

Overall the results aren't brilliant - as expected - but show a general 
marginal improvement. Here's some figures:

Summary:
Average diff		+0.9%
Max diff		+1.5%
Min diff		+0.2%

Test			     Ops/second	before  after   diff %	

async ecb(aes) encrypt
test 0 (128 bit key, 16 byte blocks)	68717	69057	0.5
test 1 (128 bit key, 64 byte blocks):	72633	73163	0.7
test 2 (128 bit key, 256 byte blocks):	71475	72076	0.8
test 3 (128 bit key, 1024 byte blocks):	66819	67467	1.0
test 4 (128 bit key, 8192 byte blocks):	38237	38495	0.7
test 5 (192 bit key, 16 byte blocks): 	70273	71079	1.2
test 6 (192 bit key, 64 byte blocks):	72455	73292	1.2
test 7 (192 bit key, 256 byte blocks):	71085	71876	1.1
test 8 (192 bit key, 1024 byte blocks):	65891	66576	1.0
test 9 (192 bit key, 8192 byte blocks): 34846	35061	0.6
test 10 (256 bit key, 16 byte blocks):	72927	73762	1.2
test 11 (256 bit key, 64 byte blocks):	72361	73207	1.2
test 12 (256 bit key, 256 byte blocks):	70907	71602	1.0
test 13 (256 bit key, 1024 byte blocks):65035	65653	1.0
test 14 (256 bit key, 8192 byte blocks):32835	32998	0.5
async ecb(aes) decrypt
test 0 (128 bit key, 16 byte blocks)	68384	69130	1.1
test 1 (128 bit key, 64 byte blocks):	72645	73133	0.7
test 2 (128 bit key, 256 byte blocks):	71523	71912	0.5
test 3 (128 bit key, 1024 byte blocks):	66902	67258	0.5
test 4 (128 bit key, 8192 byte blocks):	38260	38434	0.5
test 5 (192 bit key, 16 byte blocks): 	70301	70816	0.7
test 6 (192 bit key, 64 byte blocks):	72473	73064	0.8
test 7 (192 bit key, 256 byte blocks):	71106	71663	0.8
test 8 (192 bit key, 1024 byte blocks):	65915	66363	0.7
test 9 (192 bit key, 8192 byte blocks): 34876	35006	0.4
test 10 (256 bit key, 16 byte blocks):	72969	73519	0.8
test 11 (256 bit key, 64 byte blocks):	72404	72925	0.7
test 12 (256 bit key, 256 byte blocks):	70861	71350	0.7
test 13 (256 bit key, 1024 byte blocks):65074	65485	0.6
test 14 (256 bit key, 8192 byte blocks):32861	32974	0.3
async cbc(aes) encrypt
test 0 (128 bit key, 16 byte blocks)	58306	59131	1.4
test 1 (128 bit key, 64 byte blocks):	61647	62565	1.5
test 2 (128 bit key, 256 byte blocks):	60841	61666	1.4
test 3 (128 bit key, 1024 byte blocks):	57503	58204	1.2
test 4 (128 bit key, 8192 byte blocks):	34760	35055	0.9
test 5 (192 bit key, 16 byte blocks): 	59684	60452	1.3
test 6 (192 bit key, 64 byte blocks):	61705	62516	1.3
test 7 (192 bit key, 256 byte blocks):	60678	61426	1.2
test 8 (192 bit key, 1024 byte blocks):	56805	57487	1.2
test 9 (192 bit key, 8192 byte blocks): 31836	32093	0.8
test 10 (256 bit key, 16 byte blocks):	61961	62785	1.3
test 11 (256 bit key, 64 byte blocks):	61584	62427	1.4
test 12 (256 bit key, 256 byte blocks):	60407	61246	1.4
test 13 (256 bit key, 1024 byte blocks):56135	56868	1.3
test 14 (256 bit key, 8192 byte blocks):30128	30380	0.8
async cbc(aes) decrypt
test 0 (128 bit key, 16 byte blocks)	58555	59044	0.8
test 1 (128 bit key, 64 byte blocks):	61853	62589	1.2
test 2 (128 bit key, 256 byte blocks):	60992	61728	1.2
test 3 (128 bit key, 1024 byte blocks):	57591	58250	1.1
test 4 (128 bit key, 8192 byte blocks):	34796	35064	0.8
test 5 (192 bit key, 16 byte blocks): 	59843	60506	1.1
test 6 (192 bit key, 64 byte blocks):	61808	62521	1.2
test 7 (192 bit key, 256 byte blocks):	60800	61445	1.1
test 8 (192 bit key, 1024 byte blocks):	56949	57513	1.0
test 9 (192 bit key, 8192 byte blocks): 31890	32107	0.7
test 10 (256 bit key, 16 byte blocks):	62109	62778	1.1
test 11 (256 bit key, 64 byte blocks):	61748	62418	1.1
test 12 (256 bit key, 256 byte blocks):	60604	61226	1.0
test 13 (256 bit key, 1024 byte blocks):56277	56845	1.0
test 14 (256 bit key, 8192 byte blocks):30165	30340	0.6
sync xts(aes) encrypt
test 0 (256 bit key, 16 byte blocks):	59501	59742	0.4
test 1 (256 bit key, 64 byte blocks):	62894	63249	0.6
test 2 (256 bit key, 256 byte blocks):	62074	62476	0.7
test 3 (256 bit key, 1024 byte blocks):	58569	58894	0.6
test 4 (256 bit key, 8192 byte blocks):	35271	35462	0.5
test 5 (512 bit key, 16 byte blocks):	60749	61174	0.7
test 6 (512 bit key, 64 byte blocks):	62760	63148	0.6
test 7 (512 bit key, 256 byte blocks):	61625	61928	0.5
test 8 (512 bit key, 1024 byte blocks):	57162	57452	0.5
test 9 (512 bit key, 8192 byte blocks):	30582	30718	0.4
async xts(aes) decrypt
test 0 (256 bit key, 16 byte blocks):	59423	59587	0.3
test 1 (256 bit key, 64 byte blocks):	62849	62974	0.2
test 2 (256 bit key, 256 byte blocks):	61884	62231	0.6
test 3 (256 bit key, 1024 byte blocks):	58402	58699	0.5
test 4 (256 bit key, 8192 byte blocks):	35244	35353	0.3
test 5 (512 bit key, 16 byte blocks):	60693	60882	0.3
test 6 (512 bit key, 64 byte blocks):	62671	62896	0.4
test 7 (512 bit key, 256 byte blocks):	61488	61722	0.4
test 8 (512 bit key, 1024 byte blocks):	57092	57261	0.3
test 9 (512 bit key, 8192 byte blocks):	30572	30655	0.3
sync ctr(aes) encrypt	
test 0 (128 bit key, 16 byte blocks):	59673	9926	0.4
test 1 (128 bit key, 64 byte blocks):	62738	63024	0.5
test 2 (128 bit key, 256 byte blocks):	61841	62171	0.5
test 3 (128 bit key, 1024 byte blocks):	58374	58633	0.4
test 4 (128 bit key, 8192 byte blocks):	35137	35273	0.4
test 5 (192 bit key, 16 byte blocks):	60736	61049	0.5
test 6 (192 bit key, 64 byte blocks):	62718	63081	0.6
test 7 (192 bit key, 256 byte blocks):	61652	61973	0.5
test 8 (192 bit key, 1024 byte blocks):	57709	58002	0.5
test 9 (192 bit key, 8192 byte blocks):	32171	32306	0.4
test 10 (256 bit key, 16 byte blocks):	63000	63428	0.7
test 11 (256 bit key, 64 byte blocks):	62594	63027	0.7
test 12 (256 bit key, 256 byte blocks):	61452	61845	0.6
test 13 (256 bit key, 1024 byte blocks):57019	57346	0.6
test 14 (256 bit key, 8192 byte blocks):30422	30544	0.4
async ctr(aes) decrypt
test 0 (128 bit key, 16 byte blocks):	59488	59994	0.9
test 1 (128 bit key, 64 byte blocks):	62651	63173	0.8
test 2 (128 bit key, 256 byte blocks):	61778	62464	1.1
test 3 (128 bit key, 1024 byte blocks):	58290	58909	1.1
test 4 (128 bit key, 8192 byte blocks):	35115	35358	0.7
test 5 (192 bit key, 16 byte blocks):	60590	61244	1.1
test 6 (192 bit key, 64 byte blocks):	62604	63296	1.1
test 7 (192 bit key, 256 byte blocks):	61540	62199	1.1
test 8 (192 bit key, 1024 byte blocks):	57577	58149	1.0
test 9 (192 bit key, 8192 byte blocks):	32146	32368	0.7
test 10 (256 bit key, 16 byte blocks):	62852	63610	1.2
test 11 (256 bit key, 64 byte blocks):	62523	63276	1.2
test 12 (256 bit key, 256 byte blocks):	61356	62049	1.1
test 13 (256 bit key, 1024 byte blocks):56943	57546	1.1
test 14 (256 bit key, 8192 byte blocks):30399	30593	0.6

Let me know if any more questions or comments.

Thanks,
John

> Will

>

> .

>
diff mbox series

Patch

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b0475..4afb1a8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -449,20 +449,17 @@  static void __iommu_dma_free_pages(struct page **pages, int count)
 	kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-		unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+		unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
 	struct page **pages;
-	unsigned int i = 0, array_size = count * sizeof(*pages);
+	unsigned int i = 0, nid = dev_to_node(dev);
 
 	order_mask &= (2U << MAX_ORDER) - 1;
 	if (!order_mask)
 		return NULL;
 
-	if (array_size <= PAGE_SIZE)
-		pages = kzalloc(array_size, GFP_KERNEL);
-	else
-		pages = vzalloc(array_size);
+	pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
 	if (!pages)
 		return NULL;
 
@@ -481,10 +478,12 @@  static struct page **__iommu_dma_alloc_pages(unsigned int count,
 		for (order_mask &= (2U << __fls(count)) - 1;
 		     order_mask; order_mask &= ~order_size) {
 			unsigned int order = __fls(order_mask);
+			gfp_t alloc_flags = gfp;
 
 			order_size = 1U << order;
-			page = alloc_pages((order_mask - order_size) ?
-					   gfp | __GFP_NORETRY : gfp, order);
+			if (order_mask > order_size)
+				alloc_flags |= __GFP_NORETRY;
+			page = alloc_pages_node(nid, alloc_flags, order);
 			if (!page)
 				continue;
 			if (!order)
@@ -569,7 +568,8 @@  struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
 		alloc_sizes = min_size;
 
 	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+	pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+					gfp);
 	if (!pages)
 		return NULL;