Revert "arm64: Increase the max granular size"

Message ID 20160321171403.GE25466@e104818-lin.cambridge.arm.com
State New
Headers show

Commit Message

Catalin Marinas March 21, 2016, 5:14 p.m.
On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote:
> On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:

> >Reverts commit 97303480753e ("arm64: Increase the max granular size").

> >

> >The commit 97303480753e ("arm64: Increase the max granular size") will

> >degrade system performente in some cpus.

> >

> >We test wifi network throughput with iperf on Qualcomm msm8996 CPU:

> >----------------

> >run on host:

> >  # iperf -s

> >run on device:

> >  # iperf -c <device-ip-addr> -t 100 -i 1

> >----------------

> >

> >Test result:

> >----------------

> >with commit 97303480753e ("arm64: Increase the max granular size"):

> >    172MBits/sec

> >

> >without commit 97303480753e ("arm64: Increase the max granular size"):

> >    230MBits/sec

> >----------------

> >

> >Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not

> >set the parameter correctly, it may affect the system performance.

> >

> >So revert the commit.

> 

> Is there any explanation why is this so? May be there is an

> alternative to this, apart from reverting the commit.


I agree we need an explanation but in the meantime, this patch has
caused a regression on certain systems.

> Until now it seems L1_CACHE_SHIFT is the max of supported chips. But

> now we are making it 64byte, is there any reason why not 32. 


We may have to revisit this logic and consider L1_CACHE_BYTES the
_minimum_ of cache line sizes in arm64 systems supported by the kernel.
Do you have any benchmarks on Cavium boards that would show significant
degradation with 64-byte L1_CACHE_BYTES vs 128?

For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
_maximum_ of the supported systems:

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Will Deacon March 21, 2016, 5:23 p.m. | #1
On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote:
> On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote:

> > On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:

> > >Reverts commit 97303480753e ("arm64: Increase the max granular size").

> > >

> > >The commit 97303480753e ("arm64: Increase the max granular size") will

> > >degrade system performente in some cpus.

> > >

> > >We test wifi network throughput with iperf on Qualcomm msm8996 CPU:

> > >----------------

> > >run on host:

> > >  # iperf -s

> > >run on device:

> > >  # iperf -c <device-ip-addr> -t 100 -i 1

> > >----------------

> > >

> > >Test result:

> > >----------------

> > >with commit 97303480753e ("arm64: Increase the max granular size"):

> > >    172MBits/sec

> > >

> > >without commit 97303480753e ("arm64: Increase the max granular size"):

> > >    230MBits/sec

> > >----------------

> > >

> > >Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not

> > >set the parameter correctly, it may affect the system performance.

> > >

> > >So revert the commit.

> > 

> > Is there any explanation why is this so? May be there is an

> > alternative to this, apart from reverting the commit.

> 

> I agree we need an explanation but in the meantime, this patch has

> caused a regression on certain systems.

> 

> > Until now it seems L1_CACHE_SHIFT is the max of supported chips. But

> > now we are making it 64byte, is there any reason why not 32. 

> 

> We may have to revisit this logic and consider L1_CACHE_BYTES the

> _minimum_ of cache line sizes in arm64 systems supported by the kernel.

> Do you have any benchmarks on Cavium boards that would show significant

> degradation with 64-byte L1_CACHE_BYTES vs 128?

> 

> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the

> _maximum_ of the supported systems:

> 

> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h

> index 5082b30bc2c0..4b5d7b27edaf 100644

> --- a/arch/arm64/include/asm/cache.h

> +++ b/arch/arm64/include/asm/cache.h

> @@ -18,17 +18,17 @@

>  

>  #include <asm/cachetype.h>

>  

> -#define L1_CACHE_SHIFT		7

> +#define L1_CACHE_SHIFT		6

>  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)

>  

>  /*

>   * Memory returned by kmalloc() may be used for DMA, so we must make

> - * sure that all such allocations are cache aligned. Otherwise,

> - * unrelated code may cause parts of the buffer to be read into the

> - * cache before the transfer is done, causing old data to be seen by

> - * the CPU.

> + * sure that all such allocations are aligned to the maximum *known*

> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause

> + * parts of the buffer to be read into the cache before the transfer is

> + * done, causing old data to be seen by the CPU.

>   */

> -#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES

> +#define ARCH_DMA_MINALIGN	(128)


Does this actually fix the reported iperf regression? My assumption was
that ARCH_DMA_MINALIGN is the problem, but I could be wrong.

Will
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Catalin Marinas March 21, 2016, 5:33 p.m. | #2
On Mon, Mar 21, 2016 at 05:23:01PM +0000, Will Deacon wrote:
> On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote:

> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h

> > index 5082b30bc2c0..4b5d7b27edaf 100644

> > --- a/arch/arm64/include/asm/cache.h

> > +++ b/arch/arm64/include/asm/cache.h

> > @@ -18,17 +18,17 @@

> >  

> >  #include <asm/cachetype.h>

> >  

> > -#define L1_CACHE_SHIFT		7

> > +#define L1_CACHE_SHIFT		6

> >  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)

> >  

> >  /*

> >   * Memory returned by kmalloc() may be used for DMA, so we must make

> > - * sure that all such allocations are cache aligned. Otherwise,

> > - * unrelated code may cause parts of the buffer to be read into the

> > - * cache before the transfer is done, causing old data to be seen by

> > - * the CPU.

> > + * sure that all such allocations are aligned to the maximum *known*

> > + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause

> > + * parts of the buffer to be read into the cache before the transfer is

> > + * done, causing old data to be seen by the CPU.

> >   */

> > -#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES

> > +#define ARCH_DMA_MINALIGN	(128)

> 

> Does this actually fix the reported iperf regression? My assumption was

> that ARCH_DMA_MINALIGN is the problem, but I could be wrong.


I can't tell. But since I haven't seen any better explanation in this
thread yet, I hope that at least someone would try this patch and come
back with numbers.

For networking, SKB_DATA_ALIGN() uses SMP_CACHE_BYTES (== L1_CACHE_BYTES).
I think (hope) this alignment is not meant for non-coherent DMA,
otherwise using SMP_CACHE_BYTES wouldn't make sense.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index 5082b30bc2c0..4b5d7b27edaf 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -18,17 +18,17 @@ 
 
 #include <asm/cachetype.h>
 
-#define L1_CACHE_SHIFT		7
+#define L1_CACHE_SHIFT		6
 #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
 
 /*
  * Memory returned by kmalloc() may be used for DMA, so we must make
- * sure that all such allocations are cache aligned. Otherwise,
- * unrelated code may cause parts of the buffer to be read into the
- * cache before the transfer is done, causing old data to be seen by
- * the CPU.
+ * sure that all such allocations are aligned to the maximum *known*
+ * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
+ * parts of the buffer to be read into the cache before the transfer is
+ * done, causing old data to be seen by the CPU.
  */
-#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
+#define ARCH_DMA_MINALIGN	(128)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 392c67eb9fa6..30bafca1aebf 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -976,9 +976,9 @@  void __init setup_cpu_features(void)
 	if (!cwg)
 		pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n",
 			cls);
-	if (L1_CACHE_BYTES < cls)
-		pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n",
-			L1_CACHE_BYTES, cls);
+	if (ARCH_DMA_MINALIGN < cls)
+		pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n",
+			ARCH_DMA_MINALIGN, cls);
 }
 
 static bool __maybe_unused