diff mbox

[REF] x86/tlb: just do tlb flush on one of siblings of SMT

Message ID 1459912457-5630-1-git-send-email-alex.shi@linaro.org
State New
Headers show

Commit Message

Alex Shi April 6, 2016, 3:14 a.m. UTC
It seems Intel core still share the TLB pool, flush both of threads' TLB
just cause a extra useless IPI and a extra flush. The extra flush will 
flush out TLB again which another thread just introduced.
That's double waste.

The micro testing show memory access can save about 25% time on my 
haswell i7 desktop.
munmap source code is here: https://lkml.org/lkml/2012/5/17/59

test result on Kernel v4.5.0:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 20ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':

        18,739,808      dTLB-load-misses          #    2.47% of all dTLB cache hits   (43.05%)
       757,380,911      dTLB-loads                                                    (34.34%)
         2,125,275      dTLB-store-misses                                             (32.23%)
       318,307,759      dTLB-stores                                                   (46.32%)
            32,765      iTLB-load-misses          #    2.03% of all iTLB cache hits   (56.90%)
         1,616,237      iTLB-loads                                                    (44.47%)
            41,476      tlb:tlb_flush

       1.443484546 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 32262

test result on Kernel v4.5.0 + this patch:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 16ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':

        15,984,772      dTLB-load-misses          #    1.89% of all dTLB cache hits   (41.72%)
       844,099,241      dTLB-loads                                                    (33.30%)
         1,328,102      dTLB-store-misses                                             (52.13%)
       280,902,875      dTLB-stores                                                   (52.03%)
            27,678      iTLB-load-misses          #    1.67% of all iTLB cache hits   (35.35%)
         1,659,550      iTLB-loads                                                    (38.38%)
            25,137      tlb:tlb_flush

       1.428880301 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 15912

BTW, 
This change isn't architecturally guaranteed.

Signed-off-by: Alex Shi <alex.shi@linaro.org>

Cc: Andrew Morton <akpm@linux-foundation.org>
To: linux-kernel@vger.kernel.org
To: Mel Gorman <mgorman@suse.de>
To: x86@kernel.org
To: "H. Peter Anvin" <hpa@zytor.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Alex Shi <alex.shi@linaro.org>
---
 arch/x86/mm/tlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

-- 
2.7.2.333.g70bd996

Comments

Alex Shi April 6, 2016, 5:15 a.m. UTC | #1
On 04/06/2016 12:47 PM, Andy Lutomirski wrote:
> On Apr 5, 2016 8:17 PM, "Alex Shi" <alex.shi@linaro.org> wrote:

>>

>> It seems Intel core still share the TLB pool, flush both of threads' TLB

>> just cause a extra useless IPI and a extra flush. The extra flush will

>> flush out TLB again which another thread just introduced.

>> That's double waste.

> 

> Do you have a reference in both the SDM and the APM for this?


No. as I said in the end of commit log. There are no any official
guarantee for this usage, but it seems working widely in Intel CPUs.

And the performance benefit is so tempted...
Is there Intel's guys like to dig it more? :)

> 

> Do we have a guarantee that this serialized the front end such that

> the non-targetted sibling won't execute an instruction that it decoded

> from a stale translation?


Is your worrying an evidence for my guess? Otherwise the stale
instruction happens either before IPI coming in... :)
> 

> This will conflict rather deeply with my PCID series, too.

> 

> --Andy

>
diff mbox

Patch

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3d..6510316 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -134,7 +134,10 @@  void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
+	int cpu;
 	struct flush_tlb_info info;
+	cpumask_t flush_mask, *sblmask;
+
 	info.flush_mm = mm;
 	info.flush_start = start;
 	info.flush_end = end;
@@ -151,7 +154,23 @@  void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+
+	if (unlikely(smp_num_siblings <= 1)) {
+		smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+		return;
+	}
+
+	/* Only one flush needed on both siblings of SMT */
+	cpumask_copy(&flush_mask, cpumask);
+	for_each_cpu(cpu, &flush_mask) {
+		sblmask = topology_sibling_cpumask(cpu);
+		if (!cpumask_subset(sblmask, &flush_mask))
+			continue;
+
+		cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask);
+	}
+
+	smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1);
 }
 
 void flush_tlb_current_task(void)