Message ID | 1459912457-5630-1-git-send-email-alex.shi@linaro.org |
---|---|
State | New |
Headers | show |
On 04/06/2016 12:47 PM, Andy Lutomirski wrote: > On Apr 5, 2016 8:17 PM, "Alex Shi" <alex.shi@linaro.org> wrote: >> >> It seems Intel core still share the TLB pool, flush both of threads' TLB >> just cause a extra useless IPI and a extra flush. The extra flush will >> flush out TLB again which another thread just introduced. >> That's double waste. > > Do you have a reference in both the SDM and the APM for this? No. as I said in the end of commit log. There are no any official guarantee for this usage, but it seems working widely in Intel CPUs. And the performance benefit is so tempted... Is there Intel's guys like to dig it more? :) > > Do we have a guarantee that this serialized the front end such that > the non-targetted sibling won't execute an instruction that it decoded > from a stale translation? Is your worrying an evidence for my guess? Otherwise the stale instruction happens either before IPI coming in... :) > > This will conflict rather deeply with my PCID series, too. > > --Andy >
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 8f4cc3d..6510316 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -134,7 +134,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm, unsigned long start, unsigned long end) { + int cpu; struct flush_tlb_info info; + cpumask_t flush_mask, *sblmask; + info.flush_mm = mm; info.flush_start = start; info.flush_end = end; @@ -151,7 +154,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask, &info, 1); return; } - smp_call_function_many(cpumask, flush_tlb_func, &info, 1); + + if (unlikely(smp_num_siblings <= 1)) { + smp_call_function_many(cpumask, flush_tlb_func, &info, 1); + return; + } + + /* Only one flush needed on both siblings of SMT */ + cpumask_copy(&flush_mask, cpumask); + for_each_cpu(cpu, &flush_mask) { + sblmask = topology_sibling_cpumask(cpu); + if (!cpumask_subset(sblmask, &flush_mask)) + continue; + + cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask); + } + + smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1); } void flush_tlb_current_task(void)
It seems Intel core still share the TLB pool, flush both of threads' TLB just cause a extra useless IPI and a extra flush. The extra flush will flush out TLB again which another thread just introduced. That's double waste. The micro testing show memory access can save about 25% time on my haswell i7 desktop. munmap source code is here: https://lkml.org/lkml/2012/5/17/59 test result on Kernel v4.5.0: $/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16 munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 20ns/time Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16': 18,739,808 dTLB-load-misses # 2.47% of all dTLB cache hits (43.05%) 757,380,911 dTLB-loads (34.34%) 2,125,275 dTLB-store-misses (32.23%) 318,307,759 dTLB-stores (46.32%) 32,765 iTLB-load-misses # 2.03% of all iTLB cache hits (56.90%) 1,616,237 iTLB-loads (44.47%) 41,476 tlb:tlb_flush 1.443484546 seconds time elapsed /proc/vmstat/nr_tlb_remote_flush increased: 4616 /proc/vmstat/nr_tlb_remote_flush_received increased: 32262 test result on Kernel v4.5.0 + this patch: $/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16 munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 16ns/time Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16': 15,984,772 dTLB-load-misses # 1.89% of all dTLB cache hits (41.72%) 844,099,241 dTLB-loads (33.30%) 1,328,102 dTLB-store-misses (52.13%) 280,902,875 dTLB-stores (52.03%) 27,678 iTLB-load-misses # 1.67% of all iTLB cache hits (35.35%) 1,659,550 iTLB-loads (38.38%) 25,137 tlb:tlb_flush 1.428880301 seconds time elapsed /proc/vmstat/nr_tlb_remote_flush increased: 4616 /proc/vmstat/nr_tlb_remote_flush_received increased: 15912 BTW, This change isn't architecturally guaranteed. Signed-off-by: Alex Shi <alex.shi@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> To: linux-kernel@vger.kernel.org To: Mel Gorman <mgorman@suse.de> To: x86@kernel.org To: "H. Peter Anvin" <hpa@zytor.com> To: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Alex Shi <alex.shi@linaro.org> --- arch/x86/mm/tlb.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) -- 2.7.2.333.g70bd996