Message ID | 20210329120648.19040-3-mgorman@techsingularity.net |
---|---|
State | New |
Headers | show |
Series | None | expand |
On Mon, Mar 29 2021 at 13:06, Mel Gorman wrote: > There is a lack of clarity of what exactly local_irq_save/local_irq_restore > protects in page_alloc.c . It conflates the protection of per-cpu page > allocation structures with per-cpu vmstat deltas. > > This patch protects the PCP structure using local_lock which > for most configurations is identical to IRQ enabling/disabling. > The scope of the lock is still wider than it should be but this is > decreased in later patches. The per-cpu vmstat deltas are protected by > preempt_disable/preempt_enable where necessary instead of relying on > IRQ disable/enable. Yes, this goes into the right direction and I really appreciate the scoped protection for clarity sake. > #ifdef CONFIG_MEMORY_HOTREMOVE > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 8a8f1a26b231..01b74ff73549 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -887,6 +887,7 @@ void cpu_vm_stats_fold(int cpu) > > pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > > + preempt_disable(); What's the reason for the preempt_disable() here? A comment would be appreciated. > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) > if (pzstats->vm_stat_diff[i]) { > int v; > @@ -908,6 +909,7 @@ void cpu_vm_stats_fold(int cpu) > global_numa_diff[i] += v; > } > #endif > + preempt_enable(); > } > > for_each_online_pgdat(pgdat) { > @@ -941,6 +943,7 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) > { > int i; > > + preempt_disable(); Same here. > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) > if (pzstats->vm_stat_diff[i]) { > int v = pzstats->vm_stat_diff[i]; > @@ -959,6 +962,7 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) > atomic_long_add(v, &vm_numa_stat[i]); > } > #endif > + preempt_enable(); Thanks, tglx
On Wed, Mar 31, 2021 at 11:55:56AM +0200, Thomas Gleixner wrote: > On Mon, Mar 29 2021 at 13:06, Mel Gorman wrote: > > There is a lack of clarity of what exactly local_irq_save/local_irq_restore > > protects in page_alloc.c . It conflates the protection of per-cpu page > > allocation structures with per-cpu vmstat deltas. > > > > This patch protects the PCP structure using local_lock which > > for most configurations is identical to IRQ enabling/disabling. > > The scope of the lock is still wider than it should be but this is > > decreased in later patches. The per-cpu vmstat deltas are protected by > > preempt_disable/preempt_enable where necessary instead of relying on > > IRQ disable/enable. > > Yes, this goes into the right direction and I really appreciate the > scoped protection for clarity sake. > Thanks. > > #ifdef CONFIG_MEMORY_HOTREMOVE > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 8a8f1a26b231..01b74ff73549 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -887,6 +887,7 @@ void cpu_vm_stats_fold(int cpu) > > > > pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > > > > + preempt_disable(); > > What's the reason for the preempt_disable() here? A comment would be > appreciated. > Very good question because it's protecting vm_stat_diff and vm_numa_stat_diff in different contexts and not quite correctly at this point of the series. By the end of the series vm_numa_stat_diff is a simple counter and does not need special protection. Right now, it's protecting against a read and clear of vm_stat_diff in two contexts -- cpu_vm_stats_fold and drain_zonestats but it's only defensive. cpu_vm_stats_fold is only called when a CPU is going dead and drain_zonestats is called from memory hotplug context. The protection is necessary only if a new drain_zonestats caller was added without taking the RMW of vm_stat_diff into account which may never happen. This whole problem with preemption could be avoided altogether if this_cpu_xchg was used similar to what is done elsewhere in vmstat so.... this? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 64429ca4957f..9528304ce24d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8969,8 +8969,9 @@ void zone_pcp_reset(struct zone *zone) struct per_cpu_zonestat *pzstats; /* - * No race with drain_pages. drain_zonestat disables preemption - * and drain_pages relies on the pcp local_lock. + * No race with drain_pages. drain_zonestat is only concerned with + * vm_*_stat_diff which is updated with this_cpu_xchg and drain_pages + * only cares about the PCP lists protected by local_lock. */ if (zone->per_cpu_pageset != &boot_pageset) { for_each_online_cpu(cpu) { diff --git a/mm/vmstat.c b/mm/vmstat.c index 01b74ff73549..34ff61a145d2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -887,13 +887,11 @@ void cpu_vm_stats_fold(int cpu) pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); - preempt_disable(); for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) if (pzstats->vm_stat_diff[i]) { int v; - v = pzstats->vm_stat_diff[i]; - pzstats->vm_stat_diff[i] = 0; + v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0); atomic_long_add(v, &zone->vm_stat[i]); global_zone_diff[i] += v; } @@ -903,13 +901,11 @@ void cpu_vm_stats_fold(int cpu) if (pzstats->vm_numa_stat_diff[i]) { int v; - v = pzstats->vm_numa_stat_diff[i]; - pzstats->vm_numa_stat_diff[i] = 0; + v = this_cpu_xchg(pzstats->vm_numa_stat_diff[i], 0); atomic_long_add(v, &zone->vm_numa_stat[i]); global_numa_diff[i] += v; } #endif - preempt_enable(); } for_each_online_pgdat(pgdat) { @@ -943,10 +939,9 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) { int i; - preempt_disable(); for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) if (pzstats->vm_stat_diff[i]) { - int v = pzstats->vm_stat_diff[i]; + int v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0); pzstats->vm_stat_diff[i] = 0; atomic_long_add(v, &zone->vm_stat[i]); atomic_long_add(v, &vm_zone_stat[i]); @@ -955,14 +950,12 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) #ifdef CONFIG_NUMA for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) if (pzstats->vm_numa_stat_diff[i]) { - int v = pzstats->vm_numa_stat_diff[i]; + int v = this_cpu_xchg(pzstats->vm_numa_stat_diff[i], 0); - pzstats->vm_numa_stat_diff[i] = 0; atomic_long_add(v, &zone->vm_numa_stat[i]); atomic_long_add(v, &vm_numa_stat[i]); } #endif - preempt_enable(); } #endif -- Mel Gorman SUSE Labs
On Wed, Mar 31 2021 at 12:01, Mel Gorman wrote: > On Wed, Mar 31, 2021 at 11:55:56AM +0200, Thomas Gleixner wrote: > @@ -887,13 +887,11 @@ void cpu_vm_stats_fold(int cpu) > > pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > > - preempt_disable(); > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) > if (pzstats->vm_stat_diff[i]) { > int v; > > - v = pzstats->vm_stat_diff[i]; > - pzstats->vm_stat_diff[i] = 0; > + v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0); Confused. pzstats is not a percpu pointer. zone->per_cpu_zonestats is. But @cpu is not necessarily the current CPU. Thanks, tglx
On Wed, Mar 31 2021 at 19:42, Thomas Gleixner wrote: > On Wed, Mar 31 2021 at 12:01, Mel Gorman wrote: >> On Wed, Mar 31, 2021 at 11:55:56AM +0200, Thomas Gleixner wrote: >> @@ -887,13 +887,11 @@ void cpu_vm_stats_fold(int cpu) >> >> pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); >> >> - preempt_disable(); >> for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) >> if (pzstats->vm_stat_diff[i]) { >> int v; >> >> - v = pzstats->vm_stat_diff[i]; >> - pzstats->vm_stat_diff[i] = 0; >> + v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0); > > Confused. pzstats is not a percpu pointer. zone->per_cpu_zonestats is. > > But @cpu is not necessarily the current CPU. @cpu _is_ definitely NOT the current CPU as this is invoked from the hotplug callback _after_ @cpu went dead. @cpu is dead and wont update these things concurrently. So I'm even more confused :) Thanks, tglx
On Wed, Mar 31, 2021 at 07:42:42PM +0200, Thomas Gleixner wrote: > On Wed, Mar 31 2021 at 12:01, Mel Gorman wrote: > > On Wed, Mar 31, 2021 at 11:55:56AM +0200, Thomas Gleixner wrote: > > @@ -887,13 +887,11 @@ void cpu_vm_stats_fold(int cpu) > > > > pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > > > > - preempt_disable(); > > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) > > if (pzstats->vm_stat_diff[i]) { > > int v; > > > > - v = pzstats->vm_stat_diff[i]; > > - pzstats->vm_stat_diff[i] = 0; > > + v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0); > > Confused. pzstats is not a percpu pointer. zone->per_cpu_zonestats is. > > But @cpu is not necessarily the current CPU. > I was drinking drain cleaner instead of coffee. The code was also broken to begin with. drain_pages() is draining pagesets of a local or dead CPU. For a local CPU, disabling IRQs prevent an IRQ arriving during the drain, trying to allocate a page and potentially corrupt the local pageset -- ok. zone_pcp_reset is accessing a remote CPUs pageset, freeing the percpu pointer and resetting it to boot_pageset. zone_pcp_reset calling local_irq_save() does not offer any special protection against drain_pages because there are two separate IRQs involved. This particular patch may have no reason to touch zone_pcp_reset, cpu_vm_stats_fold or drain_zonestat at all but I need to think about it more tomorrow. -- Mel Gorman SUSE Labs
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a4393ac27336..106da8fbc72a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -20,6 +20,7 @@ #include <linux/atomic.h> #include <linux/mm_types.h> #include <linux/page-flags.h> +#include <linux/local_lock.h> #include <asm/page.h> /* Free memory management - zoned buddy allocator. */ @@ -337,6 +338,7 @@ enum zone_watermarks { #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) +/* Fields and list protected by pagesets local_lock in page_alloc.c */ struct per_cpu_pages { int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 32006e66564a..7f8c73020688 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -112,6 +112,13 @@ typedef int __bitwise fpi_t; static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_FRACTION (8) +struct pagesets { + local_lock_t lock; +}; +static DEFINE_PER_CPU(struct pagesets, pagesets) = { + .lock = INIT_LOCAL_LOCK(lock), +}; + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -2962,12 +2969,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) unsigned long flags; int to_drain, batch; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) free_pcppages_bulk(zone, to_drain, pcp); - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } #endif @@ -2983,13 +2990,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) unsigned long flags; struct per_cpu_pages *pcp; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); if (pcp->count) free_pcppages_bulk(zone, pcp->count, pcp); - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } /* @@ -3252,9 +3259,9 @@ void free_unref_page(struct page *page) if (!free_unref_page_prepare(page, pfn)) return; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); free_unref_page_commit(page, pfn); - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } /* @@ -3274,7 +3281,7 @@ void free_unref_page_list(struct list_head *list) set_page_private(page, pfn); } - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); list_for_each_entry_safe(page, next, list, lru) { unsigned long pfn = page_private(page); @@ -3287,12 +3294,12 @@ void free_unref_page_list(struct list_head *list) * a large list of pages to free. */ if (++batch_count == SWAP_CLUSTER_MAX) { - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); batch_count = 0; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); } } - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } /* @@ -3449,7 +3456,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, struct page *page; unsigned long flags; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); pcp = this_cpu_ptr(zone->per_cpu_pageset); list = &pcp->lists[migratetype]; page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); @@ -3457,7 +3464,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); zone_statistics(preferred_zone, zone); } - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); return page; } @@ -5052,7 +5059,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, goto failed; /* Attempt the batch allocation */ - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); pcp = this_cpu_ptr(zone->per_cpu_pageset); pcp_list = &pcp->lists[ac.migratetype]; @@ -5090,7 +5097,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, nr_populated++; } - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); return nr_populated; @@ -8958,12 +8965,13 @@ void zone_pcp_enable(struct zone *zone) void zone_pcp_reset(struct zone *zone) { - unsigned long flags; int cpu; struct per_cpu_zonestat *pzstats; - /* avoid races with drain_pages() */ - local_irq_save(flags); + /* + * No race with drain_pages. drain_zonestat disables preemption + * and drain_pages relies on the pcp local_lock. + */ if (zone->per_cpu_pageset != &boot_pageset) { for_each_online_cpu(cpu) { pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); @@ -8974,7 +8982,6 @@ void zone_pcp_reset(struct zone *zone) zone->per_cpu_pageset = &boot_pageset; zone->per_cpu_zonestats = &boot_zonestats; } - local_irq_restore(flags); } #ifdef CONFIG_MEMORY_HOTREMOVE diff --git a/mm/vmstat.c b/mm/vmstat.c index 8a8f1a26b231..01b74ff73549 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -887,6 +887,7 @@ void cpu_vm_stats_fold(int cpu) pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); + preempt_disable(); for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) if (pzstats->vm_stat_diff[i]) { int v; @@ -908,6 +909,7 @@ void cpu_vm_stats_fold(int cpu) global_numa_diff[i] += v; } #endif + preempt_enable(); } for_each_online_pgdat(pgdat) { @@ -941,6 +943,7 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) { int i; + preempt_disable(); for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) if (pzstats->vm_stat_diff[i]) { int v = pzstats->vm_stat_diff[i]; @@ -959,6 +962,7 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) atomic_long_add(v, &vm_numa_stat[i]); } #endif + preempt_enable(); } #endif
There is a lack of clarity of what exactly local_irq_save/local_irq_restore protects in page_alloc.c . It conflates the protection of per-cpu page allocation structures with per-cpu vmstat deltas. This patch protects the PCP structure using local_lock which for most configurations is identical to IRQ enabling/disabling. The scope of the lock is still wider than it should be but this is decreased in later patches. The per-cpu vmstat deltas are protected by preempt_disable/preempt_enable where necessary instead of relying on IRQ disable/enable. [lkp@intel.com: Make pagesets static] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mmzone.h | 2 ++ mm/page_alloc.c | 43 ++++++++++++++++++++++++------------------ mm/vmstat.c | 4 ++++ 3 files changed, 31 insertions(+), 18 deletions(-)