From patchwork Wed Nov 27 02:57:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 845824 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 603216A8D2 for ; Wed, 27 Nov 2024 02:57:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676268; cv=none; b=uKWaUjKgyDxZkflV7QzzKYWCb7rK0pPaXHiPI6LHVCa8HLXBZHkIl8TSuyLhm6O+8RDCzamsKxvsGTxh/yO9mMdm05SeE+Gp52s09YdjhVrqGO5kKq374fP00dr+jWlbFtApwpEdRlDTaO2VutMbwVrywEtGwIDxvVkjJF4sbUU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676268; c=relaxed/simple; bh=vA7ObHusCpoPtNzRK7HZ0askipOytc7WfktLrhRTw38=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=BwR4xKul7nRp3E4nF3HcuVK6+i3n/lgUrgHrNr6S2tAn78d19RsxojD30aK3xrQU+GYB0xvIgxfNP3ZSr5Rtgw7KsV1VdFQIe+ZkFZjv6bO0TV9z24by4J2Y5KybZzl9PorDpGXxzKgksmVAdQy3Y8HlqHEpYRUdKPAbuwmYBDs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=OYrW2JjC; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="OYrW2JjC" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2ea2b6b08e4so6188252a91.0 for ; Tue, 26 Nov 2024 18:57:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676265; x=1733281065; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YMfZReVH6Y6OSe0PATfKlnZ+c8+zk2ht0ykfo5UOUYo=; b=OYrW2JjCV/rrmc1d98iVVEPxvyYLl9OmiMV0N/BNHwakw/HnDisjTNL7akEjYzChjs wV/jEjppf5F0hs6AXycD2EuE9bl3mu5sxBSeSCdiQ1rDrlOSt7emhsWShUp8Xvntk2lh bkZXNtqjE8VUufM9mw3xiZRIgt/fCfzpHzIBjJ2N+I7Aa7qDaQ6k1uUuuqwvqVff04W2 xGrQ27F1zQqdmA/4RPttKHoH93u0j4M4JBaH9dSnkWzuDmLR0B0vV/i4ghB+ZeS5+ffA NINLsGqgiXwDRht5Hscu1se0DUZHqlDkZahcQ+PtcM/5mPfSl8Nag3j0d00OWJUYMTxO 7kcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676265; x=1733281065; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YMfZReVH6Y6OSe0PATfKlnZ+c8+zk2ht0ykfo5UOUYo=; b=EROufp5ybLHUxv76qv13YP/iUjfnibYyhKjJJf001gS5/f1wfIz1PFtnxIldNeOiRF Sf461KTLPIia6ytrQxtAyTF5GeTanS3Rja7WdEyIgRzn25+xGZuiWxyjAMVSDc8zEBRp 3meSmOEQIXEirVbTjT4YUMpUX+z0OzLYipaHanenmfew9chW8eOl6DvOwb8bpn1r7Rrn aAUOSI/iD9pFrpVx6kNi72Ncqt68YSIIsPk6h0nEQMZuKxCpesMyst8HdQMtksWZP6+D 7cRCpGYAL3potjsNwgRXIWzh64PU17dSGSw3O90z+pe3RLwUdJzN1sTXHQ6lcvcAtkMS k41A== X-Forwarded-Encrypted: i=1; AJvYcCX5Ml25Qv3EfOxOhNfj1YqRyrcD0HTsa+18L5c0qopAw5MYqMHFfjPNmYdVhWTJi6iy3qLrrxDwdM/tbaWBrac=@vger.kernel.org X-Gm-Message-State: AOJu0YyDS2DST658Lcr3Fld36kP/ssCbIbGc1X/c6Pbk6bmWU0T/GHzz lZjIQgJC1/HWTFO3ehe+zqjLu4DtM112vBWEb6JjDis8uSs5FgkBmJynyaI3PQ/Np5jyO6HImB1 jVNDyoQ== X-Google-Smtp-Source: AGHT+IGS3z2xtRaD82Eg6lpZpx9W4L7i5UELBQBxL0/M15PzUsyV5kTYfTLGRUtmYRU55oTDmRTrliUGszQh X-Received: from pjtd4.prod.google.com ([2002:a17:90b:44:b0:2e2:8d64:6213]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2b47:b0:2ea:49a8:917b with SMTP id 98e67ed59e1d1-2ee08dab307mr2422606a91.0.1732676264733; Tue, 26 Nov 2024 18:57:44 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:20 -0800 In-Reply-To: <20241127025728.3689245-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-2-yuanchu@google.com> Subject: [PATCH v4 1/9] mm: aggregate workingset information into histograms From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's workingset per-node, per-anon/file. The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/workingset_report/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 100000 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0 /sys/devices/system/node/nodeX/workingset_report/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation. Signed-off-by: Yuanchu Xie --- drivers/base/node.c | 6 + include/linux/mmzone.h | 9 + include/linux/workingset_report.h | 79 ++++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 5 + mm/memcontrol.c | 2 + mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 10 +- mm/workingset_report.c | 451 ++++++++++++++++++++++++++++++ 11 files changed, 572 insertions(+), 4 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c diff --git a/drivers/base/node.c b/drivers/base/node.c index eb72580288e6..ba5b8720dbfa 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include static const struct bus_type node_subsys = { .name = "node", @@ -626,6 +628,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_init_sysfs(node); } return error; @@ -642,6 +645,9 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_remove_sysfs(node); + wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id))); + wsr_destroy_pgdat(NODE_DATA(node->dev.id)); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 80bc5640bb60..ee728c0c5a3b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include #include #include +#include /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -630,6 +631,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_state wsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -1424,6 +1428,11 @@ typedef struct pglist_data { struct lru_gen_memcg memcg_lru; #endif +#ifdef CONFIG_WORKINGSET_REPORT + struct mutex wsr_update_mutex; + struct wsr_report_bins __rcu *wsr_page_age_bins; +#endif + CACHELINE_PADDING(_pad2_); /* Per-node vmstats */ diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index 000000000000..d7c2ee14ec87 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include +#include + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + /* excludes the WORKINGSET_INTERVAL_MAX bin */ + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS]; + struct rcu_head rcu; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init_lruvec(struct lruvec *lruvec); +void wsr_destroy_lruvec(struct lruvec *lruvec); +void wsr_init_pgdat(struct pglist_data *pgdat); +void wsr_destroy_pgdat(struct pglist_data *pgdat); +void wsr_init_sysfs(struct node *node); +void wsr_remove_sysfs(struct node *node); + +/* + * Returns true if the wsr is configured to be refreshed. + * The next refresh time is stored in refresh_time. + */ +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat); +#else +static inline void wsr_init_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_init_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_init_sysfs(struct node *node) +{ +} +static inline void wsr_remove_sysfs(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */ + +#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 84000b016808..be949786796d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1301,6 +1301,15 @@ config ARCH_HAS_USER_SHADOW_STACK The architecture has hardware support for userspace shadow call stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). +config WORKINGSET_REPORT + bool "Working set reporting" + depends on LRU_GEN && SYSFS + help + Report system and per-memcg working set to userspace. + + This option exports stats and events giving the user more insight + into its memory working set. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index d5639b036166..f5ef0768253a 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index 64c2eb0b160e..bbd3c1501bac 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ +struct scan_control; +bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, + bool force_scan); +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs); /* * in mm/rmap.c: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53db98d2c4a1..096856b35fbc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -3453,6 +3454,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return; + wsr_destroy_lruvec(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn->lruvec_stats); kfree(pn); diff --git a/mm/mm_init.c b/mm/mm_init.c index 4ba5607aaf19..b4f7c904ce33 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -1378,6 +1379,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); + wsr_init_pgdat(pgdat); } static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid, diff --git a/mm/mmzone.c b/mm/mmzone.c index f9baa8882fbf..0352a2018067 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]); + wsr_init_lruvec(lruvec); + lru_gen_init_lruvec(lruvec); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 28ba2b06fc7d..89da4d8dfb5f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -271,8 +272,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) } #endif -static void set_task_reclaim_state(struct task_struct *task, - struct reclaim_state *rs) +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { /* Check for an overwrite */ WARN_ON_ONCE(rs && task->reclaim_state); @@ -3861,8 +3861,8 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, return success; } -static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, - bool can_swap, bool force_scan) +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, + bool force_scan) { bool success; struct lru_gen_mm_walk *walk; @@ -5640,6 +5640,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n"); + wsr_init_sysfs(NULL); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..a4dcf62fcd96 --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,451 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +void wsr_init_pgdat(struct pglist_data *pgdat) +{ + mutex_init(&pgdat->wsr_update_mutex); + RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL); +} + +void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ + struct wsr_report_bins __rcu *bins; + + mutex_lock(&pgdat->wsr_update_mutex); + bins = rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL, + lockdep_is_held(&pgdat->wsr_update_mutex)); + kfree_rcu(bins, rcu); + mutex_unlock(&pgdat->wsr_update_mutex); + mutex_destroy(&pgdat->wsr_update_mutex); +} + +void wsr_init_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} + +void wsr_destroy_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + mutex_destroy(&wsr->page_age_lock); + kfree(wsr->page_age); + memset(wsr, 0, sizeof(*wsr)); +} + +static int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) +{ + int err = 0, i = 0; + char *cur, *next = strim(src); + + if (*next == '\0') + return 0; + + while ((cur = strsep(&next, ","))) { + unsigned int interval; + + err = kstrtouint(cur, 0, &interval); + if (err) + goto out; + + bins->idle_age[i] = msecs_to_jiffies(interval); + if (i > 0 && bins->idle_age[i] <= bins->idle_age[i - 1]) { + err = -EINVAL; + goto out; + } + + if (++i == WORKINGSET_REPORT_MAX_NR_BINS) { + err = -ERANGE; + goto out; + } + } + + if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) { + err = -ERANGE; + goto out; + } + + bins->nr_bins = i; + bins->idle_age[i] = WORKINGSET_INTERVAL_MAX; +out: + return err ?: i; +} + +static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen, + unsigned long seq, + unsigned long max_seq, + unsigned long curr_timestamp) +{ + int younger_gen; + + if (seq == max_seq) + return curr_timestamp; + younger_gen = lru_gen_from_seq(seq + 1); + return READ_ONCE(lrugen->timestamps[younger_gen]); +} + +static void collect_page_age_type(const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + for (seq = max_seq; seq + 1 > min_seq; seq--) { + int gen, zone; + unsigned long gen_end, gen_start, size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max( + READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0L); + + gen_start = get_gen_start_time(lrugen, seq, max_seq, + curr_timestamp); + gen_end = READ_ONCE(lrugen->timestamps[gen]); + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long gen_in_bin = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (gen_in_bin) { + unsigned long split_bin = + size / gen_len * gen_in_bin; + + bin->nr_pages[type] += split_bin; + size -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += size; + } +} + +/* + * proportionally aggregate Multi-gen LRU bins into a working set report + * MGLRU generations: + * current time + * | max_seq timestamp + * | | max_seq - 1 timestamp + * | | | unbounded + * | | | | + * -------------------------------- + * | max_seq | ... | ... | min_seq + * -------------------------------- + * + * Bins: + * + * current time + * | current - idle_age[0] + * | | current - idle_age[1] + * | | | unbounded + * | | | | + * ------------------------------ + * | bin 0 | ... | ... | bin n-1 + * ------------------------------ + * + * Assume the heuristic that pages are in the MGLRU generation + * through uniform accesses, so we can aggregate them + * proportionally into bins. + */ +static void collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) +{ + int type; + const struct lru_gen_folio *lrugen = &lruvec->lrugen; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]), + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]), + }; + struct wsr_report_bin *bin = &page_age->bins[0]; + + for (type = 0; type < ANON_AND_FILE; type++) + collect_page_age_type(lrugen, bin, max_seq, min_seq[type], + curr_timestamp, type); +} + +/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + unsigned int flags; + struct reclaim_state rs = { 0 }; + + set_task_reclaim_state(current, &rs); + flags = memalloc_noreclaim_save(); + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, true, true); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); +} + +/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age, + struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + struct wsr_report_bin *bin; + + for (bin = page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) { + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + } + /* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */ + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + collect_page_age(page_age, lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + WRITE_ONCE(page_age->timestamp, jiffies); +} + +static void copy_node_bins(struct pglist_data *pgdat, + struct wsr_page_age_histo *page_age) +{ + struct wsr_report_bins *node_page_age_bins; + int i = 0; + + rcu_read_lock(); + node_page_age_bins = rcu_dereference(pgdat->wsr_page_age_bins); + if (!node_page_age_bins) + goto nocopy; + for (i = 0; i < node_page_age_bins->nr_bins; ++i) + page_age->bins[i].idle_age = node_page_age_bins->idle_age[i]; + +nocopy: + page_age->bins[i].idle_age = WORKINGSET_INTERVAL_MAX; + rcu_read_unlock(); +} + +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct wsr_page_age_histo *page_age; + + if (!READ_ONCE(wsr->page_age)) + return false; + + refresh_scan(wsr, root, pgdat); + mutex_lock(&wsr->page_age_lock); + page_age = READ_ONCE(wsr->page_age); + if (page_age) { + copy_node_bins(pgdat, page_age); + refresh_aggregate(page_age, root, pgdat); + } + mutex_unlock(&wsr->page_age_lock); + return !!page_age; +} +EXPORT_SYMBOL_GPL(wsr_refresh_report); + +static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{ + int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : + first_memory_node; + + return NODE_DATA(nid); +} + +static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{ + return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; +} + +static ssize_t page_age_intervals_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_report_bins *bins; + int len = 0; + struct pglist_data *pgdat = kobj_to_pgdat(kobj); + + rcu_read_lock(); + bins = rcu_dereference(pgdat->wsr_page_age_bins); + if (bins) { + int i; + int nr_bins = bins->nr_bins; + + for (i = 0; i < bins->nr_bins; ++i) { + len += sysfs_emit_at( + buf, len, "%u", + jiffies_to_msecs(bins->idle_age[i])); + if (i + 1 < nr_bins) + len += sysfs_emit_at(buf, len, ","); + } + } + len += sysfs_emit_at(buf, len, "\n"); + rcu_read_unlock(); + + return len; +} + +static ssize_t page_age_intervals_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *src, size_t len) +{ + struct wsr_report_bins *bins = NULL, __rcu *old; + char *buf = NULL; + int err = 0; + struct pglist_data *pgdat = kobj_to_pgdat(kobj); + + buf = kstrdup(src, GFP_KERNEL); + if (!buf) { + err = -ENOMEM; + goto failed; + } + + bins = + kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL); + + if (!bins) { + err = -ENOMEM; + goto failed; + } + + err = workingset_report_intervals_parse(buf, bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(bins); + bins = NULL; + } + + mutex_lock(&pgdat->wsr_update_mutex); + old = rcu_replace_pointer(pgdat->wsr_page_age_bins, bins, + lockdep_is_held(&pgdat->wsr_update_mutex)); + mutex_unlock(&pgdat->wsr_update_mutex); + kfree_rcu(old, rcu); + kfree(buf); + return len; +failed: + kfree(bins); + kfree(buf); + + return err; +} + +static struct kobj_attribute page_age_intervals_attr = + __ATTR_RW(page_age_intervals); + +static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct wsr_report_bin *bin; + int ret = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + wsr->page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); + mutex_unlock(&wsr->page_age_lock); + + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + for (bin = wsr->page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n", + WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); + +static struct attribute *workingset_report_attrs[] = { + &page_age_intervals_attr.attr, &page_age_attr.attr, NULL +}; + +static const struct attribute_group workingset_report_attr_group = { + .name = "workingset_report", + .attrs = workingset_report_attrs, +}; + +void wsr_init_sysfs(struct node *node) +{ + struct kobject *kobj = node ? &node->dev.kobj : mm_kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + + if (sysfs_create_group(kobj, &workingset_report_attr_group)) + pr_warn("Workingset report failed to create sysfs files\n"); +} +EXPORT_SYMBOL_GPL(wsr_init_sysfs); + +void wsr_remove_sysfs(struct node *node) +{ + struct kobject *kobj = &node->dev.kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + sysfs_remove_group(kobj, &workingset_report_attr_group); +} +EXPORT_SYMBOL_GPL(wsr_remove_sysfs); From patchwork Wed Nov 27 02:57:22 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 845823 Received: from mail-oa1-f73.google.com (mail-oa1-f73.google.com [209.85.160.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B523113A888 for ; Wed, 27 Nov 2024 02:57:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676271; cv=none; b=r+E5ycNXYKJZX+LOp3dZxxl3S7kbhxFh3OEwys81msTkUJqKMgwKx9jWShWlRkh1/4SlaEtvXvXYQaneePFXbWCpOlyjeydGM8oeeubYAAuv3mBYlWXaw5EbFHNYqW3NVW7eiL1e6CoPn2X0QKnAvnm6cKf6KD0WAFtqIAKEm48= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676271; c=relaxed/simple; bh=WTcNbuwg+h1080/4gOPGJLMDusJrVSUsnxgGAyu/+1o=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=VnnfAwh2hdysrbLN4232XFstSHiT6Bbb72+6ESmz8VO/aMkxnlbiRvdIyTZERwz4Nv3XmNwpDLBk9pZuEODiCpTWW5mJu0AC/OzY0Nxk2eLkf9Qlsu1XFDx/6AC7IClyoHE625W/hTFN7zb/mSDiGadjohSxZG654tM06YMxqn0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ZEPWqO4R; arc=none smtp.client-ip=209.85.160.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZEPWqO4R" Received: by mail-oa1-f73.google.com with SMTP id 586e51a60fabf-2962c45ba90so5150329fac.0 for ; Tue, 26 Nov 2024 18:57:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676268; x=1733281068; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=jJsIXJ522dse8Oa1czRmZ0i3siB+gYc3GobTJqrZlLM=; b=ZEPWqO4RFiM4Siq8DtlQ30RZGjaTvJms0NJYyGmhnO5CGL0JK7gaXbeL9SMtFb3XZI yd4T/4Hbd7XeuW4+9ue88/rTEbeCm0JHyHIU8/Q2+XrhhDgguDXrYvnu8WIvTssaEtSj wHNCBdSHj3w4mZipDZUzpjwThs1sodFtY2clPewXCShAlGznqL0PfPUR5nwkztw7f6rH PT/Lb36D4PNO2PBuFjDLbEUCusdET7GB2nY2lVk0jdN5xUdsCcxs+o6heGrQhkYnuoHV bCct8s2oQ/ndzcewBDatA8cJpGRyzWoyE9yAOOrmyBVniEliU1VwF9FJcSqg5Ed5vdYL hrEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676268; x=1733281068; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jJsIXJ522dse8Oa1czRmZ0i3siB+gYc3GobTJqrZlLM=; b=gKC1HfbTGW1xCZNXGtwVKHvoWK+dXtQvMF/SOvlUeIL2uNoYqya8ImsKCWVF4gMWVJ H7jPaByHUtubCsI2ot0bz6ty1agmIdH7xHE+tWVFGkB9OZIx6HjJeNh78hWiEzvuafrJ 3jpOJvRub1WAveIGwtbN2dW5h4dXhLnegpp6bLSjbGZ2oa/w8W1KTYZx2Y6Fb0CThOQB aIpIjySlPzw4qyGev/c7aavcqGqTQxkHUnK/gBUGtn/EdQggiG86GxTlUWHaGayAThsa TCATDXKWfAGVGadArJbtsjCl+AboiuAQF4Nk5i2brpejjLgFXNPpRzFs5O/0PJoiqkA6 EDKg== X-Forwarded-Encrypted: i=1; AJvYcCVM/zSJ1YP+Zvy2pLNMkL+eTY9Ch+2idng289+x3adpFQYbfn8rXv8CFxvY1gMrkC+frncjUJshXI89+PMnJaM=@vger.kernel.org X-Gm-Message-State: AOJu0Yx3nFJmxdix9eFGalxJsCFn+mOTe2v+o1FpqL6KYJmAltbGVKm0 H9UBgtgRScQd9qPVH5Q3Ez3oxtLRalTIEBqN8dB2n+XSNcOwMgTBlsLPWD2ls+N49h1epmPzVi+ kdEnaJw== X-Google-Smtp-Source: AGHT+IHpxh9NC9fR/Y+VVMw56gHiKhQb3M845RK+LO+TsitSupQfuVS+RxLJBMoZkNtQHfRh61ZaI/968chM X-Received: from oabxa1.prod.google.com ([2002:a05:6870:7f01:b0:295:f4c4:2bdd]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6870:9a98:b0:297:27c6:d50f with SMTP id 586e51a60fabf-29dc3f9ee1emr1306226fac.2.1732676268691; Tue, 26 Nov 2024 18:57:48 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:22 -0800 In-Reply-To: <20241127025728.3689245-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-4-yuanchu@google.com> Subject: [PATCH v4 3/9] mm: report workingset during memory pressure driven scanning From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org When a node reaches its low watermarks and wakes up kswapd, notify all userspace programs waiting on the workingset page age histogram of the memory pressure, so a userspace agent can read the workingset report in time and make policy decisions, such as logging, oom-killing, or migration. Sysfs interface: /sys/devices/system/node/nodeX/workingset_report/report_threshold time in milliseconds that specifies how often the userspace agent can be notified for node memory pressure. Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 4 +++ mm/internal.h | 12 ++++++++ mm/vmscan.c | 46 +++++++++++++++++++++++++++++++ mm/workingset_report.c | 43 ++++++++++++++++++++++++++++- 4 files changed, 104 insertions(+), 1 deletion(-) diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 8bae6a600410..2ec8b927b200 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -37,7 +37,11 @@ struct wsr_page_age_histo { }; struct wsr_state { + unsigned long report_threshold; unsigned long refresh_interval; + + struct kernfs_node *page_age_sys_file; + /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index bbd3c1501bac..508b7d9937d6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -479,6 +479,18 @@ bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, bool force_scan); void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs); +#ifdef CONFIG_WORKINGSET_REPORT +/* + * in mm/wsr.c + */ +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); +#else +static inline void notify_workingset(struct mem_cgroup *memcg, + struct pglist_data *pgdat) +{ +} +#endif + /* * in mm/rmap.c: */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 89da4d8dfb5f..2bca81271d15 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2578,6 +2578,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); } +#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, struct scan_control *sc); +#else +static inline void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + #ifdef CONFIG_LRU_GEN #ifdef CONFIG_LRU_GEN_ENABLED @@ -4004,6 +4013,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) set_initial_priority(pgdat, sc); + try_to_report_workingset(pgdat, sc); + memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); @@ -5649,6 +5660,38 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen); +#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + unsigned long threshold = READ_ONCE(wsr->report_threshold); + + if (sc->priority == DEF_PRIORITY) + return; + + if (!threshold) + return; + + if (!mutex_trylock(&wsr->page_age_lock)) + return; + + if (!wsr->page_age) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + mutex_unlock(&wsr->page_age_lock); + notify_workingset(memcg, pgdat); +} +#endif /* CONFIG_WORKINGSET_REPORT */ + #else /* !CONFIG_LRU_GEN */ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -6200,6 +6243,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; + + if (!sc->proactive) + try_to_report_workingset(zone->zone_pgdat, sc); shrink_node(zone->zone_pgdat, sc); } diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 8678536ccfc7..bbefb0046669 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -320,6 +320,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; } +static ssize_t report_threshold_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int threshold = READ_ONCE(wsr->report_threshold); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold)); +} + +static ssize_t report_threshold_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int threshold; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + if (kstrtouint(buf, 0, &threshold)) + return -EINVAL; + + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(threshold)); + + return len; +} + +static struct kobj_attribute report_threshold_attr = + __ATTR_RW(report_threshold); + static ssize_t refresh_interval_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -474,6 +501,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); static struct attribute *workingset_report_attrs[] = { + &report_threshold_attr.attr, &refresh_interval_attr.attr, &page_age_intervals_attr.attr, &page_age_attr.attr, @@ -495,8 +523,13 @@ void wsr_init_sysfs(struct node *node) wsr = kobj_to_wsr(kobj); - if (sysfs_create_group(kobj, &workingset_report_attr_group)) + if (sysfs_create_group(kobj, &workingset_report_attr_group)) { pr_warn("Workingset report failed to create sysfs files\n"); + return; + } + + wsr->page_age_sys_file = + kernfs_walk_and_get(kobj->sd, "workingset_report/page_age"); } EXPORT_SYMBOL_GPL(wsr_init_sysfs); @@ -509,6 +542,14 @@ void wsr_remove_sysfs(struct node *node) return; wsr = kobj_to_wsr(kobj); + kernfs_put(wsr->page_age_sys_file); sysfs_remove_group(kobj, &workingset_report_attr_group); } EXPORT_SYMBOL_GPL(wsr_remove_sysfs); + +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) +{ + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + + kernfs_notify(wsr->page_age_sys_file); +} From patchwork Wed Nov 27 02:57:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 845822 Received: from mail-oa1-f74.google.com (mail-oa1-f74.google.com [209.85.160.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E17C13D613 for ; Wed, 27 Nov 2024 02:57:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676275; cv=none; b=iI4sCtzJomfS/J5Gj5AFthU8IyOZutuQxa8aVj68r7iBQU5O4fwZJKyNogCXg8GpMxNubwUxr4EzgZxLg2fEPXmvmk9RDZgQpEasYzdcAyPZJVdU8FlUV483SbpAHxgnl9tETCa0d6RbXZv8uBt/hYIuWeOgsNzdHXimaaC/l7U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676275; c=relaxed/simple; bh=QBepoMOHoV4DwOBPejg0o5SckReMvASw8L0MVT06MTo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=EIHqaLwJol6J+G9hNDvddqi1u1FipUNoENPVDqO0UXgKqaANbBnhepxzz1vcvWm6UzLu1yIxYY5FcAVkisVo8mStx367glbULBADzVv98bRuYJpGKev1fiezeI9sZ+PVrSJgz6+RjgM8HpiYRBcXKnpBqXie1jKtZeSo8BNBsL0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=K7VrsVia; arc=none smtp.client-ip=209.85.160.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="K7VrsVia" Received: by mail-oa1-f74.google.com with SMTP id 586e51a60fabf-2967f717d96so3925642fac.1 for ; Tue, 26 Nov 2024 18:57:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676272; x=1733281072; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=t/WPTXcvcabaljBBrVkxMdtvNE4T0NFrHkMqJHgXQGk=; b=K7VrsViasR8HP4IXX4yG/dIsp615KetwHDwyLI6h187uVaXalPl5nDfJTIexXugV0+ CzpIdOhxvNz0lC2b6ssOlyGL+mY4mjBjlqaItaEac57yd2APhv+9RHIf97zOIPQ4GwEN 6FLNJkTWSaS+2d8F2EV66nrG7K3y/W+J0PIRHGVuoVDZJj39imUtzQ+t6qtU2XvLz+WY WPr74nPceXyYoxTty5BJj9QA4m6qSZDDpuJmRf2rm0SyA+zRKcqihH9S/PXEc6u2Nzho SJH0svdveiFH1lRSZrVxGObYjaO5pFF0Dx+ZWdJfiMc5gjb4Y7R5XW6OntgPRGMHuD5y Sz+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676272; x=1733281072; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=t/WPTXcvcabaljBBrVkxMdtvNE4T0NFrHkMqJHgXQGk=; b=sBvZXAT/PnR4MCzQHyv+yWhnMbvxgJyGzfOFdDHzGf7DJ159UMBFffvkn11GbiiEiZ xdGWGvJJUctKflARCDwNcjQVMjHuw5CRfvKeJrvdZ4j6uagzHbo/mEiDlsjhkdILjOuV VHAfZg3I9Qus5Q633dCJdMLSCerwhUzxnHCielLNWeErzDBMYR5TFxha/1/rvRzdueEK AfkH1dChWEr/99xr6i9L2WrxUKK9umod4Y2zg0O4Z0hW8Q/kc8to2Ny6zUdmIo1uU+jb w9zvP3rlVTo9DdI0IYhu+3oULbD2yIO9aZM0zTVOYLJ4O+HIpmCfJaNnXf3Pac3dYk6n Uf7g== X-Forwarded-Encrypted: i=1; AJvYcCVynP/oQLAc/bKbtpgFfx9GemEP4Py+m2tT35AGJvwqFsaxoPHAEAXhETQHaQky2KmXrpjQbzcR/gCpnUh0IA0=@vger.kernel.org X-Gm-Message-State: AOJu0YxAvFcfBBeEWUlrTnt31Yzslwo5oLSb/teHwIj/VwxLT1b3uYwl uD+M9yGTK2ORqYoOkfH/6EyyU5dr9bC22k97WeR/hLJ4LiLo+EptcEfjiaDjU/s8BHQ8/qU18tX abmmkWg== X-Google-Smtp-Source: AGHT+IFRcRWDlCpSpvonVfz6gqukD5+iS1Nh+k6upjKjqm+Gv9S2zOic8Epnn/W/99f5T4mKjm5ZX+88GSwn X-Received: from oacnq1.prod.google.com ([2002:a05:6871:3781:b0:296:523f:4d02]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6870:4996:b0:297:2376:9b07 with SMTP id 586e51a60fabf-29dc400538dmr1178830fac.10.1732676272567; Tue, 26 Nov 2024 18:57:52 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:24 -0800 In-Reply-To: <20241127025728.3689245-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-6-yuanchu@google.com> Subject: [PATCH v4 5/9] mm: add kernel aging thread for workingset reporting From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org For reliable and timely aging on memcgs, one has to read the page age histograms on time. A kernel thread makes it easier by aging memcgs with valid refresh_interval when they can be refreshed, and also reduces the latency of any userspace consumers of the page age histogram. The kerne aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING. Debugging stats may be added in the future for when aging cannot keep up with the configured refresh_interval. Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 10 ++- mm/Kconfig | 6 ++ mm/Makefile | 1 + mm/memcontrol.c | 2 +- mm/workingset_report.c | 13 ++- mm/workingset_report_aging.c | 127 ++++++++++++++++++++++++++++++ 6 files changed, 154 insertions(+), 5 deletions(-) create mode 100644 mm/workingset_report_aging.c diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 616be6469768..f6bbde2a04c3 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -64,7 +64,15 @@ void wsr_remove_sysfs(struct node *node); * The next refresh time is stored in refresh_time. */ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat); + struct pglist_data *pgdat, unsigned long *refresh_time); + +#ifdef CONFIG_WORKINGSET_REPORT_AGING +void wsr_wakeup_aging_thread(void); +#else /* CONFIG_WORKINGSET_REPORT_AGING */ +static inline void wsr_wakeup_aging_thread(void) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT_AGING */ int wsr_set_refresh_interval(struct wsr_state *wsr, unsigned long refresh_interval); diff --git a/mm/Kconfig b/mm/Kconfig index be949786796d..a8def8c65610 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1310,6 +1310,12 @@ config WORKINGSET_REPORT This option exports stats and events giving the user more insight into its memory working set. +config WORKINGSET_REPORT_AGING + bool "Workingset report kernel aging thread" + depends on WORKINGSET_REPORT + help + Performs aging on memcgs with their configured refresh intervals. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index f5ef0768253a..3a282510f960 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -99,6 +99,7 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o +obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d1032c6efc66..ea83f10b22a1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4462,7 +4462,7 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v) if (!READ_ONCE(wsr->page_age)) continue; - wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) goto unlock; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 1e1bdb8bf75b..dad539e602bb 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -283,7 +283,7 @@ static void copy_node_bins(struct pglist_data *pgdat, } bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, unsigned long *refresh_time) { struct wsr_page_age_histo *page_age; unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); @@ -300,10 +300,14 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, goto unlock; if (page_age->timestamp && time_is_after_jiffies(page_age->timestamp + refresh_interval)) - goto unlock; + goto time; refresh_scan(wsr, root, pgdat, refresh_interval); copy_node_bins(pgdat, page_age); refresh_aggregate(page_age, root, pgdat); + +time: + if (refresh_time) + *refresh_time = page_age->timestamp + refresh_interval; unlock: mutex_unlock(&wsr->page_age_lock); return !!page_age; @@ -386,6 +390,9 @@ int wsr_set_refresh_interval(struct wsr_state *wsr, WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(refresh_interval)); unlock: mutex_unlock(&wsr->page_age_lock); + if (!err && refresh_interval && + (!old_interval || jiffies_to_msecs(old_interval) > refresh_interval)) + wsr_wakeup_aging_thread(); return err; } @@ -491,7 +498,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, int ret = 0; struct wsr_state *wsr = kobj_to_wsr(kobj); - wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) diff --git a/mm/workingset_report_aging.c b/mm/workingset_report_aging.c new file mode 100644 index 000000000000..91ad5020778a --- /dev/null +++ b/mm/workingset_report_aging.c @@ -0,0 +1,127 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Workingset report kernel aging thread + * + * Performs aging on behalf of memcgs with their configured refresh interval. + * While a userspace program can periodically read the page age breakdown + * per-memcg and trigger aging, the kernel performing aging is less overhead, + * more consistent, and more reliable for the use case where every memcg should + * be aged according to their refresh interval. + */ +#define pr_fmt(fmt) "workingset report aging: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static DECLARE_WAIT_QUEUE_HEAD(aging_wait); +static bool refresh_pending; + +static bool do_aging_node(int nid, unsigned long *next_wake_time) +{ + struct mem_cgroup *memcg; + bool should_wait = true; + struct pglist_data *pgdat = NODE_DATA(nid); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + struct wsr_state *wsr = &lruvec->wsr; + unsigned long refresh_time; + + /* use returned time to decide when to wake up next */ + if (wsr_refresh_report(wsr, memcg, pgdat, &refresh_time)) { + if (should_wait) { + should_wait = false; + *next_wake_time = refresh_time; + } else if (time_before(refresh_time, *next_wake_time)) { + *next_wake_time = refresh_time; + } + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return should_wait; +} + +static int do_aging(void *unused) +{ + while (!kthread_should_stop()) { + int nid; + long timeout_ticks; + unsigned long next_wake_time; + bool should_wait = true; + + WRITE_ONCE(refresh_pending, false); + for_each_node_state(nid, N_MEMORY) { + unsigned long node_next_wake_time; + + if (do_aging_node(nid, &node_next_wake_time)) + continue; + if (should_wait) { + should_wait = false; + next_wake_time = node_next_wake_time; + } else if (time_before(node_next_wake_time, + next_wake_time)) { + next_wake_time = node_next_wake_time; + } + } + + if (should_wait) { + wait_event_interruptible(aging_wait, refresh_pending); + continue; + } + + /* sleep until next aging */ + timeout_ticks = next_wake_time - jiffies; + if (timeout_ticks > 0 && + timeout_ticks != MAX_SCHEDULE_TIMEOUT) { + schedule_timeout_idle(timeout_ticks); + continue; + } + } + return 0; +} + +/* Invoked when refresh_interval shortens or changes to a non-zero value. */ +void wsr_wakeup_aging_thread(void) +{ + WRITE_ONCE(refresh_pending, true); + wake_up_interruptible(&aging_wait); +} + +static struct task_struct *aging_thread; + +static int aging_init(void) +{ + struct task_struct *task; + + task = kthread_run(do_aging, NULL, "kagingd"); + + if (IS_ERR(task)) { + pr_err("Failed to create aging kthread\n"); + return PTR_ERR(task); + } + + aging_thread = task; + pr_info("module loaded\n"); + return 0; +} + +static void aging_exit(void) +{ + kthread_stop(aging_thread); + aging_thread = NULL; + pr_info("module unloaded\n"); +} + +module_init(aging_init); +module_exit(aging_exit); From patchwork Wed Nov 27 02:57:26 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 845821 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 287A7823C3 for ; Wed, 27 Nov 2024 02:57:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676279; cv=none; b=S7rygI8DYMxbqWwU7qvGUVmO+mAV5ldkckeTCZ2X+aUxi/ba4OG2xmKHfr5AzuKcjL+3Ry7E+MK6Jbt/IZQVIraGAJXWmjzmyseDa3ang1qj5hrVWP//gIVolox7N4Ow0YhB4Dq7p78k5M+cTNp82t4v6TRr5NaaiRDIKOL8oXo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676279; c=relaxed/simple; bh=qleIIFAwfTwv5qoVB4dJO6WTQiIO1s+7ceRpumwVu1w=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=NcgolKAtZqHnbtyXgd4t4JOJ18CGBL9LNvg6c47AI3B8sf+BOomffXN2ru0Wg7irZPqFxqHwKvtJjXaKT2DSUw0hti94XerVq9OGsoTHAzh75LgZ7mvzF0Y/aWRJU57KVhggHprmqWD9A5XNtehb1Aw6VTMJeOPYOLjeNz8ppuc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=eNoog/aa; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="eNoog/aa" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2ea5bf5354fso6170459a91.3 for ; Tue, 26 Nov 2024 18:57:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676276; x=1733281076; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=FdWbI+s0p0wBgh+VsWAydrlLqfgeB+x4CgOEMGbe8TM=; b=eNoog/aaskHAOAAiARJN/HRdDhAXTPDkWMMUozqnIr/YU7DpKxtxWhZuLCTAizdzls L0tsmd57aJFuCtPQmP/FXrzQYo+53C4NzGnb56eKVlNBsGvmgQOEjrXddbABNNrYBXXI 6DfVpMLXqAKXrFR+fHhP7hIaOovH8Fdxl6PsfWraMp5bXpQqyE/ZvhrCNaIXfrTMtTT+ pFzcSidV6cZ0sqwcMxSRbk2PF5EIrgalNVIqWQXcf0CLU/54wBYlLIpXVO7Hq+Jcjpks +kt6VMXq0Nw01yar8eYkcl2gTi3jeY+WlOxBM4e5rZWT60qVUOiT987aDSF6/biDsgyO Ui/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676276; x=1733281076; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FdWbI+s0p0wBgh+VsWAydrlLqfgeB+x4CgOEMGbe8TM=; b=Vz3wR5unwRj+QWzzaiRe2GObSBNeKGKpuTNQSTxYYL9BN/AvaZlRSfgsB7qljZA+sN psk55YCP6soLG7dPh93E5jhwA6tC5ob2IFzNqEbYB1jLh5/Lub4SWE/7hOI9io6ci4FE Lplgvw37lFqKx590HkksOgHp6MTaxF93G2azF8EVACDofNz3xaC38hZDY21SWlJsCxLc X6t/O1eWvXQNEmpQn/peRr6rtMhOytyt0wM77CZxbwU5iEJtx0CJP63TOmcPU0gfiJe4 X1G90nyS5MtgZB6Z70Q/VvAcFXw+VuR2qD7Nr2fDYcmeczEIvt8DJCaMXk1fMcm/M9Cw 1LDw== X-Forwarded-Encrypted: i=1; AJvYcCU0Bv9fhgzqGc9e2+kks42nDpGUnIuME5KZ8OL7Xrp8gNkVrnr+Lob0BLQBvxKP5tauKjfhOORTOtlWtfOxkj8=@vger.kernel.org X-Gm-Message-State: AOJu0Ywx/z/PS8HRQMaU/E2c8GBL+zTr0Hh17QYrC9Ju6NJeHaoFfMEt h11ZlFTCtWOU7qAw3ld8/HCHMgsdG7pO5PgXhzVrNmCNSFAs1Rg/1LJh7dQKwqlJ1oUF9wiawJS Tj0M6Ng== X-Google-Smtp-Source: AGHT+IGwhfii5+MtQdbRihMKL41ITlobJS3ldyPZjP/ITZaiG2+mPbpCQb3KZbCSJSGpnWkssvQZr9R3iWRd X-Received: from pjbok13.prod.google.com ([2002:a17:90b:1d4d:b0:2da:ac73:93e5]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1d10:b0:2ea:aa56:4b0 with SMTP id 98e67ed59e1d1-2ee08e99941mr2057809a91.3.1732676276518; Tue, 26 Nov 2024 18:57:56 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:26 -0800 In-Reply-To: <20241127025728.3689245-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-8-yuanchu@google.com> Subject: [PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Add workingset reporting documentation for better discoverability of its sysfs and memcg interfaces. Also document the required kernel config to enable workingset reporting. Signed-off-by: Yuanchu Xie --- Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 ++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..61a2a347fc91 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -41,4 +41,5 @@ the Linux memory management. swap_numa transhuge userfaultfd + workingset_report zswap diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst new file mode 100644 index 000000000000..0969513705c4 --- /dev/null +++ b/Documentation/admin-guide/mm/workingset_report.rst @@ -0,0 +1,105 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Workingset Report +================= +Workingset report provides a view of memory coldness in user-defined +time intervals, e.g. X bytes are Y milliseconds cold. It breaks down +the user pages in the system per-NUMA node, per-memcg, for both +anonymous and file pages into histograms that look like: +:: + + 1000 anon=137368 file=24530 + 20000 anon=34342 file=0 + 30000 anon=353232 file=333608 + 40000 anon=407198 file=206052 + 9223372036854775807 anon=4925624 file=892892 + +The workingset reports can be used to drive proactive reclaim, by +identifying the number of cold bytes in a memcg, then writing to +``memory.reclaim``. + +Quick start +=========== +Build the kernel with the following configurations. The report relies +on Multi-gen LRU for page coldness. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` +* ``CONFIG_WORKINGSET_REPORT=y`` + +Optionally, the aging kernel daemon can be enabled with the following +configuration. +* ``CONFIG_WORKINGSET_REPORT_AGING=y`` + +Sysfs interfaces +================ +``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides +a per-node page age histogram, showing an aggregate of the node's lruvecs. +Reading this file causes a hierarchical aging of all lruvecs, scanning +pages and creates a new Multi-gen LRU generation in each lruvec. +For example: +:: + + 1000 anon=0 file=0 + 2000 anon=0 file=0 + 100000 anon=5533696 file=5566464 + 18446744073709551615 anon=0 file=0 + +``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals`` +is a comma-separated list of time in milliseconds that configures what +the page age histogram uses for aggregation. For the above histogram, +the intervals are:: + + 1000,2000,100000 + +``/sys/devices/system/node/nodeX/workingset_report/refresh_interval`` +defines the amount of time the report is valid for in milliseconds. +When a report is still valid, reading the ``page_age`` file shows +the existing valid report, instead of generating a new one. + +``/sys/devices/system/node/nodeX/workingset_report/report_threshold`` +specifies how often the userspace agent can be notified for node +memory pressure, in milliseconds. When a node reaches its low +watermarks and wakes up kswapd, programs waiting on ``page_age`` are +woken up so they can read the histogram and make policy decisions. + +Memcg interface +=============== +While ``page_age_interval`` is defined per-node in sysfs, ``page_age``, +``refresh_interval`` and ``report_threshold`` are available per-memcg. + +``/sys/fs/cgroup/.../memory.workingset.page_age`` +The memcg equivalent of the sysfs workingset page age histogram +breaks down the workingset of this memcg and its children into +page age intervals. Each node is prefixed with a node header and +a newline. Non-proactive direct reclaim on this memcg can also +wake up userspace agents that are waiting on this file. +E.g. +:: + + N0 + 1000 anon=0 file=0 + 2000 anon=0 file=0 + 3000 anon=0 file=0 + 4000 anon=0 file=0 + 5000 anon=0 file=0 + 18446744073709551615 anon=0 file=0 + +``/sys/fs/cgroup/.../memory.workingset.refresh_interval`` +The memcg equivalent of the sysfs refresh interval. A per-node +number of how much time a page age histogram is valid for, in +milliseconds. +E.g. +:: + + echo N0=2000 > memory.workingset.refresh_interval + +``/sys/fs/cgroup/.../memory.workingset.report_threshold`` +The memcg equivalent of the sysfs report threshold. A per-node +number of how often userspace agent waiting on the page age +histogram can be woken up, in milliseconds. +E.g. +:: + + echo N0=1000 > memory.workingset.report_threshold From patchwork Wed Nov 27 02:57:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 845820 Received: from mail-oo1-f73.google.com (mail-oo1-f73.google.com [209.85.161.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B2C11482E3 for ; Wed, 27 Nov 2024 02:58:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676284; cv=none; b=FK0gFzgGrBwYprBB6/PAgP3Vl27+43Stq1a8vjV3yymLkOCq4s4dAbW/yvqEL4S7vjF1QPrQ8Z1/axZs2OyEJys5cb0v8A6K+AoXetglq+QTR8u2cyVTUCdZTE2x9DUuEjcb4+/mJbwixOOEPAsxygISuRS27Ipu14w4sdZNawc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732676284; c=relaxed/simple; bh=0cgN7lCrUY481ds+fbjA4PD1+m21LOwhclWRHQ8YlGo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=D+A72o/hQmofkt4AYn1ACEum2thMSNbK7WdXU92FjuSmBxfWS661Z/EWmgloefoZXD8fuWzDSIKPTmGF7T4pNHxBRWK68LqeaC3hBdMomQ3dUK/0ZjW3AgiNwbr3BGgxBYucjFcwatKHMPSd8qTncQ8AfxDb0XWVQ0sWAHWqx38= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=R8GRcF/8; arc=none smtp.client-ip=209.85.161.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="R8GRcF/8" Received: by mail-oo1-f73.google.com with SMTP id 006d021491bc7-5eb78268680so4537614eaf.3 for ; Tue, 26 Nov 2024 18:58:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676280; x=1733281080; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=J230iZRfwuVra4Pz4XMka7O3vESUEZ4tRN3Joobiohw=; b=R8GRcF/8LlzUK6goo1JweEZOeenx3AMR2T7v91HDRrtVJfjmfX+mvqnBlTg4WaXShi +YgymgNynQd+WkZUsC91H1wQ2JwMuXuCXoF8RmSKWhglhqYsgnwWyolUFCZtAQxXV/R/ V2DRONOGGta4+2sgHVmipOodTK4zzBXKgUWK8nJNhTKgGNdXQiL2Jb8Qc1/xoOd0Kv2u /MaTUenJd0Ubw1Qw+RLDAx2X7PMuCw4iPNQ/006nsPBSw8yVspKACuxG0+7/N0paQTis R9sYs9k24q/e9IdnI83Yvb77LIZyBRGn1JqtXzK84JUeC0At7EOQdBkFiReQscoH/1t/ BHEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676280; x=1733281080; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=J230iZRfwuVra4Pz4XMka7O3vESUEZ4tRN3Joobiohw=; b=O+hCpIKFyhRhxMGi83f70vRU9byxv90mCPKXEs3i/u7uU4NPB7P5VL96Yf0ucpdiI3 4Iafu5bb34dSowiDMy+iVDOeNbU688qlz0Qs5axt1JuXGFRbJHg1x6xmStLStu9I1WZc UA6kkot1rjirDkV8jV1DKYEviT+cbmNI0OuQYz9Ds6Viwy5Vxj9Luc3nEXe2wgYwy+q6 YUnaLueNk1Y5/YU45AR466n8H662gSf4ZhQY9nuu+qqwZ7f2DDlc8PiPuU6xm2d2hZrd pcvMzC1y6B2gJ06Pn9pNATwHF9U+p0BNnpaG5b36P0gZrz4HkcPL7YZZf7mjfOJLrkUF i9IA== X-Forwarded-Encrypted: i=1; AJvYcCV9e4Kw+1x3W23CWq1W7rhMuqQMXN0NpqqTVu6hrr+HPxAPQHzOcWszswNi9jBnQUGoq0uk2STZKHFd6+aOfDc=@vger.kernel.org X-Gm-Message-State: AOJu0Ywqr+qHpPh+VZ8He93sou0U5PnAL8O3qO91r06CZZXn6XzAYiWV f4LzGBJT9G6IBMMDarva75ivr3OzqsuU7znigsboQsA7R9BuUUoly3CjnZyGUCNpwAnEYJdYW/W ARUSGZg== X-Google-Smtp-Source: AGHT+IFIJL8h5DmsRNBM5ZiX5QfIqL9o1tI1OHXYDHS0S9zx9IEBE7CHBXLRc6dJk0foy6FX7GvuZ5iRwwn6 X-Received: from oabvr9.prod.google.com ([2002:a05:6871:a0c9:b0:295:f44d:8dfa]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6870:3b1b:b0:296:cdee:f7de with SMTP id 586e51a60fabf-29dc417fa5fmr1273477fac.21.1732676280422; Tue, 26 Nov 2024 18:58:00 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:28 -0800 In-Reply-To: <20241127025728.3689245-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-10-yuanchu@google.com> Subject: [PATCH v4 9/9] virtio-balloon: add workingset reporting From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Ballooning is a way to dynamically size a VM, and it requires guest collaboration. The amount to balloon without adversely affecting guest performance is hard to compute without clear metrics from the guest. Workingset reporting can provide guidance to the host to allow better collaborative ballooning, such that the host balloon controller can properly gauge the amount of memory the guest is actively using, i.e., the working set. A draft QEMU series [1] is being worked on. Currently it is able to configure the workingset reporting bins, refresh_interval, and report threshold. Through QMP or HMP, a balloon controller can request a workingset report. There is also a script [2] exercising the QMP interface with a visual breakdown of the guest's workingset size. According to the OASIS VIRTIO v1.3, there's a new balloon device in the works and this one I'm adding to is the "traditional" balloon. If the existing balloon device is not the right place for new features. I'm more than happy to add it to the new one as well. For technical details, this patch adds the a generic mechanism into workingset reporting infrastructure to allow other parts of the kernel to receive workingset reports. Two virtqueues are added to the virtio-balloon device, notification_vq and report_vq. The notification virtqueue allows the host to configure the guest workingset reporting parameters and request a report. The report virtqueue sends a working set report to the host when one is requested or due to memory pressure. The workingset reporting feature is gated by the compilation flag CONFIG_WORKINGSET_REPORT and the balloon feature flag VIRTIO_BALLOON_F_WS_REPORTING. [1] https://github.com/Dummyc0m/qemu/tree/wsr [2] https://gist.github.com/Dummyc0m/d45b4e1b0dda8f2bc6cd8cfb37cc7e34 Signed-off-by: Yuanchu Xie --- drivers/virtio/virtio_balloon.c | 390 +++++++++++++++++++++++++++- include/linux/balloon_compaction.h | 1 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 66 ++++- include/uapi/linux/virtio_balloon.h | 30 +++ mm/workingset_report.c | 89 ++++++- 6 files changed, 566 insertions(+), 14 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index b36d2803674e..8eb300653dd8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -18,6 +18,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -45,6 +46,8 @@ enum virtio_balloon_vq { VIRTIO_BALLOON_VQ_STATS, VIRTIO_BALLOON_VQ_FREE_PAGE, VIRTIO_BALLOON_VQ_REPORTING, + VIRTIO_BALLOON_VQ_WORKING_SET, + VIRTIO_BALLOON_VQ_NOTIFY, VIRTIO_BALLOON_VQ_MAX }; @@ -124,6 +127,23 @@ struct virtio_balloon { spinlock_t wakeup_lock; bool processing_wakeup_event; u32 wakeup_signal_mask; + +#ifdef CONFIG_WORKINGSET_REPORT + struct virtqueue *working_set_vq, *notification_vq; + + /* Protects node_id, wsr_receiver, and report_buf */ + struct mutex wsr_report_lock; + int wsr_node_id; + struct wsr_receiver wsr_receiver; + /* Buffer to report to host */ + struct virtio_balloon_working_set_report *report_buf; + + /* Buffer to hold incoming notification from the host. */ + struct virtio_balloon_working_set_notify *notify_buf; + + struct work_struct update_balloon_working_set_work; + struct work_struct update_balloon_notification_work; +#endif }; #define VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST (1 << 0) @@ -339,8 +359,352 @@ static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num) return num_freed_pages; } -static inline void update_stat(struct virtio_balloon *vb, int idx, - u16 tag, u64 val) +#ifdef CONFIG_WORKINGSET_REPORT +static bool wsr_is_configured(struct virtio_balloon *vb) +{ + if (node_online(READ_ONCE(vb->wsr_node_id)) && + READ_ONCE(vb->wsr_receiver.wsr.refresh_interval) > 0 && + READ_ONCE(vb->wsr_receiver.wsr.page_age) != NULL) + return true; + return false; +} + +/* wsr_receiver callback */ +static void wsr_receiver_notify(struct wsr_receiver *receiver) +{ + int bin; + struct virtio_balloon *vb = + container_of(receiver, struct virtio_balloon, wsr_receiver); + + /* if we fail to acquire the locks, send stale report */ + if (!mutex_trylock(&vb->wsr_report_lock)) + goto out; + if (!mutex_trylock(&receiver->wsr.page_age_lock)) + goto out_unlock_report_buf; + if (!READ_ONCE(receiver->wsr.page_age)) + goto out_unlock_page_age; + + vb->report_buf->error = cpu_to_le32(0); + vb->report_buf->node_id = cpu_to_le32(vb->wsr_node_id); + for (bin = 0; bin < WORKINGSET_REPORT_MAX_NR_BINS; ++bin) { + struct virtio_balloon_working_set_report_bin *dest = + &vb->report_buf->bins[bin]; + struct wsr_report_bin *src = &receiver->wsr.page_age->bins[bin]; + + dest->anon_bytes = + cpu_to_le64(src->nr_pages[LRU_GEN_ANON] * PAGE_SIZE); + dest->file_bytes = + cpu_to_le64(src->nr_pages[LRU_GEN_FILE] * PAGE_SIZE); + if (src->idle_age == WORKINGSET_INTERVAL_MAX) { + dest->idle_age = cpu_to_le64(WORKINGSET_INTERVAL_MAX); + break; + } + dest->idle_age = cpu_to_le64(jiffies_to_msecs(src->idle_age)); + } + +out_unlock_page_age: + mutex_unlock(&receiver->wsr.page_age_lock); +out_unlock_report_buf: + mutex_unlock(&vb->wsr_report_lock); +out: + /* Send the working set report to the device. */ + spin_lock(&vb->stop_update_lock); + if (!vb->stop_update) + queue_work(system_freezable_wq, &vb->update_balloon_working_set_work); + spin_unlock(&vb->stop_update_lock); +} + +static void virtio_balloon_working_set_request(struct virtio_balloon *vb, + int nid) +{ + int err = 0; + + if (!node_online(nid)) { + err = -EINVAL; + goto error; + } + + err = wsr_refresh_receiver_report(NODE_DATA(nid)); + if (err) + goto error; + + return; +error: + mutex_lock(&vb->wsr_report_lock); + vb->report_buf->error = cpu_to_le16(err); + vb->report_buf->node_id = cpu_to_le32(nid); + mutex_unlock(&vb->wsr_report_lock); + spin_lock(&vb->stop_update_lock); + if (!vb->stop_update) + queue_work(system_freezable_wq, + &vb->update_balloon_working_set_work); + spin_unlock(&vb->stop_update_lock); +} + +static void notification_receive(struct virtqueue *vq) +{ + struct virtio_balloon *vb = vq->vdev->priv; + + spin_lock(&vb->stop_update_lock); + if (!vb->stop_update) + queue_work(system_freezable_wq, &vb->update_balloon_notification_work); + spin_unlock(&vb->stop_update_lock); +} + +static int +virtio_balloon_register_working_set_receiver(struct virtio_balloon *vb) +{ + struct pglist_data *pgdat; + struct wsr_report_bins *bins = NULL, __rcu *old; + int nid, bin, err = 0, old_nid = vb->wsr_node_id; + struct virtio_balloon_working_set_notify *notify = vb->notify_buf; + + nid = le16_to_cpu(notify->node_id); + if (!node_online(nid)) { + dev_warn(&vb->vdev->dev, "node not online %d\n", nid); + return -EINVAL; + } + + pgdat = NODE_DATA(nid); + bins = kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL); + + if (!bins) + return -ENOMEM; + + for (bin = 0; bin < WORKINGSET_REPORT_MAX_NR_BINS; ++bin) { + u32 age_msecs = le32_to_cpu(notify->idle_age[bin]); + unsigned long age = msecs_to_jiffies(age_msecs); + + /* + * A correct idle_age array should end in + * WORKINGSET_INTERVAL_MAX. + */ + if (age_msecs == (u32)WORKINGSET_INTERVAL_MAX) { + bins->idle_age[bin] = WORKINGSET_INTERVAL_MAX; + break; + } + bins->idle_age[bin] = age; + if (bin > 0 && bins->idle_age[bin] <= bins->idle_age[bin - 1]) { + dev_warn(&vb->vdev->dev, "bins not increasing\n"); + err = -EINVAL; + goto error; + } + } + if (bin < WORKINGSET_REPORT_MIN_NR_BINS - 1 || + bin == WORKINGSET_REPORT_MAX_NR_BINS) { + err = -ERANGE; + goto error; + } + bins->nr_bins = bin; + + mutex_lock(&vb->wsr_report_lock); + err = wsr_set_refresh_interval( + &vb->wsr_receiver.wsr, + le32_to_cpu(notify->refresh_interval)); + if (err) { + mutex_unlock(&vb->wsr_report_lock); + goto error; + } + if (old_nid != NUMA_NO_NODE) + wsr_remove_receiver(&vb->wsr_receiver, NODE_DATA(old_nid)); + WRITE_ONCE(vb->wsr_node_id, nid); + WRITE_ONCE(vb->wsr_receiver.wsr.report_threshold, + msecs_to_jiffies(le32_to_cpu(notify->report_threshold))); + WRITE_ONCE(vb->wsr_receiver.notify, wsr_receiver_notify); + mutex_unlock(&vb->wsr_report_lock); + + /* update the bins for target node */ + mutex_lock(&pgdat->wsr_update_mutex); + old = rcu_replace_pointer(pgdat->wsr_page_age_bins, bins, + lockdep_is_held(&pgdat->wsr_update_mutex)); + mutex_unlock(&pgdat->wsr_update_mutex); + kfree_rcu(old, rcu); + + wsr_register_receiver(&vb->wsr_receiver, pgdat); + + return 0; +error: + kfree(bins); + return err; +} + +static void update_balloon_notification_func(struct work_struct *work) +{ + unsigned int len, op; + int err; + struct virtio_balloon *vb; + struct scatterlist sg_in; + + vb = container_of(work, struct virtio_balloon, + update_balloon_notification_work); + op = le16_to_cpu(vb->notify_buf->op); + + switch (op) { + case VIRTIO_BALLOON_WS_OP_REQUEST: + virtio_balloon_working_set_request(vb, + READ_ONCE(vb->wsr_node_id)); + break; + case VIRTIO_BALLOON_WS_OP_CONFIG: + err = virtio_balloon_register_working_set_receiver(vb); + if (err) + dev_warn(&vb->vdev->dev, + "Error configuring working set, %d\n", err); + break; + default: + dev_warn(&vb->vdev->dev, "Received invalid notification, %u\n", + op); + break; + } + + /* Detach all the used buffers from the vq */ + while (virtqueue_get_buf(vb->notification_vq, &len)) + ; + /* Add a new notification buffer for device to fill. */ + sg_init_one(&sg_in, vb->notify_buf, sizeof(*vb->notify_buf)); + virtqueue_add_inbuf(vb->notification_vq, &sg_in, 1, vb, GFP_KERNEL); + virtqueue_kick(vb->notification_vq); +} + +static void update_balloon_ws_func(struct work_struct *work) +{ + struct virtio_balloon *vb; + + vb = container_of(work, struct virtio_balloon, + update_balloon_working_set_work); + + if (wsr_is_configured(vb)) { + struct scatterlist sg_out; + int unused; + int err; + + /* Detach all the used buffers from the vq */ + while (virtqueue_get_buf(vb->working_set_vq, &unused)) + ; + sg_init_one(&sg_out, vb->report_buf, sizeof(*vb->report_buf)); + err = virtqueue_add_outbuf(vb->working_set_vq, &sg_out, 1, vb, GFP_KERNEL); + if (unlikely(err)) + dev_err(&vb->vdev->dev, + "Failed to send working set report err = %d\n", + err); + else + virtqueue_kick(vb->working_set_vq); + + } else { + dev_warn(&vb->vdev->dev, "Working Set not initialized."); + } +} + +static void wsr_init_vqs_info(struct virtio_balloon *vb, + struct virtqueue_info vqs_info[]) +{ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) { + vqs_info[VIRTIO_BALLOON_VQ_WORKING_SET].name = "ws"; + vqs_info[VIRTIO_BALLOON_VQ_WORKING_SET].callback = NULL; + vqs_info[VIRTIO_BALLOON_VQ_NOTIFY].name = "notify"; + vqs_info[VIRTIO_BALLOON_VQ_NOTIFY].callback = notification_receive; + } +} + +static int wsr_init_vq(struct virtio_balloon *vb, struct virtqueue *vqs[]) +{ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) { + struct scatterlist sg; + int err; + + vb->working_set_vq = vqs[VIRTIO_BALLOON_VQ_WORKING_SET]; + vb->notification_vq = vqs[VIRTIO_BALLOON_VQ_NOTIFY]; + + /* Prime the notification virtqueue for the device to fill. */ + sg_init_one(&sg, vb->notify_buf, sizeof(*vb->notify_buf)); + err = virtqueue_add_inbuf(vb->notification_vq, &sg, 1, vb, GFP_KERNEL); + if (unlikely(err)) { + dev_err(&vb->vdev->dev, + "Failed to prepare notifications, err = %d\n", err); + return err; + } + virtqueue_kick(vb->notification_vq); + } + return 0; +} + +static void wsr_init_work(struct virtio_balloon *vb) +{ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) { + INIT_WORK(&vb->update_balloon_working_set_work, + update_balloon_ws_func); + INIT_WORK(&vb->update_balloon_notification_work, + update_balloon_notification_func); + } +} + +static int wsr_init(struct virtio_balloon *vb) +{ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) { + vb->report_buf = kzalloc(sizeof(*vb->report_buf), GFP_KERNEL); + if (!vb->report_buf) + return -ENOMEM; + + vb->notify_buf = kzalloc(sizeof(*vb->notify_buf), GFP_KERNEL); + if (!vb->notify_buf) { + kfree(vb->report_buf); + vb->report_buf = NULL; + return -ENOMEM; + } + + wsr_init_state(&vb->wsr_receiver.wsr); + vb->wsr_node_id = NUMA_NO_NODE; + vb->report_buf->bins[0].idle_age = WORKINGSET_INTERVAL_MAX; + } + return 0; +} + +static void wsr_remove(struct virtio_balloon *vb) +{ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING) && + vb->wsr_node_id != NUMA_NO_NODE) { + wsr_remove_receiver(&vb->wsr_receiver, NODE_DATA(vb->wsr_node_id)); + wsr_destroy_state(&vb->wsr_receiver.wsr); + } + + kfree(vb->report_buf); + kfree(vb->notify_buf); + mutex_destroy(&vb->wsr_report_lock); +} + +static void wsr_cancel_work(struct virtio_balloon *vb) +{ + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) { + cancel_work_sync(&vb->update_balloon_working_set_work); + cancel_work_sync(&vb->update_balloon_notification_work); + } +} +#else +static inline void wsr_init_vqs_info(struct virtio_balloon *vb, + struct virtqueue_info vqs_info[]) +{ +} +static inline int wsr_init_vq(struct virtio_balloon *vb, + struct virtqueue *vqs[]) +{ + return 0; +} +static inline void wsr_init_work(struct virtio_balloon *vb) +{ +} +static inline int wsr_init(struct virtio_balloon *vb) +{ + return 0; +} +static inline void wsr_remove(struct virtio_balloon *vb) +{ +} +static inline void wsr_cancel_work(struct virtio_balloon *vb) +{ +} +#endif + +static inline void update_stat(struct virtio_balloon *vb, int idx, u16 tag, + u64 val) { BUG_ON(idx >= VIRTIO_BALLOON_S_NR); vb->stats[idx].tag = cpu_to_virtio16(vb->vdev, tag); @@ -605,6 +969,8 @@ static int init_vqs(struct virtio_balloon *vb) vqs_info[VIRTIO_BALLOON_VQ_REPORTING].callback = balloon_ack; } + wsr_init_vqs_info(vb, vqs_info); + err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs, vqs_info, NULL); if (err) @@ -615,6 +981,7 @@ static int init_vqs(struct virtio_balloon *vb) if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { struct scatterlist sg; unsigned int num_stats; + vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS]; /* @@ -640,6 +1007,11 @@ static int init_vqs(struct virtio_balloon *vb) if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING]; + err = wsr_init_vq(vb, vqs); + + if (err) + return err; + return 0; } @@ -961,15 +1333,21 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out; } + vb->vdev = vdev; + INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func); INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func); + wsr_init_work(vb); spin_lock_init(&vb->stop_update_lock); mutex_init(&vb->balloon_lock); init_waitqueue_head(&vb->acked); - vb->vdev = vdev; balloon_devinfo_init(&vb->vb_dev_info); + err = wsr_init(vb); + if (err) + goto out_remove_wsr; + err = init_vqs(vb); if (err) goto out_free_vb; @@ -1085,7 +1463,6 @@ static int virtballoon_probe(struct virtio_device *vdev) if (towards_target(vb)) virtballoon_changed(vdev); return 0; - out_unregister_oom: if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) unregister_oom_notifier(&vb->oom_nb); @@ -1099,6 +1476,8 @@ static int virtballoon_probe(struct virtio_device *vdev) vdev->config->del_vqs(vdev); out_free_vb: kfree(vb); +out_remove_wsr: + wsr_remove(vb); out: return err; } @@ -1130,11 +1509,13 @@ static void virtballoon_remove(struct virtio_device *vdev) unregister_oom_notifier(&vb->oom_nb); if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) virtio_balloon_unregister_shrinker(vb); + wsr_remove(vb); spin_lock_irq(&vb->stop_update_lock); vb->stop_update = true; spin_unlock_irq(&vb->stop_update_lock); cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + wsr_cancel_work(vb); if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { cancel_work_sync(&vb->report_free_page_work); @@ -1200,6 +1581,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_FREE_PAGE_HINT, VIRTIO_BALLOON_F_PAGE_POISON, VIRTIO_BALLOON_F_REPORTING, + VIRTIO_BALLOON_F_WS_REPORTING, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index 5ca2d5699620..d92b8337dbcf 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -43,6 +43,7 @@ #include #include #include +#include /* * Balloon device information descriptor. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ee728c0c5a3b..9a2dc506779d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1429,8 +1429,12 @@ typedef struct pglist_data { #endif #ifdef CONFIG_WORKINGSET_REPORT + /* protects wsr_page_age_bins */ struct mutex wsr_update_mutex; struct wsr_report_bins __rcu *wsr_page_age_bins; + /* protects wsr_receiver_lost */ + struct mutex wsr_receiver_mutex; + struct list_head wsr_receiver_list; #endif CACHELINE_PADDING(_pad2_); diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index f6bbde2a04c3..1074b89035e9 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -11,13 +11,14 @@ struct node; struct lruvec; struct cgroup_file; struct wsr_state; - -#ifdef CONFIG_WORKINGSET_REPORT +struct wsr_receiver; #define WORKINGSET_REPORT_MIN_NR_BINS 2 #define WORKINGSET_REPORT_MAX_NR_BINS 32 #define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) + +#ifdef CONFIG_WORKINGSET_REPORT #define ANON_AND_FILE 2 struct wsr_report_bin { @@ -52,6 +53,8 @@ struct wsr_state { struct wsr_page_age_histo *page_age; }; +void wsr_init_state(struct wsr_state *wsr); +void wsr_destroy_state(struct wsr_state *wsr); void wsr_init_lruvec(struct lruvec *lruvec); void wsr_destroy_lruvec(struct lruvec *lruvec); void wsr_init_pgdat(struct pglist_data *pgdat); @@ -66,6 +69,47 @@ void wsr_remove_sysfs(struct node *node); bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, struct pglist_data *pgdat, unsigned long *refresh_time); +/* + * If refresh_interval > 0, enable working set reporting and kick + * the aging thread (if configured). + * If refresh_interval = 0, disable working set reporting and free + * the bookkeeping resources. + * + * @param refresh_interval milliseconds. + */ +int wsr_set_refresh_interval(struct wsr_state *wsr, + unsigned long refresh_interval); + +struct wsr_receiver { + /* + * Working set reporting ensures that two notify calls to + * the same receivercannot interleave one another. + * + * Must be set before calling wsr_register_receiver. + */ + void (*notify)(struct wsr_receiver *receiver); + struct wsr_state wsr; + struct list_head list; +}; + +/* + * Register a per-node receiver + * report_threshold and refresh_interval are configured + * by the caller in struct wsr_state and contain valid values. + * page_age is allocated. + */ +void wsr_register_receiver(struct wsr_receiver *receiver, + struct pglist_data *pgdat); + +void wsr_remove_receiver(struct wsr_receiver *receiver, + struct pglist_data *pgdat); + +/* + * Refresh the report for the specified node, unless a refresh is already + * in progress or the parameters are being updated. + */ +int wsr_refresh_receiver_report(struct pglist_data *pgdat); + #ifdef CONFIG_WORKINGSET_REPORT_AGING void wsr_wakeup_aging_thread(void); #else /* CONFIG_WORKINGSET_REPORT_AGING */ @@ -77,6 +121,12 @@ static inline void wsr_wakeup_aging_thread(void) int wsr_set_refresh_interval(struct wsr_state *wsr, unsigned long refresh_interval); #else +static inline void wsr_init_state(struct wsr_state *wsr) +{ +} +static inline void wsr_destroy_state(struct wsr_state *wsr) +{ +} static inline void wsr_init_lruvec(struct lruvec *lruvec) { } @@ -100,6 +150,18 @@ static inline int wsr_set_refresh_interval(struct wsr_state *wsr, { return 0; } +static inline int wsr_register_receiver(struct wsr_receiver *receiver, + struct pglist_data *pgdat) +{ + return -ENODEV; +} +static inline void wsr_remove_receiver(struct wsr_receiver *receiver, + struct pglist_data *pgdat) +{ +} +static inline void wsr_refresh_receiver_report(struct pglist_data *pgdat) +{ +} #endif /* CONFIG_WORKINGSET_REPORT */ #endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index ee35a372805d..668eaa39c85b 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -25,6 +25,7 @@ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ +#include "linux/workingset_report.h" #include #include #include @@ -37,6 +38,7 @@ #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */ #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */ #define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */ +#define VIRTIO_BALLOON_F_WS_REPORTING 6 /* Working Set Size reporting */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 @@ -128,4 +130,32 @@ struct virtio_balloon_stat { __virtio64 val; } __attribute__((packed)); +/* Operations from the device */ +#define VIRTIO_BALLOON_WS_OP_REQUEST 1 +#define VIRTIO_BALLOON_WS_OP_CONFIG 2 + +struct virtio_balloon_working_set_notify { + /* REQUEST or CONFIG */ + __le16 op; + __le16 node_id; + /* the following fields valid iff op=CONFIG */ + __le32 report_threshold; + __le32 refresh_interval; + __le32 idle_age[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct virtio_balloon_working_set_report_bin { + __le64 idle_age; + /* bytes in this bucket for anon and file */ + __le64 anon_bytes; + __le64 file_bytes; +}; + +struct virtio_balloon_working_set_report { + __le32 error; + __le32 node_id; + struct virtio_balloon_working_set_report_bin + bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + #endif /* _LINUX_VIRTIO_BALLOON_H */ diff --git a/mm/workingset_report.c b/mm/workingset_report.c index dad539e602bb..4b3397ebdbd0 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -20,27 +20,51 @@ void wsr_init_pgdat(struct pglist_data *pgdat) { mutex_init(&pgdat->wsr_update_mutex); RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL); + INIT_LIST_HEAD(&pgdat->wsr_receiver_list); } void wsr_destroy_pgdat(struct pglist_data *pgdat) { struct wsr_report_bins __rcu *bins; + struct list_head *cursor, *next; mutex_lock(&pgdat->wsr_update_mutex); bins = rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL, lockdep_is_held(&pgdat->wsr_update_mutex)); - kfree_rcu(bins, rcu); mutex_unlock(&pgdat->wsr_update_mutex); + kfree_rcu(bins, rcu); + mutex_lock(&pgdat->wsr_receiver_mutex); + list_for_each_safe(cursor, next, &pgdat->wsr_receiver_list) { + /* pgdat does not own the receiver, so it's not free'd here */ + list_del(cursor); + } + mutex_unlock(&pgdat->wsr_receiver_mutex); + mutex_destroy(&pgdat->wsr_update_mutex); + mutex_destroy(&pgdat->wsr_receiver_mutex); +} + +void wsr_init_state(struct wsr_state *wsr) +{ + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} +EXPORT_SYMBOL_GPL(wsr_init_state); + +void wsr_destroy_state(struct wsr_state *wsr) +{ + kfree(wsr->page_age); + mutex_destroy(&wsr->page_age_lock); + memset(wsr, 0, sizeof(*wsr)); } +EXPORT_SYMBOL_GPL(wsr_destroy_state); void wsr_init_lruvec(struct lruvec *lruvec) { struct wsr_state *wsr = &lruvec->wsr; struct mem_cgroup *memcg = lruvec_memcg(lruvec); - memset(wsr, 0, sizeof(*wsr)); - mutex_init(&wsr->page_age_lock); + wsr_init_state(wsr); if (memcg && !mem_cgroup_is_root(memcg)) wsr->page_age_cgroup_file = mem_cgroup_page_age_file(memcg); } @@ -49,9 +73,7 @@ void wsr_destroy_lruvec(struct lruvec *lruvec) { struct wsr_state *wsr = &lruvec->wsr; - mutex_destroy(&wsr->page_age_lock); - kfree(wsr->page_age); - memset(wsr, 0, sizeof(*wsr)); + wsr_destroy_state(wsr); } int workingset_report_intervals_parse(char *src, @@ -395,6 +417,7 @@ int wsr_set_refresh_interval(struct wsr_state *wsr, wsr_wakeup_aging_thread(); return err; } +EXPORT_SYMBOL_GPL(wsr_set_refresh_interval); static ssize_t refresh_interval_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -569,12 +592,62 @@ void wsr_remove_sysfs(struct node *node) } EXPORT_SYMBOL_GPL(wsr_remove_sysfs); +/* wsr belongs to the root memcg or memcg is disabled */ +static int notify_receiver(struct wsr_state *wsr, struct pglist_data *pgdat) +{ + struct list_head *cursor; + + if (!mutex_trylock(&pgdat->wsr_receiver_mutex)) + return -EAGAIN; + list_for_each(cursor, &pgdat->wsr_receiver_list) { + struct wsr_receiver *entry = + list_entry(cursor, struct wsr_receiver, list); + + wsr_refresh_report(&entry->wsr, NULL, pgdat, NULL); + entry->notify(entry); + } + mutex_unlock(&pgdat->wsr_receiver_mutex); + return 0; +} + +int wsr_refresh_receiver_report(struct pglist_data *pgdat) +{ + struct wsr_state *wsr = &mem_cgroup_lruvec(NULL, pgdat)->wsr; + + return notify_receiver(wsr, pgdat); +} +EXPORT_SYMBOL_GPL(wsr_refresh_receiver_report); + void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) { struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; - if (mem_cgroup_is_root(memcg)) + if (mem_cgroup_is_root(memcg)) { kernfs_notify(wsr->page_age_sys_file); - else + notify_receiver(wsr, pgdat); + } else cgroup_file_notify(wsr->page_age_cgroup_file); } + +void wsr_register_receiver(struct wsr_receiver *receiver, + struct pglist_data *pgdat) +{ + struct wsr_state *wsr = &receiver->wsr; + + mutex_lock(&pgdat->wsr_receiver_mutex); + list_add_tail(&receiver->list, &pgdat->wsr_receiver_list); + mutex_unlock(&pgdat->wsr_receiver_mutex); + + if (!!wsr->page_age && READ_ONCE(wsr->refresh_interval)) + wsr_wakeup_aging_thread(); +} +EXPORT_SYMBOL(wsr_register_receiver); + +void wsr_remove_receiver(struct wsr_receiver *receiver, + struct pglist_data *pgdat) +{ + mutex_lock(&pgdat->wsr_receiver_mutex); + list_del(&receiver->list); + mutex_unlock(&pgdat->wsr_receiver_mutex); +} +EXPORT_SYMBOL(wsr_remove_receiver);