From patchwork Tue Apr 29 23:38:29 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885989
Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com
 [209.85.219.177])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC2D921767D;
 Tue, 29 Apr 2025 23:38:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.177
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969933; cv=none;
 b=F692B7FjC1ZmztJDoWDf/AcoK0jbqe5ZOPeQqFAC6ZxSA5fWBhzujQGDAjQS4l/QMM8ANePg0UdekA/VBHu9VnWnc6EcAyO46vtVne+LidDK/rMfo+XcUUWY95e4JkqN5dTp1CWy+4Rz7IRr34c8pZrEKG7OEeepFsQ1HwfWvGA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969933; c=relaxed/simple;
 bh=9lfn7Mdrh8XNMBCIBDTXYW/OeN0hzWBMyVinBrI4kxE=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=feI0nmEgEyzEF+5bDD3XTG1aEF7N0sg0JcY3tZBxddfAXqX1rh/a2yA/wMXrg0ERkdRUAPnI7K80qfWwOofzi0YjywQ8BG+pc48xaD4FkvYHssny6RMCI5FW3TA9kx36meJXMT6v2dBBi7lcuuB0mKtPRCgGUdBIsL28PlVtg9o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=dZRAsKZz; arc=none smtp.client-ip=209.85.219.177
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="dZRAsKZz"
Received: by mail-yb1-f177.google.com with SMTP id
 3f1490d57ef6-e72c2a1b0d4so6338235276.2;
 Tue, 29 Apr 2025 16:38:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969930; x=1746574730;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=0qBYzjibBJx9rHZLcTQDEGn367QWHS1A048569h/uvM=;
 b=dZRAsKZzhcvlY3uKeeyJL+h5abKDODq+ZhZwcWaiElLiiMY4n8jxVQwdMuNJRIIS82
 s951qOk91u2gXqpYbqdJcXYeq/WZs+2+PkP0+aiMJz2sL+HLH3x0+Asotc265sXgwYp5
 zP8iC0U+23N+sCPdswsXkIJZlrLX9zGxZv62UzqZM5ltehN+ASYzQ0RBodAn9qlUeVmu
 BddSd+CmEffUlr4xtznpUHCOUbQi7bnk8IIQbgGo/vfBQRFeUcjotfnWjAk85w38Re+o
 fn9WYrIRzEzmBxzAhLk3ITBuTyJ4ygTk30ksHHiFGdRmMwWe1XvVFDcBVfNjep/OVFuS
 2v0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969930; x=1746574730;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=0qBYzjibBJx9rHZLcTQDEGn367QWHS1A048569h/uvM=;
 b=Za1hDRqvgrAKIUz+ECyuJp6RKtgndeA6yTQ9gd7MCs81Lb+w5Jk/AdN5oWbAoav63g
 mmgQ0pbul1ymAn/vJsVrt1KaSTS8KFhXbrZMGM2rQ9/DS9mYHqrSXMciai572Dus2Qwp
 1bGM8rxRdGp0pykSu1884xBR+O1Mste5hlPZD8b6hWbyjg3qTaZoZ+aXrOx+xmHSj1Ek
 Vpe1FwWgGk+/rmrqIiNCTJAw0y12w91eC4bHd8NgAjbci785swNBTEFQ5oUWwBHsSop2
 TmVS+ji7Flqdvsr33Cpu+ZrqRYdz8apEkL887jfLljUEDA8BnJtGlVFKSHtImZSZ5+H+
 l6Wg==
X-Forwarded-Encrypted: i=1;
 AJvYcCUbVi3j10+v+BRNbOdsh8DEXOIpZuBkc2uE+xkb/gzPlgJ2G6QzOxbq54lT4MXRlMA9U8iaqj/v4f4=@vger.kernel.org,
 AJvYcCUkkVRS91W7M4zd2kQ8Aav7EMt9ve1BPUKpJuPcFjFMJcCWOHm08v8UUwRT08bsaRIIsEFcHETJ@vger.kernel.org,
 AJvYcCWtVaQB8dHbCaJL6eLEoLLd/3iLbprdxIToFAEU1UpHC4Hx2yQ3EOyRSjuls5J89HDj0/Cd8y0/U8fhaj1u@vger.kernel.org
X-Gm-Message-State: AOJu0YyqSBTqN0nmZ25s3LxVWfCpxq0VNRql0DmHZmTjwMoyHALGmlmo
 PYrzn2PRQKCO2zNLgPAg9R6g/x4F2rxQySxT8+WFt4iIgxIlzf8N
X-Gm-Gg: ASbGnctVnfMwv/gkP652sGqPhOHAy57WM8jHqFNYCoYqCSeuzNzXkJ79b02XvyMty3h
 VtE5VVJLz/9P2Me98LgpNRd9Hod15FIWWx+SXIqz13/CQBNGrE2ChGlO1cabXc6+X6Ezk/51U2a
 AtWR5YfBS2tC3nPkzIomY0I76JTZv+yfi//o6XVc3Ihdpbit0t5cs04Tn+4XBxCEoaJ0/ZWXWn1
 U+Mwngysq/yxi0SMxfJVf3EibCdARC/JyBogjOq9lcL+lE7lOjtElii9vQpPrXMoA/Kru7ozB8l
 OXnGnxeKPimP+82vwLAGti2lnQEqwyTxRW2KHmf7NWs=
X-Google-Smtp-Source: AGHT+IHdxLhTmtD4Ht9GO+oV1MromZ9Nj3cR1phCcjSOrLEmdVAvEWKlu3ewjdvFFLOqR9mFjIe3Tw==
X-Received: by 2002:a05:6902:150a:b0:e73:124b:95c4 with SMTP id
 3f1490d57ef6-e740412f1e7mr1147823276.13.1745969930578;
 Tue, 29 Apr 2025 16:38:50 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:73::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e7412e1f0a4sm63120276.17.2025.04.29.16.38.50
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:50 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 01/18] swap: rearrange the swap header file
Date: Tue, 29 Apr 2025 16:38:29 -0700
Message-ID: <20250429233848.3093350-2-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In the swap header file (include/linux/swap.h), group the swap API into
the following categories:

1. Lifetime swap functions (i.e the function that changes the reference
   count of the swap entry).

2. Swap cache API.

3. Physical swapfile allocator and swap device API.

Also remove extern in the functions that are rearranged.

This is purely a clean up. No functional change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h | 63 +++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 30 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b13b72645db3..8b8c10356a5c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,24 +453,40 @@ extern void __meminit kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
-		unsigned long nr_pages, sector_t start_block);
-int generic_swapfile_activate(struct swap_info_struct *, struct file *,
-		sector_t *);
-
+/* Lifetime swap API (mm/swapfile.c) */
+swp_entry_t folio_alloc_swap(struct folio *folio);
+bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
+void swap_shmem_alloc(swp_entry_t, int);
+int swap_duplicate(swp_entry_t);
+int swapcache_prepare(swp_entry_t entry, int nr);
+void swap_free_nr(swp_entry_t entry, int nr_pages);
+void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+int __swap_count(swp_entry_t entry);
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
+int swp_swapcount(swp_entry_t entry);
+
+/* Swap cache API (mm/swap_state.c) */
 static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
-
-void free_swap_cache(struct folio *folio);
 void free_page_and_swap_cache(struct page *);
 void free_pages_and_swap_cache(struct encoded_page **, int);
-/* linux/mm/swapfile.c */
+void free_swap_cache(struct folio *folio);
+int init_swap_address_space(unsigned int type, unsigned long nr_pages);
+void exit_swap_address_space(unsigned int type);
+
+/* Physical swap allocator and swap device API (mm/swapfile.c) */
+int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
+		unsigned long nr_pages, sector_t start_block);
+int generic_swapfile_activate(struct swap_info_struct *, struct file *,
+		sector_t *);
+
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
-extern bool has_usable_swap(void);
+bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -483,31 +499,18 @@ static inline long get_nr_swap_pages(void)
 	return atomic_long_read(&nr_swap_pages);
 }
 
-extern void si_swapinfo(struct sysinfo *);
-swp_entry_t folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+void si_swapinfo(struct sysinfo *);
+swp_entry_t get_swap_page_of_type(int);
+int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
+int add_swap_count_continuation(swp_entry_t, gfp_t);
+void swapcache_free_entries(swp_entry_t *entries, int n);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
-extern unsigned int count_swap_pages(int, int);
-extern sector_t swapdev_block(int, pgoff_t);
-extern int __swap_count(swp_entry_t entry);
-extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
-extern int swp_swapcount(swp_entry_t entry);
+unsigned int count_swap_pages(int, int);
+sector_t swapdev_block(int, pgoff_t);
 struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 struct backing_dev_info;
-extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
-extern void exit_swap_address_space(unsigned int type);
-extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
 static inline void put_swap_device(struct swap_info_struct *si)

From patchwork Tue Apr 29 23:38:30 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886261
Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com
 [209.85.128.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE5EF2DCB41;
 Tue, 29 Apr 2025 23:38:52 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.179
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969934; cv=none;
 b=XZYsdunsvXFpLClEk50Y/7c2uoh6JaTVNBe/O9d4qsjHIxYDs6MVZyLRcvv5erXvuxNKIJTswa1He8laOrLJ9OQ1mQCs2JA3tNsWpn+EtJKy7OXHVhcXWEuiO6hzasRLR3S1EXyOdRA9gI/RMJkdcGCakbIi4xIzaoLz/PZ+aik=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969934; c=relaxed/simple;
 bh=K+fE1tEOojfiAdRs+JtbTz/Ixc47rZypB0NeTTCr0xk=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=XrRdMb5nj1TYPJ7IGQNQ+ZYz23iaO53wZPC/P7XhPMLa/CTFuwOo+2jtn+Bcu8hdprdLKZkZb3uebXB6pxokB+agj3fuFXKb3ITVwl3HPzmG1zoFGkMsiRmjT5MNRmHx4OMzs35eds4bDF6wCvYHg40aJe5yaZn4H3ijiH6TCf8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=J0sg9pwf; arc=none smtp.client-ip=209.85.128.179
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="J0sg9pwf"
Received: by mail-yw1-f179.google.com with SMTP id
 00721157ae682-6fead015247so61557837b3.2;
 Tue, 29 Apr 2025 16:38:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969931; x=1746574731;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=oKPYMi9z2u+pylu9elX+HeHi2VBZ+HlSHJvjFWP/QJA=;
 b=J0sg9pwfRsJ7csNyKGKLuey8bbgtZ6jtTBZGdBUhYWHS5sJl/AKcfQzwxCYFOM4pdD
 zP+vVIK6q6p93paoXTAbmHfX0R9GiIcNViBGSYjq3kxbPNkoTkqWZqphW0wvFLL8v9Tv
 SMaDVnQfNxFxCaKFwqN6Z66NuoqZhRC7y9ogaa36UTn9+CsVxhjyenq5QmgWC0BqTo4A
 5+hH7ZmtukBVRErBWBGBplA2I1t++0YV24aJjJUQSU+qetuRulh8H2jtp2FypKnMlEmr
 Yb9vCuqLHyt53WGzbQdBGM+cer8Jkmuw4F1L+WeIuSTmK5eNae0k9SwjosrddGu1plr9
 VaJg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969931; x=1746574731;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=oKPYMi9z2u+pylu9elX+HeHi2VBZ+HlSHJvjFWP/QJA=;
 b=m4AQgoQP0pxtrFMj6XlBoqo8d+CyA8IVkvdjAT9VDKsn40xOQNvuaJZ/7ZZcWXQ1Rc
 0bS3zMXpwkJ3xT2QZt0WSlcm/PUpk1Wdr23JrEg4gZPz4Zb5x4L2h6NXZ0ocJwaPOh2u
 jppYvNgNOixB2fTUlb6dgHCBP4ERV/39dqTCtWi8drut183gN8cDZQXUZ5Ptff0VciaA
 Ubf2W0mZyeYrbprobf0R5FcF4XLhq/HZm2IWqdnLgLuCttyxFIh5SgT3R+Rxb1e+ma/K
 IZINmjHpVLtzsDapfsZORKgeULxSga2DyuDQoY3KvgHhVvdhhA408P3SFhN/241Q9LnL
 7CGQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCUKmkISwj0vGt2IftnRFGlFTCtBeBsVIy1hMSQofqP4JKlVlYjqg04KuT5a8FWpDNP1CEhZBWHu@vger.kernel.org,
 AJvYcCUzGTIFZkmL2Rm70imC8XYRmf/Zh080j5wFqO7Kk+rhn8mxa7MOR1bvsmJJDcy7uIbwetuOsfjovl0=@vger.kernel.org,
 AJvYcCVMuDRd5fjyWPtRvE9lsfdcX/C4NNZplJgjbrISFXY9aTdm5KGMuLSwsV/4odbxy3W+0Ul+o7iWUr0BTOCv@vger.kernel.org
X-Gm-Message-State: AOJu0YwiM3wWzxeeaNKU8TeRynAu897CV7SWGC2cWwP6OqjYvMhrF+NX
 GDi1w00HHYgE/LWgM83AvDg6xB0CwE6yMCJNCDCsMRkUEBGuFWj6
X-Gm-Gg: ASbGnculB5siLtFTscp57um30AOLdWtidaC/t/3qVK1plOjQnp5e8hfCG2pGWFmQ5D6
 3lEtLSOJmLLlCu9u7OM2QYiP5Yd46KHLDHVfqDoZpwi1wO1tMM6GRnWBVSV7iKfak8NiqZvHTiD
 hDN0dI4z/k931yoxXdvlGbhq0ANFxOJlsyiGtM9zJ7fOE1BWrfpR0tLS26FRCun95bQ5JDoGbEX
 UoOwYJ8jLWGNMDjjSALAAdC1oAvR2KS1YqLQlBL28YW4c0lcZIMswTsRnJOy0vMztyqEM1w+7u9
 Fzp9yOLv+ZTAdWnTBSMojV0ArVtwKRpX
X-Google-Smtp-Source: AGHT+IEbVsUwWm1vJ8cPgyEKm/U3mnCYlKZjFZzkCwn0NZV9arn5fKzWNi2sdzUswsF0UpfObKyaeA==
X-Received: by 2002:a05:690c:3509:b0:6fb:9429:83c5 with SMTP id
 00721157ae682-708ad623882mr10923457b3.19.1745969931462;
 Tue, 29 Apr 2025 16:38:51 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:74::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae06befesm761997b3.55.2025.04.29.16.38.50
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:51 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 02/18] swapfile: rearrange functions
Date: Tue, 29 Apr 2025 16:38:30 -0700
Message-ID: <20250429233848.3093350-3-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Rearrange some functions in preparation for the rest of the series. No
functional change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 332 +++++++++++++++++++++++++-------------------------
 1 file changed, 166 insertions(+), 166 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index df7c4e8b089c..426674d35983 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -124,11 +124,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
-static inline unsigned char swap_count(unsigned char ent)
-{
-	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
-}
-
 /*
  * Use the second highest bit of inuse_pages counter as the indicator
  * if one swap device is on the available plist, so the atomic can
@@ -161,6 +156,11 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim directly, bypass the slot cache and don't touch device lock */
 #define TTRS_DIRECT		0x8
 
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
+}
+
 static bool swap_is_has_cache(struct swap_info_struct *si,
 			      unsigned long offset, int nr_pages)
 {
@@ -1326,46 +1326,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
-					      unsigned long offset,
-					      unsigned char usage)
-{
-	unsigned char count;
-	unsigned char has_cache;
-
-	count = si->swap_map[offset];
-
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-
-	if (usage == SWAP_HAS_CACHE) {
-		VM_BUG_ON(!has_cache);
-		has_cache = 0;
-	} else if (count == SWAP_MAP_SHMEM) {
-		/*
-		 * Or we could insist on shmem.c using a special
-		 * swap_shmem_free() and free_shmem_swap_and_cache()...
-		 */
-		count = 0;
-	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
-		if (count == COUNT_CONTINUED) {
-			if (swap_count_continued(si, offset, count))
-				count = SWAP_MAP_MAX | COUNT_CONTINUED;
-			else
-				count = SWAP_MAP_MAX;
-		} else
-			count--;
-	}
-
-	usage = count | has_cache;
-	if (usage)
-		WRITE_ONCE(si->swap_map[offset], usage);
-	else
-		WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE);
-
-	return usage;
-}
-
 /*
  * When we get a swap entry, if there aren't some other ways to
  * prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -1432,6 +1392,46 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
+static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
+					      unsigned long offset,
+					      unsigned char usage)
+{
+	unsigned char count;
+	unsigned char has_cache;
+
+	count = si->swap_map[offset];
+
+	has_cache = count & SWAP_HAS_CACHE;
+	count &= ~SWAP_HAS_CACHE;
+
+	if (usage == SWAP_HAS_CACHE) {
+		VM_BUG_ON(!has_cache);
+		has_cache = 0;
+	} else if (count == SWAP_MAP_SHMEM) {
+		/*
+		 * Or we could insist on shmem.c using a special
+		 * swap_shmem_free() and free_shmem_swap_and_cache()...
+		 */
+		count = 0;
+	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
+		if (count == COUNT_CONTINUED) {
+			if (swap_count_continued(si, offset, count))
+				count = SWAP_MAP_MAX | COUNT_CONTINUED;
+			else
+				count = SWAP_MAP_MAX;
+		} else
+			count--;
+	}
+
+	usage = count | has_cache;
+	if (usage)
+		WRITE_ONCE(si->swap_map[offset], usage);
+	else
+		WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE);
+
+	return usage;
+}
+
 static unsigned char __swap_entry_free(struct swap_info_struct *si,
 				       swp_entry_t entry)
 {
@@ -1585,25 +1585,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster(ci);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
-{
-	int i;
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si = NULL;
-
-	if (n <= 0)
-		return;
-
-	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
-		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
-			unlock_cluster(ci);
-		}
-	}
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = swp_swap_info(entry);
@@ -1717,57 +1698,6 @@ static bool folio_swapped(struct folio *folio)
 	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
 }
 
-static bool folio_swapcache_freeable(struct folio *folio)
-{
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-
-	if (!folio_test_swapcache(folio))
-		return false;
-	if (folio_test_writeback(folio))
-		return false;
-
-	/*
-	 * Once hibernation has begun to create its image of memory,
-	 * there's a danger that one of the calls to folio_free_swap()
-	 * - most probably a call from __try_to_reclaim_swap() while
-	 * hibernation is allocating its own swap pages for the image,
-	 * but conceivably even a call from memory reclaim - will free
-	 * the swap from a folio which has already been recorded in the
-	 * image as a clean swapcache folio, and then reuse its swap for
-	 * another page of the image.  On waking from hibernation, the
-	 * original folio might be freed under memory pressure, then
-	 * later read back in from swap, now with the wrong data.
-	 *
-	 * Hibernation suspends storage while it is writing the image
-	 * to disk so check that here.
-	 */
-	if (pm_suspended_storage())
-		return false;
-
-	return true;
-}
-
-/**
- * folio_free_swap() - Free the swap space used for this folio.
- * @folio: The folio to remove.
- *
- * If swap is getting full, or if there are no more mappings of this folio,
- * then call folio_free_swap to free its swap space.
- *
- * Return: true if we were able to release the swap space.
- */
-bool folio_free_swap(struct folio *folio)
-{
-	if (!folio_swapcache_freeable(folio))
-		return false;
-	if (folio_swapped(folio))
-		return false;
-
-	delete_from_swap_cache(folio);
-	folio_set_dirty(folio);
-	return true;
-}
-
 /**
  * free_swap_and_cache_nr() - Release reference on range of swap entries and
  *                            reclaim their cache if no more references remain.
@@ -1842,6 +1772,76 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	put_swap_device(si);
 }
 
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	int i;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si = NULL;
+
+	if (n <= 0)
+		return;
+
+	for (i = 0; i < n; ++i) {
+		si = _swap_info_get(entries[i]);
+		if (si) {
+			ci = lock_cluster(si, swp_offset(entries[i]));
+			swap_entry_range_free(si, ci, entries[i], 1);
+			unlock_cluster(ci);
+		}
+	}
+}
+
+static bool folio_swapcache_freeable(struct folio *folio)
+{
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+	if (!folio_test_swapcache(folio))
+		return false;
+	if (folio_test_writeback(folio))
+		return false;
+
+	/*
+	 * Once hibernation has begun to create its image of memory,
+	 * there's a danger that one of the calls to folio_free_swap()
+	 * - most probably a call from __try_to_reclaim_swap() while
+	 * hibernation is allocating its own swap pages for the image,
+	 * but conceivably even a call from memory reclaim - will free
+	 * the swap from a folio which has already been recorded in the
+	 * image as a clean swapcache folio, and then reuse its swap for
+	 * another page of the image.  On waking from hibernation, the
+	 * original folio might be freed under memory pressure, then
+	 * later read back in from swap, now with the wrong data.
+	 *
+	 * Hibernation suspends storage while it is writing the image
+	 * to disk so check that here.
+	 */
+	if (pm_suspended_storage())
+		return false;
+
+	return true;
+}
+
+/**
+ * folio_free_swap() - Free the swap space used for this folio.
+ * @folio: The folio to remove.
+ *
+ * If swap is getting full, or if there are no more mappings of this folio,
+ * then call folio_free_swap to free its swap space.
+ *
+ * Return: true if we were able to release the swap space.
+ */
+bool folio_free_swap(struct folio *folio)
+{
+	if (!folio_swapcache_freeable(folio))
+		return false;
+	if (folio_swapped(folio))
+		return false;
+
+	delete_from_swap_cache(folio);
+	folio_set_dirty(folio);
+	return true;
+}
+
 #ifdef CONFIG_HIBERNATION
 
 swp_entry_t get_swap_page_of_type(int type)
@@ -1957,6 +1957,37 @@ unsigned int count_swap_pages(int type, int free)
 }
 #endif /* CONFIG_HIBERNATION */
 
+/*
+ * Scan swap_map from current position to next entry still in use.
+ * Return 0 if there are no inuse entries after prev till end of
+ * the map.
+ */
+static unsigned int find_next_to_unuse(struct swap_info_struct *si,
+					unsigned int prev)
+{
+	unsigned int i;
+	unsigned char count;
+
+	/*
+	 * No need for swap_lock here: we're just looking
+	 * for whether an entry is in use, not modifying it; false
+	 * hits are okay, and sys_swapoff() has already prevented new
+	 * allocations from this area (while holding swap_lock).
+	 */
+	for (i = prev + 1; i < si->max; i++) {
+		count = READ_ONCE(si->swap_map[i]);
+		if (count && swap_count(count) != SWAP_MAP_BAD)
+			break;
+		if ((i % LATENCY_LIMIT) == 0)
+			cond_resched();
+	}
+
+	if (i == si->max)
+		i = 0;
+
+	return i;
+}
+
 static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
 {
 	return pte_same(pte_swp_clear_flags(pte), swp_pte);
@@ -2241,37 +2272,6 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 	return ret;
 }
 
-/*
- * Scan swap_map from current position to next entry still in use.
- * Return 0 if there are no inuse entries after prev till end of
- * the map.
- */
-static unsigned int find_next_to_unuse(struct swap_info_struct *si,
-					unsigned int prev)
-{
-	unsigned int i;
-	unsigned char count;
-
-	/*
-	 * No need for swap_lock here: we're just looking
-	 * for whether an entry is in use, not modifying it; false
-	 * hits are okay, and sys_swapoff() has already prevented new
-	 * allocations from this area (while holding swap_lock).
-	 */
-	for (i = prev + 1; i < si->max; i++) {
-		count = READ_ONCE(si->swap_map[i]);
-		if (count && swap_count(count) != SWAP_MAP_BAD)
-			break;
-		if ((i % LATENCY_LIMIT) == 0)
-			cond_resched();
-	}
-
-	if (i == si->max)
-		i = 0;
-
-	return i;
-}
-
 static int try_to_unuse(unsigned int type)
 {
 	struct mm_struct *prev_mm;
@@ -3525,6 +3525,26 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
+struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+{
+	return swap_type_to_swap_info(swp_type(entry));
+}
+
+/*
+ * out-of-line methods to avoid include hell.
+ */
+struct address_space *swapcache_mapping(struct folio *folio)
+{
+	return swp_swap_info(folio->swap)->swap_file->f_mapping;
+}
+EXPORT_SYMBOL_GPL(swapcache_mapping);
+
+pgoff_t __folio_swap_cache_index(struct folio *folio)
+{
+	return swap_cache_index(folio->swap);
+}
+EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
+
 /*
  * Verify that nr swap entries are valid and increment their swap map counts.
  *
@@ -3658,26 +3678,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return swap_type_to_swap_info(swp_type(entry));
-}
-
-/*
- * out-of-line methods to avoid include hell.
- */
-struct address_space *swapcache_mapping(struct folio *folio)
-{
-	return swp_swap_info(folio->swap)->swap_file->f_mapping;
-}
-EXPORT_SYMBOL_GPL(swapcache_mapping);
-
-pgoff_t __folio_swap_cache_index(struct folio *folio)
-{
-	return swap_cache_index(folio->swap);
-}
-EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's

From patchwork Tue Apr 29 23:38:31 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885988
Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com
 [209.85.128.177])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5D6E72DCB5D;
 Tue, 29 Apr 2025 23:38:53 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.177
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969934; cv=none;
 b=lxP/pXsJQTdm3xHBUBzFEoOubB+T43rAR3vKNUcSLg/6M2P9nis0C5Fcu/HUx9zYX9BKwUxsw5U/8KbXuZHT5kjQ59FGvq04zbM+tcnHqGVwUR3d0oWsxpoLBhlhoo6JFBLUOtoFZrPGohQ8xsCoKrpHh9UPFjP/3v9q/Q8LZD0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969934; c=relaxed/simple;
 bh=Q4DB29dSrYsP+LDqQSDdApEPVzxvdtk+efseOmHmDYE=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=mIyaQmF1SoSlWBHz7tlziEnAHcMQHwnQuy/HSnJc3QsQEeepNWM8wcvWgEpb1h+q8gMmiC5e+aBDv6n5A0ueIXKRz06dhli+sNCfxWBbSm+IQZSvBW5y0xN9JQacGC6wJ+TGKLz8dCn4CQ+7V/mfMjIdANDkwZq1buk2lZFxdHo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=HN8sTkWu; arc=none smtp.client-ip=209.85.128.177
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="HN8sTkWu"
Received: by mail-yw1-f177.google.com with SMTP id
 00721157ae682-7082ad1355bso56756487b3.1;
 Tue, 29 Apr 2025 16:38:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969932; x=1746574732;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=14KDt7UXK6qD0EIwc8xBiPaJIpnP0mlTkZz3fHF5ExI=;
 b=HN8sTkWukVlEQwesuSEq5TeksGULB9/iNGatOS7MMyFkHq6gySjlj2Ac/w81ImOoHd
 xamNYk+O0M7e5lt/mJS4h0N0CDiXx4DpndOrB6y9aJz1DuOTrwIOMrvPsvfIJVjBF8VG
 lcXo7+xbXoFLOM816GhM7Vhy4S10kooz2LN28Lo/NZkSFW9yd9p0XKQ/ejidM7dleIFl
 6XElCOfrQj+R5sYjH2qK461kdvNyJ2y8PlfyFUImIkXv5IV2zMutZvs4zUfjngQshXtJ
 hH55p3IaTGF71+zGHbeFxpaiFQAvZnHLXLys7g7OlXHgiIUEB90OFO9yv6MN+5iBoKVO
 NGtA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969932; x=1746574732;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=14KDt7UXK6qD0EIwc8xBiPaJIpnP0mlTkZz3fHF5ExI=;
 b=cYN30HzlxAbRtV2iDhB59rV2/HdrvPlbCZEWf+vKMX7zh+KHR8izNBkJNFDxYvu8p5
 AzdkHq/jmLBIPU5Zbi8he8w5iUjLBlyAbUlklhtaQGkpezPlANHF/2Y6DyX5zmcJ6xrV
 kzKO0HvvWA7ZMoUvxag0/roXP2YheS9yJ4ZlJlSUk3HR1WkqKj1ksfAQDSqSeTm0ZqfC
 6lO8zOPVJ3GQUPNLBbfdc3g7l+IZ8H8VoSfOWZ/5w+FpG4pTzWaF5H1b8qRO0xDuwvqq
 x66EdQNtFvQykpcR+eA0WzV4J5Y1tdjOzEZcuoLcX/vLD1+WFvhu0wHZHGo+dQxexEQb
 5kag==
X-Forwarded-Encrypted: i=1;
 AJvYcCUPguLr7eImgbrMrjN7VUMcggTdufOTpEcNAtjMV4NW566GgQKfpiMxCkKIKaomjDFJZ81eNoWu@vger.kernel.org,
 AJvYcCWOUN09ZVBKQswmM/2y2n7ScG6KHl9yY8w0dzOd2tCqEvGOstYAPPQNyEtxPXU8xoh1B1DZv+pm1SI=@vger.kernel.org,
 AJvYcCXSn3yUj7pFVKaAmi8aVyINwzBIoZZupkLqd9owEdhgvuN4S/6LpGA12Xc5hkvSO1gOGLY6D+//t/FbUrIK@vger.kernel.org
X-Gm-Message-State: AOJu0YxVmvudE5ANZuXF2u8YzMgGB/waBTQbXWUgT8W5Qouf+EsqXp17
 zZHHTinTL5x6qL5TWMqPsNvex3qkpOAXlPNSJk5EaRQfmC5qxZYg
X-Gm-Gg: ASbGnctvq+KeN6ppL7IDB8kgPlbu2Z2WtwmPhtkAy61ea1UQes8D0ojCQj7zFBckOHD
 F0DDx+C9d7mRAHrn6RGvYrFwol93ngUX3QHPR4KcO+6+VRiTXIYUqkCLfC0W+kxH8Yn9qfvgkqG
 6fcdAHkEzHzZany4YZyZTdPIPxf3xMXk2s2KaxOmLbx4SMzzBkCpSuzEpQ3tDYX6UfQCfHYG4V+
 LtuJ96COFozVk8G+FRGeSBKQNeypz/8UGZuZk9BOBk4CaqZjrZw6mlSh1pkDRxJBT7K94KkXktT
 6M2mOYl2B+sOonhp0UTIPlgwZw+dEBI=
X-Google-Smtp-Source: AGHT+IFL1V6HRFg2K9+CUxVQ7bHrWgmJ7ZUqXAol+JA4ERhhSJiPwEl2QI+sts6PXu02TZGnUD2BvA==
X-Received: by 2002:a05:690c:1b:b0:6fe:bfb7:68bd with SMTP id
 00721157ae682-708abd46579mr21773787b3.1.1745969932210;
 Tue, 29 Apr 2025 16:38:52 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:2::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae068ee9sm753227b3.53.2025.04.29.16.38.51
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:51 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 03/18] swapfile: rearrange freeing steps
Date: Tue, 29 Apr 2025 16:38:31 -0700
Message-ID: <20250429233848.3093350-4-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In the swap free path, certain steps (cgroup uncharging and shadow
clearing) will be handled at the virtual layer eventually. To facilitate
this change, rearrange these functions a bit in their caller. There
should not be any functional change.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 426674d35983..e717d0e7ae6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1129,6 +1129,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
 
+	clear_shadow_from_swap_cache(si->type, begin, end);
+
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
@@ -1149,7 +1151,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	clear_shadow_from_swap_cache(si->type, begin, end);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1502,6 +1503,8 @@ static void swap_entry_range_free(struct swap_info_struct *si,
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 
+	mem_cgroup_uncharge_swap(entry, nr_pages);
+
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_empty(ci));
@@ -1513,7 +1516,6 @@ static void swap_entry_range_free(struct swap_info_struct *si,
 		*map = 0;
 	} while (++map < map_end);
 
-	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
 
 	if (!ci->count)

From patchwork Tue Apr 29 23:38:32 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886260
Received: from mail-yw1-f171.google.com (mail-yw1-f171.google.com
 [209.85.128.171])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52E7E2DCB7A;
 Tue, 29 Apr 2025 23:38:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969936; cv=none;
 b=i0nrcqlnH6cW1UmL2f/nSGuEefq2bfHIHqhx2Dwl32MRf7odirWTNCTQfEssTRUARiNdViJmBUjdQIrQGXtEKlgM81l8oYp9IThV+TxbrNu1/HHnc6kyln2ts1YKtoEeaC8NMn2fT8gLGD7aLbpwC4Ncs/paTsZGexYxu1C4zSI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969936; c=relaxed/simple;
 bh=6lu4oF39SEcXcDQcSnvnULT5GeUPSdpjuo9d07J7rEE=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=O/4liRd12lXc3crmj363JLSaMaCFu3SolskvKZ+mX/USnuhBTO07gVtgwQQMBGtdQ9zDYl6i2OOSFwobNvRDtlPJD/529X0RVKNhvpuqlTmoExCsjEEMEK1848aQaimngYktOJyAtDW3qrMpgNc/CvEFK0RLKuZXxZ+zv9gx9z4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=VZHfA0mO; arc=none smtp.client-ip=209.85.128.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="VZHfA0mO"
Received: by mail-yw1-f171.google.com with SMTP id
 00721157ae682-70427fb838cso55256907b3.2;
 Tue, 29 Apr 2025 16:38:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969933; x=1746574733;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=fKE37L6euvcRArGqI2WCdwnU49QgVd0+WJwqbTjSGfU=;
 b=VZHfA0mOypSbyP/7KjOi+WCRZg2L1pgSEC68x+apyOGN8qCxYma2DlR4vaW1cOo4Tw
 y3AVg8bNWecMQBnD8gUbs2eASJMF3thfJ117cVZwaljxevHY/zxwtB7f+i+hv1qcseL0
 oQSko+Ypgeo2h4Mu9yPJR1khyoH0ZUljfadvSWOqOsevEr0tnhAJ9EaMkiOVYx7d0CJt
 H354rcmlx4SPFxABVCmLSK1INv6g/D4I/DYIPlN4BLRlhuemLe1oBRc9rhspyOlbsMs2
 U+i7cge7YRRnphA9aCE748WjX0I2Dx9cwpnX9WgrK1v8MnOaTGXV8Noa+smZDltpGqPZ
 2gLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969933; x=1746574733;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=fKE37L6euvcRArGqI2WCdwnU49QgVd0+WJwqbTjSGfU=;
 b=m0nqRcXiV6hzMehGRCl8lgd//E9uTHziDFPf+G/epv5/o6iD74rDbp/8Ps+QKEuCLo
 +FzrA4krZ/ZvKpnDWI1PNE4wZpHHs/6w429alU1eZjXUvkyqKhACHLFJEcI7XmzOe+mt
 Igc5nfcgsjTFDSB75QvpLE2q2rG26l7VEa08RvqYAgz42BQxoLoE+HDNnsym4BUXF/gT
 vI0lJNb3gL/CD4SZl713BScuJQNFQuJCvdTnwr1h3DP0+CSLgPKBzJPZhawlKXVv0tgn
 2zgYlzvSSEMvxphFP0CzgdfLFvDmBxW1jVBexbRTO5YpstkGQUkfVvjuQ7Fywopbsbu+
 wt1w==
X-Forwarded-Encrypted: i=1;
 AJvYcCU01+Cb9IQvFbE37Mnuw+1AJP9S3ax/ZJ0l/jPvERqy1CG+2U/tpuuAaSptYh9+q1HgJN693VGz@vger.kernel.org,
 AJvYcCVKNWf/x7xHv7RXHSLeo5nNCYU0rqkpOq3AXppLfUvOMuaoZMb9FJ9VDsjc7VI/OSCJwVTJsd0t6I/4fIVQ@vger.kernel.org,
 AJvYcCXd3vIfXP6OD0FK8AHAYXku5noTWa2GyaZ6mKzMwIaMCDMKzAiLRbYy00qWr+/2pScn5iaUrevr+V8=@vger.kernel.org
X-Gm-Message-State: AOJu0YwwNMWSOmCy2ibGw4yobslEa+QiOpwKb3IYJOdcBxzMFnCu2Hgi
 WaBLwWDfNa+iAxzpRqH8+u+b3zNSE2XnVmL/tZiPbD8dkV4wiw1K
X-Gm-Gg: ASbGncuRtXcR5ZT2AKJ1PJOUCU0ag8bPVEgeAzJT+JT/uLkC9z1H3igZwvXFB+jO0zL
 6h4dBj6YK7W4gwu5tE/wEeDTamsT64YqxVY3gTrlGNs1UcFpfhP0oAzSJ9h4po2CMWf8QJWOUOk
 3oTCVXUCzCRmJeXa1RsJM+xG29sSo6XKAo8boqKKvdWThyWkLeh2G0VrBGH5YZYs3nOV0Dkjlgl
 91ewYKiHgR8afIoyyr1sDrYF7KhUR5Q+hiiw7CH2hp8OITw42bSHv5BM2EC0MLYcn/WK+mJy6OF
 m0Ze1OuTWTLQf2UMIgLu0JMDTe+mzyU=
X-Google-Smtp-Source: AGHT+IFO87ET7RNKrk3UF1PEUh15ufHAa0Xuh+MHK1ieSEIT7Rz22UvQkjilBKaStSwIpkQPF5GxvQ==
X-Received: by 2002:a05:690c:7304:b0:6f9:7a3c:1fe with SMTP id
 00721157ae682-708abe2046emr20914727b3.23.1745969933102;
 Tue, 29 Apr 2025 16:38:53 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:4::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae1edd01sm700967b3.115.2025.04.29.16.38.52
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:52 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out
 swapoff
Date: Tue, 29 Apr 2025 16:38:32 -0700
Message-ID: <20250429233848.3093350-5-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Currently, we get a reference to the backing swap device in order to
lock out swapoff and ensure its validity. This does not make sense in
the new virtual swap design, especially after the swap backends are
decoupled - a swap entry might not have any backing swap device at all,
and its backend might change at any time during its lifetime.

In preparation for this, abstract away the swapoff locking out behavior
into a generic API.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h | 12 ++++++++++++
 mm/memory.c          | 13 +++++++------
 mm/shmem.c           |  7 +++----
 mm/swap_state.c      | 10 ++++------
 mm/userfaultfd.c     | 11 ++++++-----
 5 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8b8c10356a5c..23eaf44791d4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -709,5 +709,17 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+static inline bool trylock_swapoff(swp_entry_t entry,
+				struct swap_info_struct **si)
+{
+	return get_swap_device(entry);
+}
+
+static inline void unlock_swapoff(swp_entry_t entry,
+				struct swap_info_struct *si)
+{
+	put_swap_device(si);
+}
+
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc75167..e92914df5ca7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4305,6 +4305,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
+	bool swapoff_locked = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4365,8 +4366,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* Prevent swapoff from happening to us. */
-	si = get_swap_device(entry);
-	if (unlikely(!si))
+	swapoff_locked = trylock_swapoff(entry, &si);
+	if (unlikely(!swapoff_locked))
 		goto out;
 
 	folio = swap_cache_get_folio(entry, vma, vmf->address);
@@ -4713,8 +4714,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4732,8 +4733,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 	return ret;
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..8ef72dcc592e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2262,8 +2262,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (is_poisoned_swp_entry(swap))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	if (!si) {
+	if (!trylock_swapoff(swap, &si)) {
 		if (!shmem_confirm_swap(mapping, index, swap))
 			return -EEXIST;
 		else
@@ -2411,7 +2410,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
-	put_swap_device(si);
+	unlock_swapoff(swap, si);
 
 	*foliop = folio;
 	return 0;
@@ -2428,7 +2427,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_unlock(folio);
 		folio_put(folio);
 	}
-	put_swap_device(si);
+	unlock_swapoff(swap, si);
 
 	return error;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ca42b2be64d9..81f69b2df550 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -419,12 +419,11 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	if (non_swap_entry(swp))
 		return ERR_PTR(-ENOENT);
 	/* Prevent swapoff from happening to us */
-	si = get_swap_device(swp);
-	if (!si)
+	if (!trylock_swapoff(swp, &si))
 		return ERR_PTR(-ENOENT);
 	index = swap_cache_index(swp);
 	folio = filemap_get_folio(swap_address_space(swp), index);
-	put_swap_device(si);
+	unlock_swapoff(swp, si);
 	return folio;
 }
 
@@ -439,8 +438,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	void *shadow = NULL;
 
 	*new_page_allocated = false;
-	si = get_swap_device(entry);
-	if (!si)
+	if (!trylock_swapoff(entry, &si))
 		return NULL;
 
 	for (;;) {
@@ -538,7 +536,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	put_swap_folio(new_folio, entry);
 	folio_unlock(new_folio);
 put_and_return:
-	put_swap_device(si);
+	unlock_swapoff(entry, si);
 	if (!(*new_page_allocated) && new_folio)
 		folio_put(new_folio);
 	return result;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..f40bbfd09fd5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1161,6 +1161,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	struct folio *src_folio = NULL;
 	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
+	bool swapoff_locked = false;
 	int err = 0;
 
 	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
@@ -1367,8 +1368,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			goto out;
 		}
 
-		si = get_swap_device(entry);
-		if (unlikely(!si)) {
+		swapoff_locked = trylock_swapoff(entry, &si);
+		if (unlikely(!swapoff_locked)) {
 			err = -EAGAIN;
 			goto out;
 		}
@@ -1399,7 +1400,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				pte_unmap(src_pte);
 				pte_unmap(dst_pte);
 				src_pte = dst_pte = NULL;
-				put_swap_device(si);
+				unlock_swapoff(entry, si);
 				si = NULL;
 				/* now we can block and wait */
 				folio_lock(src_folio);
@@ -1425,8 +1426,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	if (src_pte)
 		pte_unmap(src_pte);
 	mmu_notifier_invalidate_range_end(&range);
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 
 	return err;
 }

From patchwork Tue Apr 29 23:38:33 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885986
Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com
 [209.85.128.170])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id B53382DDD0C;
 Tue, 29 Apr 2025 23:38:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.170
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969939; cv=none;
 b=Qsi7Xp9lw9lZxKjyz88sNH1zGjf4Tri4NVVd1gqOZNLTk7gjOgGevSbpUxowkKmAjMXbH/GRlv+Gmrw8iERcfeM7D+xu8trKvGZrwHZTxikqRJh3+8jHCnf/UfK0hjACW7ufuxK+Rzkh8M02mhZxKTKEE58f/33yQFEni8kRjjA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969939; c=relaxed/simple;
 bh=4HyKWITlGYt8qr7JDNo6yxXyPt4Ozc+jc00C/mDZVmc=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=bPpUBD5Md8ZM1SYwjbsMfFXx+R+BbGh2I1ksaHioID8qT5wxuuELjg2I5hZyTQScqM2PpYcN00fpeRe8F5WPyjvf1Z+gizphp+4JyAUvfvjNW01KCXQL3WPJldfwyYAMpnpdesEGtaV3bQ6Qz7uh56oncQJ4iyiFDQoqr2cuucQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=LihfACIi; arc=none smtp.client-ip=209.85.128.170
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="LihfACIi"
Received: by mail-yw1-f170.google.com with SMTP id
 00721157ae682-702628e34f2so4274847b3.0;
 Tue, 29 Apr 2025 16:38:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969934; x=1746574734;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=7SIdCJDJPYJXeNx7PuyzmnTxAetdDtE+G8IdMd89kY0=;
 b=LihfACIids2tbWejvdJVKaZtE6M19GxMBQl7bFUZxcNxTV9TTXCEj+g2hLU5EIurcD
 FUa1qQEvHXZEJ8k+ROB0bu+sgqs+/tD+eTS96QhSwTM5dq6Un/jEI3hsyJ9HtLC4LakV
 x6ktPjdmprmiSXrtV2zq8Igg/NEXjNVPpGBw87NmC2Rq64LkvDTUb68gV/pziZLOgkkv
 PPu17KLU7tc8ObMjuwkgu57kUhm1tuf3Xy3eYZR02usvpXpzLTa1lvyjbM+mcB4+f6uU
 NMm6RxEyWFpjDZp+QTsNVAr/PLEv3y7gAgTwsSmEvPHgsyfLejaOypmH/VkCZR0XLf85
 Nhzw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969934; x=1746574734;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=7SIdCJDJPYJXeNx7PuyzmnTxAetdDtE+G8IdMd89kY0=;
 b=X0+KOKab0XlWa3ouvs7Z726BOkIQvPNyDJbYNe8myGiQWCqr1DmpB0D/uddypYW/VM
 cVOalSK+0SNBkhxBng1hfIBv8gIYRDO2A9z+6xKAtxerfrh4opWD5BWuREMl+yash4dA
 FhqhB3vLlVQ5iVr+boIUyWZayPOYe+ZFW2BHrvYhDEqip+UoeYVbzbTSczg5iH35vFct
 M1E6KHQUf2TfvMwLtc/6KxQ8jK+Zf+wHIQ4HcEbO/vUf+RKMKw1U7NiskVddmQT7YNHc
 sK+dEHZTro1ugQci894Sy8wUfdIXg4QwmI+B4wgDrgoUWtZa3Q+TtB2YSSPoIRswABeg
 Ts3w==
X-Forwarded-Encrypted: i=1;
 AJvYcCVbq9rafqj2o4rUi01G84rF1uGU5aUol1SKCekQCRFVnzWCXAa0A3ufn4G8mTeoK4+MDEdx8aFB5bQ=@vger.kernel.org,
 AJvYcCVvf3IfG4BeVGcwakYw7Y//luWCRFKP/LxYwmmJCpkdHLv4sf83gpf6XYdse/Q3NAnxZTtja6OI@vger.kernel.org,
 AJvYcCWH7BoIxFwQHj2PY1/VL8rWBNXlHuwllBb3tzYxk9po9kmqMO6Z9IhUtc81CZI7oTAWjVXbyWZ+lV4JDfPt@vger.kernel.org
X-Gm-Message-State: AOJu0Yy2z2oLCFZ55PXwkybz9ECFnwXu4wR9AVKL+qIMoqecBEUUl+iK
 h+379A+ngUQABdMX5Da5xqFkO4qCGDtDzDCh0ubaBS+HILPMjcgM
X-Gm-Gg: ASbGnct/D02y6AE/8jvyP26l+Cdk4vLi6DKlYvdary8Ndtg8E8QdkzGLoByr2nulgPL
 yu8ZQiFqKDzV1+T8Fa9PSS/Yj/q88pMp4pipddZFIXWbITKlLk4cziq/63mVTVpFO3ep1wBJJxN
 CS+Bewz9nkUZVUZtHBC1fgiiVyfDjSFt5hUkZV41as+/maD7KFzJt1vn9KweYOvkfOTxgqCsUXC
 hkJZKtnGGF6VH9oHj084WxE7dxXHMVsSfSRu6Hls271tKjEjANKLPRzZKJjNXCRmDGG9ZwsmChu
 DyMePnAF7tJPCD0/ydlXA+etwci76yk=
X-Google-Smtp-Source: AGHT+IEWDDTWRmKfx8ivSPB7FsCq0qlw42CTNFJ6Dwf7KJwjFOdoPs52vwTrf07sDZ2inZNrSTv9lA==
X-Received: by 2002:a05:690c:6e0b:b0:703:b296:7897 with SMTP id
 00721157ae682-708ad0943d8mr11509767b3.6.1745969934320;
 Tue, 29 Apr 2025 16:38:54 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:5::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae0695f4sm755497b3.61.2025.04.29.16.38.53
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:53 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap
 slots
Date: Tue, 29 Apr 2025 16:38:33 -0700
Message-ID: <20250429233848.3093350-6-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In preparation for swap virtualization, add a new type to represent the
physical swap slots of swapfile. This allows us to separates:

1. The logical view of the swap entry (i.e what is stored in page table
   entries and used to index into the swap cache), represented by the
   old swp_entry_t type.

from:

2. Its physical backing state (i.e the actual backing slot on the swap
   device), represented by the new swp_slot_t type.

The functions that operate at the physical level (i.e on the swp_slot_t
types) are also renamed where appropriate (prefixed with swp_slot_* for
e.g).

Note that we have not made any behavioral change - the mapping between
the two types is the identity mapping. In later patches, we shall
dynamically allocate a virtual swap slot (of type swp_entry_t) for each
swapped out page to store in the page table entry, and associate it with
a backing store. A physical swap slot (i.e a slot on a physical swap
device) is one of the backing options.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/mm_types.h   |   7 ++
 include/linux/swap.h       |  70 +++++++++--
 include/linux/swap_slots.h |   2 +-
 include/linux/swapops.h    |  25 ++++
 kernel/power/swap.c        |   6 +-
 mm/internal.h              |  10 +-
 mm/memory.c                |   7 +-
 mm/page_io.c               |  33 +++--
 mm/shmem.c                 |  21 +++-
 mm/swap.h                  |  17 +--
 mm/swap_cgroup.c           |  10 +-
 mm/swap_slots.c            |  28 ++---
 mm/swap_state.c            |  28 +++--
 mm/swapfile.c              | 243 ++++++++++++++++++++-----------------
 14 files changed, 324 insertions(+), 183 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..7d93bb2c3dae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -283,6 +283,13 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/*
+ * Physical (i.e disk-based) swap slot handle.
+ */
+typedef struct {
+	unsigned long val;
+} swp_slot_t;
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 23eaf44791d4..567fd2ebb0d3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -277,7 +277,7 @@ enum swap_cluster_flags {
  * cluster to which it belongs being marked free. Therefore 0 is safe to use as
  * a sentinel to indicate an entry is not valid.
  */
-#define SWAP_ENTRY_INVALID	0
+#define SWAP_SLOT_INVALID	0
 
 #ifdef CONFIG_THP_SWAP
 #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
@@ -471,12 +471,16 @@ static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
+
 void free_page_and_swap_cache(struct page *);
 void free_pages_and_swap_cache(struct encoded_page **, int);
 void free_swap_cache(struct folio *folio);
 int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 void exit_swap_address_space(unsigned int type);
 
+/* Swap slot cache API (mm/swap_slot.c) */
+swp_slot_t folio_alloc_swap_slot(struct folio *folio);
+
 /* Physical swap allocator and swap device API (mm/swapfile.c) */
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
@@ -500,36 +504,37 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
-swp_entry_t get_swap_page_of_type(int);
-int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
+swp_slot_t swap_slot_alloc_of_type(int);
+int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
 int add_swap_count_continuation(swp_entry_t, gfp_t);
-void swapcache_free_entries(swp_entry_t *entries, int n);
+void swap_slot_cache_free_slots(swp_slot_t *slots, int n);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 unsigned int count_swap_pages(int, int);
 sector_t swapdev_block(int, pgoff_t);
-struct swap_info_struct *swp_swap_info(swp_entry_t entry);
+struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot);
 struct backing_dev_info;
-struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot);
 sector_t swap_folio_sector(struct folio *folio);
 
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
 }
 
 #else /* CONFIG_SWAP */
-static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
 	return NULL;
 }
 
-static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 {
 	return NULL;
 }
 
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
 {
 }
 
@@ -578,7 +583,7 @@ static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 }
 
@@ -609,12 +614,24 @@ static inline bool folio_free_swap(struct folio *folio)
 	return false;
 }
 
+static inline swp_slot_t folio_alloc_swap_slot(struct folio *folio)
+{
+	swp_slot_t slot;
+
+	slot.val = 0;
+	return slot;
+}
+
 static inline int add_swap_extent(struct swap_info_struct *sis,
 				  unsigned long start_page,
 				  unsigned long nr_pages, sector_t start_block)
 {
 	return -EINVAL;
 }
+
+static inline void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
+{
+}
 #endif /* CONFIG_SWAP */
 
 static inline void free_swap_and_cache(swp_entry_t entry)
@@ -709,16 +726,43 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ *                         virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+static inline swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+	return (swp_slot_t) { entry.val };
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ *                         physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+	return (swp_entry_t) { slot.val };
+}
+
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
 {
-	return get_swap_device(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	*si = swap_slot_tryget_swap_info(slot);
+	return *si;
 }
 
 static inline void unlock_swapoff(swp_entry_t entry,
 				struct swap_info_struct *si)
 {
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 }
 
 #endif /* __KERNEL__*/
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 840aec3523b2..1ac926d46389 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -13,7 +13,7 @@
 struct swap_slots_cache {
 	bool		lock_initialized;
 	struct mutex	alloc_lock; /* protects slots, nr, cur */
-	swp_entry_t	*slots;
+	swp_slot_t	*slots;
 	int		nr;
 	int		cur;
 	int		n_ret;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 96f26e29fefe..2a4101c9bba4 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -618,5 +618,30 @@ static inline int non_swap_entry(swp_entry_t entry)
 	return swp_type(entry) >= MAX_SWAPFILES;
 }
 
+/* Physical swap slots operations */
+
+/*
+ * Store a swap device type + offset into a swp_slot_t handle.
+ */
+static inline swp_slot_t swp_slot(unsigned long type, pgoff_t offset)
+{
+	swp_slot_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT) | (offset & SWP_OFFSET_MASK);
+	return ret;
+}
+
+/* Extract the `type' field from a swp_slot_t. */
+static inline unsigned swp_slot_type(swp_slot_t slot)
+{
+	return (slot.val >> SWP_TYPE_SHIFT);
+}
+
+/* Extract the `offset' field from a swp_slot_t. */
+static inline pgoff_t swp_slot_offset(swp_slot_t slot)
+{
+	return slot.val & SWP_OFFSET_MASK;
+}
+
 #endif /* CONFIG_MMU */
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 82b884b67152..32b236a81dbb 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -178,10 +178,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_slot_offset(swap_slot_alloc_of_type(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_slot_free_nr(swp_slot(swap, offset), 1);
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -203,7 +203,7 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
+		swap_slot_free_nr(swp_slot(swap, ext->start),
 			     ext->end - ext->start + 1);
 
 		kfree(ext);
diff --git a/mm/internal.h b/mm/internal.h
index 20b3535935a3..2d63f6537e35 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -275,9 +275,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  */
 static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
-	swp_entry_t entry = pte_to_swp_entry(pte);
-	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
-						   (swp_offset(entry) + delta)));
+	swp_entry_t entry = pte_to_swp_entry(pte), new_entry;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	pte_t new;
+
+	new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+			swp_slot_offset(slot) + delta));
+	new = swp_entry_to_pte(new_entry);
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
diff --git a/mm/memory.c b/mm/memory.c
index e92914df5ca7..c44e845b5320 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4125,8 +4125,9 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
+	pgoff_t offset = swp_slot_offset(slot);
 	int i;
 
 	/*
@@ -4308,6 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	bool swapoff_locked = false;
 	bool exclusive = false;
 	swp_entry_t entry;
+	swp_slot_t slot;
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
@@ -4369,6 +4371,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapoff_locked = trylock_swapoff(entry, &si);
 	if (unlikely(!swapoff_locked))
 		goto out;
+	slot = swp_entry_to_swp_slot(entry);
 
 	folio = swap_cache_get_folio(entry, vma, vmf->address);
 	if (folio)
diff --git a/mm/page_io.c b/mm/page_io.c
index 9b983de351f9..182851c47f43 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,14 +204,17 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		set_bit(swp_offset(entry), sis->zeromap);
+		slot = swp_entry_to_swp_slot(entry);
+		set_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 
 	count_vm_events(SWPOUT_ZERO, nr_pages);
@@ -223,13 +226,16 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		clear_bit(swp_offset(entry), sis->zeromap);
+		slot = swp_entry_to_swp_slot(entry);
+		clear_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 }
 
@@ -358,7 +364,8 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 		 * messages.
 		 */
 		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
-				   ret, swap_dev_pos(page_swap_entry(page)));
+				   ret,
+				   swap_slot_pos(swp_entry_to_swp_slot(page_swap_entry(page))));
 		for (p = 0; p < sio->pages; p++) {
 			page = sio->bvec[p].bv_page;
 			set_page_dirty(page);
@@ -374,10 +381,11 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 
 static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
 	struct swap_iocb *sio = NULL;
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct file *swap_file = sis->swap_file;
-	loff_t pos = swap_dev_pos(folio->swap);
+	loff_t pos = swap_slot_pos(slot);
 
 	count_swpout_vm_event(folio);
 	folio_start_writeback(folio);
@@ -452,7 +460,8 @@ static void swap_writepage_bdev_async(struct folio *folio,
 
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	/*
@@ -543,9 +552,10 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct swap_iocb *sio = NULL;
-	loff_t pos = swap_dev_pos(folio->swap);
+	loff_t pos = swap_slot_pos(slot);
 
 	if (plug)
 		sio = *plug;
@@ -614,7 +624,8 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset = folio_test_workingset(folio);
 	unsigned long pflags;
diff --git a/mm/shmem.c b/mm/shmem.c
index 8ef72dcc592e..f8efa49eb499 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1387,6 +1387,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 	XA_STATE(xas, &mapping->i_pages, start);
 	struct folio *folio;
 	swp_entry_t entry;
+	swp_slot_t slot;
 
 	rcu_read_lock();
 	xas_for_each(&xas, folio, ULONG_MAX) {
@@ -1397,11 +1398,13 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 			continue;
 
 		entry = radix_to_swp_entry(folio);
+		slot = swp_entry_to_swp_slot(entry);
+
 		/*
 		 * swapin error entries can be found in the mapping. But they're
 		 * deliberately ignored here as we've done everything we can do.
 		 */
-		if (swp_type(entry) != type)
+		if (swp_slot_type(slot) != type)
 			continue;
 
 		indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1619,7 +1622,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!swap.val) {
 		if (nr_pages > 1)
 			goto try_split;
-
 		goto redirty;
 	}
 
@@ -2164,6 +2166,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
 	void *alloced_shadow = NULL;
 	int alloced_order = 0, i;
+	swp_slot_t slot = swp_entry_to_swp_slot(swap);
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
@@ -2202,11 +2205,14 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			 */
 			for (i = 0; i < 1 << order; i++) {
 				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp;
+				swp_entry_t tmp_entry;
+				swp_slot_t tmp_slot;
 
-				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+				tmp_slot =
+					swp_slot(swp_slot_type(slot), swp_slot_offset(slot) + i);
+				tmp_entry = swp_slot_to_swp_entry(tmp_slot);
 				__xa_store(&mapping->i_pages, aligned_index + i,
-					   swp_to_radix_entry(tmp), 0);
+					   swp_to_radix_entry(tmp_entry), 0);
 			}
 		}
 
@@ -2253,10 +2259,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
 	swp_entry_t swap;
+	swp_slot_t slot;
 	int error, nr_pages, order, split_order;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
+	slot = swp_entry_to_swp_slot(swap);
 	*foliop = NULL;
 
 	if (is_poisoned_swp_entry(swap))
@@ -2328,7 +2336,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
 
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+			swap = swp_slot_to_swp_entry(swp_slot(
+					swp_slot_type(slot), swp_slot_offset(slot) + offset));
 		}
 
 		/* Here we actually start the io */
diff --git a/mm/swap.h b/mm/swap.h
index ad2f121de970..d5f8effa8015 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -32,12 +32,10 @@ extern struct address_space *swapper_spaces[];
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
-/*
- * Return the swap device position of the swap entry.
- */
-static inline loff_t swap_dev_pos(swp_entry_t entry)
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
 {
-	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
 }
 
 /*
@@ -78,7 +76,9 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->flags;
+	swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
+
+	return swap_slot_swap_info(swp_slot)->flags;
 }
 
 /*
@@ -89,8 +89,9 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		bool *is_zeromap)
 {
-	struct swap_info_struct *sis = swp_swap_info(entry);
-	unsigned long start = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
+	unsigned long start = swp_slot_offset(slot);
 	unsigned long end = start + max_nr;
 	bool first_bit;
 
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 1007c30f12e2..5e4c91d694a0 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -65,11 +65,12 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
 			swp_entry_t ent)
 {
 	unsigned int nr_ents = folio_nr_pages(folio);
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
 	struct swap_cgroup *map;
 	pgoff_t offset, end;
 	unsigned short old;
 
-	offset = swp_offset(ent);
+	offset = swp_slot_offset(slot);
 	end = offset + nr_ents;
 	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
@@ -92,12 +93,12 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
  */
 unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 {
-	pgoff_t offset = swp_offset(ent);
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
+	pgoff_t offset = swp_slot_offset(slot);
 	pgoff_t end = offset + nr_ents;
 	struct swap_cgroup *map;
 	unsigned short old, iter = 0;
 
-	offset = swp_offset(ent);
 	end = offset + nr_ents;
 	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
@@ -120,12 +121,13 @@ unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	struct swap_cgroup_ctrl *ctrl;
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
+	return __swap_cgroup_id_lookup(ctrl->map, swp_slot_offset(slot));
 }
 
 int swap_cgroup_swapon(int type, unsigned long max_pages)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9c7c171df7ba..4ec2de0c2756 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -111,14 +111,14 @@ static bool check_cache_active(void)
 static int alloc_swap_slot_cache(unsigned int cpu)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots;
+	swp_slot_t *slots;
 
 	/*
 	 * Do allocation outside swap_slots_cache_mutex
 	 * as kvzalloc could trigger reclaim and folio_alloc_swap,
 	 * which can lock swap_slots_cache_mutex.
 	 */
-	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
+	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_slot_t),
 			 GFP_KERNEL);
 	if (!slots)
 		return -ENOMEM;
@@ -160,7 +160,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
 	cache = &per_cpu(swp_slots, cpu);
 	if (cache->slots) {
 		mutex_lock(&cache->alloc_lock);
-		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
+		swap_slot_cache_free_slots(cache->slots + cache->cur, cache->nr);
 		cache->cur = 0;
 		cache->nr = 0;
 		if (free_slots && cache->slots) {
@@ -238,22 +238,22 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 
 	cache->cur = 0;
 	if (swap_slot_cache_active)
-		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
+		cache->nr = swap_slot_alloc(SWAP_SLOTS_CACHE_SIZE,
 					   cache->slots, 0);
 
 	return cache->nr;
 }
 
-swp_entry_t folio_alloc_swap(struct folio *folio)
+swp_slot_t folio_alloc_swap_slot(struct folio *folio)
 {
-	swp_entry_t entry;
+	swp_slot_t slot;
 	struct swap_slots_cache *cache;
 
-	entry.val = 0;
+	slot.val = 0;
 
 	if (folio_test_large(folio)) {
 		if (IS_ENABLED(CONFIG_THP_SWAP))
-			get_swap_pages(1, &entry, folio_order(folio));
+			swap_slot_alloc(1, &slot, folio_order(folio));
 		goto out;
 	}
 
@@ -273,7 +273,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (cache->slots) {
 repeat:
 			if (cache->nr) {
-				entry = cache->slots[cache->cur];
+				slot = cache->slots[cache->cur];
 				cache->slots[cache->cur++].val = 0;
 				cache->nr--;
 			} else if (refill_swap_slots_cache(cache)) {
@@ -281,15 +281,11 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 			}
 		}
 		mutex_unlock(&cache->alloc_lock);
-		if (entry.val)
+		if (slot.val)
 			goto out;
 	}
 
-	get_swap_pages(1, &entry, 0);
+	swap_slot_alloc(1, &slot, 0);
 out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val = 0;
-	}
-	return entry;
+	return slot;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 81f69b2df550..cbd1532b6b24 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -167,6 +167,19 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	swp_slot_t slot = folio_alloc_swap_slot(folio);
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+
+	if (entry.val && mem_cgroup_try_charge_swap(folio, entry)) {
+		put_swap_folio(folio, entry);
+		entry.val = 0;
+	}
+
+	return entry;
+}
+
 /**
  * add_to_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -548,8 +561,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
  *
- * get/put_swap_device() aren't needed to call this function, because
- * __read_swap_cache_async() call them and swap_read_folio() holds the
+ * swap_slot_(tryget|put)_swap_info() aren't needed to call this function,
+ * because __read_swap_cache_async() call them and swap_read_folio() holds the
  * swap cache folio lock.
  */
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -654,11 +667,12 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 				    struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct folio *folio;
-	unsigned long entry_offset = swp_offset(entry);
-	unsigned long offset = entry_offset;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long slot_offset = swp_slot_offset(slot);
+	unsigned long offset = slot_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
@@ -679,13 +693,13 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		folio = __read_swap_cache_async(
-				swp_entry(swp_type(entry), offset),
+				swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot), offset)),
 				gfp_mask, mpol, ilx, &page_allocated, false);
 		if (!folio)
 			continue;
 		if (page_allocated) {
 			swap_read_folio(folio, &splug);
-			if (offset != entry_offset) {
+			if (offset != slot_offset) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
 			}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e717d0e7ae6b..17cbf14bac72 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,9 +53,9 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entry_range_free(struct swap_info_struct *si,
+static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  swp_entry_t entry, unsigned int nr_pages);
+				  swp_slot_t slot, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
@@ -203,7 +203,8 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
-	swp_entry_t entry = swp_entry(si->type, offset);
+	swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
+	swp_slot_t slot;
 	struct address_space *address_space = swap_address_space(entry);
 	struct swap_cluster_info *ci;
 	struct folio *folio;
@@ -229,7 +230,8 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 
 	/* offset could point to the middle of a large folio */
 	entry = folio->swap;
-	offset = swp_offset(entry);
+	slot = swp_entry_to_swp_slot(entry);
+	offset = swp_slot_offset(slot);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
 			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
@@ -263,7 +265,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_set_dirty(folio);
 
 	ci = lock_cluster(si, offset);
-	swap_entry_range_free(si, ci, entry, nr_pages);
+	swap_slot_range_free(si, ci, slot, nr_pages);
 	unlock_cluster(ci);
 	ret = nr_pages;
 out_unlock:
@@ -344,12 +346,12 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
 
 sector_t swap_folio_sector(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct swap_extent *se;
 	sector_t sector;
-	pgoff_t offset;
+	pgoff_t offset = swp_slot_offset(slot);
 
-	offset = swp_offset(folio->swap);
 	se = offset_to_swap_extent(sis, offset);
 	sector = se->start_block + (offset - se->start_page);
 	return sector << (PAGE_SHIFT - 9);
@@ -387,15 +389,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #ifdef CONFIG_THP_SWAP
 #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
 
-#define swap_entry_order(order)	(order)
+#define swap_slot_order(order)	(order)
 #else
 #define SWAPFILE_CLUSTER	256
 
 /*
- * Define swap_entry_order() as constant to let compiler to optimize
+ * Define swap_slot_order() as constant to let compiler to optimize
  * out some code if !CONFIG_THP_SWAP
  */
-#define swap_entry_order(order)	0
+#define swap_slot_order(order)	0
 #endif
 #define LATENCY_LIMIT		256
 
@@ -779,7 +781,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    unsigned int order,
 					    unsigned char usage)
 {
-	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	unsigned int next = SWAP_SLOT_INVALID, found = SWAP_SLOT_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
@@ -883,7 +885,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * pool (a cluster). This might involve allocating a new cluster for current CPU
  * too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+static unsigned long cluster_alloc_swap_slot(struct swap_info_struct *si, int order,
 					      unsigned char usage)
 {
 	struct swap_cluster_info *ci;
@@ -1137,7 +1139,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
-		zswap_invalidate(swp_entry(si->type, offset + i));
+		zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
 	}
 
 	if (si->flags & SWP_BLKDEV)
@@ -1163,16 +1165,16 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 
 static int cluster_alloc_swap(struct swap_info_struct *si,
 			     unsigned char usage, int nr,
-			     swp_entry_t slots[], int order)
+			     swp_slot_t slots[], int order)
 {
 	int n_ret = 0;
 
 	while (n_ret < nr) {
-		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
+		unsigned long offset = cluster_alloc_swap_slot(si, order, usage);
 
 		if (!offset)
 			break;
-		slots[n_ret++] = swp_entry(si->type, offset);
+		slots[n_ret++] = swp_slot(si->type, offset);
 	}
 
 	return n_ret;
@@ -1180,7 +1182,7 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
-			       swp_entry_t slots[], int order)
+			       swp_slot_t slots[], int order)
 {
 	unsigned int nr_pages = 1 << order;
 
@@ -1232,9 +1234,9 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
+int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 {
-	int order = swap_entry_order(entry_order);
+	int order = swap_slot_order(entry_order);
 	unsigned long size = 1 << order;
 	struct swap_info_struct *si, *next;
 	long avail_pgs;
@@ -1261,8 +1263,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_entries, order);
-			put_swap_device(si);
+					n_goal, swp_slots, order);
+			swap_slot_put_swap_info(si);
 			if (n_ret || size > 1)
 				goto check_out;
 		}
@@ -1293,36 +1295,36 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	return n_ret;
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
 {
 	struct swap_info_struct *si;
 	unsigned long offset;
 
-	if (!entry.val)
+	if (!slot.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (!si)
 		goto bad_nofile;
 	if (data_race(!(si->flags & SWP_USED)))
 		goto bad_device;
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	if (offset >= si->max)
 		goto bad_offset;
-	if (data_race(!si->swap_map[swp_offset(entry)]))
+	if (data_race(!si->swap_map[swp_slot_offset(slot)]))
 		goto bad_free;
 	return si;
 
 bad_free:
-	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Unused_offset, slot.val);
 	goto out;
 bad_offset:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
 	goto out;
 bad_device:
-	pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Unused_file, slot.val);
 	goto out;
 bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
 out:
 	return NULL;
 }
@@ -1332,8 +1334,9 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * prevent swapoff, such as the folio in swap cache is locked, RCU
  * reader side is locked, etc., the swap entry may become invalid
  * because of swapoff.  Then, we need to enclose all swap related
- * functions with get_swap_device() and put_swap_device(), unless the
- * swap functions call get/put_swap_device() by themselves.
+ * functions with swap_slot_tryget_swap_info() and
+ * swap_slot_put_swap_info(), unless the swap functions call
+ * swap_slot_(tryget|put)_swap_info by themselves.
  *
  * RCU reader side lock (including any spinlock) is sufficient to
  * prevent swapoff, because synchronize_rcu() is called in swapoff()
@@ -1342,11 +1345,11 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * Check whether swap entry is valid in the swap device.  If so,
  * return pointer to swap_info_struct, and keep the swap entry valid
  * via preventing the swap device from being swapoff, until
- * put_swap_device() is called.  Otherwise return NULL.
+ * swap_slot_put_swap_info() is called.  Otherwise return NULL.
  *
  * Notice that swapoff or swapoff+swapon can still happen before the
- * percpu_ref_tryget_live() in get_swap_device() or after the
- * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * percpu_ref_tryget_live() in swap_slot_tryget_swap_info() or after the
+ * percpu_ref_put() in swap_slot_put_swap_info() if there isn't any other way
  * to prevent swapoff.  The caller must be prepared for that.  For
  * example, the following situation is possible.
  *
@@ -1366,34 +1369,34 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * changed with the page table locked to check whether the swap device
  * has been swapoff or swapoff+swapon.
  */
-struct swap_info_struct *get_swap_device(swp_entry_t entry)
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 {
 	struct swap_info_struct *si;
 	unsigned long offset;
 
-	if (!entry.val)
+	if (!slot.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (!si)
 		goto bad_nofile;
 	if (!get_swap_device_info(si))
 		goto out;
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	if (offset >= si->max)
 		goto put_out;
 
 	return si;
 bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
 out:
 	return NULL;
 put_out:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
 	percpu_ref_put(&si->users);
 	return NULL;
 }
 
-static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
+static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
 {
@@ -1433,27 +1436,27 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
 	return usage;
 }
 
-static unsigned char __swap_entry_free(struct swap_info_struct *si,
-				       swp_entry_t entry)
+static unsigned char __swap_slot_free(struct swap_info_struct *si,
+				       swp_slot_t slot)
 {
 	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	unsigned char usage;
 
 	ci = lock_cluster(si, offset);
-	usage = __swap_entry_free_locked(si, offset, 1);
+	usage = __swap_slot_free_locked(si, offset, 1);
 	if (!usage)
-		swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+		swap_slot_range_free(si, ci, swp_slot(si->type, offset), 1);
 	unlock_cluster(ci);
 
 	return usage;
 }
 
-static bool __swap_entries_free(struct swap_info_struct *si,
-		swp_entry_t entry, int nr)
+static bool __swap_slots_free(struct swap_info_struct *si,
+		swp_slot_t slot, int nr)
 {
-	unsigned long offset = swp_offset(entry);
-	unsigned int type = swp_type(entry);
+	unsigned long offset = swp_slot_offset(slot);
+	unsigned int type = swp_slot_type(slot);
 	struct swap_cluster_info *ci;
 	bool has_cache = false;
 	unsigned char count;
@@ -1473,7 +1476,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
 	if (!has_cache)
-		swap_entry_range_free(si, ci, entry, nr);
+		swap_slot_range_free(si, ci, slot, nr);
 	unlock_cluster(ci);
 
 	return has_cache;
@@ -1481,7 +1484,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 fallback:
 	for (i = 0; i < nr; i++) {
 		if (data_race(si->swap_map[offset + i])) {
-			count = __swap_entry_free(si, swp_entry(type, offset + i));
+			count = __swap_slot_free(si, swp_slot(type, offset + i));
 			if (count == SWAP_HAS_CACHE)
 				has_cache = true;
 		} else {
@@ -1495,13 +1498,14 @@ static bool __swap_entries_free(struct swap_info_struct *si,
  * Drop the last HAS_CACHE flag of swap entries, caller have to
  * ensure all entries belong to the same cgroup.
  */
-static void swap_entry_range_free(struct swap_info_struct *si,
+static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  swp_entry_t entry, unsigned int nr_pages)
+				  swp_slot_t slot, unsigned int nr_pages)
 {
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 
@@ -1533,23 +1537,19 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 
 	ci = lock_cluster(si, offset);
 	do {
-		if (!__swap_entry_free_locked(si, offset, usage))
-			swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+		if (!__swap_slot_free_locked(si, offset, usage))
+			swap_slot_range_free(si, ci, swp_slot(si->type, offset), 1);
 	} while (++offset < end);
 	unlock_cluster(ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
 {
 	int nr;
 	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 
-	sis = _swap_info_get(entry);
+	sis = _swap_info_get(slot);
 	if (!sis)
 		return;
 
@@ -1561,27 +1561,37 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 	}
 }
 
+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	int size = 1 << swap_entry_order(folio_order(folio));
+	int size = 1 << swap_slot_order(folio_order(folio));
 
-	si = _swap_info_get(entry);
+	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
-		swap_entry_range_free(si, ci, entry, size);
+		swap_slot_range_free(si, ci, slot, size);
 	else {
-		for (int i = 0; i < size; i++, entry.val++) {
-			if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE))
-				swap_entry_range_free(si, ci, entry, 1);
+		for (int i = 0; i < size; i++, slot.val++) {
+			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
+				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
@@ -1589,8 +1599,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
+	pgoff_t offset = swp_slot_offset(slot);
 
 	return swap_count(si->swap_map[offset]);
 }
@@ -1602,7 +1613,8 @@ int __swap_count(swp_entry_t entry)
  */
 int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	pgoff_t offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	int count;
 
@@ -1618,6 +1630,7 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
  */
 int swp_swapcount(swp_entry_t entry)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	int count, tmp_count, n;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
@@ -1625,11 +1638,11 @@ int swp_swapcount(swp_entry_t entry)
 	pgoff_t offset;
 	unsigned char *map;
 
-	si = _swap_info_get(entry);
+	si = _swap_info_get(slot);
 	if (!si)
 		return 0;
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 
 	ci = lock_cluster(si, offset);
 
@@ -1661,10 +1674,11 @@ int swp_swapcount(swp_entry_t entry)
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 					 swp_entry_t entry, int order)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned int nr_pages = 1 << order;
-	unsigned long roffset = swp_offset(entry);
+	unsigned long roffset = swp_slot_offset(slot);
 	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
@@ -1689,7 +1703,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 static bool folio_swapped(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
-	struct swap_info_struct *si = _swap_info_get(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = _swap_info_get(slot);
 
 	if (!si)
 		return false;
@@ -1712,7 +1727,8 @@ static bool folio_swapped(struct folio *folio)
  */
 void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 {
-	const unsigned long start_offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	const unsigned long start_offset = swp_slot_offset(slot);
 	const unsigned long end_offset = start_offset + nr;
 	struct swap_info_struct *si;
 	bool any_only_cache = false;
@@ -1721,7 +1737,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	if (non_swap_entry(entry))
 		return;
 
-	si = get_swap_device(entry);
+	si = swap_slot_tryget_swap_info(slot);
 	if (!si)
 		return;
 
@@ -1731,7 +1747,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	/*
 	 * First free all entries in the range.
 	 */
-	any_only_cache = __swap_entries_free(si, entry, nr);
+	any_only_cache = __swap_slots_free(si, slot, nr);
 
 	/*
 	 * Short-circuit the below loop if none of the entries had their
@@ -1744,7 +1760,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	 * Now go back over the range trying to reclaim the swap cache. This is
 	 * more efficient for large folios because we will only try to reclaim
 	 * the swap once per folio in the common case. If we do
-	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+	 * __swap_slot_free() and __try_to_reclaim_swap() in the same loop, the
 	 * latter will get a reference and lock the folio for every individual
 	 * page but will only succeed once the swap slot for every subpage is
 	 * zero.
@@ -1771,10 +1787,10 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	}
 
 out:
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
+void swap_slot_cache_free_slots(swp_slot_t *slots, int n)
 {
 	int i;
 	struct swap_cluster_info *ci;
@@ -1784,10 +1800,10 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 		return;
 
 	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
+		si = _swap_info_get(slots[i]);
 		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
+			ci = lock_cluster(si, swp_slot_offset(slots[i]));
+			swap_slot_range_free(si, ci, slots[i], 1);
 			unlock_cluster(ci);
 		}
 	}
@@ -1846,22 +1862,22 @@ bool folio_free_swap(struct folio *folio)
 
 #ifdef CONFIG_HIBERNATION
 
-swp_entry_t get_swap_page_of_type(int type)
+swp_slot_t swap_slot_alloc_of_type(int type)
 {
 	struct swap_info_struct *si = swap_type_to_swap_info(type);
-	swp_entry_t entry = {0};
+	swp_slot_t slot = {0};
 
 	if (!si)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &slot, 0))
 			atomic_long_dec(&nr_swap_pages);
-		put_swap_device(si);
+		swap_slot_put_swap_info(si);
 	}
 fail:
-	return entry;
+	return slot;
 }
 
 /*
@@ -2114,6 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long offset;
 		unsigned char swp_count;
 		swp_entry_t entry;
+		swp_slot_t slot;
 		int ret;
 		pte_t ptent;
 
@@ -2129,10 +2146,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
-		if (swp_type(entry) != type)
+		slot = swp_entry_to_swp_slot(entry);
+
+		if (swp_slot_type(slot) != type)
 			continue;
 
-		offset = swp_offset(entry);
+		offset = swp_slot_offset(slot);
 		pte_unmap(pte);
 		pte = NULL;
 
@@ -2283,6 +2302,7 @@ static int try_to_unuse(unsigned int type)
 	struct swap_info_struct *si = swap_info[type];
 	struct folio *folio;
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	if (!swap_usage_in_pages(si))
@@ -2330,7 +2350,8 @@ static int try_to_unuse(unsigned int type)
 	       !signal_pending(current) &&
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
-		entry = swp_entry(type, i);
+		slot = swp_slot(type, i);
+		entry = swp_slot_to_swp_entry(slot);
 		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
 		if (IS_ERR(folio))
 			continue;
@@ -2739,7 +2760,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	reenable_swap_slots_cache_unlock();
 
 	/*
-	 * Wait for swap operations protected by get/put_swap_device()
+	 * Wait for swap operations protected by swap_slot_(tryget|put)_swap_info()
 	 * to complete.  Because of synchronize_rcu() here, all swap
 	 * operations protected by RCU reader side lock (including any
 	 * spinlock) will be waited too.  This makes it easy to
@@ -3198,7 +3219,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 
 			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 			for (i = 0; i < SWAP_NR_ORDERS; i++)
-				cluster->next[i] = SWAP_ENTRY_INVALID;
+				cluster->next[i] = SWAP_SLOT_INVALID;
 			local_lock_init(&cluster->lock);
 		}
 	} else {
@@ -3207,7 +3228,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		if (!si->global_cluster)
 			goto err_free;
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
+			si->global_cluster->next[i] = SWAP_SLOT_INVALID;
 		spin_lock_init(&si->global_cluster_lock);
 	}
 
@@ -3527,9 +3548,9 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
-	return swap_type_to_swap_info(swp_type(entry));
+	return swap_type_to_swap_info(swp_slot_type(slot));
 }
 
 /*
@@ -3537,7 +3558,8 @@ struct swap_info_struct *swp_swap_info(swp_entry_t entry)
  */
 struct address_space *swapcache_mapping(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->swap_file->f_mapping;
+	return swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap))
+				->swap_file->f_mapping;
 }
 EXPORT_SYMBOL_GPL(swapcache_mapping);
 
@@ -3560,6 +3582,7 @@ EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
  */
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
@@ -3567,13 +3590,13 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	unsigned char has_cache;
 	int err, i;
 
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (WARN_ON_ONCE(!si)) {
 		pr_err("%s%08lx\n", Bad_file, entry.val);
 		return -EINVAL;
 	}
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
 	ci = lock_cluster(si, offset);
@@ -3675,7 +3698,8 @@ int swapcache_prepare(swp_entry_t entry, int nr)
 
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long offset = swp_slot_offset(slot);
 
 	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
@@ -3704,6 +3728,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	struct page *list_page;
 	pgoff_t offset;
 	unsigned char count;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	int ret = 0;
 
 	/*
@@ -3712,7 +3737,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = get_swap_device(entry);
+	si = swap_slot_tryget_swap_info(slot);
 	if (!si) {
 		/*
 		 * An acceptable race has occurred since the failing
@@ -3721,7 +3746,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 		goto outer;
 	}
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 
 	ci = lock_cluster(si, offset);
 
@@ -3784,7 +3809,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 outer:
 	if (page)
 		__free_page(page);

From patchwork Tue Apr 29 23:38:34 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885987
Received: from mail-yw1-f174.google.com (mail-yw1-f174.google.com
 [209.85.128.174])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6F9402DEB86;
 Tue, 29 Apr 2025 23:38:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969938; cv=none;
 b=eo3cHgl3ozyoGDutJW4340HBEcq8tWN3hq6BxzIOgwGEn+DS2tzOgNDEzvwgVHG/MEmeSzI92jwldFm3HXT5kZvZVK9c8ctb6/yzrgIcJSY8gr8riqjRfhH+094UI37IBIY8GWdqIS9y5QGsptuiDQjP3nPRoWnp8bYIplJmUVg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969938; c=relaxed/simple;
 bh=x8e+MEX/qcRiV4ForD6Sd+jZ2RFE/w1McwyxXE9bSbk=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=czUQaTdZra5trH40f0gm8YrFWGOJM+dJ/gT9bdIfknxVMcMYME3WyaHq7NeWIhqTwSJMZk1rgLgGRGTP2j2hk0BzjS4iqeaZHt8TCHKm8E8AEOfPLCceml5zHlp9aI/qhQrdbDoF11otG/KY7eJjlZd5nLsn3kudSzQ1aMnY5IY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=FSO3QXz2; arc=none smtp.client-ip=209.85.128.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="FSO3QXz2"
Received: by mail-yw1-f174.google.com with SMTP id
 00721157ae682-70814384238so57315207b3.0;
 Tue, 29 Apr 2025 16:38:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969935; x=1746574735;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=BlPKsWIQtTq0kXJAAHxIhR2EL0TtNwMyl7VzV0J4eyc=;
 b=FSO3QXz2G4NkhsyPqE0faskxOpQTwcAYUadcCBMGQ6Kh4cyWO0ekLPiDCKTROanJwZ
 x+fVHjr4yIjYkFJbvMqKCth2GrE1qAMYb8OuXCi2QG4t2cUSOkjwxWSzUvkvur7GKi9m
 D32cqb16DKcxNychDi075k8FGcPdstMu9fyGymlpJ1eW2KgAHB/AbEMsTnP6JvZfE9gW
 eo38vHBZny/CLEbKSVghEm6Fsx1p7z+gKd6NLEpfy5vCMfBeEd4cOfhpLb52dxPKF3fw
 UW87Y83ROKiwzXE7RZSShcopnVve4tgI2AS+yjYxzdN86ic6t2eDMIJwpoYidKhiNA/K
 zS0Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969935; x=1746574735;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=BlPKsWIQtTq0kXJAAHxIhR2EL0TtNwMyl7VzV0J4eyc=;
 b=nmDdSadPcQfxOQK0Mssi5fnSKhTFBv9uVwtXE9JOGAQwWU09ddgCgeQhXE68t0cobl
 kHbau0keh6gAYcsPzeqkP3z761zIrfyAEIhCH8hevGPPRA6rZ57cOw2HNu7HGA0E8BdP
 UIQvBeonxoM43oguxmymbodc994xjagpRPJy33NLnxbdiaaCJApRmUx62gDnlSSsWq4V
 pRuF1LcMxlrVEgg1G3xJ6mCxn38BBlGVcTKfmexJ4JvFWzaZkBIkBwfpMVNcEKxOhP9Y
 L/GMsyp7I9chhfAAJgLiT6ISe3pZtKsjlYcOtsFO7I2s97SSlkVxOc9MhtFJIm3DePL8
 RViw==
X-Forwarded-Encrypted: i=1;
 AJvYcCU5pehxTPxwAHCdi72J8D7BD3DhNlhOIt9g1HEuQaHvCmspUhm54RKbuKaed48SudV8yRo5Q3KR@vger.kernel.org,
 AJvYcCV2WcjJQeDnk+nD89lsFDXjMGrS3i9BMmDSjwawUZZDvvx7mAEIr03I17pvqVk7ZEDWH2gq6FeY6SI=@vger.kernel.org,
 AJvYcCVyaUW0zS/6ZRmwvXxtvr5rNbDmo8pJzwv9qYeaLR2Puuhe5mz0TIKhbxvFTQJdNGclBzq0agvEibQX6fnu@vger.kernel.org
X-Gm-Message-State: AOJu0YwrlOBijuxQYhdo+NqI55wlc7ju6bcremH/0T1eSjyUkBgsqbfB
 QyVI39FsAsf8u15hilOMAGWzik8OANemVy+I1K2O71Tsz6m0V1MR
X-Gm-Gg: ASbGncuvBDfavZrGtsr8n1idEIzKZOBflI0y23AGe5V4Yf6oAEjBTs17BxqPRW3uVsy
 U8yMkzfQDcZThBlnZt2XC7yKfCtM+BcSJd5kfSOFmq2DOM7sHsQIYh69/aNqvQA+3T/0phZN9Yi
 sRaBPi+4bUZLCm6hc3El2BsZ4u9rwRzbewKJVCnBr4JGH4qdy9ruKRuiE9AEd1X8kcRQ86YAaIP
 po7SULWjfaLlpWBLwLMBaXI+qELk7eQVA3ZNQFG3N290rb5q2ZBVhZP1veIf1wum1CPPCoBi/gW
 so+Zo6CKkUs+97sk7wBupnH6phC/
X-Google-Smtp-Source: AGHT+IGONTd8guxRHJvy647hbI7ZqJLCliUkJVI8P7gVTtEIytul7Gg3vq/OHI3N/2vGVT3C4NylUg==
X-Received: by 2002:a05:690c:63c7:b0:702:72e3:1cb6 with SMTP id
 00721157ae682-708abe0bab9mr20919227b3.26.1745969935407;
 Tue, 29 Apr 2025 16:38:55 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff::]) by smtp.gmail.com with ESMTPSA id
 00721157ae682-708adfc8160sm778457b3.6.2025.04.29.16.38.54
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:54 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap
 implementation
Date: Tue, 29 Apr 2025 16:38:34 -0700
Message-ID: <20250429233848.3093350-7-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In prepration for the implementation of swap virtualization, add new
scaffolds for the new code:

1. Add a new mm/vswap.c source file, which currently only holds the
   logic to set up the (for now, empty) vswap debugfs directory. Hook
   this up in the swap setup step in mm/swap_state.c. Add a new
   maintainer entry for the new source file.

2. Add a new config option (CONFIG_VIRTUAL_SWAP). We will only get new
   behavior when the kernel is built with this config option. The entry
   for the config option in mm/Kconfig summarizes the pros and cons of
   the new virtual swap design, which the remainder of the patch series
   will implement.

3. Set up vswap compilation in the Makefile.

Other than the debugfs directory, no behavioral change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 MAINTAINERS          |  7 +++++++
 include/linux/swap.h |  9 +++++++++
 mm/Kconfig           | 25 +++++++++++++++++++++++++
 mm/Makefile          |  1 +
 mm/swap_state.c      |  6 ++++++
 mm/vswap.c           | 35 +++++++++++++++++++++++++++++++++++
 6 files changed, 83 insertions(+)
 create mode 100644 mm/vswap.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 00e94bec401e..65108bf2a5f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25290,6 +25290,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/iio/light/vishay,veml6030.yaml
 F:	drivers/iio/light/veml6030.c
 
+VIRTUAL SWAP SPACE
+M:	Nhat Pham <nphamcs@gmail.com>
+M:	Johannes Weiner <hannes@cmpxchg.org>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/vswap.c
+
 VISHAY VEML6075 UVA AND UVB LIGHT SENSOR DRIVER
 M:	Javier Carrasco <javier.carrasco.cruz@gmail.com>
 S:	Maintained
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 567fd2ebb0d3..328f6aec9313 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -726,6 +726,15 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int vswap_init(void);
+#else /* CONFIG_VIRTUAL_SWAP */
+static inline int vswap_init(void)
+{
+	return 0;
+}
+#endif /* CONFIG_VIRTUAL_SWAP */
+
 /**
  * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
  *                         virtual swap slot.
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..2e8eb66c5888 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -22,6 +22,31 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config VIRTUAL_SWAP
+	bool "Swap space virtualization"
+	depends on SWAP
+	default n
+	help
+		When this is selected, the kernel is built with the new swap
+		design, where each swap entry is associated with a virtual swap
+		slot that is decoupled from a specific physical backing storage
+		location. As a result, swap entries that are:
+
+		1. Zero-filled
+
+		2. Stored in the zswap pool.
+
+		3. Rejected by zswap/zram but cannot be written back to a
+		   backing swap device.
+
+		no longer take up any disk storage (i.e they do not occupy any
+		slot in the backing swap device).
+
+		Swapoff is also more efficient.
+
+		There might be more lock contentions with heavy swap use, since
+		the swap cache is no longer range partitioned.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..b7216c714fa1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ ifdef CONFIG_MMU
 endif
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_slots.o
+obj-$(CONFIG_VIRTUAL_SWAP)	+= vswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
diff --git a/mm/swap_state.c b/mm/swap_state.c
index cbd1532b6b24..1607d23a3d7b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -930,6 +930,12 @@ static int __init swap_init_sysfs(void)
 	int err;
 	struct kobject *swap_kobj;
 
+	err = vswap_init();
+	if (err) {
+		pr_err("failed to initialize virtual swap space\n");
+		return err;
+	}
+
 	swap_kobj = kobject_create_and_add("swap", mm_kobj);
 	if (!swap_kobj) {
 		pr_err("failed to create swap kobject\n");
diff --git a/mm/vswap.c b/mm/vswap.c
new file mode 100644
index 000000000000..b9c28e819cca
--- /dev/null
+++ b/mm/vswap.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual swap space
+ *
+ * Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
+ */
+ #include <linux/swap.h>
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *vswap_debugfs_root;
+
+static int vswap_debug_fs_init(void)
+{
+	if (!debugfs_initialized())
+		return -ENODEV;
+
+	vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+	return 0;
+}
+#else
+static int vswap_debug_fs_init(void)
+{
+	return 0;
+}
+#endif
+
+int vswap_init(void)
+{
+	if (vswap_debug_fs_init())
+		pr_warn("Failed to initialize vswap debugfs\n");
+
+	return 0;
+}

From patchwork Tue Apr 29 23:38:35 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886259
Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com
 [209.85.128.182])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B5902DEBA8;
 Tue, 29 Apr 2025 23:38:57 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969939; cv=none;
 b=cdMYoUGKilO6xTNjp6kZc4SiIGF6Vg7gjX4hdxwrXBXqamubnofLEGsVRgB/2UAs7m95nsrVwe24s314WgB4O4YLpBILcWMRaPClewAhzq7ognj8Njgqv2qzO87nG06UuRNOmtZp2qHh1wCZW0xy70eM2x1nDQBYPOqiKwr0BhM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969939; c=relaxed/simple;
 bh=196lU1jIJriKLiqmzv1Vy3qpM+eNgjeCkiA+fGipeqw=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=RVM7oZm5dD1NTP+AB/9SyeOTbp2kb1WdWQ4Xa3eBxvhmKDiJaDGxMYtcCn7WZ6yzqYn7g7qFWDQhFD6jJd0c6TQ5A4e5jh4L/NNOD406/mTSg3De37vpbFFRQPwEZOqWBBtY5dEsdpb07N76BG7ElLti4Mewuy7f66eQyxZ4NNs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=akWuXJXT; arc=none smtp.client-ip=209.85.128.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="akWuXJXT"
Received: by mail-yw1-f182.google.com with SMTP id
 00721157ae682-70811611315so57367017b3.1;
 Tue, 29 Apr 2025 16:38:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969936; x=1746574736;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=44dQEBk34Lg+twP2t8lNoyYgFTs2iNjxhzJ6aTJ3tVA=;
 b=akWuXJXT023ucQQLpYOgmCCXrORzgtGLN4vEd7skv9SM6NcWwuZwkyCz5A1UznBm/G
 GnTmSXJQfVU/Zg/CBugWsLa4HSCXpT5xYb6QrTX6rs5/HjtQR1WvfudCtdAae1fc4hcT
 vVOGo0T/0TTV8X8/kNKGQ/2S8rTKAu6P84triLyjGClpJLZicPMvbZ39IMvKPNvdaBz3
 GzQ+MMlWX24VN/amejdiD+FQn+O2SuXWBjbGfJRxuWTT2YvJ9Nn1aGeHcr5m5wNescUH
 41nvwdf2VbNllvIfc+CLNKMHgvTucoy3GuwKqRwUwfOWDSCTCCrLFoc4sxEMfIg/59FR
 0Jeg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969936; x=1746574736;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=44dQEBk34Lg+twP2t8lNoyYgFTs2iNjxhzJ6aTJ3tVA=;
 b=iG70vX4RywuDsbbPqSqKxsMFC2zgMFsy/FyBql/g6tizXskDFfczxFGxGWgVfFwprn
 KOWmct0FRO6voMTiUPLQamjHN397lzZ9AnuXnV6N93Nj53cvIm5E9bQtiu/H6GWx0Q0P
 fRyORTqtHgj+koikyolkcsKC5tVqBbvHXbJqyZ6B6lK945Ah/YC4yPxEDw3V9sn/aP3H
 8oCnOKxWz3OWkW+FCvU83yexUd2HTmfjkBuvwEJhGTkyhQB+9O6enWq+Y/oxjVRmdJCh
 1DjQWR+5jDySoReAST+FKauSQAkTuzRagh5uuVWyQs7fRZ4AzsQ3E8og5I5VGRgNWKdi
 n3yg==
X-Forwarded-Encrypted: i=1;
 AJvYcCVSnCrGdY8m/k7bNNN0TBSRT3mQNJUhc29L7UHvfZcjZdTf7Qqmi/Zji47ddy5/jvjdmyhVpY0/@vger.kernel.org,
 AJvYcCVdxUklhEuQ89MgiYAFdz0FD5bKGKpyj4aI0uTSNvaAjXsYyxh0pCxpGY+/jndMYJyAGQz/iTt4f4c=@vger.kernel.org,
 AJvYcCWpi7j+XTrdpbBrlumxE4xvp139q9/gjQFRSR9VNMLWnTFZyjFIG3/2HUZSZKW4AnuiEueUswwuaQYRJZhU@vger.kernel.org
X-Gm-Message-State: AOJu0YzpQWQhwYMainBDfSsnkku52NedeU9737Yn6j+BA6VCAQfUggun
 mB6LO2q2UPdu3bQgSdxYwGVMAzzs8mCkJy15GH2YEnmaaJxvO+0G
X-Gm-Gg: ASbGnctjeA+RIVOmEBxLGH2lCKwbMMHYCcciA3AtBJ0g3c1Zaz1/JMifjZCmrku8FDI
 e9TfVDZKigh0xzVeR5a5jkopx9njjpdZydT8UXivzx9R/wa/3I9vxc+iaR9LEmtvbLmC7mFPiG8
 Zu+SjtdCgU3CdKtNlWYVxQytA9tstW+uYm3vVBKRpM5WeLee9MI5MIvbBKNoP1+Pct6x+XJkw2i
 vNLoIoRkgV8hmQxLQuUFl8lg8AhMcBPLKuJLauQQ6lkhx41rpIG3B3PpXzDtDwttJevLb30UAgX
 G4OsszSJyMcQ/wukyNMZEVtXdjyxZ1DrZ6cYJyYDBw==
X-Google-Smtp-Source: AGHT+IF0Gq0OKMjMbPFewLBiKCryoCgbkMLTWTjJKXA6zaQErPJ0Qq2QeEf8yHjPw1HgNf/wZrnNZg==
X-Received: by 2002:a05:690c:6c85:b0:6f7:55a2:4cd8 with SMTP id
 00721157ae682-708ad5c575dmr9535387b3.5.1745969936353;
 Tue, 29 Apr 2025 16:38:56 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:2::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae1e9ae7sm701547b3.102.2025.04.29.16.38.55
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:55 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support
 for virtualized swap
Date: Tue, 29 Apr 2025 16:38:35 -0700
Message-ID: <20250429233848.3093350-8-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Currently, the swap cache code assumes that the swap space is of a fixed
size. The virtual swap space is dynamically sized, so the existing
partitioning code cannot be easily reused.  A dynamic partitioning is
planned, but for now keep the design simple and just use a flat
swapcache for vswap.

Similar to swap cache, the zswap tree code, specifically the range
partition logic, can no longer easily be reused for the new virtual swap
space design. Use a simple unified zswap tree in the new implementation
for now. As in the case of swap cache, range partitioning is planned as
a follow up work.

Since the vswap's implementation has begun to diverge from the old
implementation, we also introduce a new build config
(CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the
old implementation, with no behavioral change.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swap.h       | 22 ++++++++++++++--------
 mm/swap_state.c | 44 +++++++++++++++++++++++++++++++++++---------
 mm/zswap.c      | 38 ++++++++++++++++++++++++++++++++------
 3 files changed, 81 insertions(+), 23 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index d5f8effa8015..06e20b1d79c4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
 
-/* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
+{
+	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
+}
+
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
+
+/* linux/mm/swap_state.c */
+#ifdef CONFIG_VIRTUAL_SWAP
+extern struct address_space *swap_address_space(swp_entry_t entry);
+#define swap_cache_index(entry) entry.val
+#else
+/* One swap address space for each 64M swap space */
 extern struct address_space *swapper_spaces[];
 #define swap_address_space(entry)			    \
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
-/* Return the swap device position of the swap slot. */
-static inline loff_t swap_slot_pos(swp_slot_t slot)
-{
-	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
-}
-
 /*
  * Return the swap cache index of the swap entry.
  */
@@ -46,6 +51,7 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
 	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
 }
+#endif
 
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1607d23a3d7b..f677ebf9c5d0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -38,8 +38,18 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static struct address_space swapper_space __read_mostly;
+
+struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return &swapper_space;
+}
+#else
 struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
+#endif
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -718,23 +728,34 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
+static void init_swapper_space(struct address_space *space)
+{
+	xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+	atomic_set(&space->i_mmap_writable, 0);
+	space->a_ops = &swap_aops;
+	/* swap cache doesn't use writeback related tags */
+	mapping_set_no_writeback_tags(space);
+}
+
+#ifdef CONFIG_VIRTUAL_SWAP
+int init_swap_address_space(unsigned int type, unsigned long nr_pages)
+{
+	return 0;
+}
+
+void exit_swap_address_space(unsigned int type) {}
+#else
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 {
-	struct address_space *spaces, *space;
+	struct address_space *spaces;
 	unsigned int i, nr;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
 	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
 	if (!spaces)
 		return -ENOMEM;
-	for (i = 0; i < nr; i++) {
-		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
-		atomic_set(&space->i_mmap_writable, 0);
-		space->a_ops = &swap_aops;
-		/* swap cache doesn't use writeback related tags */
-		mapping_set_no_writeback_tags(space);
-	}
+	for (i = 0; i < nr; i++)
+		init_swapper_space(spaces + i);
 	nr_swapper_spaces[type] = nr;
 	swapper_spaces[type] = spaces;
 
@@ -752,6 +773,7 @@ void exit_swap_address_space(unsigned int type)
 	nr_swapper_spaces[type] = 0;
 	swapper_spaces[type] = NULL;
 }
+#endif
 
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
@@ -930,6 +952,10 @@ static int __init swap_init_sysfs(void)
 	int err;
 	struct kobject *swap_kobj;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	init_swapper_space(&swapper_space);
+#endif
+
 	err = vswap_init();
 	if (err) {
 		pr_err("failed to initialize virtual swap space\n");
diff --git a/mm/zswap.c b/mm/zswap.c
index 23365e76a3ce..c1327569ce80 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -203,8 +203,6 @@ struct zswap_entry {
 	struct list_head lru;
 };
 
-static struct xarray *zswap_trees[MAX_SWAPFILES];
-static unsigned int nr_zswap_trees[MAX_SWAPFILES];
 
 /* RCU-protected iteration */
 static LIST_HEAD(zswap_pools);
@@ -231,12 +229,28 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static DEFINE_XARRAY(zswap_tree);
+
+static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
+{
+	return &zswap_tree;
+}
+
+#define zswap_tree_index(entry)	entry.val
+#else
+static struct xarray *zswap_trees[MAX_SWAPFILES];
+static unsigned int nr_zswap_trees[MAX_SWAPFILES];
+
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
 		>> SWAP_ADDRESS_SPACE_SHIFT];
 }
 
+#define zswap_tree_index(entry)	swp_offset(entry)
+#endif
+
 #define zswap_pool_debug(msg, p)				\
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpool))
@@ -1047,7 +1061,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 				 swp_entry_t swpentry)
 {
 	struct xarray *tree;
-	pgoff_t offset = swp_offset(swpentry);
+	pgoff_t offset = zswap_tree_index(swpentry);
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
@@ -1463,7 +1477,7 @@ static bool zswap_store_page(struct page *page,
 		goto compress_failed;
 
 	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
+		       zswap_tree_index(page_swpentry),
 		       entry, GFP_KERNEL);
 	if (xa_is_err(old)) {
 		int err = xa_err(old);
@@ -1612,7 +1626,7 @@ bool zswap_store(struct folio *folio)
 bool zswap_load(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
+	pgoff_t offset = zswap_tree_index(swp);
 	bool swapcache = folio_test_swapcache(folio);
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
@@ -1670,7 +1684,7 @@ bool zswap_load(struct folio *folio)
 
 void zswap_invalidate(swp_entry_t swp)
 {
-	pgoff_t offset = swp_offset(swp);
+	pgoff_t offset = zswap_tree_index(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
@@ -1682,6 +1696,16 @@ void zswap_invalidate(swp_entry_t swp)
 		zswap_entry_free(entry);
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int zswap_swapon(int type, unsigned long nr_pages)
+{
+	return 0;
+}
+
+void zswap_swapoff(int type)
+{
+}
+#else
 int zswap_swapon(int type, unsigned long nr_pages)
 {
 	struct xarray *trees, *tree;
@@ -1718,6 +1742,8 @@ void zswap_swapoff(int type)
 	nr_zswap_trees[type] = 0;
 	zswap_trees[type] = NULL;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
+
 
 /*********************************
 * debugfs functions

From patchwork Tue Apr 29 23:38:36 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885985
Received: from mail-yw1-f171.google.com (mail-yw1-f171.google.com
 [209.85.128.171])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD1BB2DFA28;
 Tue, 29 Apr 2025 23:38:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969941; cv=none;
 b=TjkBrP3Nqs0Ta1+LxXcfZhOY8PfDLUwxo8WRy0ngSJLdeyC2tEvjzvOpYLXEC0R8bgovjVYiuphZarZBeDYKSRY0PxN2X92YaTnqPMy3BK1wy8nvVszyyVJZ0y8huZ30Pm5DgrIa3drsODhLJ9v4oyIZccZk5trczlyIFb/hIug=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969941; c=relaxed/simple;
 bh=N17zPWxBpUKFHgnUl5+5s6tRTrKYek0rO0KMwYvFWeM=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=qeXqZs481Ap422Z8rn+bT39GpeU91rb4rirRlBWflefcfBCloe+tbPEO7KIFvt0AJlXybtE4SdVF0ha218haaqRHMBtSZM7G3icFxyk5racqQK855Ns9IoOR2Ar7dx75NJLfD/3nQECZE3MhGao0/NLG1L1HcT981+ioSie3/uI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=HGVkBgQi; arc=none smtp.client-ip=209.85.128.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="HGVkBgQi"
Received: by mail-yw1-f171.google.com with SMTP id
 00721157ae682-7086dcab64bso31956427b3.1;
 Tue, 29 Apr 2025 16:38:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969937; x=1746574737;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=K9RcEtV41DAfc6eUkH4NGp4ZSyS4pPlCziLaor8iU+o=;
 b=HGVkBgQi8oS+OLtA5DlQ/E/FWdWJ94y85OHo0DywzCWoXtVjqqx3TqMEdPZQg0zPK8
 hgeKMYaD+ONl3KTM/dneOybjl9uAhlZYiejvtMYgtPUDXC2JJJ0TbZEhkCecugK1kWca
 BH7oxfhfjMQ+2C5d4oDdt7sPhO2Q6mERsqR5hrwEFd8uqmdv49q0/PEDNNaMQkz9RVXU
 jzvVWX259Uv/2YVEPzb4KiU87tV+C4obh8RCIo2RjeZB1BvD7Wtlg20chV4zGjzHWeHA
 BlcmVZQ/deOvAd8cD5Rsie7Y/KqDjE3lVM5LUjlr/ZqJYBX/gP3g9jZz4/9mFzYWUfG0
 pglg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969937; x=1746574737;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=K9RcEtV41DAfc6eUkH4NGp4ZSyS4pPlCziLaor8iU+o=;
 b=nka6Vh6j0v+NFR/M+GEKvi+uCGtxfp4+lwOFeT34OqK9oQol6P5M6dNHSbQqVxCwBn
 ZxrG9duDH5pizM2GfM34Tbd0ACWPq28AdjqVRWBGt3tSKiFeWyTuTuG/lCcX9pVnofuQ
 K287Vbb/9HXzI220UOQgGEbPevR9H+QUB3Sgtf8bm0Z9hkDwKNuOQ2Sfc+PKDJB+yZke
 mHtJOJj5QlaeP6a0Iu+DA75xLwJnenj0N+ias2aoryv02h46OYf3NLeLQrVQhtgipeNV
 8Ab8GEQn5FLupMZz05lchXV+SE3h0GE7mtAruVVji0GA9xC8GaQyeKQ2YOn0oxJijQW8
 ZsHQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCVtvLPR1jrwVEk3/nKzYntshBjZNdF6tAKbjYYxVXpAF6JQ80yr1QpuEbfB7kiL1mGxC8Gf8OhyHy4=@vger.kernel.org,
 AJvYcCW2OW03kBFc5WMcZqjqgNMdbBxBDEdHe5IlLluipri9RHb5vYTKfqRYcESN02972jiGa9fBlCUb@vger.kernel.org,
 AJvYcCWzwFQP2NRcjAIa/5pnddK72CTjFcbaz/Jk+BNeg317+pmvDw8Gwt0NX2cBHabraLYTV5kpmU6tuPOwPGPU@vger.kernel.org
X-Gm-Message-State: AOJu0YyoZayl1h8wif1l7YTUDIypmjyPefrfgdK+8qNlYpm0ShriNSIL
 rb77VQQjyZn/ZsI3KTG2Ga6+O+/be4u+hTC7pTgGRi9JnextDj2U
X-Gm-Gg: ASbGncu5EVDtTHih5qCbxovBpS68P3XcIhvuQtInPMC8ZFM0Qpdw0CNy6Bdx+AUsthE
 lwEMGFXzTESyeKNTiYEUm/zEe/gkeyr5tch63jP61ijkpff27RfC06dtrn85Qd+zwcutIBse2KE
 ypSjZxFIPNHF23yujLLr2D3W1/1kJ6qU9LGiv3yR+Jh+8YCdj1/11WWhDjp81E8i+XAVGvBSooL
 U27trwMcq49xiGcabFbRPHq4u3ZCEBY9M2Xy9nPi/QId+oDtzo6uTRo4qPAstMjvZlhwP+jp6o8
 4s6c8KYnW2zmPe2IAQco+bT1UUb6uWmoujDW45EycA==
X-Google-Smtp-Source: AGHT+IEb1Ey1Nq8Ab4XTF5BmPCiPuZk0L2rm7yz3tl5C2edecMV7a5dw3WhNKFiVpWUW5HnWyR+qPw==
X-Received: by 2002:a05:690c:c93:b0:703:c3ed:1f61 with SMTP id
 00721157ae682-708abdabe08mr21680937b3.20.1745969937413;
 Tue, 29 Apr 2025 16:38:57 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:7::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae0695f4sm755667b3.61.2025.04.29.16.38.56
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:56 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each
 swapped out page
Date: Tue, 29 Apr 2025 16:38:36 -0700
Message-ID: <20250429233848.3093350-9-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

For the new virtual swap space design, dynamically allocate a virtual
slot (as well as an associated metadata structure) for each swapped out
page, and associate it to the (physical) swap slot on the swapfile/swap
partition.

For now, there is always a physical slot in the swapfile associated for
each virtual swap slot (except those about to be freed). The virtual
swap slot's lifetime is still tied to the lifetime of its physical swap
slot.

We also maintain a backward map to look up the virtual swap slot from
its associated physical swap slot on swapfile. This is used in cluster
readahead, as well as several swapfile operations, such as the swap slot
reclamation that happens when the swapfile is almost full.  It will also
be used in a future patch that simplifies swapoff.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h    |  17 +-
 include/linux/swapops.h |  12 ++
 mm/internal.h           |  43 ++++-
 mm/shmem.c              |  10 +-
 mm/swap.h               |   2 +
 mm/swap_state.c         |  29 +++-
 mm/swapfile.c           |  24 ++-
 mm/vswap.c              | 342 +++++++++++++++++++++++++++++++++++++++-
 8 files changed, 457 insertions(+), 22 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 328f6aec9313..0f1337431e27 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -456,7 +456,6 @@ extern void __meminit kswapd_stop(int nid);
 /* Lifetime swap API (mm/swapfile.c) */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
 int swapcache_prepare(swp_entry_t entry, int nr);
@@ -504,6 +503,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
+void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
@@ -728,12 +728,19 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
+void vswap_exit(void);
+void vswap_free(swp_entry_t entry);
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
 	return 0;
 }
-#endif /* CONFIG_VIRTUAL_SWAP */
+
+static inline void vswap_exit(void)
+{
+}
 
 /**
  * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
@@ -758,6 +765,12 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
 	return (swp_entry_t) { slot.val };
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
+
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
+}
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 2a4101c9bba4..ba7364e1400a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -27,6 +27,18 @@
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
 #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
 
+#ifdef CONFIG_VIRTUAL_SWAP
+#if SWP_TYPE_SHIFT > 32
+#define MAX_VSWAP	U32_MAX
+#else
+/*
+ * The range of virtual swap slots is the same as the range of physical swap
+ * slots.
+ */
+#define MAX_VSWAP	(((MAX_SWAPFILES - 1) << SWP_TYPE_SHIFT) | SWP_OFFSET_MASK)
+#endif
+#endif
+
 /*
  * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
  * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
diff --git a/mm/internal.h b/mm/internal.h
index 2d63f6537e35..ca28729f822a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -262,6 +262,40 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 	return min(ptep - start_ptep, max_nr);
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+	return (swp_entry_t) { entry.val + n };
+}
+
+/* similar to swap_nth, but check the backing physical slots as well. */
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
+	swp_entry_t next_entry = swap_nth(entry, delta);
+
+	next_slot = swp_entry_to_swp_slot(next_entry);
+	if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
+			swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
+		next_entry.val = 0;
+
+	return next_entry;
+}
+#else
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	return swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+			swp_slot_offset(slot) + n));
+}
+
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	return swap_nth(entry, delta);
+}
+#endif
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
@@ -275,13 +309,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  */
 static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
-	swp_entry_t entry = pte_to_swp_entry(pte), new_entry;
-	swp_slot_t slot = swp_entry_to_swp_slot(entry);
-	pte_t new;
-
-	new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
-			swp_slot_offset(slot) + delta));
-	new = swp_entry_to_pte(new_entry);
+	swp_entry_t entry = pte_to_swp_entry(pte);
+	pte_t new = swp_entry_to_pte(swap_move(entry, delta));
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
diff --git a/mm/shmem.c b/mm/shmem.c
index f8efa49eb499..4c00b4673468 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2166,7 +2166,6 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
 	void *alloced_shadow = NULL;
 	int alloced_order = 0, i;
-	swp_slot_t slot = swp_entry_to_swp_slot(swap);
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
@@ -2205,12 +2204,8 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			 */
 			for (i = 0; i < 1 << order; i++) {
 				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp_entry;
-				swp_slot_t tmp_slot;
+				swp_entry_t tmp_entry = swap_nth(swap, i);
 
-				tmp_slot =
-					swp_slot(swp_slot_type(slot), swp_slot_offset(slot) + i);
-				tmp_entry = swp_slot_to_swp_entry(tmp_slot);
 				__xa_store(&mapping->i_pages, aligned_index + i,
 					   swp_to_radix_entry(tmp_entry), 0);
 			}
@@ -2336,8 +2331,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
 
-			swap = swp_slot_to_swp_entry(swp_slot(
-					swp_slot_type(slot), swp_slot_offset(slot) + offset));
+			swap = swap_nth(swap, offset);
 		}
 
 		/* Here we actually start the io */
diff --git a/mm/swap.h b/mm/swap.h
index 06e20b1d79c4..31c94671cb44 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -36,6 +36,8 @@ static inline loff_t swap_slot_pos(swp_slot_t slot)
 #ifdef CONFIG_VIRTUAL_SWAP
 extern struct address_space *swap_address_space(swp_entry_t entry);
 #define swap_cache_index(entry) entry.val
+
+void virt_clear_shadow_from_swap_cache(swp_entry_t entry);
 #else
 /* One swap address space for each 64M swap space */
 extern struct address_space *swapper_spaces[];
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f677ebf9c5d0..16abdb5ce07a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -177,6 +177,7 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_slot_t slot = folio_alloc_swap_slot(folio);
@@ -189,6 +190,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	return entry;
 }
+#endif
 
 /**
  * add_to_swap - allocate swap space for a folio
@@ -270,6 +272,27 @@ void delete_from_swap_cache(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/*
+ * In the virtual swap implementation, we index the swap cache by virtual swap
+ * slots rather than physical ones. As a result, we only clear the shadow when
+ * the virtual swap slot is freed (via virt_clear_shadow_from_swap_cache()),
+ * not when the physical swap slot is freed (via clear_shadow_from_swap_cache()
+ * in the old implementation).
+ */
+void virt_clear_shadow_from_swap_cache(swp_entry_t entry)
+{
+	struct address_space *address_space = swap_address_space(entry);
+	pgoff_t index = swap_cache_index(entry);
+	XA_STATE(xas, &address_space->i_pages, index);
+
+	xas_set_update(&xas, workingset_update_node);
+	xa_lock_irq(&address_space->i_pages);
+	if (xa_is_value(xas_load(&xas)))
+		xas_store(&xas, NULL);
+	xa_unlock_irq(&address_space->i_pages);
+}
+#else
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				unsigned long end)
 {
@@ -300,6 +323,7 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
 			break;
 	}
 }
+#endif
 
 /*
  * If we are the only user, then try to free up the swap cache.
@@ -965,7 +989,8 @@ static int __init swap_init_sysfs(void)
 	swap_kobj = kobject_create_and_add("swap", mm_kobj);
 	if (!swap_kobj) {
 		pr_err("failed to create swap kobject\n");
-		return -ENOMEM;
+		err = -ENOMEM;
+		goto vswap_exit;
 	}
 	err = sysfs_create_group(swap_kobj, &swap_attr_group);
 	if (err) {
@@ -976,6 +1001,8 @@ static int __init swap_init_sysfs(void)
 
 delete_obj:
 	kobject_put(swap_kobj);
+vswap_exit:
+	vswap_exit();
 	return err;
 }
 subsys_initcall(swap_init_sysfs);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 17cbf14bac72..849525810bbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1126,12 +1126,18 @@ static void swap_range_alloc(struct swap_info_struct *si,
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			    unsigned int nr_entries)
 {
-	unsigned long begin = offset;
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
+#ifndef CONFIG_VIRTUAL_SWAP
+	unsigned long begin = offset;
 
+	/*
+	 * In the virtual swap design, the swap cache is indexed by virtual swap
+	 * slots. We will clear the shadow when the virtual swap slots are freed.
+	 */
 	clear_shadow_from_swap_cache(si->type, begin, end);
+#endif
 
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
@@ -1506,8 +1512,21 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+#ifdef CONFIG_VIRTUAL_SWAP
+	int i;
 
+	/* release all the associated (virtual) swap slots */
+	for (i = 0; i < nr_pages; i++) {
+		vswap_free(entry);
+		entry.val++;
+	}
+#else
+	/*
+	 * In the new (i.e virtual swap) implementation, we will let the virtual
+	 * swap layer handle the cgroup swap accounting and charging.
+	 */
 	mem_cgroup_uncharge_swap(entry, nr_pages);
+#endif
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1573,9 +1592,8 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
+void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 {
-	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
diff --git a/mm/vswap.c b/mm/vswap.c
index b9c28e819cca..23a05c3393d8 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -4,7 +4,75 @@
  *
  * Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
  */
- #include <linux/swap.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/swap_cgroup.h>
+#include "swap.h"
+
+/*
+ * Virtual Swap Space
+ *
+ * We associate with each swapped out page a virtual swap slot. This will allow
+ * us to change the backing state of a swapped out page without having to
+ * update every single page table entries referring to it.
+ *
+ * For now, there is a one-to-one correspondence between a virtual swap slot
+ * and its associated physical swap slot.
+ */
+
+/**
+ * Swap descriptor - metadata of a swapped out page.
+ *
+ * @slot: The handle to the physical swap slot backing this page.
+ * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ */
+struct swp_desc {
+	swp_slot_t slot;
+	struct rcu_head rcu;
+};
+
+/* Virtual swap space - swp_entry_t -> struct swp_desc */
+static DEFINE_XARRAY_FLAGS(vswap_map, XA_FLAGS_TRACK_FREE);
+
+static const struct xa_limit vswap_map_limit = {
+	.max = MAX_VSWAP,
+	/* reserve the 0 virtual swap slot to indicate errors */
+	.min = 1,
+};
+
+/* Physical (swp_slot_t) to virtual (swp_entry_t) swap slots. */
+static DEFINE_XARRAY(vswap_rmap);
+
+/*
+ * For swapping large folio of size n, we reserve an empty PMD-sized cluster
+ * of contiguous and aligned virtual swap slots, then allocate the first n
+ * virtual swap slots from the cluster.
+ */
+#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
+#define VSWAP_CLUSTER_SIZE (1UL << VSWAP_CLUSTER_SHIFT)
+
+/*
+ * Map from a cluster id to the number of allocated virtual swap slots in the
+ * (PMD-sized) cluster. This allows us to quickly allocate an empty cluster
+ * for a large folio being swapped out.
+ */
+static DEFINE_XARRAY_FLAGS(vswap_cluster_map, XA_FLAGS_TRACK_FREE);
+
+static const struct xa_limit vswap_cluster_map_limit = {
+	/* Do not allocate from the last cluster if it does not have enough slots. */
+	.max = (((MAX_VSWAP + 1) >> (VSWAP_CLUSTER_SHIFT)) - 1),
+	/*
+	 * First cluster is never handed out for large folios, since the 0 virtual
+	 * swap slot is reserved for errors.
+	 */
+	.min = 1,
+};
+
+static struct kmem_cache *swp_desc_cache;
+static atomic_t vswap_alloc_reject;
+static atomic_t vswap_used;
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -17,6 +85,10 @@ static int vswap_debug_fs_init(void)
 		return -ENODEV;
 
 	vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+	debugfs_create_atomic_t("alloc_reject", 0444,
+		vswap_debugfs_root, &vswap_alloc_reject);
+	debugfs_create_atomic_t("used", 0444, vswap_debugfs_root, &vswap_used);
+
 	return 0;
 }
 #else
@@ -26,10 +98,278 @@ static int vswap_debug_fs_init(void)
 }
 #endif
 
+/* Allolcate a contiguous range of virtual swap slots */
+static swp_entry_t vswap_alloc(int nr)
+{
+	struct swp_desc **descs;
+	swp_entry_t entry;
+	u32 index, cluster_id;
+	void *cluster_entry;
+	unsigned long cluster_count;
+	int i, err;
+
+	entry.val = 0;
+	descs = kcalloc(nr, sizeof(*descs), GFP_KERNEL);
+	if (!descs) {
+		atomic_add(nr, &vswap_alloc_reject);
+		return (swp_entry_t){0};
+	}
+
+	if (unlikely(!kmem_cache_alloc_bulk(
+					swp_desc_cache, GFP_KERNEL, nr, (void **)descs))) {
+		atomic_add(nr, &vswap_alloc_reject);
+		kfree(descs);
+		return (swp_entry_t){0};
+	}
+
+	for (i = 0; i < nr; i++)
+		descs[i]->slot.val = 0;
+
+	xa_lock(&vswap_map);
+	if (nr == 1) {
+		if (__xa_alloc(&vswap_map, &index, descs[0], vswap_map_limit,
+				GFP_KERNEL))
+			goto unlock;
+		else {
+			/*
+			 * Increment the allocation count of the cluster which the
+			 * allocated virtual swap slot belongs to.
+			 */
+			cluster_id = index >> VSWAP_CLUSTER_SHIFT;
+			cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
+			cluster_count = cluster_entry ? xa_to_value(cluster_entry) : 0;
+			cluster_count++;
+			VM_WARN_ON(cluster_count > VSWAP_CLUSTER_SIZE);
+
+			if (xa_err(xa_store(&vswap_cluster_map, cluster_id,
+					xa_mk_value(cluster_count), GFP_KERNEL))) {
+				__xa_erase(&vswap_map, index);
+				goto unlock;
+			}
+		}
+	} else {
+		/* allocate an unused cluster */
+		cluster_entry = xa_mk_value(nr);
+		if (xa_alloc(&vswap_cluster_map, &cluster_id, cluster_entry,
+				vswap_cluster_map_limit, GFP_KERNEL))
+			goto unlock;
+
+		index = cluster_id << VSWAP_CLUSTER_SHIFT;
+
+		for (i = 0; i < nr; i++) {
+			err = __xa_insert(&vswap_map, index + i, descs[i], GFP_KERNEL);
+			VM_WARN_ON(err == -EBUSY);
+			if (err) {
+				while (--i >= 0)
+					__xa_erase(&vswap_map, index + i);
+				xa_erase(&vswap_cluster_map, cluster_id);
+				goto unlock;
+			}
+		}
+	}
+
+	VM_WARN_ON(!index);
+	VM_WARN_ON(index + nr - 1 > MAX_VSWAP);
+	entry.val = index;
+	atomic_add(nr, &vswap_used);
+unlock:
+	xa_unlock(&vswap_map);
+	if (!entry.val) {
+		atomic_add(nr, &vswap_alloc_reject);
+		kmem_cache_free_bulk(swp_desc_cache, nr, (void **)descs);
+	}
+	kfree(descs);
+	return entry;
+}
+
+static inline void release_vswap_slot(unsigned long index)
+{
+	unsigned long cluster_id = index >> VSWAP_CLUSTER_SHIFT, cluster_count;
+	void *cluster_entry;
+
+	xa_lock(&vswap_map);
+	__xa_erase(&vswap_map, index);
+	cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
+	VM_WARN_ON(!cluster_entry);
+	cluster_count = xa_to_value(cluster_entry);
+	cluster_count--;
+
+	VM_WARN_ON(cluster_count < 0);
+
+	if (cluster_count)
+		xa_store(&vswap_cluster_map, cluster_id,
+			xa_mk_value(cluster_count), GFP_KERNEL);
+	else
+		xa_erase(&vswap_cluster_map, cluster_id);
+	xa_unlock(&vswap_map);
+	atomic_dec(&vswap_used);
+}
+
+/**
+ * vswap_free - free a virtual swap slot.
+ * @id: the virtual swap slot to free
+ */
+void vswap_free(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+
+	if (!entry.val)
+		return;
+
+	/* do not immediately erase the virtual slot to prevent its reuse */
+	desc = xa_load(&vswap_map, entry.val);
+	if (!desc)
+		return;
+
+	virt_clear_shadow_from_swap_cache(entry);
+
+	if (desc->slot.val) {
+		/* we only charge after linkage was established */
+		mem_cgroup_uncharge_swap(entry, 1);
+		xa_erase(&vswap_rmap, desc->slot.val);
+	}
+
+	/* erase forward mapping and release the virtual slot for reallocation */
+	release_vswap_slot(entry.val);
+	kfree_rcu(desc, rcu);
+}
+
+/**
+ * folio_alloc_swap - allocate virtual swap slots for a folio.
+ * @folio: the folio.
+ *
+ * Return: the first allocated slot if success, or the zero virtuals swap slot
+ * on failure.
+ */
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	int i, err, nr = folio_nr_pages(folio);
+	bool manual_freeing = true;
+	struct swp_desc *desc;
+	swp_entry_t entry;
+	swp_slot_t slot;
+
+	entry = vswap_alloc(nr);
+	if (!entry.val)
+		return entry;
+
+	/*
+	 * XXX: for now, we always allocate a physical swap slot for each virtual
+	 * swap slot, and their lifetime are coupled. This will change once we
+	 * decouple virtual swap slots from their backing states, and only allocate
+	 * physical swap slots for them on demand (i.e on zswap writeback, or
+	 * fallback from zswap store failure).
+	 */
+	slot = folio_alloc_swap_slot(folio);
+	if (!slot.val)
+		goto vswap_free;
+
+	/* establish the vrtual <-> physical swap slots linkages. */
+	for (i = 0; i < nr; i++) {
+		err = xa_insert(&vswap_rmap, slot.val + i,
+				xa_mk_value(entry.val + i), GFP_KERNEL);
+		VM_WARN_ON(err == -EBUSY);
+		if (err) {
+			while (--i >= 0)
+				xa_erase(&vswap_rmap, slot.val + i);
+			goto put_physical_swap;
+		}
+	}
+
+	i = 0;
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		desc->slot.val = slot.val + i;
+		i++;
+	}
+	rcu_read_unlock();
+
+	manual_freeing = false;
+	/*
+	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+	 * swap slots allocation. This is acceptable because as noted above, each
+	 * virtual swap slot corresponds to a physical swap slot. Once we have
+	 * decoupled virtual and physical swap slots, we will only charge when we
+	 * actually allocate a physical swap slot.
+	 */
+	if (!mem_cgroup_try_charge_swap(folio, entry))
+		return entry;
+
+put_physical_swap:
+	/*
+	 * There is no any linkage between virtual and physical swap slots yet. We
+	 * have to manually and separately free the allocated virtual and physical
+	 * swap slots.
+	 */
+	swap_slot_put_folio(slot, folio);
+vswap_free:
+	if (manual_freeing) {
+		for (i = 0; i < nr; i++)
+			vswap_free((swp_entry_t){entry.val + i});
+	}
+	entry.val = 0;
+	return entry;
+}
+
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ *                         virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+
+	if (!entry.val)
+		return (swp_slot_t){0};
+
+	desc = xa_load(&vswap_map, entry.val);
+	return desc ? desc->slot : (swp_slot_t){0};
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ *                         physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+	void *entry = xa_load(&vswap_rmap, slot.val);
+
+	/*
+	 * entry can be NULL if we fail to link the virtual and physical swap slot
+	 * during the swap slot allocation process.
+	 */
+	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
+}
+
 int vswap_init(void)
 {
+	swp_desc_cache = KMEM_CACHE(swp_desc, 0);
+	if (!swp_desc_cache)
+		return -ENOMEM;
+
+	if (xa_insert(&vswap_cluster_map, 0, xa_mk_value(1), GFP_KERNEL)) {
+		kmem_cache_destroy(swp_desc_cache);
+		return -ENOMEM;
+	}
+
 	if (vswap_debug_fs_init())
 		pr_warn("Failed to initialize vswap debugfs\n");
 
 	return 0;
 }
+
+void vswap_exit(void)
+{
+	kmem_cache_destroy(swp_desc_cache);
+}

From patchwork Tue Apr 29 23:38:37 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886258
Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com
 [209.85.219.169])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E9902DFA35;
 Tue, 29 Apr 2025 23:38:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.169
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969941; cv=none;
 b=Ge5JvSmvZSYleupK5T6i58QLDD33c3BVyxWwOzJtQBbQ4vY2FHAmS2QZBk5sPI4DE1sTaEohZbwI4npl4CQ+NwnY7sgs2y76SxBScy8S+iLvsxucyK0ouglG+1Z5GG5w5TDEBeOevRlZmbGQFrR1Vwpyuh0u3QOrhWMMj1eMSDc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969941; c=relaxed/simple;
 bh=NKb1h0r/ekGqJhrCCk5g0MiE9b26osXSCs9ZAg+i8m4=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=J8ZIQpoLBrJvIeQjIvjAxvE+8EUz0EhTHij4JDp9G8C3cpw4bwePCYq44mGSEgBMyMl31RDN27b8I+qX+AAPUPtr3ZZff5pTPTTXQ6FMBvD2rIORtL2U7y0eXti/HDu0auOewSCajBdMqWmdCr83rWj61aQLIYjldSSlb3XEPiQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=ikDd6VMs; arc=none smtp.client-ip=209.85.219.169
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="ikDd6VMs"
Received: by mail-yb1-f169.google.com with SMTP id
 3f1490d57ef6-e731a56e111so4423421276.1;
 Tue, 29 Apr 2025 16:38:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969938; x=1746574738;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=VDaYNr62oZV/U6TDpPRjwWYHhp0Fz9Nf+NOXmppHJcs=;
 b=ikDd6VMsDbsy6qw2EXAtpehvoHDnE/CfJeGqOmRLK/BGgV/9fIcwlV0qODxV4mB2Dw
 PXC8/J8Q9pIipJrcyKmoS+Pf/fnrOyQQQaxiyo1coh4XLEHOTWN/iCfgDcVWFgERk5yC
 y3mtjp0h/b54dfe1M7hZwf0Nc8vqsEEVo4v8Hq8WgtLi0Onll133qALWDVSB/x1jo/8z
 MAoOiuBTAAZeb8tXykWMhTUPTMycsdY1jgU4aZrFPMeLdSUY7+cP2/E3gHbW+MF3UYPo
 ZO9xNCVPabcFez6FhD8lFxqXX9Pfl8WzAH1af78+vkxKSCLmp4t1BaZLdbZ3KYa8VAdY
 4O0Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969938; x=1746574738;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=VDaYNr62oZV/U6TDpPRjwWYHhp0Fz9Nf+NOXmppHJcs=;
 b=RYdU3SWtMuX8AmU4gvR5JRmnyzeo0b70oEiRmMatrYVFrN/b/8rKP+KdIVCDP5OIu3
 WCeiPaiZkRtuv7RCNNr4t0uvad7gksCs8IgtWtTVtTT60R3BUvcY4K0LLCAxUeCYIMtP
 o17wzm8zyCNgzhuWN3f+k60cY49P7xzyV51ptuzv1OP6BrUVWPdRxwJNg6EkKhf7Ro/g
 Voe6XZHcOLlpKCUiVmNAOn3G7O6LIJTYq8llPQt+Tg/v13YvKy1158GhuzOmNHoi5/kj
 +Wrlm16QINTUaXEnMBOYVnbiesclwDiCZ+vMZ6T4wKwyS8fsHHcD13jOen4Ic39wDYm7
 IIKA==
X-Forwarded-Encrypted: i=1;
 AJvYcCUovkN1ATi+OIIK1mqLFgVQNxDUPzM8g82AAVRZZwvMfvYBe9H8XvW6R84dTPiJyu3/rQIR2kF6@vger.kernel.org,
 AJvYcCWLjbj/2fnjPT7EVKRXBlBX7eAcGwmPEoB2wX3AjYlJgE/MIpiNXDn1J9oy2dfv06XCOLleBIDfjZk=@vger.kernel.org,
 AJvYcCXUyTiy2sQEkk+iFe+dx5nmX/n7/a3L+WeHGeLL5wMdPu8+h90n84bkGC0Q4EUUgJ8egnoMNBE/3+vfUyGR@vger.kernel.org
X-Gm-Message-State: AOJu0YzqkxHkqozykdKmH5rJCU/zQSpbAn0xHqEG+bfAeg+mS4dhspRe
 06B12n3oMriog336uB6PUZrlpWm/V8ZCcQJY+3nQJ8PlKGjCQTZS
X-Gm-Gg: ASbGncsI71BgHaCFLRFnUIdJg4NyBlZlzmzXW8d+GXbZEa5nXLIDT+6uZjLoFKe2cok
 2QP2tDPJC1nsQaRvHWgdIfgOcNeGK51LDuIfDmomX0hvA9Bj+jA3pN6mWjPI4m1JXyz+lc9Uwon
 6nAYFPXLX0QEH4aHSOcvwC40+Mie7I+dJPk4A8Rtx4hi6nji/b/KQiJ23Wgtg3mKBW061wZS2sk
 KBY04TioMUTyYTf1WEaWHYrpv273kb+2D25iU3FyrJ80feJWcM3DnvnbYDwyGZILyWJopBVa4fD
 39bpAJATbvRz6z/ttpa/QYGRhg3Wei7j
X-Google-Smtp-Source: AGHT+IHn6mOAWGkqEbsXafeS7GzpaJiDdkS+3cjQx5q2B086gbKX/FA06JCBoeWXljznIEoDhpFWCg==
X-Received: by 2002:a05:6902:4607:b0:e73:3147:e7c with SMTP id
 3f1490d57ef6-e73e7a5043amr1563404276.0.1745969938064;
 Tue, 29 Apr 2025 16:38:58 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:74::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e7412f1386esm63843276.22.2025.04.29.16.38.57
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:57 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using
 virtual swap
Date: Tue, 29 Apr 2025 16:38:37 -0700
Message-ID: <20250429233848.3093350-10-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Once we decouple a swap entry from its backing store via the virtual
swap, we can no longer statically allocate an array to store the swap
entries' cgroup information. Move it to the swap descriptor.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/Makefile |  2 ++
 mm/vswap.c  | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/mm/Makefile b/mm/Makefile
index b7216c714fa1..35f2f282c8da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -101,8 +101,10 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
+ifndef CONFIG_VIRTUAL_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/vswap.c b/mm/vswap.c
index 23a05c3393d8..3792fa7f766b 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -27,10 +27,14 @@
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ * @memcgid: The memcg id of the owning memcg, if any.
  */
 struct swp_desc {
 	swp_slot_t slot;
 	struct rcu_head rcu;
+#ifdef CONFIG_MEMCG
+	atomic_t memcgid;
+#endif
 };
 
 /* Virtual swap space - swp_entry_t -> struct swp_desc */
@@ -122,8 +126,10 @@ static swp_entry_t vswap_alloc(int nr)
 		return (swp_entry_t){0};
 	}
 
-	for (i = 0; i < nr; i++)
+	for (i = 0; i < nr; i++) {
 		descs[i]->slot.val = 0;
+		atomic_set(&descs[i]->memcgid, 0);
+	}
 
 	xa_lock(&vswap_map);
 	if (nr == 1) {
@@ -352,6 +358,70 @@ swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
 }
 
+#ifdef CONFIG_MEMCG
+static unsigned short vswap_cgroup_record(swp_entry_t entry,
+				unsigned short memcgid, unsigned int nr_ents)
+{
+	struct swp_desc *desc;
+	unsigned short oldid, iter = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr_ents - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		oldid = atomic_xchg(&desc->memcgid, memcgid);
+		if (!iter)
+			iter = oldid;
+		VM_WARN_ON(iter != oldid);
+	}
+	rcu_read_unlock();
+
+	return oldid;
+}
+
+void swap_cgroup_record(struct folio *folio, unsigned short memcgid,
+			swp_entry_t entry)
+{
+	unsigned short oldid =
+		vswap_cgroup_record(entry, memcgid, folio_nr_pages(folio));
+
+	VM_WARN_ON(oldid);
+}
+
+unsigned short swap_cgroup_clear(swp_entry_t entry, unsigned int nr_ents)
+{
+	return vswap_cgroup_record(entry, 0, nr_ents);
+}
+
+unsigned short lookup_swap_cgroup_id(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	unsigned short ret;
+
+	/*
+	 * Note that the virtual swap slot can be freed under us, for instance in
+	 * the invocation of mem_cgroup_swapin_charge_folio. We need to wrap the
+	 * entire lookup in RCU read-side critical section, and double check the
+	 * existence of the swap descriptor.
+	 */
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	ret = desc ? atomic_read(&desc->memcgid) : 0;
+	rcu_read_unlock();
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+void swap_cgroup_swapoff(int type) {}
+#endif /* CONFIG_MEMCG */
+
 int vswap_init(void)
 {
 	swp_desc_cache = KMEM_CACHE(swp_desc, 0);

From patchwork Tue Apr 29 23:38:38 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885984
Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com
 [209.85.128.177])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10D8E2DFA52;
 Tue, 29 Apr 2025 23:38:59 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.177
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969943; cv=none;
 b=S7bIHTisJ/KG/VX05LTv4JQ549ncYP5TYINkF9qU1O+Ql+D4r9nfJYMptkQdRAPF7jVBsuRP1JKtcH7tXInG4SUdGHtoB4UEVQTsl3yutpLKe/sF6wNeWpDOug5S16swG503u2AYdLbLeVIeVVXtl4c0RDRrDWl8C1c6CsOg4co=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969943; c=relaxed/simple;
 bh=LgSSKcvtWhAwwMo7zfBcbVUMCWodnoh/8zd5F+o6Inw=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=gjR669i4YDIiuF1Fbj6VOljQ80u8clofktVn3RZZ8foGHpabz92LNqCaYs+iPTWOI+uPzqdMSHJ89dFE8rRJVRa9b+9LFZxzGxhuBAEAeTfHhLgD5LRVNwmCHp2r23+8X4iLrTDIj0ArzqUd2kQF/dxa3rVbIyAetH3PnKIx7eg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Sf7BqvpR; arc=none smtp.client-ip=209.85.128.177
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Sf7BqvpR"
Received: by mail-yw1-f177.google.com with SMTP id
 00721157ae682-6ff27ad48beso58158047b3.0;
 Tue, 29 Apr 2025 16:38:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969939; x=1746574739;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=qt9MdATHI4093PTxGb0CivsGVYmaPcz42OJvHHv98GY=;
 b=Sf7BqvpR4Dal2DYsqx3IISjXZ2CkzMvK4SNcFfE5Sm9oqbwCneI+d6tzSDY2UfFAlM
 J5SRT7BXnHrq4rd2mDUIwFIzck2OUG9FyNOhuZ4rUqJDxEhCeRSvzK7D/uUz4mHI/hEH
 cS6jGWwj5SEWCZvDqdpH07qPk+u6HkfcYfqBkqGP7eq4BTuntMspCLF42ISGT/FtVkbU
 yGk/NvbxmLiEehha13bCxSfigT+xJUiV9Eb1kuOT3j0sbcQoqLnYiSdFZvQZ84w/3tgj
 LUV2NuJh2fEQ+RX7e/OfQd57l02aZ092tZnykE0vqEDgb1U/KiDTe9uzYzxkyV7Y7uzZ
 pmNA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969939; x=1746574739;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=qt9MdATHI4093PTxGb0CivsGVYmaPcz42OJvHHv98GY=;
 b=xJ71BK1AXBGt+r+iUxX/T3/iRCxO3738g2bdecWPOAbarUrfh20mv/PDUNScF3Kvvl
 vSKz0rLm52AdeOQu8PDIM7GdAWiZ3IlI9HndLqhMRKUnLAzqAkwApnP1eIviA+fV7ub/
 XWrZZnquCZdxeEp9k2zKXqx1NWwP44ByxYKMmauJ1YcyQLyksSQTVHsrStmQEWtAJa4j
 NhR4jwuuQVct26ff3/eWjgsxNEaWj/ruYTU6wYgBxySyR3pbScxd/0i/u8dxqYyvd0cV
 FUZrMpDUnWTbLQ7eq49D0ONrSb3t4LsqJgmFMHmvFhzx4drjLV+QzmevkgQbJ9PLV/S9
 ug+w==
X-Forwarded-Encrypted: i=1;
 AJvYcCU9Ny/ZT/wOw/L1Bg0yEv+bk5vKE962d7fc9ylmzhvdNKu3KahSEo6IHoLRS0/0SIb5Xvxit0rapWdKMKdd@vger.kernel.org,
 AJvYcCUEnpZOIJ2eVzAO2n/ippUMUTXHvAnlQlx9T3SmQhPBvzF0zH1dHNgx1Jiw2s38CB/dJ29zle1E@vger.kernel.org,
 AJvYcCWzAG3vNGwkJnGmtE5AVY1Jd9ynYDot97Ae5WG7VG//WvSE7xzuIQx8zgBRjE/oH8ZQALox29DLnUI=@vger.kernel.org
X-Gm-Message-State: AOJu0YwSQfLD/zbWJR/Ng4U+F72vURCga/mqmetShThOVvXvkHug1O4e
 +8KSUcE9Raa8GbYDVD05ipN8a55a5kf3MEfqmVb2ekkxTKEWJKtv
X-Gm-Gg: ASbGncuF/Y4mTvk3bb/ISgkRUYW9gwb4uRmtbU4mpW0K9qhUjs4hK2RsXNTfZE2HaU5
 ZYJMT5NnMEGiT2a5QTY1rcXFUQmODm9ZtwDYn3cLAns49gT7tqYBDAYq/EnY1RHur8l990cpwHm
 s7Dx0izbGsFfldM9w+Zj3JLR1VwdAqNPzvVn3dffAWldqRqhS945dt5wA+TodnbrYVENNNYk4dm
 lgmb88ihuNkqkBfoswYrKnK8Swk1DhqxLO5jpifacxoqe9WaPcvhvQexzi1rVvdpvI68dCPJUbb
 D8hJRsG0UoiH2wKjXQQ/E7kjggBtX0IojnGScv3qVg==
X-Google-Smtp-Source: AGHT+IEdY8qlHZe7orsNoht9Bsh11CwLM53lwWeatZV+reMxRxwpbzZxutRcb4G6ECMFbiqKGfQNLQ==
X-Received: by 2002:a05:690c:d1f:b0:708:4d42:1a13 with SMTP id
 00721157ae682-708abd7da80mr21320087b3.12.1745969938843;
 Tue, 29 Apr 2025 16:38:58 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:1::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708adfc3d2bsm769387b3.5.2025.04.29.16.38.58
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:58 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual
 swap layer
Date: Tue, 29 Apr 2025 16:38:38 -0700
Message-ID: <20250429233848.3093350-11-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch moves the swap entry lifetime to the virtual swap layer (if
we enable swap virtualization), by adding to the swap descriptor an
atomic field named "swap_refs" that takes into account:

1. Whether the swap entry is in swap cache (or about to be added). This
   is indicated by the last bit of the field.
2. The swap count of the swap entry, which counts the number of page
   table entries at which the swap entry is inserted. This is given by
   the remaining bits of the field.

We also re-implement all of the swap entry lifetime API
(swap_duplicate(), swap_free_nr(), swapcache_prepare(), etc.) in the
virtual swap layer.

For now, we do not implement swap count continuation - the swap_count
field in the swap descriptor is big enough to hold the maximum number of
swap counts. This vastly simplifies the logic.

Note that the swapfile's swap map can be now be reduced under the virtual swap
implementation, as each slot can now only have 3 states: unallocated,
allocated, and bad slot. However, I leave this simplification to future work,
to minimize the amount of code change for review here.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  40 ++++-
 mm/memory.c          |   6 +
 mm/swapfile.c        | 124 +++++++++++---
 mm/vswap.c           | 400 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 536 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0f1337431e27..798adfbd43cb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -225,6 +225,11 @@ enum {
 #define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/* Swapfile's swap map state*/
+#define SWAP_MAP_ALLOCATED	0x01	/* Page is allocated */
+#define SWAP_MAP_BAD	0x02	/* Page is bad */
+#else
 /* Bit flag in swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
 #define COUNT_CONTINUED	0x80	/* Flag swap_map continuation for full count */
@@ -236,6 +241,7 @@ enum {
 
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
@@ -453,7 +459,7 @@ extern void __meminit kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-/* Lifetime swap API (mm/swapfile.c) */
+/* Lifetime swap API (mm/swapfile.c or mm/vswap.c) */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
 void swap_shmem_alloc(swp_entry_t, int);
@@ -507,7 +513,9 @@ void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
+#ifndef CONFIG_VIRTUAL_SWAP
 int add_swap_count_continuation(swp_entry_t, gfp_t);
+#endif
 void swap_slot_cache_free_slots(swp_slot_t *slots, int n);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
@@ -560,10 +568,12 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
 	return 0;
 }
+#endif
 
 static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
 {
@@ -729,9 +739,14 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
 void vswap_exit(void);
-void vswap_free(swp_entry_t entry);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
+bool vswap_tryget(swp_entry_t entry);
+void vswap_put(swp_entry_t entry);
+bool folio_swapped(struct folio *folio);
+bool vswap_swapcache_only(swp_entry_t entry, int nr);
+int non_swapcache_batch(swp_entry_t entry, int nr);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
@@ -765,26 +780,41 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
 	return (swp_entry_t) { slot.val };
 }
-#endif /* CONFIG_VIRTUAL_SWAP */
 
 static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 
+	/*
+	 * Note that in the virtual swap implementation, we do not need to do anything
+	 * to guard against concurrent swapoff for the swap entry's metadata:
+	 *
+	 * 1. The swap descriptor (struct swp_desc) has its existence guaranteed by
+	 *    RCU + its reference count.
+	 *
+	 * 2. Swap cache, zswap trees, etc. are all statically declared, and never
+	 *    freed.
+	 *
+	 * We do, however, need a reference to the swap device itself, because we
+	 * need swap device's metadata in certain scenarios, for example when we
+	 * need to inspect the swap device flag in do_swap_page().
+	 */
 	*si = swap_slot_tryget_swap_info(slot);
-	return *si;
+	return IS_ENABLED(CONFIG_VIRTUAL_SWAP) || *si;
 }
 
 static inline void unlock_swapoff(swp_entry_t entry,
 				struct swap_info_struct *si)
 {
-	swap_slot_put_swap_info(si);
+	if (si)
+		swap_slot_put_swap_info(si);
 }
 
 #endif /* __KERNEL__*/
diff --git a/mm/memory.c b/mm/memory.c
index c44e845b5320..a8c418104f28 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1202,10 +1202,14 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 
 	if (ret == -EIO) {
 		VM_WARN_ON_ONCE(!entry.val);
+		/* virtual swap implementation does not have swap count continuation */
+		VM_WARN_ON_ONCE(IS_ENABLED(CONFIG_VIRTUAL_SWAP));
+#ifndef CONFIG_VIRTUAL_SWAP
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
 			ret = -ENOMEM;
 			goto out;
 		}
+#endif
 		entry.val = 0;
 	} else if (ret == -EBUSY || unlikely(ret == -EHWPOISON)) {
 		goto out;
@@ -4123,6 +4127,7 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifndef CONFIG_VIRTUAL_SWAP
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
@@ -4143,6 +4148,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 	return i;
 }
+#endif
 
 /*
  * Check if the PTEs within a range are contiguous swap entries
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 849525810bbe..c09011867263 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -50,8 +50,10 @@
 #include "internal.h"
 #include "swap.h"
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
+#endif
 static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
@@ -156,6 +158,25 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim directly, bypass the slot cache and don't touch device lock */
 #define TTRS_DIRECT		0x8
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent;
+}
+
+static bool swap_is_has_cache(struct swap_info_struct *si,
+			      unsigned long offset, int nr_pages)
+{
+	swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
+
+	return vswap_swapcache_only(entry, nr_pages);
+}
+
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
+{
+	return swap_is_has_cache(si, offset, 1);
+}
+#else
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -176,6 +197,11 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
 	return true;
 }
 
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
+{
+	return READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE;
+}
+
 static bool swap_is_last_map(struct swap_info_struct *si,
 		unsigned long offset, int nr_pages, bool *has_cache)
 {
@@ -194,6 +220,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 	*has_cache = !!(count & SWAP_HAS_CACHE);
 	return true;
 }
+#endif
 
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
@@ -250,7 +277,11 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	if (!(flags & TTRS_DIRECT)) {
+	/*
+	 * For now, virtual swap implementation only supports freeing through the
+	 * swap slot cache...
+	 */
+	if (!(flags & TTRS_DIRECT) || IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
 		/* Free through slot cache */
 		delete_from_swap_cache(folio);
 		folio_set_dirty(folio);
@@ -700,7 +731,12 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 		case 0:
 			offset++;
 			break;
+#ifdef CONFIG_VIRTUAL_SWAP
+		/* __try_to_reclaim_swap() checks if the slot is in-cache only */
+		case SWAP_MAP_ALLOCATED:
+#else
 		case SWAP_HAS_CACHE:
+#endif
 			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
 			if (nr_reclaim > 0)
 				offset += nr_reclaim;
@@ -731,19 +767,20 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 {
 	unsigned long offset, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
+	unsigned char count;
 
 	for (offset = start; offset < end; offset++) {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
+		count = READ_ONCE(map[offset]);
+		if (!count)
 			continue;
-		case SWAP_HAS_CACHE:
+
+		if (swap_cache_only(si, offset)) {
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
 			continue;
-		default:
-			return false;
 		}
+		return false;
 	}
 
 	return true;
@@ -836,7 +873,6 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	long to_scan = 1;
 	unsigned long offset, end;
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
 	int nr_reclaim;
 
 	if (force)
@@ -848,7 +884,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+			if (swap_cache_only(si, offset)) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY | TTRS_DIRECT);
@@ -1175,6 +1211,10 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 {
 	int n_ret = 0;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	VM_WARN_ON(usage != SWAP_MAP_ALLOCATED);
+#endif
+
 	while (n_ret < nr) {
 		unsigned long offset = cluster_alloc_swap_slot(si, order, usage);
 
@@ -1192,6 +1232,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 {
 	unsigned int nr_pages = 1 << order;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	VM_WARN_ON(usage != SWAP_MAP_ALLOCATED);
+#endif
+
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
 	 * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
@@ -1248,7 +1292,13 @@ int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 	long avail_pgs;
 	int n_ret = 0;
 	int node;
+	unsigned char usage;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
+	usage = SWAP_HAS_CACHE;
+#endif
 	spin_lock(&swap_avail_lock);
 
 	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
@@ -1268,8 +1318,7 @@ int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_slots, order);
+			n_ret = scan_swap_map_slots(si, usage, n_goal, swp_slots, order);
 			swap_slot_put_swap_info(si);
 			if (n_ret || size > 1)
 				goto check_out;
@@ -1402,6 +1451,17 @@ struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 	return NULL;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
+					      unsigned long offset,
+					      unsigned char usage)
+{
+	VM_WARN_ON(usage != 1);
+	VM_WARN_ON(si->swap_map[offset] != SWAP_MAP_ALLOCATED);
+
+	return 0;
+}
+#else
 static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
@@ -1499,6 +1559,7 @@ static bool __swap_slots_free(struct swap_info_struct *si,
 	}
 	return has_cache;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * Drop the last HAS_CACHE flag of swap entries, caller have to
@@ -1511,21 +1572,17 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 	unsigned long offset = swp_slot_offset(slot);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
-	swp_entry_t entry = swp_slot_to_swp_entry(slot);
-#ifdef CONFIG_VIRTUAL_SWAP
-	int i;
+	unsigned char usage;
 
-	/* release all the associated (virtual) swap slots */
-	for (i = 0; i < nr_pages; i++) {
-		vswap_free(entry);
-		entry.val++;
-	}
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
 #else
 	/*
 	 * In the new (i.e virtual swap) implementation, we will let the virtual
 	 * swap layer handle the cgroup swap accounting and charging.
 	 */
-	mem_cgroup_uncharge_swap(entry, nr_pages);
+	mem_cgroup_uncharge_swap(swp_slot_to_swp_entry(slot), nr_pages);
+	usage = SWAP_HAS_CACHE;
 #endif
 
 	/* It should never free entries across different clusters */
@@ -1535,7 +1592,7 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 
 	ci->count -= nr_pages;
 	do {
-		VM_BUG_ON(*map != SWAP_HAS_CACHE);
+		VM_BUG_ON(*map != usage);
 		*map = 0;
 	} while (++map < map_end);
 
@@ -1580,6 +1637,7 @@ void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
 	}
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 /*
  * Caller has made sure that the swap device corresponding to entry
  * is still around or has not been recycled.
@@ -1588,9 +1646,11 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
 }
+#endif
 
 /*
- * Called after dropping swapcache to decrease refcnt to swap entries.
+ * This should only be called in contexts in which the slot has
+ * been allocated but not associated with any swap entries.
  */
 void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 {
@@ -1598,23 +1658,31 @@ void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	int size = 1 << swap_slot_order(folio_order(folio));
+	unsigned char usage;
 
 	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
+	usage = SWAP_HAS_CACHE;
+#endif
+
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
 		swap_slot_range_free(si, ci, slot, size);
 	else {
 		for (int i = 0; i < size; i++, slot.val++) {
-			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
+			if (!__swap_slot_free_locked(si, offset + i, usage))
 				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 int __swap_count(swp_entry_t entry)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
@@ -1785,7 +1853,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	 */
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
-		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+		if (swap_cache_only(si, offset)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
@@ -1807,6 +1875,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 out:
 	swap_slot_put_swap_info(si);
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 void swap_slot_cache_free_slots(swp_slot_t *slots, int n)
 {
@@ -3587,6 +3656,14 @@ pgoff_t __folio_swap_cache_index(struct folio *folio)
 }
 EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/*
+ * We do not use continuation in virtual swap implementation.
+ */
+static void free_swap_count_continuations(struct swap_info_struct *si)
+{
+}
+#else /* CONFIG_VIRTUAL_SWAP */
 /*
  * Verify that nr swap entries are valid and increment their swap map counts.
  *
@@ -3944,6 +4021,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 		}
 	}
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
 void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
diff --git a/mm/vswap.c b/mm/vswap.c
index 3792fa7f766b..513d000a134c 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -18,8 +18,23 @@
  * us to change the backing state of a swapped out page without having to
  * update every single page table entries referring to it.
  *
- * For now, there is a one-to-one correspondence between a virtual swap slot
- * and its associated physical swap slot.
+ *
+ * I. Swap Entry Lifetime
+ *
+ * The swap entry's lifetime is now managed at the virtual swap layer. We
+ * assign each virtual swap slot a reference count, which includes:
+ *
+ * 1. The number of page table entries that refer to the virtual swap slot, i.e
+ *    its swap count.
+ *
+ * 2. Whether the virtual swap slot has been added to the swap cache - if so,
+ *    its reference count is incremented by 1.
+ *
+ * Each virtual swap slot starts out with a reference count of 1 (since it is
+ * about to be added to the swap cache). Its reference count is incremented or
+ * decremented every time it is mapped to or unmapped from a PTE, as well as
+ * when it is added to or removed from the swap cache. Finally, when its
+ * reference count reaches 0, the virtual swap slot is freed.
  */
 
 /**
@@ -27,14 +42,24 @@
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ * @lock: The lock protecting the swap slot backing field.
  * @memcgid: The memcg id of the owning memcg, if any.
+ * @swap_refs: This field stores all the references to the swap entry. The
+ *             least significant bit indicates whether the swap entry is (about
+ *             to be) pinned in swap cache. The remaining bits tell us the
+ *             number of page table entries that refer to the swap entry.
  */
 struct swp_desc {
 	swp_slot_t slot;
 	struct rcu_head rcu;
+
+	rwlock_t lock;
+
 #ifdef CONFIG_MEMCG
 	atomic_t memcgid;
 #endif
+
+	atomic_t swap_refs;
 };
 
 /* Virtual swap space - swp_entry_t -> struct swp_desc */
@@ -78,6 +103,11 @@ static struct kmem_cache *swp_desc_cache;
 static atomic_t vswap_alloc_reject;
 static atomic_t vswap_used;
 
+/* least significant bit is for swap cache pin, the rest is for swap count. */
+#define SWAP_CACHE_SHIFT 1
+#define SWAP_CACHE_INC 1
+#define SWAP_COUNT_INC 2
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 
@@ -129,6 +159,9 @@ static swp_entry_t vswap_alloc(int nr)
 	for (i = 0; i < nr; i++) {
 		descs[i]->slot.val = 0;
 		atomic_set(&descs[i]->memcgid, 0);
+		/* swap entry is about to be added to the swap cache */
+		atomic_set(&descs[i]->swap_refs, 1);
+		rwlock_init(&descs[i]->lock);
 	}
 
 	xa_lock(&vswap_map);
@@ -215,7 +248,7 @@ static inline void release_vswap_slot(unsigned long index)
  * vswap_free - free a virtual swap slot.
  * @id: the virtual swap slot to free
  */
-void vswap_free(swp_entry_t entry)
+static void vswap_free(swp_entry_t entry)
 {
 	struct swp_desc *desc;
 
@@ -233,6 +266,7 @@ void vswap_free(swp_entry_t entry)
 		/* we only charge after linkage was established */
 		mem_cgroup_uncharge_swap(entry, 1);
 		xa_erase(&vswap_rmap, desc->slot.val);
+		swap_slot_free_nr(desc->slot, 1);
 	}
 
 	/* erase forward mapping and release the virtual slot for reallocation */
@@ -332,12 +366,24 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 {
 	struct swp_desc *desc;
+	swp_slot_t slot;
 
 	if (!entry.val)
 		return (swp_slot_t){0};
 
+	rcu_read_lock();
 	desc = xa_load(&vswap_map, entry.val);
-	return desc ? desc->slot : (swp_slot_t){0};
+	if (!desc) {
+		rcu_read_unlock();
+		return (swp_slot_t){0};
+	}
+
+	read_lock(&desc->lock);
+	slot = desc->slot;
+	read_unlock(&desc->lock);
+	rcu_read_unlock();
+
+	return slot;
 }
 
 /**
@@ -349,13 +395,355 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
  */
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
-	void *entry = xa_load(&vswap_rmap, slot.val);
+	swp_entry_t ret;
+	void *entry;
 
+	rcu_read_lock();
 	/*
 	 * entry can be NULL if we fail to link the virtual and physical swap slot
 	 * during the swap slot allocation process.
 	 */
-	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
+	entry = xa_load(&vswap_rmap, slot.val);
+	if (!entry)
+		ret.val = 0;
+	else
+		ret = (swp_entry_t){xa_to_value(entry)};
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Decrease the swap count of nr contiguous swap entries by 1 (when the swap
+ * entries are removed from a range of PTEs), and check if any of the swap
+ * entries are in swap cache only after its swap count is decreased.
+ *
+ * The check is racy, but it is OK because free_swap_and_cache_nr() only use
+ * the result as a hint.
+ */
+static bool vswap_free_nr_any_cache_only(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	bool ret = false;
+	int end = entry.val + nr - 1;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, end) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		/* 1 page table entry ref + 1 swap cache ref == 11 (binary) */
+		ret |= (atomic_read(&desc->swap_refs) == 3);
+		if (atomic_sub_and_test(SWAP_COUNT_INC, &desc->swap_refs))
+			vswap_free(entry);
+		entry.val++;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * swap_free_nr - decrease the swap count of nr contiguous swap entries by 1
+ *                (when the swap entries are removed from a range of PTEs).
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+void swap_free_nr(swp_entry_t entry, int nr)
+{
+	vswap_free_nr_any_cache_only(entry, nr);
+}
+
+static int swap_duplicate_nr(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || !atomic_add_unless(&desc->swap_refs, SWAP_COUNT_INC, 0))
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && i < nr)
+		swap_free_nr(entry, i);
+
+	return i == nr ? 0 : -ENOENT;
+}
+
+/**
+ * swap_duplicate - increase the swap count of the swap entry by 1 (i.e when
+ *                  the swap entry is stored at a new PTE).
+ * @entry: the swap entry.
+ *
+ * Return: 0 (always).
+ *
+ * Note that according to the existing API, we ALWAYS returns 0 unless a swap
+ * continuation is required (which is no longer the case in the new design).
+ */
+int swap_duplicate(swp_entry_t entry)
+{
+	swap_duplicate_nr(entry, 1);
+	return 0;
+}
+
+static int vswap_swap_count(atomic_t *swap_refs)
+{
+	return atomic_read(swap_refs) >> SWAP_CACHE_SHIFT;
+}
+
+bool folio_swapped(struct folio *folio)
+{
+	swp_entry_t entry = folio->swap;
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+	bool swapped = false;
+
+	if (!entry.val)
+		return false;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (desc && vswap_swap_count(&desc->swap_refs)) {
+			swapped = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return swapped;
+}
+
+/**
+ * swp_swapcount - return the swap count of the swap entry.
+ * @id: the swap entry.
+ *
+ * Note that all the swap count functions are identical in the new design,
+ * since we no longer need swap count continuation.
+ *
+ * Return: the swap count of the swap entry.
+ */
+int swp_swapcount(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	unsigned int ret;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	ret = desc ? vswap_swap_count(&desc->swap_refs) : 0;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+int __swap_count(swp_entry_t entry)
+{
+	return swp_swapcount(entry);
+}
+
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
+{
+	return swp_swapcount(entry);
+}
+
+void swap_shmem_alloc(swp_entry_t entry, int nr)
+{
+	swap_duplicate_nr(entry, nr);
+}
+
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int end = entry.val + nr - 1;
+
+	if (!nr)
+		return;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, end) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (atomic_dec_and_test(&desc->swap_refs))
+			vswap_free(entry);
+		entry.val++;
+	}
+	rcu_read_unlock();
+}
+
+int swapcache_prepare(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int old, new, i = 0, ret = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc) {
+			ret = -ENOENT;
+			goto done;
+		}
+
+		old = atomic_read(&desc->swap_refs);
+		do {
+			new = old;
+			ret = 0;
+
+			if (!old)
+				ret = -ENOENT;
+			else if (old & SWAP_CACHE_INC)
+				ret = -EEXIST;
+			else
+				new += SWAP_CACHE_INC;
+		} while (!atomic_try_cmpxchg(&desc->swap_refs, &old, new));
+
+		if (ret)
+			goto done;
+
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && i < nr)
+		swapcache_clear(NULL, entry, i);
+	if (i < nr && !ret)
+		ret = -ENOENT;
+	return ret;
+}
+
+/**
+ * vswap_swapcache_only - check if all the slots in the range are still valid,
+ *                        and are in swap cache only (i.e not stored in any
+ *                        PTEs).
+ * @entry: the first slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * Return: true if all the slots in the range are still valid, and are in swap
+ * cache only, or false otherwise.
+ */
+bool vswap_swapcache_only(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || atomic_read(&desc->swap_refs) != SWAP_CACHE_INC)
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	return i == nr;
+}
+
+/**
+ * non_swapcache_batch - count the longest range starting from a particular
+ *                       swap slot that are stil valid, but not in swap cache.
+ * @entry: the first slot to check.
+ * @max_nr: the maximum number of slots to check.
+ *
+ * Return: the number of slots in the longest range that are still valid, but
+ * not in swap cache.
+ */
+int non_swapcache_batch(swp_entry_t entry, int max_nr)
+{
+	struct swp_desc *desc;
+	int swap_refs, i = 0;
+
+	if (!entry.val)
+		return 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + max_nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		swap_refs = atomic_read(&desc->swap_refs);
+		if (!(swap_refs & SWAP_CACHE_INC) && (swap_refs >> SWAP_CACHE_SHIFT))
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	return i;
+}
+
+/**
+ * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
+ *                            reclaim their cache if no more references remain.
+ * @entry: First entry of range.
+ * @nr: Number of entries in range.
+ *
+ * For each swap entry in the contiguous range, release a swap count. If any
+ * swap entries have their swap count decremented to zero, try to reclaim their
+ * associated swap cache pages.
+ */
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+	int i = 0, incr = 1;
+	struct folio *folio;
+
+	if (non_swap_entry(entry))
+		return;
+
+	if (vswap_free_nr_any_cache_only(entry, nr)) {
+		while (i < nr) {
+			incr = 1;
+			if (vswap_swapcache_only(entry, 1)) {
+				folio = filemap_get_folio(swap_address_space(entry),
+							swap_cache_index(entry));
+				if (IS_ERR(folio))
+					goto next;
+				if (!folio_trylock(folio)) {
+					folio_put(folio);
+					goto next;
+				}
+				incr = folio_nr_pages(folio);
+				folio_free_swap(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+			}
+next:
+			i += incr;
+			entry.val += incr;
+		}
+	}
+}
+
+/*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+	int nr = folio_nr_pages(folio);
+
+	VM_WARN_ON(!folio_test_locked(folio));
+	swapcache_clear(NULL, entry, nr);
 }
 
 #ifdef CONFIG_MEMCG

From patchwork Tue Apr 29 23:38:39 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886257
Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com
 [209.85.128.180])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id B25502E3365;
 Tue, 29 Apr 2025 23:39:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.180
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969942; cv=none;
 b=Bvkk2yLKSO/YjvUUkyZJXB1T6agkY8gxTD/N+E5MmD1Pf8e1n69S3+m/KqZejvWoQABoAKt0rRtokDf6No6huipSbvZlFNWAuGxUfNfEgSJr49N5ic7zfgzfTkNLfBArzGDwGpdw68rl5pbW6Le4IbMpwLlkIwcASmcxeF7tEbY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969942; c=relaxed/simple;
 bh=nh5sXd7001T9p5XYT6xTzXrfxlVThNejp4CgBjL9AYA=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=Fv3WuQEOsb2+h1RXnV55o7aiQM7xoS7mv83QtloXpaP/sDR6u1DWLoqMJ42taWljPBk7MsgZlVbMO7nnGcou71JlljGx9q5eeLWwp4Q6YN430KOuNUAhPem4F8LfCYWaBpey7qTTw1MSdbP5En1P89cRehIv8u5kCs0+db/LImQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=bnDZeZlS; arc=none smtp.client-ip=209.85.128.180
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="bnDZeZlS"
Received: by mail-yw1-f180.google.com with SMTP id
 00721157ae682-70811611315so57367227b3.1;
 Tue, 29 Apr 2025 16:39:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969939; x=1746574739;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=JFhHSkDxJzypIaqtTPtajYYCWi+xgmJi1aAEQUZ9t58=;
 b=bnDZeZlStQ6fB45Zw5nh0i15si0taml3Mkq80l9bQ9lQ+AzWWH7BNGOjJ+RYy13ekx
 ulJWnnGxn7aEHlfL/NwXv7obK9x95cjd7lkY0QIrtepSkvUC4NAq1o75Id2VVpGrE/as
 zezCcN3VXfLEomveuA26ZDNb240zk4vT63flEQLNpso8p2S+Dw+RiQL6obc4j/0boOpH
 1sLnyOhJQS/rqBsIbDMPQgg/bH9BChlm1D4CMgPRz7nIpVQnib+0xKqIaMR3FPdhlB4I
 3WAR9XHcEK0KhvM9jd2P1PufoImoN7f2vZWN0PXeZAM1LuKxSc2Wx7C58iMAJsbphX7q
 ZIHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969940; x=1746574740;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=JFhHSkDxJzypIaqtTPtajYYCWi+xgmJi1aAEQUZ9t58=;
 b=hCE5WoprtcfIPAvqXbTuj3vEM7JiNRBFMf70pcsNflHRv+2gMf5jPxA8S8qqvMK8gG
 i1xE7rzM7ikuuRXzIYcjayWi+JvpRDiWyVvfjuuVKiGpoimeM1MqjBBgfmqjqtSMupRI
 mYhlBh4cIl58MUTjx6/W3itxfJaM3jHu3McEUe0U8PonEWnsgbQxaTmwCmwtDQSyNAIx
 g3mGM1qIUZP2uWYdJ7BqKQFYt2Vd5hFG70ecR+TjyUfphwXEcgPr3oismRIU2TGUDKb7
 HWF4jJXfKWH+E89KV/vuPFNb6RhUR6gLCGKBaOurO9JM5j2bPzd2DuTGkHUECYob3d/n
 KmfQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCVSBXQHfAq2KMxn9ilaBc6SWr25t+cYl0YXRp4UjCowtiUQSqp8shf7EJHnW+/aO8Qf91oqrdjh5pM=@vger.kernel.org,
 AJvYcCVz0at2fgPp/9m24ZPa6Z+Xt6ZIXMgncGHwCbzWDq2T7kGcU5ldqMSK2SdMtjTcZuWzoyUxEP/pP6oI8qX/@vger.kernel.org,
 AJvYcCXJxugLxYY0Z6uKuPvN09yUBSNy3UFDQrO/PyJTYWum3uNID5K/wjtteoDmrUAorwuyRZJ2cIYD@vger.kernel.org
X-Gm-Message-State: AOJu0Yy0zgMZyd3hA6mg3Lktky+VMEN7+up5IiLvoIzmEWei8zaU8brU
 64B4m/iOeNEdPi5bKg26ArrlQB/yyj5TdRVdkShwVsrP5NrweW8J
X-Gm-Gg: ASbGncukjoxcKoSrnajZQ/YCMXnXIwLaKt/ouY12/PGtIPIEVr0Dqqlu7xEbGFT/zfs
 H5r1MNBjtRSSDNcyZgv1437TnrFuVMG+aMsV2MgmzFWqPNS00kZjB3AoM3MV8Xk15b5kPrclt02
 /eEfPgFhaA7RzgPTi5PuvYK3aFwiulTdhZLl20aeNJRTl56zrZpGNArdDrhtxYxDEjELuQvMy1W
 afc/lwJlTsm88enjj/YEAQNtWzeuvzcCyAF4U1tjieEN1ZyNAsa1+Iy18LjDWWcIdRG49RdiMpO
 d41OJpLkVVNOcYK4dmOO8y4Vb8rHdMA=
X-Google-Smtp-Source: AGHT+IHTUZTqhcYgGJbQPuDnVTu9sU6OrQRADCC7lMuDWvnweWgrgBFJr5UsG76GOWcxTplwIkyh3g==
X-Received: by 2002:a05:690c:6a05:b0:702:5920:c3c8 with SMTP id
 00721157ae682-708ad5c4c7emr9892907b3.8.1745969939647;
 Tue, 29 Apr 2025 16:38:59 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:9::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae1b429esm724327b3.74.2025.04.29.16.38.59
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:38:59 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and
 batched freeing swap
Date: Tue, 29 Apr 2025 16:38:39 -0700
Message-ID: <20250429233848.3093350-12-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Disable THP swapin on virtual swap implementation, for now. Similarly,
only operate on one swap entry at a time when we zap a PTE range. There
is no real reason why we cannot build support for this in the new
design. It is simply to make the following patch, which decouples swap
backends, smaller and more manageable for reviewers - these capabilities
will be restored in a following patch.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/internal.h | 16 ++++++++--------
 mm/memory.c   |  4 +++-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ca28729f822a..51061691a731 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -268,17 +268,12 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 	return (swp_entry_t) { entry.val + n };
 }
 
-/* similar to swap_nth, but check the backing physical slots as well. */
+/* temporary disallow batched swap operations */
 static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
 {
-	swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
-	swp_entry_t next_entry = swap_nth(entry, delta);
-
-	next_slot = swp_entry_to_swp_slot(next_entry);
-	if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
-			swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
-		next_entry.val = 0;
+	swp_entry_t next_entry;
 
+	next_entry.val = 0;
 	return next_entry;
 }
 #else
@@ -349,6 +344,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  * max_nr must be at least one and must be limited by the caller so scanning
  * cannot exceed a single page table.
  *
+ * Note that for virtual swap space, we will not batch anything for now.
+ *
  * Return: the number of table entries in the batch.
  */
 static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
@@ -363,6 +360,9 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(entry));
 
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+		return 1;
+
 	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
diff --git a/mm/memory.c b/mm/memory.c
index a8c418104f28..2a8fd26fb31d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4230,8 +4230,10 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * A large swapped out folio could be partially or fully in zswap. We
 	 * lack handling for such cases, so fallback to swapping in order-0
 	 * folio.
+	 *
+	 * We also disable THP swapin on the virtual swap implementation, for now.
 	 */
-	if (!zswap_never_enabled())
+	if (!zswap_never_enabled() || IS_ENABLED(CONFIG_VIRTUAL_SWAP))
 		goto fallback;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);

From patchwork Tue Apr 29 23:38:40 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886255
Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com
 [209.85.128.182])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC8242E338D;
 Tue, 29 Apr 2025 23:39:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969945; cv=none;
 b=WN5UBVS8tBQFlHuqeug/AZ46gzksWjjxgUMoi0RhwUgGRpqcKizJJA4J8Ipfk5QDq44emz/I6ILM1/xXlN1xyaUXqfGXql1jReDVgl/nWgbZCkP/N2CTJehZWdqlexSQNdGdDVqRoOsXQnqvS6boKds4PLbj3e/FeoK/YrfTcFQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969945; c=relaxed/simple;
 bh=qmroPqVPhbT4Ig2lsX8t/xe4x2SfPFyMZ9LYou6ObQM=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=N8jeKGay0JqE36fjeF6Gg4biBjCAy60R5xqGA4hxlKMIyjEBdbwsppPcXyu8aYm1WSNeIBBoTeEVVcF0NBidCus2FmoZRjmv9Y+v1VFqqeSmNjvnQX/BzyfN8y7iK0Bfgg5MRLZ58GDu8D14LgnKF7oCAj7oyArPLibyzx85R04=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=VNJlscR6; arc=none smtp.client-ip=209.85.128.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="VNJlscR6"
Received: by mail-yw1-f182.google.com with SMTP id
 00721157ae682-70576d2faa1so59329817b3.3;
 Tue, 29 Apr 2025 16:39:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969941; x=1746574741;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=uk2AtoL6O8R4DAdvIk4kgvB6yIFKCWs5xqgZ5v3n/p0=;
 b=VNJlscR6oJrHdsgLRHG3eOSEiorI955ZKzDasWw6sbBAzx2mVBiLzYSghFnQXCvzQE
 ns5ityDi7ETwFangFC7FROjdyvYeWg65x7vwosq9NkVA30yG/8XaNIb/gK3bEnqeX855
 Cmot71dFpOavlQ12aDijwvaluSDZXGlBExbQ0LM90h+YpU2hLDUK2e11lVd+Qwr87upR
 9dK2Ji3qFjFgwKNXverU+NLl7JAzG1YyLgeVH7QB3pIXY6uwkW5Mvf65Y8jeufWREzAN
 ZMpNHGZep52hOyG+2KJkBCjtUz9z8Cz3Lo2PfOTHQiFD2jepSRQlu6rSM5jdGe7WWsAH
 3VXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969941; x=1746574741;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=uk2AtoL6O8R4DAdvIk4kgvB6yIFKCWs5xqgZ5v3n/p0=;
 b=Fuf50ish/3lf0WdDAT5tX9znS4dHbefSxylPRIMqftFfTshlnPqA3WFZ386wamLx3D
 /OEhX1i00mzUqojI4wN5lmepjRvyTiVRFbXeDjcg4Qtt1w1CYkpTb83yqUfMnE1TOTvS
 dv4TY3KDvqiEJlcNjLo7O762/wTFTVUjf/Z6m5bMTK7rshv0y9zi5Dy9jLX2Z7mhOH1b
 IFTbG38dnOu9xNemV231TROjuUA9pDptOaB/TpN05N3DMtP3Il7eNjVBHDLfOABrLXPj
 onYvf5jxz1lEqH433xtfMvFLA4X11KbLihs4jufZBQd4pSVjMreQ0wZCM8WsEOtlNh3p
 JONQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCUhJ/ZpO2i5GMuqcvWjLdZLmMhuMOogwl5hdowFF3nsQngMS4jORXpao4zDHO7wMg7VsVCiZUKeap8=@vger.kernel.org,
 AJvYcCVFqIX1eGH9TXzeUrdCXElWPiUNLLu/Ptt8TIRo4AMHiUK9HTyQL6OE+UFj5kA6WUQ2VyXM7HIxhIMO/Vct@vger.kernel.org,
 AJvYcCW6Tbj8w344Wzcp3qYCg1r7GRJTCe0Q3QExSzgSsCZBvF7yREYFXgV058L0761QdwUBJ1FYwfzX@vger.kernel.org
X-Gm-Message-State: AOJu0YwaMzxnKhF2gMK3iY7c+4mdBFKv1XBeyNct5dO0wVoGjWFasLvn
 vfDSz6E2Sh0sVBoahj9tf/ZMs/DbOSOw99aLQMiZylOgM5vqEpKi
X-Gm-Gg: ASbGncuQYkPkV5RVnDfy+27/hrb7ivEmHQbINBD/jn/fCEdaY8rE4mRASKBEXwjOLye
 rhofPZm/ZanQGq/xECQEGtwI3+HU18GDQlQ4godOHkj5e9zp4zU4GJPsT3jVFs4eSlQgKk8FMPf
 ZJXvHnL9bfdKgon0Tag7rok+toglPn0CMxjE6RuWqbYtjMoPp01WoddhEa5LAh3is/++jSvuRE5
 NjT0P/ZtM28Mt5aRX4F2eU6D8G263vIjS0m1keKEnYTZbUOEVmU7cwOs2VVynpYTftvyaj/EDmN
 UqfO/mQI5GDcd1WmmxHVGUFWIQvqscnj5BTDYdQeEw==
X-Google-Smtp-Source: AGHT+IE2aswSfx0TsiDrfDBLy63nxEgdHPTW2X72lfi7R4FsUNyF2uGRsobtBHhyJFPNzIcFYoVsCA==
X-Received: by 2002:a05:690c:62c3:b0:6fe:e76a:4d38 with SMTP id
 00721157ae682-708abdaacaemr18775207b3.21.1745969940521;
 Tue, 29 Apr 2025 16:39:00 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:8::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae1b41a2sm722197b3.69.2025.04.29.16.38.59
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:00 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from
 backing store
Date: Tue, 29 Apr 2025 16:38:40 -0700
Message-ID: <20250429233848.3093350-13-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch presents the first real use case of the new virtual swap
design. It leverages the virtualization of the swap space to decouple a
swap entry and its backing storage. A swap entry can now be backed by
one of the following options:

1. A slot on a physical swapfile/swap partition.
2. A "zero swap page".
3. A compressed object in the zswap pool.
4. An in-memory page. This can happen when a page is loaded
   (exclusively) from the zswap pool, or if the page is rejected by
   zswap and zswap writeback is disabled.

This allows us to use zswap and the zero swap page optimization, without
having to reserved a slot on a swapfile, or a swapfile at all. This
translates to tens to hundreds of GBs of disk saving on hosts and
workloads that have high memory usage, as well as removes this spurious
limit on the usage of these optimizations.

For now, we still charge virtual swap slots towards the memcg's swap
usage. In a following patch, we will change this behavior and only
charge physical (i.e on swapfile) swap slots towards the memcg's swap
usage.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  66 +++++-
 mm/huge_memory.c     |   5 +-
 mm/memcontrol.c      |  70 ++++--
 mm/memory.c          |  69 ++++--
 mm/migrate.c         |   1 +
 mm/page_io.c         |  31 ++-
 mm/shmem.c           |   7 +-
 mm/swap.h            |  10 +
 mm/swap_state.c      |  23 +-
 mm/swapfile.c        |  22 +-
 mm/vmscan.c          |  26 ++-
 mm/vswap.c           | 528 ++++++++++++++++++++++++++++++++++++++-----
 mm/zswap.c           |  34 ++-
 13 files changed, 743 insertions(+), 149 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 798adfbd43cb..9c92a982d546 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,6 +462,7 @@ extern void __meminit kswapd_stop(int nid);
 /* Lifetime swap API (mm/swapfile.c or mm/vswap.c) */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
 int swapcache_prepare(swp_entry_t entry, int nr);
@@ -509,7 +510,6 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
-void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
@@ -736,9 +736,12 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+struct zswap_entry;
+
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
 void vswap_exit(void);
+swp_slot_t vswap_alloc_swap_slot(struct folio *folio);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
 bool vswap_tryget(swp_entry_t entry);
@@ -746,7 +749,13 @@ void vswap_put(swp_entry_t entry);
 bool folio_swapped(struct folio *folio);
 bool vswap_swapcache_only(swp_entry_t entry, int nr);
 int non_swapcache_batch(swp_entry_t entry, int nr);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
+void vswap_split_huge_page(struct folio *head, struct folio *subpage);
+void vswap_migrate(struct folio *src, struct folio *dst);
+bool vswap_disk_backed(swp_entry_t entry, int nr);
+bool vswap_folio_backed(swp_entry_t entry, int nr);
+void vswap_store_folio(swp_entry_t entry, struct folio *folio);
+void swap_zeromap_folio_set(struct folio *folio);
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
@@ -781,9 +790,37 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return (swp_entry_t) { slot.val };
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
+static inline swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
+{
+	return swp_entry_to_swp_slot(folio->swap);
+}
+
+static inline void vswap_split_huge_page(struct folio *head,
+				struct folio *subpage)
+{
+}
+
+static inline void vswap_migrate(struct folio *src, struct folio *dst)
+{
+}
+
+static inline bool vswap_disk_backed(swp_entry_t entry, int nr)
+{
+	return true;
+}
+
+static inline bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+	return false;
+}
+
+static inline void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+}
+
+static inline void vswap_assoc_zswap(swp_entry_t entry,
+				struct zswap_entry *zswap_entry)
 {
-	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
 #endif /* CONFIG_VIRTUAL_SWAP */
 
@@ -802,11 +839,22 @@ static inline bool trylock_swapoff(swp_entry_t entry,
 	 * 2. Swap cache, zswap trees, etc. are all statically declared, and never
 	 *    freed.
 	 *
-	 * We do, however, need a reference to the swap device itself, because we
+	 * However, this function does not provide any guarantee that the virtual
+	 * swap slot's backing state will be stable. This has several implications:
+	 *
+	 * 1. We have to obtain a reference to the swap device itself, because we
 	 * need swap device's metadata in certain scenarios, for example when we
 	 * need to inspect the swap device flag in do_swap_page().
+	 *
+	 * 2. The swap device we are looking up here might be outdated by the time we
+	 * return to the caller. It is perfectly OK, if the swap_info_struct is only
+	 * used in a best-effort manner (i.e optimization). If we need the precise
+	 * backing state, we need to re-check after the entry is pinned in swapcache.
 	 */
-	*si = swap_slot_tryget_swap_info(slot);
+	if (vswap_disk_backed(entry, 1))
+		*si = swap_slot_tryget_swap_info(slot);
+	else
+		*si = NULL;
 	return IS_ENABLED(CONFIG_VIRTUAL_SWAP) || *si;
 }
 
@@ -817,5 +865,11 @@ static inline void unlock_swapoff(swp_entry_t entry,
 		swap_slot_put_swap_info(si);
 }
 
+static inline struct swap_info_struct *vswap_get_device(swp_entry_t entry)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	return slot.val ? swap_slot_tryget_swap_info(slot) : NULL;
+}
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 373781b21e5c..e6832ec2b07a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3172,6 +3172,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 {
 	struct page *head = &folio->page;
 	struct page *page_tail = head + tail;
+
 	/*
 	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
 	 * Don't pass it around before clear_compound_head().
@@ -3227,8 +3228,10 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 		VM_WARN_ON_ONCE_PAGE(true, page_tail);
 		page_tail->private = 0;
 	}
-	if (folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
 		new_folio->swap.val = folio->swap.val + tail;
+		vswap_split_huge_page(folio, new_folio);
+	}
 
 	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a037ec92881d..126b2d0e6aaa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5095,10 +5095,23 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_unlock();
 }
 
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
+
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
-	long nr_swap_pages = get_nr_swap_pages();
+	long nr_swap_pages, nr_zswap_pages = 0;
+
+	/*
+	 * If swap is virtualized and zswap is enabled, we can still use zswap even
+	 * if there is no space left in any swap file/partition.
+	 */
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled() &&
+			(mem_cgroup_disabled() || do_memsw_account() ||
+				mem_cgroup_may_zswap(memcg))) {
+		nr_zswap_pages = PAGE_COUNTER_MAX;
+	}
 
+	nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
 	if (mem_cgroup_disabled() || do_memsw_account())
 		return nr_swap_pages;
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
@@ -5267,6 +5280,29 @@ static struct cftype swap_files[] = {
 };
 
 #ifdef CONFIG_ZSWAP
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		unsigned long max = READ_ONCE(memcg->zswap_max);
+		unsigned long pages;
+
+		if (max == PAGE_COUNTER_MAX)
+			continue;
+		if (max == 0)
+			return false;
+
+		/* Force flush to get accurate stats for charging */
+		__mem_cgroup_flush_stats(memcg, true);
+		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
+		if (pages >= max)
+			return false;
+	}
+	return true;
+}
+
 /**
  * obj_cgroup_may_zswap - check if this cgroup can zswap
  * @objcg: the object cgroup
@@ -5281,34 +5317,15 @@ static struct cftype swap_files[] = {
  */
 bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 {
-	struct mem_cgroup *memcg, *original_memcg;
+	struct mem_cgroup *memcg;
 	bool ret = true;
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return true;
 
-	original_memcg = get_mem_cgroup_from_objcg(objcg);
-	for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
-	     memcg = parent_mem_cgroup(memcg)) {
-		unsigned long max = READ_ONCE(memcg->zswap_max);
-		unsigned long pages;
-
-		if (max == PAGE_COUNTER_MAX)
-			continue;
-		if (max == 0) {
-			ret = false;
-			break;
-		}
-
-		/* Force flush to get accurate stats for charging */
-		__mem_cgroup_flush_stats(memcg, true);
-		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
-		if (pages < max)
-			continue;
-		ret = false;
-		break;
-	}
-	mem_cgroup_put(original_memcg);
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	ret = mem_cgroup_may_zswap(memcg);
+	mem_cgroup_put(memcg);
 	return ret;
 }
 
@@ -5452,6 +5469,11 @@ static struct cftype zswap_files[] = {
 	},
 	{ }	/* terminate */
 };
+#else
+static inline bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+	return false;
+}
 #endif /* CONFIG_ZSWAP */
 
 static int __init mem_cgroup_swap_init(void)
diff --git a/mm/memory.c b/mm/memory.c
index 2a8fd26fb31d..d9c382a5e157 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4311,12 +4311,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct folio *swapcache, *folio = NULL;
 	DECLARE_WAITQUEUE(wait, current);
 	struct page *page;
-	struct swap_info_struct *si = NULL;
+	struct swap_info_struct *si = NULL, *stable_si;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
 	bool swapoff_locked = false;
 	bool exclusive = false;
-	swp_entry_t entry;
+	swp_entry_t orig_entry, entry;
 	swp_slot_t slot;
 	pte_t pte;
 	vm_fault_t ret = 0;
@@ -4330,6 +4330,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * entry might change if we get a large folio - remember the original entry
+	 * for unlocking swapoff etc.
+	 */
+	orig_entry = entry;
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
@@ -4387,7 +4392,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+		if (si && data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			folio = alloc_swap_folio(vmf);
@@ -4597,27 +4602,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 * swapcache -> certainly exclusive.
 			 */
 			exclusive = true;
-		} else if (exclusive && folio_test_writeback(folio) &&
-			  data_race(si->flags & SWP_STABLE_WRITES)) {
+		} else if (exclusive && folio_test_writeback(folio)) {
 			/*
-			 * This is tricky: not all swap backends support
-			 * concurrent page modifications while under writeback.
-			 *
-			 * So if we stumble over such a page in the swapcache
-			 * we must not set the page exclusive, otherwise we can
-			 * map it writable without further checks and modify it
-			 * while still under writeback.
-			 *
-			 * For these problematic swap backends, simply drop the
-			 * exclusive marker: this is perfectly fine as we start
-			 * writeback only if we fully unmapped the page and
-			 * there are no unexpected references on the page after
-			 * unmapping succeeded. After fully unmapped, no
-			 * further GUP references (FOLL_GET and FOLL_PIN) can
-			 * appear, so dropping the exclusive marker and mapping
-			 * it only R/O is fine.
+			 * We need to look up the swap device again here, for the virtual
+			 * swap case. The si we got from trylock_swapoff() is not
+			 * guaranteed to be stable, as at that time we have not pinned
+			 * the virtual swap slot's backing storage. With the folio locked
+			 * and loaded into the swap cache, we can now guarantee a stable
+			 * backing state.
 			 */
-			exclusive = false;
+			if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+				stable_si = vswap_get_device(entry);
+			else
+				stable_si = si;
+			if (stable_si && data_race(stable_si->flags & SWP_STABLE_WRITES)) {
+				/*
+				 * This is tricky: not all swap backends support
+				 * concurrent page modifications while under writeback.
+				 *
+				 * So if we stumble over such a page in the swapcache
+				 * we must not set the page exclusive, otherwise we can
+				 * map it writable without further checks and modify it
+				 * while still under writeback.
+				 *
+				 * For these problematic swap backends, simply drop the
+				 * exclusive marker: this is perfectly fine as we start
+				 * writeback only if we fully unmapped the page and
+				 * there are no unexpected references on the page after
+				 * unmapping succeeded. After fully unmapped, no
+				 * further GUP references (FOLL_GET and FOLL_PIN) can
+				 * appear, so dropping the exclusive marker and mapping
+				 * it only R/O is fine.
+				 */
+				exclusive = false;
+			}
+
+			if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && stable_si)
+				swap_slot_put_swap_info(stable_si);
 		}
 	}
 
@@ -4726,7 +4747,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			wake_up(&swapcache_wq);
 	}
 	if (swapoff_locked)
-		unlock_swapoff(entry, si);
+		unlock_swapoff(orig_entry, si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4745,7 +4766,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			wake_up(&swapcache_wq);
 	}
 	if (swapoff_locked)
-		unlock_swapoff(entry, si);
+		unlock_swapoff(orig_entry, si);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 97f0edf0c032..3a2cf62f47ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -523,6 +523,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	if (folio_test_swapcache(folio)) {
 		folio_set_swapcache(newfolio);
 		newfolio->private = folio_get_private(folio);
+		vswap_migrate(folio, newfolio);
 		entries = nr;
 	} else {
 		entries = 1;
diff --git a/mm/page_io.c b/mm/page_io.c
index 182851c47f43..83fc4a466db8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -201,6 +201,12 @@ static bool is_folio_zero_filled(struct folio *folio)
 	return true;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static void swap_zeromap_folio_clear(struct folio *folio)
+{
+	vswap_store_folio(folio->swap, folio);
+}
+#else
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
@@ -238,6 +244,7 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 		clear_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * We may have stale swap cache pages in memory: notice
@@ -246,6 +253,7 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 int swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct folio *folio = page_folio(page);
+	swp_slot_t slot;
 	int ret;
 
 	if (folio_free_swap(folio)) {
@@ -275,9 +283,8 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return 0;
 	} else {
 		/*
-		 * Clear bits this folio occupies in the zeromap to prevent
-		 * zero data being read in from any previous zero writes that
-		 * occupied the same swap entries.
+		 * Clear the zeromap state to prevent zero data being read in from any
+		 * previous zero writes that occupied the same swap entries.
 		 */
 		swap_zeromap_folio_clear(folio);
 	}
@@ -291,6 +298,13 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
 
+	/* fall back to physical swap device */
+	slot = vswap_alloc_swap_slot(folio);
+	if (!slot.val) {
+		folio_mark_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	__swap_writepage(folio, wbc);
 	return 0;
 }
@@ -624,14 +638,11 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis =
-		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
-	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
-	bool workingset = folio_test_workingset(folio);
+	struct swap_info_struct *sis;
+	bool synchronous, workingset = folio_test_workingset(folio);
 	unsigned long pflags;
 	bool in_thrashing;
 
-	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
 
@@ -657,6 +668,10 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
 
+	sis = swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
+	synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
+
 	if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_read_folio_fs(folio, plug);
 	} else if (synchronous) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 4c00b4673468..609971a2b365 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1404,7 +1404,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 		 * swapin error entries can be found in the mapping. But they're
 		 * deliberately ignored here as we've done everything we can do.
 		 */
-		if (swp_slot_type(slot) != type)
+		if (!slot.val || swp_slot_type(slot) != type)
 			continue;
 
 		indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1554,7 +1554,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if ((info->flags & VM_LOCKED) || sbinfo->noswap)
 		goto redirty;
 
-	if (!total_swap_pages)
+	if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP) && !total_swap_pages)
 		goto redirty;
 
 	/*
@@ -2295,7 +2295,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			fallback_order0 = true;
 
 		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+		if (!fallback_order0 && si &&
+				  data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
diff --git a/mm/swap.h b/mm/swap.h
index 31c94671cb44..411282d08a15 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -86,9 +86,18 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
 
+	/*
+	 * In the virtual swap implementation, the folio might not be backed by any
+	 * physical swap slots (for e.g zswap-backed only).
+	 */
+	if (!swp_slot.val)
+		return 0;
 	return swap_slot_swap_info(swp_slot)->flags;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap);
+#else
 /*
  * Return the count of contiguous swap entries that share the same
  * zeromap status as the starting entry. If is_zeromap is not NULL,
@@ -114,6 +123,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 	else
 		return find_next_bit(sis->zeromap, end, start) - start;
 }
+#endif
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16abdb5ce07a..19c0c01f3c6b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -490,6 +490,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	for (;;) {
 		int err;
+
 		/*
 		 * First check the swap cache.  Since this is normally
 		 * called after swap_cache_get_folio() failed, re-calling
@@ -527,8 +528,20 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * Swap entry may have been freed since our caller observed it.
 		 */
 		err = swapcache_prepare(entry, 1);
-		if (!err)
+		if (!err) {
+			/* This might be invoked by swap_cluster_readahead(), which can
+			 * race with shmem_swapin_folio(). The latter might have already
+			 * called delete_from_swap_cache(), allowing swapcache_prepare()
+			 * to succeed here. This can lead to reading bogus data to populate
+			 * the page. To prevent this, skip folio-backed virtual swap slots,
+			 * and let caller retry if necessary.
+			 */
+			if (vswap_folio_backed(entry, 1)) {
+				swapcache_clear(si, entry, 1);
+				goto put_and_return;
+			}
 			break;
+		}
 		else if (err != -EEXIST)
 			goto put_and_return;
 
@@ -711,6 +724,14 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
 
+	/*
+	 * If swap is virtualized, the swap entry might not be backed by any
+	 * physical swap slot. In that case, just skip readahead and bring in the
+	 * target entry.
+	 */
+	if (!slot.val)
+		goto skip;
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c09011867263..83016d86eb1c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1164,8 +1164,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 {
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
-	unsigned int i;
 #ifndef CONFIG_VIRTUAL_SWAP
+	unsigned int i;
 	unsigned long begin = offset;
 
 	/*
@@ -1173,16 +1173,20 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 * slots. We will clear the shadow when the virtual swap slots are freed.
 	 */
 	clear_shadow_from_swap_cache(si->type, begin, end);
-#endif
 
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
+	 *
+	 * Note that in the virtual swap implementation, we do not need to perform
+	 * these operations, since zswap and zero-filled pages are not backed by
+	 * physical swapfile.
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
 		zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
 	}
+#endif
 
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1646,43 +1650,35 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
 }
-#endif
 
 /*
  * This should only be called in contexts in which the slot has
  * been allocated but not associated with any swap entries.
  */
-void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	int size = 1 << swap_slot_order(folio_order(folio));
-	unsigned char usage;
 
 	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
-#ifdef CONFIG_VIRTUAL_SWAP
-	usage = SWAP_MAP_ALLOCATED;
-#else
-	usage = SWAP_HAS_CACHE;
-#endif
-
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
 		swap_slot_range_free(si, ci, slot, size);
 	else {
 		for (int i = 0; i < size; i++, slot.val++) {
-			if (!__swap_slot_free_locked(si, offset + i, usage))
+			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
 				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
 }
 
-#ifndef CONFIG_VIRTUAL_SWAP
 int __swap_count(swp_entry_t entry)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7..db4178bf5f6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -341,10 +341,15 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
 {
 	if (memcg == NULL) {
 		/*
-		 * For non-memcg reclaim, is there
-		 * space in any swap device?
+		 * For non-memcg reclaim:
+		 *
+		 * If swap is virtualized, we can still use zswap even if there is no
+		 * space left in any swap file/partition.
+		 *
+		 * Otherwise, check if there is space in any swap device?
 		 */
-		if (get_nr_swap_pages() > 0)
+		if ((IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled()) ||
+				get_nr_swap_pages() > 0)
 			return true;
 	} else {
 		/* Is the memcg below its swap limit? */
@@ -2611,12 +2616,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 static bool can_age_anon_pages(struct pglist_data *pgdat,
 			       struct scan_control *sc)
 {
-	/* Aging the anon LRU is valuable if swap is present: */
-	if (total_swap_pages > 0)
-		return true;
-
-	/* Also valuable if anon pages can be demoted: */
-	return can_demote(pgdat->node_id, sc);
+	/*
+	 * Aging the anon LRU is valuable if:
+	 * 1. Swap is virtualized and zswap is enabled.
+	 * 2. There are physical swap slots available.
+	 * 3. Anon pages can be demoted
+	 */
+	return (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled()) ||
+			total_swap_pages > 0 ||
+			can_demote(pgdat->node_id, sc);
 }
 
 #ifdef CONFIG_LRU_GEN
diff --git a/mm/vswap.c b/mm/vswap.c
index 513d000a134c..a42d346b7e93 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -34,26 +34,59 @@
  * about to be added to the swap cache). Its reference count is incremented or
  * decremented every time it is mapped to or unmapped from a PTE, as well as
  * when it is added to or removed from the swap cache. Finally, when its
- * reference count reaches 0, the virtual swap slot is freed.
+ * reference count reaches 0, the virtual swap slot is freed, and its backing
+ * store released.
+ *
+ *
+ * II. Backing State
+ *
+ * Each virtual swap slot be backed by:
+ *
+ * 1. A slot on a physical swap device (i.e a swapfile or a swap partition).
+ * 2. A swapped out zero-filled page.
+ * 3. A compressed object in zswap.
+ * 4. An in-memory folio, that is not backed by neither a physical swap device
+ *    nor zswap (i.e only in swap cache). This is used for pages that are
+ *    rejected by zswap, but not (yet) backed by a physical swap device,
+ *    (for e.g, due to zswap.writeback = 0), or for pages that were previously
+ *    stored in zswap, but has since been loaded back into memory (and has its
+ *    zswap copy invalidated).
  */
 
+/* The backing state options of a virtual swap slot */
+enum swap_type {
+	VSWAP_SWAPFILE,
+	VSWAP_ZERO,
+	VSWAP_ZSWAP,
+	VSWAP_FOLIO
+};
+
 /**
  * Swap descriptor - metadata of a swapped out page.
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
  * @lock: The lock protecting the swap slot backing field.
+ * @folio: The folio that backs the virtual swap slot.
+ * @zswap_entry: The zswap entry that backs the virtual swap slot.
+ * @lock: The lock protecting the swap slot backing fields.
  * @memcgid: The memcg id of the owning memcg, if any.
+ * @type: The backing store type of the swap entry.
  * @swap_refs: This field stores all the references to the swap entry. The
  *             least significant bit indicates whether the swap entry is (about
  *             to be) pinned in swap cache. The remaining bits tell us the
  *             number of page table entries that refer to the swap entry.
  */
 struct swp_desc {
-	swp_slot_t slot;
+	union {
+		swp_slot_t slot;
+		struct folio *folio;
+		struct zswap_entry *zswap_entry;
+	};
 	struct rcu_head rcu;
 
 	rwlock_t lock;
+	enum swap_type type;
 
 #ifdef CONFIG_MEMCG
 	atomic_t memcgid;
@@ -157,6 +190,7 @@ static swp_entry_t vswap_alloc(int nr)
 	}
 
 	for (i = 0; i < nr; i++) {
+		descs[i]->type = VSWAP_SWAPFILE;
 		descs[i]->slot.val = 0;
 		atomic_set(&descs[i]->memcgid, 0);
 		/* swap entry is about to be added to the swap cache */
@@ -244,6 +278,72 @@ static inline void release_vswap_slot(unsigned long index)
 	atomic_dec(&vswap_used);
 }
 
+/*
+ * Caller needs to handle races with other operations themselves.
+ *
+ * For instance, this function is safe to be called in contexts where the swap
+ * entry has been added to the swap cache and the associated folio is locked.
+ * We cannot race with other accessors, and the swap entry is guaranteed to be
+ * valid the whole time (since swap cache implies one refcount).
+ *
+ * We also need to make sure the backing state of the entire range matches.
+ * This is usually already checked by upstream callers.
+ */
+static inline void release_backing(swp_entry_t entry, int nr)
+{
+	swp_slot_t slot = (swp_slot_t){0};
+	struct swap_info_struct *si;
+	struct folio *folio = NULL;
+	enum swap_type type;
+	struct swp_desc *desc;
+	int i = 0;
+
+	VM_WARN_ON(!entry.val);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		VM_WARN_ON(!desc);
+		write_lock(&desc->lock);
+		if (!i) {
+			type = desc->type;
+			if (type == VSWAP_FOLIO)
+				folio = desc->folio;
+			else if (type == VSWAP_SWAPFILE)
+				slot = desc->slot;
+		} else {
+			VM_WARN_ON(type != desc->type);
+			VM_WARN_ON(type == VSWAP_FOLIO && desc->folio != folio);
+			VM_WARN_ON(type == VSWAP_SWAPFILE && slot.val &&
+				desc->slot.val != slot.val + i);
+		}
+
+		if (desc->type == VSWAP_ZSWAP)
+			zswap_invalidate((swp_entry_t){entry.val + i});
+		else if (desc->type == VSWAP_SWAPFILE) {
+			if (desc->slot.val) {
+				xa_erase(&vswap_rmap, desc->slot.val);
+				desc->slot.val = 0;
+			}
+		}
+		write_unlock(&desc->lock);
+		i++;
+	}
+	rcu_read_unlock();
+
+	if (slot.val) {
+		si = swap_slot_tryget_swap_info(slot);
+		if (si) {
+			swap_slot_free_nr(slot, nr);
+			swap_slot_put_swap_info(si);
+		}
+	}
+}
+
 /**
  * vswap_free - free a virtual swap slot.
  * @id: the virtual swap slot to free
@@ -257,52 +357,88 @@ static void vswap_free(swp_entry_t entry)
 
 	/* do not immediately erase the virtual slot to prevent its reuse */
 	desc = xa_load(&vswap_map, entry.val);
-	if (!desc)
-		return;
 
 	virt_clear_shadow_from_swap_cache(entry);
-
-	if (desc->slot.val) {
-		/* we only charge after linkage was established */
-		mem_cgroup_uncharge_swap(entry, 1);
-		xa_erase(&vswap_rmap, desc->slot.val);
-		swap_slot_free_nr(desc->slot, 1);
-	}
-
+	release_backing(entry, 1);
+	mem_cgroup_uncharge_swap(entry, 1);
 	/* erase forward mapping and release the virtual slot for reallocation */
 	release_vswap_slot(entry.val);
 	kfree_rcu(desc, rcu);
 }
 
 /**
- * folio_alloc_swap - allocate virtual swap slots for a folio.
- * @folio: the folio.
+ * folio_alloc_swap - allocate virtual swap slots for a folio, and
+ *                    set their backing store to the folio.
+ * @folio: the folio to allocate virtual swap slots for.
  *
  * Return: the first allocated slot if success, or the zero virtuals swap slot
  * on failure.
  */
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
-	int i, err, nr = folio_nr_pages(folio);
-	bool manual_freeing = true;
-	struct swp_desc *desc;
 	swp_entry_t entry;
-	swp_slot_t slot;
+	struct swp_desc *desc;
+	int i, nr = folio_nr_pages(folio);
 
 	entry = vswap_alloc(nr);
 	if (!entry.val)
 		return entry;
 
 	/*
-	 * XXX: for now, we always allocate a physical swap slot for each virtual
-	 * swap slot, and their lifetime are coupled. This will change once we
-	 * decouple virtual swap slots from their backing states, and only allocate
-	 * physical swap slots for them on demand (i.e on zswap writeback, or
-	 * fallback from zswap store failure).
+	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+	 * swap slots allocation. This will be changed soon - we will only charge on
+	 * physical swap slots allocation.
+	 */
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		for (i = 0; i < nr; i++) {
+			vswap_free(entry);
+			entry.val++;
+		}
+		atomic_add(nr, &vswap_alloc_reject);
+		entry.val = 0;
+		return entry;
+	}
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		desc->folio = folio;
+		desc->type = VSWAP_FOLIO;
+	}
+	rcu_read_unlock();
+	return entry;
+}
+
+/**
+ * vswap_alloc_swap_slot - allocate physical swap space for a folio that is
+ *                         already associated with virtual swap slots.
+ * @folio: folio we want to allocate physical swap space for.
+ *
+ * Return: the first allocated physical swap slot, if succeeds.
+ */
+swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
+{
+	int i, err, nr = folio_nr_pages(folio);
+	swp_slot_t slot = { .val = 0 };
+	swp_entry_t entry = folio->swap;
+	struct swp_desc *desc;
+	bool fallback = false;
+
+	/*
+	 * We might have already allocated a backing physical swap slot in past
+	 * attempts (for instance, when we disable zswap).
 	 */
+	slot = swp_entry_to_swp_slot(entry);
+	if (slot.val)
+		return slot;
+
 	slot = folio_alloc_swap_slot(folio);
 	if (!slot.val)
-		goto vswap_free;
+		return slot;
 
 	/* establish the vrtual <-> physical swap slots linkages. */
 	for (i = 0; i < nr; i++) {
@@ -312,7 +448,13 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (err) {
 			while (--i >= 0)
 				xa_erase(&vswap_rmap, slot.val + i);
-			goto put_physical_swap;
+			/*
+			 * We have not updated the backing type of the virtual swap slot.
+			 * Simply free up the physical swap slots here!
+			 */
+			swap_slot_free_nr(slot, nr);
+			slot.val = 0;
+			return slot;
 		}
 	}
 
@@ -324,36 +466,31 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (xas_retry(&xas, desc))
 			continue;
 
+		write_lock(&desc->lock);
+		if (desc->type == VSWAP_FOLIO) {
+			/* case 1: fallback from zswap store failure */
+			fallback = true;
+			if (!folio)
+				folio = desc->folio;
+			else
+				VM_WARN_ON(folio != desc->folio);
+		} else {
+			/*
+			 * Case 2: zswap writeback.
+			 *
+			 * No need to free zswap entry here - it will be freed once zswap
+			 * writeback suceeds.
+			 */
+			VM_WARN_ON(desc->type != VSWAP_ZSWAP);
+			VM_WARN_ON(fallback);
+		}
+		desc->type = VSWAP_SWAPFILE;
 		desc->slot.val = slot.val + i;
+		write_unlock(&desc->lock);
 		i++;
 	}
 	rcu_read_unlock();
-
-	manual_freeing = false;
-	/*
-	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
-	 * swap slots allocation. This is acceptable because as noted above, each
-	 * virtual swap slot corresponds to a physical swap slot. Once we have
-	 * decoupled virtual and physical swap slots, we will only charge when we
-	 * actually allocate a physical swap slot.
-	 */
-	if (!mem_cgroup_try_charge_swap(folio, entry))
-		return entry;
-
-put_physical_swap:
-	/*
-	 * There is no any linkage between virtual and physical swap slots yet. We
-	 * have to manually and separately free the allocated virtual and physical
-	 * swap slots.
-	 */
-	swap_slot_put_folio(slot, folio);
-vswap_free:
-	if (manual_freeing) {
-		for (i = 0; i < nr; i++)
-			vswap_free((swp_entry_t){entry.val + i});
-	}
-	entry.val = 0;
-	return entry;
+	return slot;
 }
 
 /**
@@ -361,7 +498,9 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
  *                         virtual swap slot.
  * @entry: the virtual swap slot.
  *
- * Return: the physical swap slot corresponding to the virtual swap slot.
+ * Return: the physical swap slot corresponding to the virtual swap slot, if
+ * exists, or the zero physical swap slot if the virtual swap slot is not
+ * backed by any physical slot on a swapfile.
  */
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 {
@@ -379,7 +518,10 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 	}
 
 	read_lock(&desc->lock);
-	slot = desc->slot;
+	if (desc->type != VSWAP_SWAPFILE)
+		slot = (swp_slot_t){0};
+	else
+		slot = desc->slot;
 	read_unlock(&desc->lock);
 	rcu_read_unlock();
 
@@ -693,6 +835,286 @@ int non_swapcache_batch(swp_entry_t entry, int max_nr)
 	return i;
 }
 
+/**
+ * vswap_split_huge_page - update a subpage's swap descriptor to point to the
+ *                         recently split out subpage folio descriptor.
+ * @head: the original head's folio descriptor.
+ * @subpage: the subpage's folio descriptor.
+ */
+void vswap_split_huge_page(struct folio *head, struct folio *subpage)
+{
+	struct swp_desc *desc = xa_load(&vswap_map, subpage->swap.val);
+
+	write_lock(&desc->lock);
+	if (desc->type == VSWAP_FOLIO) {
+		VM_WARN_ON(desc->folio != head);
+		desc->folio = subpage;
+	}
+	write_unlock(&desc->lock);
+}
+
+/**
+ * vswap_migrate - update the swap entries of the original folio to refer to
+ *                 the new folio for migration.
+ * @old: the old folio.
+ * @new: the new folio.
+ */
+void vswap_migrate(struct folio *src, struct folio *dst)
+{
+	long nr = folio_nr_pages(src), nr_folio_backed = 0;
+	struct swp_desc *desc;
+
+	VM_WARN_ON(!folio_test_locked(src));
+	VM_WARN_ON(!folio_test_swapcache(src));
+
+	XA_STATE(xas, &vswap_map, src->swap.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, src->swap.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		if (desc->type == VSWAP_FOLIO) {
+			VM_WARN_ON(desc->folio != src);
+			desc->folio = dst;
+			nr_folio_backed++;
+		}
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+
+	/* we should not see mixed backing states for swap entries in swap cache */
+	VM_WARN_ON(nr_folio_backed && nr_folio_backed != nr);
+}
+
+/**
+ * vswap_store_folio - set a folio as the backing of a range of virtual swap
+ *                     slots.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio.
+ */
+void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+
+	VM_BUG_ON(!folio_test_locked(folio));
+	VM_BUG_ON(folio->swap.val != entry.val);
+
+	release_backing(entry, nr);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		desc->type = VSWAP_FOLIO;
+		desc->folio = folio;
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * vswap_assoc_zswap - associate a virtual swap slot to a zswap entry.
+ * @entry: the virtual swap slot.
+ * @zswap_entry: the zswap entry.
+ */
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry)
+{
+	struct swp_desc *desc;
+
+	release_backing(entry, 1);
+
+	desc = xa_load(&vswap_map, entry.val);
+	write_lock(&desc->lock);
+	desc->type = VSWAP_ZSWAP;
+	desc->zswap_entry = zswap_entry;
+	write_unlock(&desc->lock);
+}
+
+/**
+ * swap_zeromap_folio_set - mark a range of virtual swap slots corresponding to
+ *                          a folio as zero-filled.
+ * @folio: the folio
+ */
+void swap_zeromap_folio_set(struct folio *folio)
+{
+	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
+	swp_entry_t entry = folio->swap;
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+
+	VM_BUG_ON(!folio_test_locked(folio));
+	VM_BUG_ON(!entry.val);
+
+	release_backing(entry, nr);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		desc->type = VSWAP_ZERO;
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+
+	count_vm_events(SWPOUT_ZERO, nr);
+	if (objcg) {
+		count_objcg_events(objcg, SWPOUT_ZERO, nr);
+		obj_cgroup_put(objcg);
+	}
+}
+
+/*
+ * Iterate through the entire range of virtual swap slots, returning the
+ * longest contiguous range of slots starting from the first slot that satisfies:
+ *
+ * 1. If the first slot is zero-mapped, the entire range should be
+ *    zero-mapped.
+ * 2. If the first slot is backed by a swapfile, the entire range should
+ *    be backed by a range of contiguous swap slots on the same swapfile.
+ * 3. If the first slot is zswap-backed, the entire range should be
+ *    zswap-backed.
+ * 4. If the first slot is backed by a folio, the entire range should
+ *    be backed by the same folio.
+ *
+ * Note that this check is racy unless we can ensure that the entire range
+ * has their backing state stable - for instance, if the caller was the one
+ * who set the in_swapcache flag of the entire field.
+ */
+static int vswap_check_backing(swp_entry_t entry, enum swap_type *type, int nr)
+{
+	unsigned int swapfile_type;
+	enum swap_type first_type;
+	struct swp_desc *desc;
+	pgoff_t first_offset;
+	struct folio *folio;
+	int i = 0;
+
+	if (!entry.val || non_swap_entry(entry))
+		return 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc)
+			goto done;
+
+		read_lock(&desc->lock);
+		if (!i) {
+			first_type = desc->type;
+			if (first_type == VSWAP_SWAPFILE) {
+				swapfile_type = swp_slot_type(desc->slot);
+				first_offset = swp_slot_offset(desc->slot);
+			} else if (first_type == VSWAP_FOLIO) {
+				folio = desc->folio;
+			}
+		} else if (desc->type != first_type) {
+			read_unlock(&desc->lock);
+			goto done;
+		} else if (first_type == VSWAP_SWAPFILE &&
+				(swp_slot_type(desc->slot) != swapfile_type ||
+					swp_slot_offset(desc->slot) != first_offset + i)) {
+			read_unlock(&desc->lock);
+			goto done;
+		} else if (first_type == VSWAP_FOLIO && desc->folio != folio) {
+			read_unlock(&desc->lock);
+			goto done;
+		}
+		read_unlock(&desc->lock);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (type)
+		*type = first_type;
+	return i;
+}
+
+/**
+ * vswap_disk_backed - check if the virtual swap slots are backed by physical
+ *                     swap slots.
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+bool vswap_disk_backed(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr
+				&& type == VSWAP_SWAPFILE;
+}
+
+/**
+ * vswap_folio_backed - check if the virtual swap slots are backed by in-memory
+ *                      pages.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ */
+bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr
+				&& type == VSWAP_FOLIO;
+}
+
+/*
+ * Return the count of contiguous swap entries that share the same
+ * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,
+ * it will return the VSWAP_ZERO status of the starting entry.
+ */
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap)
+{
+	struct swp_desc *desc;
+	int i = 0;
+	bool is_zero = false;
+
+	VM_WARN_ON(!entry.val || non_swap_entry(entry));
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + max_nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc)
+			goto done;
+
+		read_lock(&desc->lock);
+		if (!i) {
+			is_zero = (desc->type == VSWAP_ZERO);
+		} else {
+			if ((desc->type == VSWAP_ZERO) != is_zero) {
+				read_unlock(&desc->lock);
+				goto done;
+			}
+		}
+		read_unlock(&desc->lock);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && is_zeromap)
+		*is_zeromap = is_zero;
+
+	return i;
+}
+
 /**
  * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
  *                            reclaim their cache if no more references remain.
diff --git a/mm/zswap.c b/mm/zswap.c
index c1327569ce80..15429825d667 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1068,6 +1068,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
+	struct zswap_entry *new_entry;
+	swp_slot_t slot;
 
 	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
@@ -1088,6 +1090,10 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -EEXIST;
 	}
 
+	slot = vswap_alloc_swap_slot(folio);
+	if (!slot.val)
+		goto release_folio;
+
 	/*
 	 * folio is locked, and the swapcache is now secured against
 	 * concurrent swapping to and from the slot, and concurrent
@@ -1098,12 +1104,9 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	 * be dereferenced.
 	 */
 	tree = swap_zswap_tree(swpentry);
-	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
-		delete_from_swap_cache(folio);
-		folio_unlock(folio);
-		folio_put(folio);
-		return -ENOMEM;
-	}
+	new_entry = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
+	if (entry != new_entry)
+		goto fail;
 
 	zswap_decompress(entry, folio);
 
@@ -1124,6 +1127,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	folio_put(folio);
 
 	return 0;
+
+fail:
+	vswap_assoc_zswap(swpentry, new_entry);
+release_folio:
+	delete_from_swap_cache(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+	return -ENOMEM;
 }
 
 /*********************************
@@ -1487,6 +1498,8 @@ static bool zswap_store_page(struct page *page,
 		goto store_failed;
 	}
 
+	vswap_assoc_zswap(page_swpentry, entry);
+
 	/*
 	 * We may have had an existing entry that became stale when
 	 * the folio was redirtied and now the new version is being
@@ -1608,7 +1621,7 @@ bool zswap_store(struct folio *folio)
 	 */
 	if (!ret) {
 		unsigned type = swp_type(swp);
-		pgoff_t offset = swp_offset(swp);
+		pgoff_t offset = zswap_tree_index(swp);
 		struct zswap_entry *entry;
 		struct xarray *tree;
 
@@ -1618,6 +1631,12 @@ bool zswap_store(struct folio *folio)
 			if (entry)
 				zswap_entry_free(entry);
 		}
+
+		/*
+		 * We might have also partially associated some virtual swap slots with
+		 * zswap entries. Undo this.
+		 */
+		vswap_store_folio(swp, folio);
 	}
 
 	return ret;
@@ -1674,6 +1693,7 @@ bool zswap_load(struct folio *folio)
 		count_objcg_events(entry->objcg, ZSWPIN, 1);
 
 	if (swapcache) {
+		vswap_store_folio(swp, folio);
 		zswap_entry_free(entry);
 		folio_mark_dirty(folio);
 	}

From patchwork Tue Apr 29 23:38:41 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886256
Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com
 [209.85.128.170])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 58ABB2E3398;
 Tue, 29 Apr 2025 23:39:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.170
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969944; cv=none;
 b=D/Noi2lJqKjPPotc3+DjWFAVbYKCMlPBakgFUP+xQQUyRnwY2h3nofc/q0dR0khSxmCSbgRG5kD5UcvUrdFtpYoK9o3Gi/PFgLDslUhMj1J4XTrjpFHoCR8F4F0nxSEBDXb+V7L9+Sr7FiNFdr2JA+5xiclxDlCLn/mnUsNXQtg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969944; c=relaxed/simple;
 bh=Er3hegLgwCFhC/Ruck5UgO4+9V5jnr65pYLNSjYW1MQ=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=prldLTghbHaYzahO1FvhiaBKts+R65FAHnkTCk5VjMSxaFw9h7vhi4bLfESGpCyAscpJabTzwVC2d6mzhoIOwGk7NsvIKEFXhKg6z8SgmZXRSxekJ5YbUj0fOIJ+H1RkEuXj3JUJ8BW7yBVJO02RuhMOAfQdHjL49faSzom+Pos=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=U95xvnwe; arc=none smtp.client-ip=209.85.128.170
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="U95xvnwe"
Received: by mail-yw1-f170.google.com with SMTP id
 00721157ae682-702628e34f2so4275497b3.0;
 Tue, 29 Apr 2025 16:39:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969941; x=1746574741;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=rEZyehd4zLYWxImC6V7ghpPr92k+M9DC2r20TkQaMas=;
 b=U95xvnweWNESnXY6xWsevHatngHnCFVZdL5I+aolh9+rZ/jvm/9feaj3saLEjXzTnl
 4GEclSNVviEZqFC5jXKAt7MUsVIFapPqy25wJ+phZqvvXiGCC2Rn0Khv6+W21OD8hQzc
 k+avANWqL3M3X3xoJoDIs5RZhyLT+xAuPmTqpxShdS9/WEd1ccyET8KAk8GAQxpBCtbH
 8KJdNj3yuakmlyUJize20XPcuHL9W+3cU365aMUWgyXUHXxLcmkUq2xk7lI1fU3IDkRV
 KLKfYw3j6ZbseI4axe7phPPylRDFxNAFlA7f/kPQCGfCajlgRkLN+JipW5+zhlRLDRme
 fyew==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969941; x=1746574741;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=rEZyehd4zLYWxImC6V7ghpPr92k+M9DC2r20TkQaMas=;
 b=A5fD4GW8X51OnFEYwPC8ZLGkxmjOmDYurSgrq1kjVmS8wAKSrAC5wUFUrubZiYIkbs
 92TjRdqPPs3wCDbg0LUHOW7UpN/Wjjw/Er/YBL7x3TIgATq04LdWzwtx/mrLZ5epDOIz
 pBJnCBmwFLX3JIbmWpeYAPt8YlhyvZ7FQBjwW7wsikchUOSbvsATlNe019kTD4wjnfQJ
 Knt1mNMk0Q7oWxm1x0hwqdj148dyz+t1oTFi4skUZYhQtNAlfCOps+uXKMg0RqUDtp/f
 gMiWsav5Rsg5WpO5GQpVPkonxtxgJ8ENGg3ckXnX62g8e35+ubqGpOCTc020K5yWzaii
 yQPw==
X-Forwarded-Encrypted: i=1;
 AJvYcCU9qht3Ugt39MPQEUSoEfW0S9H6BWu4ysC7hQZzbYc0Eg0HSnrGUtOxIEYORjQhriQmsAMWxd6SrMs=@vger.kernel.org,
 AJvYcCVGmUP5RhSmmRkgmFR/VervJq4AnWsLn9oqNXpl2HcIyPU4RT3uzSnW5VGdTAzt08iRb9P5EKKA@vger.kernel.org,
 AJvYcCXno4ida7KPkJzhmZlGaffy0iYePXWCtw3fzNUxkzIngodv57LGfvcgDKpGWDvOzBdbCLZQOPeNK9eN4poF@vger.kernel.org
X-Gm-Message-State: AOJu0Ywtd0J+Z42mAIGSs9r0awBsm+7oFXQxLYjHLVjw59fnjPmlKrCH
 XfXfONto+CHhAzuwAdKFz62LrfkgbIEFqCIINx91EMT2VoaPOyhw
X-Gm-Gg: ASbGnctcNaPhMs0YOSLhziFSzoo20JRfNn0VYCHER1kmaMhfPD8GBFP9lOFN2C0wwlc
 qJPQabm16UT0to4QMM4NlE5fBrk96KW84e3iqrJ2ip2nwxnb23LR3Zu7kAg8xEu0UtZYHQ4yKVU
 42q5RFPnbLLJ+roBXsfpehjKbpeVH10I4xZb6xGUH2LOLTGMTlC1Z8+HcXwRODLeuch3jFf1b+p
 VZLw7zPJXwXk+DKkIVwFSxLeA7cbygs73r43mWtkG8jLGnwC2dtqdUKb+pAROMfS7yF12pxKiQB
 9Vb30tJh8Fj5F2ba1B6eqcl/Fv8yy2sDr8Ug20NEwA==
X-Google-Smtp-Source: AGHT+IHLBOt1mg0UkEf7TqpYRdSt7F/Em6EVYpt0sNHt2NR3UzEsMCk2MhPOS7GmiEg/ttkp85F6Dg==
X-Received: by 2002:a05:690c:998a:b0:6fb:9445:d28e with SMTP id
 00721157ae682-708ad0cc1bdmr11913687b3.10.1745969941309;
 Tue, 29 Apr 2025 16:39:01 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:8::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae035c65sm760327b3.42.2025.04.29.16.39.00
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:00 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is
 no physical swap slots
Date: Tue, 29 Apr 2025 16:38:41 -0700
Message-ID: <20250429233848.3093350-14-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

When swap is virtualized, we no longer pre-allocate a slot on swapfile
for each zswap entry. Do not start the zswap shrinker if there is no
physical swap slots available.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 15429825d667..f2f412cc1911 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1277,6 +1277,14 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
 		return 0;
 
+	/*
+	 * When swap is virtualized, we do not have any swap slots on swapfile
+	 * preallocated for zswap objects. If there is no slot available, we
+	 * cannot writeback and should just bail out here.
+	 */
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && !get_nr_swap_pages())
+		return 0;
+
 	/*
 	 * The shrinker resumes swap writeback, which will enter block
 	 * and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS

From patchwork Tue Apr 29 23:38:42 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885983
Received: from mail-yb1-f174.google.com (mail-yb1-f174.google.com
 [209.85.219.174])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 564FD2E62C2;
 Tue, 29 Apr 2025 23:39:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969945; cv=none;
 b=n78HJc9tSPcYxYVP22EAVCuYm363GGICrqxlY7D7DbD+LCMRxQBoek2IYanpM3I3Em9bY1Y6h5lxP0ulfXhRbUx4vMVeezU/LhlwsgqhMqnJJlU922wUOmsDmQ3hyw0ly9KRD/PoTb4O2PPGjUi4oKzIg37LNwDV2wyWjZz69FQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969945; c=relaxed/simple;
 bh=jbqrUtYtzNa6LKzb9EuCsr+nQwCoCI6+ju1wQS/Q8fw=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=BY9L+BC8TiQOJS2LVnsnwxtD5op2jZi529NQH0IZqTGSnKjRW55PGjmt2LakcZ0Or3AXU0vGt9w2qEH4hnuLxaAfI64yr5o6cujL+5i4u+K6Mqdye02NLolu8yBIcnNUaHBLPuzXekOYJ1ZJCwugBnQG6EjLwii+eA6WhkTX+Gc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=UEWjdksb; arc=none smtp.client-ip=209.85.219.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="UEWjdksb"
Received: by mail-yb1-f174.google.com with SMTP id
 3f1490d57ef6-e6e1cd3f451so5410218276.2;
 Tue, 29 Apr 2025 16:39:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969942; x=1746574742;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=U2S3fikh6wnFZd7wLpLwS9wxRlqN0WemSs8LsRqS7Hc=;
 b=UEWjdksbTZ4BDqDnlX4+ZtB5LYMSR0MYSefYxQOwx95jXPqPvGB0NxJlP086mKHWAr
 rs/g0VH0zsTQw1GhxA7jA2CRxQAMx04K9kXI8AUf/aTGqEzkBXo1YuRXivMZcSq5OoFY
 rK5INbn5rKpZieNZS37WTPnOgbK4S/wubMyO0lcmbVR7KlijdiIuafh+TAwoUmiRUtlL
 p74CA0Ps0+zpBqZ/B1KtdU/unTjJQyTSMpF049j9ctbapofKyhGwpGzyjw9rVp3boZRM
 jq64BeRZpi6ZVuR9DUusor9lib8xUUEUWPlix57c2vZpLXp+TRTX/PV+lD7qEGDmsOCO
 uvPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969942; x=1746574742;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=U2S3fikh6wnFZd7wLpLwS9wxRlqN0WemSs8LsRqS7Hc=;
 b=UshczbXPUyKx3hwmOClCslgZzckEIGijTBRwWVVGiYRwKtycpLQ2VP30tH38+jzG4e
 tRqXghg00B1amrlGqYGhVbmX2SdVzOrFqcuV+MRowNGvUwdhxh8Xu1VjyZls4oDMpCR7
 1oWDuexlq/wnEPejxueevc8JtwrywP2BATC3c9BhZX3yNgM7eoi02+FKqYfHgWoN61wT
 d8q89H4JjExMGQRNUSI1EGXihI7KfdIoVp76b7+tAo4V8wM5XOWhsK6hBFLiJ+y1PljQ
 XdaeVVwoCz1PEB7+xX2b6P7pWbGYg9E+AxlatbTYuWJe5WebGzKvBjN7wm2QUWmNZpf1
 NU/A==
X-Forwarded-Encrypted: i=1;
 AJvYcCULAHZ0wh2xqOiwjcMhElpvbXzBLZOqffyIqPss+sXh0YPj2jmzVLv5b1dKh+U9E32oswWhChl+/nM=@vger.kernel.org,
 AJvYcCUjDOfQv0T5qtnyr7p7f198hg4zVEqReDGXea7bAYgLSbvieDz27OsoSXpbhCRlPhwCJchPApxG@vger.kernel.org,
 AJvYcCXZkEkBYoA93K/85kKC5rcNyYRBd8EzKPM0i/di0g/ecPWaEXi6fnkSTEc9zN9CmzrL4VXgJCpyRsEcvDOW@vger.kernel.org
X-Gm-Message-State: AOJu0YzBDnOhrt4vkUC92abAaq1zbP7Dw6FG56x6wslOmQGgh+sX7Ykt
 N7cofr+Ye9qaiBBbPrPHvSrJapFFcGVHc/WXmhl6+988w+2nWXRt
X-Gm-Gg: ASbGncuHWMhe7VORFH/YxHOeOjc9O7b9AHv6TCsgqPfX9wMQKGLCJFFDwOq8wM51tKY
 jCEp7tc+FfOm4fCkoRNjagib/Y6nZ+v8nTD69Oe6AFPYc1JH/dx6mGpfV7NjHYDah08rnmE7gFP
 +KHE8+Sz7VkmHyP5cpCWTLm1LUNR7BQ4gjldLVJ9T2SI/qdTz3YhLlZwqdPI5VR049yLK7A57/p
 F9wVPrMDb+7TIA7QsNvF31NrN6GHxuKZlkUS9SfNx1oURnaSv0FfGRpf2Uv5cU1H9hrLQ3g+X2G
 Sre/GNi0Q6Mb2CnggwnPFTGjbOgMcWt+
X-Google-Smtp-Source: AGHT+IFU9ZEHTZiihNXiiX7ZqSu0t5gcjbUJ3ndqu9lphvRCw+DJM5mi2qAtWYKs7OOSvIaNJnNGuw==
X-Received: by 2002:a05:6902:e04:b0:e73:192b:2963 with SMTP id
 3f1490d57ef6-e73eaadb162mr1580318276.14.1745969942215;
 Tue, 29 Apr 2025 16:39:02 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:74::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e7412a39697sm64520276.8.2025.04.29.16.39.01
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:01 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 14/18] memcg: swap: only charge physical swap slots
Date: Tue, 29 Apr 2025 16:38:42 -0700
Message-ID: <20250429233848.3093350-15-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Now that zswap and the zero-filled swap page optimization no longer
takes up any physical swap space, we should not charge towards the swap
usage and limits of the memcg in these case. We will only record the
memcg id on virtual swap slot allocation, and defer physical swap
charging (i.e towards memory.swap.current) until the virtual swap slot
is backed by an actual physical swap slot (on zswap store failure
fallback or zswap writeback).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  17 ++++++++
 mm/memcontrol.c      | 102 ++++++++++++++++++++++++++++++++++---------
 mm/vswap.c           |  43 ++++++++----------
 3 files changed, 118 insertions(+), 44 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9c92a982d546..a65b22de4cdd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -690,6 +690,23 @@ static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
 void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry);
+
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry);
+static inline void mem_cgroup_record_swap(struct folio *folio,
+		swp_entry_t entry)
+{
+	if (!mem_cgroup_disabled())
+		__mem_cgroup_record_swap(folio, entry);
+}
+
+void __mem_cgroup_unrecord_swap(swp_entry_t entry, unsigned int nr_pages);
+static inline void mem_cgroup_unrecord_swap(swp_entry_t entry,
+		unsigned int nr_pages)
+{
+	if (!mem_cgroup_disabled())
+		__mem_cgroup_unrecord_swap(entry, nr_pages);
+}
+
 int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry);
 static inline int mem_cgroup_try_charge_swap(struct folio *folio,
 		swp_entry_t entry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 126b2d0e6aaa..c6bee12f2016 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5020,6 +5020,46 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	css_put(&memcg->css);
 }
 
+/**
+ * __mem_cgroup_record_swap - record the folio's cgroup for the swap entries.
+ * @folio: folio being swapped out.
+ * @entry: the first swap entry in the range.
+ *
+ * In the virtual swap implementation, we only record the folio's cgroup
+ * for the virtual swap slots on their allocation. We will only charge
+ * physical swap slots towards the cgroup's swap usage, i.e when physical swap
+ * slots are allocated for zswap writeback or fallback from zswap store
+ * failure.
+ */
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+	struct mem_cgroup *memcg;
+
+	memcg = folio_memcg(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+	if (!memcg)
+		return;
+
+	memcg = mem_cgroup_id_get_online(memcg);
+	if (nr_pages > 1)
+		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+}
+
+void __mem_cgroup_unrecord_swap(swp_entry_t entry, unsigned int nr_pages)
+{
+	unsigned short id = swap_cgroup_clear(entry, nr_pages);
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_id(id);
+	if (memcg)
+		mem_cgroup_id_put_many(memcg, nr_pages);
+	rcu_read_unlock();
+}
+
 /**
  * __mem_cgroup_try_charge_swap - try charging swap space for a folio
  * @folio: folio being added to swap
@@ -5038,34 +5078,47 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	if (do_memsw_account())
 		return 0;
 
-	memcg = folio_memcg(folio);
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
+		/*
+		 * In the virtual swap implementation, we already record the cgroup
+		 * on virtual swap allocation. Note that the virtual swap slot holds
+		 * a reference to memcg, so this lookup should be safe.
+		 */
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(lookup_swap_cgroup_id(entry));
+		rcu_read_unlock();
+	} else {
+		memcg = folio_memcg(folio);
 
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
-		return 0;
+		VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+		if (!memcg)
+			return 0;
 
-	if (!entry.val) {
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		return 0;
-	}
+		if (!entry.val) {
+			memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+			return 0;
+		}
 
-	memcg = mem_cgroup_id_get_online(memcg);
+		memcg = mem_cgroup_id_get_online(memcg);
+	}
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		mem_cgroup_id_put(memcg);
+		if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+			mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	/* Get references for the tail pages, too */
-	if (nr_pages > 1)
-		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
+		/* Get references for the tail pages, too */
+		if (nr_pages > 1)
+			mem_cgroup_id_get_many(memcg, nr_pages - 1);
+		swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+	}
 	mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
 
-	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
-
 	return 0;
 }
 
@@ -5079,7 +5132,11 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	struct mem_cgroup *memcg;
 	unsigned short id;
 
-	id = swap_cgroup_clear(entry, nr_pages);
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+		id = lookup_swap_cgroup_id(entry);
+	else
+		id = swap_cgroup_clear(entry, nr_pages);
+
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
@@ -5090,7 +5147,8 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 				page_counter_uncharge(&memcg->swap, nr_pages);
 		}
 		mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages);
-		mem_cgroup_id_put_many(memcg, nr_pages);
+		if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+			mem_cgroup_id_put_many(memcg, nr_pages);
 	}
 	rcu_read_unlock();
 }
@@ -5099,7 +5157,7 @@ static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
 
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
-	long nr_swap_pages, nr_zswap_pages = 0;
+	long nr_swap_pages;
 
 	/*
 	 * If swap is virtualized and zswap is enabled, we can still use zswap even
@@ -5108,10 +5166,14 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled() &&
 			(mem_cgroup_disabled() || do_memsw_account() ||
 				mem_cgroup_may_zswap(memcg))) {
-		nr_zswap_pages = PAGE_COUNTER_MAX;
+		/*
+		 * No need to check swap cgroup limits, since zswap is not charged
+		 * towards swap consumption.
+		 */
+		return PAGE_COUNTER_MAX;
 	}
 
-	nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
+	nr_swap_pages = get_nr_swap_pages();
 	if (mem_cgroup_disabled() || do_memsw_account())
 		return nr_swap_pages;
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
diff --git a/mm/vswap.c b/mm/vswap.c
index a42d346b7e93..c51ff5c54480 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -341,6 +341,7 @@ static inline void release_backing(swp_entry_t entry, int nr)
 			swap_slot_free_nr(slot, nr);
 			swap_slot_put_swap_info(si);
 		}
+		mem_cgroup_uncharge_swap(entry, nr);
 	}
 }
 
@@ -360,7 +361,7 @@ static void vswap_free(swp_entry_t entry)
 
 	virt_clear_shadow_from_swap_cache(entry);
 	release_backing(entry, 1);
-	mem_cgroup_uncharge_swap(entry, 1);
+	mem_cgroup_unrecord_swap(entry, 1);
 	/* erase forward mapping and release the virtual slot for reallocation */
 	release_vswap_slot(entry.val);
 	kfree_rcu(desc, rcu);
@@ -378,27 +379,13 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_entry_t entry;
 	struct swp_desc *desc;
-	int i, nr = folio_nr_pages(folio);
+	int nr = folio_nr_pages(folio);
 
 	entry = vswap_alloc(nr);
 	if (!entry.val)
 		return entry;
 
-	/*
-	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
-	 * swap slots allocation. This will be changed soon - we will only charge on
-	 * physical swap slots allocation.
-	 */
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		for (i = 0; i < nr; i++) {
-			vswap_free(entry);
-			entry.val++;
-		}
-		atomic_add(nr, &vswap_alloc_reject);
-		entry.val = 0;
-		return entry;
-	}
-
+	mem_cgroup_record_swap(folio, entry);
 	XA_STATE(xas, &vswap_map, entry.val);
 
 	rcu_read_lock();
@@ -440,6 +427,9 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 	if (!slot.val)
 		return slot;
 
+	if (mem_cgroup_try_charge_swap(folio, entry))
+		goto free_phys_swap;
+
 	/* establish the vrtual <-> physical swap slots linkages. */
 	for (i = 0; i < nr; i++) {
 		err = xa_insert(&vswap_rmap, slot.val + i,
@@ -448,13 +438,7 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 		if (err) {
 			while (--i >= 0)
 				xa_erase(&vswap_rmap, slot.val + i);
-			/*
-			 * We have not updated the backing type of the virtual swap slot.
-			 * Simply free up the physical swap slots here!
-			 */
-			swap_slot_free_nr(slot, nr);
-			slot.val = 0;
-			return slot;
+			goto uncharge;
 		}
 	}
 
@@ -491,6 +475,17 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 	}
 	rcu_read_unlock();
 	return slot;
+
+uncharge:
+	mem_cgroup_uncharge_swap(entry, nr);
+free_phys_swap:
+	/*
+	 * We have not updated the backing type of the virtual swap slot.
+	 * Simply free up the physical swap slots here!
+	 */
+	swap_slot_free_nr(slot, nr);
+	slot.val = 0;
+	return slot;
 }
 
 /**

From patchwork Tue Apr 29 23:38:43 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885982
Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com
 [209.85.128.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1347F2E62D8;
 Tue, 29 Apr 2025 23:39:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.179
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969946; cv=none;
 b=Ix8ldi58j6UFZUwfxzzxQUV6iSOny9VTRDEy8GRGvAn7kR0buU6qJNrFT3rotoop9Tcsh4NichW4m5gzKvRMFQKMM+HG8FSssKHltd0JI1vELlWVJzF/3NwoRsja4x9J3RhujFL3co587aug3YQQxNhi6kmmjpPKa5Y2bWgOB68=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969946; c=relaxed/simple;
 bh=PrhZu9UGS1Zo3iBepaZyF3RiCrxHnnORdIJ+ZCLi0eQ=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=LmKPk8FKiiwzaFMWfxj18V/LY8cuVYalYZ9oCQHYKUGjba4Xdz8Zb65CTxTBXMAjmu9TEA32cD5O4VBbWytD5yGYr8jAJ8JB7s/F0n31WmVR4/7sQrSDQlFnMzKWJHUpyabb69KDDlExSLZkftv3Rw0uvKUUqD99VDxwDCSu3PM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=k26FpIr2; arc=none smtp.client-ip=209.85.128.179
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="k26FpIr2"
Received: by mail-yw1-f179.google.com with SMTP id
 00721157ae682-7082ad1355bso56757557b3.1;
 Tue, 29 Apr 2025 16:39:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969943; x=1746574743;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=tMZjxyYXXDk1gidKnzSq/7a4QhnBooZEU1T9UUeNwgw=;
 b=k26FpIr2TTHTrppxgZmi5qXBcqFOUorYpWlkhI73rC86i+KyfGYf9V8W4RDU4lt24b
 kpXnZ9A8djTGo0wzhCuJ5deWlgky/ov7/xZT+bWm9q/yp+Ef0UVb+iE+3L4k073itHlq
 bTs1seApQp5i20W54TYsh0XX8fatwCeNaS1VukT4eMQVmKDF7KvIeNYSi6auBfEuBAm4
 JzJaFC6Lv2CqLbPVbw4qxU3uU/iWU6LB60b8Ff0pxEIoiGaMuJQkHbLV+dqnEqyxOOTm
 xebB6549DCx/iFskyfAykTiw4Q9LN8vnrE3vzV2OY5zJH6kM6mkw9brlaZ+pwuXWTBMA
 tA8w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969943; x=1746574743;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=tMZjxyYXXDk1gidKnzSq/7a4QhnBooZEU1T9UUeNwgw=;
 b=TyLpqJKVihY7FmK2oXW8IwLaQWVh0NkejnYhc5vlTiL6VZ4D8Pj8nIem0JfFR8ag8X
 9eSyett2vENpNfs98LcU/2QO16MI8fLS6dBABS7ZmEgR7p666hgXR2G7KzI6LHAosSuc
 NB5chvBqWzGgqrm7t2P65pZCGZnXt9nzAjjhASR9oZcrOT2ZFghcKEz9zxwvNi87JQED
 7SdzUC5VZkrAk/RUoQg+MHZ2WCCWScqjBXbplGRegSVH6BnzffiSJ/STDiXzS+b0tyZC
 1I8sroUCnpn2xO+9LPzydKL3xKyZw9Ra/QAWFL/IerBEXLiO58cSeQ03QbtUMuwayggY
 DqZQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCVCjt/ks4LnGd+tFGw+Dd135pagiqS8lRKxDolmaK0pyjhIjIzekYEKRrXzcx7XtN6B54MwEOvWJUA=@vger.kernel.org,
 AJvYcCWHyk5aZ9DjdlHP+R5rCPQb3Gj18R9TFgnreou+1ngzePZAE/Y9XOx1zvCGzm4JObqnnqjs8zYnZ/sjK5X3@vger.kernel.org,
 AJvYcCWr5dafU8Cnj5yoDE1jgd2XvWLmMea5qwdNxfUr3RV36YYKYYtTG9Z1QMsLiE0uTmkNb6YK6vsx@vger.kernel.org
X-Gm-Message-State: AOJu0YzNhqGVt658j6EvuWLM2Vj9v1JthtCqHXTUHTuCJzN0s9i/V3rW
 cs5ukmW36RoGyRmQVpUtiNYGeBcUh87mzQdUik+bdAqXLcy7WxXx
X-Gm-Gg: ASbGnculfxJiA/twUQICiU0GMLhY+bYBv7At04/zDDRSi3JwA4gI1+YRCyx1/NnQ4oA
 crGyU92mBPqEqpdebkp2UxRBhBXsY5glIdQ+Zyk+suBBvFiCGSWpB12T4IwaSK9YumpStY8863z
 DHW3lcb4D5fLPHm3uTduRHSN/L1DOnfV8hDuL3NW/A9jtkfpPEt0qzQK9w0hL+Z8koP63JwuWmt
 kvwOA5vrcfZgR86FoWrNmXAtgjv1nau+RovV1jP66+59QiVfso7zwl0GjgatiG7811Lia5PBEjC
 1DZbjPR7iWABfCyYdd5F8tkNu0sGUGI=
X-Google-Smtp-Source: AGHT+IGELX37yPB7pWIFLWi40woXsWK753X8DNcmjzLObYxT76mYAq69pGlTXHByb9H+htTw1h1vuw==
X-Received: by 2002:a05:690c:6e01:b0:6fb:a4e6:7d52 with SMTP id
 00721157ae682-708abe47dfdmr21530177b3.35.1745969942983;
 Tue, 29 Apr 2025 16:39:02 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:7::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae1e9ea3sm701667b3.105.2025.04.29.16.39.02
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:02 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 15/18] vswap: support THP swapin and batch
 free_swap_and_cache
Date: Tue, 29 Apr 2025 16:38:43 -0700
Message-ID: <20250429233848.3093350-16-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch implements the required functionalities for THP swapin and
batched free_swap_and_cache() in the virtual swap space design.

The central requirement is the range of entries we are working with must
have no mixed backing states:

1. For now, zswap-backed entries are not supported for these batched
   operations.
2. All the entries must be backed by the same type.
3. If the swap entries in the batch are backed by in-memory folio, it
   must be the same folio (i.e they correspond to the subpages of that
   folio).
4. If the swap entries in the batch are backed by slots on swapfiles, it
   must be the same swapfile, and these physical swap slots must also be
   contiguous.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  6 +++
 mm/internal.h        | 14 +------
 mm/memory.c          | 16 ++++++--
 mm/vswap.c           | 91 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 110 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a65b22de4cdd..c5a16f1ca376 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -773,6 +773,7 @@ bool vswap_folio_backed(swp_entry_t entry, int nr);
 void vswap_store_folio(swp_entry_t entry, struct folio *folio);
 void swap_zeromap_folio_set(struct folio *folio);
 void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
@@ -839,6 +840,11 @@ static inline void vswap_assoc_zswap(swp_entry_t entry,
 				struct zswap_entry *zswap_entry)
 {
 }
+
+static inline bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+	return true;
+}
 #endif /* CONFIG_VIRTUAL_SWAP */
 
 static inline bool trylock_swapoff(swp_entry_t entry,
diff --git a/mm/internal.h b/mm/internal.h
index 51061691a731..6694e7a14745 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -268,14 +268,7 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 	return (swp_entry_t) { entry.val + n };
 }
 
-/* temporary disallow batched swap operations */
-static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
-{
-	swp_entry_t next_entry;
-
-	next_entry.val = 0;
-	return next_entry;
-}
+swp_entry_t swap_move(swp_entry_t entry, long delta);
 #else
 static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 {
@@ -344,8 +337,6 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  * max_nr must be at least one and must be limited by the caller so scanning
  * cannot exceed a single page table.
  *
- * Note that for virtual swap space, we will not batch anything for now.
- *
  * Return: the number of table entries in the batch.
  */
 static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
@@ -360,9 +351,6 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(entry));
 
-	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
-		return 1;
-
 	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
diff --git a/mm/memory.c b/mm/memory.c
index d9c382a5e157..b0b23348d9be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4230,10 +4230,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * A large swapped out folio could be partially or fully in zswap. We
 	 * lack handling for such cases, so fallback to swapping in order-0
 	 * folio.
-	 *
-	 * We also disable THP swapin on the virtual swap implementation, for now.
 	 */
-	if (!zswap_never_enabled() || IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+	if (!zswap_never_enabled())
 		goto fallback;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
@@ -4423,6 +4421,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				}
 				need_clear_cache = true;
 
+				/*
+				 * Recheck to make sure the entire range is still
+				 * THP-swapin-able. Note that before we call
+				 * swapcache_prepare(), entries in the range can
+				 * still have their backing status changed.
+				 */
+				if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) &&
+						!vswap_can_swapin_thp(entry, nr_pages)) {
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+
 				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
diff --git a/mm/vswap.c b/mm/vswap.c
index c51ff5c54480..4aeb144921b8 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -9,6 +9,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "internal.h"
 #include "swap.h"
 
 /*
@@ -984,7 +985,7 @@ void swap_zeromap_folio_set(struct folio *folio)
  *
  * Note that this check is racy unless we can ensure that the entire range
  * has their backing state stable - for instance, if the caller was the one
- * who set the in_swapcache flag of the entire field.
+ * who set the swap cache pin.
  */
 static int vswap_check_backing(swp_entry_t entry, enum swap_type *type, int nr)
 {
@@ -1067,6 +1068,94 @@ bool vswap_folio_backed(swp_entry_t entry, int nr)
 				&& type == VSWAP_FOLIO;
 }
 
+/**
+ * vswap_can_swapin_thp - check if the swap entries can be swapped in as a THP.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * For now, we can only swap in a THP if the entire range is zero-filled, or if
+ * the entire range is backed by a contiguous range of physical swap slots on a
+ * swapfile.
+ */
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr &&
+		(type == VSWAP_ZERO || type == VSWAP_SWAPFILE);
+}
+
+/**
+ * swap_move - increment the swap slot by delta, checking the backing state and
+ *             return 0 if the backing state does not match (i.e wrong backing
+ *             state type, or wrong offset on the backing stores).
+ * @entry: the original virtual swap slot.
+ * @delta: the offset to increment the original slot.
+ *
+ * Note that this function is racy unless we can pin the backing state of these
+ * swap slots down with swapcache_prepare().
+ *
+ * Caller should only rely on this function as a best-effort hint otherwise,
+ * and should double-check after ensuring the whole range is pinned down.
+ *
+ * Return: the incremented virtual swap slot if the backing state matches, or
+ *         0 if the backing state does not match.
+ */
+swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	struct swp_desc *desc, *next_desc;
+	swp_entry_t next_entry;
+	bool invalid = true;
+	struct folio *folio;
+	enum swap_type type;
+	swp_slot_t slot;
+
+	next_entry.val = entry.val + delta;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	next_desc = xa_load(&vswap_map, next_entry.val);
+
+	if (!desc || !next_desc) {
+		rcu_read_unlock();
+		return (swp_entry_t){0};
+	}
+
+	read_lock(&desc->lock);
+	if (desc->type == VSWAP_ZSWAP) {
+		read_unlock(&desc->lock);
+		goto rcu_unlock;
+	}
+
+	type = desc->type;
+	if (type == VSWAP_FOLIO)
+		folio = desc->folio;
+
+	if (type == VSWAP_SWAPFILE)
+		slot = desc->slot;
+	read_unlock(&desc->lock);
+
+	read_lock(&next_desc->lock);
+	if (next_desc->type != type)
+		goto next_unlock;
+
+	if (type == VSWAP_SWAPFILE &&
+			(swp_slot_type(next_desc->slot) != swp_slot_type(slot) ||
+				swp_slot_offset(next_desc->slot) !=
+							swp_slot_offset(slot) + delta))
+		goto next_unlock;
+
+	if (type == VSWAP_FOLIO && next_desc->folio != folio)
+		goto next_unlock;
+
+	invalid = false;
+next_unlock:
+	read_unlock(&next_desc->lock);
+rcu_unlock:
+	rcu_read_unlock();
+	return invalid ? (swp_entry_t){0} : next_entry;
+}
+
 /*
  * Return the count of contiguous swap entries that share the same
  * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,

From patchwork Tue Apr 29 23:38:44 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886254
Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com
 [209.85.128.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 32A28231841;
 Tue, 29 Apr 2025 23:39:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.179
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969947; cv=none;
 b=klwOT3QvLSxLED42mmHx2+3U8fegDpOvhzkIPEQEWfx7ikwV8Uv3jZzklw24XB1bQmtgzM6Lpa5PnLVLy32vY3l1lypyZPTymtZg9orBDZnayckbkhzJxe/wYSEwVz/3zFae7hk/fQV8tLHxdHRUe85VNkdzbixLSKc02b9maY4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969947; c=relaxed/simple;
 bh=D+juSeb9xcE/sRXQik3RBgvKgnBuFTBq66GM8WioyWs=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=UtUyS9w09Fwm+9G1Skx9a08s/cxPKb869NvVZrpD1zHslZOclAPKuaueOnq4OfzublSYQTm7+euO6jexohrxK6JFNo2XXBCkBLApapo6rsCMj9832mw+EzOXU3WEJQBdBF7lDjtT6QTrFtap08YRqHrGLuxn4P/4Pz+xYAurgsY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=dc9cd7oW; arc=none smtp.client-ip=209.85.128.179
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="dc9cd7oW"
Received: by mail-yw1-f179.google.com with SMTP id
 00721157ae682-707d3c12574so3207877b3.1;
 Tue, 29 Apr 2025 16:39:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969944; x=1746574744;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=fJ/RbZZlo/UkCDv1xusfLlmapZEwEXtY5fDz/xwPMAE=;
 b=dc9cd7oW0SQS4KZa3Br5P6gX169mY+UtxLIPiz0bQvRP1Ngy/FPlkbyDJj3NGpXbl4
 XweKWXbw/wNgLqzBJUsZpvnFJNuMaRdbo8piM8fFCH5xO9kZooJz2DHTq/4/SBAcwet9
 Z5o5RVNla9oSD8StoFK15Yd5dSkff1NFMhu7IV72Wu1Gs/EiPnTRE7AG6N935r6whKjp
 sd0cOQIGBKUfalZWMQj5BeDerXe0ByCwHVC3Fq2jEselEQlGOfikLACnwj5ytm6yg0W/
 i6kzAE0uYJkzQD63Ns5LZBJpxasRIRsbyK1oI49iYDAXc/26iVfnxYqtyVgSo1yS8VqF
 dMdg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969944; x=1746574744;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=fJ/RbZZlo/UkCDv1xusfLlmapZEwEXtY5fDz/xwPMAE=;
 b=NQdGF5BxZse6UlWPX7Ei5/H8VFRfSQIMjOuu8p7Vzi12YhVRgldumMVBfnra/BOaco
 6tyRaHjSNLLBecsARenJwWU1ppLLjBeQ1LprRYawNXrXTPRFDyWSSEmqsxKt2MrXyu8s
 pnqDpmyPiLAtUCu72wuv+28F9PuN37H0CM9dhJDqChX1WsjselXeG1zK6Q8b9auaH0rJ
 ZCOCOtqZlSVO0o5Asv9eRl+y2jaxlrt+9jV12OJSv452v7npHgC2WU2v+5BwNOOMBNwG
 AnvrfJOcDqYY5MZhpCqpQ5+U2ZbdCY53ITA06j53bqEx3wSst7vz3kWOYoyV3/AVWFtq
 y5yg==
X-Forwarded-Encrypted: i=1;
 AJvYcCU2bPxTW09HYvfGadsyQQwgVnPX+r9zOae6uzocIJTyNjKSOp1dj3hGa/FJIHtLuM3OqGOzKNMf4EYOK941@vger.kernel.org,
 AJvYcCUNaNMJjtYGfo7zDnSYHI7cO7FFX3a7FkDGMDLjElx7LzpYOTajPI6KE3rrsiqFyXzzg/a/G/VP@vger.kernel.org,
 AJvYcCXPO7zBYFIJIqVHgTUjZ/u7751CCJR4jYhwF4SDB20TYg7dVpsBXRXw/1EZB4wZKsBODSBRSx4bjG4=@vger.kernel.org
X-Gm-Message-State: AOJu0YxcJACnyMmje75jYYnT7w/Dp9SCDcE4IYkULOGIT4srtSR6lyxL
 Q0J3ekUYRLx/epGRN4eQwI7Xs7rSMvo9xl70Wfl3X58iyWJuoixX
X-Gm-Gg: ASbGncuA6w0pL9kjEWk6eJD9ByLdHNrdv7WApC79rhJbd2G8K7UjWpPjys4dJqbTFM4
 /JhmU2cHvtdTJQfAmk2F9QiuBjSPpIyZTGcO72igYnd7LvGJW6bB0pFXLnJ4tvdVA7p3qcHyBS8
 jw+ggZ6Vz1xiCtSPZdbb70aHMihyPE0tcCDNr3fPV9x2OW4C1dW6oXHuR3wu/X4BPza00YQNu4Y
 eRRYYIruKOOjs/2Ng8TydJdfA+hqbb8xmBBcz8xT+/2Jd9qsWgO7ALmEY7ZEva/AEiShbHmauQr
 KjQmXYz871Aoi0CQ7D5hqa9jMi8wWiZKLhV3u61Srwo=
X-Google-Smtp-Source: AGHT+IEHV428ZSbiu1+j7VMUxhVNZCUuk+j6iHkhzjjso97SHfk0dHLLiY+/9yco0G/qMlrTow3Phg==
X-Received: by 2002:a05:690c:7a1:b0:6ea:8901:dad8 with SMTP id
 00721157ae682-708ad07cabamr9860267b3.3.1745969944028;
 Tue, 29 Apr 2025 16:39:04 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:73::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-708adfc3eadsm768377b3.12.2025.04.29.16.39.03
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:03 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap
Date: Tue, 29 Apr 2025 16:38:44 -0700
Message-ID: <20250429233848.3093350-17-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch presents the second applications of virtual swap design -
simplifying and optimizing swapoff.

With virtual swap slots stored at page table entries and used as indices
to various swap-related data structures, we no longer have to perform a
page table walk in swapoff. Simply iterate through all the allocated
swap slots on the swapfile, invoke the backward map and fault them in.

This is significantly cleaner, as well as slightly more performant,
especially when there are a lot of unrelated VMAs (since the old swapoff
code would have to traverse through all of them).

In a simple benchmark, in which we swapoff a 32 GB swapfile that is 50%
full, and in which there is a process that maps a 128GB file into
memory:

Baseline:
real: 25.54s
user: 0.00s
sys: 11.48s

New Design:
real: 11.69s
user: 0.00s
sys: 9.96s

Disregarding the real time reduction (which is mostly due to more IO
asynchrony), the new design reduces the kernel CPU time by about 13%.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/shmem_fs.h |   3 +
 include/linux/swap.h     |   1 +
 mm/shmem.c               |   2 +
 mm/swapfile.c            | 127 +++++++++++++++++++++++++++++++++++++++
 mm/vswap.c               |  61 +++++++++++++++++++
 5 files changed, 194 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0b273a7b9f01..668b6add3b8f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -108,7 +108,10 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
+
+#ifndef CONFIG_VIRTUAL_SWAP
 int shmem_unuse(unsigned int type);
+#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 unsigned long shmem_allowable_huge_orders(struct inode *inode,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c5a16f1ca376..0c585103d228 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -774,6 +774,7 @@ void vswap_store_folio(swp_entry_t entry, struct folio *folio);
 void swap_zeromap_folio_set(struct folio *folio);
 void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
 bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
+void vswap_swapoff(swp_entry_t entry, struct folio *folio, swp_slot_t slot);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 609971a2b365..fa792769e422 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1380,6 +1380,7 @@ static void shmem_evict_inode(struct inode *inode)
 #endif
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static int shmem_find_swap_entries(struct address_space *mapping,
 				   pgoff_t start, struct folio_batch *fbatch,
 				   pgoff_t *indices, unsigned int type)
@@ -1525,6 +1526,7 @@ int shmem_unuse(unsigned int type)
 
 	return error;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * Move the page from the page cache to the swap cache.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83016d86eb1c..3aa3df10c3be 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2089,6 +2089,132 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	return i;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+#define	for_each_allocated_offset(si, offset)	\
+	while (swap_usage_in_pages(si) && \
+		!signal_pending(current) && \
+		(offset = find_next_to_unuse(si, offset)) != 0)
+
+static struct folio *pagein(swp_entry_t entry, struct swap_iocb **splug,
+		struct mempolicy *mpol)
+{
+	bool folio_was_allocated;
+	struct folio *folio = __read_swap_cache_async(entry, GFP_KERNEL, mpol,
+			NO_INTERLEAVE_INDEX, &folio_was_allocated, false);
+
+	if (folio_was_allocated)
+		swap_read_folio(folio, splug);
+	return folio;
+}
+
+static int try_to_unuse(unsigned int type)
+{
+	struct swap_info_struct *si = swap_info[type];
+	struct swap_iocb *splug = NULL;
+	struct mempolicy *mpol;
+	struct blk_plug plug;
+	unsigned long offset;
+	struct folio *folio;
+	swp_entry_t entry;
+	swp_slot_t slot;
+	int ret = 0;
+
+	if (!atomic_long_read(&si->inuse_pages))
+		goto success;
+
+	mpol = get_task_policy(current);
+	blk_start_plug(&plug);
+
+	/* first round - submit the reads */
+	offset = 0;
+	for_each_allocated_offset(si, offset) {
+		slot = swp_slot(type, offset);
+		entry = swp_slot_to_swp_entry(slot);
+		if (!entry.val)
+			continue;
+
+		folio = pagein(entry, &splug, mpol);
+		if (folio)
+			folio_put(folio);
+	}
+	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
+	lru_add_drain();
+
+	/* second round - updating the virtual swap slots' backing state */
+	offset = 0;
+	for_each_allocated_offset(si, offset) {
+		slot = swp_slot(type, offset);
+retry:
+		entry = swp_slot_to_swp_entry(slot);
+		if (!entry.val)
+			continue;
+
+		/* try to allocate swap cache folio */
+		folio = pagein(entry, &splug, mpol);
+		if (!folio) {
+			if (!swp_slot_to_swp_entry(swp_slot(type, offset)).val)
+				continue;
+
+			ret = -ENOMEM;
+			pr_err("swapoff: unable to allocate swap cache folio for %lu\n",
+						entry.val);
+			goto finish;
+		}
+
+		folio_lock(folio);
+		/*
+		 * We need to check if the folio is still in swap cache. We can, for
+		 * instance, race with zswap writeback, obtaining the temporary folio
+		 * it allocated for decompression and writeback, which would be
+		 * promply deleted from swap cache. By the time we lock that folio,
+		 * it might have already contained stale data.
+		 *
+		 * Concurrent swap operations might have also come in before we
+		 * reobtain the lock, deleting the folio from swap cache, invalidating
+		 * the virtual swap slot, then swapping out the folio again.
+		 *
+		 * In all of these cases, we must retry the physical -> virtual lookup.
+		 *
+		 * Note that if everything is still valid, then virtual swap slot must
+		 * corresponds to the head page (since all previous swap slots are
+		 * freed).
+		 */
+		if (!folio_test_swapcache(folio) || folio->swap.val != entry.val) {
+			folio_unlock(folio);
+			folio_put(folio);
+			if (signal_pending(current))
+				break;
+			schedule_timeout_uninterruptible(1);
+			goto retry;
+		}
+
+		folio_wait_writeback(folio);
+		vswap_swapoff(entry, folio, slot);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+finish:
+	if (ret == -ENOMEM)
+		return ret;
+
+	/* concurrent swappers might still be releasing physical swap slots... */
+	while (swap_usage_in_pages(si)) {
+		if (signal_pending(current))
+			return -EINTR;
+		schedule_timeout_uninterruptible(1);
+	}
+
+success:
+	/*
+	 * Make sure that further cleanups after try_to_unuse() returns happen
+	 * after swap_range_free() reduces si->inuse_pages to 0.
+	 */
+	smp_mb();
+	return 0;
+}
+#else
 static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
 {
 	return pte_same(pte_swp_clear_flags(pte), swp_pte);
@@ -2479,6 +2605,7 @@ static int try_to_unuse(unsigned int type)
 	smp_mb();
 	return 0;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * After a successful try_to_unuse, if no swap is now in use, we know
diff --git a/mm/vswap.c b/mm/vswap.c
index 4aeb144921b8..35261b5664ee 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -1252,6 +1252,67 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	swapcache_clear(NULL, entry, nr);
 }
 
+/**
+ * vswap_swapoff - unlink a range of virtual swap slots from their backing
+ *                 physical swap slots on a swapfile that is being swapped off,
+ *                 and associate them with the swapped in folio.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio swapped in and loaded into swap cache.
+ * @slot: the first physical swap slot in the range.
+ */
+void vswap_swapoff(swp_entry_t entry, struct folio *folio, swp_slot_t slot)
+{
+	int i = 0, nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+	unsigned int type = swp_slot_type(slot);
+	unsigned int offset = swp_slot_offset(slot);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		/*
+		 * There might be concurrent swap operations that might invalidate the
+		 * originally obtained virtual swap slot, allowing it to be
+		 * re-allocated, or change its backing state.
+		 *
+		 * We must re-check here to make sure we are not performing bogus backing
+		 * store changes.
+		 */
+		if (desc->type != VSWAP_SWAPFILE ||
+				swp_slot_type(desc->slot) != type) {
+			/* there should not be mixed backing states among the subpages */
+			VM_WARN_ON(i);
+			write_unlock(&desc->lock);
+			break;
+		}
+
+		VM_WARN_ON(swp_slot_offset(desc->slot) != offset + i);
+
+		xa_erase(&vswap_rmap, desc->slot.val);
+		desc->type = VSWAP_FOLIO;
+		desc->folio = folio;
+		write_unlock(&desc->lock);
+		i++;
+	}
+	rcu_read_unlock();
+
+	if (i) {
+		/*
+		 * If we update the virtual swap slots' backing, mark the folio as
+		 * dirty so that reclaimers will try to page it out again.
+		 */
+		folio_mark_dirty(folio);
+		swap_slot_free_nr(slot, nr);
+		/* folio is in swap cache, so entries are guaranteed to be valid */
+		mem_cgroup_uncharge_swap(entry, nr);
+	}
+}
+
 #ifdef CONFIG_MEMCG
 static unsigned short vswap_cgroup_record(swp_entry_t entry,
 				unsigned short memcgid, unsigned int nr_ents)

From patchwork Tue Apr 29 23:38:45 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 885981
Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com
 [209.85.219.173])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id BFF0B231854;
 Tue, 29 Apr 2025 23:39:05 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.173
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969947; cv=none;
 b=SSO19UPR0rbWCP6dggBkbGWFrz/nyvd/36FgB8PeRel/PjMXq1Qac62yOcO8JVxHlhrwiwKN4T6+Q02uReV0pGgyvpgxKtR9l4d5gbomt8/R7+yJTWdti4NeocYaSIvwVKmogxXxRuLyyPzxks6jb7F4GVIYjxSsnvsWerYq9EU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969947; c=relaxed/simple;
 bh=Nio/aNh97oUUMOeQHZePfI/PfEgYpOnHKeK/atP80i0=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=IwLM54TWGWxJxBJmARQcQA17yirVoqLOdENm/C1PDTG90lkAJftXXqemWBXXXrzkcqjRvPrYntrvb68fP5DVDxeCStsEE+YBC8fiXNikS1jApYolHc0a9Tg88QLfPAMqcsgqMZiU4hSiaFgQjTOKka4t9pPb7GV50cgS+kUDBYE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=miihjida; arc=none smtp.client-ip=209.85.219.173
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="miihjida"
Received: by mail-yb1-f173.google.com with SMTP id
 3f1490d57ef6-e694601f624so5130260276.1;
 Tue, 29 Apr 2025 16:39:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969945; x=1746574745;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=2OXJh49ro0Is08VU7l706A8kleEYBYNLH5HPqY/XJZw=;
 b=miihjidaYjzLPuf8G5PMnMcHpEqj55kJi5QLZWloMrDv24i8HBSqNraZfgeHGR67vG
 BjowOcdVaLEaf0QPyXnjnJ27Kyhb/TwoljhPZ601S0HVAZrC6b7T0F1+JrZBfMHMlOIz
 ONY7+yZPTeZ6sXhcvfyviPyaPTIWLp76XkdZV9JKPuH34jNGPffAvq6a3raxnWiQT/hN
 O6ZnFqSWlz/Lgsgte7UEDz8H1t7a2TcAxwEH8Hkz99cE+haVr0mPdy6x9cuLOqBhMRl8
 zccQOwcE023ip/auVMgnjD+WceuPRRLCVITBHQfzEnJbqcrOmqeWATV6mB96F0lwX/pz
 lqUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969945; x=1746574745;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=2OXJh49ro0Is08VU7l706A8kleEYBYNLH5HPqY/XJZw=;
 b=enwn9BfuT6UtCMftP0n7GD7Q9YOAzqmS7uwsOHDrYoXdPUV5lVaH09IDfFy2jYa233
 /MJrfqRicg9g/cihCoM0vwxU+MNPrM7PKhSNp04CGb2J/Wel5TYOrlixhiPOi0kXhqCJ
 zLT37RUQ++EjmVl5DcnmjPia+9h+ACNoyqpZW4pkLWEDMYs5p8AOETiqvPKuDBSxv2Bv
 ki3lYEjEPjNxD6Tjp3T9y5tTh//Rw2SiQz5fEEYkDNORoEF6RVPlJTFBSVOcQ+tK6Sbb
 DAymd8yB+buZV/pViZoL61LZl+m8RcNT7gNW/E8/fXc0+6GLanCfqu7a4zHZuAhsKVWy
 PDGA==
X-Forwarded-Encrypted: i=1;
 AJvYcCVMBvC9jWPAAswoTssnh3rzStQEb4dfzWVczTlbkv3l7V6cP2wbKxTnKq6Ydaksc093ws4/io0Z@vger.kernel.org,
 AJvYcCW9CiCHsLFJ8D44J132eKhHLX3XRYT5AvJj3p/wB/kFbaN1K22xT4dar+NElCJP09ldJ0JwrITRU58=@vger.kernel.org,
 AJvYcCWeuku8QaRhGF+jygapCweqWRTCZY7Q2p1voRRtmASZnlYRMw10RG5zn5laVPFCkJEMAdxCyM2tyt8Cd6SO@vger.kernel.org
X-Gm-Message-State: AOJu0YwcxBBQo1QLMlxjTtBswx3aRZjZ7jOVMjmgGIqKnxAkySGowh7M
 bBwTUbx2oY5nyIusUkybZ2DTl1Mj+jtoxar/ryrunABeIvoTdAre
X-Gm-Gg: ASbGncvCpV1MOiii8G12xDG0qq82eTrBwo53kqA9HaEvDzbCI2aVQew/TzSP3lk4NGs
 k4JQGDWAhfEZaAz0Q3oXIcb3ape1YGf8Tzexuy7h5mJsp9zK5qhlvcsHsrkxRprqsIaS0rBpadg
 PPb0GQFNYWvbhr2AWve+sgP3itLqBHNbzYngDQjcDlHXiq1C9RPNzr1c3Wv8j7ykuPiK3Eft7ps
 /SaTDb/+INjUQdTFs58Yr1MKC/wKCJSCx3hWTnEH8LvHPsHeIJXFTN1FB5Xz7273vqLIxx0hMl4
 h7j8KYLduLQxSHVcE310h3P7Dyg5P6k=
X-Google-Smtp-Source: AGHT+IFvGhhWkQ9UGzTtXcjIsy+O9RRs3eS+N4X6pVmlww1mm3JHu8MPhztRLEkVipdj5eMNP+HjVw==
X-Received: by 2002:a05:6902:1692:b0:e73:176b:fc28 with SMTP id
 3f1490d57ef6-e73ecb8dd29mr1543475276.49.1745969944799;
 Tue, 29 Apr 2025 16:39:04 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:4::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e7412f1428fsm63359276.24.2025.04.29.16.39.04
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:04 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 17/18] swapfile: move zeromap setup out of
 enable_swap_info
Date: Tue, 29 Apr 2025 16:38:45 -0700
Message-ID: <20250429233848.3093350-18-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In preparation for zeromap removal in virtual swap implementation, move
zeromap setup step out of enable_swap_info to its callers, where
necessary.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3aa3df10c3be..3ed7edc800fe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2767,8 +2767,7 @@ static int swap_node(struct swap_info_struct *si)
 
 static void setup_swap_info(struct swap_info_struct *si, int prio,
 			    unsigned char *swap_map,
-			    struct swap_cluster_info *cluster_info,
-			    unsigned long *zeromap)
+			    struct swap_cluster_info *cluster_info)
 {
 	int i;
 
@@ -2793,7 +2792,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
 	}
 	si->swap_map = swap_map;
 	si->cluster_info = cluster_info;
-	si->zeromap = zeromap;
 }
 
 static void _enable_swap_info(struct swap_info_struct *si)
@@ -2825,7 +2823,8 @@ static void enable_swap_info(struct swap_info_struct *si, int prio,
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, prio, swap_map, cluster_info, zeromap);
+	setup_swap_info(si, prio, swap_map, cluster_info);
+	si->zeromap = zeromap;
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
 	/*
@@ -2843,7 +2842,7 @@ static void reinsert_swap_info(struct swap_info_struct *si)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap);
+	setup_swap_info(si, si->prio, si->swap_map, si->cluster_info);
 	_enable_swap_info(si);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);

From patchwork Tue Apr 29 23:38:46 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 886253
Received: from mail-yw1-f178.google.com (mail-yw1-f178.google.com
 [209.85.128.178])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1470246779;
 Tue, 29 Apr 2025 23:39:06 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.178
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1745969948; cv=none;
 b=jx3h1wWsr7Wgv7RZRtvgMbxCJB4Dny/7fLjJrnlPRN22h6HM8zBdOPiBv0joGRBvCA/GByOku5EQMnC77TGUltN+fFhX08DTgR2QM9Mzy2PMf9Pb8S/zKwIq6FVRNdk1N1WL4+8GAjMwJrpCkxR2ZWCJ1lUR3qyH6wyRL6nTe3I=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1745969948; c=relaxed/simple;
 bh=kgzMDwXSzbL/BzkDP8KOyhv89b4/e2QeN1j/eudgfgs=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=Izm+M8LRPUHwbJR/MTz7HGfjrQ4H7pa4fLH6nGyL4rrg/aPrWp932eXyyjx+uXc7nhjjm8CmfOQs0NQpRtMoW24vjcW24V+PswnLrelsidRwsFRNRpLfFHgNQT7iwfIJGon50Xb1oaoSrZhTnQNB6WL5tBa4AQp5coJCgMywbmw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=k9jFu3cz; arc=none smtp.client-ip=209.85.128.178
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="k9jFu3cz"
Received: by mail-yw1-f178.google.com with SMTP id
 00721157ae682-6ff4faf858cso46293167b3.2;
 Tue, 29 Apr 2025 16:39:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1745969945; x=1746574745;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=n7Y1LUcjO0i0d9KEbjRFxqx4ocs8427R84SBdDQ2ggo=;
 b=k9jFu3czpzY3WCQqNMvyOQLDYvHqeeH+s9g/k5CqGMJF6ANAfm9qQILmiL1icnk+WJ
 BUNN/lgnllIcwXRpr/ED1BA8h6vlw7GwN+saxEzDa33+T2ai9jQTQ6T8x3nUzfZpfRg1
 Aii8ciRs5+HVQ/UXCISeiV9F1afbV7EgK7zIuZsGNHDmCcIheaqmN5uQS3jAT8cLQSjM
 2yun1rJyVq3kNvwmbaYA7CEdcYs11JQeIRj60eI1Gaxoj4HUo72nxFeTLR5QCy0bJ8BB
 mzTfcv3hqbXhla60aQe9z01VKcCyOW7n7ajwfU6AquizyA7T1Jxp1oCGg13MxSfnMARt
 JVmA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1745969945; x=1746574745;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=n7Y1LUcjO0i0d9KEbjRFxqx4ocs8427R84SBdDQ2ggo=;
 b=eemKO0yvbeU5jSbzRcKjYEDR19y4xHrWf/QOOMREKtZfCtC7hSffAS3cDjEvtWI0nc
 mrJ8tlvNCCidfmMJ5JtNTKsw5IMhQTwZ+LjMeohV21IhznupFAi2GF8NmswR2duv7S1s
 aG6FF9Uv5ZOcw7ap+NMVEkX/v5yM+4Og5AMViYBXmSv3jLU9tOpmw3A1cQB6TNlTX4Zd
 3jSrT7BdAgtxEijfsJHP7h5tKyp8c0b4HYq38Yj55Ig52kUwKvQX/w+M1il9UwcvsRtk
 FND73AhEQe+DchvmOitkaNKrwirG4YP1TDxHcd2k6X72eEdys3MBlb4Y1Z7xz4epkhSA
 BEMw==
X-Forwarded-Encrypted: i=1;
 AJvYcCVqGz1oodZk3svNihA3pyG9mvF2GbgrbdO/ymnSilpOFjlwAqEL501BhkbEE+ikZxjr/zBCnIi3UYs=@vger.kernel.org,
 AJvYcCXXhNwy68oWfvrADj6ZtCv6YZhJGZIMFeqZ5+CYF+HoMeFKmdaP8UfFCtGOwk3rLrJ6ZR4AKwr9kXWnZ3YH@vger.kernel.org,
 AJvYcCXnrgncbcRclZVSyS0DpTCt2L7AjTnsLZYgOfWAKbqsD07LcbXVG9E2emKjaZ8dwJgXPOWsjmes@vger.kernel.org
X-Gm-Message-State: AOJu0YwzG970gcPWPTV/CweSGb9eY4e1FI/cjdsqSVDWZkiz/31N73ER
 z/A5F38mfudy1MAt4HYRA4QCVFWi8jyTxk2bZYSxWkiaSGznLy/1
X-Gm-Gg: ASbGncu/ylqlLYl7XsOxHhEWrAC3bKvzFDjjai/Ns3a2qyrnxvhPf03IZ4WDbemn5et
 TvXdAdd5YQJjdBBJRHoOnKpQ38EheGdvSwZpMlPalkaIj8UjG2WKpOz/Ocb1TA0rFDX91LQeuY1
 dslgjj7jjtuOxIX7zGhZGC5cniFTwLIRx94GiRlLSuwMRl4Mso2g45M1mZYpZPOPTHZDvuHHnv0
 utaozFDn9wuF3cDgEifC0ZcLfFAVbH2RKQliHciZMprL9LhamU2L2ceo6JCgvtAbvB9YZkb1lYy
 a/HA91XbodnPS/aE+VPvXYj+k2Y3
X-Google-Smtp-Source: AGHT+IGWumwdLV+HkJwjKnK300YNBa45olDNOQIScWwSpDt70t/8c1Fec7a0C3Mk2LkTlhHYM6Jf2A==
X-Received: by 2002:a05:690c:39b:b0:703:c3be:24ad with SMTP id
 00721157ae682-708abd7d5a6mr19013757b3.14.1745969945613;
 Tue, 29 Apr 2025 16:39:05 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff::]) by smtp.gmail.com with ESMTPSA id
 00721157ae682-708ae1e6038sm707687b3.97.2025.04.29.16.39.05
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 29 Apr 2025 16:39:05 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org, peterx@redhat.com
Subject: [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap
 implementation
Date: Tue, 29 Apr 2025 16:38:46 -0700
Message-ID: <20250429233848.3093350-19-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250429233848.3093350-1-nphamcs@gmail.com>
References: <20250429233848.3093350-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

We are not using the zeromap for swapped out zero-filled pages in the
virtual swap implementation. Remove it. This saves about 1 bit per
physical swap slot.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  2 ++
 mm/swapfile.c        | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0c585103d228..408368d56dfb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -312,7 +312,9 @@ struct swap_info_struct {
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+#ifndef CONFIG_VIRTUAL_SWAP
 	unsigned long *zeromap;		/* kvmalloc'ed bitmap to track zero pages */
+#endif
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
 	struct list_head full_clusters; /* full clusters list */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3ed7edc800fe..3d99bd02ede9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2824,7 +2824,9 @@ static void enable_swap_info(struct swap_info_struct *si, int prio,
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
 	setup_swap_info(si, prio, swap_map, cluster_info);
+#ifndef CONFIG_VIRTUAL_SWAP
 	si->zeromap = zeromap;
+#endif
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
 	/*
@@ -2885,7 +2887,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
 	unsigned char *swap_map;
+#ifndef CONFIG_VIRTUAL_SWAP
 	unsigned long *zeromap;
+#endif
 	struct swap_cluster_info *cluster_info;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
@@ -3000,8 +3004,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
+#ifndef CONFIG_VIRTUAL_SWAP
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
+#endif
 	cluster_info = p->cluster_info;
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
@@ -3014,7 +3020,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
+#ifndef CONFIG_VIRTUAL_SWAP
 	kvfree(zeromap);
+#endif
 	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
@@ -3601,6 +3609,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 
+#ifndef CONFIG_VIRTUAL_SWAP
 	/*
 	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
 	 * be above MAX_PAGE_ORDER incase of a large swap file.
@@ -3611,6 +3620,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		error = -ENOMEM;
 		goto bad_swap_unlock_inode;
 	}
+#endif
 
 	if (si->bdev && bdev_stable_writes(si->bdev))
 		si->flags |= SWP_STABLE_WRITES;
@@ -3722,7 +3732,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	si->flags = 0;
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
+#ifndef CONFIG_VIRTUAL_SWAP
 	kvfree(zeromap);
+#endif
 	kvfree(cluster_info);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);