Message ID | 20250303084724.6490-13-kanchana.p.sridhar@intel.com |
---|---|
State | New |
Headers | show |
Series | zswap IAA compress batching | expand |
On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > This patch modifies the acomp_ctx resources' lifetime to be from pool > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > "struct crypto_acomp_ctx" which simplify a few things: > > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx > to 0 or NULL and only then initialize the mutex. > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > to true, without locking the mutex. > 3) CPU hotunplug will lock the mutex before setting __online to false. It > will not delete any resources. > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > is true, and if so, return the mutex for use in zswap compress and > decompress ops. > 5) CPU onlining after offlining will simply check if either __online or > nr_reqs are non-0, and return 0 if so, without re-allocating the > resources. > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to > delete the acomp_ctx resources. > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > acomp_ctx_dealloc(). > > The CPU hot[un]plug callback functions are moved to "pool functions" > accordingly. > > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU > offlining, and only deleting them when the pool is destroyed, is as follows: > > IAA with batching: 64.8 KB > Software compressors: 8.2 KB > > I would appreciate code review comments on whether this memory cost is > acceptable, for the latency improvement that it provides due to a faster > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > if so, set __online to true and return, and reclaim can proceed. I like the idea of allocating the resources on memory hotplug but leaving them allocated until the pool is torn down. It avoids allocating unnecessary memory if some CPUs are never onlined, but it simplifies things because we don't have to synchronize against the resources being freed in CPU offline. The only case that would suffer from this AFAICT is if someone onlines many CPUs, uses them once, and then offline them and not use them again. I am not familiar with CPU hotplug use cases so I can't tell if that's something people do, but I am inclined to agree with this simplification. > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 182 insertions(+), 91 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 10f2a16e7586..cff96df1df8b 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > struct crypto_acomp_ctx { > struct crypto_acomp *acomp; > struct acomp_req *req; > - struct crypto_wait wait; Is there a reason for moving this? If not please avoid unrelated changes. > u8 *buffer; > + u8 nr_reqs; > + struct crypto_wait wait; > struct mutex mutex; > bool is_sleepable; > + bool __online; I don't believe we need this. If we are not freeing resources during CPU offlining, then we do not need a CPU offline callback and acomp_ctx->__online serves no purpose. The whole point of synchronizing between offlining and compress/decompress operations is to avoid UAF. If offlining does not free resources, then we can hold the mutex directly in the compress/decompress path and drop the hotunplug callback completely. I also believe nr_reqs can be dropped from this patch, as it seems like it's only used know when to set __online. > }; > > /* > @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > **********************************/ > static void __zswap_pool_empty(struct percpu_ref *ref); > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > +{ > + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { > + > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > + acomp_request_free(acomp_ctx->req); > + acomp_ctx->req = NULL; > + > + kfree(acomp_ctx->buffer); > + acomp_ctx->buffer = NULL; > + > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > + crypto_free_acomp(acomp_ctx->acomp); > + > + acomp_ctx->nr_reqs = 0; > + } > +} Please split the pure refactoring into a separate patch to make it easier to review. > + > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) Why is the function moved while being changed? It's really hard to see the diff this way. If the function needs to be moved please do that separately as well. I also see some ordering changes inside the function (e.g. we now allocate the request before the buffer). Not sure if these are intentional. If not, please keep the diff to the required changes only. > +{ > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + int ret = -ENOMEM; > + > + /* > + * Just to be even more fail-safe against changes in assumptions and/or > + * implementation of the CPU hotplug code. > + */ > + if (acomp_ctx->__online) > + return 0; > + > + if (acomp_ctx->nr_reqs) { > + acomp_ctx->__online = true; > + return 0; > + } > + > + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); > + if (IS_ERR(acomp_ctx->acomp)) { > + pr_err("could not alloc crypto acomp %s : %ld\n", > + pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); > + ret = PTR_ERR(acomp_ctx->acomp); > + goto fail; > + } > + > + acomp_ctx->nr_reqs = 1; > + > + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); > + if (!acomp_ctx->req) { > + pr_err("could not alloc crypto acomp_request %s\n", > + pool->tfm_name); > + ret = -ENOMEM; > + goto fail; > + } > + > + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); > + if (!acomp_ctx->buffer) { > + ret = -ENOMEM; > + goto fail; > + } > + > + crypto_init_wait(&acomp_ctx->wait); > + > + /* > + * if the backend of acomp is async zip, crypto_req_done() will wakeup > + * crypto_wait_req(); if the backend of acomp is scomp, the callback > + * won't be called, crypto_wait_req() will return without blocking. > + */ > + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, > + crypto_req_done, &acomp_ctx->wait); > + > + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp); > + > + acomp_ctx->__online = true; > + > + return 0; > + > +fail: > + acomp_ctx_dealloc(acomp_ctx); > + > + return ret; > +} > + > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > +{ > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + > + mutex_lock(&acomp_ctx->mutex); > + acomp_ctx->__online = false; > + mutex_unlock(&acomp_ctx->mutex); > + > + return 0; > +} > + > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node *node) > +{ > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + > + /* > + * The lifetime of acomp_ctx resources is from pool creation to > + * pool deletion. > + * > + * Reclaims should not be happening because, we get to this routine only > + * in two scenarios: > + * > + * 1) pool creation failures before/during the pool ref initialization. > + * 2) we are in the process of releasing the pool, it is off the > + * zswap_pools list and has no references. > + * > + * Hence, there is no need for locks. > + */ > + acomp_ctx->__online = false; > + acomp_ctx_dealloc(acomp_ctx); Since __online can be dropped, we can probably drop zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? > +} > + > static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > { > struct zswap_pool *pool; > @@ -285,13 +403,21 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > goto error; > } > > - for_each_possible_cpu(cpu) > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > + for_each_possible_cpu(cpu) { > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + > + acomp_ctx->acomp = NULL; > + acomp_ctx->req = NULL; > + acomp_ctx->buffer = NULL; > + acomp_ctx->__online = false; > + acomp_ctx->nr_reqs = 0; Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them right away? If it is in fact needed we should probably just use __GFP_ZERO. > + mutex_init(&acomp_ctx->mutex); > + } > > ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > if (ret) > - goto error; > + goto ref_fail; > > /* being the current pool takes 1 ref; this func expects the > * caller to always add the new pool as the current pool > @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > return pool; > > ref_fail: > + for_each_possible_cpu(cpu) > + zswap_cpu_comp_dealloc(cpu, &pool->node); > + > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); I am wondering if we can guard these by hlist_empty(&pool->node) instead of having separate labels. If we do that we can probably make all the cleanup calls conditional and merge this cleanup code with zswap_pool_destroy(). Although I am not too sure about whether or not we should rely on hlist_empty() for this. I am just thinking out loud, no need to do anything here. If you decide to pursue this tho please make it a separate refactoring patch. > error: > if (pool->acomp_ctx) > @@ -361,8 +490,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) > > static void zswap_pool_destroy(struct zswap_pool *pool) > { > + int cpu; > + > zswap_pool_debug("destroying", pool); > > + for_each_possible_cpu(cpu) > + zswap_cpu_comp_dealloc(cpu, &pool->node); > + > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); > free_percpu(pool->acomp_ctx); > > @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry *entry) > /********************************* > * compressed storage functions > **********************************/ > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) > -{ > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > - struct crypto_acomp *acomp = NULL; > - struct acomp_req *req = NULL; > - u8 *buffer = NULL; > - int ret; > - > - buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); > - if (!buffer) { > - ret = -ENOMEM; > - goto fail; > - } > - > - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); > - if (IS_ERR(acomp)) { > - pr_err("could not alloc crypto acomp %s : %ld\n", > - pool->tfm_name, PTR_ERR(acomp)); > - ret = PTR_ERR(acomp); > - goto fail; > - } > - > - req = acomp_request_alloc(acomp); > - if (!req) { > - pr_err("could not alloc crypto acomp_request %s\n", > - pool->tfm_name); > - ret = -ENOMEM; > - goto fail; > - } > - > - /* > - * Only hold the mutex after completing allocations, otherwise we may > - * recurse into zswap through reclaim and attempt to hold the mutex > - * again resulting in a deadlock. > - */ > - mutex_lock(&acomp_ctx->mutex); > - crypto_init_wait(&acomp_ctx->wait); > - > - /* > - * if the backend of acomp is async zip, crypto_req_done() will wakeup > - * crypto_wait_req(); if the backend of acomp is scomp, the callback > - * won't be called, crypto_wait_req() will return without blocking. > - */ > - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, > - crypto_req_done, &acomp_ctx->wait); > - > - acomp_ctx->buffer = buffer; > - acomp_ctx->acomp = acomp; > - acomp_ctx->is_sleepable = acomp_is_async(acomp); > - acomp_ctx->req = req; > - mutex_unlock(&acomp_ctx->mutex); > - return 0; > - > -fail: > - if (acomp) > - crypto_free_acomp(acomp); > - kfree(buffer); > - return ret; > -} > - > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > -{ > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > - > - mutex_lock(&acomp_ctx->mutex); > - if (!IS_ERR_OR_NULL(acomp_ctx)) { > - if (!IS_ERR_OR_NULL(acomp_ctx->req)) > - acomp_request_free(acomp_ctx->req); > - acomp_ctx->req = NULL; > - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > - crypto_free_acomp(acomp_ctx->acomp); > - kfree(acomp_ctx->buffer); > - } > - mutex_unlock(&acomp_ctx->mutex); > - > - return 0; > -} > > static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > { > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > for (;;) { > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > - mutex_lock(&acomp_ctx->mutex); > - if (likely(acomp_ctx->req)) > - return acomp_ctx; > /* > - * It is possible that we were migrated to a different CPU after > - * getting the per-CPU ctx but before the mutex was acquired. If > - * the old CPU got offlined, zswap_cpu_comp_dead() could have > - * already freed ctx->req (among other things) and set it to > - * NULL. Just try again on the new CPU that we ended up on. > + * If the CPU onlining code successfully allocates acomp_ctx resources, > + * it sets acomp_ctx->__online to true. Until this happens, we have > + * two options: > + * > + * 1. Return NULL and fail all stores on this CPU. > + * 2. Retry, until onlining has finished allocating resources. > + * > + * In theory, option 1 could be more appropriate, because it > + * allows the calling procedure to decide how it wants to handle > + * reclaim racing with CPU hotplug. For instance, it might be Ok > + * for compress to return an error for the backing swap device > + * to store the folio. Decompress could wait until we get a > + * valid and locked mutex after onlining has completed. For now, > + * we go with option 2 because adding a do-while in > + * zswap_decompress() adds latency for software compressors. > + * > + * Once initialized, the resources will be de-allocated only > + * when the pool is destroyed. The acomp_ctx will hold on to the > + * resources through CPU offlining/onlining at any time until > + * the pool is destroyed. > + * > + * This prevents races/deadlocks between reclaim and CPU acomp_ctx > + * resource allocation that are a dependency for reclaim. > + * It further simplifies the interaction with CPU onlining and > + * offlining: > + * > + * - CPU onlining does not take the mutex. It only allocates > + * resources and sets __online to true. > + * - CPU offlining acquires the mutex before setting > + * __online to false. If reclaim has acquired the mutex, > + * offlining will have to wait for reclaim to complete before > + * hotunplug can proceed. Further, hotplug merely sets > + * __online to false. It does not delete the acomp_ctx > + * resources. > + * > + * Option 1 is better than potentially not exiting the earlier > + * for (;;) loop because the system is running low on memory > + * and/or CPUs are getting offlined for whatever reason. At > + * least failing this store will prevent data loss by failing > + * zswap_store(), and saving the data in the backing swap device. > */ I believe we can dropped. I don't think we can have any store/load operations on a CPU before it's fully onlined, and we should always have a reference on the pool here, so the resources cannot go away. So unless I missed something we can drop this completely now and just hold the mutex directly in the load/store paths. > + mutex_lock(&acomp_ctx->mutex); > + if (likely(acomp_ctx->__online)) > + return acomp_ctx; > + > mutex_unlock(&acomp_ctx->mutex); > } > } > -- > 2.27.0 >
On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > This patch modifies the acomp_ctx resources' lifetime to be from pool > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > "struct crypto_acomp_ctx" which simplify a few things: > > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx > to 0 or NULL and only then initialize the mutex. > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > to true, without locking the mutex. > 3) CPU hotunplug will lock the mutex before setting __online to false. It > will not delete any resources. > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > is true, and if so, return the mutex for use in zswap compress and > decompress ops. > 5) CPU onlining after offlining will simply check if either __online or > nr_reqs are non-0, and return 0 if so, without re-allocating the > resources. > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to > delete the acomp_ctx resources. > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > acomp_ctx_dealloc(). > > The CPU hot[un]plug callback functions are moved to "pool functions" > accordingly. > > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU > offlining, and only deleting them when the pool is destroyed, is as follows: > > IAA with batching: 64.8 KB > Software compressors: 8.2 KB I am assuming this is specifically on x86_64, so let's call that out.
On Thu, Mar 06, 2025 at 07:35:36PM +0000, Yosry Ahmed wrote: > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > "struct crypto_acomp_ctx" which simplify a few things: > > > > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx > > to 0 or NULL and only then initialize the mutex. > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > to true, without locking the mutex. > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > will not delete any resources. > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > is true, and if so, return the mutex for use in zswap compress and > > decompress ops. > > 5) CPU onlining after offlining will simply check if either __online or > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > resources. > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to > > delete the acomp_ctx resources. > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > > acomp_ctx_dealloc(). > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > accordingly. > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU > > offlining, and only deleting them when the pool is destroyed, is as follows: > > > > IAA with batching: 64.8 KB > > Software compressors: 8.2 KB > > > > I would appreciate code review comments on whether this memory cost is > > acceptable, for the latency improvement that it provides due to a faster > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > > if so, set __online to true and return, and reclaim can proceed. > > I like the idea of allocating the resources on memory hotplug but > leaving them allocated until the pool is torn down. It avoids allocating > unnecessary memory if some CPUs are never onlined, but it simplifies > things because we don't have to synchronize against the resources being > freed in CPU offline. > > The only case that would suffer from this AFAICT is if someone onlines > many CPUs, uses them once, and then offline them and not use them again. > I am not familiar with CPU hotplug use cases so I can't tell if that's > something people do, but I am inclined to agree with this > simplification. > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------ > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 10f2a16e7586..cff96df1df8b 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > struct crypto_acomp_ctx { > > struct crypto_acomp *acomp; > > struct acomp_req *req; > > - struct crypto_wait wait; > > Is there a reason for moving this? If not please avoid unrelated changes. > > > u8 *buffer; > > + u8 nr_reqs; > > + struct crypto_wait wait; > > struct mutex mutex; > > bool is_sleepable; > > + bool __online; > > I don't believe we need this. > > If we are not freeing resources during CPU offlining, then we do not > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > The whole point of synchronizing between offlining and > compress/decompress operations is to avoid UAF. If offlining does not > free resources, then we can hold the mutex directly in the > compress/decompress path and drop the hotunplug callback completely. > > I also believe nr_reqs can be dropped from this patch, as it seems like > it's only used know when to set __online. > > > }; > > > > /* > > @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > > **********************************/ > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > +{ > > + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { Also, we can just return early here to save an indentation level: if (IS_ERR_OR_NULL(acomp_ctx) || !acomp_ctx->nr_reqs) return; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > + acomp_request_free(acomp_ctx->req); > > + acomp_ctx->req = NULL; > > + > > + kfree(acomp_ctx->buffer); > > + acomp_ctx->buffer = NULL; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > + crypto_free_acomp(acomp_ctx->acomp); > > + > > + acomp_ctx->nr_reqs = 0; > > + } > > +} > > Please split the pure refactoring into a separate patch to make it > easier to review.
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Thursday, March 6, 2025 11:36 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > "struct crypto_acomp_ctx" which simplify a few things: > > > > 1) zswap_pool_create() will initialize all members of each percpu > acomp_ctx > > to 0 or NULL and only then initialize the mutex. > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > to true, without locking the mutex. > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > will not delete any resources. > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > is true, and if so, return the mutex for use in zswap compress and > > decompress ops. > > 5) CPU onlining after offlining will simply check if either __online or > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > resources. > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() > to > > delete the acomp_ctx resources. > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > > acomp_ctx_dealloc(). > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > accordingly. > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon > CPU > > offlining, and only deleting them when the pool is destroyed, is as follows: > > > > IAA with batching: 64.8 KB > > Software compressors: 8.2 KB > > > > I would appreciate code review comments on whether this memory cost is > > acceptable, for the latency improvement that it provides due to a faster > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > > if so, set __online to true and return, and reclaim can proceed. > > I like the idea of allocating the resources on memory hotplug but > leaving them allocated until the pool is torn down. It avoids allocating > unnecessary memory if some CPUs are never onlined, but it simplifies > things because we don't have to synchronize against the resources being > freed in CPU offline. > > The only case that would suffer from this AFAICT is if someone onlines > many CPUs, uses them once, and then offline them and not use them again. > I am not familiar with CPU hotplug use cases so I can't tell if that's > something people do, but I am inclined to agree with this > simplification. Thanks Yosry, for your code review comments! Good to know that this simplification is acceptable. > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++-------------- > ---- > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 10f2a16e7586..cff96df1df8b 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > struct crypto_acomp_ctx { > > struct crypto_acomp *acomp; > > struct acomp_req *req; > > - struct crypto_wait wait; > > Is there a reason for moving this? If not please avoid unrelated changes. The reason is so that req/buffer, and reqs/buffers with batching, go together logically, hence I found this easier to understand. I can restore this to the original order, if that's preferable. > > > u8 *buffer; > > + u8 nr_reqs; > > + struct crypto_wait wait; > > struct mutex mutex; > > bool is_sleepable; > > + bool __online; > > I don't believe we need this. > > If we are not freeing resources during CPU offlining, then we do not > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > The whole point of synchronizing between offlining and > compress/decompress operations is to avoid UAF. If offlining does not > free resources, then we can hold the mutex directly in the > compress/decompress path and drop the hotunplug callback completely. > > I also believe nr_reqs can be dropped from this patch, as it seems like > it's only used know when to set __online. All great points! In fact, that was the original solution I had implemented (not having an offline callback). But then, I spent some time understanding the v6.13 hotfix for synchronizing freeing of resources, and this comment in zswap_cpu_comp_prepare(): /* * Only hold the mutex after completing allocations, otherwise we may * recurse into zswap through reclaim and attempt to hold the mutex * again resulting in a deadlock. */ Hence, I figured the constraint of "recurse into zswap through reclaim" was something to comprehend in the simplification (even though I had a tough time imagining how this could happen). Hence, I added the "bool __online" because zswap_cpu_comp_prepare() does not acquire the mutex lock while allocating resources. We have already initialized the mutex, so in theory, it is possible for compress/decompress to acquire the mutex lock. The __online acts as a way to indicate whether compress/decompress can proceed reliably to use the resources. The "nr_reqs" was needed as a way to distinguish between initial and subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that goes through an online-offline-online sequence. In the initial onlining, we need to allocate resources because nr_reqs=0. If resources are to be allocated, we set acomp_ctx->nr_reqs and proceed to allocate reqs/buffers/etc. In the subsequent onlining, we can quickly inspect nr_reqs as being greater than 0 and return, thus avoiding any latency delays before reclaim/page-faults can be handled on that CPU. Please let me know if this rationale seems reasonable for why __online and nr_reqs were introduced. > > > }; > > > > /* > > @@ -246,6 +248,122 @@ static inline struct xarray > *swap_zswap_tree(swp_entry_t swp) > > **********************************/ > > static void __zswap_pool_empty(struct percpu_ref *ref); > > > > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) > > +{ > > + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > + acomp_request_free(acomp_ctx->req); > > + acomp_ctx->req = NULL; > > + > > + kfree(acomp_ctx->buffer); > > + acomp_ctx->buffer = NULL; > > + > > + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > + crypto_free_acomp(acomp_ctx->acomp); > > + > > + acomp_ctx->nr_reqs = 0; > > + } > > +} > > Please split the pure refactoring into a separate patch to make it > easier to review. Sure, will do. > > > + > > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node > *node) > > Why is the function moved while being changed? It's really hard to see > the diff this way. If the function needs to be moved please do that > separately as well. Sure, will do. > > I also see some ordering changes inside the function (e.g. we now > allocate the request before the buffer). Not sure if these are > intentional. If not, please keep the diff to the required changes only. The reason for this was, I am trying to organize the allocations based on dependencies. Unless requests are allocated, there is no point in allocating buffers. Please let me know if this is Ok. > > > +{ > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + int ret = -ENOMEM; > > + > > + /* > > + * Just to be even more fail-safe against changes in assumptions > and/or > > + * implementation of the CPU hotplug code. > > + */ > > + if (acomp_ctx->__online) > > + return 0; > > + > > + if (acomp_ctx->nr_reqs) { > > + acomp_ctx->__online = true; > > + return 0; > > + } > > + > > + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, > 0, cpu_to_node(cpu)); > > + if (IS_ERR(acomp_ctx->acomp)) { > > + pr_err("could not alloc crypto acomp %s : %ld\n", > > + pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); > > + ret = PTR_ERR(acomp_ctx->acomp); > > + goto fail; > > + } > > + > > + acomp_ctx->nr_reqs = 1; > > + > > + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); > > + if (!acomp_ctx->req) { > > + pr_err("could not alloc crypto acomp_request %s\n", > > + pool->tfm_name); > > + ret = -ENOMEM; > > + goto fail; > > + } > > + > > + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, > cpu_to_node(cpu)); > > + if (!acomp_ctx->buffer) { > > + ret = -ENOMEM; > > + goto fail; > > + } > > + > > + crypto_init_wait(&acomp_ctx->wait); > > + > > + /* > > + * if the backend of acomp is async zip, crypto_req_done() will > wakeup > > + * crypto_wait_req(); if the backend of acomp is scomp, the callback > > + * won't be called, crypto_wait_req() will return without blocking. > > + */ > > + acomp_request_set_callback(acomp_ctx->req, > CRYPTO_TFM_REQ_MAY_BACKLOG, > > + crypto_req_done, &acomp_ctx->wait); > > + > > + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp); > > + > > + acomp_ctx->__online = true; > > + > > + return 0; > > + > > +fail: > > + acomp_ctx_dealloc(acomp_ctx); > > + > > + return ret; > > +} > > + > > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node > *node) > > +{ > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + > > + mutex_lock(&acomp_ctx->mutex); > > + acomp_ctx->__online = false; > > + mutex_unlock(&acomp_ctx->mutex); > > + > > + return 0; > > +} > > + > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node > *node) > > +{ > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + > > + /* > > + * The lifetime of acomp_ctx resources is from pool creation to > > + * pool deletion. > > + * > > + * Reclaims should not be happening because, we get to this routine > only > > + * in two scenarios: > > + * > > + * 1) pool creation failures before/during the pool ref initialization. > > + * 2) we are in the process of releasing the pool, it is off the > > + * zswap_pools list and has no references. > > + * > > + * Hence, there is no need for locks. > > + */ > > + acomp_ctx->__online = false; > > + acomp_ctx_dealloc(acomp_ctx); > > Since __online can be dropped, we can probably drop > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? I suppose there is value in having a way in zswap to know for sure, that resource allocation has completed, and it is safe for compress/decompress to proceed. Especially because the mutex has been initialized before we get to resource allocation. Would you agree? > > > +} > > + > > static struct zswap_pool *zswap_pool_create(char *type, char > *compressor) > > { > > struct zswap_pool *pool; > > @@ -285,13 +403,21 @@ static struct zswap_pool > *zswap_pool_create(char *type, char *compressor) > > goto error; > > } > > > > - for_each_possible_cpu(cpu) > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > + for_each_possible_cpu(cpu) { > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > + > > + acomp_ctx->acomp = NULL; > > + acomp_ctx->req = NULL; > > + acomp_ctx->buffer = NULL; > > + acomp_ctx->__online = false; > > + acomp_ctx->nr_reqs = 0; > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > right away? Yes, I figured this is needed for two reasons: 1) For the error handling in zswap_cpu_comp_prepare() and calls into zswap_cpu_comp_dealloc() to be handled by the common procedure "acomp_ctx_dealloc()" unambiguously. 2) The second scenario I thought of that would need this, is let's say the zswap compressor is switched immediately after setting the compressor. Some cores have executed the onlining code and some haven't. Because there are no pool refs held, zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured it would help to initialize these acomp_ctx members before the hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). Please let me know if these are valid considerations. > > If it is in fact needed we should probably just use __GFP_ZERO. Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of "alloc_percpu()" for the acomp_ctx? > > > + mutex_init(&acomp_ctx->mutex); > > + } > > > > ret = > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > > &pool->node); > > if (ret) > > - goto error; > > + goto ref_fail; > > > > /* being the current pool takes 1 ref; this func expects the > > * caller to always add the new pool as the current pool > > @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char > *type, char *compressor) > > return pool; > > > > ref_fail: > > + for_each_possible_cpu(cpu) > > + zswap_cpu_comp_dealloc(cpu, &pool->node); > > + > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > > I am wondering if we can guard these by hlist_empty(&pool->node) instead > of having separate labels. If we do that we can probably make all the > cleanup calls conditional and merge this cleanup code with > zswap_pool_destroy(). > > Although I am not too sure about whether or not we should rely on > hlist_empty() for this. I am just thinking out loud, no need to do > anything here. If you decide to pursue this tho please make it a > separate refactoring patch. Sure, makes sense. > > > error: > > if (pool->acomp_ctx) > > @@ -361,8 +490,13 @@ static struct zswap_pool > *__zswap_pool_create_fallback(void) > > > > static void zswap_pool_destroy(struct zswap_pool *pool) > > { > > + int cpu; > > + > > zswap_pool_debug("destroying", pool); > > > > + for_each_possible_cpu(cpu) > > + zswap_cpu_comp_dealloc(cpu, &pool->node); > > + > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, > &pool->node); > > free_percpu(pool->acomp_ctx); > > > > @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry > *entry) > > /********************************* > > * compressed storage functions > > **********************************/ > > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node > *node) > > -{ > > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > - struct crypto_acomp *acomp = NULL; > > - struct acomp_req *req = NULL; > > - u8 *buffer = NULL; > > - int ret; > > - > > - buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, > cpu_to_node(cpu)); > > - if (!buffer) { > > - ret = -ENOMEM; > > - goto fail; > > - } > > - > > - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, > cpu_to_node(cpu)); > > - if (IS_ERR(acomp)) { > > - pr_err("could not alloc crypto acomp %s : %ld\n", > > - pool->tfm_name, PTR_ERR(acomp)); > > - ret = PTR_ERR(acomp); > > - goto fail; > > - } > > - > > - req = acomp_request_alloc(acomp); > > - if (!req) { > > - pr_err("could not alloc crypto acomp_request %s\n", > > - pool->tfm_name); > > - ret = -ENOMEM; > > - goto fail; > > - } > > - > > - /* > > - * Only hold the mutex after completing allocations, otherwise we > may > > - * recurse into zswap through reclaim and attempt to hold the mutex > > - * again resulting in a deadlock. > > - */ > > - mutex_lock(&acomp_ctx->mutex); > > - crypto_init_wait(&acomp_ctx->wait); > > - > > - /* > > - * if the backend of acomp is async zip, crypto_req_done() will > wakeup > > - * crypto_wait_req(); if the backend of acomp is scomp, the callback > > - * won't be called, crypto_wait_req() will return without blocking. > > - */ > > - acomp_request_set_callback(req, > CRYPTO_TFM_REQ_MAY_BACKLOG, > > - crypto_req_done, &acomp_ctx->wait); > > - > > - acomp_ctx->buffer = buffer; > > - acomp_ctx->acomp = acomp; > > - acomp_ctx->is_sleepable = acomp_is_async(acomp); > > - acomp_ctx->req = req; > > - mutex_unlock(&acomp_ctx->mutex); > > - return 0; > > - > > -fail: > > - if (acomp) > > - crypto_free_acomp(acomp); > > - kfree(buffer); > > - return ret; > > -} > > - > > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node > *node) > > -{ > > - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > node); > > - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > >acomp_ctx, cpu); > > - > > - mutex_lock(&acomp_ctx->mutex); > > - if (!IS_ERR_OR_NULL(acomp_ctx)) { > > - if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > - acomp_request_free(acomp_ctx->req); > > - acomp_ctx->req = NULL; > > - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > - crypto_free_acomp(acomp_ctx->acomp); > > - kfree(acomp_ctx->buffer); > > - } > > - mutex_unlock(&acomp_ctx->mutex); > > - > > - return 0; > > -} > > > > static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct > zswap_pool *pool) > > { > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > for (;;) { > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > - mutex_lock(&acomp_ctx->mutex); > > - if (likely(acomp_ctx->req)) > > - return acomp_ctx; > > /* > > - * It is possible that we were migrated to a different CPU > after > > - * getting the per-CPU ctx but before the mutex was > acquired. If > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > have > > - * already freed ctx->req (among other things) and set it to > > - * NULL. Just try again on the new CPU that we ended up on. > > + * If the CPU onlining code successfully allocates acomp_ctx > resources, > > + * it sets acomp_ctx->__online to true. Until this happens, we > have > > + * two options: > > + * > > + * 1. Return NULL and fail all stores on this CPU. > > + * 2. Retry, until onlining has finished allocating resources. > > + * > > + * In theory, option 1 could be more appropriate, because it > > + * allows the calling procedure to decide how it wants to > handle > > + * reclaim racing with CPU hotplug. For instance, it might be > Ok > > + * for compress to return an error for the backing swap device > > + * to store the folio. Decompress could wait until we get a > > + * valid and locked mutex after onlining has completed. For > now, > > + * we go with option 2 because adding a do-while in > > + * zswap_decompress() adds latency for software > compressors. > > + * > > + * Once initialized, the resources will be de-allocated only > > + * when the pool is destroyed. The acomp_ctx will hold on to > the > > + * resources through CPU offlining/onlining at any time until > > + * the pool is destroyed. > > + * > > + * This prevents races/deadlocks between reclaim and CPU > acomp_ctx > > + * resource allocation that are a dependency for reclaim. > > + * It further simplifies the interaction with CPU onlining and > > + * offlining: > > + * > > + * - CPU onlining does not take the mutex. It only allocates > > + * resources and sets __online to true. > > + * - CPU offlining acquires the mutex before setting > > + * __online to false. If reclaim has acquired the mutex, > > + * offlining will have to wait for reclaim to complete before > > + * hotunplug can proceed. Further, hotplug merely sets > > + * __online to false. It does not delete the acomp_ctx > > + * resources. > > + * > > + * Option 1 is better than potentially not exiting the earlier > > + * for (;;) loop because the system is running low on memory > > + * and/or CPUs are getting offlined for whatever reason. At > > + * least failing this store will prevent data loss by failing > > + * zswap_store(), and saving the data in the backing swap > device. > > */ > > I believe we can dropped. I don't think we can have any store/load > operations on a CPU before it's fully onlined, and we should always have > a reference on the pool here, so the resources cannot go away. > > So unless I missed something we can drop this completely now and just > hold the mutex directly in the load/store paths. Based on the above explanations, please let me know if it is a good idea to keep the __online, or if you think further simplification is possible. Thanks, Kanchana > > > + mutex_lock(&acomp_ctx->mutex); > > + if (likely(acomp_ctx->__online)) > > + return acomp_ctx; > > + > > mutex_unlock(&acomp_ctx->mutex); > > } > > } > > -- > > 2.27.0 > >
On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > Sent: Thursday, March 6, 2025 11:36 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > > Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > > allocation/deletion and mutex lock usage. > > > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > > "struct crypto_acomp_ctx" which simplify a few things: > > > > > > 1) zswap_pool_create() will initialize all members of each percpu > > acomp_ctx > > > to 0 or NULL and only then initialize the mutex. > > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > > to true, without locking the mutex. > > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > > will not delete any resources. > > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > > is true, and if so, return the mutex for use in zswap compress and > > > decompress ops. > > > 5) CPU onlining after offlining will simply check if either __online or > > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > > resources. > > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() > > to > > > delete the acomp_ctx resources. > > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare() > > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new > > > acomp_ctx_dealloc(). > > > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > > accordingly. > > > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon > > CPU > > > offlining, and only deleting them when the pool is destroyed, is as follows: > > > > > > IAA with batching: 64.8 KB > > > Software compressors: 8.2 KB > > > > > > I would appreciate code review comments on whether this memory cost is > > > acceptable, for the latency improvement that it provides due to a faster > > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and > > > if so, set __online to true and return, and reclaim can proceed. > > > > I like the idea of allocating the resources on memory hotplug but > > leaving them allocated until the pool is torn down. It avoids allocating > > unnecessary memory if some CPUs are never onlined, but it simplifies > > things because we don't have to synchronize against the resources being > > freed in CPU offline. > > > > The only case that would suffer from this AFAICT is if someone onlines > > many CPUs, uses them once, and then offline them and not use them again. > > I am not familiar with CPU hotplug use cases so I can't tell if that's > > something people do, but I am inclined to agree with this > > simplification. > > Thanks Yosry, for your code review comments! Good to know that this > simplification is acceptable. > > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > --- > > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++-------------- > > ---- > > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 10f2a16e7586..cff96df1df8b 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > > struct crypto_acomp_ctx { > > > struct crypto_acomp *acomp; > > > struct acomp_req *req; > > > - struct crypto_wait wait; > > > > Is there a reason for moving this? If not please avoid unrelated changes. > > The reason is so that req/buffer, and reqs/buffers with batching, go together > logically, hence I found this easier to understand. I can restore this to the > original order, if that's preferable. I see. In that case, this fits better in the patch that actually adds support for having multiple requests and buffers, and please call it out explicitly in the commit message. > > > > > > u8 *buffer; > > > + u8 nr_reqs; > > > + struct crypto_wait wait; > > > struct mutex mutex; > > > bool is_sleepable; > > > + bool __online; > > > > I don't believe we need this. > > > > If we are not freeing resources during CPU offlining, then we do not > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > The whole point of synchronizing between offlining and > > compress/decompress operations is to avoid UAF. If offlining does not > > free resources, then we can hold the mutex directly in the > > compress/decompress path and drop the hotunplug callback completely. > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > it's only used know when to set __online. > > All great points! In fact, that was the original solution I had implemented > (not having an offline callback). But then, I spent some time understanding > the v6.13 hotfix for synchronizing freeing of resources, and this comment > in zswap_cpu_comp_prepare(): > > /* > * Only hold the mutex after completing allocations, otherwise we may > * recurse into zswap through reclaim and attempt to hold the mutex > * again resulting in a deadlock. > */ > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > something to comprehend in the simplification (even though I had a tough > time imagining how this could happen). The constraint here is about zswap_cpu_comp_prepare() holding the mutex, making an allocation which internally triggers reclaim, then recursing into zswap and trying to hold the same mutex again causing a deadlock. If zswap_cpu_comp_prepare() does not need to hold the mutex to begin with, the constraint naturally goes away. > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > does not acquire the mutex lock while allocating resources. We have already > initialized the mutex, so in theory, it is possible for compress/decompress > to acquire the mutex lock. The __online acts as a way to indicate whether > compress/decompress can proceed reliably to use the resources. For compress/decompress to acquire the mutex they need to run on that CPU, and I don't think that's possible before onlining completes, so zswap_cpu_comp_prepare() must have already completed before compress/decompress can use that CPU IIUC. > > The "nr_reqs" was needed as a way to distinguish between initial and > subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that > goes through an online-offline-online sequence. In the initial onlining, > we need to allocate resources because nr_reqs=0. If resources are to > be allocated, we set acomp_ctx->nr_reqs and proceed to allocate > reqs/buffers/etc. In the subsequent onlining, we can quickly inspect > nr_reqs as being greater than 0 and return, thus avoiding any latency > delays before reclaim/page-faults can be handled on that CPU. > > Please let me know if this rationale seems reasonable for why > __online and nr_reqs were introduced. Based on what I said, I still don't believe they are needed, but please correct me if I am wrong. [..] > > I also see some ordering changes inside the function (e.g. we now > > allocate the request before the buffer). Not sure if these are > > intentional. If not, please keep the diff to the required changes only. > > The reason for this was, I am trying to organize the allocations based > on dependencies. Unless requests are allocated, there is no point in > allocating buffers. Please let me know if this is Ok. Please separate refactoring changes in general from functional changes because it makes code review harder. In this specific instance, I think moving the code is probably not worth it, as there's also no point in allocating requests if we cannot allocate buffers. In fact, since the buffers are larger, in theory their allocation is more likely to fail, so it makes since to do it first. Anyway, please propose such refactoring changes separately and they can be discussed as such. [..] > > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node > > *node) > > > +{ > > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > > node); > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > >acomp_ctx, cpu); > > > + > > > + /* > > > + * The lifetime of acomp_ctx resources is from pool creation to > > > + * pool deletion. > > > + * > > > + * Reclaims should not be happening because, we get to this routine > > only > > > + * in two scenarios: > > > + * > > > + * 1) pool creation failures before/during the pool ref initialization. > > > + * 2) we are in the process of releasing the pool, it is off the > > > + * zswap_pools list and has no references. > > > + * > > > + * Hence, there is no need for locks. > > > + */ > > > + acomp_ctx->__online = false; > > > + acomp_ctx_dealloc(acomp_ctx); > > > > Since __online can be dropped, we can probably drop > > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? > > I suppose there is value in having a way in zswap to know for sure, that > resource allocation has completed, and it is safe for compress/decompress > to proceed. Especially because the mutex has been initialized before we > get to resource allocation. Would you agree? As I mentioned above, I believe compress/decompress cannot run on a CPU before the onlining completes. Please correct me if I am wrong. > > > > > > +} > > > + > > > static struct zswap_pool *zswap_pool_create(char *type, char > > *compressor) > > > { > > > struct zswap_pool *pool; > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > *zswap_pool_create(char *type, char *compressor) > > > goto error; > > > } > > > > > > - for_each_possible_cpu(cpu) > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > + for_each_possible_cpu(cpu) { > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > >acomp_ctx, cpu); > > > + > > > + acomp_ctx->acomp = NULL; > > > + acomp_ctx->req = NULL; > > > + acomp_ctx->buffer = NULL; > > > + acomp_ctx->__online = false; > > > + acomp_ctx->nr_reqs = 0; > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > right away? > > Yes, I figured this is needed for two reasons: > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > zswap_cpu_comp_dealloc() to be handled by the common procedure > "acomp_ctx_dealloc()" unambiguously. This makes sense. When you move the refactoring to create acomp_ctx_dealloc() to a separate patch, please include this change in it and call it out explicitly in the commit message. > 2) The second scenario I thought of that would need this, is let's say > the zswap compressor is switched immediately after setting the > compressor. Some cores have executed the onlining code and > some haven't. Because there are no pool refs held, > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > it would help to initialize these acomp_ctx members before the > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). I believe cpuhp_state_add_instance() calls the onlining function synchronously on all present CPUs, so I don't think it's possible to end up in a state where the pool is being destroyed and some CPU executed zswap_cpu_comp_prepare() while others haven't. That being said, this made me think of a different problem. If pool destruction races with CPU onlining, there could be a race between zswap_cpu_comp_prepare() allocating resources and zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. I believe we must always call cpuhp_state_remove_instance() *before* freeing the resources to prevent this race from happening. This needs to be documented with a comment. Let me know if I missed something. > > Please let me know if these are valid considerations. > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of "alloc_percpu()" > for the acomp_ctx? Yeah if we need to initialize all/most fields to 0 let's use alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. [..] > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > for (;;) { > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > - mutex_lock(&acomp_ctx->mutex); > > > - if (likely(acomp_ctx->req)) > > > - return acomp_ctx; > > > /* > > > - * It is possible that we were migrated to a different CPU > > after > > > - * getting the per-CPU ctx but before the mutex was > > acquired. If > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > have > > > - * already freed ctx->req (among other things) and set it to > > > - * NULL. Just try again on the new CPU that we ended up on. > > > + * If the CPU onlining code successfully allocates acomp_ctx > > resources, > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > have > > > + * two options: > > > + * > > > + * 1. Return NULL and fail all stores on this CPU. > > > + * 2. Retry, until onlining has finished allocating resources. > > > + * > > > + * In theory, option 1 could be more appropriate, because it > > > + * allows the calling procedure to decide how it wants to > > handle > > > + * reclaim racing with CPU hotplug. For instance, it might be > > Ok > > > + * for compress to return an error for the backing swap device > > > + * to store the folio. Decompress could wait until we get a > > > + * valid and locked mutex after onlining has completed. For > > now, > > > + * we go with option 2 because adding a do-while in > > > + * zswap_decompress() adds latency for software > > compressors. > > > + * > > > + * Once initialized, the resources will be de-allocated only > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > the > > > + * resources through CPU offlining/onlining at any time until > > > + * the pool is destroyed. > > > + * > > > + * This prevents races/deadlocks between reclaim and CPU > > acomp_ctx > > > + * resource allocation that are a dependency for reclaim. > > > + * It further simplifies the interaction with CPU onlining and > > > + * offlining: > > > + * > > > + * - CPU onlining does not take the mutex. It only allocates > > > + * resources and sets __online to true. > > > + * - CPU offlining acquires the mutex before setting > > > + * __online to false. If reclaim has acquired the mutex, > > > + * offlining will have to wait for reclaim to complete before > > > + * hotunplug can proceed. Further, hotplug merely sets > > > + * __online to false. It does not delete the acomp_ctx > > > + * resources. > > > + * > > > + * Option 1 is better than potentially not exiting the earlier > > > + * for (;;) loop because the system is running low on memory > > > + * and/or CPUs are getting offlined for whatever reason. At > > > + * least failing this store will prevent data loss by failing > > > + * zswap_store(), and saving the data in the backing swap > > device. > > > */ > > > > I believe we can dropped. I don't think we can have any store/load > > operations on a CPU before it's fully onlined, and we should always have > > a reference on the pool here, so the resources cannot go away. > > > > So unless I missed something we can drop this completely now and just > > hold the mutex directly in the load/store paths. > > Based on the above explanations, please let me know if it is a good idea > to keep the __online, or if you think further simplification is possible. I still think it's not needed. Let me know if I missed anything.
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Friday, March 7, 2025 11:30 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > > Sent: Thursday, March 6, 2025 11:36 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; > > > Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > > > allocation/deletion and mutex lock usage. > > > > > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote: > > > > This patch modifies the acomp_ctx resources' lifetime to be from pool > > > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to > > > > "struct crypto_acomp_ctx" which simplify a few things: > > > > > > > > 1) zswap_pool_create() will initialize all members of each percpu > > > acomp_ctx > > > > to 0 or NULL and only then initialize the mutex. > > > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online > > > > to true, without locking the mutex. > > > > 3) CPU hotunplug will lock the mutex before setting __online to false. It > > > > will not delete any resources. > > > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online > > > > is true, and if so, return the mutex for use in zswap compress and > > > > decompress ops. > > > > 5) CPU onlining after offlining will simply check if either __online or > > > > nr_reqs are non-0, and return 0 if so, without re-allocating the > > > > resources. > > > > 6) zswap_pool_destroy() will call a newly added > zswap_cpu_comp_dealloc() > > > to > > > > delete the acomp_ctx resources. > > > > 7) Common resource deletion code in case of > zswap_cpu_comp_prepare() > > > > errors, and for use in zswap_cpu_comp_dealloc(), is factored into a > new > > > > acomp_ctx_dealloc(). > > > > > > > > The CPU hot[un]plug callback functions are moved to "pool functions" > > > > accordingly. > > > > > > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon > > > CPU > > > > offlining, and only deleting them when the pool is destroyed, is as > follows: > > > > > > > > IAA with batching: 64.8 KB > > > > Software compressors: 8.2 KB > > > > > > > > I would appreciate code review comments on whether this memory cost > is > > > > acceptable, for the latency improvement that it provides due to a faster > > > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the > > > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, > and > > > > if so, set __online to true and return, and reclaim can proceed. > > > > > > I like the idea of allocating the resources on memory hotplug but > > > leaving them allocated until the pool is torn down. It avoids allocating > > > unnecessary memory if some CPUs are never onlined, but it simplifies > > > things because we don't have to synchronize against the resources being > > > freed in CPU offline. > > > > > > The only case that would suffer from this AFAICT is if someone onlines > > > many CPUs, uses them once, and then offline them and not use them > again. > > > I am not familiar with CPU hotplug use cases so I can't tell if that's > > > something people do, but I am inclined to agree with this > > > simplification. > > > > Thanks Yosry, for your code review comments! Good to know that this > > simplification is acceptable. > > > > > > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > > --- > > > > mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++---------- > ---- > > > ---- > > > > 1 file changed, 182 insertions(+), 91 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 10f2a16e7586..cff96df1df8b 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) > > > > struct crypto_acomp_ctx { > > > > struct crypto_acomp *acomp; > > > > struct acomp_req *req; > > > > - struct crypto_wait wait; > > > > > > Is there a reason for moving this? If not please avoid unrelated changes. > > > > The reason is so that req/buffer, and reqs/buffers with batching, go together > > logically, hence I found this easier to understand. I can restore this to the > > original order, if that's preferable. > > I see. In that case, this fits better in the patch that actually adds > support for having multiple requests and buffers, and please call it out > explicitly in the commit message. Thanks Yosry, for the follow up comments. Sure, this makes sense. > > > > > > > > > > u8 *buffer; > > > > + u8 nr_reqs; > > > > + struct crypto_wait wait; > > > > struct mutex mutex; > > > > bool is_sleepable; > > > > + bool __online; > > > > > > I don't believe we need this. > > > > > > If we are not freeing resources during CPU offlining, then we do not > > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > > > The whole point of synchronizing between offlining and > > > compress/decompress operations is to avoid UAF. If offlining does not > > > free resources, then we can hold the mutex directly in the > > > compress/decompress path and drop the hotunplug callback completely. > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > > it's only used know when to set __online. > > > > All great points! In fact, that was the original solution I had implemented > > (not having an offline callback). But then, I spent some time understanding > > the v6.13 hotfix for synchronizing freeing of resources, and this comment > > in zswap_cpu_comp_prepare(): > > > > /* > > * Only hold the mutex after completing allocations, otherwise we > may > > * recurse into zswap through reclaim and attempt to hold the mutex > > * again resulting in a deadlock. > > */ > > > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > > something to comprehend in the simplification (even though I had a tough > > time imagining how this could happen). > > The constraint here is about zswap_cpu_comp_prepare() holding the mutex, > making an allocation which internally triggers reclaim, then recursing > into zswap and trying to hold the same mutex again causing a deadlock. > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin > with, the constraint naturally goes away. Actually, if it is possible for the allocations in zswap_cpu_comp_prepare() to trigger reclaim, then I believe we need some way for reclaim to know if the acomp_ctx resources are available. Hence, this seems like a potential for deadlock regardless of the mutex. I verified that all the zswap_cpu_comp_prepare() allocations are done with GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare() in general, i.e., aside from this patchset. I can think of the following options to resolve this, and would welcome other suggestions: 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check if acomp_ctx->__online is true, and if so, return the mutex. If acomp_ctx->__online is false, then it returns NULL. In other words, we don't have the for loop. - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare() to fail, cpuhotplug to fail. However, there is no deadlock. - zswap_compress() will need to detect NULL returned by acomp_ctx_get_cpu_lock(), and return an error. - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling acomp_ctx_get_cpu_lock(). - We won't be migrated to a different CPU because we hold the mutex, hence zswap_cpu_comp_dead() will wait on the mutex. 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h, but for non-THP allocations. If we decide to adopt this approach, we would need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this new gfp_t, in lieu of GFP_KERNEL. > > > > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > > does not acquire the mutex lock while allocating resources. We have > already > > initialized the mutex, so in theory, it is possible for compress/decompress > > to acquire the mutex lock. The __online acts as a way to indicate whether > > compress/decompress can proceed reliably to use the resources. > > For compress/decompress to acquire the mutex they need to run on that > CPU, and I don't think that's possible before onlining completes, so > zswap_cpu_comp_prepare() must have already completed before > compress/decompress can use that CPU IIUC. If we can make this assumption, that would be great! However, I am not totally sure because of the GFP_KERNEL allocations in zswap_cpu_comp_prepare(). > > > > > The "nr_reqs" was needed as a way to distinguish between initial and > > subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that > > goes through an online-offline-online sequence. In the initial onlining, > > we need to allocate resources because nr_reqs=0. If resources are to > > be allocated, we set acomp_ctx->nr_reqs and proceed to allocate > > reqs/buffers/etc. In the subsequent onlining, we can quickly inspect > > nr_reqs as being greater than 0 and return, thus avoiding any latency > > delays before reclaim/page-faults can be handled on that CPU. > > > > Please let me know if this rationale seems reasonable for why > > __online and nr_reqs were introduced. > > Based on what I said, I still don't believe they are needed, but please > correct me if I am wrong. Same comments as above. > > [..] > > > I also see some ordering changes inside the function (e.g. we now > > > allocate the request before the buffer). Not sure if these are > > > intentional. If not, please keep the diff to the required changes only. > > > > The reason for this was, I am trying to organize the allocations based > > on dependencies. Unless requests are allocated, there is no point in > > allocating buffers. Please let me know if this is Ok. > > Please separate refactoring changes in general from functional changes > because it makes code review harder. Sure, I will do so. > > In this specific instance, I think moving the code is probably not worth > it, as there's also no point in allocating requests if we cannot > allocate buffers. In fact, since the buffers are larger, in theory their > allocation is more likely to fail, so it makes since to do it first. Understood, makes better sense than allocating the requests first. > > Anyway, please propose such refactoring changes separately and they can > be discussed as such. Ok. > > [..] > > > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct > hlist_node > > > *node) > > > > +{ > > > > + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, > > > node); > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > >acomp_ctx, cpu); > > > > + > > > > + /* > > > > + * The lifetime of acomp_ctx resources is from pool creation to > > > > + * pool deletion. > > > > + * > > > > + * Reclaims should not be happening because, we get to this routine > > > only > > > > + * in two scenarios: > > > > + * > > > > + * 1) pool creation failures before/during the pool ref initialization. > > > > + * 2) we are in the process of releasing the pool, it is off the > > > > + * zswap_pools list and has no references. > > > > + * > > > > + * Hence, there is no need for locks. > > > > + */ > > > > + acomp_ctx->__online = false; > > > > + acomp_ctx_dealloc(acomp_ctx); > > > > > > Since __online can be dropped, we can probably drop > > > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly? > > > > I suppose there is value in having a way in zswap to know for sure, that > > resource allocation has completed, and it is safe for compress/decompress > > to proceed. Especially because the mutex has been initialized before we > > get to resource allocation. Would you agree? > > As I mentioned above, I believe compress/decompress cannot run on a CPU > before the onlining completes. Please correct me if I am wrong. > > > > > > > > > > +} > > > > + > > > > static struct zswap_pool *zswap_pool_create(char *type, char > > > *compressor) > > > > { > > > > struct zswap_pool *pool; > > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > > *zswap_pool_create(char *type, char *compressor) > > > > goto error; > > > > } > > > > > > > > - for_each_possible_cpu(cpu) > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > + for_each_possible_cpu(cpu) { > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > >acomp_ctx, cpu); > > > > + > > > > + acomp_ctx->acomp = NULL; > > > > + acomp_ctx->req = NULL; > > > > + acomp_ctx->buffer = NULL; > > > > + acomp_ctx->__online = false; > > > > + acomp_ctx->nr_reqs = 0; > > > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > > right away? > > > > Yes, I figured this is needed for two reasons: > > > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > > zswap_cpu_comp_dealloc() to be handled by the common procedure > > "acomp_ctx_dealloc()" unambiguously. > > This makes sense. When you move the refactoring to create > acomp_ctx_dealloc() to a separate patch, please include this change in > it and call it out explicitly in the commit message. Sure. > > > 2) The second scenario I thought of that would need this, is let's say > > the zswap compressor is switched immediately after setting the > > compressor. Some cores have executed the onlining code and > > some haven't. Because there are no pool refs held, > > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > > it would help to initialize these acomp_ctx members before the > > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). > > I believe cpuhp_state_add_instance() calls the onlining function > synchronously on all present CPUs, so I don't think it's possible to end > up in a state where the pool is being destroyed and some CPU executed > zswap_cpu_comp_prepare() while others haven't. I looked at the cpuhotplug code some more. The startup callback is invoked sequentially for_each_present_cpu(). If an error occurs for any one of them, it calls the teardown callback only on the earlier cores that have already finished running the startup callback. However, zswap_cpu_comp_dealloc() will be called for all cores, even the ones for which the startup callback was not run. Hence, I believe the zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO) to allocate the acomp_ctx. > > That being said, this made me think of a different problem. If pool > destruction races with CPU onlining, there could be a race between > zswap_cpu_comp_prepare() allocating resources and > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. > > I believe we must always call cpuhp_state_remove_instance() *before* > freeing the resources to prevent this race from happening. This needs to > be documented with a comment. Yes, this race condition is possible, thanks for catching this! The problem with calling cpuhp_state_remove_instance() before freeing the resources is that cpuhp_state_add_instance() and cpuhp_state_remove_instance() both acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence are serialized. For the reasons motivating why acomp_ctx->__online is set to false in zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance() before calling acomp_ctx_dealloc() because the latter could wait until acomp_ctx->__online to be true before deleting the resources. I will think about this some more. Another possibility is to not rely on cpuhotplug in zswap, and instead manage the per-cpu acomp_ctx resource allocation entirely in zswap_pool_create(), and deletion entirely in zswap_pool_destroy(), along with the necessary error handling. Let me think about this some more as well. > > Let me know if I missed something. > > > > > Please let me know if these are valid considerations. > > > > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of > "alloc_percpu()" > > for the acomp_ctx? > > Yeah if we need to initialize all/most fields to 0 let's use > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. Sounds good. > > [..] > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > > > for (;;) { > > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > > - mutex_lock(&acomp_ctx->mutex); > > > > - if (likely(acomp_ctx->req)) > > > > - return acomp_ctx; > > > > /* > > > > - * It is possible that we were migrated to a different CPU > > > after > > > > - * getting the per-CPU ctx but before the mutex was > > > acquired. If > > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > > have > > > > - * already freed ctx->req (among other things) and set it to > > > > - * NULL. Just try again on the new CPU that we ended up on. > > > > + * If the CPU onlining code successfully allocates acomp_ctx > > > resources, > > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > > have > > > > + * two options: > > > > + * > > > > + * 1. Return NULL and fail all stores on this CPU. > > > > + * 2. Retry, until onlining has finished allocating resources. > > > > + * > > > > + * In theory, option 1 could be more appropriate, because it > > > > + * allows the calling procedure to decide how it wants to > > > handle > > > > + * reclaim racing with CPU hotplug. For instance, it might be > > > Ok > > > > + * for compress to return an error for the backing swap device > > > > + * to store the folio. Decompress could wait until we get a > > > > + * valid and locked mutex after onlining has completed. For > > > now, > > > > + * we go with option 2 because adding a do-while in > > > > + * zswap_decompress() adds latency for software > > > compressors. > > > > + * > > > > + * Once initialized, the resources will be de-allocated only > > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > > the > > > > + * resources through CPU offlining/onlining at any time until > > > > + * the pool is destroyed. > > > > + * > > > > + * This prevents races/deadlocks between reclaim and CPU > > > acomp_ctx > > > > + * resource allocation that are a dependency for reclaim. > > > > + * It further simplifies the interaction with CPU onlining and > > > > + * offlining: > > > > + * > > > > + * - CPU onlining does not take the mutex. It only allocates > > > > + * resources and sets __online to true. > > > > + * - CPU offlining acquires the mutex before setting > > > > + * __online to false. If reclaim has acquired the mutex, > > > > + * offlining will have to wait for reclaim to complete before > > > > + * hotunplug can proceed. Further, hotplug merely sets > > > > + * __online to false. It does not delete the acomp_ctx > > > > + * resources. > > > > + * > > > > + * Option 1 is better than potentially not exiting the earlier > > > > + * for (;;) loop because the system is running low on memory > > > > + * and/or CPUs are getting offlined for whatever reason. At > > > > + * least failing this store will prevent data loss by failing > > > > + * zswap_store(), and saving the data in the backing swap > > > device. > > > > */ > > > > > > I believe we can dropped. I don't think we can have any store/load > > > operations on a CPU before it's fully onlined, and we should always have > > > a reference on the pool here, so the resources cannot go away. > > > > > > So unless I missed something we can drop this completely now and just > > > hold the mutex directly in the load/store paths. > > > > Based on the above explanations, please let me know if it is a good idea > > to keep the __online, or if you think further simplification is possible. > > I still think it's not needed. Let me know if I missed anything. Let me think some more about whether it is feasible to not have cpuhotplug manage the acomp_ctx resource allocation, and instead have this be done through the pool creation/deletion routines. Thanks, Kanchana
On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote: > [..] > > > > > u8 *buffer; > > > > > + u8 nr_reqs; > > > > > + struct crypto_wait wait; > > > > > struct mutex mutex; > > > > > bool is_sleepable; > > > > > + bool __online; > > > > > > > > I don't believe we need this. > > > > > > > > If we are not freeing resources during CPU offlining, then we do not > > > > need a CPU offline callback and acomp_ctx->__online serves no purpose. > > > > > > > > The whole point of synchronizing between offlining and > > > > compress/decompress operations is to avoid UAF. If offlining does not > > > > free resources, then we can hold the mutex directly in the > > > > compress/decompress path and drop the hotunplug callback completely. > > > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems like > > > > it's only used know when to set __online. > > > > > > All great points! In fact, that was the original solution I had implemented > > > (not having an offline callback). But then, I spent some time understanding > > > the v6.13 hotfix for synchronizing freeing of resources, and this comment > > > in zswap_cpu_comp_prepare(): > > > > > > /* > > > * Only hold the mutex after completing allocations, otherwise we > > may > > > * recurse into zswap through reclaim and attempt to hold the mutex > > > * again resulting in a deadlock. > > > */ > > > > > > Hence, I figured the constraint of "recurse into zswap through reclaim" was > > > something to comprehend in the simplification (even though I had a tough > > > time imagining how this could happen). > > > > The constraint here is about zswap_cpu_comp_prepare() holding the mutex, > > making an allocation which internally triggers reclaim, then recursing > > into zswap and trying to hold the same mutex again causing a deadlock. > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin > > with, the constraint naturally goes away. > > Actually, if it is possible for the allocations in zswap_cpu_comp_prepare() > to trigger reclaim, then I believe we need some way for reclaim to know if > the acomp_ctx resources are available. Hence, this seems like a potential > for deadlock regardless of the mutex. I took a closer look and I believe my hotfix was actually unnecessary. I sent it out in response to a syzbot report, but upon closer look it seems like it was not an actual problem. Sorry if my patch confused you. Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment above says: * PREPARE: The callbacks are invoked on a control CPU before the * hotplugged CPU is started up or after the hotplugged CPU has died. So even if we go into reclaim during zswap_cpu_comp_prepare(), it will never be on the CPU that we are allocating resources for. The other case where zswap_cpu_comp_prepare() could race with compression/decompression is when a pool is being created. In this case, reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the same CPU AFAICT. However, because the pool is still under creation, it will not be used (i.e. zswap_pool_current_get() won't find it). So I think we don't need to worry about zswap_cpu_comp_prepare() racing with compression or decompression for the same pool and CPU. > > I verified that all the zswap_cpu_comp_prepare() allocations are done with > GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a > risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare() > in general, i.e., aside from this patchset. > > I can think of the following options to resolve this, and would welcome > other suggestions: > > 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check > if acomp_ctx->__online is true, and if so, return the mutex. If > acomp_ctx->__online is false, then it returns NULL. In other words, we > don't have the for loop. > - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare() > to fail, cpuhotplug to fail. However, there is no deadlock. > - zswap_compress() will need to detect NULL returned by > acomp_ctx_get_cpu_lock(), and return an error. > - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling > acomp_ctx_get_cpu_lock(). > - We won't be migrated to a different CPU because we hold the mutex, hence > zswap_cpu_comp_dead() will wait on the mutex. > > 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim > and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h, > but for non-THP allocations. If we decide to adopt this approach, we would > need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c > to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this > new gfp_t, in lieu of GFP_KERNEL. > > > > > > > > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare() > > > does not acquire the mutex lock while allocating resources. We have > > already > > > initialized the mutex, so in theory, it is possible for compress/decompress > > > to acquire the mutex lock. The __online acts as a way to indicate whether > > > compress/decompress can proceed reliably to use the resources. > > > > For compress/decompress to acquire the mutex they need to run on that > > CPU, and I don't think that's possible before onlining completes, so > > zswap_cpu_comp_prepare() must have already completed before > > compress/decompress can use that CPU IIUC. > > If we can make this assumption, that would be great! However, I am not > totally sure because of the GFP_KERNEL allocations in > zswap_cpu_comp_prepare(). As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one of two situations: - The pool is under creation, so we cannot race with stores/loads from that same pool. - The CPU is being onlined, in which case zswap_cpu_comp_prepare() is called from a control CPU before tasks start running on the CPU being onlined. Please correct me if I am wrong. [..] > > > > > @@ -285,13 +403,21 @@ static struct zswap_pool > > > > *zswap_pool_create(char *type, char *compressor) > > > > > goto error; > > > > > } > > > > > > > > > > - for_each_possible_cpu(cpu) > > > > > - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); > > > > > + for_each_possible_cpu(cpu) { > > > > > + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool- > > > > >acomp_ctx, cpu); > > > > > + > > > > > + acomp_ctx->acomp = NULL; > > > > > + acomp_ctx->req = NULL; > > > > > + acomp_ctx->buffer = NULL; > > > > > + acomp_ctx->__online = false; > > > > > + acomp_ctx->nr_reqs = 0; > > > > > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them > > > > right away? > > > > > > Yes, I figured this is needed for two reasons: > > > > > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into > > > zswap_cpu_comp_dealloc() to be handled by the common procedure > > > "acomp_ctx_dealloc()" unambiguously. > > > > This makes sense. When you move the refactoring to create > > acomp_ctx_dealloc() to a separate patch, please include this change in > > it and call it out explicitly in the commit message. > > Sure. > > > > > > 2) The second scenario I thought of that would need this, is let's say > > > the zswap compressor is switched immediately after setting the > > > compressor. Some cores have executed the onlining code and > > > some haven't. Because there are no pool refs held, > > > zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured > > > it would help to initialize these acomp_ctx members before the > > > hand-off to "cpuhp_state_add_instance()" in zswap_pool_create(). > > > > I believe cpuhp_state_add_instance() calls the onlining function > > synchronously on all present CPUs, so I don't think it's possible to end > > up in a state where the pool is being destroyed and some CPU executed > > zswap_cpu_comp_prepare() while others haven't. > > I looked at the cpuhotplug code some more. The startup callback is > invoked sequentially for_each_present_cpu(). If an error occurs for any > one of them, it calls the teardown callback only on the earlier cores that > have already finished running the startup callback. However, > zswap_cpu_comp_dealloc() will be called for all cores, even the ones > for which the startup callback was not run. Hence, I believe the > zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO) > to allocate the acomp_ctx. Yeah this is point (1) above IIUC, and I agree about zero initialization for that. > > > > > That being said, this made me think of a different problem. If pool > > destruction races with CPU onlining, there could be a race between > > zswap_cpu_comp_prepare() allocating resources and > > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them. > > > > I believe we must always call cpuhp_state_remove_instance() *before* > > freeing the resources to prevent this race from happening. This needs to > > be documented with a comment. > > Yes, this race condition is possible, thanks for catching this! The problem with > calling cpuhp_state_remove_instance() before freeing the resources is that > cpuhp_state_add_instance() and cpuhp_state_remove_instance() both > acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence > are serialized. > > For the reasons motivating why acomp_ctx->__online is set to false in > zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance() > before calling acomp_ctx_dealloc() because the latter could wait until > acomp_ctx->__online to be true before deleting the resources. I will > think about this some more. I believe this problem goes away with acomp_ctx->__online going away, right? > > Another possibility is to not rely on cpuhotplug in zswap, and instead > manage the per-cpu acomp_ctx resource allocation entirely in > zswap_pool_create(), and deletion entirely in zswap_pool_destroy(), > along with the necessary error handling. Let me think about this some > more as well. > > > > > Let me know if I missed something. > > > > > > > > Please let me know if these are valid considerations. > > > > > > > > > > > If it is in fact needed we should probably just use __GFP_ZERO. > > > > > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of > > "alloc_percpu()" > > > for the acomp_ctx? > > > > Yeah if we need to initialize all/most fields to 0 let's use > > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO. > > Sounds good. > > > > > [..] > > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx > > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) > > > > > > > > > > for (;;) { > > > > > acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); > > > > > - mutex_lock(&acomp_ctx->mutex); > > > > > - if (likely(acomp_ctx->req)) > > > > > - return acomp_ctx; > > > > > /* > > > > > - * It is possible that we were migrated to a different CPU > > > > after > > > > > - * getting the per-CPU ctx but before the mutex was > > > > acquired. If > > > > > - * the old CPU got offlined, zswap_cpu_comp_dead() could > > > > have > > > > > - * already freed ctx->req (among other things) and set it to > > > > > - * NULL. Just try again on the new CPU that we ended up on. > > > > > + * If the CPU onlining code successfully allocates acomp_ctx > > > > resources, > > > > > + * it sets acomp_ctx->__online to true. Until this happens, we > > > > have > > > > > + * two options: > > > > > + * > > > > > + * 1. Return NULL and fail all stores on this CPU. > > > > > + * 2. Retry, until onlining has finished allocating resources. > > > > > + * > > > > > + * In theory, option 1 could be more appropriate, because it > > > > > + * allows the calling procedure to decide how it wants to > > > > handle > > > > > + * reclaim racing with CPU hotplug. For instance, it might be > > > > Ok > > > > > + * for compress to return an error for the backing swap device > > > > > + * to store the folio. Decompress could wait until we get a > > > > > + * valid and locked mutex after onlining has completed. For > > > > now, > > > > > + * we go with option 2 because adding a do-while in > > > > > + * zswap_decompress() adds latency for software > > > > compressors. > > > > > + * > > > > > + * Once initialized, the resources will be de-allocated only > > > > > + * when the pool is destroyed. The acomp_ctx will hold on to > > > > the > > > > > + * resources through CPU offlining/onlining at any time until > > > > > + * the pool is destroyed. > > > > > + * > > > > > + * This prevents races/deadlocks between reclaim and CPU > > > > acomp_ctx > > > > > + * resource allocation that are a dependency for reclaim. > > > > > + * It further simplifies the interaction with CPU onlining and > > > > > + * offlining: > > > > > + * > > > > > + * - CPU onlining does not take the mutex. It only allocates > > > > > + * resources and sets __online to true. > > > > > + * - CPU offlining acquires the mutex before setting > > > > > + * __online to false. If reclaim has acquired the mutex, > > > > > + * offlining will have to wait for reclaim to complete before > > > > > + * hotunplug can proceed. Further, hotplug merely sets > > > > > + * __online to false. It does not delete the acomp_ctx > > > > > + * resources. > > > > > + * > > > > > + * Option 1 is better than potentially not exiting the earlier > > > > > + * for (;;) loop because the system is running low on memory > > > > > + * and/or CPUs are getting offlined for whatever reason. At > > > > > + * least failing this store will prevent data loss by failing > > > > > + * zswap_store(), and saving the data in the backing swap > > > > device. > > > > > */ > > > > > > > > I believe we can dropped. I don't think we can have any store/load > > > > operations on a CPU before it's fully onlined, and we should always have > > > > a reference on the pool here, so the resources cannot go away. > > > > > > > > So unless I missed something we can drop this completely now and just > > > > hold the mutex directly in the load/store paths. > > > > > > Based on the above explanations, please let me know if it is a good idea > > > to keep the __online, or if you think further simplification is possible. > > > > I still think it's not needed. Let me know if I missed anything. > > Let me think some more about whether it is feasible to not have cpuhotplug > manage the acomp_ctx resource allocation, and instead have this be done > through the pool creation/deletion routines. I don't think this is necessary, see my comments above.
On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote: > > The problem however is that, in the current architecture, CPU onlining/ > zswap_pool creation, and CPU offlining/zswap_pool deletion have the > same semantics as far as these resources are concerned. Hence, although > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which > the "hotplug" code is called is in fact online. It is possible for the memory > allocation calls in zswap_cpu_comp_prepare() to recurse into > zswap_compress(), which now needs to be handled by the current pool, > since the new pool has not yet been added to the zswap_pools, as you > pointed out. Please hold onto your patch-set for a while because I intend to get rid of the per-cpu pool in zswap after conversion to acomp. There is no reason to have a per-cpu pool at all, except for the stream memory used by the algorithm. Since we've already moved that into the Crypto API for acomp, this means zswap no longer needs to have any per-cpu pools. In fact, with LZO decompression it could go all the way through with no per-cpu resources at all. Cheers,
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, March 18, 2025 7:24 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > > Sent: Monday, March 10, 2025 10:31 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; > > > Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > > > allocation/deletion and mutex lock usage. > > > > > > On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote: > > > > > > > [..] > > > > > > > > u8 *buffer; > > > > > > > > + u8 nr_reqs; > > > > > > > > + struct crypto_wait wait; > > > > > > > > struct mutex mutex; > > > > > > > > bool is_sleepable; > > > > > > > > + bool __online; > > > > > > > > > > > > > > I don't believe we need this. > > > > > > > > > > > > > > If we are not freeing resources during CPU offlining, then we do > not > > > > > > > need a CPU offline callback and acomp_ctx->__online serves no > > > purpose. > > > > > > > > > > > > > > The whole point of synchronizing between offlining and > > > > > > > compress/decompress operations is to avoid UAF. If offlining does > not > > > > > > > free resources, then we can hold the mutex directly in the > > > > > > > compress/decompress path and drop the hotunplug callback > > > completely. > > > > > > > > > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems > like > > > > > > > it's only used know when to set __online. > > > > > > > > > > > > All great points! In fact, that was the original solution I had > implemented > > > > > > (not having an offline callback). But then, I spent some time > > > understanding > > > > > > the v6.13 hotfix for synchronizing freeing of resources, and this > comment > > > > > > in zswap_cpu_comp_prepare(): > > > > > > > > > > > > /* > > > > > > * Only hold the mutex after completing allocations, > otherwise we > > > > > may > > > > > > * recurse into zswap through reclaim and attempt to hold the > mutex > > > > > > * again resulting in a deadlock. > > > > > > */ > > > > > > > > > > > > Hence, I figured the constraint of "recurse into zswap through > reclaim" > > > was > > > > > > something to comprehend in the simplification (even though I had a > > > tough > > > > > > time imagining how this could happen). > > > > > > > > > > The constraint here is about zswap_cpu_comp_prepare() holding the > > > mutex, > > > > > making an allocation which internally triggers reclaim, then recursing > > > > > into zswap and trying to hold the same mutex again causing a > deadlock. > > > > > > > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to > begin > > > > > with, the constraint naturally goes away. > > > > > > > > Actually, if it is possible for the allocations in > zswap_cpu_comp_prepare() > > > > to trigger reclaim, then I believe we need some way for reclaim to know > if > > > > the acomp_ctx resources are available. Hence, this seems like a > potential > > > > for deadlock regardless of the mutex. > > > > > > I took a closer look and I believe my hotfix was actually unnecessary. I > > > sent it out in response to a syzbot report, but upon closer look it > > > seems like it was not an actual problem. Sorry if my patch confused you. > > > > > > Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like > > > CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The > comment > > > above > > > says: > > > > > > * PREPARE: The callbacks are invoked on a control CPU before the > > > * hotplugged CPU is started up or after the hotplugged CPU has died. > > > > > > So even if we go into reclaim during zswap_cpu_comp_prepare(), it will > > > never be on the CPU that we are allocating resources for. > > > > > > The other case where zswap_cpu_comp_prepare() could race with > > > compression/decompression is when a pool is being created. In this case, > > > reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the > > > same > > > CPU AFAICT. However, because the pool is still under creation, it will > > > not be used (i.e. zswap_pool_current_get() won't find it). > > > > > > So I think we don't need to worry about zswap_cpu_comp_prepare() > racing > > > with compression or decompression for the same pool and CPU. > > > > Thanks Yosry, for this observation! You are right, when considered purely > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is > > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h. > > > > The problem however is that, in the current architecture, CPU onlining/ > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the > > same semantics as far as these resources are concerned. Hence, although > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which > > the "hotplug" code is called is in fact online. It is possible for the memory > > allocation calls in zswap_cpu_comp_prepare() to recurse into > > zswap_compress(), which now needs to be handled by the current pool, > > since the new pool has not yet been added to the zswap_pools, as you > > pointed out. > > > > The ref on the current pool has not yet been dropped. Could there be > > a potential for a deadlock at pool transition time: the new pool is blocked > > from allocating acomp_ctx resources, triggering reclaim, which the old > > pool needs to handle? > > I am not sure how this could lead to a deadlock. The compression will be > happening in a different pool with a different acomp_ctx. I was thinking about this from the perspective of comparing the trade-offs between these two approaches: a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs. b) Allocating acomp_ctx resources once upfront. With (a), when the user switches zswap to use a new compressor, it is possible that the system is already in a low memory situation and the CPU could be doing a lot of swapouts. It occurred to me that in theory, the call to switch the compressor through the sysfs interface could never return if the acomp_ctx allocations trigger direct reclaim in this scenario. This was in the context of exploring if a better design is possible, while acknowledging that this could still happen today. With (b), this situation is avoided by design, and we can switch to a new pool without triggering additional reclaim. Sorry, I should have articulated this better. > > > > > I see other places in the kernel that use CPU hotplug for resource allocation, > > outside of the context of CPU onlining. IIUC, it is difficult to guarantee that > > the startup/teardown callbacks are modifying acomp_ctx resources for a > > dysfunctional CPU. > > IIUC, outside the context of CPU onlining, CPU hotplug callbacks get > called when they are added. In this case, only the newly added callbacks > will be executed. IOW, zswap's hotplug callback should not be randomly > getting called when irrelevant code adds hotplug callbacks. It should > only happen during zswap pool initialization or CPU onlining. > > Please correct me if I am wrong. Yes, this is my understanding as well. > > > > > Now that I think about it, the only real constraint is that the acomp_ctx > > resources are guaranteed to exist for a functional CPU which can run zswap > > compress/decompress. > > I believe this is already the case as I previously described, because > the hotplug callback can only be called in two scenarios: > - Zswap pool initialization, in which case compress/decompress > operations cannot run on the pool we are initializing. > - CPU onlining, in which case compress/decompress operations cannot run > on the CPU we are onlining. > > Please correct me if I am wrong. Agreed, this is my understanding too. > > > > > I think we can simplify this as follows, and would welcome suggestions > > to improve the proposed solution: > > > > 1) We dis-associate the acomp_ctx from the pool, and instead, have this > > be a global percpu zswap resource that gets allocated once in > zswap_setup(), > > just like the zswap_entry_cache. > > 2) The acomp_ctx resources will get allocated during zswap_setup(), using > > the cpuhp_setup_state_multi callback() in zswap_setup(), that registers > > zswap_cpu_comp_prepare(), but no teardown callback. > > 3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in > > zswap_setup(). > > 4) The acomp_ctx resources persist through subsequent "real CPU > offline/online > > state transitions". > > 5) zswap_[de]compress() can go ahead and lock the mutex, and use the > > reqs/buffers without worrying about whether these resources are > > initialized or still exist/are being deleted. > > 6) "struct zswap_pool" is now de-coupled from this global percpu zswap > > acomp_ctx. > > 7) To address the issue of how many reqs/buffers to allocate, there could > > potentially be a memory cost for non-batching compressors, if we want > > to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers. > > This would allow the acomp_ctx to seamlessly handle batching > > compressors, non-batching compressors, and transitions among the > > two compressor types in a pretty general manner, that relies only on > > the ZSWAP_MAX_BATCH_SIZE, which we define anyway. > > > > I believe we can maximize the chances of success for the allocation of > > the acomp_ctx resources if this is done in zswap_setup(), but please > > correct me if I am wrong. > > > > The added memory cost for platforms without IAA would be > > ~57 KB per cpu, on x86. Would this be acceptable? > > I think that's a lot of memory to waste per-CPU, and I don't see a good > reason for it. Yes, it appears so. Towards trying to see if a better design is possible by de-coupling the acomp_ctx from zswap_pool: Would this cost be acceptable if it is incurred based on a build config option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)? If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE acomp_ctx resources once, during zswap_setup(). If not, we allocate only one req/buffer in the global percpu acomp_ctx? Since we are comprehending that other compressors may want to do batching in future, I thought a more general config option name would be more appropriate. > > > If not, I don't believe this simplification would be worth it, because > > allocating for one req/buffer, then dynamically adding more resources > > if a newly selected compressor requires more resources, would run > > into the same race conditions and added checks as in > > acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary > because > > CPU onlining/zswap_pool creation and CPU offlining/zswap_pool > > deletion have the same semantics for these resources. > > Agree that using a single acomp_ctx per-CPU but making the resources > resizable is not a win. Yes, this makes sense: resizing is not the way to go. > > > > > The only other fallback solution in lieu of the proposed simplification that > > I can think of is to keep the lifespan of these resources from pool creation > > to deletion, using the CPU hotplug code. Although, it is not totally clear > > to me if there is potential for deadlock during pool transitions, as noted > above. > > I am not sure what's the deadlock scenario you're worried about, please > clarify. My apologies: I was referring to triggering direct reclaim during pool creation, which could, in theory, run into a scenario where the pool switching would have to wait for reclaim to free up enough memory for the acomp_ctx resources allocation: this could cause the system to hang, but not a deadlock. This can happen even today, hence trying to see if a better design is possible. Thanks, Kanchana
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, March 18, 2025 12:06 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Tue, Mar 18, 2025 at 05:38:49PM +0000, Sridhar, Kanchana P wrote: > [..] > > > > Thanks Yosry, for this observation! You are right, when considered purely > > > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and > > > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state > is > > > > registered in the PREPARE section of "enum cpuhp_state" in > cpuhotplug.h. > > > > > > > > The problem however is that, in the current architecture, CPU onlining/ > > > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the > > > > same semantics as far as these resources are concerned. Hence, > although > > > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which > > > > the "hotplug" code is called is in fact online. It is possible for the memory > > > > allocation calls in zswap_cpu_comp_prepare() to recurse into > > > > zswap_compress(), which now needs to be handled by the current pool, > > > > since the new pool has not yet been added to the zswap_pools, as you > > > > pointed out. > > > > > > > > The ref on the current pool has not yet been dropped. Could there be > > > > a potential for a deadlock at pool transition time: the new pool is > blocked > > > > from allocating acomp_ctx resources, triggering reclaim, which the old > > > > pool needs to handle? > > > > > > I am not sure how this could lead to a deadlock. The compression will be > > > happening in a different pool with a different acomp_ctx. > > > > I was thinking about this from the perspective of comparing the trade-offs > > between these two approaches: > > a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs. > > b) Allocating acomp_ctx resources once upfront. > > > > With (a), when the user switches zswap to use a new compressor, it is > possible > > that the system is already in a low memory situation and the CPU could be > doing > > a lot of swapouts. It occurred to me that in theory, the call to switch the > > compressor through the sysfs interface could never return if the acomp_ctx > > allocations trigger direct reclaim in this scenario. This was in the context of > > exploring if a better design is possible, while acknowledging that this could > still > > happen today. > > If the system is already in a low memory situation a lot of things will > hang. Switching the compressor is not a common operation at all and we > shouldn't really worry about that. Even if we remove the acomp_ctx > allocation, we still need to make some allocations in that path anyway. Ok, these are good points. > > > > > With (b), this situation is avoided by design, and we can switch to a new pool > > without triggering additional reclaim. Sorry, I should have articulated this > better. > > But we have to either allocate more memory unnecessarily or add config > options and make batching a build-time decision. This is unwarranted > imo. > > FWIW, the mutexes and buffers used to be per-CPU not per-acomp_ctx, but > they were changed in commit 8ba2f844f050 ("mm/zswap: change per-cpu > mutex and buffer to per-acomp_ctx"). What you're suggesting is not quite > the same as what we had before that commit, it's moving the acomp_ctx > itself to be per-CPU but not per-pool, including the mtuex and buffer. > But I thought the context may be useful. Thanks for sharing the context. > > [..] > > > > 7) To address the issue of how many reqs/buffers to allocate, there could > > > > potentially be a memory cost for non-batching compressors, if we > want > > > > to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and > buffers. > > > > This would allow the acomp_ctx to seamlessly handle batching > > > > compressors, non-batching compressors, and transitions among the > > > > two compressor types in a pretty general manner, that relies only on > > > > the ZSWAP_MAX_BATCH_SIZE, which we define anyway. > > > > > > > > I believe we can maximize the chances of success for the allocation of > > > > the acomp_ctx resources if this is done in zswap_setup(), but please > > > > correct me if I am wrong. > > > > > > > > The added memory cost for platforms without IAA would be > > > > ~57 KB per cpu, on x86. Would this be acceptable? > > > > > > I think that's a lot of memory to waste per-CPU, and I don't see a good > > > reason for it. > > > > Yes, it appears so. Towards trying to see if a better design is possible > > by de-coupling the acomp_ctx from zswap_pool: > > Would this cost be acceptable if it is incurred based on a build config > > option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default > OFF)? > > If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE > > acomp_ctx resources once, during zswap_setup(). If not, we allocate > > only one req/buffer in the global percpu acomp_ctx? > > We should avoid making batching a build time decision if we can help it. > A lot of kernels are shipped to different hardware that may or may not > support batching, so users will have to either decide to turn off > batching completely or eat the overhead even for hardware that does not > support batching (or for users that use SW compression). > > [..] > > > > > > > > The only other fallback solution in lieu of the proposed simplification > that > > > > I can think of is to keep the lifespan of these resources from pool > creation > > > > to deletion, using the CPU hotplug code. Although, it is not totally > clear > > > > to me if there is potential for deadlock during pool transitions, as > noted > > > above. > > > > > > I am not sure what's the deadlock scenario you're worried about, please > > > clarify. > > > > My apologies: I was referring to triggering direct reclaim during pool > creation, > > which could, in theory, run into a scenario where the pool switching would > > have to wait for reclaim to free up enough memory for the acomp_ctx > > resources allocation: this could cause the system to hang, but not a > deadlock. > > This can happen even today, hence trying to see if a better design is > possible. > > I think the concern here is unfounded. We shouldn't care about the > performance of zswap compressor switching, especially under memory > pressure. A lot of things will slow down under memory pressure, and > compressor switching should be the least of our concerns. Sounds good. It then appears that making the per-cpu acomp_ctx resources' lifetime track that of the zswap_pool, is the way to go. These resources will be allocated as per the requirements of the compressor, i.e., zswap_pool, and will persist through CPU online/offline transitions through the hotplug interface. If this plan is acceptable, it appears that acomp_ctx_get_cpu_lock() and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in zswap_[de]compress()? I will incorporate these changes in v9 if this sounds Ok. Thanks, Kanchana
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, March 18, 2025 4:14 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Tue, Mar 18, 2025 at 09:09:05PM +0000, Sridhar, Kanchana P wrote: > [..] > > > > > > > > > > > > The only other fallback solution in lieu of the proposed > simplification > > > that > > > > > > I can think of is to keep the lifespan of these resources from pool > > > creation > > > > > > to deletion, using the CPU hotplug code. Although, it is not totally > > > clear > > > > > > to me if there is potential for deadlock during pool transitions, as > > > noted > > > > > above. > > > > > > > > > > I am not sure what's the deadlock scenario you're worried about, > please > > > > > clarify. > > > > > > > > My apologies: I was referring to triggering direct reclaim during pool > > > creation, > > > > which could, in theory, run into a scenario where the pool switching > would > > > > have to wait for reclaim to free up enough memory for the acomp_ctx > > > > resources allocation: this could cause the system to hang, but not a > > > deadlock. > > > > This can happen even today, hence trying to see if a better design is > > > possible. > > > > > > I think the concern here is unfounded. We shouldn't care about the > > > performance of zswap compressor switching, especially under memory > > > pressure. A lot of things will slow down under memory pressure, and > > > compressor switching should be the least of our concerns. > > > > Sounds good. It then appears that making the per-cpu acomp_ctx resources' > > lifetime track that of the zswap_pool, is the way to go. These resources > > will be allocated as per the requirements of the compressor, i.e., > zswap_pool, > > and will persist through CPU online/offline transitions through the hotplug > > interface. If this plan is acceptable, it appears that > acomp_ctx_get_cpu_lock() > > and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in > > zswap_[de]compress()? I will incorporate these changes in v9 if this sounds > Ok. > > Sounds good to me. Thanks! Thanks Yosry! Will proceed accordingly. Thanks, Kanchana
diff --git a/mm/zswap.c b/mm/zswap.c index 10f2a16e7586..cff96df1df8b 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -144,10 +144,12 @@ bool zswap_never_enabled(void) struct crypto_acomp_ctx { struct crypto_acomp *acomp; struct acomp_req *req; - struct crypto_wait wait; u8 *buffer; + u8 nr_reqs; + struct crypto_wait wait; struct mutex mutex; bool is_sleepable; + bool __online; }; /* @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) **********************************/ static void __zswap_pool_empty(struct percpu_ref *ref); +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx) +{ + if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) { + + if (!IS_ERR_OR_NULL(acomp_ctx->req)) + acomp_request_free(acomp_ctx->req); + acomp_ctx->req = NULL; + + kfree(acomp_ctx->buffer); + acomp_ctx->buffer = NULL; + + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) + crypto_free_acomp(acomp_ctx->acomp); + + acomp_ctx->nr_reqs = 0; + } +} + +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) +{ + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + int ret = -ENOMEM; + + /* + * Just to be even more fail-safe against changes in assumptions and/or + * implementation of the CPU hotplug code. + */ + if (acomp_ctx->__online) + return 0; + + if (acomp_ctx->nr_reqs) { + acomp_ctx->__online = true; + return 0; + } + + acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); + if (IS_ERR(acomp_ctx->acomp)) { + pr_err("could not alloc crypto acomp %s : %ld\n", + pool->tfm_name, PTR_ERR(acomp_ctx->acomp)); + ret = PTR_ERR(acomp_ctx->acomp); + goto fail; + } + + acomp_ctx->nr_reqs = 1; + + acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp); + if (!acomp_ctx->req) { + pr_err("could not alloc crypto acomp_request %s\n", + pool->tfm_name); + ret = -ENOMEM; + goto fail; + } + + acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); + if (!acomp_ctx->buffer) { + ret = -ENOMEM; + goto fail; + } + + crypto_init_wait(&acomp_ctx->wait); + + /* + * if the backend of acomp is async zip, crypto_req_done() will wakeup + * crypto_wait_req(); if the backend of acomp is scomp, the callback + * won't be called, crypto_wait_req() will return without blocking. + */ + acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG, + crypto_req_done, &acomp_ctx->wait); + + acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp); + + acomp_ctx->__online = true; + + return 0; + +fail: + acomp_ctx_dealloc(acomp_ctx); + + return ret; +} + +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) +{ + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + mutex_lock(&acomp_ctx->mutex); + acomp_ctx->__online = false; + mutex_unlock(&acomp_ctx->mutex); + + return 0; +} + +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node *node) +{ + struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + /* + * The lifetime of acomp_ctx resources is from pool creation to + * pool deletion. + * + * Reclaims should not be happening because, we get to this routine only + * in two scenarios: + * + * 1) pool creation failures before/during the pool ref initialization. + * 2) we are in the process of releasing the pool, it is off the + * zswap_pools list and has no references. + * + * Hence, there is no need for locks. + */ + acomp_ctx->__online = false; + acomp_ctx_dealloc(acomp_ctx); +} + static struct zswap_pool *zswap_pool_create(char *type, char *compressor) { struct zswap_pool *pool; @@ -285,13 +403,21 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) goto error; } - for_each_possible_cpu(cpu) - mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex); + for_each_possible_cpu(cpu) { + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + acomp_ctx->acomp = NULL; + acomp_ctx->req = NULL; + acomp_ctx->buffer = NULL; + acomp_ctx->__online = false; + acomp_ctx->nr_reqs = 0; + mutex_init(&acomp_ctx->mutex); + } ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); if (ret) - goto error; + goto ref_fail; /* being the current pool takes 1 ref; this func expects the * caller to always add the new pool as the current pool @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) return pool; ref_fail: + for_each_possible_cpu(cpu) + zswap_cpu_comp_dealloc(cpu, &pool->node); + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); error: if (pool->acomp_ctx) @@ -361,8 +490,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) static void zswap_pool_destroy(struct zswap_pool *pool) { + int cpu; + zswap_pool_debug("destroying", pool); + for_each_possible_cpu(cpu) + zswap_cpu_comp_dealloc(cpu, &pool->node); + cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); free_percpu(pool->acomp_ctx); @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry *entry) /********************************* * compressed storage functions **********************************/ -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) -{ - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); - struct crypto_acomp *acomp = NULL; - struct acomp_req *req = NULL; - u8 *buffer = NULL; - int ret; - - buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); - if (!buffer) { - ret = -ENOMEM; - goto fail; - } - - acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); - if (IS_ERR(acomp)) { - pr_err("could not alloc crypto acomp %s : %ld\n", - pool->tfm_name, PTR_ERR(acomp)); - ret = PTR_ERR(acomp); - goto fail; - } - - req = acomp_request_alloc(acomp); - if (!req) { - pr_err("could not alloc crypto acomp_request %s\n", - pool->tfm_name); - ret = -ENOMEM; - goto fail; - } - - /* - * Only hold the mutex after completing allocations, otherwise we may - * recurse into zswap through reclaim and attempt to hold the mutex - * again resulting in a deadlock. - */ - mutex_lock(&acomp_ctx->mutex); - crypto_init_wait(&acomp_ctx->wait); - - /* - * if the backend of acomp is async zip, crypto_req_done() will wakeup - * crypto_wait_req(); if the backend of acomp is scomp, the callback - * won't be called, crypto_wait_req() will return without blocking. - */ - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, - crypto_req_done, &acomp_ctx->wait); - - acomp_ctx->buffer = buffer; - acomp_ctx->acomp = acomp; - acomp_ctx->is_sleepable = acomp_is_async(acomp); - acomp_ctx->req = req; - mutex_unlock(&acomp_ctx->mutex); - return 0; - -fail: - if (acomp) - crypto_free_acomp(acomp); - kfree(buffer); - return ret; -} - -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) -{ - struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); - - mutex_lock(&acomp_ctx->mutex); - if (!IS_ERR_OR_NULL(acomp_ctx)) { - if (!IS_ERR_OR_NULL(acomp_ctx->req)) - acomp_request_free(acomp_ctx->req); - acomp_ctx->req = NULL; - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) - crypto_free_acomp(acomp_ctx->acomp); - kfree(acomp_ctx->buffer); - } - mutex_unlock(&acomp_ctx->mutex); - - return 0; -} static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) { @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool) for (;;) { acomp_ctx = raw_cpu_ptr(pool->acomp_ctx); - mutex_lock(&acomp_ctx->mutex); - if (likely(acomp_ctx->req)) - return acomp_ctx; /* - * It is possible that we were migrated to a different CPU after - * getting the per-CPU ctx but before the mutex was acquired. If - * the old CPU got offlined, zswap_cpu_comp_dead() could have - * already freed ctx->req (among other things) and set it to - * NULL. Just try again on the new CPU that we ended up on. + * If the CPU onlining code successfully allocates acomp_ctx resources, + * it sets acomp_ctx->__online to true. Until this happens, we have + * two options: + * + * 1. Return NULL and fail all stores on this CPU. + * 2. Retry, until onlining has finished allocating resources. + * + * In theory, option 1 could be more appropriate, because it + * allows the calling procedure to decide how it wants to handle + * reclaim racing with CPU hotplug. For instance, it might be Ok + * for compress to return an error for the backing swap device + * to store the folio. Decompress could wait until we get a + * valid and locked mutex after onlining has completed. For now, + * we go with option 2 because adding a do-while in + * zswap_decompress() adds latency for software compressors. + * + * Once initialized, the resources will be de-allocated only + * when the pool is destroyed. The acomp_ctx will hold on to the + * resources through CPU offlining/onlining at any time until + * the pool is destroyed. + * + * This prevents races/deadlocks between reclaim and CPU acomp_ctx + * resource allocation that are a dependency for reclaim. + * It further simplifies the interaction with CPU onlining and + * offlining: + * + * - CPU onlining does not take the mutex. It only allocates + * resources and sets __online to true. + * - CPU offlining acquires the mutex before setting + * __online to false. If reclaim has acquired the mutex, + * offlining will have to wait for reclaim to complete before + * hotunplug can proceed. Further, hotplug merely sets + * __online to false. It does not delete the acomp_ctx + * resources. + * + * Option 1 is better than potentially not exiting the earlier + * for (;;) loop because the system is running low on memory + * and/or CPUs are getting offlined for whatever reason. At + * least failing this store will prevent data loss by failing + * zswap_store(), and saving the data in the backing swap device. */ + mutex_lock(&acomp_ctx->mutex); + if (likely(acomp_ctx->__online)) + return acomp_ctx; + mutex_unlock(&acomp_ctx->mutex); } }
This patch modifies the acomp_ctx resources' lifetime to be from pool creation to deletion. A "bool __online" and "u8 nr_reqs" are added to "struct crypto_acomp_ctx" which simplify a few things: 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx to 0 or NULL and only then initialize the mutex. 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online to true, without locking the mutex. 3) CPU hotunplug will lock the mutex before setting __online to false. It will not delete any resources. 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online is true, and if so, return the mutex for use in zswap compress and decompress ops. 5) CPU onlining after offlining will simply check if either __online or nr_reqs are non-0, and return 0 if so, without re-allocating the resources. 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to delete the acomp_ctx resources. 7) Common resource deletion code in case of zswap_cpu_comp_prepare() errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new acomp_ctx_dealloc(). The CPU hot[un]plug callback functions are moved to "pool functions" accordingly. The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU offlining, and only deleting them when the pool is destroyed, is as follows: IAA with batching: 64.8 KB Software compressors: 8.2 KB I would appreciate code review comments on whether this memory cost is acceptable, for the latency improvement that it provides due to a faster reclaim restart after a CPU hotunplug-hotplug sequence - all that the hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and if so, set __online to true and return, and reclaim can proceed. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------ 1 file changed, 182 insertions(+), 91 deletions(-)