Message ID | 20241018064101.336232-1-kanchana.p.sridhar@intel.com |
---|---|
Headers | show |
Series | zswap IAA compress batching | expand |
This patch enables the use of Intel IAA hardware compression acceleration to reclaim a batch of folios in shrink_folio_list(). This results in reclaim throughput and workload/sys performance improvements. The earlier patches on compress batching deployed multiple IAA compress engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages within a large folio that is being stored in zswap_store(). This patch further propagates the efficiency improvements demonstrated with IAA "batching within folios", to vmscan "batching of folios" which will also use batching within folios using the extensible architecture of the __zswap_store_batch_core() procedure added earlier, that accepts an array of folios. A plug mechanism is introduced in swap_writepage() to aggregate a batch of up to vm.compress-batchsize ([1, 32]) folios before processing the plug. The plug will be processed if any of the following is true: 1) The plug has vm.compress-batchsize folios. If the system has Intel IAA, "sysctl vm.compress-batchsize" can be configured to be in [1, 32]. On systems without IAA, or if CONFIG_ZSWAP_STORE_BATCHING_ENABLED is not set, "sysctl vm.compress-batchsize" can only be 1. 2) A folio of a different swap type or folio_nid as the current folios in the plug, needs to be added to the plug. 3) A pmd-mappable folio needs to be swapped out. In this case, the existing folios in the plug are processed. The pmd-mappable folio is swapped out (zswap_store() will batch compress SWAP_CRYPTO_SUB_BATCH_SIZE pages in the pmd-mappable folio if system has IAA) in a batch of its own.
On Thu, Oct 17, 2024 at 11:40:49PM -0700, Kanchana P Sridhar wrote: > For async compress/decompress, provide a way for the caller to poll > for compress/decompress completion, rather than wait for an interrupt > to signal completion. > > Callers can submit a compress/decompress using crypto_acomp_compress > and decompress and rather than wait on a completion, call > crypto_acomp_poll() to check for completion. > > This is useful for hardware accelerators where the overhead of > interrupts and waiting for completions is too expensive. Typically > the compress/decompress hw operations complete very quickly and in the > vast majority of cases, adding the overhead of interrupt handling and > waiting for completions simply adds unnecessary delays and cancels the > gains of using the hw acceleration. > > Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > crypto/acompress.c | 1 + > include/crypto/acompress.h | 18 ++++++++++++++++++ > include/crypto/internal/acompress.h | 1 + > 3 files changed, 20 insertions(+) How about just adding a request flag that tells the driver to make the request synchronous if possible? Something like #define CRYPTO_ACOMP_REQ_POLL 0x00000001 Cheers,
On Fri, Oct 18, 2024 at 11:01:10PM +0000, Sridhar, Kanchana P wrote: > > Thanks for your code review comments. Are you referring to how the > async/poll interface is enabled at the level of say zswap (by setting a > flag in the acomp_req), followed by the iaa_crypto driver testing for > the flag and submitting the request and returning -EINPROGRESS. > Wouldn't we still need a separate API to do the polling? Correct me if I'm wrong, but I think what you want to do is this: crypto_acomp_compress(req) crypto_acomp_poll(req) So instead of adding this interface, where the poll essentially turns the request synchronous, just move this logic into the driver, based on a flag bit in req. Cheers,
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Friday, October 18, 2024 5:20 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; zanussi@kernel.org; > viro@zeniv.linux.org.uk; brauner@kernel.org; jack@suse.cz; > mcgrof@kernel.org; kees@kernel.org; joel.granados@kernel.org; > bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to > acomp_alg and acomp_req > > On Fri, Oct 18, 2024 at 11:01:10PM +0000, Sridhar, Kanchana P wrote: > > > > Thanks for your code review comments. Are you referring to how the > > async/poll interface is enabled at the level of say zswap (by setting a > > flag in the acomp_req), followed by the iaa_crypto driver testing for > > the flag and submitting the request and returning -EINPROGRESS. > > Wouldn't we still need a separate API to do the polling? > > Correct me if I'm wrong, but I think what you want to do is this: > > crypto_acomp_compress(req) > crypto_acomp_poll(req) > > So instead of adding this interface, where the poll essentially > turns the request synchronous, just move this logic into the driver, > based on a flag bit in req. Thanks Herbert, for this suggestion. I understand this better now, and will work with Kristen for addressing this in v2. Thanks, Kanchana > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, October 23, 2024 11:16 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk; > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org; > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux- > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, > Vinodh <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching > > On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Yosry, > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Tuesday, October 22, 2024 5:57 PM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying > > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; > > > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; zanussi@kernel.org; > viro@zeniv.linux.org.uk; > > > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; > kees@kernel.org; > > > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; > linux- > > > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, > > > Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching > > > > > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar > > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > > > > > IAA Compression Batching: > > > > ========================= > > > > > > > > This RFC patch-series introduces the use of the Intel Analytics > Accelerator > > > > (IAA) for parallel compression of pages in a folio, and for batched reclaim > > > > of hybrid any-order batches of folios in shrink_folio_list(). > > > > > > > > The patch-series is organized as follows: > > > > > > > > 1) iaa_crypto driver enablers for batching: Relevant patches are tagged > > > > with "crypto:" in the subject: > > > > > > > > a) async poll crypto_acomp interface without interrupts. > > > > b) crypto testmgr acomp poll support. > > > > c) Modifying the default sync_mode to "async" and disabling > > > > verify_compress by default, to facilitate users to run IAA easily for > > > > comparison with software compressors. > > > > d) Changing the cpu-to-iaa mappings to more evenly balance cores to > IAA > > > > devices. > > > > e) Addition of a "global_wq" per IAA, which can be used as a global > > > > resource for the socket. If the user configures 2WQs per IAA device, > > > > the driver will distribute compress jobs from all cores on the > > > > socket to the "global_wqs" of all the IAA devices on that socket, in > > > > a round-robin manner. This can be used to improve compression > > > > throughput for workloads that see a lot of swapout activity. > > > > > > > > 2) Migrating zswap to use async poll in > zswap_compress()/decompress(). > > > > 3) A centralized batch compression API that can be used by swap > modules. > > > > 4) IAA compress batching within large folio zswap stores. > > > > 5) IAA compress batching of any-order hybrid folios in > > > > shrink_folio_list(). The newly added "sysctl vm.compress-batchsize" > > > > parameter can be used to configure the number of folios in [1, 32] to > > > > be reclaimed using compress batching. > > > > > > I am still digesting this series but I have some high level questions > > > that I left on some patches. My intuition though is that we should > > > drop (5) from the initial proposal as it's most controversial. > > > Batching reclaim of unrelated folios through zswap *might* make sense, > > > but it needs a broader conversation and it needs justification on its > > > own merit, without the rest of the series. > > > > Thanks for these suggestions! Sure, I can drop (5) from the initial patch-set. > > Agree also, this needs a broader discussion. > > > > I believe the 4K folios usemem30 data in this patchset does bring across > > the batching reclaim benefits to provide justification on its own merit. I > added > > the data on batching reclaim with kernel compilation as part of the 4K folios > > experiments in the IAA decompression batching patch-series [1]. > > Listing it here as well. I will make sure to add this data in subsequent revs. > > > > -------------------------------------------------------------------------- > > Kernel compilation in tmpfs/allmodconfig, 2G max memory: > > > > No large folios mm-unstable-10-16-2024 shrink_folio_list() > > batching of folios > > -------------------------------------------------------------------------- > > zswap compressor zstd deflate-iaa deflate-iaa > > vm.compress-batchsize n/a n/a 32 > > vm.page-cluster 3 3 3 > > -------------------------------------------------------------------------- > > real_sec 783.87 761.69 747.32 > > user_sec 15,750.07 15,716.69 15,728.39 > > sys_sec 6,522.32 5,725.28 5,399.44 > > Max_RSS_KB 1,872,640 1,870,848 1,874,432 > > > > zswpout 82,364,991 97,739,600 102,780,612 > > zswpin 21,303,393 27,684,166 29,016,252 > > pswpout 13 222 213 > > pswpin 12 209 202 > > pgmajfault 17,114,339 22,421,211 23,378,161 > > swap_ra 4,596,035 5,840,082 6,231,646 > > swap_ra_hit 2,903,249 3,682,444 3,940,420 > > -------------------------------------------------------------------------- > > > > The performance improvements seen does depend on compression batching > in > > the swap modules (zswap). The implementation in patch 12 in the compress > > batching series sets up this zswap compression pipeline, that takes an array > of > > folios and processes them in batches of 8 pages compressed in parallel in > hardware. > > That being said, we do see latency improvements even with reclaim > batching > > combined with zswap compress batching with zstd/lzo-rle/etc. I haven't > done a > > lot of analysis of this, but I am guessing fewer calls from the swap layer > > (swap_writepage()) into zswap could have something to do with this. If we > believe > > that batching can be the right thing to do even for the software > compressors, > > I can gather batching data with zstd for v2. > > Thanks for sharing the data. What I meant is, I think we should focus > on supporting large folio compression batching for this series, and > only present figures for this support to avoid confusion. > > Once this lands, we can discuss support for batching the compression > of different unrelated folios separately, as it spans areas beyond > just zswap and will need broader discussion. Absolutely, this makes sense, thanks Yosry! I will address this in v2. Thanks, Kanchana