Message ID | 20250430205305.22844-1-kanchana.p.sridhar@intel.com |
---|---|
Headers | show |
Series | zswap compression batching | expand |
On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote: > > @@ -127,6 +131,22 @@ struct acomp_req { > struct crypto_acomp { > int (*compress)(struct acomp_req *req); > int (*decompress)(struct acomp_req *req); > + unsigned int (*get_batch_size)(void); > + bool (*batch_compress)( > + struct acomp_req *reqs[], > + struct page *pages[], > + u8 *dsts[], > + unsigned int dlens[], > + int errors[], > + int nr_reqs); > + bool (*batch_decompress)( > + struct acomp_req *reqs[], > + u8 *srcs[], > + struct page *pages[], > + unsigned int slens[], > + unsigned int dlens[], > + int errors[], > + int nr_reqs); I shelved request chaining because allocating one request per page is actively harmful to performance. So we should not add any interface that is based on one request per page. My plan is to supply a whole folio through acomp_request_set_src_folio and mark it as a batch request with a data unit size of 4K, e.g.: acomp_request_set_src_folio(req, folio, 0, len); acomp_request_set_data_unit(req, 4096); Then the algorithm can dice it up in whatever way it sees fit. For algorithms that don't support batching, the acompress API should dice it up and feed it to the algorithm piece-meal. IOW the folio loop in zswap_store would be moved into the Crypto API. This is contingent on one API change, bringing back NULL dst support to acompress. This way zswap does not need to worry about allocating memory that might not even be needed (when pages compress well). This won't look like the useless NULL dst we had before which simply pre-allocated memory rather than allocating them on demand. What acompress should do is allocate one dst page at a time, once that is filled up, then allocate one more. They should be chained up in an SG list. Pages that do not compress can be marked as a zero-length entry in the SG list. If the allocation fails at any point in time, simply stop the batching at that point and return the SG list of what has been compressed so far. After processing the returned pages, zswap can then call acompress again with an offset into the folio to continue compression. To prevent pathological cases of zero progress, zswap can provide one pre-allocated page to seed the process. For iaa, it should just allocate as many pages as it needs for batching, and if that fails, simply fall back to no batching and do things one page at a time (or however many pages you manage to allocate). I'll whip up a quick POC and we can work on top of it. Cheers,
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Wednesday, April 30, 2025 6:41 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate > batching support in acomp & drivers. > > On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote: > > > > @@ -127,6 +131,22 @@ struct acomp_req { > > struct crypto_acomp { > > int (*compress)(struct acomp_req *req); > > int (*decompress)(struct acomp_req *req); > > + unsigned int (*get_batch_size)(void); > > + bool (*batch_compress)( > > + struct acomp_req *reqs[], > > + struct page *pages[], > > + u8 *dsts[], > > + unsigned int dlens[], > > + int errors[], > > + int nr_reqs); > > + bool (*batch_decompress)( > > + struct acomp_req *reqs[], > > + u8 *srcs[], > > + struct page *pages[], > > + unsigned int slens[], > > + unsigned int dlens[], > > + int errors[], > > + int nr_reqs); > > I shelved request chaining because allocating one request per page > is actively harmful to performance. So we should not add any > interface that is based on one request per page. Hi Herbert, My cover letter presents data that I've gathered that indicates significant performance improvements with the crypto_acomp_batch_compress() interface that allocates one request per page. In addition, I would also like to share the p50/p99 latency of just the calls to crypto_acomp_compress() and crypto_acomp_batch_compress() that I gathered using the silesia.tar dataset (http://wanos.co/assets/silesia.tar) and a simple madvise test that reads the dataset into memory, then swaps out all pages and swaps them back in. This data is on Sapphire Rapids, core frequency fixed at 2500 MHz. I have enabled 64K folios. The "count" refers to the number of calls to the crypto_acomp API. As expected, the deflate-iaa "count" values in v9 are much lower because zswap_compress() in v9 uses compression batching, i.e., invokes crypto_acomp_batch_compress() with batches of 8 pages, while storing the 64K folios. ------------------------------------------------------------------------- 64K folios: Normalized Per-Page Latency of crypto_acomp calls in zswap_compress() (ns) ------------+------------------------------+---------------------------- | mm-unstable-4-21-2025 | v9 ------------+------------------------------+---------------------------- | count p50 p99 | count p50 p99 ------------+------------------------------+---------------------------- deflate-iaa | 50,459 3,396 3,663 | 6,379 717 774 | | zstd | 50,631 27,610 33,006 | 50,631 27,253 32,516 ------------+------------------------------+---------------------------- There is no indication of sending one acomp request per page harming performance, with either deflate-iaa or zstd. We see a 4.7X speedup with IAA that uses the crypto_acomp_batch_compress() interface in question. > > My plan is to supply a whole folio through acomp_request_set_src_folio > and mark it as a batch request with a data unit size of 4K, e.g.: > > acomp_request_set_src_folio(req, folio, 0, len); > acomp_request_set_data_unit(req, 4096); > > Then the algorithm can dice it up in whatever way it sees fit. For > algorithms that don't support batching, the acompress API should dice > it up and feed it to the algorithm piece-meal. > > IOW the folio loop in zswap_store would be moved into the Crypto API. > > This is contingent on one API change, bringing back NULL dst support > to acompress. This way zswap does not need to worry about allocating > memory that might not even be needed (when pages compress well). > > This won't look like the useless NULL dst we had before which simply > pre-allocated memory rather than allocating them on demand. > > What acompress should do is allocate one dst page at a time, once that > is filled up, then allocate one more. They should be chained up in an > SG list. Pages that do not compress can be marked as a zero-length > entry in the SG list. > > If the allocation fails at any point in time, simply stop the > batching at that point and return the SG list of what has been > compressed so far. After processing the returned pages, zswap > can then call acompress again with an offset into the folio to > continue compression. I am not sure if this would be feasible, because zswap_store() incrementally does other book-keeping necessary for mm/swap consistency as pages get compressed, such as allocating entries, storing compressed buffers in zpool, updating the xarray of swap offsets stored in zswap, adding the entry to the zswap memcg LRU list, etc. As soon as an error is encountered in zswap_compress(), zswap_store() has to cleanup any prior zpool stores for the batch, and delete any swap offsets for the folio from xarray. Imo, handing over the folio store loop to crypto might make this "maintaining consistency of mm/swap incrementally with each page compressed/stored" somewhat messy. However, I would like to request the zswap maintainers to weigh in with their insights on pros/cons of what you are proposing. > > To prevent pathological cases of zero progress, zswap can provide > one pre-allocated page to seed the process. For iaa, it should > just allocate as many pages as it needs for batching, and if that > fails, simply fall back to no batching and do things one page at > a time (or however many pages you manage to allocate). I'm somewhat concerned about a) allocating memory and b) adding computes in the zswap_store() path. It is not clear what is the motivating factor for doing so, especially because the solution and data presented in v9 have indicated only performance upside with the crypto_acomp_batch_compress() API and its implementation in iaa_crypto, and modest performance gains with zstd using the existing crypto_acomp_compress() API to compress one page at a time in a large folio. Please let me know if I am missing something. Thanks, Kanchana > > I'll whip up a quick POC and we can work on top of it. > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Adding Sergey.
> -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Sunday, May 11, 2025 1:53 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v9 00/19] zswap compression batching > > On Wed, Apr 30, 2025 at 4:53 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Changes since v8: > > ================= > > 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611. > > 2) Backported commits for reverting request chaining, since these are > > in cryptodev-2.6 but not yet in mm-unstable: without these backports, > > deflate-iaa is non-functional in mm-unstable: > > commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining") > > commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp > > testing"") > > Backported this hotfix as well: > > commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when > > calculating last page"). > > 3) crypto_acomp_[de]compress() restored to non-request chained > > implementations since request chaining has been removed from acomp in > > commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining"). > > I'm a bit confused. Which patches on top of mm-unstable do I need from > the crypto tree as pre-requisite for this patch series? And in which > order? Hi Nhat, As of today's mm-unstable (commit "c68cfbc5048e"), you can apply patches in this order: 1) commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining") from the cryptodev-2.6 tree. Or, Patch 0001 in this series that does the same. 2) Patch 0002 in this series. 3) Skip Patches 0003 and 0004 in this series, since they were not in mm-unstable when I created this patch series. They are now in mm-unstable. 4) Apply Patches 0005-0019 in this series. Thanks, Kanchana