Message ID | 20241221063119.29140-3-kanchana.p.sridhar@intel.com |
---|---|
State | Superseded |
Headers | show |
Series | zswap IAA compress batching | expand |
On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote: > This commit adds get_batch_size(), batch_compress() and batch_decompress() > interfaces to: First of all we don't need a batch compress/decompress interface because the whole point of request chaining is to supply the data in batches. I'm also against having a get_batch_size because the user should be supplying as much data as they're comfortable with. In other words if the user is happy to give us 8 requests for iaa then it should be happy to give us 8 requests for every implementation. The request chaining interface should be such that processing 8 requests is always better than doing 1 request at a time as the cost is amortised. Cheers,
Hi Herbert, > -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Saturday, December 28, 2024 3:46 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org; > linux-crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for > compress/decompress batching. > > On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote: > > This commit adds get_batch_size(), batch_compress() and > batch_decompress() > > interfaces to: > > First of all we don't need a batch compress/decompress interface > because the whole point of request chaining is to supply the data > in batches. > > I'm also against having a get_batch_size because the user should > be supplying as much data as they're comfortable with. In other > words if the user is happy to give us 8 requests for iaa then it > should be happy to give us 8 requests for every implementation. > > The request chaining interface should be such that processing > 8 requests is always better than doing 1 request at a time as > the cost is amortised. Thanks for your comments. Can you please elaborate on how request chaining would enable cost amortization for software compressors? With the current implementation, a module like zswap would need to do the following to invoke request chaining for software compressors (in addition to pushing the chaining to the user layer for IAA, as per your suggestion on not needing a batch compress/decompress interface): zswap_batch_compress(): for (i = 0; i < nr_pages_in_batch; ++i) { /* set up the acomp_req "reqs[i]". */ [ ... ] if (i) acomp_request_chain(reqs[i], reqs[0]); else acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait); } /* Process the request chain in series. */ err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), crypto_wait); Internally, acomp_do_req_chain() would sequentially process the request chain by: 1) adding all requests to a list "state" 2) call "crypto_acomp_compress()" for the next list element 3) when this request completes, dequeue it from the list "state" 4) repeat for all requests in "state" 5) When the last request in "state" completes, call "reqs[0]->base.complete()", which notifies crypto_wait.
On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote: > > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized > buffer for each request on each CPU. We preallocate these buffers to > avoid trying to allocate this much memory in the reclaim path (i.e. > potentially allocating two pages to reclaim one). What if we allowed each acomp request to take a whole folio? That would mean you'd only need to allocate one request per folio, regardless of how big it is. Eric, we could do something similar with ahash. Allow the user to supply a folio (or scatterlist entry) instead of a single page, and then cut it up based on a unit-size supplied by the user (e.g., 512 bytes for sector-based users). That would mean just a single request object as long as your input is a folio or something similar. Is this something that you could use in fs/verity? You'd still need to allocate enough memory to store the output hashes. Cheers,
On Tue, Jan 7, 2025 at 5:39 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote: > > > > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized > > buffer for each request on each CPU. We preallocate these buffers to > > avoid trying to allocate this much memory in the reclaim path (i.e. > > potentially allocating two pages to reclaim one). > > What if we allowed each acomp request to take a whole folio? > That would mean you'd only need to allocate one request per > folio, regardless of how big it is. Hmm this means we need to allocate a single request instead of N requests, but the source of overhead is the output buffers not the requests. We need PAGE_SIZE*2 for each page in the folio in the output buffer on each CPU. Preallocating this unnecessarily adds up to a lot of memory. Did I miss something? > > Eric, we could do something similar with ahash. Allow the > user to supply a folio (or scatterlist entry) instead of a > single page, and then cut it up based on a unit-size supplied > by the user (e.g., 512 bytes for sector-based users). That > would mean just a single request object as long as your input > is a folio or something similar. > > Is this something that you could use in fs/verity? You'd still > need to allocate enough memory to store the output hashes. > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt >
On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote: > > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized > buffer for each request on each CPU. We preallocate these buffers to > avoid trying to allocate this much memory in the reclaim path (i.e. > potentially allocating two pages to reclaim one). Actually this PAGE_SIZE * 2 thing baffles me. Why would you allocate more memory than the input? The comment says that it's because certain hardware accelerators will disregard the output buffer length, but surely that's just a bug in the driver? Which driver does this? We should fix it or remove it if it's writing output with no regard to the maximum length. You should only ever need PAGE_SIZE for the output buffer, if the output exceeds that then just fail the compression. Cheers,
On Sun, Feb 16, 2025 at 01:17:59PM +0800, Herbert Xu wrote: > On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote: > > > > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized > > buffer for each request on each CPU. We preallocate these buffers to > > avoid trying to allocate this much memory in the reclaim path (i.e. > > potentially allocating two pages to reclaim one). > > Actually this PAGE_SIZE * 2 thing baffles me. Why would you > allocate more memory than the input? The comment says that it's > because certain hardware accelerators will disregard the output > buffer length, but surely that's just a bug in the driver? > > Which driver does this? We should fix it or remove it if it's > writing output with no regard to the maximum length. > > You should only ever need PAGE_SIZE for the output buffer, if > the output exceeds that then just fail the compression. I agree this should be fixed if it can be. This was discussed before here: https://lore.kernel.org/lkml/CAGsJ_4wuTZcGurby9h4PU2DwFaiEKB4bxuycaeyz3bPw3jSX3A@mail.gmail.com/ Barry is the one who brought up why we need PAGE_SIZE*2. Barry, could you please chime in here? > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt >
On Fri, Feb 21, 2025 at 6:32 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Sun, Feb 16, 2025 at 01:17:59PM +0800, Herbert Xu wrote: > > On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote: > > > > > > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized > > > buffer for each request on each CPU. We preallocate these buffers to > > > avoid trying to allocate this much memory in the reclaim path (i.e. > > > potentially allocating two pages to reclaim one). > > > > Actually this PAGE_SIZE * 2 thing baffles me. Why would you > > allocate more memory than the input? The comment says that it's > > because certain hardware accelerators will disregard the output > > buffer length, but surely that's just a bug in the driver? > > > > Which driver does this? We should fix it or remove it if it's > > writing output with no regard to the maximum length. > > > > You should only ever need PAGE_SIZE for the output buffer, if > > the output exceeds that then just fail the compression. > > I agree this should be fixed if it can be. This was discussed before > here: > https://lore.kernel.org/lkml/CAGsJ_4wuTZcGurby9h4PU2DwFaiEKB4bxuycaeyz3bPw3jSX3A@mail.gmail.com/ > > Barry is the one who brought up why we need PAGE_SIZE*2. Barry, could > you please chime in here? I'm not sure if any real hardware driver fails to return -ERRNO, but there could be another reason why zRAM doesn't want -ERRNO from the previous code comment: "When we receive -ERRNO from the compression backend, there's nothing more we can do": int zcomp_compress(struct zcomp_strm *zstrm, const void *src, unsigned int *dst_len) { /* * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized * because sometimes we can endup having a bigger compressed data * due to various reasons: for example compression algorithms tend * to add some padding to the compressed buffer. Speaking of padding, * comp algorithm `842' pads the compressed length to multiple of 8 * and returns -ENOSP when the dst memory is not big enough, which * is not something that ZRAM wants to see. We can handle the * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we * receive -ERRNO from the compressing backend we can't help it * anymore. To make `842' happy we need to tell the exact size of * the dst buffer, zram_drv will take care of the fact that * compressed buffer is too big. */ *dst_len = PAGE_SIZE * 2; return crypto_comp_compress(zstrm->tfm, src, PAGE_SIZE, zstrm->buffer, dst_len); } After reviewing the zRAM code, I don't see why zram_write_page() needs to rely on comp_len to call write_incompressible_page(). zram_write_page() { ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, mem, &comp_len); kunmap_local(mem); if (unlikely(ret)) { zcomp_stream_put(zstrm); pr_err("Compression failed! err=%d\n", ret); return ret; } if (comp_len >= huge_class_size) { zcomp_stream_put(zstrm); return write_incompressible_page(zram, page, index); } } I mean, why can't we make it as the below: zram_write_page() { ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, mem, &comp_len); kunmap_local(mem); if (unlikely(ret && ret != -ENOSP)) { zcomp_stream_put(zstrm); pr_err("Compression failed! err=%d\n", ret); return ret; } if (comp_len >= huge_class_size || ret) { zcomp_stream_put(zstrm); return write_incompressible_page(zram, page, index); } } As long as crypto drivers consistently return -ENOSP or a specific error code for dst_buf overflow, we should be able to eliminate the 2*PAGE_SIZE buffer. My point is: 1. All drivers must be capable of handling dst_buf overflow. 2. All drivers must return a consistent and dedicated error code for dst_buf overflow. +Minchan, Sergey, Do you think we can implement this change in zRAM by using PAGE_SIZE instead of 2 * PAGE_SIZE? > > > > > Cheers, > > -- > > Email: Herbert Xu <herbert@gondor.apana.org.au> > > Home Page: http://gondor.apana.org.au/~herbert/ > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > > Thanks Barry
On Sat, Feb 22, 2025 at 07:26:43PM +1300, Barry Song wrote: > > After reviewing the zRAM code, I don't see why zram_write_page() needs > to rely on > comp_len to call write_incompressible_page(). > > zram_write_page() > { > ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, > mem, &comp_len); > kunmap_local(mem); > > if (unlikely(ret)) { > zcomp_stream_put(zstrm); > pr_err("Compression failed! err=%d\n", ret); > return ret; > } > > if (comp_len >= huge_class_size) { > zcomp_stream_put(zstrm); > return write_incompressible_page(zram, page, index); > } > } Surely any compression error should just be treated as an incompressible page? I mean we might wish to report unusual errors in case the admin or developer can do something about it, but for the system as a whole it should still continue as if the page was simply incompressible. > As long as crypto drivers consistently return -ENOSP or a specific error > code for dst_buf overflow, we should be able to eliminate the > 2*PAGE_SIZE buffer. Yes we could certainly do that. Cheers,
On Sat, Feb 22, 2025 at 7:34 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Sat, Feb 22, 2025 at 07:26:43PM +1300, Barry Song wrote: > > > > After reviewing the zRAM code, I don't see why zram_write_page() needs > > to rely on > > comp_len to call write_incompressible_page(). > > > > zram_write_page() > > { > > ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, > > mem, &comp_len); > > kunmap_local(mem); > > > > if (unlikely(ret)) { > > zcomp_stream_put(zstrm); > > pr_err("Compression failed! err=%d\n", ret); > > return ret; > > } > > > > if (comp_len >= huge_class_size) { > > zcomp_stream_put(zstrm); > > return write_incompressible_page(zram, page, index); > > } > > } > > Surely any compression error should just be treated as an > incompressible page? probably no, as an incompressible page might become compressible after changing an algorithm. This is possible, users may swith an algorithm to compress an incompressible page in the background. Errors other than dst_buf overflow are a completely different matter though :-) > > I mean we might wish to report unusual errors in case the > admin or developer can do something about it, but for the > system as a whole it should still continue as if the page > was simply incompressible. > > > As long as crypto drivers consistently return -ENOSP or a specific error > > code for dst_buf overflow, we should be able to eliminate the > > 2*PAGE_SIZE buffer. > > Yes we could certainly do that. > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > Thanks barry
On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote: > > probably no, as an incompressible page might become compressible > after changing an algorithm. This is possible, users may swith an > algorithm to compress an incompressible page in the background. I don't understand the difference. If something is wrong with the system causing the compression algorithm to fail, shouldn't zswap just hobble along as if the page was incompressible? In fact it would be quite reasonble to try to recompress it if the admin did change the algorithm later because the error may have been specific to the previous algorithm implementation. Of course I totally agree that there should be a reporting mechanism to catch errors that admins/developers should know about. But apart from reporting that error there should be no difference between an inherently incompressible page vs. buggy algorithm/broken hardware failing to compress the page. Cheers,
On Sat, Feb 22, 2025 at 7:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote: > > > > probably no, as an incompressible page might become compressible > > after changing an algorithm. This is possible, users may swith an > > algorithm to compress an incompressible page in the background. > > I don't understand the difference. If something is wrong with > the system causing the compression algorithm to fail, shouldn't > zswap just hobble along as if the page was incompressible? > > In fact it would be quite reasonble to try to recompress it if > the admin did change the algorithm later because the error may > have been specific to the previous algorithm implementation. > Somehow, I find your comment reasonable. Another point I want to mention is the semantic difference. For example, in a system with only one algorithm, a dst_buf overflow still means a successful swap-out. However, other errors actually indicate an I/O failure. In such cases, vmscan.c will log the relevant error in pageout() to notify the user. Anyway, I'm not an authority on this, so I’d like to see comments from Minchan, Sergey, and Yosry. > Of course I totally agree that there should be a reporting > mechanism to catch errors that admins/developers should know > about. But apart from reporting that error there should be > no difference between an inherently incompressible page vs. > buggy algorithm/broken hardware failing to compress the page. > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > Thanks Barry
On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote: > > Somehow, I find your comment reasonable. Another point I want > to mention is the semantic difference. For example, in a system > with only one algorithm, a dst_buf overflow still means a successful > swap-out. However, other errors actually indicate an I/O failure. > In such cases, vmscan.c will log the relevant error in pageout() to > notify the user. I'm talking specifically about the error from the Crypto API, not any other error. So if you werer using some sort of an offload device to do the compression, that could indeed fail due to an IO error (perhaps the PCI bus is on fire :) But because that's reported through the Crypto API, it should not be treated any differently than an incompressible page, except for reporting purposes. Cheers,
On Sat, Feb 22, 2025 at 8:23 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote: > > > > Somehow, I find your comment reasonable. Another point I want > > to mention is the semantic difference. For example, in a system > > with only one algorithm, a dst_buf overflow still means a successful > > swap-out. However, other errors actually indicate an I/O failure. > > In such cases, vmscan.c will log the relevant error in pageout() to > > notify the user. > > I'm talking specifically about the error from the Crypto API, > not any other error. So if you werer using some sort of an > offload device to do the compression, that could indeed fail > due to an IO error (perhaps the PCI bus is on fire :) > > But because that's reported through the Crypto API, it should > not be treated any differently than an incompressible page, > except for reporting purposes. I'm referring more to the mm subsystem :-) Let me provide a concrete example. Below is a small program that will swap out 16MB of memory to zRAM: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> #define MB (1024 * 1024) #define SIZE (16 * MB) int main() { void *addr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap failed"); return 1; } for (size_t i = 0; i < SIZE / sizeof(int); i++) { ((int*)addr)[i] = rand(); } if (madvise(addr, SIZE, MADV_PAGEOUT) != 0) { perror("madvise failed"); return 1; } while (1); return 0; } For errors other than dst_buf overflow, we receive: / # ./a.out & / # free total used free shared buff/cache available Mem: 341228 77036 251872 0 20600 264192 Swap: 2703356 0 2703356 [1]+ Done ./a.out / # cat /proc/vmstat | grep swp pswpin 0 pswpout 0 ... No memory has been swapped out, the swap-out counter is zero, and the swap file is not used at all. If this is an incompressible page(I mean dst_buf overflow error), there is no actual issue, and we get the following: / # / # free total used free shared buff/cache available Mem: 341228 92948 236248 0 20372 248280 Swap: 2703356 16384 2686972 / # cat /proc/vmstat | grep swp pswpin 0 pswpout 4096 ... > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > Thanks Barry
On (25/02/22 19:26), Barry Song wrote: > After reviewing the zRAM code, I don't see why zram_write_page() needs > to rely on > comp_len to call write_incompressible_page(). [..] > zram_write_page() > { > ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, > mem, &comp_len); > kunmap_local(mem); > > if (unlikely(ret && ret != -ENOSP)) { > zcomp_stream_put(zstrm); > pr_err("Compression failed! err=%d\n", ret); > return ret; > } > > if (comp_len >= huge_class_size || ret) { > zcomp_stream_put(zstrm); > return write_incompressible_page(zram, page, index); > } > } Sorry, I'm slower than usual now, but why should we? Shouldn't compression algorithms just never fail, even on 3D videos, because otherwise they won't be able to validate their Weissman score or something :) On a serious note - what is the use-case here? Is the failure here due to some random "cosmic rays" that taint the compression H/W? If so then what makes us believe that it's uni-directional? What if it's decompression that gets busted and then you can't decompress anything previously stored compressed and stored in zsmalloc. Wouldn't it be better in this case to turn the computer off and on again? The idea behind zram's code is that incompressible pages are not unusual, they are quite usual, in fact, It's not necessarily that the data grew in size after compression, the data is incompressible from zsmalloc PoV. That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an object smaller than zsmalloc's huge-class-watermark (around 3600 bytes, depending on zspage chain size). That's why we look at the comp-len. Anything else is an error, perhaps a pretty catastrophic error. > As long as crypto drivers consistently return -ENOSP or a specific error > code for dst_buf overflow, we should be able to eliminate the > 2*PAGE_SIZE buffer. > > My point is: > 1. All drivers must be capable of handling dst_buf overflow. > 2. All drivers must return a consistent and dedicated error code for > dst_buf overflow. Sorry, where do these rules come from? > +Minchan, Sergey, > Do you think we can implement this change in zRAM by using PAGE_SIZE instead > of 2 * PAGE_SIZE? Sorry again, what problem are you solving?
On (25/02/22 21:31), Sergey Senozhatsky wrote: > > As long as crypto drivers consistently return -ENOSP or a specific error > > code for dst_buf overflow, we should be able to eliminate the > > 2*PAGE_SIZE buffer. > > > > My point is: > > 1. All drivers must be capable of handling dst_buf overflow. > > 2. All drivers must return a consistent and dedicated error code for > > dst_buf overflow. So I didn't look at all of them, but at least S/W lzo1 doesn't even have a notion of max-output-len. lzo1x_1_compress() accepts a pointer to out_len which tells the size of output stream (the algorithm is free to produce any), so there is no dst_buf overflow as far as lzo1 is concerned. Unless I'm missing something or misunderstanding your points.
On Sun, Feb 23, 2025 at 1:31 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (25/02/22 19:26), Barry Song wrote: > > After reviewing the zRAM code, I don't see why zram_write_page() needs > > to rely on > > comp_len to call write_incompressible_page(). > > [..] > > > zram_write_page() > > { > > ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, > > mem, &comp_len); > > kunmap_local(mem); > > > > if (unlikely(ret && ret != -ENOSP)) { > > zcomp_stream_put(zstrm); > > pr_err("Compression failed! err=%d\n", ret); > > return ret; > > } > > > > if (comp_len >= huge_class_size || ret) { > > zcomp_stream_put(zstrm); > > return write_incompressible_page(zram, page, index); > > } > > } > > Sorry, I'm slower than usual now, but why should we? Shouldn't compression > algorithms just never fail, even on 3D videos, because otherwise they won't > be able to validate their Weissman score or something :) > > On a serious note - what is the use-case here? Is the failure here due to > some random "cosmic rays" that taint the compression H/W? If so then what > makes us believe that it's uni-directional? What if it's decompression > that gets busted and then you can't decompress anything previously stored > compressed and stored in zsmalloc. Wouldn't it be better in this case > to turn the computer off and on again? > > The idea behind zram's code is that incompressible pages are not unusual, > they are quite usual, in fact, It's not necessarily that the data grew > in size after compression, the data is incompressible from zsmalloc PoV. > That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an > object smaller than zsmalloc's huge-class-watermark (around 3600 bytes, > depending on zspage chain size). That's why we look at the comp-len. > Anything else is an error, perhaps a pretty catastrophic error. > > > As long as crypto drivers consistently return -ENOSP or a specific error > > code for dst_buf overflow, we should be able to eliminate the > > 2*PAGE_SIZE buffer. > > > > My point is: > > 1. All drivers must be capable of handling dst_buf overflow. > > 2. All drivers must return a consistent and dedicated error code for > > dst_buf overflow. > > Sorry, where do these rules come from? > > > +Minchan, Sergey, > > Do you think we can implement this change in zRAM by using PAGE_SIZE instead > > of 2 * PAGE_SIZE? > > Sorry again, what problem are you solving? The context is that both zswap and zRAM currently use a destination buffer of 2 * PAGE_SIZE instead of just PAGE_SIZE. Herbert, Chenming, and Yosry are questioning why it hasn't been reduced to a single PAGE_SIZE, and some attempts have been made to do so.[1][2]. The rules are based on my thoughts on feasibility if we aim to reduce it to a single PAGE_SIZE. [1] https://lore.kernel.org/linux-mm/Z7F1B_blIbByYBzz@gondor.apana.org.au/ [2] https://lore.kernel.org/lkml/20231213-zswap-dstmem-v4-1-f228b059dd89@bytedance.com/ Thanks Barry
On Sat, Feb 22, 2025 at 11:27:49PM +0900, Sergey Senozhatsky wrote: > > So I didn't look at all of them, but at least S/W lzo1 doesn't even > have a notion of max-output-len. lzo1x_1_compress() accepts a pointer > to out_len which tells the size of output stream (the algorithm is free > to produce any), so there is no dst_buf overflow as far as lzo1 is > concerned. Unless I'm missing something or misunderstanding your points. I just looked at deflate/zstd and they seem to be doing the right things. But yes lzo is a gaping security hole on the compression side. The API has always specified a maximum output length and it needs to be respected for both compression and decompression. Cheers,
On Sat, Feb 22, 2025 at 09:31:41PM +0900, Sergey Senozhatsky wrote: > > The idea behind zram's code is that incompressible pages are not unusual, > they are quite usual, in fact, It's not necessarily that the data grew > in size after compression, the data is incompressible from zsmalloc PoV. > That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an > object smaller than zsmalloc's huge-class-watermark (around 3600 bytes, > depending on zspage chain size). That's why we look at the comp-len. > Anything else is an error, perhaps a pretty catastrophic error. If you're rejecting everything above the watermark then you should simply pass the watermark as the output length to the algorithm so that it can stop doing useless work once it gets past that point. > > +Minchan, Sergey, > > Do you think we can implement this change in zRAM by using PAGE_SIZE instead > > of 2 * PAGE_SIZE? > > Sorry again, what problem are you solving? For compression, there is no point in allocating a destination buffer that is bigger than the original. Cheers,
On (25/02/23 08:24), Herbert Xu wrote: > On Sat, Feb 22, 2025 at 09:31:41PM +0900, Sergey Senozhatsky wrote: > > > > The idea behind zram's code is that incompressible pages are not unusual, > > they are quite usual, in fact, It's not necessarily that the data grew > > in size after compression, the data is incompressible from zsmalloc PoV. > > That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an > > object smaller than zsmalloc's huge-class-watermark (around 3600 bytes, > > depending on zspage chain size). That's why we look at the comp-len. > > Anything else is an error, perhaps a pretty catastrophic error. > > If you're rejecting everything above the watermark then you should > simply pass the watermark as the output length to the algorithm so > that it can stop doing useless work once it gets past that point. Makes sense.
On (25/02/23 08:14), Herbert Xu wrote: > On Sat, Feb 22, 2025 at 11:27:49PM +0900, Sergey Senozhatsky wrote: > > > > So I didn't look at all of them, but at least S/W lzo1 doesn't even > > have a notion of max-output-len. lzo1x_1_compress() accepts a pointer > > to out_len which tells the size of output stream (the algorithm is free > > to produce any), so there is no dst_buf overflow as far as lzo1 is > > concerned. Unless I'm missing something or misunderstanding your points. > > I just looked at deflate/zstd and they seem to be doing the right > things. > > But yes lzo is a gaping security hole on the compression side. Right, for lzo/lzo-rle we need a safety page. It also seems that there is no common way of reporting dst_but overflow. Some algos return -ENOSPC immediately, some don't return anything at all, and deflate does it's own thing - there are these places where they see they are out of out space but they Z_OK it if (s->pending != 0) { flush_pending(strm); if (strm->avail_out == 0) { /* Since avail_out is 0, deflate will be called again with * more output space, but possibly with both pending and * avail_in equal to zero. There won't be anything to do, * but this is not an error situation so make sure we * return OK instead of BUF_ERROR at next call of deflate: */ s->last_flush = -1; return Z_OK; } }
On Sun, Feb 23, 2025 at 11:09:32AM +0900, Sergey Senozhatsky wrote: > > Right, for lzo/lzo-rle we need a safety page. We should fix it because it's a security hole for anyone who calls it through the Crypto API. > It also seems that there is no common way of reporting dst_but overflow. > Some algos return -ENOSPC immediately, some don't return anything at all, > and deflate does it's own thing - there are these places where they see > they are out of out space but they Z_OK it > > if (s->pending != 0) { > flush_pending(strm); > if (strm->avail_out == 0) { > /* Since avail_out is 0, deflate will be called again with > * more output space, but possibly with both pending and > * avail_in equal to zero. There won't be anything to do, > * but this is not an error situation so make sure we > * return OK instead of BUF_ERROR at next call of deflate: > */ > s->last_flush = -1; > return Z_OK; > } > } Z_OK is actually an error, see crypto/deflate.c: ret = zlib_deflate(stream, Z_FINISH); if (ret != Z_STREAM_END) { ret = -EINVAL; goto out; } We could change this to ENOSPC for consistency. If you do find anything that returns 0 through the Crypto API please let me know. Cheers,
On (25/02/23 10:52), Herbert Xu wrote: > On Sun, Feb 23, 2025 at 11:09:32AM +0900, Sergey Senozhatsky wrote: > > > > Right, for lzo/lzo-rle we need a safety page. > > We should fix it because it's a security hole for anyone who calls > it through the Crypto API. Yeah, I don't disagree. > > It also seems that there is no common way of reporting dst_but overflow. > > Some algos return -ENOSPC immediately, some don't return anything at all, > > and deflate does it's own thing - there are these places where they see > > they are out of out space but they Z_OK it > > > > if (s->pending != 0) { > > flush_pending(strm); > > if (strm->avail_out == 0) { > > /* Since avail_out is 0, deflate will be called again with > > * more output space, but possibly with both pending and > > * avail_in equal to zero. There won't be anything to do, > > * but this is not an error situation so make sure we > > * return OK instead of BUF_ERROR at next call of deflate: > > */ > > s->last_flush = -1; > > return Z_OK; > > } > > } > > Z_OK is actually an error, see crypto/deflate.c: I saw Z_STREAM_END, but deflate states "this is not an error" and there are more places like this. > ret = zlib_deflate(stream, Z_FINISH); > if (ret != Z_STREAM_END) { > ret = -EINVAL; > goto out; > } > > We could change this to ENOSPC for consistency. So it will ENOSPC all errors, not sure how good that is. We also have lz4/lz4hc that return the number of bytes "(((char *)op) - dest)" if successful and 0 otherwise. So any error is 0. dst_buf overrun is also 0, impossible to tell the difference, again not sure if we can just ENOSPC.
On Sun, Feb 23, 2025 at 12:12:47PM +0900, Sergey Senozhatsky wrote: > > > > It also seems that there is no common way of reporting dst_but overflow. > > > Some algos return -ENOSPC immediately, some don't return anything at all, > > > and deflate does it's own thing - there are these places where they see > > > they are out of out space but they Z_OK it > > > > > > if (s->pending != 0) { > > > flush_pending(strm); > > > if (strm->avail_out == 0) { > > > /* Since avail_out is 0, deflate will be called again with > > > * more output space, but possibly with both pending and > > > * avail_in equal to zero. There won't be anything to do, > > > * but this is not an error situation so make sure we > > > * return OK instead of BUF_ERROR at next call of deflate: > > > */ > > > s->last_flush = -1; > > > return Z_OK; > > > } > > > } > > > > Z_OK is actually an error, see crypto/deflate.c: > > I saw Z_STREAM_END, but deflate states "this is not an error" and > there are more places like this. That would be a serious bug in deflate. Where did you see it return Z_STREAM_END in case of an overrun or error? > So it will ENOSPC all errors, not sure how good that is. We also > have lz4/lz4hc that return the number of bytes "(((char *)op) - dest)" > if successful and 0 otherwise. So any error is 0. dst_buf overrun > is also 0, impossible to tell the difference, again not sure if we > can just ENOSPC. I'm talking about the Crypto API calling convention. Individual compression libraries obviously have vastly different calling conventions. In the Crypto API, lz4 will return -EINVAL: int out_len = LZ4_compress_default(src, dst, slen, *dlen, ctx); if (!out_len) return -EINVAL; Cheers,
On (25/02/23 11:38), Herbert Xu wrote: > On Sun, Feb 23, 2025 at 12:12:47PM +0900, Sergey Senozhatsky wrote: > > > > > > It also seems that there is no common way of reporting dst_but overflow. > > > > Some algos return -ENOSPC immediately, some don't return anything at all, > > > > and deflate does it's own thing - there are these places where they see > > > > they are out of out space but they Z_OK it > > > > > > > > if (s->pending != 0) { > > > > flush_pending(strm); > > > > if (strm->avail_out == 0) { > > > > /* Since avail_out is 0, deflate will be called again with > > > > * more output space, but possibly with both pending and > > > > * avail_in equal to zero. There won't be anything to do, > > > > * but this is not an error situation so make sure we > > > > * return OK instead of BUF_ERROR at next call of deflate: > > > > */ > > > > s->last_flush = -1; > > > > return Z_OK; > > > > } > > > > } > > > > > > Z_OK is actually an error, see crypto/deflate.c: > > > > I saw Z_STREAM_END, but deflate states "this is not an error" and > > there are more places like this. > > That would be a serious bug in deflate. Where did you see it > return Z_STREAM_END in case of an overrun or error? Oh, sorry for the confusion, I was talking about Z_OK for overruns. > > So it will ENOSPC all errors, not sure how good that is. We also > > have lz4/lz4hc that return the number of bytes "(((char *)op) - dest)" > > if successful and 0 otherwise. So any error is 0. dst_buf overrun > > is also 0, impossible to tell the difference, again not sure if we > > can just ENOSPC. > > I'm talking about the Crypto API calling convention. Individual > compression libraries obviously have vastly different calling > conventions. > > In the Crypto API, lz4 will return -EINVAL: > > int out_len = LZ4_compress_default(src, dst, > slen, *dlen, ctx); > > if (!out_len) > return -EINVAL; Right, so you said that for deflate it could be ret = zlib_deflate(stream, Z_FINISH); if (ret != Z_STREAM_END) { ret = -ENOSPC; // and not -EINVAL goto out; } if I understood it correctly. Which would make it: return 0 on success or -ENOSPC otherwise. So if crypto API wants consistency and return -ENOSPC for buffer overruns, then for lz4/lz4hc it also becomes binary: either 0 or -ENOSCP. Current -EINVAL return looks better to me, both for deflate and for lz4/lz4hc. -ENOSPC is an actionable error code, a user can double the dst_out size and retry compression etc., while in reality it could be some SW/HW issue that is misreported as -ENOSPC. So re-iterating Barry's points: > My point is: > 1. All drivers must be capable of handling dst_buf overflow. Not the case. > 2. All drivers must return a consistent and dedicated error code for > dst_buf overflow. Not the case.
On Sun, Feb 23, 2025 at 01:02:41PM +0900, Sergey Senozhatsky wrote: > > if I understood it correctly. Which would make it: return 0 on success > or -ENOSPC otherwise. So if crypto API wants consistency and return -ENOSPC > for buffer overruns, then for lz4/lz4hc it also becomes binary: either 0 or > -ENOSCP. Current -EINVAL return looks better to me, both for deflate and > for lz4/lz4hc. -ENOSPC is an actionable error code, a user can double the > dst_out size and retry compression etc., while in reality it could be some > SW/HW issue that is misreported as -ENOSPC. When you're compressing you're trying to make it smaller. It's always better to not compress something rather than doubling the buffer on ENOSPC. In any case, no software compression algorithm should ever fail for a reason other than ENOSPC. Hardware offload devices can fail of course. Cheers,
On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote: > On Sat, Feb 22, 2025 at 7:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > > > On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote: > > > > > > probably no, as an incompressible page might become compressible > > > after changing an algorithm. This is possible, users may swith an > > > algorithm to compress an incompressible page in the background. > > > > I don't understand the difference. If something is wrong with > > the system causing the compression algorithm to fail, shouldn't > > zswap just hobble along as if the page was incompressible? > > > > In fact it would be quite reasonble to try to recompress it if > > the admin did change the algorithm later because the error may > > have been specific to the previous algorithm implementation. > > > > Somehow, I find your comment reasonable. Another point I want > to mention is the semantic difference. For example, in a system > with only one algorithm, a dst_buf overflow still means a successful > swap-out. However, other errors actually indicate an I/O failure. > In such cases, vmscan.c will log the relevant error in pageout() to > notify the user. > > Anyway, I'm not an authority on this, so I’d like to see comments > from Minchan, Sergey, and Yosry.
diff --git a/crypto/acompress.c b/crypto/acompress.c index cb6444d09dd7..165559a8b9bd 100644 --- a/crypto/acompress.c +++ b/crypto/acompress.c @@ -84,6 +84,9 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm) acomp->compress = alg->compress; acomp->decompress = alg->decompress; + acomp->get_batch_size = alg->get_batch_size; + acomp->batch_compress = alg->batch_compress; + acomp->batch_decompress = alg->batch_decompress; acomp->dst_free = alg->dst_free; acomp->reqsize = alg->reqsize; diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h index eadc24514056..8451ade70fd8 100644 --- a/include/crypto/acompress.h +++ b/include/crypto/acompress.h @@ -43,6 +43,10 @@ struct acomp_req { * * @compress: Function performs a compress operation * @decompress: Function performs a de-compress operation + * @get_batch_size: Maximum batch-size for batching compress/decompress + * operations. + * @batch_compress: Function performs a batch compress operation + * @batch_decompress: Function performs a batch decompress operation * @dst_free: Frees destination buffer if allocated inside the * algorithm * @reqsize: Context size for (de)compression requests @@ -51,6 +55,21 @@ struct acomp_req { struct crypto_acomp { int (*compress)(struct acomp_req *req); int (*decompress)(struct acomp_req *req); + unsigned int (*get_batch_size)(void); + bool (*batch_compress)(struct acomp_req *reqs[], + struct crypto_wait *wait, + struct page *pages[], + u8 *dsts[], + unsigned int dlens[], + int errors[], + int nr_pages); + bool (*batch_decompress)(struct acomp_req *reqs[], + struct crypto_wait *wait, + u8 *srcs[], + struct page *pages[], + unsigned int slens[], + int errors[], + int nr_pages); void (*dst_free)(struct scatterlist *dst); unsigned int reqsize; struct crypto_tfm base; @@ -142,6 +161,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm) CRYPTO_ALG_ASYNC; } +static inline bool acomp_has_async_batching(struct crypto_acomp *tfm) +{ + return (acomp_is_async(tfm) && + (crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) && + tfm->get_batch_size && tfm->batch_compress && tfm->batch_decompress); +} + static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req) { return __crypto_acomp_tfm(req->base.tfm); @@ -306,4 +332,89 @@ static inline int crypto_acomp_decompress(struct acomp_req *req) return crypto_acomp_reqtfm(req)->decompress(req); } +/** + * crypto_acomp_batch_size() -- Get the algorithm's batch size + * + * Function returns the algorithm's batch size for batching operations + * + * @tfm: ACOMPRESS tfm handle allocated with crypto_alloc_acomp() + * + * Return: crypto_acomp's batch size. + */ +static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm) +{ + if (acomp_has_async_batching(tfm)) + return tfm->get_batch_size(); + + return 1; +} + +/** + * crypto_acomp_batch_compress() -- Invoke asynchronous compress of + * a batch of requests + * + * Function invokes the asynchronous batch compress operation + * + * @reqs: @nr_pages asynchronous compress requests. + * @wait: crypto_wait for acomp batch compress with synchronous/asynchronous + * request chaining. If NULL, the driver must provide a way to process + * request completions asynchronously. + * @pages: Pages to be compressed. + * @dsts: Pre-allocated destination buffers to store results of compression. + * @dlens: Will contain the compressed lengths. + * @errors: zero on successful compression of the corresponding + * req, or error code in case of error. + * @nr_pages: The number of pages to be compressed. + * + * Returns true if all compress requests complete successfully, + * false otherwise. + */ +static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], + struct crypto_wait *wait, + struct page *pages[], + u8 *dsts[], + unsigned int dlens[], + int errors[], + int nr_pages) +{ + struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]); + + return tfm->batch_compress(reqs, wait, pages, dsts, + dlens, errors, nr_pages); +} + +/** + * crypto_acomp_batch_decompress() -- Invoke asynchronous decompress of + * a batch of requests + * + * Function invokes the asynchronous batch decompress operation + * + * @reqs: @nr_pages asynchronous decompress requests. + * @wait: crypto_wait for acomp batch decompress with synchronous/asynchronous + * request chaining. If NULL, the driver must provide a way to process + * request completions asynchronously. + * @srcs: The src buffers to be decompressed. + * @pages: The pages to store the decompressed buffers. + * @slens: Compressed lengths of @srcs. + * @errors: zero on successful compression of the corresponding + * req, or error code in case of error. + * @nr_pages: The number of pages to be decompressed. + * + * Returns true if all decompress requests complete successfully, + * false otherwise. + */ +static inline bool crypto_acomp_batch_decompress(struct acomp_req *reqs[], + struct crypto_wait *wait, + u8 *srcs[], + struct page *pages[], + unsigned int slens[], + int errors[], + int nr_pages) +{ + struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]); + + return tfm->batch_decompress(reqs, wait, srcs, pages, + slens, errors, nr_pages); +} + #endif diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h index 53b4ef59b48c..df0e192801ff 100644 --- a/include/crypto/internal/acompress.h +++ b/include/crypto/internal/acompress.h @@ -17,6 +17,10 @@ * * @compress: Function performs a compress operation * @decompress: Function performs a de-compress operation + * @get_batch_size: Maximum batch-size for batching compress/decompress + * operations. + * @batch_compress: Function performs a batch compress operation + * @batch_decompress: Function performs a batch decompress operation * @dst_free: Frees destination buffer if allocated inside the algorithm * @init: Initialize the cryptographic transformation object. * This function is used to initialize the cryptographic @@ -37,6 +41,21 @@ struct acomp_alg { int (*compress)(struct acomp_req *req); int (*decompress)(struct acomp_req *req); + unsigned int (*get_batch_size)(void); + bool (*batch_compress)(struct acomp_req *reqs[], + struct crypto_wait *wait, + struct page *pages[], + u8 *dsts[], + unsigned int dlens[], + int errors[], + int nr_pages); + bool (*batch_decompress)(struct acomp_req *reqs[], + struct crypto_wait *wait, + u8 *srcs[], + struct page *pages[], + unsigned int slens[], + int errors[], + int nr_pages); void (*dst_free)(struct scatterlist *dst); int (*init)(struct crypto_acomp *tfm); void (*exit)(struct crypto_acomp *tfm);
This commit adds get_batch_size(), batch_compress() and batch_decompress() interfaces to: struct acomp_alg struct crypto_acomp A crypto_acomp compression algorithm that supports batching of compressions and decompressions must provide implementations for these API. A new helper function acomp_has_async_batching() can be invoked to query if a crypto_acomp has registered these batching interfaces. A higher level module like zswap can call acomp_has_async_batching() to detect if the compressor supports batching, and if so, it can call the new crypto_acomp_batch_size() to detect the maximum batch-size supported by the compressor. Based on this, zswap can use the minimum of any zswap-specific upper limits for batch-size and the compressor's max batch-size, to allocate batching resources. This allows the iaa_crypto Intel IAA driver to register implementations for the get_batch_size(), batch_compress() and batch_decompress() acomp_alg interfaces, that can subsequently be invoked from the kernel zswap/zram modules to compress/decompress pages in parallel in the IAA hardware accelerator to improve swapout/swapin performance through these newly added corresponding crypto_acomp API: crypto_acomp_batch_size() crypto_acomp_batch_compress() crypto_acomp_batch_decompress() Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- crypto/acompress.c | 3 + include/crypto/acompress.h | 111 ++++++++++++++++++++++++++++ include/crypto/internal/acompress.h | 19 +++++ 3 files changed, 133 insertions(+)