mbox series

[00/22] Don't use kmalloc() with GFP_DMA

Message ID 20220219005221.634-1-bhe@redhat.com
Headers show
Series Don't use kmalloc() with GFP_DMA | expand

Message

Baoquan He Feb. 19, 2022, 12:51 a.m. UTC
Let's replace it with other ways. This is the first step towards
removing dma-kmalloc support in kernel (Means that if everyting
is going well, we can't use kmalloc(GFP_DMA) to allocate buffer in the
future).

This series includes below changes which are easier to recognise and
make. 

1) Remove GFP_DMA from dma_alloc_wc/noncoherent(), dma_pool_alloc(),
   and dmam_alloc_coherent() which are redundant to specify GFP_DMA when
   calling.
2) Replace kmalloc(GFP_DMA)/dma_map_xxxx() pair with dma_alloc_noncoherent().

Next, plan to investigate how we should handle places as below. We
firstly need figure out whether they really need buffer from ZONE_DMA.
If yes, how to change them with other ways. This need help from
maintainers, experts from sub-components and code contributors or anyone
knowing them well. E.g s390 and crypyto, we need guidance and help.

1) Kmalloc(GFP_DMA) in s390 platform, under arch/s390 and drivers/s390;
2) Kmalloc(GFP_DMA) in drivers/crypto;
3) Kmalloc(GFP_DMA) in network drivers under drivers/net, e.g skb
   buffer requested from DMA zone.
4) Kmalloc(GFP_DMA) in device register control, e.g using regmap, devres  
   to read/write register, while memory from ZONE_DMA is required, e.g
   i2c, spi. 

For this first patch series, thanks to Hyeonggon for helping
reviewing and great suggestions on patch improving. We will work
together to continue the next steps of work.

Any comment, thought, or suggestoin is welcome and appreciated,
including but not limited to:
1) whether we should remove dma-kmalloc support in kernel();
3) why kmalloc(GFP_DMA) is needed in a certain place. why memory from
   ZONE_DMA has to be requested in the case.
2) how to replace it with other ways in any place which you are familiar
   with;

===========================Background information=======================
Prelusion:
Earlier, allocation failure was observed when calling kmalloc() with
GFP_DMA. It requests to allocate slab page from DMA zone while no managed
pages at all in there. Because in the current kernel, dma-kmalloc will
be created as long as CONFIG_ZONE_DMA is enabled. However, kdump kernel
of x86_64 doesn't have managed pages on DMA zone since below commit. The
details of this kdump issue can be found in reference link (a).

	commit 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")

To make clear the root cause and fix, many reviewers contributed their
thoughts and suggestions in the thread of the patchset v3 (a). Finally
Hyeonggon concluded what we can do to fix the kdump issue for now as a
workaround, and further action to get rid of dma-kmalloc which is not
a reasonable existence. (Please see Hyeonggon's reply in refernce (b)).
Quote Hyeonggon's words here:
~~~~
What about one of those?:

    1) Do not call warn_alloc in page allocator if will always fail
    to allocate ZONE_DMA pages.

    2) let's check all callers of kmalloc with GFP_DMA
    if they really need GFP_DMA flag and replace those by DMA API or
    just remove GFP_DMA from kmalloc()

    3) Drop support for allocating DMA memory from slab allocator
    (as Christoph Hellwig said) and convert them to use DMA32
    and see what happens
~~~~

Then Christoph acked Hyeonggon's conclusion, and said "This is the right
thing to do, but it will take a while." (See reference link (c))


==========Reference links=======
(a) v4 post including the details of kdump issue:
https://lore.kernel.org/all/20211223094435.248523-1-bhe@redhat.com/T/#u

(b) v3 post including many reviewers' comments:
https://lore.kernel.org/all/20211213122712.23805-1-bhe@redhat.com/T/#u

(c) Hyeonggon's mail concluding the solution:
https://lore.kernel.org/all/20211215044818.GB1097530@odroid/T/#u

(d) Christoph acked the plan in this mail:
https://lore.kernel.org/all/20211215072710.GA3010@lst.de/T/#u

Baoquan He (21):
  parisc: pci-dma: remove stale code and comment
  gpu: ipu-v3: Don't use GFP_DMA when calling dma_alloc_coherent()
  drm/sti: Don't use GFP_DMA when calling dma_alloc_wc()
  sound: n64: Don't use GFP_DMA when calling dma_alloc_coherent()
  fbdev: da8xx: Don't use GFP_DMA when calling dma_alloc_coherent()
  fbdev: mx3fb: Don't use GFP_DMA when calling dma_alloc_wc()
  usb: gadget: lpc32xx_udc: Don't use GFP_DMA when calling
    dma_alloc_coherent()
  usb: cdns3: Don't use GFP_DMA when calling dma_alloc_coherent()
  uio: pruss: Don't use GFP_DMA when calling dma_alloc_coherent()
  staging: emxx_udc: Don't use GFP_DMA when calling dma_alloc_coherent()
  staging: emxx_udc: Don't use GFP_DMA when calling dma_alloc_coherent()
  spi: atmel: Don't use GFP_DMA when calling dma_alloc_coherent()
  spi: spi-ti-qspi: Don't use GFP_DMA when calling dma_alloc_coherent()
  usb: cdns3: Don't use GFP_DMA when calling dma_pool_alloc()
  usb: udc: lpc32xx: Don't use GFP_DMA when calling dma_pool_alloc()
  net: marvell: prestera: Don't use GFP_DMA when calling
    dma_pool_alloc()
  net: ethernet: mtk-star-emac: Don't use GFP_DMA when calling
    dmam_alloc_coherent()
  ethernet: rocker: Use dma_alloc_noncoherent() for dma buffer
  HID: intel-ish-hid: Use dma_alloc_noncoherent() for dma buffer
  mmc: wbsd: Use dma_alloc_noncoherent() for dma buffer
  mtd: rawnand: Use dma_alloc_noncoherent() for dma buffer

Hyeonggon Yoo (1):
  net: moxa: Don't use GFP_DMA when calling dma_alloc_coherent()

 arch/parisc/kernel/pci-dma.c                  |  8 ---
 drivers/gpu/drm/sti/sti_cursor.c              |  4 +-
 drivers/gpu/drm/sti/sti_hqvdp.c               |  2 +-
 drivers/gpu/ipu-v3/ipu-image-convert.c        |  2 +-
 drivers/hid/intel-ish-hid/ishtp-fw-loader.c   | 23 +++-----
 drivers/mmc/host/wbsd.c                       | 45 +++-----------
 drivers/mtd/nand/raw/marvell_nand.c           | 55 ++++++++++-------
 .../ethernet/marvell/prestera/prestera_rxtx.c |  2 +-
 drivers/net/ethernet/mediatek/mtk_star_emac.c |  2 +-
 drivers/net/ethernet/moxa/moxart_ether.c      |  4 +-
 drivers/net/ethernet/rocker/rocker_main.c     | 59 ++++++++-----------
 drivers/spi/spi-atmel.c                       |  4 +-
 drivers/spi/spi-ti-qspi.c                     |  2 +-
 drivers/staging/emxx_udc/emxx_udc.c           |  2 +-
 drivers/staging/media/imx/imx-media-utils.c   |  2 +-
 drivers/uio/uio_pruss.c                       |  2 +-
 drivers/usb/cdns3/cdns3-gadget.c              |  4 +-
 drivers/usb/gadget/udc/lpc32xx_udc.c          |  4 +-
 drivers/video/fbdev/da8xx-fb.c                |  4 +-
 drivers/video/fbdev/fsl-diu-fb.c              |  2 +-
 drivers/video/fbdev/mx3fb.c                   |  2 +-
 sound/mips/snd-n64.c                          |  2 +-
 22 files changed, 97 insertions(+), 139 deletions(-)

Comments

Jakub Kicinski Feb. 19, 2022, 4:54 a.m. UTC | #1
On Sat, 19 Feb 2022 08:52:16 +0800 Baoquan He wrote:
> dma_pool_alloc() uses dma_alloc_coherent() to pre-allocate DMA buffer,
> so it's redundent to specify GFP_DMA when calling.
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>

This and the other two netdev patches in the series are perfectly
cleanups reasonable even outside of the larger context.

Please repost those separately and make sure you CC the maintainers
of the drivers.
Christoph Hellwig Feb. 19, 2022, 7:07 a.m. UTC | #2
On Sat, Feb 19, 2022 at 08:52:00AM +0800, Baoquan He wrote:
> The gfp assignment has been commented out in ancient times, combined with
> the code comment, obviously it's not needed since then. Let's remove the
> whole ifdeffery block so that GFP_DMA searching won't point to this.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:07 a.m. UTC | #3
On Sat, Feb 19, 2022 at 08:52:02AM +0800, Baoquan He wrote:
> dma_alloc_coherent() allocates dma buffer with device's addressing
> limitation in mind. It's redundent to specify GFP_DMA when calling
> dma_alloc_coherent().

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:08 a.m. UTC | #4
On Sat, Feb 19, 2022 at 08:52:04AM +0800, Baoquan He wrote:
> dma_alloc_coherent() allocates dma buffer with device's addressing
> limitation in mind. It's redundent to specify GFP_DMA when calling
> dma_alloc_coherent().
> 
> [ 42.hyeyoo@gmail.com: Update changelog ]
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:08 a.m. UTC | #5
On Sat, Feb 19, 2022 at 08:52:06AM +0800, Baoquan He wrote:
> dma_alloc_coherent() allocates dma buffer with device's addressing
> limitation in mind. It's redundent to specify GFP_DMA when calling
> dma_alloc_coherent().

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:09 a.m. UTC | #6
On Sat, Feb 19, 2022 at 08:52:08AM +0800, Baoquan He wrote:
> dma_alloc_coherent() allocates dma buffer with device's addressing
> limitation in mind. It's redundent to specify GFP_DMA when calling
> dma_alloc_coherent(). replace it with GFP_KERNEL.

Plase avoid the overly long line. The rest looks good.
Christoph Hellwig Feb. 19, 2022, 7:09 a.m. UTC | #7
On Sat, Feb 19, 2022 at 08:52:10AM +0800, Baoquan He wrote:
> dma_alloc_coherent() allocates dma buffer with device's addressing
> limitation in mind. It's redundent to specify GFP_DMA when calling
> dma_alloc_coherent().

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:10 a.m. UTC | #8
On Sat, Feb 19, 2022 at 08:52:12AM +0800, Baoquan He wrote:
> dma_alloc_coherent() allocates dma buffer with device's addressing
> limitation in mind. It's redundent to specify GFP_DMA when calling
> dma_alloc_coherent().

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:13 a.m. UTC | #9
On Sat, Feb 19, 2022 at 08:52:14AM +0800, Baoquan He wrote:
> dma_pool_alloc() uses dma_alloc_coherent() to pre-allocate DMA buffer,
> so it's redundent to specify GFP_DMA32 when calling.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:13 a.m. UTC | #10
On Sat, Feb 19, 2022 at 08:52:16AM +0800, Baoquan He wrote:
> dma_pool_alloc() uses dma_alloc_coherent() to pre-allocate DMA buffer,
> so it's redundent to specify GFP_DMA when calling.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:14 a.m. UTC | #11
On Sat, Feb 19, 2022 at 08:52:18AM +0800, Baoquan He wrote:
> Use dma_alloc_noncoherent() instead to get the DMA buffer.
> 
> [ 42.hyeyoo@gmail.com: Use dma_alloc_noncoherent() instead of
>   __get_free_pages.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig Feb. 19, 2022, 7:17 a.m. UTC | #12
On Sat, Feb 19, 2022 at 08:52:20AM +0800, Baoquan He wrote:
>  	if (request_dma(dma, DRIVER_NAME))
>  		goto err;
>  
> +	dma_set_mask_and_coherent(mmc_dev(host->mmc), DMA_BIT_MASK(24));

This also sets the streaming mask, but the driver doesn't seem to make
use of that.  Please document it in the commit log.

Also setting smaller than 32 bit masks can fail, so this should have
error handling.
Baoquan He Feb. 20, 2022, 1:55 a.m. UTC | #13
On 02/19/22 at 07:51am, Wolfram Sang wrote:
> 
> > --- a/drivers/staging/media/imx/imx-media-utils.c
> 
> $subject says 'emxx_udc' instead of 'imx: media-utils'.

Ah, good catch. It should be wrongly copied from the patch 12, will fix
it, thanks.
Baoquan He Feb. 20, 2022, 8:40 a.m. UTC | #14
On 02/19/22 at 08:17am, Christoph Hellwig wrote:
> On Sat, Feb 19, 2022 at 08:52:20AM +0800, Baoquan He wrote:
> >  	if (request_dma(dma, DRIVER_NAME))
> >  		goto err;
> >  
> > +	dma_set_mask_and_coherent(mmc_dev(host->mmc), DMA_BIT_MASK(24));
> 
> This also sets the streaming mask, but the driver doesn't seem to make
> use of that.  Please document it in the commit log.

Thanks for reviewing. I will change it to dma_set_mask(), and describe
this change in patch log.

> 
> Also setting smaller than 32 bit masks can fail, so this should have
> error handling.

OK, will check and add error handling.
Heiko Carstens Feb. 23, 2022, 7:18 p.m. UTC | #15
On Tue, Feb 22, 2022 at 09:44:22AM +0100, Christoph Hellwig wrote:
> On Mon, Feb 21, 2022 at 02:57:34PM +0100, Heiko Carstens wrote:
> > > 1) Kmalloc(GFP_DMA) in s390 platform, under arch/s390 and drivers/s390;
> > 
> > So, s390 partially requires GFP_DMA allocations for memory areas which
> > are required by the hardware to be below 2GB. There is not necessarily
> > a device associated when this is required. E.g. some legacy "diagnose"
> > calls require buffers to be below 2GB.
> > 
> > How should something like this be handled? I'd guess that the
> > dma_alloc API is not the right thing to use in such cases. Of course
> > we could say, let's waste memory and use full pages instead, however
> > I'm not sure this is a good idea.
> 
> Yeah, I don't think the DMA API is the right thing for that.  This
> is one of the very rare cases where a raw allocation makes sense.
> 
> That being said being able to drop kmalloc support for GFP_DMA would
> be really useful. How much memory would we waste if switching to the
> page allocator?

At a first glance this would not waste much memory, since most callers
seem to allocate such memory pieces only temporarily.

> > The question is: what would this buy us? As stated above I'd assume
> > this comes with quite some code churn, so there should be a good
> > reason to do this.
> 
> There is two steps here.  One is to remove GFP_DMA support from
> kmalloc, which would help to cleanup the slab allocator(s) very nicely,
> as at that point it can stop to be zone aware entirely.

Well, looking at slub.c it looks like there is only a very minimal
maintenance burden for GPF_DMA/GFP_DMA32 support.

> The long term goal is to remove ZONE_DMA entirely at least for
> architectures that only use the small 16MB ISA-style one.  It can
> then be replaced with for example a CMA area and fall into a movable
> zone.  I'd have to prototype this first and see how it applies to the
> s390 case.  It might not be worth it and maybe we should replace
> ZONE_DMA and ZONE_DMA32 with a ZONE_LIMITED for those use cases as
> the amount covered tends to not be totally out of line for what we
> built the zone infrastructure.

So probably I'm missing something; but for small systems where we
would only have ZONE_DMA, how would a CMA area within this zone
improve things?

If I'm not mistaken then the page allocator will not fallback to any
CMA area for GFP_KERNEL allocations. That is: we would somehow need to
find "the right size" for the CMA area, depending on memory size. This
looks like a new problem class which currently does not exist.

Besides that we would also not have all the debugging options provided
by the slab allocator anymore.

Anyway, maybe it would make more sense if you would send your patch
and then we can see where we would end up.
Christoph Hellwig Feb. 24, 2022, 6:33 a.m. UTC | #16
On Wed, Feb 23, 2022 at 08:18:08PM +0100, Heiko Carstens wrote:
> > The long term goal is to remove ZONE_DMA entirely at least for
> > architectures that only use the small 16MB ISA-style one.  It can
> > then be replaced with for example a CMA area and fall into a movable
> > zone.  I'd have to prototype this first and see how it applies to the
> > s390 case.  It might not be worth it and maybe we should replace
> > ZONE_DMA and ZONE_DMA32 with a ZONE_LIMITED for those use cases as
> > the amount covered tends to not be totally out of line for what we
> > built the zone infrastructure.
> 
> So probably I'm missing something; but for small systems where we
> would only have ZONE_DMA, how would a CMA area within this zone
> improve things?

It would not, but more importantly we would not need it at all.  The
thinking here is really about the nasty 16MB ISA-style zone DMA.
a 31-bit something rather different.
Baoquan He Feb. 24, 2022, 2:11 p.m. UTC | #17
On 02/23/22 at 03:25pm, Christoph Hellwig wrote:
> On Wed, Feb 23, 2022 at 08:28:13AM +0800, Baoquan He wrote:
> > Could you tell more why this is wrong? According to
> > Documentation/core-api/dma-api.rst and DMA code, __dma_alloc_pages() is
> > the core function of dma_alloc_pages()/dma_alloc_noncoherent() which are
> > obviously streaming mapping,
> 
> Why are they "obviously" streaming mappings?

Because they are obviously not coherent mapping?

With my understanding, there are two kinds of DMA mapping, coherent
mapping (which is also persistent mapping), and streaming mapping. The
coherent mapping will be handled during driver init, and released during
driver de-init. While streaming mapping will be done when needed at any
time, and released after usage.

Are we going to add another kind of mapping? It's not streaming mapping,
but use dev->coherent_dma_mask, just because it uses dma_alloc_xxx()
api.

> 
> > why do we need to check
> > dev->coherent_dma_mask here? Because dev->coherent_dma_mask is the subset
> > of dev->dma_mask, it's safer to use dev->coherent_dma_mask in these
> > places? This is confusing, I talked to Hyeonggon in private mail, he has
> > the same feeling.
> 
> Think of th coherent_dma_mask as dma_alloc_mask.  It is the mask for the
> DMA memory allocator.  dma_mask is the mask for the dma_map_* routines.

I will check code further. While this may need be noted in doc, e.g
dma_api.rst or dma-api-howto.rst.

If you have guide, I can try to add some words to make clear this. Or
leave this to people who knows this clearly. I believe it will be very
helpful to understand DMA api.
David Laight Feb. 24, 2022, 2:27 p.m. UTC | #18
From: Baoquan He
> Sent: 24 February 2022 14:11
...
> With my understanding, there are two kinds of DMA mapping, coherent
> mapping (which is also persistent mapping), and streaming mapping. The
> coherent mapping will be handled during driver init, and released during
> driver de-init. While streaming mapping will be done when needed at any
> time, and released after usage.

The lifetime has absolutely nothing to do with it.

It is all about how the DMA cycles (from the device) interact with
(or more don't interact with) the cpu memory cache.

For coherent mapping the cpu and device can write to (different)
words in the same cache line at the same time, and both will see
both updates.
On some systems this can only be achieved by making the memory
uncached - which significantly slows down cpu access.

For non-coherent (streaming) mapping the cpu writes back and/or
invalidates the data cache so that the dma read cycles from memory
read the correct data and the cpu re-reads the cache line after
the dma has completed.
They are only really suitable for data buffers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Baoquan He Feb. 25, 2022, 3:39 p.m. UTC | #19
On 02/24/22 at 02:27pm, David Laight wrote:
> From: Baoquan He
> > Sent: 24 February 2022 14:11
> ...
> > With my understanding, there are two kinds of DMA mapping, coherent
> > mapping (which is also persistent mapping), and streaming mapping. The
> > coherent mapping will be handled during driver init, and released during
> > driver de-init. While streaming mapping will be done when needed at any
> > time, and released after usage.
> 
> The lifetime has absolutely nothing to do with it.
> 
> It is all about how the DMA cycles (from the device) interact with
> (or more don't interact with) the cpu memory cache.
> 
> For coherent mapping the cpu and device can write to (different)
> words in the same cache line at the same time, and both will see
> both updates.
> On some systems this can only be achieved by making the memory
> uncached - which significantly slows down cpu access.
> 
> For non-coherent (streaming) mapping the cpu writes back and/or
> invalidates the data cache so that the dma read cycles from memory
> read the correct data and the cpu re-reads the cache line after
> the dma has completed.
> They are only really suitable for data buffers.

Thanks for valuable input, I agree the lifetime is not stuff we can rely
on to judge. But how do we explain dma_alloc_noncoherent() is not streaming
mapping? Then which kind of dma mapping is it?

I could miss something important to understand this which is obvious to
other people, I will make time to check.