[RFC,00/12] dma: Enable dmem cgroup tracking

Message ID	20250310-dmem-cgroups-v1-0-2984c1bc9312@kernel.org
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1FD222156F; Mon, 10 Mar 2025 12:06:19 +0000 (UTC) From: Maxime Ripard <mripard@kernel.org> Subject: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking Date: Mon, 10 Mar 2025 13:06:06 +0100 Message-Id: <20250310-dmem-cgroups-v1-0-2984c1bc9312@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Andrew Morton <akpm@linux-foundation.org>, Marek Szyprowski <m.szyprowski@samsung.com>, Robin Murphy <robin.murphy@arm.com>, Sumit Semwal <sumit.semwal@linaro.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Benjamin Gaignard <benjamin.gaignard@collabora.com>, Brian Starkey <Brian.Starkey@arm.com>, John Stultz <jstultz@google.com>, "T.J. Mercier" <tjmercier@google.com>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Thomas Zimmermann <tzimmermann@suse.de>, David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>, Tomasz Figa <tfiga@chromium.org>, Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Hans Verkuil <hverkuil@xs4all.nl>, Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, iommu@lists.linux.dev, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, Maxime Ripard <mripard@kernel.org>
Series	dma: Enable dmem cgroup tracking \| expand [RFC,00/12] dma: Enable dmem cgroup tracking [RFC,01/12] cma: Register dmem region for each cma region [RFC,02/12] cma: Provide accessor to cma dmem region [RFC,03/12] dma: coherent: Register dmem region for each coherent region [RFC,04/12] dma: coherent: Provide accessor to dmem region [RFC,05/12] dma: contiguous: Provide accessor to dmem region [RFC,06/12] dma: direct: Provide accessor to dmem region [RFC,07/12] dma: Create default dmem region for DMA allocations [RFC,08/12] dma: Provide accessor to dmem region [RFC,09/12] dma-buf: Clear cgroup accounting on release [RFC,10/12] dma-buf: cma: Account for allocations in dmem cgroup [RFC,11/12] drm/gem: Add cgroup memory accounting [RFC,12/12] media: videobuf2: Track buffer allocations through the dmem cgroup

Maxime Ripard March 10, 2025, 12:06 p.m. UTC

Hi,

Here's preliminary work to enable dmem tracking for heavy users of DMA
allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.

It's not really meant for inclusion at the moment, because I really
don't like it that much, and would like to discuss solutions on how to
make it nicer.

In particular, the dma dmem region accessors don't feel that great to
me. It duplicates the logic to select the proper accessor in
dma_alloc_attrs(), and it looks fragile and potentially buggy to me.

One solution I tried is to do the accounting in dma_alloc_attrs()
directly, depending on a flag being set, similar to what __GFP_ACCOUNT
is doing.

It didn't work because dmem initialises a state pointer when charging an
allocation to a region, and expects that state pointer to be passed back
when uncharging. Since dma_alloc_attrs() returns a void pointer to the
allocated buffer, we need to put that state into a higher-level
structure, such as drm_gem_object, or dma_buf.

Since we can't share the region selection logic, we need to get the
region through some other mean. Another thing I consider was to return
the region as part of the allocated buffer (through struct page or
folio), but those are lost across the calls and dma_alloc_attrs() will
only get a void pointer. So that's not doable without some heavy
rework, if it's a good idea at all.

So yeah, I went for the dumbest possible solution with the accessors,
hoping you could suggest a much smarter idea :)

Thanks,
Maxime

Signed-off-by: Maxime Ripard <mripard@kernel.org>
---
Maxime Ripard (12):
      cma: Register dmem region for each cma region
      cma: Provide accessor to cma dmem region
      dma: coherent: Register dmem region for each coherent region
      dma: coherent: Provide accessor to dmem region
      dma: contiguous: Provide accessor to dmem region
      dma: direct: Provide accessor to dmem region
      dma: Create default dmem region for DMA allocations
      dma: Provide accessor to dmem region
      dma-buf: Clear cgroup accounting on release
      dma-buf: cma: Account for allocations in dmem cgroup
      drm/gem: Add cgroup memory accounting
      media: videobuf2: Track buffer allocations through the dmem cgroup

 drivers/dma-buf/dma-buf.c                          |  7 ++++
 drivers/dma-buf/heaps/cma_heap.c                   | 18 ++++++++--
 drivers/gpu/drm/drm_gem.c                          |  5 +++
 drivers/gpu/drm/drm_gem_dma_helper.c               |  6 ++++
 .../media/common/videobuf2/videobuf2-dma-contig.c  | 19 +++++++++++
 include/drm/drm_device.h                           |  1 +
 include/drm/drm_gem.h                              |  2 ++
 include/linux/cma.h                                |  9 +++++
 include/linux/dma-buf.h                            |  5 +++
 include/linux/dma-direct.h                         |  2 ++
 include/linux/dma-map-ops.h                        | 32 ++++++++++++++++++
 include/linux/dma-mapping.h                        | 11 ++++++
 kernel/dma/coherent.c                              | 26 +++++++++++++++
 kernel/dma/direct.c                                |  8 +++++
 kernel/dma/mapping.c                               | 39 ++++++++++++++++++++++
 mm/cma.c                                           | 21 +++++++++++-
 mm/cma.h                                           |  3 ++
 17 files changed, 211 insertions(+), 3 deletions(-)
---
base-commit: 55a2aa61ba59c138bd956afe0376ec412a7004cf
change-id: 20250307-dmem-cgroups-73febced0989

Best regards,

Maxime Ripard March 10, 2025, 2:26 p.m. UTC | #1

Hi,

On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
> [Adding Ben since we are currently in the middle of a discussion
> regarding exactly that problem]
>
> Just for my understanding before I deep dive into the code: This uses
> a separate dmem cgroup and does not account against memcg, don't it?

Yes. The main rationale being that it doesn't always make sense to
register against memcg: a lot of devices are going to allocate from
dedicated chunks of memory that are either carved out from the main
memory allocator, or not under Linux supervision at all.

And if there's no way to make it consistent across drivers, it's not the
right tool.

Maxime

Robin Murphy March 10, 2025, 3:06 p.m. UTC | #2

On 2025-03-10 12:06 pm, Maxime Ripard wrote:
> In order to support any device using the GEM support, let's charge any
> GEM DMA allocation into the dmem cgroup.
> 
> Signed-off-by: Maxime Ripard <mripard@kernel.org>
> ---
>   drivers/gpu/drm/drm_gem.c            | 5 +++++
>   drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++++
>   include/drm/drm_device.h             | 1 +
>   include/drm/drm_gem.h                | 2 ++
>   4 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index ee811764c3df4b4e9c377a66afd4967512ba2001..e04733cb49353cf3ff9672d883b106a083f80d86 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -108,10 +108,11 @@ drm_gem_init(struct drm_device *dev)
>   	dev->vma_offset_manager = vma_offset_manager;
>   	drm_vma_offset_manager_init(vma_offset_manager,
>   				    DRM_FILE_PAGE_OFFSET_START,
>   				    DRM_FILE_PAGE_OFFSET_SIZE);
>   
> +
>   	return drmm_add_action(dev, drm_gem_init_release, NULL);
>   }
>   
>   /**
>    * drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM
> @@ -973,10 +974,14 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
>    * drm_gem_object_init().
>    */
>   void
>   drm_gem_object_release(struct drm_gem_object *obj)
>   {
> +
> +	if (obj->cgroup_pool_state)
> +		dmem_cgroup_uncharge(obj->cgroup_pool_state, obj->size);
> +
>   	if (obj->filp)
>   		fput(obj->filp);
>   
>   	drm_gem_private_object_fini(obj);
>   
> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c
> index 16988d316a6dc702310fa44c15c92dc67b82802b..6236feb67ddd6338f0f597a0606377e0352ca6ed 100644
> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
> @@ -104,10 +104,16 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private)
>   	if (ret) {
>   		drm_gem_object_release(gem_obj);
>   		goto error;
>   	}
>   
> +	ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(drm->dev),
> +				     size,
> +				     &dma_obj->base.cgroup_pool_state, NULL);
> +	if (ret)
> +		goto error;

Doesn't that miss cleaning up gem_obj? However, surely you want the 
accounting before the allocation anyway, like in the other cases. 
Otherwise userspace is still able to allocate massive amounts of memory 
and incur some of the associated side-effects of that, it just doesn't 
get to keep said memory for very long :)

Thanks,
Robin.

> +
>   	return dma_obj;
>   
>   error:
>   	kfree(dma_obj);
>   	return ERR_PTR(ret);
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c91f87b5242d7a499917eb4aeb6ca8350f856eb3..58987f39ba8718eb768f6261fb0a1fbf16b38549 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -1,8 +1,9 @@
>   #ifndef _DRM_DEVICE_H_
>   #define _DRM_DEVICE_H_
>   
> +#include <linux/cgroup_dmem.h>
>   #include <linux/list.h>
>   #include <linux/kref.h>
>   #include <linux/mutex.h>
>   #include <linux/idr.h>
>   
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index fdae947682cd0b7b06db5e35e120f049a0f30179..95fe8ed48a26204020bb47d6074689829c410465 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -430,10 +430,12 @@ struct drm_gem_object {
>   	 * @lru:
>   	 *
>   	 * The current LRU list that the GEM object is on.
>   	 */
>   	struct drm_gem_lru *lru;
> +
> +	struct dmem_cgroup_pool_state *cgroup_pool_state;
>   };
>   
>   /**
>    * DRM_GEM_FOPS - Default drm GEM file operations
>    *
>

Dave Airlie March 31, 2025, 8:43 p.m. UTC | #3

On Tue, 11 Mar 2025 at 00:26, Maxime Ripard <mripard@kernel.org> wrote:
>
> Hi,
>
> On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
> > [Adding Ben since we are currently in the middle of a discussion
> > regarding exactly that problem]
> >
> > Just for my understanding before I deep dive into the code: This uses
> > a separate dmem cgroup and does not account against memcg, don't it?
>
> Yes. The main rationale being that it doesn't always make sense to
> register against memcg: a lot of devices are going to allocate from
> dedicated chunks of memory that are either carved out from the main
> memory allocator, or not under Linux supervision at all.
>
> And if there's no way to make it consistent across drivers, it's not the
> right tool.
>

While I agree on that, if a user can cause a device driver to allocate
memory that is also memory that memcg accounts, then we have to
interface with memcg to account that memory.

The pathological case would be a single application wanting to use 90%
of RAM for device allocations, freeing it all, then using 90% of RAM
for normal usage. How to create a policy that would allow that with
dmem and memcg is difficult, since if you say you can do 90% on both
then the user can easily OOM the system.

Dave.
> Maxime

Christian König April 1, 2025, 11:03 a.m. UTC | #4

Am 31.03.25 um 22:43 schrieb Dave Airlie:
> On Tue, 11 Mar 2025 at 00:26, Maxime Ripard <mripard@kernel.org> wrote:
>> Hi,
>>
>> On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
>>> [Adding Ben since we are currently in the middle of a discussion
>>> regarding exactly that problem]
>>>
>>> Just for my understanding before I deep dive into the code: This uses
>>> a separate dmem cgroup and does not account against memcg, don't it?
>> Yes. The main rationale being that it doesn't always make sense to
>> register against memcg: a lot of devices are going to allocate from
>> dedicated chunks of memory that are either carved out from the main
>> memory allocator, or not under Linux supervision at all.
>>
>> And if there's no way to make it consistent across drivers, it's not the
>> right tool.
>>
> While I agree on that, if a user can cause a device driver to allocate
> memory that is also memory that memcg accounts, then we have to
> interface with memcg to account that memory.

This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.

E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.

That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).

> The pathological case would be a single application wanting to use 90%
> of RAM for device allocations, freeing it all, then using 90% of RAM
> for normal usage. How to create a policy that would allow that with
> dmem and memcg is difficult, since if you say you can do 90% on both
> then the user can easily OOM the system.

Yeah, completely agree.

That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and  TTM_TT_FLAG_EXTERNAL.

Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?

If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.

But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.

There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.

Maybe there is not this one solution and we need to make it somehow configurable?

Regards,
Christian.

>
> Dave.
>> Maxime

Maxime Ripard April 3, 2025, 3:47 p.m. UTC | #5

On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
> > For the UMA GPU case where there is no device memory or eviction
> > problem, perhaps a configurable option to just say account memory in
> > memcg for all allocations done by this process, and state yes you can
> > work around it with allocation servers or whatever but the behaviour
> > for well behaved things is at least somewhat defined.
> 
> We can have that as a workaround, but I think we should approach that
> differently.
> 
> With upcoming CXL even coherent device memory is exposed to the core
> OS as NUMA memory with just a high latency.
> 
> So both in the CXL and UMA case it actually doesn't make sense to
> allocate the memory through the driver interfaces any more. With
> AMDGPU for example we are just replicating mbind()/madvise() within
> the driver.
> 
> Instead what the DRM subsystem should aim for is to allocate memory
> using the normal core OS functionality and then import it into the
> driver.
> 
> AMD, NVidia and Intel have HMM working for quite a while now but it
> has some limitations, especially on the performance side.
> 
> So for AMDGPU we are currently evaluating udmabuf as alternative. That
> seems to be working fine with different NUMA nodes, is perfectly memcg
> accounted and gives you a DMA-buf which can be imported everywhere.
> 
> The only show stopper might be the allocation performance, but even if
> that's the case I think the ongoing folio work will properly resolve
> that.

I mean, no, the showstopper to that is that using udmabuf has the
assumption that you have an IOMMU for every device doing DMA, which is
absolutely not true on !x86 platforms.

It might be true for all GPUs, but it certainly isn't for display
controllers, and it's not either for codecs, ISPs, and cameras.

And then there's the other assumption that all memory is under the
memory allocator control, which isn't the case on most recent platforms
either.

We *need* to take CMA into account there, all the carved-out, device
specific memory regions, and the memory regions that aren't even under
Linux supervision like protected memory that is typically handled by the
firmware and all you get is a dma-buf.

Saying that it's how you want to workaround it on AMD is absolutely
fine, but DRM as a whole should certainly not aim for that, because it
can't.

Maxime

Christian König April 4, 2025, 8:47 a.m. UTC | #6

Hi Maxime,

Am 03.04.25 um 17:47 schrieb Maxime Ripard:
> On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
>>> For the UMA GPU case where there is no device memory or eviction
>>> problem, perhaps a configurable option to just say account memory in
>>> memcg for all allocations done by this process, and state yes you can
>>> work around it with allocation servers or whatever but the behaviour
>>> for well behaved things is at least somewhat defined.
>> We can have that as a workaround, but I think we should approach that
>> differently.
>>
>> With upcoming CXL even coherent device memory is exposed to the core
>> OS as NUMA memory with just a high latency.
>>
>> So both in the CXL and UMA case it actually doesn't make sense to
>> allocate the memory through the driver interfaces any more. With
>> AMDGPU for example we are just replicating mbind()/madvise() within
>> the driver.
>>
>> Instead what the DRM subsystem should aim for is to allocate memory
>> using the normal core OS functionality and then import it into the
>> driver.
>>
>> AMD, NVidia and Intel have HMM working for quite a while now but it
>> has some limitations, especially on the performance side.
>>
>> So for AMDGPU we are currently evaluating udmabuf as alternative. That
>> seems to be working fine with different NUMA nodes, is perfectly memcg
>> accounted and gives you a DMA-buf which can be imported everywhere.
>>
>> The only show stopper might be the allocation performance, but even if
>> that's the case I think the ongoing folio work will properly resolve
>> that.
> I mean, no, the showstopper to that is that using udmabuf has the
> assumption that you have an IOMMU for every device doing DMA, which is
> absolutely not true on !x86 platforms.
>
> It might be true for all GPUs, but it certainly isn't for display
> controllers, and it's not either for codecs, ISPs, and cameras.
>
> And then there's the other assumption that all memory is under the
> memory allocator control, which isn't the case on most recent platforms
> either.
>
> We *need* to take CMA into account there, all the carved-out, device
> specific memory regions, and the memory regions that aren't even under
> Linux supervision like protected memory that is typically handled by the
> firmware and all you get is a dma-buf.
>
> Saying that it's how you want to workaround it on AMD is absolutely
> fine, but DRM as a whole should certainly not aim for that, because it
> can't.

A bunch of good points you bring up here but it sounds like you misunderstood me a bit.

I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.

For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.

When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.

When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.

The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.

Regards,
Christian.

>
> Maxime

[RFC,00/12] dma: Enable dmem cgroup tracking

Message

Comments