mbox series

[RFC,v3,0/8] Proposal for a GPU cgroup controller

Message ID 20220309165222.2843651-1-tjmercier@google.com
Headers show
Series Proposal for a GPU cgroup controller | expand

Message

T.J. Mercier March 9, 2022, 4:52 p.m. UTC
This patch series revisits the proposal for a GPU cgroup controller to
track and limit memory allocations by various device/allocator
subsystems. The patch series also contains a simple prototype to
illustrate how Android intends to implement DMA-BUF allocator
attribution using the GPU cgroup controller. The prototype does not
include resource limit enforcements.

Changelog:
v3:
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz

Use more common dual author commit message format per John Stultz

Remove android from binder changes title per Todd Kjos

Add a kselftest for this new behavior per Greg Kroah-Hartman

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

Fix pid and uid types in binder UAPI header

v2:
See the previous revision of this change submitted by Hridya Valsaraju
at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/

Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König. Pointers to struct gpucg and struct gpucg_device
tracking the current associations were added to the dma_buf struct to
achieve this.

Fix incorrect Kconfig help section indentation per Randy Dunlap.

History of the GPU cgroup controller
====================================
The GPU/DRM cgroup controller came into being when a consensus[1]
was reached that the resources it tracked were unsuitable to be integrated
into memcg. Originally, the proposed controller was specific to the DRM
subsystem and was intended to track GEM buffers and GPU-specific
resources[2]. In order to help establish a unified memory accounting model
for all GPU and all related subsystems, Daniel Vetter put forth a
suggestion to move it out of the DRM subsystem so that it can be used by
other DMA-BUF exporters as well[3]. This RFC proposes an interface that
does the same.

[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
[2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
[3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/

Hridya Valsaraju (5):
  gpu: rfc: Proposal for a GPU cgroup controller
  cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
    memory
  dmabuf: heaps: export system_heap buffers with GPU cgroup charging
  dmabuf: Add gpu cgroup charge transfer function
  binder: Add a buffer flag to relinquish ownership of fds

T.J. Mercier (3):
  dmabuf: Use the GPU cgroup charge/uncharge APIs
  binder: use __kernel_pid_t and __kernel_uid_t for userspace
  selftests: Add binder cgroup gpu memory transfer test

 Documentation/gpu/rfc/gpu-cgroup.rst          | 183 +++++++
 Documentation/gpu/rfc/index.rst               |   4 +
 drivers/android/binder.c                      |  26 +
 drivers/dma-buf/dma-buf.c                     | 100 ++++
 drivers/dma-buf/dma-heap.c                    |  27 +
 drivers/dma-buf/heaps/system_heap.c           |   3 +
 include/linux/cgroup_gpu.h                    | 127 +++++
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/dma-buf.h                       |  22 +-
 include/linux/dma-heap.h                      |  11 +
 include/uapi/linux/android/binder.h           |   5 +-
 init/Kconfig                                  |   7 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/gpu.c                           | 304 +++++++++++
 .../selftests/drivers/android/binder/Makefile |   8 +
 .../drivers/android/binder/binder_util.c      | 254 +++++++++
 .../drivers/android/binder/binder_util.h      |  32 ++
 .../selftests/drivers/android/binder/config   |   4 +
 .../binder/test_dmabuf_cgroup_transfer.c      | 480 ++++++++++++++++++
 19 files changed, 1598 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
 create mode 100644 tools/testing/selftests/drivers/android/binder/config
 create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c

Comments

Todd Kjos March 10, 2022, 7:33 p.m. UTC | #1
On Wed, Mar 9, 2022 at 8:52 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> The kernel interface should use types that the kernel defines instead of
> pid_t and uid_t, whose definiton is owned by libc. This fixes the header
> so that it can be included without first including sys/types.h.
>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
>  include/uapi/linux/android/binder.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> index 169fd5069a1a..aa28454dbca3 100644
> --- a/include/uapi/linux/android/binder.h
> +++ b/include/uapi/linux/android/binder.h
> @@ -289,8 +289,8 @@ struct binder_transaction_data {
>
>         /* General information about the transaction. */
>         __u32           flags;
> -       pid_t           sender_pid;
> -       uid_t           sender_euid;
> +       __kernel_pid_t  sender_pid;
> +       __kernel_uid_t  sender_euid;

Are we guaranteed that this does not affect the UAPI at all? Userspace
code using this definition will have to run with kernels using the old
definition and visa-versa.

>         binder_size_t   data_size;      /* number of bytes of data */
>         binder_size_t   offsets_size;   /* number of bytes of offsets */
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>
T.J. Mercier March 14, 2022, 11:45 p.m. UTC | #2
On Thu, Mar 10, 2022 at 11:33 AM Todd Kjos <tkjos@google.com> wrote:
>
> On Wed, Mar 9, 2022 at 8:52 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > The kernel interface should use types that the kernel defines instead of
> > pid_t and uid_t, whose definiton is owned by libc. This fixes the header
> > so that it can be included without first including sys/types.h.
> >
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> >  include/uapi/linux/android/binder.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> > index 169fd5069a1a..aa28454dbca3 100644
> > --- a/include/uapi/linux/android/binder.h
> > +++ b/include/uapi/linux/android/binder.h
> > @@ -289,8 +289,8 @@ struct binder_transaction_data {
> >
> >         /* General information about the transaction. */
> >         __u32           flags;
> > -       pid_t           sender_pid;
> > -       uid_t           sender_euid;
> > +       __kernel_pid_t  sender_pid;
> > +       __kernel_uid_t  sender_euid;
>
> Are we guaranteed that this does not affect the UAPI at all? Userspace
> code using this definition will have to run with kernels using the old
> definition and visa-versa.

A standards compliant userspace should be expecting a signed integer
type here. So the only way I can think userspace would be affected is
if:
1) pid_t is a long AND
2) sizeof(long) > sizeof(int) AND
3) Consumers of the pid_t definition actually attempt to mutate the
result to make use of extra bits in the variable (which are not there)

This seems extremely unlikely. For instance just on the topic of the
first item, all of the C library implementations with pid_t
definitions linked here use an int, except for Bionic which typdefs
pid_t to __kernel_pid_t and Sortix which uses long.
https://wiki.osdev.org/C_Library

However I would argue this is already broken and should count as a bug
fix since I can't do this:

$ cat binder_include.c ; gcc binder_include.c
#include <linux/android/binder.h>
int main() {}
In file included from binder_include.c:1:
/usr/include/linux/android/binder.h:291:9: error: unknown type name ‘pid_t’
  291 |         pid_t           sender_pid;
      |         ^~~~~
/usr/include/linux/android/binder.h:292:9: error: unknown type name ‘uid_t’
  292 |         uid_t           sender_euid;
      |         ^~~~~

This is also the only occurrence of pid_t in all of
include/uapi/linux. All 40+ other uses are __kernel_pid_t, and I don't
see why the binder header should be different.


>
> >         binder_size_t   data_size;      /* number of bytes of data */
> >         binder_size_t   offsets_size;   /* number of bytes of offsets */
> >
> > --
> > 2.35.1.616.g0bdcbb4464-goog
> >
Todd Kjos March 15, 2022, 12:11 a.m. UTC | #3
On Mon, Mar 14, 2022 at 4:45 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Thu, Mar 10, 2022 at 11:33 AM Todd Kjos <tkjos@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 8:52 AM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > The kernel interface should use types that the kernel defines instead of
> > > pid_t and uid_t, whose definiton is owned by libc. This fixes the header
> > > so that it can be included without first including sys/types.h.
> > >
> > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > ---
> > >  include/uapi/linux/android/binder.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> > > index 169fd5069a1a..aa28454dbca3 100644
> > > --- a/include/uapi/linux/android/binder.h
> > > +++ b/include/uapi/linux/android/binder.h
> > > @@ -289,8 +289,8 @@ struct binder_transaction_data {
> > >
> > >         /* General information about the transaction. */
> > >         __u32           flags;
> > > -       pid_t           sender_pid;
> > > -       uid_t           sender_euid;
> > > +       __kernel_pid_t  sender_pid;
> > > +       __kernel_uid_t  sender_euid;
> >
> > Are we guaranteed that this does not affect the UAPI at all? Userspace
> > code using this definition will have to run with kernels using the old
> > definition and visa-versa.
>
> A standards compliant userspace should be expecting a signed integer
> type here. So the only way I can think userspace would be affected is
> if:
> 1) pid_t is a long AND
> 2) sizeof(long) > sizeof(int) AND
> 3) Consumers of the pid_t definition actually attempt to mutate the
> result to make use of extra bits in the variable (which are not there)
>
> This seems extremely unlikely. For instance just on the topic of the
> first item, all of the C library implementations with pid_t
> definitions linked here use an int, except for Bionic which typdefs
> pid_t to __kernel_pid_t and Sortix which uses long.
> https://wiki.osdev.org/C_Library
>
> However I would argue this is already broken and should count as a bug
> fix since I can't do this:
>
> $ cat binder_include.c ; gcc binder_include.c
> #include <linux/android/binder.h>
> int main() {}
> In file included from binder_include.c:1:
> /usr/include/linux/android/binder.h:291:9: error: unknown type name ‘pid_t’
>   291 |         pid_t           sender_pid;
>       |         ^~~~~
> /usr/include/linux/android/binder.h:292:9: error: unknown type name ‘uid_t’
>   292 |         uid_t           sender_euid;
>       |         ^~~~~
>
> This is also the only occurrence of pid_t in all of
> include/uapi/linux. All 40+ other uses are __kernel_pid_t, and I don't
> see why the binder header should be different.

It looks like those other cases used to be pid_t, but were changed to
__kernel_pid_t.

Acked-by: Todd Kjos <tkjos@google.com>

>
>
> >
> > >         binder_size_t   data_size;      /* number of bytes of data */
> > >         binder_size_t   offsets_size;   /* number of bytes of offsets */
> > >
> > > --
> > > 2.35.1.616.g0bdcbb4464-goog
> > >
T.J. Mercier March 15, 2022, 12:11 a.m. UTC | #4
On Wed, Mar 9, 2022 at 1:31 PM Shuah Khan <skhan@linuxfoundation.org> wrote:
>
> On 3/9/22 9:52 AM, T.J. Mercier wrote:
> > This test verifies that the cgroup GPU memory charge is transferred
> > correctly when a dmabuf is passed between processes in two different
> > cgroups and the sender specifies BINDER_BUFFER_FLAG_SENDER_NO_NEED in the
> > binder transaction data containing the dmabuf file descriptor.
> >
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> >   .../selftests/drivers/android/binder/Makefile |   8 +
> >   .../drivers/android/binder/binder_util.c      | 254 +++++++++
> >   .../drivers/android/binder/binder_util.h      |  32 ++
> >   .../selftests/drivers/android/binder/config   |   4 +
> >   .../binder/test_dmabuf_cgroup_transfer.c      | 480 ++++++++++++++++++
> >   5 files changed, 778 insertions(+)
> >   create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> >   create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> >   create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> >   create mode 100644 tools/testing/selftests/drivers/android/binder/config
> >   create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> >
> > diff --git a/tools/testing/selftests/drivers/android/binder/Makefile b/tools/testing/selftests/drivers/android/binder/Makefile
> > new file mode 100644
> > index 000000000000..726439d10675
> > --- /dev/null
> > +++ b/tools/testing/selftests/drivers/android/binder/Makefile
> > @@ -0,0 +1,8 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +CFLAGS += -Wall
> > +
>
> Does this test inteded to be built on all architectures? Is arch
> check necessary here?

I think this test should be runnable on any architecture, so long as
the correct kernel configs (for binder and cgroups) are enabled. I
have tested it on arm64 and x86-64.

>
> Also does this test require root previleges - I see mount and
> unmount operations in the test. If so add root check and skip
> if non-root user runs the test.

Yes, currently it does. Thanks, I will add this check at the location
you have pointed out in the TEST function.

>
> > +TEST_GEN_PROGS = test_dmabuf_cgroup_transfer
> > +
> > +include ../../../lib.mk
> > +
> > +$(OUTPUT)/test_dmabuf_cgroup_transfer: ../../../cgroup/cgroup_util.c binder_util.c
> > diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.c b/tools/testing/selftests/drivers/android/binder/binder_util.c
> > new file mode 100644
> > index 000000000000..c9dcf5b9d42b
> > --- /dev/null
> > +++ b/tools/testing/selftests/drivers/android/binder/binder_util.c
> > @@ -0,0 +1,254 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include "binder_util.h"
> > +
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <sys/mount.h>
> > +
> > +#include <linux/limits.h>
> > +#include <linux/android/binder.h>
> > +#include <linux/android/binderfs.h>
> > +
> > +static const size_t BINDER_MMAP_SIZE = 64 * 1024;
> > +
> > +static void binderfs_unmount(const char *mountpoint)
> > +{
> > +     if (umount2(mountpoint, MNT_DETACH))
> > +             fprintf(stderr, "Failed to unmount binderfs at %s: %s\n",
> > +                     mountpoint, strerror(errno));
> > +     else
> > +             fprintf(stderr, "Binderfs unmounted: %s\n", mountpoint);
> > +
> > +     if (rmdir(mountpoint))
> > +             fprintf(stderr, "Failed to remove binderfs mount %s: %s\n",
> > +                     mountpoint, strerror(errno));
> > +     else
> > +             fprintf(stderr, "Binderfs mountpoint destroyed: %s\n", mountpoint);
>
> Does umount require root previleges? Same commment as above about
> non-root user running test.
>
>
> > +}
> > +
> > +struct binderfs_ctx create_binderfs(const char *name)
> > +{
> > +     int fd, ret, saved_errno;
> > +     struct binderfs_device device = { 0 };
> > +     struct binderfs_ctx ctx = { 0 };
> > +
> > +     /*
> > +      * P_tmpdir is set to "/tmp/" on Android platforms where Binder is most
> > +      * commonly used, but this path does not actually exist on Android. We
> > +      * will first try using "/data/local/tmp" and fallback to P_tmpdir if
> > +      * that fails for non-Android platforms.
> > +      */
> > +     static const char tmpdir[] = "/data/local/tmp";
> > +     static const size_t MAX_TMPDIR_SIZE =
> > +             sizeof(tmpdir) > sizeof(P_tmpdir) ?
> > +             sizeof(tmpdir) : sizeof(P_tmpdir);
> > +     static const char template[] = "/binderfs_XXXXXX";
> > +
> > +     char *mkdtemp_result;
> > +     char binderfs_mntpt[MAX_TMPDIR_SIZE + sizeof(template)];
> > +     char device_path[MAX_TMPDIR_SIZE + sizeof(template) + BINDERFS_MAX_NAME];
> > +
> > +     snprintf(binderfs_mntpt, sizeof(binderfs_mntpt), "%s%s", tmpdir, template);
> > +
> > +     mkdtemp_result = mkdtemp(binderfs_mntpt);
> > +     if (mkdtemp_result == NULL) {
> > +             fprintf(stderr, "Failed to create binderfs mountpoint at %s: %s.\n",
> > +                     binderfs_mntpt, strerror(errno));
> > +             fprintf(stderr, "Trying fallback mountpoint...\n");
> > +             snprintf(binderfs_mntpt, sizeof(binderfs_mntpt), "%s%s", P_tmpdir, template);
> > +             if (mkdtemp(binderfs_mntpt) == NULL) {
> > +                     fprintf(stderr, "Failed to create binderfs mountpoint at %s: %s\n",
> > +                     binderfs_mntpt, strerror(errno));
> > +                     return ctx;
> > +             }
> > +     }
> > +     fprintf(stderr, "Binderfs mountpoint created at %s\n", binderfs_mntpt);
>
> Does mount require root previleges? Same commment as above about
> non-root user running test.
>
> > +
> > +     if (mount(NULL, binderfs_mntpt, "binder", 0, 0)) {
> > +             perror("Could not mount binderfs");
> > +             rmdir(binderfs_mntpt);
> > +             return ctx;
> > +     }
> > +     fprintf(stderr, "Binderfs mounted at %s\n", binderfs_mntpt);
> > +
> > +     strncpy(device.name, name, sizeof(device.name));
> > +     snprintf(device_path, sizeof(device_path), "%s/binder-control", binderfs_mntpt);
> > +     fd = open(device_path, O_RDONLY | O_CLOEXEC);
> > +     if (!fd) {
> > +             perror("Failed to open binder-control device");
> > +             binderfs_unmount(binderfs_mntpt);
> > +             return ctx;
> > +     }
> > +
> > +     ret = ioctl(fd, BINDER_CTL_ADD, &device);
> > +     saved_errno = errno;
> > +     close(fd);
> > +     errno = saved_errno;
> > +     if (ret) {
> > +             perror("Failed to allocate new binder device");
> > +             binderfs_unmount(binderfs_mntpt);
> > +             return ctx;
> > +     }
> > +
> > +     fprintf(stderr, "Allocated new binder device with major %d, minor %d, and name %s at %s\n",
> > +             device.major, device.minor, device.name, binderfs_mntpt);
> > +
> > +     ctx.name = strdup(name);
> > +     ctx.mountpoint = strdup(binderfs_mntpt);
> > +     return ctx;
> > +}
> > +
> > +void destroy_binderfs(struct binderfs_ctx *ctx)
> > +{
> > +     char path[PATH_MAX];
> > +
> > +     snprintf(path, sizeof(path), "%s/%s", ctx->mountpoint, ctx->name);
> > +
> > +     if (unlink(path))
> > +             fprintf(stderr, "Failed to unlink binder device %s: %s\n", path, strerror(errno));
> > +     else
> > +             fprintf(stderr, "Destroyed binder %s at %s\n", ctx->name, ctx->mountpoint);
> > +
> > +     binderfs_unmount(ctx->mountpoint);
> > +
> > +     free(ctx->name);
> > +     free(ctx->mountpoint);
> > +}
> > +
> > +struct binder_ctx open_binder(struct binderfs_ctx *bfs_ctx)
> > +{
> > +     struct binder_ctx ctx = {.fd = -1, .memory = NULL};
> > +     char path[PATH_MAX];
> > +
> > +     snprintf(path, sizeof(path), "%s/%s", bfs_ctx->mountpoint, bfs_ctx->name);
> > +     ctx.fd = open(path, O_RDWR | O_NONBLOCK | O_CLOEXEC);
> > +     if (ctx.fd < 0) {
> > +             fprintf(stderr, "Error opening binder device %s: %s\n", path, strerror(errno));
>
> Does this require root previleges?
>
> > +             return ctx;
> > +     }
> > +
> > +     ctx.memory = mmap(NULL, BINDER_MMAP_SIZE, PROT_READ, MAP_SHARED, ctx.fd, 0);
> > +     if (ctx.memory == NULL) {
> > +             perror("Error mapping binder memory");
> > +             close(ctx.fd);
> > +             ctx.fd = -1;
> > +     }
> > +
> > +     return ctx;
> > +}
> > +
> > +void close_binder(struct binder_ctx *ctx)
> > +{
> > +     if (munmap(ctx->memory, BINDER_MMAP_SIZE))
> > +             perror("Failed to unmap binder memory");
> > +     ctx->memory = NULL;
> > +
> > +     if (close(ctx->fd))
> > +             perror("Failed to close binder");
> > +     ctx->fd = -1;
> > +}
> > +
> > +int become_binder_context_manager(int binder_fd)
> > +{
> > +     return ioctl(binder_fd, BINDER_SET_CONTEXT_MGR, 0);
> > +}
> > +
> > +int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writesize,
> > +                      void *readbuf, binder_size_t readsize)
> > +{
> > +     int err;
> > +     struct binder_write_read bwr = {
> > +             .write_buffer = (binder_uintptr_t)writebuf,
> > +             .write_size = writesize,
> > +             .read_buffer = (binder_uintptr_t)readbuf,
> > +             .read_size = readsize
> > +     };
> > +
> > +     do {
> > +             if (ioctl(binder_fd, BINDER_WRITE_READ, &bwr) >= 0)
> > +                     err = 0;
> > +             else
> > +                     err = -errno;
> > +     } while (err == -EINTR);
> > +
> > +     if (err < 0) {
> > +             perror("BINDER_WRITE_READ");
> > +             return -1;
> > +     }
> > +
> > +     if (bwr.write_consumed < writesize) {
> > +             fprintf(stderr, "Binder did not consume full write buffer %llu %llu\n",
> > +                     bwr.write_consumed, writesize);
> > +             return -1;
> > +     }
> > +
> > +     return bwr.read_consumed;
> > +}
> > +
> > +static const char *reply_string(int cmd)
> > +{
> > +     switch (cmd) {
> > +     case BR_ERROR:
> > +             return("BR_ERROR");
> > +     case BR_OK:
> > +             return("BR_OK");
> > +     case BR_TRANSACTION_SEC_CTX:
> > +             return("BR_TRANSACTION_SEC_CTX");
> > +     case BR_TRANSACTION:
> > +             return("BR_TRANSACTION");
> > +     case BR_REPLY:
> > +             return("BR_REPLY");
> > +     case BR_ACQUIRE_RESULT:
> > +             return("BR_ACQUIRE_RESULT");
> > +     case BR_DEAD_REPLY:
> > +             return("BR_DEAD_REPLY");
> > +     case BR_TRANSACTION_COMPLETE:
> > +             return("BR_TRANSACTION_COMPLETE");
> > +     case BR_INCREFS:
> > +             return("BR_INCREFS");
> > +     case BR_ACQUIRE:
> > +             return("BR_ACQUIRE");
> > +     case BR_RELEASE:
> > +             return("BR_RELEASE");
> > +     case BR_DECREFS:
> > +             return("BR_DECREFS");
> > +     case BR_ATTEMPT_ACQUIRE:
> > +             return("BR_ATTEMPT_ACQUIRE");
> > +     case BR_NOOP:
> > +             return("BR_NOOP");
> > +     case BR_SPAWN_LOOPER:
> > +             return("BR_SPAWN_LOOPER");
> > +     case BR_FINISHED:
> > +             return("BR_FINISHED");
> > +     case BR_DEAD_BINDER:
> > +             return("BR_DEAD_BINDER");
> > +     case BR_CLEAR_DEATH_NOTIFICATION_DONE:
> > +             return("BR_CLEAR_DEATH_NOTIFICATION_DONE");
> > +     case BR_FAILED_REPLY:
> > +             return("BR_FAILED_REPLY");
> > +     case BR_FROZEN_REPLY:
> > +             return("BR_FROZEN_REPLY");
> > +     case BR_ONEWAY_SPAM_SUSPECT:
> > +             return("BR_ONEWAY_SPAM_SUSPECT");
> > +     default:
> > +             return("Unknown");
> > +     };
> > +}
> > +
> > +int expect_binder_reply(int32_t actual, int32_t expected)
> > +{
> > +     if (actual != expected) {
> > +             fprintf(stderr, "Expected %s but received %s\n",
> > +                     reply_string(expected), reply_string(actual));
> > +             return -1;
> > +     }
> > +     return 0;
> > +}
> > +
> > diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.h b/tools/testing/selftests/drivers/android/binder/binder_util.h
> > new file mode 100644
> > index 000000000000..807f5abe987e
> > --- /dev/null
> > +++ b/tools/testing/selftests/drivers/android/binder/binder_util.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef SELFTEST_BINDER_UTIL_H
> > +#define SELFTEST_BINDER_UTIL_H
> > +
> > +#include <stdint.h>
> > +
> > +#include <linux/android/binder.h>
> > +
> > +struct binderfs_ctx {
> > +     char *name;
> > +     char *mountpoint;
> > +};
> > +
> > +struct binder_ctx {
> > +     int fd;
> > +     void *memory;
> > +};
> > +
> > +struct binderfs_ctx create_binderfs(const char *name);
> > +void destroy_binderfs(struct binderfs_ctx *ctx);
> > +
> > +struct binder_ctx open_binder(struct binderfs_ctx *bfs_ctx);
> > +void close_binder(struct binder_ctx *ctx);
> > +
> > +int become_binder_context_manager(int binder_fd);
> > +
> > +int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writesize,
> > +                      void *readbuf, binder_size_t readsize);
> > +
> > +int expect_binder_reply(int32_t actual, int32_t expected);
> > +#endif
> > diff --git a/tools/testing/selftests/drivers/android/binder/config b/tools/testing/selftests/drivers/android/binder/config
> > new file mode 100644
> > index 000000000000..fcc5f8f693b3
> > --- /dev/null
> > +++ b/tools/testing/selftests/drivers/android/binder/config
> > @@ -0,0 +1,4 @@
> > +CONFIG_CGROUP_GPU=y
> > +CONFIG_ANDROID=y
> > +CONFIG_ANDROID_BINDERFS=y
> > +CONFIG_ANDROID_BINDER_IPC=y
> > diff --git a/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > new file mode 100644
> > index 000000000000..9b952ab401cc
> > --- /dev/null
> > +++ b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > @@ -0,0 +1,480 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * This test verifies that the cgroup GPU memory charge is transferred correctly
> > + * when a dmabuf is passed between processes in two different cgroups and the
> > + * sender specifies BINDER_BUFFER_FLAG_SENDER_NO_NEED in the binder transaction
> > + * data containing the dmabuf file descriptor.
> > + *
> > + * The gpu_cgroup_dmabuf_transfer test function becomes the binder context
> > + * manager, then forks a child who initiates a transaction with the context
> > + * manager by specifying a target of 0. The context manager reply contains a
> > + * dmabuf file descriptor which was allocated by the gpu_cgroup_dmabuf_transfer
> > + * test function, but should be charged to the child cgroup after the binder
> > + * transaction.
> > + */
> > +
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stddef.h>
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/epoll.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/types.h>
> > +#include <sys/wait.h>
> > +
> > +#include "binder_util.h"
> > +#include "../../../cgroup/cgroup_util.h"
> > +#include "../../../kselftest.h"
> > +#include "../../../kselftest_harness.h"
> > +
> > +#include <linux/limits.h>
> > +#include <linux/dma-heap.h>
> > +#include <linux/android/binder.h>
> > +
> > +#define UNUSED(x) ((void)(x))
> > +
> > +static const unsigned int BINDER_CODE = 8675309; /* Any number will work here */
> > +
> > +struct cgroup_ctx {
> > +     char *root;
> > +     char *source;
> > +     char *dest;
> > +};
> > +
> > +void destroy_cgroups(struct __test_metadata *_metadata, struct cgroup_ctx *ctx)
> > +{
> > +     if (ctx->source != NULL) {
> > +             TH_LOG("Destroying cgroup: %s", ctx->source);
> > +             rmdir(ctx->source);
> > +             free(ctx->source);
> > +     }
> > +
> > +     if (ctx->dest != NULL) {
> > +             TH_LOG("Destroying cgroup: %s", ctx->dest);
> > +             rmdir(ctx->dest);
> > +             free(ctx->dest);
> > +     }
> > +
> > +     free(ctx->root);
> > +     ctx->root = ctx->source = ctx->dest = NULL;
> > +}
> > +
> > +struct cgroup_ctx create_cgroups(struct __test_metadata *_metadata)
> > +{
> > +     struct cgroup_ctx ctx = {0};
> > +     char root[PATH_MAX], *tmp;
> > +     static const char template[] = "/gpucg_XXXXXX";
> > +
> > +     if (cg_find_unified_root(root, sizeof(root))) {
> > +             TH_LOG("Could not find cgroups root");
> > +             return ctx;
> > +     }
> > +
> > +     if (cg_read_strstr(root, "cgroup.controllers", "gpu")) {
> > +             TH_LOG("Could not find GPU controller");
> > +             return ctx;
> > +     }
> > +
> > +     if (cg_write(root, "cgroup.subtree_control", "+gpu")) {
> > +             TH_LOG("Could not enable GPU controller");
> > +             return ctx;
> > +     }
> > +
> > +     ctx.root = strdup(root);
> > +
> > +     snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
> > +     tmp = mkdtemp(root);
> > +     if (tmp == NULL) {
> > +             TH_LOG("%s - Could not create source cgroup", strerror(errno));
> > +             destroy_cgroups(_metadata, &ctx);
> > +             return ctx;
> > +     }
> > +     ctx.source = strdup(tmp);
> > +
> > +     snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
> > +     tmp = mkdtemp(root);
> > +     if (tmp == NULL) {
> > +             TH_LOG("%s - Could not create destination cgroup", strerror(errno));
> > +             destroy_cgroups(_metadata, &ctx);
> > +             return ctx;
> > +     }
> > +     ctx.dest = strdup(tmp);
> > +
> > +     TH_LOG("Created cgroups: %s %s", ctx.source, ctx.dest);
> > +
> > +     return ctx;
> > +}
> > +
> > +int dmabuf_heap_alloc(int fd, size_t len, int *dmabuf_fd)
> > +{
> > +     struct dma_heap_allocation_data data = {
> > +             .len = len,
> > +             .fd = 0,
> > +             .fd_flags = O_RDONLY | O_CLOEXEC,
> > +             .heap_flags = 0,
> > +     };
> > +     int ret;
> > +
> > +     if (!dmabuf_fd)
> > +             return -EINVAL;
> > +
> > +     ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
> > +     if (ret < 0)
> > +             return ret;
> > +     *dmabuf_fd = (int)data.fd;
> > +     return ret;
> > +}
> > +
> > +/* The system heap is known to export dmabufs with support for cgroup tracking */
> > +int alloc_dmabuf_from_system_heap(struct __test_metadata *_metadata, size_t bytes)
> > +{
> > +     int heap_fd = -1, dmabuf_fd = -1;
> > +     static const char * const heap_path = "/dev/dma_heap/system";
> > +
> > +     heap_fd = open(heap_path, O_RDONLY);
> > +     if (heap_fd < 0) {
> > +             TH_LOG("%s - open %s failed!\n", strerror(errno), heap_path);
> > +             return -1;
> > +     }
>
> Same question about root preveliges?
>
> > +
> > +     if (dmabuf_heap_alloc(heap_fd, bytes, &dmabuf_fd))
> > +             TH_LOG("dmabuf allocation failed! - %s", strerror(errno));
> > +     close(heap_fd);
> > +
> > +     return dmabuf_fd;
> > +}
> > +
> > +int binder_request_dmabuf(int binder_fd)
> > +{
> > +     int ret;
> > +
> > +     /*
> > +      * We just send an empty binder_buffer_object to initiate a transaction
> > +      * with the context manager, who should respond with a single dmabuf
> > +      * inside a binder_fd_array_object.
> > +      */
> > +
> > +     struct binder_buffer_object bbo = {
> > +             .hdr.type = BINDER_TYPE_PTR,
> > +             .flags = 0,
> > +             .buffer = 0,
> > +             .length = 0,
> > +             .parent = 0, /* No parent */
> > +             .parent_offset = 0 /* No parent */
> > +     };
> > +
> > +     binder_size_t offsets[] = {0};
> > +
> > +     struct {
> > +             int32_t cmd;
> > +             struct binder_transaction_data btd;
> > +     } __attribute__((packed)) bc = {
> > +             .cmd = BC_TRANSACTION,
> > +             .btd = {
> > +                     .target = { 0 },
> > +                     .cookie = 0,
> > +                     .code = BINDER_CODE,
> > +                     .flags = TF_ACCEPT_FDS, /* We expect a FDA in the reply */
> > +                     .data_size = sizeof(bbo),
> > +                     .offsets_size = sizeof(offsets),
> > +                     .data.ptr = {
> > +                             (binder_uintptr_t)&bbo,
> > +                             (binder_uintptr_t)offsets
> > +                     }
> > +             },
> > +     };
> > +
> > +     struct {
> > +             int32_t reply_noop;
> > +     } __attribute__((packed)) br;
> > +
> > +     ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
> > +     if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
> > +             return -1;
> > +     } else if (ret < sizeof(br)) {
> > +             fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
> > +             return -1;
> > +     }
> > +     return 0;
> > +}
> > +
> > +int send_dmabuf_reply(int binder_fd, struct binder_transaction_data *tr, int dmabuf_fd)
> > +{
> > +     int ret;
> > +     /*
> > +      * The trailing 0 is to achieve the necessary alignment for the binder
> > +      * buffer_size.
> > +      */
> > +     int fdarray[] = { dmabuf_fd, 0 };
> > +
> > +     struct binder_buffer_object bbo = {
> > +             .hdr.type = BINDER_TYPE_PTR,
> > +             .flags = BINDER_BUFFER_FLAG_SENDER_NO_NEED,
> > +             .buffer = (binder_uintptr_t)fdarray,
> > +             .length = sizeof(fdarray),
> > +             .parent = 0, /* No parent */
> > +             .parent_offset = 0 /* No parent */
> > +     };
> > +
> > +     struct binder_fd_array_object bfdao = {
> > +             .hdr.type = BINDER_TYPE_FDA,
> > +             .num_fds = 1,
> > +             .parent = 0, /* The binder_buffer_object */
> > +             .parent_offset = 0 /* FDs follow immediately */
> > +     };
> > +
> > +     uint64_t sz = sizeof(fdarray);
> > +     uint8_t data[sizeof(sz) + sizeof(bbo) + sizeof(bfdao)];
> > +     binder_size_t offsets[] = {sizeof(sz), sizeof(sz)+sizeof(bbo)};
> > +
> > +     memcpy(data,                            &sz, sizeof(sz));
> > +     memcpy(data + sizeof(sz),               &bbo, sizeof(bbo));
> > +     memcpy(data + sizeof(sz) + sizeof(bbo), &bfdao, sizeof(bfdao));
> > +
> > +     struct {
> > +             int32_t cmd;
> > +             struct binder_transaction_data_sg btd;
> > +     } __attribute__((packed)) bc = {
> > +             .cmd = BC_REPLY_SG,
> > +             .btd.transaction_data = {
> > +                     .target = { tr->target.handle },
> > +                     .cookie = tr->cookie,
> > +                     .code = BINDER_CODE,
> > +                     .flags = 0,
> > +                     .data_size = sizeof(data),
> > +                     .offsets_size = sizeof(offsets),
> > +                     .data.ptr = {
> > +                             (binder_uintptr_t)data,
> > +                             (binder_uintptr_t)offsets
> > +                     }
> > +             },
> > +             .btd.buffers_size = sizeof(fdarray)
> > +     };
> > +
> > +     struct {
> > +             int32_t reply_noop;
> > +     } __attribute__((packed)) br;
> > +
> > +     ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
> > +     if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
> > +             return -1;
> > +     } else if (ret < sizeof(br)) {
> > +             fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
> > +             return -1;
> > +     }
> > +     return 0;
> > +}
> > +
> > +struct binder_transaction_data *binder_wait_for_transaction(int binder_fd,
> > +                                                         uint32_t *readbuf,
> > +                                                         size_t readsize)
> > +{
> > +     static const int MAX_EVENTS = 1, EPOLL_WAIT_TIME_MS = 3 * 1000;
> > +     struct binder_reply {
> > +             int32_t reply0;
> > +             int32_t reply1;
> > +             struct binder_transaction_data btd;
> > +     } *br;
> > +     struct binder_transaction_data *ret = NULL;
> > +     struct epoll_event events[MAX_EVENTS];
> > +     int epoll_fd, num_events, readcount;
> > +     uint32_t bc[] = { BC_ENTER_LOOPER };
> > +
> > +     do_binder_write_read(binder_fd, &bc, sizeof(bc), NULL, 0);
> > +
> > +     epoll_fd = epoll_create1(EPOLL_CLOEXEC);
> > +     if (epoll_fd == -1) {
> > +             perror("epoll_create");
> > +             return NULL;
> > +     }
> > +
> > +     events[0].events = EPOLLIN;
> > +     if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &events[0])) {
> > +             perror("epoll_ctl add");
> > +             goto err_close;
> > +     }
> > +
> > +     num_events = epoll_wait(epoll_fd, events, MAX_EVENTS, EPOLL_WAIT_TIME_MS);
> > +     if (num_events < 0) {
> > +             perror("epoll_wait");
> > +             goto err_ctl;
> > +     } else if (num_events == 0) {
> > +             fprintf(stderr, "No events\n");
> > +             goto err_ctl;
> > +     }
> > +
> > +     readcount = do_binder_write_read(binder_fd, NULL, 0, readbuf, readsize);
> > +     fprintf(stderr, "Read %d bytes from binder\n", readcount);
> > +
> > +     if (readcount < (int)sizeof(struct binder_reply)) {
> > +             fprintf(stderr, "read_consumed not large enough\n");
> > +             goto err_ctl;
> > +     }
> > +
> > +     br = (struct binder_reply *)readbuf;
> > +     if (expect_binder_reply(br->reply0, BR_NOOP))
> > +             goto err_ctl;
> > +
> > +     if (br->reply1 == BR_TRANSACTION) {
> > +             if (br->btd.code == BINDER_CODE)
> > +                     ret = &br->btd;
> > +             else
> > +                     fprintf(stderr, "Received transaction with unexpected code: %u\n",
> > +                             br->btd.code);
> > +     } else {
> > +             expect_binder_reply(br->reply1, BR_TRANSACTION_COMPLETE);
> > +     }
> > +
> > +err_ctl:
> > +     if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, NULL))
> > +             perror("epoll_ctl del");
> > +err_close:
> > +     close(epoll_fd);
> > +     return ret;
> > +}
> > +
> > +static int child_request_dmabuf_transfer(const char *cgroup, void *arg)
> > +{
> > +     UNUSED(cgroup);
> > +     int ret = -1;
> > +     uint32_t readbuf[32];
> > +     struct binderfs_ctx bfs_ctx = *(struct binderfs_ctx *)arg;
> > +     struct binder_ctx b_ctx;
> > +
> > +     fprintf(stderr, "Child PID: %d\n", getpid());
> > +
> > +     b_ctx = open_binder(&bfs_ctx);
> > +     if (b_ctx.fd < 0) {
> > +             fprintf(stderr, "Child unable to open binder\n");
> > +             return -1;
> > +     }
> > +
> > +     if (binder_request_dmabuf(b_ctx.fd))
> > +             goto err;
> > +
> > +     /* The child must stay alive until the binder reply is received */
> > +     if (binder_wait_for_transaction(b_ctx.fd, readbuf, sizeof(readbuf)) == NULL)
> > +             ret = 0;
> > +
> > +     /*
> > +      * We don't close the received dmabuf here so that the parent can
> > +      * inspect the cgroup gpu memory charges to verify the charge transfer
> > +      * completed successfully.
> > +      */
> > +err:
> > +     close_binder(&b_ctx);
> > +     fprintf(stderr, "Child done\n");
> > +     return ret;
> > +}
> > +
> > +TEST(gpu_cgroup_dmabuf_transfer)
> > +{
> > +     static const char * const GPUMEM_FILENAME = "gpu.memory.current";
> > +     static const size_t ONE_MiB = 1024 * 1024;
> > +
> > +     int ret, dmabuf_fd;
> > +     uint32_t readbuf[32];
> > +     long memsize;
> > +     pid_t child_pid;
> > +     struct binderfs_ctx bfs_ctx;
> > +     struct binder_ctx b_ctx;
> > +     struct cgroup_ctx cg_ctx;
> > +     struct binder_transaction_data *tr;
> > +     struct flat_binder_object *fbo;
> > +     struct binder_buffer_object *bbo;
> > +
>
> If root previges is necessary - pls add check here and skip.
>
> > +     bfs_ctx = create_binderfs("testbinder");
> > +     if (bfs_ctx.name == NULL)
> > +             ksft_exit_skip("The Android binderfs filesystem is not available\n");
> > +
> > +     cg_ctx = create_cgroups(_metadata);
> > +     if (cg_ctx.root == NULL) {
> > +             destroy_binderfs(&bfs_ctx);
> > +             ksft_exit_skip("cgroup v2 isn't mounted\n");
> > +     }
> > +
> > +     ASSERT_EQ(cg_enter_current(cg_ctx.source), 0) {
> > +             TH_LOG("Could not move parent to cgroup: %s", cg_ctx.source);
> > +             goto err_cg;
> > +     }
> > +
> > +     dmabuf_fd = alloc_dmabuf_from_system_heap(_metadata, ONE_MiB);
> > +     ASSERT_GE(dmabuf_fd, 0) {
> > +             goto err_cg;
> > +     }
> > +     TH_LOG("Allocated dmabuf");
> > +
> > +     memsize = cg_read_key_long(cg_ctx.source, GPUMEM_FILENAME, "system");
> > +     ASSERT_EQ(memsize, ONE_MiB) {
> > +             TH_LOG("GPU memory used after allocation: %ld but it should be %lu",
> > +                    memsize, (unsigned long)ONE_MiB);
> > +             goto err_dmabuf;
> > +     }
> > +
> > +     b_ctx = open_binder(&bfs_ctx);
> > +     ASSERT_GE(b_ctx.fd, 0) {
> > +             TH_LOG("Parent unable to open binder");
> > +             goto err_dmabuf;
> > +     }
> > +     TH_LOG("Opened binder at %s/%s", bfs_ctx.mountpoint, bfs_ctx.name);
> > +
> > +     ASSERT_EQ(become_binder_context_manager(b_ctx.fd), 0) {
> > +             TH_LOG("Cannot become context manager: %s", strerror(errno));
> > +             goto err_binder;
> > +     }
> > +
> > +     child_pid = cg_run_nowait(cg_ctx.dest, child_request_dmabuf_transfer, &bfs_ctx);
> > +     ASSERT_GT(child_pid, 0) {
> > +             TH_LOG("Error forking: %s", strerror(errno));
> > +             goto err_binder;
> > +     }
> > +
> > +     tr = binder_wait_for_transaction(b_ctx.fd, readbuf, sizeof(readbuf));
> > +     ASSERT_NE(tr, NULL) {
> > +             TH_LOG("Error receiving transaction request from child");
> > +             goto err_child;
> > +     }
> > +     fbo = (struct flat_binder_object *)tr->data.ptr.buffer;
> > +     ASSERT_EQ(fbo->hdr.type, BINDER_TYPE_PTR) {
> > +             TH_LOG("Did not receive a buffer object from child");
> > +             goto err_child;
> > +     }
> > +     bbo = (struct binder_buffer_object *)fbo;
> > +     ASSERT_EQ(bbo->length, 0) {
> > +             TH_LOG("Did not receive an empty buffer object from child");
> > +             goto err_child;
> > +     }
> > +
> > +     TH_LOG("Received transaction from child");
> > +     send_dmabuf_reply(b_ctx.fd, tr, dmabuf_fd);
> > +
> > +     ASSERT_EQ(cg_read_key_long(cg_ctx.dest, GPUMEM_FILENAME, "system"), ONE_MiB) {
> > +             TH_LOG("Destination cgroup does not have system charge!");
> > +             goto err_child;
> > +     }
> > +     ASSERT_EQ(cg_read_key_long(cg_ctx.source, GPUMEM_FILENAME, "system"), 0) {
> > +             TH_LOG("Source cgroup still has system charge!");
> > +             goto err_child;
> > +     }
> > +     TH_LOG("Charge transfer succeeded!");
> > +
> > +err_child:
> > +     waitpid(child_pid, &ret, 0);
> > +     if (WIFEXITED(ret))
> > +             TH_LOG("Child %d terminated with %d", child_pid, WEXITSTATUS(ret));
> > +     else
> > +             TH_LOG("Child terminated abnormally");
>
> What does this mean? What are the conditions that could cause this?
> Pls include more info. in the message.

This would be very unusual, but possible if the child is
(accidentally?) killed by a user or the low/out of memory killer. It
looks like I can add more information with the WIFSIGNALED and
WTERMSIG macros, so I will do that.

>
> > +err_binder:
> > +     close_binder(&b_ctx);
> > +err_dmabuf:
> > +     close(dmabuf_fd);
> > +err_cg:
> > +     destroy_cgroups(_metadata, &cg_ctx);
> > +     destroy_binderfs(&bfs_ctx);
> > +}
> > +
> > +TEST_HARNESS_MAIN
> >
>
> thanks,
> -- Shuah

Thanks for taking a look!
Todd Kjos March 15, 2022, 12:43 a.m. UTC | #5
On Wed, Mar 9, 2022 at 8:53 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> This test verifies that the cgroup GPU memory charge is transferred
> correctly when a dmabuf is passed between processes in two different
> cgroups and the sender specifies BINDER_BUFFER_FLAG_SENDER_NO_NEED in the
> binder transaction data containing the dmabuf file descriptor.
>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>

Reviewed-by: Todd Kjos <tkjos@google.com>
for the binder driver interactions. Need Christian to take a look at
the binderfs interactions.

> ---
>  .../selftests/drivers/android/binder/Makefile |   8 +
>  .../drivers/android/binder/binder_util.c      | 254 +++++++++
>  .../drivers/android/binder/binder_util.h      |  32 ++
>  .../selftests/drivers/android/binder/config   |   4 +
>  .../binder/test_dmabuf_cgroup_transfer.c      | 480 ++++++++++++++++++
>  5 files changed, 778 insertions(+)
>  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
>  create mode 100644 tools/testing/selftests/drivers/android/binder/config
>  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
>
> diff --git a/tools/testing/selftests/drivers/android/binder/Makefile b/tools/testing/selftests/drivers/android/binder/Makefile
> new file mode 100644
> index 000000000000..726439d10675
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/android/binder/Makefile
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +CFLAGS += -Wall
> +
> +TEST_GEN_PROGS = test_dmabuf_cgroup_transfer
> +
> +include ../../../lib.mk
> +
> +$(OUTPUT)/test_dmabuf_cgroup_transfer: ../../../cgroup/cgroup_util.c binder_util.c
> diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.c b/tools/testing/selftests/drivers/android/binder/binder_util.c
> new file mode 100644
> index 000000000000..c9dcf5b9d42b
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/android/binder/binder_util.c
> @@ -0,0 +1,254 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "binder_util.h"
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/mount.h>
> +
> +#include <linux/limits.h>
> +#include <linux/android/binder.h>
> +#include <linux/android/binderfs.h>
> +
> +static const size_t BINDER_MMAP_SIZE = 64 * 1024;
> +
> +static void binderfs_unmount(const char *mountpoint)
> +{
> +       if (umount2(mountpoint, MNT_DETACH))
> +               fprintf(stderr, "Failed to unmount binderfs at %s: %s\n",
> +                       mountpoint, strerror(errno));
> +       else
> +               fprintf(stderr, "Binderfs unmounted: %s\n", mountpoint);
> +
> +       if (rmdir(mountpoint))
> +               fprintf(stderr, "Failed to remove binderfs mount %s: %s\n",
> +                       mountpoint, strerror(errno));
> +       else
> +               fprintf(stderr, "Binderfs mountpoint destroyed: %s\n", mountpoint);
> +}
> +
> +struct binderfs_ctx create_binderfs(const char *name)
> +{
> +       int fd, ret, saved_errno;
> +       struct binderfs_device device = { 0 };
> +       struct binderfs_ctx ctx = { 0 };
> +
> +       /*
> +        * P_tmpdir is set to "/tmp/" on Android platforms where Binder is most
> +        * commonly used, but this path does not actually exist on Android. We
> +        * will first try using "/data/local/tmp" and fallback to P_tmpdir if
> +        * that fails for non-Android platforms.
> +        */
> +       static const char tmpdir[] = "/data/local/tmp";
> +       static const size_t MAX_TMPDIR_SIZE =
> +               sizeof(tmpdir) > sizeof(P_tmpdir) ?
> +               sizeof(tmpdir) : sizeof(P_tmpdir);
> +       static const char template[] = "/binderfs_XXXXXX";
> +
> +       char *mkdtemp_result;
> +       char binderfs_mntpt[MAX_TMPDIR_SIZE + sizeof(template)];
> +       char device_path[MAX_TMPDIR_SIZE + sizeof(template) + BINDERFS_MAX_NAME];
> +
> +       snprintf(binderfs_mntpt, sizeof(binderfs_mntpt), "%s%s", tmpdir, template);
> +
> +       mkdtemp_result = mkdtemp(binderfs_mntpt);
> +       if (mkdtemp_result == NULL) {
> +               fprintf(stderr, "Failed to create binderfs mountpoint at %s: %s.\n",
> +                       binderfs_mntpt, strerror(errno));
> +               fprintf(stderr, "Trying fallback mountpoint...\n");
> +               snprintf(binderfs_mntpt, sizeof(binderfs_mntpt), "%s%s", P_tmpdir, template);
> +               if (mkdtemp(binderfs_mntpt) == NULL) {
> +                       fprintf(stderr, "Failed to create binderfs mountpoint at %s: %s\n",
> +                       binderfs_mntpt, strerror(errno));
> +                       return ctx;
> +               }
> +       }
> +       fprintf(stderr, "Binderfs mountpoint created at %s\n", binderfs_mntpt);
> +
> +       if (mount(NULL, binderfs_mntpt, "binder", 0, 0)) {
> +               perror("Could not mount binderfs");
> +               rmdir(binderfs_mntpt);
> +               return ctx;
> +       }
> +       fprintf(stderr, "Binderfs mounted at %s\n", binderfs_mntpt);
> +
> +       strncpy(device.name, name, sizeof(device.name));
> +       snprintf(device_path, sizeof(device_path), "%s/binder-control", binderfs_mntpt);
> +       fd = open(device_path, O_RDONLY | O_CLOEXEC);
> +       if (!fd) {
> +               perror("Failed to open binder-control device");
> +               binderfs_unmount(binderfs_mntpt);
> +               return ctx;
> +       }
> +
> +       ret = ioctl(fd, BINDER_CTL_ADD, &device);
> +       saved_errno = errno;
> +       close(fd);
> +       errno = saved_errno;
> +       if (ret) {
> +               perror("Failed to allocate new binder device");
> +               binderfs_unmount(binderfs_mntpt);
> +               return ctx;
> +       }
> +
> +       fprintf(stderr, "Allocated new binder device with major %d, minor %d, and name %s at %s\n",
> +               device.major, device.minor, device.name, binderfs_mntpt);
> +
> +       ctx.name = strdup(name);
> +       ctx.mountpoint = strdup(binderfs_mntpt);
> +       return ctx;
> +}
> +
> +void destroy_binderfs(struct binderfs_ctx *ctx)
> +{
> +       char path[PATH_MAX];
> +
> +       snprintf(path, sizeof(path), "%s/%s", ctx->mountpoint, ctx->name);
> +
> +       if (unlink(path))
> +               fprintf(stderr, "Failed to unlink binder device %s: %s\n", path, strerror(errno));
> +       else
> +               fprintf(stderr, "Destroyed binder %s at %s\n", ctx->name, ctx->mountpoint);
> +
> +       binderfs_unmount(ctx->mountpoint);
> +
> +       free(ctx->name);
> +       free(ctx->mountpoint);
> +}
> +
> +struct binder_ctx open_binder(struct binderfs_ctx *bfs_ctx)
> +{
> +       struct binder_ctx ctx = {.fd = -1, .memory = NULL};
> +       char path[PATH_MAX];
> +
> +       snprintf(path, sizeof(path), "%s/%s", bfs_ctx->mountpoint, bfs_ctx->name);
> +       ctx.fd = open(path, O_RDWR | O_NONBLOCK | O_CLOEXEC);
> +       if (ctx.fd < 0) {
> +               fprintf(stderr, "Error opening binder device %s: %s\n", path, strerror(errno));
> +               return ctx;
> +       }
> +
> +       ctx.memory = mmap(NULL, BINDER_MMAP_SIZE, PROT_READ, MAP_SHARED, ctx.fd, 0);
> +       if (ctx.memory == NULL) {
> +               perror("Error mapping binder memory");
> +               close(ctx.fd);
> +               ctx.fd = -1;
> +       }
> +
> +       return ctx;
> +}
> +
> +void close_binder(struct binder_ctx *ctx)
> +{
> +       if (munmap(ctx->memory, BINDER_MMAP_SIZE))
> +               perror("Failed to unmap binder memory");
> +       ctx->memory = NULL;
> +
> +       if (close(ctx->fd))
> +               perror("Failed to close binder");
> +       ctx->fd = -1;
> +}
> +
> +int become_binder_context_manager(int binder_fd)
> +{
> +       return ioctl(binder_fd, BINDER_SET_CONTEXT_MGR, 0);
> +}
> +
> +int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writesize,
> +                        void *readbuf, binder_size_t readsize)
> +{
> +       int err;
> +       struct binder_write_read bwr = {
> +               .write_buffer = (binder_uintptr_t)writebuf,
> +               .write_size = writesize,
> +               .read_buffer = (binder_uintptr_t)readbuf,
> +               .read_size = readsize
> +       };
> +
> +       do {
> +               if (ioctl(binder_fd, BINDER_WRITE_READ, &bwr) >= 0)
> +                       err = 0;
> +               else
> +                       err = -errno;
> +       } while (err == -EINTR);
> +
> +       if (err < 0) {
> +               perror("BINDER_WRITE_READ");
> +               return -1;
> +       }
> +
> +       if (bwr.write_consumed < writesize) {
> +               fprintf(stderr, "Binder did not consume full write buffer %llu %llu\n",
> +                       bwr.write_consumed, writesize);
> +               return -1;
> +       }
> +
> +       return bwr.read_consumed;
> +}
> +
> +static const char *reply_string(int cmd)
> +{
> +       switch (cmd) {
> +       case BR_ERROR:
> +               return("BR_ERROR");
> +       case BR_OK:
> +               return("BR_OK");
> +       case BR_TRANSACTION_SEC_CTX:
> +               return("BR_TRANSACTION_SEC_CTX");
> +       case BR_TRANSACTION:
> +               return("BR_TRANSACTION");
> +       case BR_REPLY:
> +               return("BR_REPLY");
> +       case BR_ACQUIRE_RESULT:
> +               return("BR_ACQUIRE_RESULT");
> +       case BR_DEAD_REPLY:
> +               return("BR_DEAD_REPLY");
> +       case BR_TRANSACTION_COMPLETE:
> +               return("BR_TRANSACTION_COMPLETE");
> +       case BR_INCREFS:
> +               return("BR_INCREFS");
> +       case BR_ACQUIRE:
> +               return("BR_ACQUIRE");
> +       case BR_RELEASE:
> +               return("BR_RELEASE");
> +       case BR_DECREFS:
> +               return("BR_DECREFS");
> +       case BR_ATTEMPT_ACQUIRE:
> +               return("BR_ATTEMPT_ACQUIRE");
> +       case BR_NOOP:
> +               return("BR_NOOP");
> +       case BR_SPAWN_LOOPER:
> +               return("BR_SPAWN_LOOPER");
> +       case BR_FINISHED:
> +               return("BR_FINISHED");
> +       case BR_DEAD_BINDER:
> +               return("BR_DEAD_BINDER");
> +       case BR_CLEAR_DEATH_NOTIFICATION_DONE:
> +               return("BR_CLEAR_DEATH_NOTIFICATION_DONE");
> +       case BR_FAILED_REPLY:
> +               return("BR_FAILED_REPLY");
> +       case BR_FROZEN_REPLY:
> +               return("BR_FROZEN_REPLY");
> +       case BR_ONEWAY_SPAM_SUSPECT:
> +               return("BR_ONEWAY_SPAM_SUSPECT");
> +       default:
> +               return("Unknown");
> +       };
> +}
> +
> +int expect_binder_reply(int32_t actual, int32_t expected)
> +{
> +       if (actual != expected) {
> +               fprintf(stderr, "Expected %s but received %s\n",
> +                       reply_string(expected), reply_string(actual));
> +               return -1;
> +       }
> +       return 0;
> +}
> +
> diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.h b/tools/testing/selftests/drivers/android/binder/binder_util.h
> new file mode 100644
> index 000000000000..807f5abe987e
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/android/binder/binder_util.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef SELFTEST_BINDER_UTIL_H
> +#define SELFTEST_BINDER_UTIL_H
> +
> +#include <stdint.h>
> +
> +#include <linux/android/binder.h>
> +
> +struct binderfs_ctx {
> +       char *name;
> +       char *mountpoint;
> +};
> +
> +struct binder_ctx {
> +       int fd;
> +       void *memory;
> +};
> +
> +struct binderfs_ctx create_binderfs(const char *name);
> +void destroy_binderfs(struct binderfs_ctx *ctx);
> +
> +struct binder_ctx open_binder(struct binderfs_ctx *bfs_ctx);
> +void close_binder(struct binder_ctx *ctx);
> +
> +int become_binder_context_manager(int binder_fd);
> +
> +int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writesize,
> +                        void *readbuf, binder_size_t readsize);
> +
> +int expect_binder_reply(int32_t actual, int32_t expected);
> +#endif
> diff --git a/tools/testing/selftests/drivers/android/binder/config b/tools/testing/selftests/drivers/android/binder/config
> new file mode 100644
> index 000000000000..fcc5f8f693b3
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/android/binder/config
> @@ -0,0 +1,4 @@
> +CONFIG_CGROUP_GPU=y
> +CONFIG_ANDROID=y
> +CONFIG_ANDROID_BINDERFS=y
> +CONFIG_ANDROID_BINDER_IPC=y
> diff --git a/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> new file mode 100644
> index 000000000000..9b952ab401cc
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> @@ -0,0 +1,480 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * This test verifies that the cgroup GPU memory charge is transferred correctly
> + * when a dmabuf is passed between processes in two different cgroups and the
> + * sender specifies BINDER_BUFFER_FLAG_SENDER_NO_NEED in the binder transaction
> + * data containing the dmabuf file descriptor.
> + *
> + * The gpu_cgroup_dmabuf_transfer test function becomes the binder context
> + * manager, then forks a child who initiates a transaction with the context
> + * manager by specifying a target of 0. The context manager reply contains a
> + * dmabuf file descriptor which was allocated by the gpu_cgroup_dmabuf_transfer
> + * test function, but should be charged to the child cgroup after the binder
> + * transaction.
> + */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/epoll.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +
> +#include "binder_util.h"
> +#include "../../../cgroup/cgroup_util.h"
> +#include "../../../kselftest.h"
> +#include "../../../kselftest_harness.h"
> +
> +#include <linux/limits.h>
> +#include <linux/dma-heap.h>
> +#include <linux/android/binder.h>
> +
> +#define UNUSED(x) ((void)(x))
> +
> +static const unsigned int BINDER_CODE = 8675309; /* Any number will work here */
> +
> +struct cgroup_ctx {
> +       char *root;
> +       char *source;
> +       char *dest;
> +};
> +
> +void destroy_cgroups(struct __test_metadata *_metadata, struct cgroup_ctx *ctx)
> +{
> +       if (ctx->source != NULL) {
> +               TH_LOG("Destroying cgroup: %s", ctx->source);
> +               rmdir(ctx->source);
> +               free(ctx->source);
> +       }
> +
> +       if (ctx->dest != NULL) {
> +               TH_LOG("Destroying cgroup: %s", ctx->dest);
> +               rmdir(ctx->dest);
> +               free(ctx->dest);
> +       }
> +
> +       free(ctx->root);
> +       ctx->root = ctx->source = ctx->dest = NULL;
> +}
> +
> +struct cgroup_ctx create_cgroups(struct __test_metadata *_metadata)
> +{
> +       struct cgroup_ctx ctx = {0};
> +       char root[PATH_MAX], *tmp;
> +       static const char template[] = "/gpucg_XXXXXX";
> +
> +       if (cg_find_unified_root(root, sizeof(root))) {
> +               TH_LOG("Could not find cgroups root");
> +               return ctx;
> +       }
> +
> +       if (cg_read_strstr(root, "cgroup.controllers", "gpu")) {
> +               TH_LOG("Could not find GPU controller");
> +               return ctx;
> +       }
> +
> +       if (cg_write(root, "cgroup.subtree_control", "+gpu")) {
> +               TH_LOG("Could not enable GPU controller");
> +               return ctx;
> +       }
> +
> +       ctx.root = strdup(root);
> +
> +       snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
> +       tmp = mkdtemp(root);
> +       if (tmp == NULL) {
> +               TH_LOG("%s - Could not create source cgroup", strerror(errno));
> +               destroy_cgroups(_metadata, &ctx);
> +               return ctx;
> +       }
> +       ctx.source = strdup(tmp);
> +
> +       snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
> +       tmp = mkdtemp(root);
> +       if (tmp == NULL) {
> +               TH_LOG("%s - Could not create destination cgroup", strerror(errno));
> +               destroy_cgroups(_metadata, &ctx);
> +               return ctx;
> +       }
> +       ctx.dest = strdup(tmp);
> +
> +       TH_LOG("Created cgroups: %s %s", ctx.source, ctx.dest);
> +
> +       return ctx;
> +}
> +
> +int dmabuf_heap_alloc(int fd, size_t len, int *dmabuf_fd)
> +{
> +       struct dma_heap_allocation_data data = {
> +               .len = len,
> +               .fd = 0,
> +               .fd_flags = O_RDONLY | O_CLOEXEC,
> +               .heap_flags = 0,
> +       };
> +       int ret;
> +
> +       if (!dmabuf_fd)
> +               return -EINVAL;
> +
> +       ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
> +       if (ret < 0)
> +               return ret;
> +       *dmabuf_fd = (int)data.fd;
> +       return ret;
> +}
> +
> +/* The system heap is known to export dmabufs with support for cgroup tracking */
> +int alloc_dmabuf_from_system_heap(struct __test_metadata *_metadata, size_t bytes)
> +{
> +       int heap_fd = -1, dmabuf_fd = -1;
> +       static const char * const heap_path = "/dev/dma_heap/system";
> +
> +       heap_fd = open(heap_path, O_RDONLY);
> +       if (heap_fd < 0) {
> +               TH_LOG("%s - open %s failed!\n", strerror(errno), heap_path);
> +               return -1;
> +       }
> +
> +       if (dmabuf_heap_alloc(heap_fd, bytes, &dmabuf_fd))
> +               TH_LOG("dmabuf allocation failed! - %s", strerror(errno));
> +       close(heap_fd);
> +
> +       return dmabuf_fd;
> +}
> +
> +int binder_request_dmabuf(int binder_fd)
> +{
> +       int ret;
> +
> +       /*
> +        * We just send an empty binder_buffer_object to initiate a transaction
> +        * with the context manager, who should respond with a single dmabuf
> +        * inside a binder_fd_array_object.
> +        */
> +
> +       struct binder_buffer_object bbo = {
> +               .hdr.type = BINDER_TYPE_PTR,
> +               .flags = 0,
> +               .buffer = 0,
> +               .length = 0,
> +               .parent = 0, /* No parent */
> +               .parent_offset = 0 /* No parent */
> +       };
> +
> +       binder_size_t offsets[] = {0};
> +
> +       struct {
> +               int32_t cmd;
> +               struct binder_transaction_data btd;
> +       } __attribute__((packed)) bc = {
> +               .cmd = BC_TRANSACTION,
> +               .btd = {
> +                       .target = { 0 },
> +                       .cookie = 0,
> +                       .code = BINDER_CODE,
> +                       .flags = TF_ACCEPT_FDS, /* We expect a FDA in the reply */
> +                       .data_size = sizeof(bbo),
> +                       .offsets_size = sizeof(offsets),
> +                       .data.ptr = {
> +                               (binder_uintptr_t)&bbo,
> +                               (binder_uintptr_t)offsets
> +                       }
> +               },
> +       };
> +
> +       struct {
> +               int32_t reply_noop;
> +       } __attribute__((packed)) br;
> +
> +       ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
> +       if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
> +               return -1;
> +       } else if (ret < sizeof(br)) {
> +               fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
> +               return -1;
> +       }
> +       return 0;
> +}
> +
> +int send_dmabuf_reply(int binder_fd, struct binder_transaction_data *tr, int dmabuf_fd)
> +{
> +       int ret;
> +       /*
> +        * The trailing 0 is to achieve the necessary alignment for the binder
> +        * buffer_size.
> +        */
> +       int fdarray[] = { dmabuf_fd, 0 };
> +
> +       struct binder_buffer_object bbo = {
> +               .hdr.type = BINDER_TYPE_PTR,
> +               .flags = BINDER_BUFFER_FLAG_SENDER_NO_NEED,
> +               .buffer = (binder_uintptr_t)fdarray,
> +               .length = sizeof(fdarray),
> +               .parent = 0, /* No parent */
> +               .parent_offset = 0 /* No parent */
> +       };
> +
> +       struct binder_fd_array_object bfdao = {
> +               .hdr.type = BINDER_TYPE_FDA,
> +               .num_fds = 1,
> +               .parent = 0, /* The binder_buffer_object */
> +               .parent_offset = 0 /* FDs follow immediately */
> +       };
> +
> +       uint64_t sz = sizeof(fdarray);
> +       uint8_t data[sizeof(sz) + sizeof(bbo) + sizeof(bfdao)];
> +       binder_size_t offsets[] = {sizeof(sz), sizeof(sz)+sizeof(bbo)};
> +
> +       memcpy(data,                            &sz, sizeof(sz));
> +       memcpy(data + sizeof(sz),               &bbo, sizeof(bbo));
> +       memcpy(data + sizeof(sz) + sizeof(bbo), &bfdao, sizeof(bfdao));
> +
> +       struct {
> +               int32_t cmd;
> +               struct binder_transaction_data_sg btd;
> +       } __attribute__((packed)) bc = {
> +               .cmd = BC_REPLY_SG,
> +               .btd.transaction_data = {
> +                       .target = { tr->target.handle },
> +                       .cookie = tr->cookie,
> +                       .code = BINDER_CODE,
> +                       .flags = 0,
> +                       .data_size = sizeof(data),
> +                       .offsets_size = sizeof(offsets),
> +                       .data.ptr = {
> +                               (binder_uintptr_t)data,
> +                               (binder_uintptr_t)offsets
> +                       }
> +               },
> +               .btd.buffers_size = sizeof(fdarray)
> +       };
> +
> +       struct {
> +               int32_t reply_noop;
> +       } __attribute__((packed)) br;
> +
> +       ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
> +       if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
> +               return -1;
> +       } else if (ret < sizeof(br)) {
> +               fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
> +               return -1;
> +       }
> +       return 0;
> +}
> +
> +struct binder_transaction_data *binder_wait_for_transaction(int binder_fd,
> +                                                           uint32_t *readbuf,
> +                                                           size_t readsize)
> +{
> +       static const int MAX_EVENTS = 1, EPOLL_WAIT_TIME_MS = 3 * 1000;
> +       struct binder_reply {
> +               int32_t reply0;
> +               int32_t reply1;
> +               struct binder_transaction_data btd;
> +       } *br;
> +       struct binder_transaction_data *ret = NULL;
> +       struct epoll_event events[MAX_EVENTS];
> +       int epoll_fd, num_events, readcount;
> +       uint32_t bc[] = { BC_ENTER_LOOPER };
> +
> +       do_binder_write_read(binder_fd, &bc, sizeof(bc), NULL, 0);
> +
> +       epoll_fd = epoll_create1(EPOLL_CLOEXEC);
> +       if (epoll_fd == -1) {
> +               perror("epoll_create");
> +               return NULL;
> +       }
> +
> +       events[0].events = EPOLLIN;
> +       if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &events[0])) {
> +               perror("epoll_ctl add");
> +               goto err_close;
> +       }
> +
> +       num_events = epoll_wait(epoll_fd, events, MAX_EVENTS, EPOLL_WAIT_TIME_MS);
> +       if (num_events < 0) {
> +               perror("epoll_wait");
> +               goto err_ctl;
> +       } else if (num_events == 0) {
> +               fprintf(stderr, "No events\n");
> +               goto err_ctl;
> +       }
> +
> +       readcount = do_binder_write_read(binder_fd, NULL, 0, readbuf, readsize);
> +       fprintf(stderr, "Read %d bytes from binder\n", readcount);
> +
> +       if (readcount < (int)sizeof(struct binder_reply)) {
> +               fprintf(stderr, "read_consumed not large enough\n");
> +               goto err_ctl;
> +       }
> +
> +       br = (struct binder_reply *)readbuf;
> +       if (expect_binder_reply(br->reply0, BR_NOOP))
> +               goto err_ctl;
> +
> +       if (br->reply1 == BR_TRANSACTION) {
> +               if (br->btd.code == BINDER_CODE)
> +                       ret = &br->btd;
> +               else
> +                       fprintf(stderr, "Received transaction with unexpected code: %u\n",
> +                               br->btd.code);
> +       } else {
> +               expect_binder_reply(br->reply1, BR_TRANSACTION_COMPLETE);
> +       }
> +
> +err_ctl:
> +       if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, NULL))
> +               perror("epoll_ctl del");
> +err_close:
> +       close(epoll_fd);
> +       return ret;
> +}
> +
> +static int child_request_dmabuf_transfer(const char *cgroup, void *arg)
> +{
> +       UNUSED(cgroup);
> +       int ret = -1;
> +       uint32_t readbuf[32];
> +       struct binderfs_ctx bfs_ctx = *(struct binderfs_ctx *)arg;
> +       struct binder_ctx b_ctx;
> +
> +       fprintf(stderr, "Child PID: %d\n", getpid());
> +
> +       b_ctx = open_binder(&bfs_ctx);
> +       if (b_ctx.fd < 0) {
> +               fprintf(stderr, "Child unable to open binder\n");
> +               return -1;
> +       }
> +
> +       if (binder_request_dmabuf(b_ctx.fd))
> +               goto err;
> +
> +       /* The child must stay alive until the binder reply is received */
> +       if (binder_wait_for_transaction(b_ctx.fd, readbuf, sizeof(readbuf)) == NULL)
> +               ret = 0;
> +
> +       /*
> +        * We don't close the received dmabuf here so that the parent can
> +        * inspect the cgroup gpu memory charges to verify the charge transfer
> +        * completed successfully.
> +        */
> +err:
> +       close_binder(&b_ctx);
> +       fprintf(stderr, "Child done\n");
> +       return ret;
> +}
> +
> +TEST(gpu_cgroup_dmabuf_transfer)
> +{
> +       static const char * const GPUMEM_FILENAME = "gpu.memory.current";
> +       static const size_t ONE_MiB = 1024 * 1024;
> +
> +       int ret, dmabuf_fd;
> +       uint32_t readbuf[32];
> +       long memsize;
> +       pid_t child_pid;
> +       struct binderfs_ctx bfs_ctx;
> +       struct binder_ctx b_ctx;
> +       struct cgroup_ctx cg_ctx;
> +       struct binder_transaction_data *tr;
> +       struct flat_binder_object *fbo;
> +       struct binder_buffer_object *bbo;
> +
> +       bfs_ctx = create_binderfs("testbinder");
> +       if (bfs_ctx.name == NULL)
> +               ksft_exit_skip("The Android binderfs filesystem is not available\n");
> +
> +       cg_ctx = create_cgroups(_metadata);
> +       if (cg_ctx.root == NULL) {
> +               destroy_binderfs(&bfs_ctx);
> +               ksft_exit_skip("cgroup v2 isn't mounted\n");
> +       }
> +
> +       ASSERT_EQ(cg_enter_current(cg_ctx.source), 0) {
> +               TH_LOG("Could not move parent to cgroup: %s", cg_ctx.source);
> +               goto err_cg;
> +       }
> +
> +       dmabuf_fd = alloc_dmabuf_from_system_heap(_metadata, ONE_MiB);
> +       ASSERT_GE(dmabuf_fd, 0) {
> +               goto err_cg;
> +       }
> +       TH_LOG("Allocated dmabuf");
> +
> +       memsize = cg_read_key_long(cg_ctx.source, GPUMEM_FILENAME, "system");
> +       ASSERT_EQ(memsize, ONE_MiB) {
> +               TH_LOG("GPU memory used after allocation: %ld but it should be %lu",
> +                      memsize, (unsigned long)ONE_MiB);
> +               goto err_dmabuf;
> +       }
> +
> +       b_ctx = open_binder(&bfs_ctx);
> +       ASSERT_GE(b_ctx.fd, 0) {
> +               TH_LOG("Parent unable to open binder");
> +               goto err_dmabuf;
> +       }
> +       TH_LOG("Opened binder at %s/%s", bfs_ctx.mountpoint, bfs_ctx.name);
> +
> +       ASSERT_EQ(become_binder_context_manager(b_ctx.fd), 0) {
> +               TH_LOG("Cannot become context manager: %s", strerror(errno));
> +               goto err_binder;
> +       }
> +
> +       child_pid = cg_run_nowait(cg_ctx.dest, child_request_dmabuf_transfer, &bfs_ctx);
> +       ASSERT_GT(child_pid, 0) {
> +               TH_LOG("Error forking: %s", strerror(errno));
> +               goto err_binder;
> +       }
> +
> +       tr = binder_wait_for_transaction(b_ctx.fd, readbuf, sizeof(readbuf));
> +       ASSERT_NE(tr, NULL) {
> +               TH_LOG("Error receiving transaction request from child");
> +               goto err_child;
> +       }
> +       fbo = (struct flat_binder_object *)tr->data.ptr.buffer;
> +       ASSERT_EQ(fbo->hdr.type, BINDER_TYPE_PTR) {
> +               TH_LOG("Did not receive a buffer object from child");
> +               goto err_child;
> +       }
> +       bbo = (struct binder_buffer_object *)fbo;
> +       ASSERT_EQ(bbo->length, 0) {
> +               TH_LOG("Did not receive an empty buffer object from child");
> +               goto err_child;
> +       }
> +
> +       TH_LOG("Received transaction from child");
> +       send_dmabuf_reply(b_ctx.fd, tr, dmabuf_fd);
> +
> +       ASSERT_EQ(cg_read_key_long(cg_ctx.dest, GPUMEM_FILENAME, "system"), ONE_MiB) {
> +               TH_LOG("Destination cgroup does not have system charge!");
> +               goto err_child;
> +       }
> +       ASSERT_EQ(cg_read_key_long(cg_ctx.source, GPUMEM_FILENAME, "system"), 0) {
> +               TH_LOG("Source cgroup still has system charge!");
> +               goto err_child;
> +       }
> +       TH_LOG("Charge transfer succeeded!");
> +
> +err_child:
> +       waitpid(child_pid, &ret, 0);
> +       if (WIFEXITED(ret))
> +               TH_LOG("Child %d terminated with %d", child_pid, WEXITSTATUS(ret));
> +       else
> +               TH_LOG("Child terminated abnormally");
> +err_binder:
> +       close_binder(&b_ctx);
> +err_dmabuf:
> +       close(dmabuf_fd);
> +err_cg:
> +       destroy_cgroups(_metadata, &cg_ctx);
> +       destroy_binderfs(&bfs_ctx);
> +}
> +
> +TEST_HARNESS_MAIN
> --
> 2.35.1.616.g0bdcbb4464-goog
>
Todd Kjos March 15, 2022, 12:50 a.m. UTC | #6
On Wed, Mar 9, 2022 at 8:52 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> From: Hridya Valsaraju <hridya@google.com>
>
> This patch introduces a buffer flag BINDER_BUFFER_FLAG_SENDER_NO_NEED
> that a process sending an fd array to another process over binder IPC
> can set to relinquish ownership of the fds being sent for memory
> accounting purposes. If the flag is found to be set during the fd array
> translation and the fd is for a DMA-BUF, the buffer is uncharged from
> the sender's cgroup and charged to the receiving process's cgroup
> instead.
>
> It is up to the sending process to ensure that it closes the fds
> regardless of whether the transfer failed or succeeded.
>
> Most graphics shared memory allocations in Android are done by the
> graphics allocator HAL process. On requests from clients, the HAL process
> allocates memory and sends the fds to the clients over binder IPC.
> The graphics allocator HAL will not retain any references to the
> buffers. When the HAL sets the BINDER_BUFFER_FLAG_SENDER_NO_NEED for fd
> arrays holding DMA-BUF fds, the gpu cgroup controller will be able to
> correctly charge the buffers to the client processes instead of the
> graphics allocator HAL.
>
> Since this is a new feature exposed to userspace, the kernel and userspace
> must be compatible for the accounting to work for transfers. In all cases
> the allocation and transport of DMA buffers via binder will succeed, but
> only when both the kernel supports, and userspace depends on this feature
> will the transfer accounting work. The possible scenarios are detailed
> below:
>
> 1. new kernel + old userspace
> The kernel supports the feature but userspace does not use it. The old
> userspace won't mount the new cgroup controller, accounting is not
> performed, charge is not transferred.
>
> 2. old kernel + new userspace
> The new cgroup controller is not supported by the kernel, accounting is
> not performed, charge is not transferred.
>
> 3. old kernel + old userspace
> Same as #2
>
> 4. new kernel + new userspace
> Cgroup is mounted, feature is supported and used.
>
> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>

Acked-by: Todd Kjos <tkjos@google.com>

>
> ---
> v3 changes
> Remove android from title per Todd Kjos.
>
> Use more common dual author commit message format per John Stultz.
>
> Include details on behavior for all combinations of kernel/userspace
> versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
>
> v2 changes
> Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> heap to a single dma-buf function for all heaps per Daniel Vetter and
> Christian König.
> ---
>  drivers/android/binder.c            | 26 ++++++++++++++++++++++++++
>  include/uapi/linux/android/binder.h |  1 +
>  2 files changed, 27 insertions(+)
>
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index 8351c5638880..f50d88ded188 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -42,6 +42,7 @@
>
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> +#include <linux/dma-buf.h>
>  #include <linux/fdtable.h>
>  #include <linux/file.h>
>  #include <linux/freezer.h>
> @@ -2482,8 +2483,10 @@ static int binder_translate_fd_array(struct list_head *pf_head,
>  {
>         binder_size_t fdi, fd_buf_size;
>         binder_size_t fda_offset;
> +       bool transfer_gpu_charge = false;
>         const void __user *sender_ufda_base;
>         struct binder_proc *proc = thread->proc;
> +       struct binder_proc *target_proc = t->to_proc;
>         int ret;
>
>         fd_buf_size = sizeof(u32) * fda->num_fds;
> @@ -2521,8 +2524,15 @@ static int binder_translate_fd_array(struct list_head *pf_head,
>         if (ret)
>                 return ret;
>
> +       if (IS_ENABLED(CONFIG_CGROUP_GPU) &&
> +               parent->flags & BINDER_BUFFER_FLAG_SENDER_NO_NEED)
> +               transfer_gpu_charge = true;
> +
>         for (fdi = 0; fdi < fda->num_fds; fdi++) {
>                 u32 fd;
> +               struct dma_buf *dmabuf;
> +               struct gpucg *gpucg;
> +
>                 binder_size_t offset = fda_offset + fdi * sizeof(fd);
>                 binder_size_t sender_uoffset = fdi * sizeof(fd);
>
> @@ -2532,6 +2542,22 @@ static int binder_translate_fd_array(struct list_head *pf_head,
>                                                   in_reply_to);
>                 if (ret)
>                         return ret > 0 ? -EINVAL : ret;
> +
> +               if (!transfer_gpu_charge)
> +                       continue;
> +
> +               dmabuf = dma_buf_get(fd);
> +               if (IS_ERR(dmabuf))
> +                       continue;
> +
> +               gpucg = gpucg_get(target_proc->tsk);
> +               ret = dma_buf_charge_transfer(dmabuf, gpucg);
> +               if (ret) {
> +                       pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d",
> +                               proc->pid, thread->pid, target_proc->pid);
> +                       gpucg_put(gpucg);
> +               }
> +               dma_buf_put(dmabuf);
>         }
>         return 0;
>  }
> diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> index 3246f2c74696..169fd5069a1a 100644
> --- a/include/uapi/linux/android/binder.h
> +++ b/include/uapi/linux/android/binder.h
> @@ -137,6 +137,7 @@ struct binder_buffer_object {
>
>  enum {
>         BINDER_BUFFER_FLAG_HAS_PARENT = 0x01,
> +       BINDER_BUFFER_FLAG_SENDER_NO_NEED = 0x02,
>  };
>
>  /* struct binder_fd_array_object - object describing an array of fds in a buffer
> --
> 2.35.1.616.g0bdcbb4464-goog
>
David Laight March 15, 2022, 7:56 a.m. UTC | #7
From: T.J. Mercier
> Sent: 14 March 2022 23:45
> 
> On Thu, Mar 10, 2022 at 11:33 AM Todd Kjos <tkjos@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 8:52 AM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > The kernel interface should use types that the kernel defines instead of
> > > pid_t and uid_t, whose definiton is owned by libc. This fixes the header
> > > so that it can be included without first including sys/types.h.
> > >
> > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > ---
> > >  include/uapi/linux/android/binder.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> > > index 169fd5069a1a..aa28454dbca3 100644
> > > --- a/include/uapi/linux/android/binder.h
> > > +++ b/include/uapi/linux/android/binder.h
> > > @@ -289,8 +289,8 @@ struct binder_transaction_data {
> > >
> > >         /* General information about the transaction. */
> > >         __u32           flags;
> > > -       pid_t           sender_pid;
> > > -       uid_t           sender_euid;
> > > +       __kernel_pid_t  sender_pid;
> > > +       __kernel_uid_t  sender_euid;
> >
> > Are we guaranteed that this does not affect the UAPI at all? Userspace
> > code using this definition will have to run with kernels using the old
> > definition and visa-versa.
> 
> A standards compliant userspace should be expecting a signed integer
> type here. So the only way I can think userspace would be affected is
> if:
> 1) pid_t is a long AND
> 2) sizeof(long) > sizeof(int) AND
> 3) Consumers of the pid_t definition actually attempt to mutate the
> result to make use of extra bits in the variable (which are not there)

Or the userspace headers have a 16bit pid_t.

I can't help feeling that uapi headers should only use explicit
fixed sized types.
There is no point indirecting the type names - the sizes still
can't be changes.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
T.J. Mercier March 15, 2022, 7:02 p.m. UTC | #8
On Tue, Mar 15, 2022 at 12:56 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: T.J. Mercier
> > Sent: 14 March 2022 23:45
> >
> > On Thu, Mar 10, 2022 at 11:33 AM Todd Kjos <tkjos@google.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 8:52 AM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > The kernel interface should use types that the kernel defines instead of
> > > > pid_t and uid_t, whose definiton is owned by libc. This fixes the header
> > > > so that it can be included without first including sys/types.h.
> > > >
> > > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > > ---
> > > >  include/uapi/linux/android/binder.h | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> > > > index 169fd5069a1a..aa28454dbca3 100644
> > > > --- a/include/uapi/linux/android/binder.h
> > > > +++ b/include/uapi/linux/android/binder.h
> > > > @@ -289,8 +289,8 @@ struct binder_transaction_data {
> > > >
> > > >         /* General information about the transaction. */
> > > >         __u32           flags;
> > > > -       pid_t           sender_pid;
> > > > -       uid_t           sender_euid;
> > > > +       __kernel_pid_t  sender_pid;
> > > > +       __kernel_uid_t  sender_euid;
> > >
> > > Are we guaranteed that this does not affect the UAPI at all? Userspace
> > > code using this definition will have to run with kernels using the old
> > > definition and visa-versa.
> >
> > A standards compliant userspace should be expecting a signed integer
> > type here. So the only way I can think userspace would be affected is
> > if:
> > 1) pid_t is a long AND
> > 2) sizeof(long) > sizeof(int) AND
> > 3) Consumers of the pid_t definition actually attempt to mutate the
> > result to make use of extra bits in the variable (which are not there)
>
> Or the userspace headers have a 16bit pid_t.

Since the kernel uses an int for PIDs, wouldn't a 16 bit pid_t already
be potentially broken (overflow) on systems where int is not 16 bits?
On systems where int is 16 bits, there is no change here except to
achieve uniform use of __kernel_pid_t in the kernel headers and fix
the include problem.

>
> I can't help feeling that uapi headers should only use explicit
> fixed sized types.
> There is no point indirecting the type names - the sizes still
> can't be changes.

I think it's still unlikely to be an actual problem. For example there
are other occasions where a switch like this was made:
https://github.com/torvalds/linux/commit/694a58e29ef27c4c26f103a9decfd053f94dd34c
https://github.com/torvalds/linux/commit/269b8fd5d058f2c0da01a42b20315ffc2640d99b

And also since Binder's only known user is Android through Bionic
which already expects the type of pid_t to be __kernel_pid_t.


>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
T.J. Mercier March 22, 2022, 3:41 p.m. UTC | #9
On Mon, Mar 21, 2022 at 10:37 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Wed, Mar 09, 2022 at 04:52:11PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> > +The new cgroup controller would:
> > +
> > +* Allow setting per-cgroup limits on the total size of buffers charged to it.
>
> What is the meaning of the total? (I only have very naïve
> understanding of the device buffers.)

So "total" is used twice here in two different contexts.
The first one is the global "GPU" cgroup context. As in any buffer
that any exporter claims is a GPU buffer, regardless of where/how it
is allocated. So this refers to the sum of all gpu buffers of any
type/source. An exporter contributes to this total by registering a
corresponding gpucg_device and making charges against that device when
it exports.
The second one is in a per device context. This allows us to make a
distinction between different types of GPU memory based on who
exported the buffer. A single process can make use of several
different types of dma buffers (for example cached and uncached
versions of the same type of memory), and it would be useful to have
different limits for each. These are distinguished by the device name
string chosen when the gpucg_device is first registered.

>
> Is it like a) there's global pool of memory that is partitioned among
> individual devices or b) each device has its own specific type of memory
> and adding across two devices is adding apples and oranges or c) there
> can be various devices both of a) and b) type?

So I guess the most correct answer to this question is c.


>
> (Apologies not replying to previous versions and possibly missing
> anything.)
>
> Thanks,
> Michal
>
Christian Brauner March 22, 2022, 4:07 p.m. UTC | #10
On Mon, Mar 14, 2022 at 05:43:40PM -0700, Todd Kjos wrote:
> On Wed, Mar 9, 2022 at 8:53 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > This test verifies that the cgroup GPU memory charge is transferred
> > correctly when a dmabuf is passed between processes in two different
> > cgroups and the sender specifies BINDER_BUFFER_FLAG_SENDER_NO_NEED in the
> > binder transaction data containing the dmabuf file descriptor.
> >
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> 
> Reviewed-by: Todd Kjos <tkjos@google.com>
> for the binder driver interactions. Need Christian to take a look at
> the binderfs interactions.

Sorry, just saw this now. I'll take a look tomorrow!