[v2,bpf-next,1/4] bpf: introduce task_vma bpf_iter

Message ID	20201215233702.3301881-2-songliubraving@fb.com
State	New
Headers	show Return-Path: <netdev-owner@kernel.org> From: Song Liu <songliubraving@fb.com> To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org> CC: <ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>, <john.fastabend@gmail.com>, <kpsingh@chromium.org>, <kernel-team@fb.com>, Song Liu <songliubraving@fb.com> Subject: [PATCH v2 bpf-next 1/4] bpf: introduce task_vma bpf_iter Date: Tue, 15 Dec 2020 15:36:59 -0800 Message-ID: <20201215233702.3301881-2-songliubraving@fb.com> In-Reply-To: <20201215233702.3301881-1-songliubraving@fb.com> References: <20201215233702.3301881-1-songliubraving@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Precedence: bulk
Series	introduce bpf_iter for task_vma \| expand [v2,bpf-next,0/4] introduce bpf_iter for task_vma [v2,bpf-next,1/4] bpf: introduce task_vma bpf_iter [v2,bpf-next,2/4] bpf: allow bpf_d_path in sleepable bpf_iter program [v2,bpf-next,3/4] libbpf: introduce section "iter.s/" for sleepable bpf_iter program [v2,bpf-next,4/4] selftests/bpf: add test for bpf_iter_task_vma

Song Liu Dec. 15, 2020, 11:36 p.m. UTC

Introduce task_vma bpf_iter to print memory information of a process. It
can be used to print customized information similar to /proc/<pid>/maps.

task_vma iterator releases mmap_lock before calling the BPF program.
Therefore, we cannot pass vm_area_struct directly to the BPF program. A
new __vm_area_struct is introduced to keep key information of a vma. On
each iteration, task_vma gathers information in __vm_area_struct and
passes it to the BPF program.

If the vma maps to a file, task_vma also holds a reference to the file
while calling the BPF program.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/bpf.h    |   2 +-
 kernel/bpf/task_iter.c | 205 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 205 insertions(+), 2 deletions(-)

Yonghong Song Dec. 16, 2020, 5:36 p.m. UTC | #1

On 12/15/20 3:36 PM, Song Liu wrote:
> Introduce task_vma bpf_iter to print memory information of a process. It

> can be used to print customized information similar to /proc/<pid>/maps.

> 

> task_vma iterator releases mmap_lock before calling the BPF program.

> Therefore, we cannot pass vm_area_struct directly to the BPF program. A

> new __vm_area_struct is introduced to keep key information of a vma. On

> each iteration, task_vma gathers information in __vm_area_struct and

> passes it to the BPF program.

> 

> If the vma maps to a file, task_vma also holds a reference to the file

> while calling the BPF program.

> 

> Signed-off-by: Song Liu <songliubraving@fb.com>

> ---

>   include/linux/bpf.h    |   2 +-

>   kernel/bpf/task_iter.c | 205 ++++++++++++++++++++++++++++++++++++++++-

>   2 files changed, 205 insertions(+), 2 deletions(-)

> 

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h

> index 07cb5d15e7439..49dd1e29c8118 100644

> --- a/include/linux/bpf.h

> +++ b/include/linux/bpf.h

> @@ -1325,7 +1325,7 @@ enum bpf_iter_feature {

>   	BPF_ITER_RESCHED	= BIT(0),

>   };

>   

> -#define BPF_ITER_CTX_ARG_MAX 2

> +#define BPF_ITER_CTX_ARG_MAX 3

>   struct bpf_iter_reg {

>   	const char *target;

>   	bpf_iter_attach_target_t attach_target;

> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c

> index 0458a40edf10a..15a066b442f75 100644

> --- a/kernel/bpf/task_iter.c

> +++ b/kernel/bpf/task_iter.c

> @@ -304,9 +304,183 @@ static const struct seq_operations task_file_seq_ops = {

>   	.show	= task_file_seq_show,

>   };

>   

> +/*

> + * Key information from vm_area_struct. We need this because we cannot

> + * assume the vm_area_struct is still valid after each iteration.

> + */

> +struct __vm_area_struct {

> +	__u64 start;

> +	__u64 end;

> +	__u64 flags;

> +	__u64 pgoff;

> +};

> +

> +struct bpf_iter_seq_task_vma_info {

> +	/* The first field must be struct bpf_iter_seq_task_common.

> +	 * this is assumed by {init, fini}_seq_pidns() callback functions.

> +	 */

> +	struct bpf_iter_seq_task_common common;

> +	struct task_struct *task;

> +	struct __vm_area_struct vma;

> +	struct file *file;

> +	u32 tid;

> +};

> +

> +static struct __vm_area_struct *

> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

> +{

> +	struct pid_namespace *ns = info->common.ns;

> +	struct task_struct *curr_task;

> +	struct vm_area_struct *vma;

> +	u32 curr_tid = info->tid;

> +	bool new_task = false;

> +

> +	/* If this function returns a non-NULL __vm_area_struct, it held

> +	 * a reference to the task_struct. If info->file is non-NULL, it

> +	 * also holds a reference to the file. If this function returns

> +	 * NULL, it does not hold any reference.

> +	 */

> +again:

> +	if (info->task) {

> +		curr_task = info->task;

> +	} else {

> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

> +		if (!curr_task) {

> +			info->task = NULL;

> +			info->tid++;


Here, info->tid should be info->tid = curr_tid + 1.
For exmaple, suppose initial curr_tid = info->tid = 10, and the
above task_seq_get_next(...) returns NULL with curr_tid = 100
which means tid = 100 has been visited. So we would like
to set info->tid = 101 to avoid future potential redundant work.
Returning NULL here will signal end of iteration but user
space can still call read()...

> +			return NULL;

> +		}

> +

> +		if (curr_tid != info->tid) {

> +			info->tid = curr_tid;

> +			new_task = true;

> +		}

> +

> +		if (!curr_task->mm)

> +			goto next_task;

> +		info->task = curr_task;

> +	}

> +

> +	mmap_read_lock(curr_task->mm);

> +	if (new_task) {

> +		vma = curr_task->mm->mmap;

> +	} else {

> +		/* We drop the lock between each iteration, so it is

> +		 * necessary to use find_vma() to find the next vma. This

> +		 * is similar to the mechanism in show_smaps_rollup().

> +		 */

> +		vma = find_vma(curr_task->mm, info->vma.end - 1);

> +		/* same vma as previous iteration, use vma->next */

> +		if (vma && (vma->vm_start == info->vma.start))

> +			vma = vma->vm_next;


We may have some issues here if control is returned to user space
in the middle of iterations. For example,
    - seq_ops->next() sets info->vma properly (say corresponds to vma1 
of tid1)
    - control returns to user space
    - control backs to kernel and this is not a new task since
      tid is the same
    - but we skipped this vma for show().

I think the above skipping should be guarded. If the function
is called from seq_ops->next(), yes it can be skipped.
If the function is called from seq_ops->start(), it should not
be skipped.

Could you double check such a scenario with a smaller buffer
size for read() in user space?

> +	}

> +	if (!vma) {

> +		mmap_read_unlock(curr_task->mm);

> +		goto next_task;

> +	}

> +	info->task = curr_task;

> +	info->vma.start = vma->vm_start;

> +	info->vma.end = vma->vm_end;

> +	info->vma.pgoff = vma->vm_pgoff;

> +	info->vma.flags = vma->vm_flags;

> +	if (vma->vm_file)

> +		info->file = get_file(vma->vm_file);

> +	mmap_read_unlock(curr_task->mm);

> +	return &info->vma;

> +

> +next_task:

> +	put_task_struct(curr_task);

> +	info->task = NULL;

> +	curr_tid = ++(info->tid);

> +	goto again;

> +}

> +

> +static void *task_vma_seq_start(struct seq_file *seq, loff_t *pos)

> +{

> +	struct bpf_iter_seq_task_vma_info *info = seq->private;

> +	struct __vm_area_struct *vma;

> +

> +	vma = task_vma_seq_get_next(info);

> +	if (vma && *pos == 0)

> +		++*pos;

> +

> +	return vma;

> +}

> +

> +static void *task_vma_seq_next(struct seq_file *seq, void *v, loff_t *pos)

> +{

> +	struct bpf_iter_seq_task_vma_info *info = seq->private;

> +

> +	++*pos;

> +	if (info->file) {

> +		fput(info->file);

> +		info->file = NULL;

> +	}

> +	return task_vma_seq_get_next(info);

> +}

> +

> +struct bpf_iter__task_vma {

> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);

> +	__bpf_md_ptr(struct task_struct *, task);

> +	__bpf_md_ptr(struct __vm_area_struct *, vma);

> +	__bpf_md_ptr(struct file *, file);

> +};

> +

> +DEFINE_BPF_ITER_FUNC(task_vma, struct bpf_iter_meta *meta,

> +		     struct task_struct *task, struct __vm_area_struct *vma,

> +		     struct file *file)

> +

> +static int __task_vma_seq_show(struct seq_file *seq, bool in_stop)

> +{

> +	struct bpf_iter_seq_task_vma_info *info = seq->private;

> +	struct bpf_iter__task_vma ctx;

> +	struct bpf_iter_meta meta;

> +	struct bpf_prog *prog;

> +

> +	meta.seq = seq;

> +	prog = bpf_iter_get_info(&meta, in_stop);

> +	if (!prog)

> +		return 0;

> +

> +	ctx.meta = &meta;

> +	ctx.task = info->task;

> +	ctx.vma = &info->vma;

> +	ctx.file = info->file;

> +	return bpf_iter_run_prog(prog, &ctx);

> +}

> +

> +static int task_vma_seq_show(struct seq_file *seq, void *v)

> +{

> +	return __task_vma_seq_show(seq, false);

> +}

> +

> +static void task_vma_seq_stop(struct seq_file *seq, void *v)

> +{

> +	struct bpf_iter_seq_task_vma_info *info = seq->private;

> +

> +	if (!v) {

> +		(void)__task_vma_seq_show(seq, true);

> +	} else {

> +		put_task_struct(info->task);

> +		if (info->file) {

> +			fput(info->file);

> +			info->file = NULL;

> +		}

> +		info->task = NULL;

> +	}

> +}

> +

[...]

Song Liu Dec. 16, 2020, 7:41 p.m. UTC | #2

> On Dec 16, 2020, at 9:36 AM, Yonghong Song <yhs@fb.com> wrote:

> 

[...]
>> +	if (info->task) {

>> +		curr_task = info->task;

>> +	} else {

>> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

>> +		if (!curr_task) {

>> +			info->task = NULL;

>> +			info->tid++;

> 

> Here, info->tid should be info->tid = curr_tid + 1.

> For exmaple, suppose initial curr_tid = info->tid = 10, and the

> above task_seq_get_next(...) returns NULL with curr_tid = 100

> which means tid = 100 has been visited. So we would like

> to set info->tid = 101 to avoid future potential redundant work.

> Returning NULL here will signal end of iteration but user

> space can still call read()...


Make sense. Let me fix. 

> 

>> +			return NULL;

>> +		}

>> +

>> +		if (curr_tid != info->tid) {

>> +			info->tid = curr_tid;

>> +			new_task = true;

>> +		}

>> +

>> +		if (!curr_task->mm)

>> +			goto next_task;

>> +		info->task = curr_task;

>> +	}

>> +

>> +	mmap_read_lock(curr_task->mm);

>> +	if (new_task) {

>> +		vma = curr_task->mm->mmap;

>> +	} else {

>> +		/* We drop the lock between each iteration, so it is

>> +		 * necessary to use find_vma() to find the next vma. This

>> +		 * is similar to the mechanism in show_smaps_rollup().

>> +		 */

>> +		vma = find_vma(curr_task->mm, info->vma.end - 1);

>> +		/* same vma as previous iteration, use vma->next */

>> +		if (vma && (vma->vm_start == info->vma.start))

>> +			vma = vma->vm_next;

> 

> We may have some issues here if control is returned to user space

> in the middle of iterations. For example,

>   - seq_ops->next() sets info->vma properly (say corresponds to vma1 of tid1)

>   - control returns to user space

>   - control backs to kernel and this is not a new task since

>     tid is the same

>   - but we skipped this vma for show().

> 

> I think the above skipping should be guarded. If the function

> is called from seq_ops->next(), yes it can be skipped.

> If the function is called from seq_ops->start(), it should not

> be skipped.

> 

> Could you double check such a scenario with a smaller buffer

> size for read() in user space?


Yeah, this appeared to be a problem... Thanks for catching it! But I 
am not sure (yet) how to fix it. Let me think more about it. 

Thanks,
Song

Andrii Nakryiko Dec. 17, 2020, 12:34 a.m. UTC | #3

On Tue, Dec 15, 2020 at 3:37 PM Song Liu <songliubraving@fb.com> wrote:
>

> Introduce task_vma bpf_iter to print memory information of a process. It

> can be used to print customized information similar to /proc/<pid>/maps.

>

> task_vma iterator releases mmap_lock before calling the BPF program.

> Therefore, we cannot pass vm_area_struct directly to the BPF program. A

> new __vm_area_struct is introduced to keep key information of a vma. On

> each iteration, task_vma gathers information in __vm_area_struct and

> passes it to the BPF program.

>

> If the vma maps to a file, task_vma also holds a reference to the file

> while calling the BPF program.

>

> Signed-off-by: Song Liu <songliubraving@fb.com>

> ---

>  include/linux/bpf.h    |   2 +-

>  kernel/bpf/task_iter.c | 205 ++++++++++++++++++++++++++++++++++++++++-

>  2 files changed, 205 insertions(+), 2 deletions(-)

>

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h

> index 07cb5d15e7439..49dd1e29c8118 100644

> --- a/include/linux/bpf.h

> +++ b/include/linux/bpf.h

> @@ -1325,7 +1325,7 @@ enum bpf_iter_feature {

>         BPF_ITER_RESCHED        = BIT(0),

>  };

>

> -#define BPF_ITER_CTX_ARG_MAX 2

> +#define BPF_ITER_CTX_ARG_MAX 3

>  struct bpf_iter_reg {

>         const char *target;

>         bpf_iter_attach_target_t attach_target;

> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c

> index 0458a40edf10a..15a066b442f75 100644

> --- a/kernel/bpf/task_iter.c

> +++ b/kernel/bpf/task_iter.c

> @@ -304,9 +304,183 @@ static const struct seq_operations task_file_seq_ops = {

>         .show   = task_file_seq_show,

>  };

>

> +/*

> + * Key information from vm_area_struct. We need this because we cannot

> + * assume the vm_area_struct is still valid after each iteration.

> + */

> +struct __vm_area_struct {

> +       __u64 start;

> +       __u64 end;

> +       __u64 flags;

> +       __u64 pgoff;


I'd keep the original names of the fields (vm_start, vm_end, etc). But
there are some more fields which seem useful, like vm_page_prot,
vm_mm, etc.

It's quite unfortunate, actually, that bpf_iter program doesn't get
access to the real vm_area_struct, because it won't be able to do much
beyond using fields that we pre-defined here. E.g., there could be
interesting things to do with vm_mm, but unfortunately it won't be
possible.

Is there any way to still provide access to the original
vm_area_struct and let BPF programs use BTF magic to follow all those
pointers (like vm_mm) safely?

> +};

> +


[...]

Song Liu Dec. 17, 2020, 1:51 a.m. UTC | #4

> On Dec 16, 2020, at 4:34 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> 

> On Tue, Dec 15, 2020 at 3:37 PM Song Liu <songliubraving@fb.com> wrote:

>> 

>> Introduce task_vma bpf_iter to print memory information of a process. It

>> can be used to print customized information similar to /proc/<pid>/maps.

>> 

>> task_vma iterator releases mmap_lock before calling the BPF program.

>> Therefore, we cannot pass vm_area_struct directly to the BPF program. A

>> new __vm_area_struct is introduced to keep key information of a vma. On

>> each iteration, task_vma gathers information in __vm_area_struct and

>> passes it to the BPF program.

>> 

>> If the vma maps to a file, task_vma also holds a reference to the file

>> while calling the BPF program.

>> 

>> Signed-off-by: Song Liu <songliubraving@fb.com>

>> ---

>> include/linux/bpf.h    |   2 +-

>> kernel/bpf/task_iter.c | 205 ++++++++++++++++++++++++++++++++++++++++-

>> 2 files changed, 205 insertions(+), 2 deletions(-)

>> 

>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h

>> index 07cb5d15e7439..49dd1e29c8118 100644

>> --- a/include/linux/bpf.h

>> +++ b/include/linux/bpf.h

>> @@ -1325,7 +1325,7 @@ enum bpf_iter_feature {

>>        BPF_ITER_RESCHED        = BIT(0),

>> };

>> 

>> -#define BPF_ITER_CTX_ARG_MAX 2

>> +#define BPF_ITER_CTX_ARG_MAX 3

>> struct bpf_iter_reg {

>>        const char *target;

>>        bpf_iter_attach_target_t attach_target;

>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c

>> index 0458a40edf10a..15a066b442f75 100644

>> --- a/kernel/bpf/task_iter.c

>> +++ b/kernel/bpf/task_iter.c

>> @@ -304,9 +304,183 @@ static const struct seq_operations task_file_seq_ops = {

>>        .show   = task_file_seq_show,

>> };

>> 

>> +/*

>> + * Key information from vm_area_struct. We need this because we cannot

>> + * assume the vm_area_struct is still valid after each iteration.

>> + */

>> +struct __vm_area_struct {

>> +       __u64 start;

>> +       __u64 end;

>> +       __u64 flags;

>> +       __u64 pgoff;

> 

> I'd keep the original names of the fields (vm_start, vm_end, etc).


I thought about the names. Unlike the kernel fs/mm code, where there
are many different start/end/offset/flags, the prefix doesn't seem to 
be helpful in the BPF programs. In fact, it is probably easier for 
the developers to differentiate __vm_area_struct->start and 
vm_area_struct->vm_start.

Also, we have bpf_iter__task_vma->file, which is the same as 
vm_area_struct->vm_file. If we prefix __vm_area_struct members, it 
will be a little weird to name it either "vm_file" or "file".

Given these reasons, I decided to not have vm_ prefix. Does this make
sense? 

> But

> there are some more fields which seem useful, like vm_page_prot,

> vm_mm, etc.


vm_page_prot doesn't really provide extra information than vm_flags. 
Please refer to mm/mmap.c vm_get_page_prot(). 

We have the vm_mm in task->mm, so no need to add it to __vm_area_struct.

> 

> It's quite unfortunate, actually, that bpf_iter program doesn't get

> access to the real vm_area_struct, because it won't be able to do much

> beyond using fields that we pre-defined here. E.g., there could be

> interesting things to do with vm_mm, but unfortunately it won't be

> possible.

> 

> Is there any way to still provide access to the original

> vm_area_struct and let BPF programs use BTF magic to follow all those

> pointers (like vm_mm) safely?


We hold a reference to task, and the task holds a reference to task->mm,
so we can safely probe_read information in mm_struct, like the page 
table. 

Thanks,
Song

Alexei Starovoitov Dec. 17, 2020, 7:03 p.m. UTC | #5

On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:
> +/*

> + * Key information from vm_area_struct. We need this because we cannot

> + * assume the vm_area_struct is still valid after each iteration.

> + */

> +struct __vm_area_struct {

> +	__u64 start;

> +	__u64 end;

> +	__u64 flags;

> +	__u64 pgoff;

> +};


Where it's inside .c or exposed in uapi/bpf.h it will become uapi
if it's used this way. Let's switch to arbitrary BTF-based access instead.

> +static struct __vm_area_struct *

> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

> +{

> +	struct pid_namespace *ns = info->common.ns;

> +	struct task_struct *curr_task;

> +	struct vm_area_struct *vma;

> +	u32 curr_tid = info->tid;

> +	bool new_task = false;

> +

> +	/* If this function returns a non-NULL __vm_area_struct, it held

> +	 * a reference to the task_struct. If info->file is non-NULL, it

> +	 * also holds a reference to the file. If this function returns

> +	 * NULL, it does not hold any reference.

> +	 */

> +again:

> +	if (info->task) {

> +		curr_task = info->task;

> +	} else {

> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

> +		if (!curr_task) {

> +			info->task = NULL;

> +			info->tid++;

> +			return NULL;

> +		}

> +

> +		if (curr_tid != info->tid) {

> +			info->tid = curr_tid;

> +			new_task = true;

> +		}

> +

> +		if (!curr_task->mm)

> +			goto next_task;

> +		info->task = curr_task;

> +	}

> +

> +	mmap_read_lock(curr_task->mm);


That will hurt. /proc readers do that and it causes all sorts
of production issues. We cannot take this lock.
There is no need to take it.
Switch the whole thing to probe_read style walking.
And reimplement find_vma with probe_read while omitting vmacache.
It will be short rbtree walk.
bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id
which will use probe_read equivalent underneath.

Song Liu Dec. 17, 2020, 10:08 p.m. UTC | #6

> On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 

> On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:

>> +/*

>> + * Key information from vm_area_struct. We need this because we cannot

>> + * assume the vm_area_struct is still valid after each iteration.

>> + */

>> +struct __vm_area_struct {

>> +	__u64 start;

>> +	__u64 end;

>> +	__u64 flags;

>> +	__u64 pgoff;

>> +};

> 

> Where it's inside .c or exposed in uapi/bpf.h it will become uapi

> if it's used this way. Let's switch to arbitrary BTF-based access instead.

> 

>> +static struct __vm_area_struct *

>> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

>> +{

>> +	struct pid_namespace *ns = info->common.ns;

>> +	struct task_struct *curr_task;

>> +	struct vm_area_struct *vma;

>> +	u32 curr_tid = info->tid;

>> +	bool new_task = false;

>> +

>> +	/* If this function returns a non-NULL __vm_area_struct, it held

>> +	 * a reference to the task_struct. If info->file is non-NULL, it

>> +	 * also holds a reference to the file. If this function returns

>> +	 * NULL, it does not hold any reference.

>> +	 */

>> +again:

>> +	if (info->task) {

>> +		curr_task = info->task;

>> +	} else {

>> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

>> +		if (!curr_task) {

>> +			info->task = NULL;

>> +			info->tid++;

>> +			return NULL;

>> +		}

>> +

>> +		if (curr_tid != info->tid) {

>> +			info->tid = curr_tid;

>> +			new_task = true;

>> +		}

>> +

>> +		if (!curr_task->mm)

>> +			goto next_task;

>> +		info->task = curr_task;

>> +	}

>> +

>> +	mmap_read_lock(curr_task->mm);

> 

> That will hurt. /proc readers do that and it causes all sorts

> of production issues. We cannot take this lock.

> There is no need to take it.

> Switch the whole thing to probe_read style walking.

> And reimplement find_vma with probe_read while omitting vmacache.

> It will be short rbtree walk.

> bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id

> which will use probe_read equivalent underneath.


rw_semaphore is designed to avoid write starvation, so read_lock should not cause
problem unless the lock was taken for extended period. [1] was a recent fix that 
avoids /proc issue by releasing mmap_lock between iterations. We are using similar
mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version. 

On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the 
rbtree without taking any lock would not work. We can avoid taking the lock when 
some SPF like mechanism merged (hopefully soonish). 

Did I miss anything? 

We can improve bpf_iter with some mechanism to specify which task to iterate, so 
that we don't have to iterate through all tasks when the user only want to inspect 
vmas in one task. 

Thanks,
Song

[1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

Alexei Starovoitov Dec. 18, 2020, 2:34 a.m. UTC | #7

On Thu, Dec 17, 2020 at 10:08:31PM +0000, Song Liu wrote:
> 

> 

> > On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > 

> > On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:

> >> +/*

> >> + * Key information from vm_area_struct. We need this because we cannot

> >> + * assume the vm_area_struct is still valid after each iteration.

> >> + */

> >> +struct __vm_area_struct {

> >> +	__u64 start;

> >> +	__u64 end;

> >> +	__u64 flags;

> >> +	__u64 pgoff;

> >> +};

> > 

> > Where it's inside .c or exposed in uapi/bpf.h it will become uapi

> > if it's used this way. Let's switch to arbitrary BTF-based access instead.

> > 

> >> +static struct __vm_area_struct *

> >> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

> >> +{

> >> +	struct pid_namespace *ns = info->common.ns;

> >> +	struct task_struct *curr_task;

> >> +	struct vm_area_struct *vma;

> >> +	u32 curr_tid = info->tid;

> >> +	bool new_task = false;

> >> +

> >> +	/* If this function returns a non-NULL __vm_area_struct, it held

> >> +	 * a reference to the task_struct. If info->file is non-NULL, it

> >> +	 * also holds a reference to the file. If this function returns

> >> +	 * NULL, it does not hold any reference.

> >> +	 */

> >> +again:

> >> +	if (info->task) {

> >> +		curr_task = info->task;

> >> +	} else {

> >> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

> >> +		if (!curr_task) {

> >> +			info->task = NULL;

> >> +			info->tid++;

> >> +			return NULL;

> >> +		}

> >> +

> >> +		if (curr_tid != info->tid) {

> >> +			info->tid = curr_tid;

> >> +			new_task = true;

> >> +		}

> >> +

> >> +		if (!curr_task->mm)

> >> +			goto next_task;

> >> +		info->task = curr_task;

> >> +	}

> >> +

> >> +	mmap_read_lock(curr_task->mm);

> > 

> > That will hurt. /proc readers do that and it causes all sorts

> > of production issues. We cannot take this lock.

> > There is no need to take it.

> > Switch the whole thing to probe_read style walking.

> > And reimplement find_vma with probe_read while omitting vmacache.

> > It will be short rbtree walk.

> > bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id

> > which will use probe_read equivalent underneath.

> 

> rw_semaphore is designed to avoid write starvation, so read_lock should not cause

> problem unless the lock was taken for extended period. [1] was a recent fix that 

> avoids /proc issue by releasing mmap_lock between iterations. We are using similar

> mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version. 

> 

> On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the 


ahh. I missed that. Makes sense.
vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

> rbtree without taking any lock would not work. We can avoid taking the lock when 

> some SPF like mechanism merged (hopefully soonish). 

> 

> Did I miss anything? 

> 

> We can improve bpf_iter with some mechanism to specify which task to iterate, so 

> that we don't have to iterate through all tasks when the user only want to inspect 

> vmas in one task. 


yes. let's figure out how to make it parametrizable.
map_iter runs only for given map_fd.
Maybe vma_iter should run only for given pidfd?
I think all_task_all_vmas iter is nice to have, but we don't really need it?

> Thanks,

> Song

> 

> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")


Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.

Yonghong Song Dec. 18, 2020, 3:15 a.m. UTC | #8

On 12/17/20 6:34 PM, Alexei Starovoitov wrote:
> On Thu, Dec 17, 2020 at 10:08:31PM +0000, Song Liu wrote:

>>

>>

>>> On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>>

>>> On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:

>>>> +/*

>>>> + * Key information from vm_area_struct. We need this because we cannot

>>>> + * assume the vm_area_struct is still valid after each iteration.

>>>> + */

>>>> +struct __vm_area_struct {

>>>> +	__u64 start;

>>>> +	__u64 end;

>>>> +	__u64 flags;

>>>> +	__u64 pgoff;

>>>> +};

>>>

>>> Where it's inside .c or exposed in uapi/bpf.h it will become uapi

>>> if it's used this way. Let's switch to arbitrary BTF-based access instead.

>>>

>>>> +static struct __vm_area_struct *

>>>> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

>>>> +{

>>>> +	struct pid_namespace *ns = info->common.ns;

>>>> +	struct task_struct *curr_task;

>>>> +	struct vm_area_struct *vma;

>>>> +	u32 curr_tid = info->tid;

>>>> +	bool new_task = false;

>>>> +

>>>> +	/* If this function returns a non-NULL __vm_area_struct, it held

>>>> +	 * a reference to the task_struct. If info->file is non-NULL, it

>>>> +	 * also holds a reference to the file. If this function returns

>>>> +	 * NULL, it does not hold any reference.

>>>> +	 */

>>>> +again:

>>>> +	if (info->task) {

>>>> +		curr_task = info->task;

>>>> +	} else {

>>>> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

>>>> +		if (!curr_task) {

>>>> +			info->task = NULL;

>>>> +			info->tid++;

>>>> +			return NULL;

>>>> +		}

>>>> +

>>>> +		if (curr_tid != info->tid) {

>>>> +			info->tid = curr_tid;

>>>> +			new_task = true;

>>>> +		}

>>>> +

>>>> +		if (!curr_task->mm)

>>>> +			goto next_task;

>>>> +		info->task = curr_task;

>>>> +	}

>>>> +

>>>> +	mmap_read_lock(curr_task->mm);

>>>

>>> That will hurt. /proc readers do that and it causes all sorts

>>> of production issues. We cannot take this lock.

>>> There is no need to take it.

>>> Switch the whole thing to probe_read style walking.

>>> And reimplement find_vma with probe_read while omitting vmacache.

>>> It will be short rbtree walk.

>>> bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id

>>> which will use probe_read equivalent underneath.

>>

>> rw_semaphore is designed to avoid write starvation, so read_lock should not cause

>> problem unless the lock was taken for extended period. [1] was a recent fix that

>> avoids /proc issue by releasing mmap_lock between iterations. We are using similar

>> mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version.

>>

>> On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the

> 

> ahh. I missed that. Makes sense.

> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

> 

>> rbtree without taking any lock would not work. We can avoid taking the lock when

>> some SPF like mechanism merged (hopefully soonish).

>>

>> Did I miss anything?

>>

>> We can improve bpf_iter with some mechanism to specify which task to iterate, so

>> that we don't have to iterate through all tasks when the user only want to inspect

>> vmas in one task.

> 

> yes. let's figure out how to make it parametrizable.

> map_iter runs only for given map_fd.

> Maybe vma_iter should run only for given pidfd?

> I think all_task_all_vmas iter is nice to have, but we don't really need it?


We could make pidfd optional? If pidfd == 0, all vmas of all tasks will 
be traversed. Otherwise, pidfd > 0, only vmas of that pidfd will be 
traversed.

> 

>> Thanks,

>> Song

>>

>> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

> 

> Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.

>

Song Liu Dec. 18, 2020, 4:33 a.m. UTC | #9

> On Dec 17, 2020, at 6:34 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 

> On Thu, Dec 17, 2020 at 10:08:31PM +0000, Song Liu wrote:

>> 

>> 

>>> On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>> 

>>> On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:

>>>> +/*

>>>> + * Key information from vm_area_struct. We need this because we cannot

>>>> + * assume the vm_area_struct is still valid after each iteration.

>>>> + */

>>>> +struct __vm_area_struct {

>>>> +	__u64 start;

>>>> +	__u64 end;

>>>> +	__u64 flags;

>>>> +	__u64 pgoff;

>>>> +};

>>> 

>>> Where it's inside .c or exposed in uapi/bpf.h it will become uapi

>>> if it's used this way. Let's switch to arbitrary BTF-based access instead.

>>> 

>>>> +static struct __vm_area_struct *

>>>> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

>>>> +{

>>>> +	struct pid_namespace *ns = info->common.ns;

>>>> +	struct task_struct *curr_task;

>>>> +	struct vm_area_struct *vma;

>>>> +	u32 curr_tid = info->tid;

>>>> +	bool new_task = false;

>>>> +

>>>> +	/* If this function returns a non-NULL __vm_area_struct, it held

>>>> +	 * a reference to the task_struct. If info->file is non-NULL, it

>>>> +	 * also holds a reference to the file. If this function returns

>>>> +	 * NULL, it does not hold any reference.

>>>> +	 */

>>>> +again:

>>>> +	if (info->task) {

>>>> +		curr_task = info->task;

>>>> +	} else {

>>>> +		curr_task = task_seq_get_next(ns, &curr_tid, true);

>>>> +		if (!curr_task) {

>>>> +			info->task = NULL;

>>>> +			info->tid++;

>>>> +			return NULL;

>>>> +		}

>>>> +

>>>> +		if (curr_tid != info->tid) {

>>>> +			info->tid = curr_tid;

>>>> +			new_task = true;

>>>> +		}

>>>> +

>>>> +		if (!curr_task->mm)

>>>> +			goto next_task;

>>>> +		info->task = curr_task;

>>>> +	}

>>>> +

>>>> +	mmap_read_lock(curr_task->mm);

>>> 

>>> That will hurt. /proc readers do that and it causes all sorts

>>> of production issues. We cannot take this lock.

>>> There is no need to take it.

>>> Switch the whole thing to probe_read style walking.

>>> And reimplement find_vma with probe_read while omitting vmacache.

>>> It will be short rbtree walk.

>>> bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id

>>> which will use probe_read equivalent underneath.

>> 

>> rw_semaphore is designed to avoid write starvation, so read_lock should not cause

>> problem unless the lock was taken for extended period. [1] was a recent fix that 

>> avoids /proc issue by releasing mmap_lock between iterations. We are using similar

>> mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version. 

>> 

>> On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the 

> 

> ahh. I missed that. Makes sense.

> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.


Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we 
allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will
allow access of vma->vm_file as a valid pointer to struct file. However, since the
vma might be freed, vma->vm_file could point to random data. 

> 

>> rbtree without taking any lock would not work. We can avoid taking the lock when 

>> some SPF like mechanism merged (hopefully soonish). 

>> 

>> Did I miss anything? 

>> 

>> We can improve bpf_iter with some mechanism to specify which task to iterate, so 

>> that we don't have to iterate through all tasks when the user only want to inspect 

>> vmas in one task. 

> 

> yes. let's figure out how to make it parametrizable.

> map_iter runs only for given map_fd.

> Maybe vma_iter should run only for given pidfd?

> I think all_task_all_vmas iter is nice to have, but we don't really need it?

> 

>> Thanks,

>> Song

>> 

>> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

> 

> Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.


To make sure we are on the same page: I am using slightly different mechanism in 
task_vma_iter, which doesn't require checking mmap_lock_is_contended(). In the 
smaps_rollup case, the code only unlock mmap_sem when the lock is contended. In 
task_iter, we always unlock mmap_sem between two iterations. This is because we 
don't want to hold mmap_sem while calling the BPF program, which may sleep (calling
bpf_d_path). 

Thanks,
Song

Alexei Starovoitov Dec. 18, 2020, 5:23 a.m. UTC | #10

On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:
> >

> > ahh. I missed that. Makes sense.

> > vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>

> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

> allow access of vma->vm_file as a valid pointer to struct file. However, since the

> vma might be freed, vma->vm_file could point to random data.


I don't think so. The proposed patch will do get_file() on it.
There is actually no need to assign it into a different variable.
Accessing it via vma->vm_file is safe and cleaner.

> >> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

> >

> > Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.

>

> To make sure we are on the same page: I am using slightly different mechanism in

> task_vma_iter, which doesn't require checking mmap_lock_is_contended(). In the

> smaps_rollup case, the code only unlock mmap_sem when the lock is contended. In

> task_iter, we always unlock mmap_sem between two iterations. This is because we

> don't want to hold mmap_sem while calling the BPF program, which may sleep (calling

> bpf_d_path).


That part is clear. I had to look into mmap_read_lock_killable() implementation
to realize that it's checking for lock_is_contended after acquiring
and releasing
if there is a contention. So it's the same behavior at the end.

Yonghong Song Dec. 18, 2020, 4:38 p.m. UTC | #11

On 12/17/20 9:23 PM, Alexei Starovoitov wrote:
> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

>>>

>>> ahh. I missed that. Makes sense.

>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>>

>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

>> vma might be freed, vma->vm_file could point to random data.

> 

> I don't think so. The proposed patch will do get_file() on it.

> There is actually no need to assign it into a different variable.

> Accessing it via vma->vm_file is safe and cleaner.


I did not check the code but do you have scenarios where vma is freed 
but old vma->vm_file is not freed due to reference counting, but
freed vma area is reused so vma->vm_file could be garbage?

> 

>>>> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

>>>

>>> Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.

>>

>> To make sure we are on the same page: I am using slightly different mechanism in

>> task_vma_iter, which doesn't require checking mmap_lock_is_contended(). In the

>> smaps_rollup case, the code only unlock mmap_sem when the lock is contended. In

>> task_iter, we always unlock mmap_sem between two iterations. This is because we

>> don't want to hold mmap_sem while calling the BPF program, which may sleep (calling

>> bpf_d_path).

> 

> That part is clear. I had to look into mmap_read_lock_killable() implementation

> to realize that it's checking for lock_is_contended after acquiring

> and releasing

> if there is a contention. So it's the same behavior at the end.

>

Song Liu Dec. 18, 2020, 5:23 p.m. UTC | #12

> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

> 

> 

> 

> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

>>>> 

>>>> ahh. I missed that. Makes sense.

>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>>> 

>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

>>> vma might be freed, vma->vm_file could point to random data.

>> I don't think so. The proposed patch will do get_file() on it.

>> There is actually no need to assign it into a different variable.

>> Accessing it via vma->vm_file is safe and cleaner.

> 

> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

> freed vma area is reused so vma->vm_file could be garbage?


AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id
or probe_read would not help with this?

Thanks,
Song

> 

>>>>> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

>>>> 

>>>> Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.

>>> 

>>> To make sure we are on the same page: I am using slightly different mechanism in

>>> task_vma_iter, which doesn't require checking mmap_lock_is_contended(). In the

>>> smaps_rollup case, the code only unlock mmap_sem when the lock is contended. In

>>> task_iter, we always unlock mmap_sem between two iterations. This is because we

>>> don't want to hold mmap_sem while calling the BPF program, which may sleep (calling

>>> bpf_d_path).

>> That part is clear. I had to look into mmap_read_lock_killable() implementation

>> to realize that it's checking for lock_is_contended after acquiring

>> and releasing

>> if there is a contention. So it's the same behavior at the end.

Alexei Starovoitov Jan. 5, 2021, 1:46 a.m. UTC | #13

On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:
> 

> 

> > On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

> > 

> > 

> > 

> > On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

> >> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

> >>>> 

> >>>> ahh. I missed that. Makes sense.

> >>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

> >>> 

> >>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

> >>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

> >>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

> >>> vma might be freed, vma->vm_file could point to random data.

> >> I don't think so. The proposed patch will do get_file() on it.

> >> There is actually no need to assign it into a different variable.

> >> Accessing it via vma->vm_file is safe and cleaner.

> > 

> > I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

> > freed vma area is reused so vma->vm_file could be garbage?

> 

> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

> or probe_read would not help with this?


Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less
valid" than the other ptr_to_btf_id, but the user experience will not be great.
Reading such bpf prog will not be obvious. I think it's better to run bpf prog
in mmap_lock then and let it access vma->vm_file. After prog finishes the iter
bit can do if (mmap_lock_is_contended()) before iterating. That will deliver
better performance too. Instead of task_vma_seq_get_next() doing
mmap_lock/unlock at every vma. No need for get_file() either. And no
__vm_area_struct exposure.

Song Liu Jan. 5, 2021, 5:47 a.m. UTC | #14

> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 

> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

>> 

>> 

>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

>>> 

>>> 

>>> 

>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

>>>>>> 

>>>>>> ahh. I missed that. Makes sense.

>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>>>>> 

>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

>>>>> vma might be freed, vma->vm_file could point to random data.

>>>> I don't think so. The proposed patch will do get_file() on it.

>>>> There is actually no need to assign it into a different variable.

>>>> Accessing it via vma->vm_file is safe and cleaner.

>>> 

>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

>>> freed vma area is reused so vma->vm_file could be garbage?

>> 

>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

>> or probe_read would not help with this?

> 

> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

> valid" than the other ptr_to_btf_id, but the user experience will not be great.

> Reading such bpf prog will not be obvious. I think it's better to run bpf prog

> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

> better performance too. Instead of task_vma_seq_get_next() doing

> mmap_lock/unlock at every vma. No need for get_file() either. And no

> __vm_area_struct exposure.


I think there might be concern calling BPF program with mmap_lock, especially that 
the program is sleepable (for bpf_d_path). It shouldn't be a problem for common 
cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep). 
Current version is designed to be very safe for the workload, which might be too
conservative. 

Thanks,
Song

Alexei Starovoitov Jan. 5, 2021, 4:27 p.m. UTC | #15

On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@fb.com> wrote:
>

>

>

> > On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> >

> > On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

> >>

> >>

> >>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

> >>>

> >>>

> >>>

> >>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

> >>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

> >>>>>>

> >>>>>> ahh. I missed that. Makes sense.

> >>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

> >>>>>

> >>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

> >>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

> >>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

> >>>>> vma might be freed, vma->vm_file could point to random data.

> >>>> I don't think so. The proposed patch will do get_file() on it.

> >>>> There is actually no need to assign it into a different variable.

> >>>> Accessing it via vma->vm_file is safe and cleaner.

> >>>

> >>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

> >>> freed vma area is reused so vma->vm_file could be garbage?

> >>

> >> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

> >> or probe_read would not help with this?

> >

> > Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

> > valid" than the other ptr_to_btf_id, but the user experience will not be great.

> > Reading such bpf prog will not be obvious. I think it's better to run bpf prog

> > in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

> > bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

> > better performance too. Instead of task_vma_seq_get_next() doing

> > mmap_lock/unlock at every vma. No need for get_file() either. And no

> > __vm_area_struct exposure.

>

> I think there might be concern calling BPF program with mmap_lock, especially that

> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common

> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep).

> Current version is designed to be very safe for the workload, which might be too

> conservative.


I know and I agree with all that, but how do you propose to fix the
vm_file concern
without holding the lock while prog is running? I couldn't come up with a way.

Song Liu Jan. 5, 2021, 5:10 p.m. UTC | #16

> On Jan 5, 2021, at 8:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 

> On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@fb.com> wrote:

>> 

>> 

>> 

>>> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>> 

>>> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

>>>> 

>>>> 

>>>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

>>>>> 

>>>>> 

>>>>> 

>>>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

>>>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

>>>>>>>> 

>>>>>>>> ahh. I missed that. Makes sense.

>>>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>>>>>>> 

>>>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

>>>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

>>>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

>>>>>>> vma might be freed, vma->vm_file could point to random data.

>>>>>> I don't think so. The proposed patch will do get_file() on it.

>>>>>> There is actually no need to assign it into a different variable.

>>>>>> Accessing it via vma->vm_file is safe and cleaner.

>>>>> 

>>>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

>>>>> freed vma area is reused so vma->vm_file could be garbage?

>>>> 

>>>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

>>>> or probe_read would not help with this?

>>> 

>>> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

>>> valid" than the other ptr_to_btf_id, but the user experience will not be great.

>>> Reading such bpf prog will not be obvious. I think it's better to run bpf prog

>>> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

>>> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

>>> better performance too. Instead of task_vma_seq_get_next() doing

>>> mmap_lock/unlock at every vma. No need for get_file() either. And no

>>> __vm_area_struct exposure.

>> 

>> I think there might be concern calling BPF program with mmap_lock, especially that

>> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common

>> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep).

>> Current version is designed to be very safe for the workload, which might be too

>> conservative.

> 

> I know and I agree with all that, but how do you propose to fix the

> vm_file concern

> without holding the lock while prog is running? I couldn't come up with a way.


I guess the gap here is that I don't see why __vm_area_struct exposure is an issue. 
It is similar to __sk_buff, and simpler (though we had more reasons to introduce
__sk_buff back when there wasn't BTF). 

If we can accept __vm_area_struct, current version should work, as it doesn't have 
problem with vm_file. 

Thanks,
Song

Alexei Starovoitov Jan. 5, 2021, 5:27 p.m. UTC | #17

On Tue, Jan 5, 2021 at 9:11 AM Song Liu <songliubraving@fb.com> wrote:
>

>

>

> > On Jan 5, 2021, at 8:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> >

> > On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@fb.com> wrote:

> >>

> >>

> >>

> >>> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> >>>

> >>> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

> >>>>

> >>>>

> >>>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

> >>>>>

> >>>>>

> >>>>>

> >>>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

> >>>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

> >>>>>>>>

> >>>>>>>> ahh. I missed that. Makes sense.

> >>>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

> >>>>>>>

> >>>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

> >>>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

> >>>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

> >>>>>>> vma might be freed, vma->vm_file could point to random data.

> >>>>>> I don't think so. The proposed patch will do get_file() on it.

> >>>>>> There is actually no need to assign it into a different variable.

> >>>>>> Accessing it via vma->vm_file is safe and cleaner.

> >>>>>

> >>>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

> >>>>> freed vma area is reused so vma->vm_file could be garbage?

> >>>>

> >>>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

> >>>> or probe_read would not help with this?

> >>>

> >>> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

> >>> valid" than the other ptr_to_btf_id, but the user experience will not be great.

> >>> Reading such bpf prog will not be obvious. I think it's better to run bpf prog

> >>> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

> >>> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

> >>> better performance too. Instead of task_vma_seq_get_next() doing

> >>> mmap_lock/unlock at every vma. No need for get_file() either. And no

> >>> __vm_area_struct exposure.

> >>

> >> I think there might be concern calling BPF program with mmap_lock, especially that

> >> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common

> >> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep).

> >> Current version is designed to be very safe for the workload, which might be too

> >> conservative.

> >

> > I know and I agree with all that, but how do you propose to fix the

> > vm_file concern

> > without holding the lock while prog is running? I couldn't come up with a way.

>

> I guess the gap here is that I don't see why __vm_area_struct exposure is an issue.

> It is similar to __sk_buff, and simpler (though we had more reasons to introduce

> __sk_buff back when there wasn't BTF).

>

> If we can accept __vm_area_struct, current version should work, as it doesn't have

> problem with vm_file


True. The problem with __vm_area_struct is that it is a hard coded
uapi with little to none
extensibility. In this form vma iterator is not really useful beyond
the example in selftest.

Song Liu Jan. 5, 2021, 7:38 p.m. UTC | #18

> On Jan 5, 2021, at 9:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 

> On Tue, Jan 5, 2021 at 9:11 AM Song Liu <songliubraving@fb.com> wrote:

>> 

>> 

>> 

>>> On Jan 5, 2021, at 8:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>> 

>>> On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@fb.com> wrote:

>>>> 

>>>> 

>>>> 

>>>>> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>>>> 

>>>>> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

>>>>>> 

>>>>>> 

>>>>>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

>>>>>>> 

>>>>>>> 

>>>>>>> 

>>>>>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

>>>>>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

>>>>>>>>>> 

>>>>>>>>>> ahh. I missed that. Makes sense.

>>>>>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>>>>>>>>> 

>>>>>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

>>>>>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

>>>>>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

>>>>>>>>> vma might be freed, vma->vm_file could point to random data.

>>>>>>>> I don't think so. The proposed patch will do get_file() on it.

>>>>>>>> There is actually no need to assign it into a different variable.

>>>>>>>> Accessing it via vma->vm_file is safe and cleaner.

>>>>>>> 

>>>>>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

>>>>>>> freed vma area is reused so vma->vm_file could be garbage?

>>>>>> 

>>>>>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

>>>>>> or probe_read would not help with this?

>>>>> 

>>>>> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

>>>>> valid" than the other ptr_to_btf_id, but the user experience will not be great.

>>>>> Reading such bpf prog will not be obvious. I think it's better to run bpf prog

>>>>> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

>>>>> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

>>>>> better performance too. Instead of task_vma_seq_get_next() doing

>>>>> mmap_lock/unlock at every vma. No need for get_file() either. And no

>>>>> __vm_area_struct exposure.

>>>> 

>>>> I think there might be concern calling BPF program with mmap_lock, especially that

>>>> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common

>>>> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep).

>>>> Current version is designed to be very safe for the workload, which might be too

>>>> conservative.

>>> 

>>> I know and I agree with all that, but how do you propose to fix the

>>> vm_file concern

>>> without holding the lock while prog is running? I couldn't come up with a way.

>> 

>> I guess the gap here is that I don't see why __vm_area_struct exposure is an issue.

>> It is similar to __sk_buff, and simpler (though we had more reasons to introduce

>> __sk_buff back when there wasn't BTF).

>> 

>> If we can accept __vm_area_struct, current version should work, as it doesn't have

>> problem with vm_file

> 

> True. The problem with __vm_area_struct is that it is a hard coded

> uapi with little to none

> extensibility. In this form vma iterator is not really useful beyond

> the example in selftest.


With __vm_area_struct, we can still probe_read the page table, so we can 
cover more use cases than the selftest. But I agree that it is not as
extensible as feeding real vm_area_struct with BTF to the BPF program. 
Let me try calling BPF program with mmap_lock. 

Thanks,
Song

Alexei Starovoitov Jan. 5, 2021, 7:46 p.m. UTC | #19

On Tue, Jan 05, 2021 at 07:38:08PM +0000, Song Liu wrote:
> 

> 

> > On Jan 5, 2021, at 9:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > 

> > On Tue, Jan 5, 2021 at 9:11 AM Song Liu <songliubraving@fb.com> wrote:

> >> 

> >> 

> >> 

> >>> On Jan 5, 2021, at 8:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> >>> 

> >>> On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@fb.com> wrote:

> >>>> 

> >>>> 

> >>>> 

> >>>>> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> >>>>> 

> >>>>> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

> >>>>>> 

> >>>>>> 

> >>>>>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

> >>>>>>> 

> >>>>>>> 

> >>>>>>> 

> >>>>>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

> >>>>>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

> >>>>>>>>>> 

> >>>>>>>>>> ahh. I missed that. Makes sense.

> >>>>>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

> >>>>>>>>> 

> >>>>>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

> >>>>>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

> >>>>>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

> >>>>>>>>> vma might be freed, vma->vm_file could point to random data.

> >>>>>>>> I don't think so. The proposed patch will do get_file() on it.

> >>>>>>>> There is actually no need to assign it into a different variable.

> >>>>>>>> Accessing it via vma->vm_file is safe and cleaner.

> >>>>>>> 

> >>>>>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

> >>>>>>> freed vma area is reused so vma->vm_file could be garbage?

> >>>>>> 

> >>>>>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

> >>>>>> or probe_read would not help with this?

> >>>>> 

> >>>>> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

> >>>>> valid" than the other ptr_to_btf_id, but the user experience will not be great.

> >>>>> Reading such bpf prog will not be obvious. I think it's better to run bpf prog

> >>>>> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

> >>>>> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

> >>>>> better performance too. Instead of task_vma_seq_get_next() doing

> >>>>> mmap_lock/unlock at every vma. No need for get_file() either. And no

> >>>>> __vm_area_struct exposure.

> >>>> 

> >>>> I think there might be concern calling BPF program with mmap_lock, especially that

> >>>> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common

> >>>> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep).

> >>>> Current version is designed to be very safe for the workload, which might be too

> >>>> conservative.

> >>> 

> >>> I know and I agree with all that, but how do you propose to fix the

> >>> vm_file concern

> >>> without holding the lock while prog is running? I couldn't come up with a way.

> >> 

> >> I guess the gap here is that I don't see why __vm_area_struct exposure is an issue.

> >> It is similar to __sk_buff, and simpler (though we had more reasons to introduce

> >> __sk_buff back when there wasn't BTF).

> >> 

> >> If we can accept __vm_area_struct, current version should work, as it doesn't have

> >> problem with vm_file

> > 

> > True. The problem with __vm_area_struct is that it is a hard coded

> > uapi with little to none

> > extensibility. In this form vma iterator is not really useful beyond

> > the example in selftest.

> 

> With __vm_area_struct, we can still probe_read the page table, so we can 

> cover more use cases than the selftest. But I agree that it is not as

> extensible as feeding real vm_area_struct with BTF to the BPF program. 


Another confusing thing with __vm_area_struct is vm_flags field.
It's copied directly. As __vm_area_struct->flags this field is uapi field,
but its contents are kernel internal. We avoided such corner cases in the past.
When flags field is copied into uapi the kernel internal flags are encoded
and exposed as separate uapi flags. That was the case with
BPF_TCP_* flags. If we have to do this with vm_flags (VM_EXEC, etc) that
might kill the patchset due to abi concerns.

Song Liu Jan. 5, 2021, 7:51 p.m. UTC | #20

> On Jan 5, 2021, at 11:46 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 

> On Tue, Jan 05, 2021 at 07:38:08PM +0000, Song Liu wrote:

>> 

>> 

>>> On Jan 5, 2021, at 9:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>> 

>>> On Tue, Jan 5, 2021 at 9:11 AM Song Liu <songliubraving@fb.com> wrote:

>>>> 

>>>> 

>>>> 

>>>>> On Jan 5, 2021, at 8:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>>>> 

>>>>> On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@fb.com> wrote:

>>>>>> 

>>>>>> 

>>>>>> 

>>>>>>> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>>>>>>> 

>>>>>>> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote:

>>>>>>>> 

>>>>>>>> 

>>>>>>>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@fb.com> wrote:

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote:

>>>>>>>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@fb.com> wrote:

>>>>>>>>>>>> 

>>>>>>>>>>>> ahh. I missed that. Makes sense.

>>>>>>>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

>>>>>>>>>>> 

>>>>>>>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we

>>>>>>>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will

>>>>>>>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the

>>>>>>>>>>> vma might be freed, vma->vm_file could point to random data.

>>>>>>>>>> I don't think so. The proposed patch will do get_file() on it.

>>>>>>>>>> There is actually no need to assign it into a different variable.

>>>>>>>>>> Accessing it via vma->vm_file is safe and cleaner.

>>>>>>>>> 

>>>>>>>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but

>>>>>>>>> freed vma area is reused so vma->vm_file could be garbage?

>>>>>>>> 

>>>>>>>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id

>>>>>>>> or probe_read would not help with this?

>>>>>>> 

>>>>>>> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less

>>>>>>> valid" than the other ptr_to_btf_id, but the user experience will not be great.

>>>>>>> Reading such bpf prog will not be obvious. I think it's better to run bpf prog

>>>>>>> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter

>>>>>>> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver

>>>>>>> better performance too. Instead of task_vma_seq_get_next() doing

>>>>>>> mmap_lock/unlock at every vma. No need for get_file() either. And no

>>>>>>> __vm_area_struct exposure.

>>>>>> 

>>>>>> I think there might be concern calling BPF program with mmap_lock, especially that

>>>>>> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common

>>>>>> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep).

>>>>>> Current version is designed to be very safe for the workload, which might be too

>>>>>> conservative.

>>>>> 

>>>>> I know and I agree with all that, but how do you propose to fix the

>>>>> vm_file concern

>>>>> without holding the lock while prog is running? I couldn't come up with a way.

>>>> 

>>>> I guess the gap here is that I don't see why __vm_area_struct exposure is an issue.

>>>> It is similar to __sk_buff, and simpler (though we had more reasons to introduce

>>>> __sk_buff back when there wasn't BTF).

>>>> 

>>>> If we can accept __vm_area_struct, current version should work, as it doesn't have

>>>> problem with vm_file

>>> 

>>> True. The problem with __vm_area_struct is that it is a hard coded

>>> uapi with little to none

>>> extensibility. In this form vma iterator is not really useful beyond

>>> the example in selftest.

>> 

>> With __vm_area_struct, we can still probe_read the page table, so we can 

>> cover more use cases than the selftest. But I agree that it is not as

>> extensible as feeding real vm_area_struct with BTF to the BPF program. 

> 

> Another confusing thing with __vm_area_struct is vm_flags field.

> It's copied directly. As __vm_area_struct->flags this field is uapi field,

> but its contents are kernel internal. We avoided such corner cases in the past.

> When flags field is copied into uapi the kernel internal flags are encoded

> and exposed as separate uapi flags. That was the case with

> BPF_TCP_* flags. If we have to do this with vm_flags (VM_EXEC, etc) that

> might kill the patchset due to abi concerns.


This makes sense. It shouldn't be uapi without extra encoding. 

Song

[v2,bpf-next,1/4] bpf: introduce task_vma bpf_iter

Commit Message

Comments

Patch