diff mbox

[1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas

Message ID 1394822013-23804-2-git-send-email-john.stultz@linaro.org
State Superseded
Headers show

Commit Message

John Stultz March 14, 2014, 6:33 p.m. UTC
This patch introduces the vrange() syscall, which allows for specifying
ranges of memory as volatile, and able to be discarded by the system.

This initial patch simply adds the syscall, and the vma handling,
splitting and merging the vmas as needed, and marking them with
VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

Example man page:

NAME
	vrange - Mark or unmark range of memory as volatile

SYNOPSIS
	int vrange(unsigned_long start, size_t length, int mode,
			 int *purged);

DESCRIPTION
	Applications can use vrange(2) to advise the kernel how it should
	handle paging I/O in this VM area.  The idea is to help the kernel
	discard pages of vrange instead of reclaiming when memory pressure
	happens. It means kernel doesn't discard any pages of vrange if
	there is no memory pressure.

	mode:
	VRANGE_VOLATILE
		hint to kernel so VM can discard in vrange pages when
		memory pressure happens.
	VRANGE_NONVOLATILE
		hint to kernel so VM doesn't discard vrange pages
		any more.

	If user try to access purged memory without VRANGE_NONVOLATILE call,
	he can encounter SIGBUS if the page was discarded by kernel.

	purged: Pointer to an integer which will return 1 if
	mode == VRANGE_NONVOLATILE and any page in the affected range
	was purged. If purged returns zero during a mode ==
	VRANGE_NONVOLATILE call, it means all of the pages in the range
	are intact.

RETURN VALUE
	On success vrange returns the number of bytes marked or unmarked.
	Similar to write(), it may return fewer bytes then specified
	if it ran into a problem.

	If an error is returned, no changes were made.

ERRORS
	EINVAL This error can occur for the following reasons:
		* The value length is negative or not page size units.
		* addr is not page-aligned
		* mode not a valid value.

	ENOMEM Not enough memory

	EFAULT purged pointer is invalid

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Dhaval Giani <dgiani@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/vrange.h           |   7 ++
 mm/Makefile                      |   2 +-
 mm/vrange.c                      | 150 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

Comments

Jan Kara March 17, 2014, 9:21 a.m. UTC | #1
On Fri 14-03-14 11:33:31, John Stultz wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
> 
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
> 
> No purging or discarding of volatile ranges is done at this point.
> 
> Example man page:
> 
> NAME
> 	vrange - Mark or unmark range of memory as volatile
> 
> SYNOPSIS
> 	int vrange(unsigned_long start, size_t length, int mode,
> 			 int *purged);
> 
> DESCRIPTION
> 	Applications can use vrange(2) to advise the kernel how it should
> 	handle paging I/O in this VM area.  The idea is to help the kernel
> 	discard pages of vrange instead of reclaiming when memory pressure
> 	happens. It means kernel doesn't discard any pages of vrange if
> 	there is no memory pressure.
  I'd say that the advantage is kernel doesn't have to swap volatile pages,
it can just directly discard them on memory pressure. You should also
mention somewhere vrange() is currently supported only for anonymous pages.
So maybe we can have the description like:
Applications can use vrange(2) to advise kernel that pages of anonymous
mapping in the given VM area can be reclaimed without swapping (or can no
longer be reclaimed without swapping). The idea is that application can
help kernel with page reclaim under memory pressure by specifying data
it can easily regenerate and thus kernel can discard the data if needed.

> 	mode:
> 	VRANGE_VOLATILE
> 		hint to kernel so VM can discard in vrange pages when
> 		memory pressure happens.
> 	VRANGE_NONVOLATILE
> 		hint to kernel so VM doesn't discard vrange pages
> 		any more.
> 
> 	If user try to access purged memory without VRANGE_NONVOLATILE call,
                ^^^ tries

> 	he can encounter SIGBUS if the page was discarded by kernel.
> 
> 	purged: Pointer to an integer which will return 1 if
> 	mode == VRANGE_NONVOLATILE and any page in the affected range
> 	was purged. If purged returns zero during a mode ==
> 	VRANGE_NONVOLATILE call, it means all of the pages in the range
> 	are intact.
> 
> RETURN VALUE
> 	On success vrange returns the number of bytes marked or unmarked.
> 	Similar to write(), it may return fewer bytes then specified
> 	if it ran into a problem.
  I believe you may need to better explain what is 'purged' argument good
for. Because in my naive understanding *purged == 1 iff return value !=
length.  I recall your discussion with Johannes about error conditions and
the need to return error but also the state of the range, is that right?
But that should be really explained somewhere so that poor application
programmer is aware of those corner cases as well.

> 
> 	If an error is returned, no changes were made.
> 
> ERRORS
> 	EINVAL This error can occur for the following reasons:
> 		* The value length is negative or not page size units.
> 		* addr is not page-aligned
> 		* mode not a valid value.
> 
> 	ENOMEM Not enough memory
> 
> 	EFAULT purged pointer is invalid
> 
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Dhaval Giani <dgiani@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
  Some minor comments in the patch below...

> ---
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/mm.h               |   1 +
>  include/linux/vrange.h           |   7 ++
>  mm/Makefile                      |   2 +-
>  mm/vrange.c                      | 150 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 160 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
> 
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313	common	finit_module		sys_finit_module
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
> +316	common	vrange			sys_vrange
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c1b7414..a1f11da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
>  					/* Used by sys_madvise() */
> +#define VM_VOLATILE	0x00001000	/* VMA is volatile */
>  #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
>  #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> new file mode 100644
> index 0000000..652396b
> --- /dev/null
> +++ b/include/linux/vrange.h
> @@ -0,0 +1,7 @@
> +#ifndef _LINUX_VRANGE_H
> +#define _LINUX_VRANGE_H
> +
> +#define VRANGE_NONVOLATILE 0
> +#define VRANGE_VOLATILE 1
> +
> +#endif /* _LINUX_VRANGE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..20229e2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> -			   compaction.o balloon_compaction.o \
> +			   compaction.o balloon_compaction.o vrange.o \
>  			   interval_tree.o list_lru.o $(mmu-y)
>  
>  obj-y += init-mm.o
> diff --git a/mm/vrange.c b/mm/vrange.c
> new file mode 100644
> index 0000000..acb4356
> --- /dev/null
> +++ b/mm/vrange.c
> @@ -0,0 +1,150 @@
> +#include <linux/syscalls.h>
> +#include <linux/vrange.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, int mode, int *purged)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long orig_start = start;
> +	ssize_t count = 0, ret = 0;
> +	int lpurged = 0;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		unsigned long new_flags;
> +		pgoff_t pgoff;
> +		unsigned long tmp;
> +
> +		if (!vma)
> +			goto out;
> +
> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +					VM_HUGETLB))
> +			goto out;
> +
> +		/* We don't support volatility on files for now */
> +		if (vma->vm_file) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		new_flags = vma->vm_flags;
> +
> +		if (start < vma->vm_start) {
> +			start = vma->vm_start;
> +			if (start >= end)
> +				goto out;
> +		}
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		switch (mode) {
> +		case VRANGE_VOLATILE:
> +			new_flags |= VM_VOLATILE;
> +			break;
> +		case VRANGE_NONVOLATILE:
> +			new_flags &= ~VM_VOLATILE;
> +		}
> +
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +		prev = vma_merge(mm, prev, start, tmp, new_flags,
> +					vma->anon_vma, vma->vm_file, pgoff,
> +					vma_policy(vma));
> +		if (prev)
> +			goto success;
> +
> +		if (start != vma->vm_start) {
> +			ret = split_vma(mm, vma, start, 1);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		if (tmp != vma->vm_end) {
> +			ret = split_vma(mm, vma, tmp, 0);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		prev = vma;
> +success:
> +		vma->vm_flags = new_flags;
> +		*purged = lpurged;
> +
> +		/* update count to distance covered so far*/
> +		count = tmp - orig_start;
> +
> +		if (prev && start < prev->vm_end)
  In which case 'prev' can be NULL? And when start >= prev->vm_end? In all
the cases I can come up with this condition seems to be true...

> +			start = prev->vm_end;
> +		if (start >= end)
> +			goto out;
> +		if (prev)
  Ditto regarding 'prev'...

> +			vma = prev->vm_next;
> +		else	/* madvise_remove dropped mmap_sem */
> +			vma = find_vma(mm, start);
  The comment regarding madvise_remove() looks bogus...

> +	}
> +out:
> +	up_read(&mm->mmap_sem);
> +
> +	/* report bytes successfully marked, even if we're exiting on error */
> +	if (count)
> +		return count;
> +
> +	return ret;
> +}
> +
> +SYSCALL_DEFINE4(vrange, unsigned long, start,
> +		size_t, len, int, mode, int __user *, purged)
> +{
> +	unsigned long end;
> +	struct mm_struct *mm = current->mm;
> +	ssize_t ret = -EINVAL;
> +	int p = 0;
> +
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	if (start >= TASK_SIZE)
> +		goto out;
> +
> +	if (purged) {
> +		/* Test pointer is valid before making any changes */
> +		if (put_user(p, purged))
> +			return -EFAULT;
> +	}
> +
> +	ret = do_vrange(mm, start, end, mode, &p);
> +
> +	if (purged) {
> +		if (put_user(p, purged)) {
> +			/*
> +			 * This would be bad, since we've modified volatilty
> +			 * and the change in purged state would be lost.
> +			 */
> +			WARN_ONCE(1, "vrange: purge state possibly lost\n");
> +		}
> +	}
> +
> +out:
> +	return ret;
> +}
								Honza
Jan Kara March 17, 2014, 9:43 a.m. UTC | #2
On Mon 17-03-14 10:21:18, Jan Kara wrote:
> On Fri 14-03-14 11:33:31, John Stultz wrote:
> > +	for (;;) {
> > +		unsigned long new_flags;
> > +		pgoff_t pgoff;
> > +		unsigned long tmp;
> > +
> > +		if (!vma)
> > +			goto out;
> > +
> > +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> > +					VM_HUGETLB))
> > +			goto out;
> > +
> > +		/* We don't support volatility on files for now */
> > +		if (vma->vm_file) {
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		new_flags = vma->vm_flags;
> > +
> > +		if (start < vma->vm_start) {
> > +			start = vma->vm_start;
> > +			if (start >= end)
> > +				goto out;
> > +		}
  One more question: This seems to silently skip any holes between VMAs. Is
that really intended? I'd expect that marking unmapped range as volatile /
non-volatile should return error... In any case what happens should be
defined in the description.

								Honza

> > +		tmp = vma->vm_end;
> > +		if (end < tmp)
> > +			tmp = end;
> > +
> > +		switch (mode) {
> > +		case VRANGE_VOLATILE:
> > +			new_flags |= VM_VOLATILE;
> > +			break;
> > +		case VRANGE_NONVOLATILE:
> > +			new_flags &= ~VM_VOLATILE;
> > +		}
> > +
> > +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> > +		prev = vma_merge(mm, prev, start, tmp, new_flags,
> > +					vma->anon_vma, vma->vm_file, pgoff,
> > +					vma_policy(vma));
> > +		if (prev)
> > +			goto success;
> > +
> > +		if (start != vma->vm_start) {
> > +			ret = split_vma(mm, vma, start, 1);
> > +			if (ret)
> > +				goto out;
> > +		}
> > +
> > +		if (tmp != vma->vm_end) {
> > +			ret = split_vma(mm, vma, tmp, 0);
> > +			if (ret)
> > +				goto out;
> > +		}
> > +
> > +		prev = vma;
> > +success:
> > +		vma->vm_flags = new_flags;
> > +		*purged = lpurged;
> > +
> > +		/* update count to distance covered so far*/
> > +		count = tmp - orig_start;
> > +
> > +		if (prev && start < prev->vm_end)
>   In which case 'prev' can be NULL? And when start >= prev->vm_end? In all
> the cases I can come up with this condition seems to be true...
> 
> > +			start = prev->vm_end;
> > +		if (start >= end)
> > +			goto out;
> > +		if (prev)
>   Ditto regarding 'prev'...
> 
> > +			vma = prev->vm_next;
> > +		else	/* madvise_remove dropped mmap_sem */
> > +			vma = find_vma(mm, start);
>   The comment regarding madvise_remove() looks bogus...
> 
> > +	}
> > +out:
> > +	up_read(&mm->mmap_sem);
> > +
> > +	/* report bytes successfully marked, even if we're exiting on error */
> > +	if (count)
> > +		return count;
> > +
> > +	return ret;
> > +}
> > +
> > +SYSCALL_DEFINE4(vrange, unsigned long, start,
> > +		size_t, len, int, mode, int __user *, purged)
> > +{
> > +	unsigned long end;
> > +	struct mm_struct *mm = current->mm;
> > +	ssize_t ret = -EINVAL;
> > +	int p = 0;
> > +
> > +	if (start & ~PAGE_MASK)
> > +		goto out;
> > +
> > +	len &= PAGE_MASK;
> > +	if (!len)
> > +		goto out;
> > +
> > +	end = start + len;
> > +	if (end < start)
> > +		goto out;
> > +
> > +	if (start >= TASK_SIZE)
> > +		goto out;
> > +
> > +	if (purged) {
> > +		/* Test pointer is valid before making any changes */
> > +		if (put_user(p, purged))
> > +			return -EFAULT;
> > +	}
> > +
> > +	ret = do_vrange(mm, start, end, mode, &p);
> > +
> > +	if (purged) {
> > +		if (put_user(p, purged)) {
> > +			/*
> > +			 * This would be bad, since we've modified volatilty
> > +			 * and the change in purged state would be lost.
> > +			 */
> > +			WARN_ONCE(1, "vrange: purge state possibly lost\n");
> > +		}
> > +	}
> > +
> > +out:
> > +	return ret;
> > +}
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
John Stultz March 17, 2014, 10:19 p.m. UTC | #3
On 03/17/2014 02:21 AM, Jan Kara wrote:
> On Fri 14-03-14 11:33:31, John Stultz wrote:
>> This patch introduces the vrange() syscall, which allows for specifying
>> ranges of memory as volatile, and able to be discarded by the system.
>>
>> This initial patch simply adds the syscall, and the vma handling,
>> splitting and merging the vmas as needed, and marking them with
>> VM_VOLATILE.
>>
>> No purging or discarding of volatile ranges is done at this point.
>>
>> Example man page:
>>
>> NAME
>> 	vrange - Mark or unmark range of memory as volatile
>>
>> SYNOPSIS
>> 	int vrange(unsigned_long start, size_t length, int mode,
>> 			 int *purged);
>>
>> DESCRIPTION
>> 	Applications can use vrange(2) to advise the kernel how it should
>> 	handle paging I/O in this VM area.  The idea is to help the kernel
>> 	discard pages of vrange instead of reclaiming when memory pressure
>> 	happens. It means kernel doesn't discard any pages of vrange if
>> 	there is no memory pressure.
>   I'd say that the advantage is kernel doesn't have to swap volatile pages,
> it can just directly discard them on memory pressure. You should also
> mention somewhere vrange() is currently supported only for anonymous pages.
> So maybe we can have the description like:
> Applications can use vrange(2) to advise kernel that pages of anonymous
> mapping in the given VM area can be reclaimed without swapping (or can no
> longer be reclaimed without swapping). The idea is that application can
> help kernel with page reclaim under memory pressure by specifying data
> it can easily regenerate and thus kernel can discard the data if needed.

Good point. This man page description originated from previous patches
where we did handle file paging, so I'll try to update it and make it
more clear we currently don't (although I very much want to re-add that
functionality eventually).


>
>> 	mode:
>> 	VRANGE_VOLATILE
>> 		hint to kernel so VM can discard in vrange pages when
>> 		memory pressure happens.
>> 	VRANGE_NONVOLATILE
>> 		hint to kernel so VM doesn't discard vrange pages
>> 		any more.
>>
>> 	If user try to access purged memory without VRANGE_NONVOLATILE call,
>                 ^^^ tries
>
>> 	he can encounter SIGBUS if the page was discarded by kernel.
>>
>> 	purged: Pointer to an integer which will return 1 if
>> 	mode == VRANGE_NONVOLATILE and any page in the affected range
>> 	was purged. If purged returns zero during a mode ==
>> 	VRANGE_NONVOLATILE call, it means all of the pages in the range
>> 	are intact.
>>
>> RETURN VALUE
>> 	On success vrange returns the number of bytes marked or unmarked.
>> 	Similar to write(), it may return fewer bytes then specified
>> 	if it ran into a problem.
>   I believe you may need to better explain what is 'purged' argument good
> for. Because in my naive understanding *purged == 1 iff return value !=
> length.  I recall your discussion with Johannes about error conditions and
> the need to return error but also the state of the range, is that right?
> But that should be really explained somewhere so that poor application
> programmer is aware of those corner cases as well.

Right. So the purged flag is separate/independent from the byte/error
return. Basically we want to describe how much memory has been changed
from volatile to non-volatile state (or vice versa), as well as
providing if any of those pages were purged while they were volatile.
One could mark 1meg of previously volatile memory non-volatile, and find
purged == 0 if there was no memory pressure, or if there was pressure,
find purged == 1.

The error case is that should we run out of memory (or hit some other
error condition that prevents us from successfully marking all of the
specified memory as non-volatile) half way through, we need to return to
the user the purged state for the pages that we did change.

Now, its true that if we ran out of memory, its likely that the memory
pressure caused pages to be purged before being marked, but one could
imagine a situation were we got half way through marking non-purged
pages and memory pressure suddenly went up, causing an allocation to
fail. Thus in that case, you would see the return value != length, and
purged == 0.

But thank you for the feedback, I'll try to rework that part to be more
clear. Any suggestions would also be welcome, as I worry my head is a
bit too steeped in this to see what would make it more clear to fresh eyes.


[snip]
>> diff --git a/mm/vrange.c b/mm/vrange.c
>> new file mode 100644
>> index 0000000..acb4356
>> --- /dev/null
>> +++ b/mm/vrange.c
>> @@ -0,0 +1,150 @@
>> +#include <linux/syscalls.h>
>> +#include <linux/vrange.h>
>> +#include <linux/mm_inline.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/rmap.h>
>> +#include <linux/hugetlb.h>
>> +#include <linux/mmu_notifier.h>
>> +#include <linux/mm_inline.h>
>> +#include "internal.h"
>> +
>> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
>> +				unsigned long end, int mode, int *purged)
>> +{
>> +	struct vm_area_struct *vma, *prev;
>> +	unsigned long orig_start = start;
>> +	ssize_t count = 0, ret = 0;
>> +	int lpurged = 0;
>> +
>> +	down_read(&mm->mmap_sem);
>> +
>> +	vma = find_vma_prev(mm, start, &prev);
>> +	if (vma && start > vma->vm_start)
>> +		prev = vma;
>> +
>> +	for (;;) {
>> +		unsigned long new_flags;
>> +		pgoff_t pgoff;
>> +		unsigned long tmp;
>> +
>> +		if (!vma)
>> +			goto out;
>> +
>> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
>> +					VM_HUGETLB))
>> +			goto out;
>> +
>> +		/* We don't support volatility on files for now */
>> +		if (vma->vm_file) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		new_flags = vma->vm_flags;
>> +
>> +		if (start < vma->vm_start) {
>> +			start = vma->vm_start;
>> +			if (start >= end)
>> +				goto out;
>> +		}
>> +		tmp = vma->vm_end;
>> +		if (end < tmp)
>> +			tmp = end;
>> +
>> +		switch (mode) {
>> +		case VRANGE_VOLATILE:
>> +			new_flags |= VM_VOLATILE;
>> +			break;
>> +		case VRANGE_NONVOLATILE:
>> +			new_flags &= ~VM_VOLATILE;
>> +		}
>> +
>> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>> +		prev = vma_merge(mm, prev, start, tmp, new_flags,
>> +					vma->anon_vma, vma->vm_file, pgoff,
>> +					vma_policy(vma));
>> +		if (prev)
>> +			goto success;
>> +
>> +		if (start != vma->vm_start) {
>> +			ret = split_vma(mm, vma, start, 1);
>> +			if (ret)
>> +				goto out;
>> +		}
>> +
>> +		if (tmp != vma->vm_end) {
>> +			ret = split_vma(mm, vma, tmp, 0);
>> +			if (ret)
>> +				goto out;
>> +		}
>> +
>> +		prev = vma;
>> +success:
>> +		vma->vm_flags = new_flags;
>> +		*purged = lpurged;
>> +
>> +		/* update count to distance covered so far*/
>> +		count = tmp - orig_start;
>> +
>> +		if (prev && start < prev->vm_end)
>   In which case 'prev' can be NULL? And when start >= prev->vm_end? In all
> the cases I can come up with this condition seems to be true...
>
>> +			start = prev->vm_end;
>> +		if (start >= end)
>> +			goto out;
>> +		if (prev)
>   Ditto regarding 'prev'...

I haven't had the chance to look closely here today, but I'll double
check on these two.


>> +			vma = prev->vm_next;
>> +		else	/* madvise_remove dropped mmap_sem */
>> +			vma = find_vma(mm, start);
>   The comment regarding madvise_remove() looks bogus...

Yep. Thanks for pointing it out, that's from my starting with the
madvise logic and reworking it.


Thanks so much for the feedback here! I really appreciate it, and will
rework things appropriately.

thanks again!
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
John Stultz March 18, 2014, 12:36 a.m. UTC | #4
On 03/17/2014 02:43 AM, Jan Kara wrote:
> On Mon 17-03-14 10:21:18, Jan Kara wrote:
>> On Fri 14-03-14 11:33:31, John Stultz wrote:
>>> +	for (;;) {
>>> +		unsigned long new_flags;
>>> +		pgoff_t pgoff;
>>> +		unsigned long tmp;
>>> +
>>> +		if (!vma)
>>> +			goto out;
>>> +
>>> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
>>> +					VM_HUGETLB))
>>> +			goto out;
>>> +
>>> +		/* We don't support volatility on files for now */
>>> +		if (vma->vm_file) {
>>> +			ret = -EINVAL;
>>> +			goto out;
>>> +		}
>>> +
>>> +		new_flags = vma->vm_flags;
>>> +
>>> +		if (start < vma->vm_start) {
>>> +			start = vma->vm_start;
>>> +			if (start >= end)
>>> +				goto out;
>>> +		}
>   One more question: This seems to silently skip any holes between VMAs. Is
> that really intended? I'd expect that marking unmapped range as volatile /
> non-volatile should return error... In any case what happens should be
> defined in the description.

So.. initially it was by design, but as I look at madvise and think
about it further, it does make more sense to throw errors if memory in
the range is not mapped.

I'll try to rework things to adapt to this.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..7ae3940 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@ 
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	vrange			sys_vrange
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..a1f11da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..652396b
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,7 @@ 
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#define VRANGE_NONVOLATILE 0
+#define VRANGE_VOLATILE 1
+
+#endif /* _LINUX_VRANGE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..20229e2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o balloon_compaction.o \
+			   compaction.o balloon_compaction.o vrange.o \
 			   interval_tree.o list_lru.o $(mmu-y)
 
 obj-y += init-mm.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..acb4356
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,150 @@ 
+#include <linux/syscalls.h>
+#include <linux/vrange.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
+				unsigned long end, int mode, int *purged)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long orig_start = start;
+	ssize_t count = 0, ret = 0;
+	int lpurged = 0;
+
+	down_read(&mm->mmap_sem);
+
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+		unsigned long tmp;
+
+		if (!vma)
+			goto out;
+
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB))
+			goto out;
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		new_flags = vma->vm_flags;
+
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+		}
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		switch (mode) {
+		case VRANGE_VOLATILE:
+			new_flags |= VM_VOLATILE;
+			break;
+		case VRANGE_NONVOLATILE:
+			new_flags &= ~VM_VOLATILE;
+		}
+
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, tmp, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (prev)
+			goto success;
+
+		if (start != vma->vm_start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (tmp != vma->vm_end) {
+			ret = split_vma(mm, vma, tmp, 0);
+			if (ret)
+				goto out;
+		}
+
+		prev = vma;
+success:
+		vma->vm_flags = new_flags;
+		*purged = lpurged;
+
+		/* update count to distance covered so far*/
+		count = tmp - orig_start;
+
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			goto out;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_sem */
+			vma = find_vma(mm, start);
+	}
+out:
+	up_read(&mm->mmap_sem);
+
+	/* report bytes successfully marked, even if we're exiting on error */
+	if (count)
+		return count;
+
+	return ret;
+}
+
+SYSCALL_DEFINE4(vrange, unsigned long, start,
+		size_t, len, int, mode, int __user *, purged)
+{
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	ssize_t ret = -EINVAL;
+	int p = 0;
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (start >= TASK_SIZE)
+		goto out;
+
+	if (purged) {
+		/* Test pointer is valid before making any changes */
+		if (put_user(p, purged))
+			return -EFAULT;
+	}
+
+	ret = do_vrange(mm, start, end, mode, &p);
+
+	if (purged) {
+		if (put_user(p, purged)) {
+			/*
+			 * This would be bad, since we've modified volatilty
+			 * and the change in purged state would be lost.
+			 */
+			WARN_ONCE(1, "vrange: purge state possibly lost\n");
+		}
+	}
+
+out:
+	return ret;
+}