From patchwork Fri Nov 18 01:49:12 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Stultz X-Patchwork-Id: 5190 Return-Path: X-Original-To: patchwork@peony.canonical.com Delivered-To: patchwork@peony.canonical.com Received: from fiordland.canonical.com (fiordland.canonical.com [91.189.94.145]) by peony.canonical.com (Postfix) with ESMTP id CA4F423FFD for ; Fri, 18 Nov 2011 01:49:33 +0000 (UTC) Received: from mail-fx0-f52.google.com (mail-fx0-f52.google.com [209.85.161.52]) by fiordland.canonical.com (Postfix) with ESMTP id A13E0A18187 for ; Fri, 18 Nov 2011 01:49:33 +0000 (UTC) Received: by faaa26 with SMTP id a26so6771048faa.11 for ; Thu, 17 Nov 2011 17:49:33 -0800 (PST) Received: by 10.152.135.166 with SMTP id pt6mr681725lab.26.1321580973370; Thu, 17 Nov 2011 17:49:33 -0800 (PST) X-Forwarded-To: linaro-patchwork@canonical.com X-Forwarded-For: patch@linaro.org linaro-patchwork@canonical.com Delivered-To: patches@linaro.org Received: by 10.152.41.198 with SMTP id h6cs162798lal; Thu, 17 Nov 2011 17:49:32 -0800 (PST) Received: by 10.50.100.194 with SMTP id fa2mr979987igb.49.1321580969748; Thu, 17 Nov 2011 17:49:29 -0800 (PST) Received: from e7.ny.us.ibm.com (e7.ny.us.ibm.com. [32.97.182.137]) by mx.google.com with ESMTPS id x9si848846ica.58.2011.11.17.17.49.28 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 17 Nov 2011 17:49:29 -0800 (PST) Received-SPF: pass (google.com: domain of jstultz@us.ibm.com designates 32.97.182.137 as permitted sender) client-ip=32.97.182.137; Authentication-Results: mx.google.com; spf=pass (google.com: domain of jstultz@us.ibm.com designates 32.97.182.137 as permitted sender) smtp.mail=jstultz@us.ibm.com Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 17 Nov 2011 20:49:28 -0500 Received: from d01relay03.pok.ibm.com ([9.56.227.235]) by e7.ny.us.ibm.com ([192.168.1.107]) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 17 Nov 2011 20:49:25 -0500 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id pAI1nPNA308854; Thu, 17 Nov 2011 20:49:25 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id pAI1nOnc015150; Thu, 17 Nov 2011 23:49:25 -0200 Received: from kernel.beaverton.ibm.com ([9.47.67.96]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id pAI1nJK5014907; Thu, 17 Nov 2011 23:49:20 -0200 Received: by kernel.beaverton.ibm.com (Postfix, from userid 1056) id 77EF11E74FB; Thu, 17 Nov 2011 17:49:18 -0800 (PST) From: John Stultz To: Dave Hansen Cc: John Stultz Subject: [PATCH] madvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Date: Thu, 17 Nov 2011 17:49:12 -0800 Message-Id: <1321580952-14298-1-git-send-email-john.stultz@linaro.org> X-Mailer: git-send-email 1.7.3.2.146.gca209 x-cbid: 11111801-5806-0000-0000-000008C0003F Got most of your feedback integrated. Let me know what else catches your eye. Regards to locking: I'm hoping to lean on whatever protects the file address space. To serialize changes to the range list. Let me know if I need to take something specifically there. thanks -john This patch provides new madvise flags that can be used to mark memory as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. Right now, we can simply throw away pages if they are clean (backed by a current on-disk copy). That only happens for anonymous/tmpfs/shmfs pages when they're swapped out. This patch lets userspace select dirty pages which can be simply thrown away instead of writing them to disk first. See the mm/shmem.c for this bit of code. It's different from MADV_DONTNEED since the pages are not immediately discarded; they are only discarded under pressure. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. Note, this only provides half of the ashmem functionality, as ashmem also works on files as well, so similar fadvise calls will also be needed to provide full ashmem coverage. Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring in the VM code. Signed-off-by: John Stultz --- fs/inode.c | 1 + include/asm-generic/mman-common.h | 3 + include/linux/fs.h | 52 +++++++++ mm/madvise.c | 206 +++++++++++++++++++++++++++++++++++++ mm/shmem.c | 13 +++ 5 files changed, 275 insertions(+), 0 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index ee4e66b..c1f55f4 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -278,6 +278,7 @@ void address_space_init_once(struct address_space *mapping) spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); + INIT_LIST_HEAD(&mapping->unpinned_list); } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h index 787abbb..adf1565 100644 --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -47,6 +47,9 @@ #define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ #define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */ +#define MADV_VOLATILE 16 /* _can_ toss, but don't toss now */ +#define MADV_ISVOLATILE 17 /* Returns volatile flag for region */ +#define MADV_NONVOLATILE 18 /* Remove VOLATILE flag */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/fs.h b/include/linux/fs.h index 0c4df26..6d184a1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -632,6 +632,57 @@ int pagecache_write_end(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); + + +/* upinned_mem_range & range helpers from Robert Love's Ashmem patch */ +struct unpinned_mem_range { + /* + * List is sorted, and no two ranges + * on the same list should overlap. + */ + struct list_head unpinned; + size_t start_byte; + size_t end_byte; + unsigned int purged; +}; + +static inline bool mem_range_subsumes_range(struct unpinned_mem_range *range, + size_t start_addr, size_t end_addr) +{ + + return (range->start_byte >= start_addr) + && (range->end_byte <= end_addr); +} + +static inline bool mem_range_subsumed_by_range( + struct unpinned_mem_range *range, + size_t start_addr, size_t end_addr) +{ + return (range->start_byte <= start_addr) + && (range->end_byte >= end_addr); +} + +static inline bool address_in_range(struct unpinned_mem_range *range, + size_t addr) +{ + return (range->start_byte <= addr) && (range->end_byte > addr); +} + +static inline bool mem_range_in_range(struct unpinned_mem_range *range, + size_t start_addr, size_t end_addr) +{ + return address_in_range(range, start_addr) || + address_in_range(range, end_addr) || + mem_range_subsumes_range(range, start_addr, end_addr); +} + +static inline bool range_before_address(struct unpinned_mem_range *range, + size_t addr) +{ + return range->end_byte < addr; +} + + struct backing_dev_info; struct address_space { struct inode *host; /* owner: inode, block_device */ @@ -650,6 +701,7 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ + struct list_head unpinned_list; /* unpinned range list */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but diff --git a/mm/madvise.c b/mm/madvise.c index 74bf193..443c00b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -225,6 +225,203 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } + +/* + * Allocates a unpinned_mem_range, and adds it to the address_space's + * unpinned list + */ +static int unpinned_range_alloc(struct unpinned_mem_range *prev_range, + unsigned int purged, size_t start, size_t end) +{ + struct unpinned_mem_range *range; + + range = kzalloc(sizeof(struct unpinned_mem_range), GFP_KERNEL); + if (!range) + return -ENOMEM; + + range->start_byte = start; + range->end_byte = end; + range->purged = purged; + + list_add_tail(&range->unpinned, &prev_range->unpinned); + + return 0; +} + +/* + * Deletes a unpinned_mem_range, removing it from the address_space's + * unpinned list + */ +static void unpinned_range_del(struct unpinned_mem_range *range) +{ + list_del(&range->unpinned); + kfree(range); +} + +/* + * Resizes a unpinned_mem_range + */ +static inline void unpinned_range_shrink(struct unpinned_mem_range *range, + size_t start, size_t end) +{ + range->start_byte = start; + range->end_byte = end; +} + +/* + * Mark a region as volatile, allowing dirty pages to be purged + * under memory pressure + */ +static long madvise_volatile(struct vm_area_struct *vma, + unsigned long start_vaddr, unsigned long end_vaddr) +{ + struct unpinned_mem_range *range, *next; + unsigned long start, end; + unsigned int purged = 0; + int ret; + struct address_space *mapping; + + if (!vma->vm_file) + return -EINVAL; + mapping = vma->vm_file->f_mapping; + + /* remove the vma offset */ + start = start_vaddr - vma->vm_start; + end = end_vaddr - vma->vm_start; + +restart: + /* Iterate through the sorted range list */ + list_for_each_entry_safe(range, next, &mapping->unpinned_list, + unpinned) { + /* + * If the current existing range is before the start + * of tnew range, then we're done, since the list is + * sorted + */ + if (range_before_address(range, start)) + break; + /* + * If the new range is already covered by the existing + * range, then there is nothing we need to do. + */ + if (mem_range_subsumed_by_range(range, start, end)) + return 0; + /* + * Coalesce if the new range overlaps the existing range, + * by growing the new range to cover the existing range, + * deleting the existing range, and start over. + * Starting over is necessary to make sure we also coalesce + * any other ranges we overlap with. + */ + if (mem_range_in_range(range, start, end)) { + start = min_t(size_t, range->start_byte, start); + end = max_t(size_t, range->end_byte, end); + purged |= range->purged; + unpinned_range_del(range); + goto restart; + } + + } + /* Allocate the new range and add it to the list */ + ret = unpinned_range_alloc(range, purged, start, end); + return ret; +} + +/* + * Mark a region as nonvolatile, returns 1 if any pages in the region + * were purged. + */ +static long madvise_nonvolatile(struct vm_area_struct *vma, + unsigned long start_vaddr, unsigned long end_vaddr) +{ + struct unpinned_mem_range *range, *next; + struct address_space *mapping; + unsigned long start, end; + int ret = 0; + + if (!vma->vm_file) + return -EINVAL; + mapping = vma->vm_file->f_mapping; + + /* remove the vma offset */ + start = start_vaddr - vma->vm_start; + end = end_vaddr - vma->vm_start; + + list_for_each_entry_safe(range, next, &mapping->unpinned_list, + unpinned) { + if (range_before_address(range, start)) + break; + + if (mem_range_in_range(range, start, end)) { + ret |= range->purged; + /* Case #1: Easy. Just nuke the whole thing. */ + if (mem_range_subsumes_range(range, start, end)) { + unpinned_range_del(range); + continue; + } + + /* Case #2: We overlap from the start, so adjust it */ + if (range->start_byte >= start) { + unpinned_range_shrink(range, end + 1, + range->end_byte); + continue; + } + + /* Case #3: We overlap from the rear, so adjust it */ + if (range->end_byte <= end) { + unpinned_range_shrink(range, range->start_byte, + start-1); + continue; + } + + /* + * Case #4: We eat a chunk out of the middle. A bit + * more complicated, we allocate a new range for the + * second half and adjust the first chunk's endpoint. + */ + unpinned_range_alloc(range, + range->purged, end + 1, + range->end_byte); + unpinned_range_shrink(range, range->start_byte, + start - 1); + } + } + return ret; + + +} + +/* + * Returns if a region has been marked volatile or not. + * Does not return if the region has been purged. + */ +static long madvise_isvolatile(struct vm_area_struct *vma, + unsigned long start_vaddr, unsigned long end_vaddr) +{ + struct unpinned_mem_range *range; + struct address_space *mapping; + unsigned long start, end; + long ret = 0; + + if (!vma->vm_file) + return -EINVAL; + mapping = vma->vm_file->f_mapping; + + /* remove the vma offset */ + start = start_vaddr - vma->vm_start; + end = end_vaddr - vma->vm_start; + + list_for_each_entry(range, &mapping->unpinned_list, unpinned) { + if (range_before_address(range, start)) + break; + if (mem_range_in_range(range, start, end)) { + ret = 1; + break; + } + } + return ret; +} + #ifdef CONFIG_MEMORY_FAILURE /* * Error injection support for memory error handling. @@ -268,6 +465,12 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_willneed(vma, prev, start, end); case MADV_DONTNEED: return madvise_dontneed(vma, prev, start, end); + case MADV_VOLATILE: + return madvise_volatile(vma, start, end); + case MADV_ISVOLATILE: + return madvise_isvolatile(vma, start, end); + case MADV_NONVOLATILE: + return madvise_nonvolatile(vma, start, end); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -293,6 +496,9 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: #endif + case MADV_VOLATILE: + case MADV_ISVOLATILE: + case MADV_NONVOLATILE: return 1; default: diff --git a/mm/shmem.c b/mm/shmem.c index d672250..f340c11 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -679,6 +679,19 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) index = page->index; inode = mapping->host; info = SHMEM_I(inode); + + if (!list_empty(&mapping->unpinned_list)) { + struct unpinned_mem_range *range, *next; + list_for_each_entry_safe(range, next, &mapping->unpinned_list, + unpinned) { + if (address_in_range(range, index << PAGE_SHIFT)) { + range->purged = 1; + unlock_page(page); + return 0; + } + } + } + if (info->flags & VM_LOCKED) goto redirty; if (!total_swap_pages)