From patchwork Wed Sep 27 23:32:23 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nicolas Pitre <nicolas.pitre@linaro.org>
X-Patchwork-Id: 114383
Delivered-To: patch@linaro.org
Received: by 10.140.106.117 with SMTP id d108csp61457qgf;
 Wed, 27 Sep 2017 16:33:52 -0700 (PDT)
X-Google-Smtp-Source: AOwi7QDCz7RsbCPYjKRwjqyPieL1nBXUAC6lCLbIhQ66fL5NNuSxKcKQjrWeLAvVOjNgEISBmqS6
X-Received: by 10.99.3.216 with SMTP id 207mr2586413pgd.15.1506555232220;
 Wed, 27 Sep 2017 16:33:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1506555232; cv=none;
 d=google.com; s=arc-20160816;
 b=c9H6cdajSN0V4jzjwcYoE2GpLIuKBCJn0dlKik6WzUzlcZ4xH/M3/TWnChW40d5+Oj
 YLeljcoZ0m5Pfynm1qVMtPRhua7T/T6ZSkFuJBbnWmd1GZItSofSAmsTiPzp890bax9T
 OaxdY2+xpd6uVlDb6NZn+mc9ZJYvqEzy0TwtePi/NNI2AHKWAR1VVv0yVvoY/Jj+3x4t
 lh40wcVYlY/uapFmoJg8hmUdfPlnjMCM6p6BomGHoq342REOsX7z1QS6I0ZD227wqvNw
 J6XuupRAuBaXCbqLK5wq74/ES/XJU/Yly2ClaXriFgpHmDTyZPBy39stb4Qava01t7y/
 0T0Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature:arc-authentication-results;
 bh=9CxxbiRE2o1ixBbspkW6d1ORYdnNghmAeO9+IsOoWc4=;
 b=zSKbNj532WunK5vdfYe4SJjDV9LsLDKOWonzqqLqPJKQyuJexF+FA032bbv6KzHZic
 4XPtXZxhp+QQiA4hA8yxyT1IEn8upX45OTOExs5MMLdPBtNQIyxMEz7dD5S3eYKnvnVo
 zYHucF7ReIohBOZZErnnCl4Yb6s6oGkitkHOOYOyAvze7Y/JBlK+NjA2c6QhgkMiyC6U
 nMIAW/gBHMSQmlSeT2RdJ+iZ6jhMTvHIO4KpUlSYwOZAaByBMwwK1FKIbgWdeKeiBJ8l
 a19iN5LMxNbn6o4Wdl1Dz8TceBw/KvOmppeQKswp2NHguR9ve1Gxj+Qu4S16o8uNxub7
 O3eg==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@pobox.com header.s=sasl header.b=Gd+NL0Z8;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 y187si104444pgb.132.2017.09.27.16.33.51; 
 Wed, 27 Sep 2017 16:33:52 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@pobox.com header.s=sasl header.b=Gd+NL0Z8;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1752603AbdI0XdX (ORCPT <rfc822;daniel.diaz@linaro.org>
 + 26 others); Wed, 27 Sep 2017 19:33:23 -0400
Received: from pb-smtp2.pobox.com ([64.147.108.71]:63030 "EHLO
 sasl.smtp.pobox.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
 with ESMTP id S1752357AbdI0Xcb (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Wed, 27 Sep 2017 19:32:31 -0400
Received: from sasl.smtp.pobox.com (unknown [127.0.0.1])
 by pb-smtp2.pobox.com (Postfix) with ESMTP id 09E07AD2F1;
 Wed, 27 Sep 2017 19:32:31 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc
 :subject:date:message-id:in-reply-to:references; s=sasl; bh=dt4a
 6Iq/z2MDXTSnyJBy6MMLnoE=; b=Gd+NL0Z8DuxiI4sYvJupRgO/tU4mw41Uz0Yf
 o5WtxmHXjlqKimrNRZ+NEVOupFXtBKLXgPde85Sa9SPpdwchAsRAVRfWNLNxcLsh
 33RtrbiN8ly4Aw4Xp2EV8HYPo4pV1+Z0NUi7zJDTo7/jexobyJ45S40jffbrvEHq
 Ysh2wM4=
Received: from pb-smtp2.nyi.icgroup.com (unknown [127.0.0.1])
 by pb-smtp2.pobox.com (Postfix) with ESMTP id 00498AD2F0;
 Wed, 27 Sep 2017 19:32:31 -0400 (EDT)
Received: from yoda.home (unknown [137.175.234.45])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
 bits)) (No client certificate requested)
 by pb-smtp2.pobox.com (Postfix) with ESMTPSA id 44B45AD2EC;
 Wed, 27 Sep 2017 19:32:30 -0400 (EDT)
Received: from xanadu.home (xanadu.home [192.168.2.2])
 by yoda.home (Postfix) with ESMTP id 2C4952DA0679;
 Wed, 27 Sep 2017 19:32:29 -0400 (EDT)
From: Nicolas Pitre <nicolas.pitre@linaro.org>
To: Alexander Viro <viro@zeniv.linux.org.uk>, linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org, linux-embedded@vger.kernel.org,
 linux-kernel@vger.kernel.org, Chris Brandt <Chris.Brandt@renesas.com>
Subject: [PATCH v4 4/5] cramfs: add mmap support
Date: Wed, 27 Sep 2017 19:32:23 -0400
Message-Id: <20170927233224.31676-5-nicolas.pitre@linaro.org>
X-Mailer: git-send-email 2.9.5
In-Reply-To: <20170927233224.31676-1-nicolas.pitre@linaro.org>
References: <20170927233224.31676-1-nicolas.pitre@linaro.org>
X-Pobox-Relay-ID: 212FA3AC-A3DC-11E7-9B7F-9D2B0D78B957-78420484!pb-smtp2.pobox.com
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When cramfs_physmem is used then we have the opportunity to map files
directly from ROM, directly into user space, saving on RAM usage.
This gives us Execute-In-Place (XIP) support.

For a file to be mmap()-able, the map area has to correspond to a range
of uncompressed and contiguous blocks, and in the MMU case it also has
to be page aligned. A version of mkcramfs with appropriate support is
necessary to create such a filesystem image.

In the MMU case it may happen for a vma structure to extend beyond the
actual file size. This is notably the case in binfmt_elf.c:elf_map().
Or the file's last block is shared with other files and cannot be mapped
as is. Rather than refusing to mmap it, we do a partial map and set up
a special vm_ops fault handler that splits the vma in two: the direct
mapping vma and the memory-backed vma populated by the readpage method.
In practice the unmapped area is seldom accessed so the split might never
occur before this area is discarded.

In the non-MMU case it is the get_unmapped_area method that is responsible
for providing the address where the actual data can be found. No mapping
is necessary of course.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 fs/cramfs/Kconfig |   2 +-
 fs/cramfs/inode.c | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 296 insertions(+), 1 deletion(-)

-- 
2.9.5
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>

diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 5b4e0b7e13..306549be25 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -30,7 +30,7 @@ config CRAMFS_BLOCKDEV
 
 config CRAMFS_PHYSMEM
 	bool "Support CramFs image directly mapped in physical memory"
-	depends on CRAMFS
+	depends on CRAMFS = y
 	default y if !CRAMFS_BLOCKDEV
 	help
 	  This option allows the CramFs driver to load data directly from
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 2fc886092b..1d7d61354b 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -15,7 +15,9 @@
 
 #include <linux/module.h>
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/ramfs.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/blkdev.h>
@@ -49,6 +51,7 @@ static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
 static const struct super_operations cramfs_ops;
 static const struct inode_operations cramfs_dir_inode_operations;
 static const struct file_operations cramfs_directory_operations;
+static const struct file_operations cramfs_physmem_fops;
 static const struct address_space_operations cramfs_aops;
 
 static DEFINE_MUTEX(read_mutex);
@@ -96,6 +99,10 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
 	case S_IFREG:
 		inode->i_fop = &generic_ro_fops;
 		inode->i_data.a_ops = &cramfs_aops;
+		if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
+		    CRAMFS_SB(sb)->flags & CRAMFS_FLAG_EXT_BLOCK_POINTERS &&
+		    CRAMFS_SB(sb)->linear_phys_addr)
+			inode->i_fop = &cramfs_physmem_fops;
 		break;
 	case S_IFDIR:
 		inode->i_op = &cramfs_dir_inode_operations;
@@ -277,6 +284,294 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset,
 		return NULL;
 }
 
+/*
+ * For a mapping to be possible, we need a range of uncompressed and
+ * contiguous blocks. Return the offset for the first block and number of
+ * valid blocks for which that is true, or zero otherwise.
+ */
+static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
+{
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	int i;
+	u32 *blockptrs, blockaddr;
+
+	/*
+	 * We can dereference memory directly here as this code may be
+	 * reached only when there is a direct filesystem image mapping
+	 * available in memory.
+	 */
+	blockptrs = (u32 *)(sbi->linear_virt_addr + OFFSET(inode) + pgoff*4);
+	blockaddr = blockptrs[0] & ~CRAMFS_BLK_FLAGS;
+	i = 0;
+	do {
+		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);
+		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;
+		if (blockptrs[i] != expect) {
+			pr_debug("range: block %d/%d got %#x expects %#x\n",
+				 pgoff+i, pgoff+*pages-1, blockptrs[i], expect);
+			if (i == 0)
+				return 0;
+			break;
+		}
+	} while (++i < *pages);
+
+	*pages = i;
+
+	/* stored "direct" block ptrs are shifted down by 2 bits */
+	return blockaddr << 2;
+}
+
+/*
+ * It is possible for cramfs_physmem_mmap() to partially populate the mapping
+ * causing page faults in the unmapped area. When that happens, we need to
+ * split the vma so that the unmapped area gets its own vma that can be backed
+ * with actual memory pages and loaded normally. This is necessary because
+ * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and filemap_fault()
+ * no longer works with it. Furthermore this makes /proc/x/maps right.
+ * Q: is there a way to do split vma at mmap() time?
+ */
+static const struct vm_operations_struct cramfs_vmasplit_ops;
+static int cramfs_vmasplit_fault(struct vm_fault *vmf)
+{
+	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma, *new_vma;
+	struct file *vma_file = get_file(vmf->vma->vm_file);
+	unsigned long split_val, split_addr;
+	unsigned int split_pgoff;
+	int ret;
+
+	/* We have some vma surgery to do and need the write lock. */
+	up_read(&mm->mmap_sem);
+	if (down_write_killable(&mm->mmap_sem)) {
+		fput(vma_file);
+		return VM_FAULT_RETRY;
+	}
+
+	/* Make sure the vma didn't change between the locks */
+	ret = VM_FAULT_SIGSEGV;
+	vma = find_vma(mm, vmf->address);
+	if (!vma)
+		goto out_fput;
+
+	/*
+	 * Someone else might have raced with us and handled the fault,
+	 * changed the vma, etc. If so let it go back to user space and
+	 * fault again if necessary.
+	 */
+	ret = VM_FAULT_NOPAGE;
+	if (vma->vm_ops != &cramfs_vmasplit_ops || vma->vm_file != vma_file)
+		goto out_fput;
+	fput(vma_file);
+
+	/* Retrieve the vma split address and validate it */
+	split_val = (unsigned long)vma->vm_private_data;
+	split_pgoff = split_val & 0xfff;
+	split_addr = (split_val >> 12) << PAGE_SHIFT;
+	if (split_addr < vma->vm_start) {
+		/* bottom of vma was unmapped */
+		split_pgoff += (vma->vm_start - split_addr) >> PAGE_SHIFT;
+		split_addr = vma->vm_start;
+	}
+	pr_debug("fault: addr=%#lx vma=%#lx-%#lx split=%#lx\n",
+		 vmf->address, vma->vm_start, vma->vm_end, split_addr);
+	ret = VM_FAULT_SIGSEGV;
+	if (!split_val || split_addr > vmf->address || vma->vm_end <= vmf->address)
+		goto out;
+
+	if (unlikely(vma->vm_start == split_addr)) {
+		/* nothing to split */
+		new_vma = vma;
+	} else {
+		/* Split away the directly mapped area */
+		ret = VM_FAULT_OOM;
+		if (split_vma(mm, vma, split_addr, 0) != 0)
+			goto out;
+
+		/* The direct vma should no longer ever fault */
+		vma->vm_ops = NULL;
+
+		/* Retrieve the new vma covering the unmapped area */
+		new_vma = find_vma(mm, split_addr);
+		BUG_ON(new_vma == vma);
+		ret = VM_FAULT_SIGSEGV;
+		if (!new_vma)
+			goto out;
+	}
+
+	/*
+	 * Readjust the new vma with the actual file based pgoff and
+	 * process the fault normally on it.
+	 */
+	new_vma->vm_pgoff = split_pgoff;
+	new_vma->vm_ops = &generic_file_vm_ops;
+	new_vma->vm_flags &= ~(VM_IO | VM_PFNMAP | VM_DONTEXPAND);
+	vmf->vma = new_vma;
+	vmf->pgoff = split_pgoff;
+	vmf->pgoff += (vmf->address - new_vma->vm_start) >> PAGE_SHIFT;
+	downgrade_write(&mm->mmap_sem);
+	return filemap_fault(vmf);
+
+out_fput:
+	fput(vma_file);
+out:
+	downgrade_write(&mm->mmap_sem);
+	return ret;
+}
+
+static const struct vm_operations_struct cramfs_vmasplit_ops = {
+	.fault	= cramfs_vmasplit_fault,
+};
+
+static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, vma_pages, max_pages, offset;
+	unsigned long address;
+	char *fail_reason;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_MMU))
+		return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -ENOSYS;
+
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		return -EINVAL;
+
+	/* Could COW work here? */
+	fail_reason = "vma is writable";
+	if (vma->vm_flags & VM_WRITE)
+		goto fail;
+
+	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	fail_reason = "beyond file limit";
+	if (vma->vm_pgoff >= max_pages)
+		goto fail;
+	pages = vma_pages;
+	if (pages > max_pages - vma->vm_pgoff)
+		pages = max_pages - vma->vm_pgoff;
+
+	offset = cramfs_get_block_range(inode, vma->vm_pgoff, &pages);
+	fail_reason = "unsuitable block layout";
+	if (!offset)
+		goto fail;
+	address = sbi->linear_phys_addr + offset;
+	fail_reason = "data is not page aligned";
+	if (!PAGE_ALIGNED(address))
+		goto fail;
+
+	/* Don't map the last page if it contains some other data */
+	if (unlikely(vma->vm_pgoff + pages == max_pages)) {
+		unsigned int partial = offset_in_page(inode->i_size);
+		if (partial) {
+			char *data = sbi->linear_virt_addr + offset;
+			data += (max_pages - 1) * PAGE_SIZE + partial;
+			while ((unsigned long)data & 7)
+				if (*data++ != 0)
+					goto nonzero;
+			while (offset_in_page(data)) {
+				if (*(u64 *)data != 0) {
+					nonzero:
+					pr_debug("mmap: %s: last page is shared\n",
+						 file_dentry(file)->d_name.name);
+					pages--;
+					break;
+				}
+				data += 8;
+			}
+		}
+	}
+
+	if (pages) {
+		/*
+		 * If we can't map it all, page faults will occur if the
+		 * unmapped area is accessed. Let's handle them to split the
+		 * vma and let the normal paging machinery take care of the
+		 * rest through cramfs_readpage(). Because remap_pfn_range()
+		 * repurposes vma->vm_pgoff, we have to save it somewhere.
+		 * Let's use vma->vm_private_data to hold both the pgoff and
+		 * the actual address split point. Maximum file size is 16MB
+		 * (12 bits pgoff) and max 20 bits pfn where a long is 32 bits
+		 * so we can pack both together.
+		 */
+		if (pages != vma_pages) {
+			unsigned int split_pgoff = vma->vm_pgoff + pages;
+			unsigned long split_pfn = (vma->vm_start >> PAGE_SHIFT) + pages;
+			unsigned long split_val = split_pgoff | (split_pfn << 12);
+			vma->vm_private_data = (void *)split_val;
+			vma->vm_ops = &cramfs_vmasplit_ops;
+			/* to keep remap_pfn_range() happy */
+			vma->vm_end = vma->vm_start + pages * PAGE_SIZE;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
+				      pages * PAGE_SIZE, vma->vm_page_prot);
+		/* restore vm_end in case we cheated it above */
+		vma->vm_end = vma->vm_start + vma_pages * PAGE_SIZE;
+		if (ret)
+			return ret;
+
+		pr_debug("mapped %s at 0x%08lx (%u/%u pages) to vma 0x%08lx, "
+			 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
+			 address, pages, vma_pages, vma->vm_start,
+			 (unsigned long long)pgprot_val(vma->vm_page_prot));
+		return 0;
+	}
+	fail_reason = "no suitable block remaining";
+
+fail:
+	pr_debug("%s: direct mmap failed: %s\n",
+		 file_dentry(file)->d_name.name, fail_reason);
+
+	/* We failed to do a direct map, but normal paging will do it */
+	vma->vm_ops = &generic_file_vm_ops;
+	return 0;
+}
+
+#ifndef CONFIG_MMU
+
+static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
+			unsigned long addr, unsigned long len,
+			unsigned long pgoff, unsigned long flags)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, block_pages, max_pages, offset;
+
+	pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= max_pages || pages > max_pages - pgoff)
+		return -EINVAL;
+	block_pages = pages;
+	offset = cramfs_get_block_range(inode, pgoff, &block_pages);
+	if (!offset || block_pages != pages)
+		return -ENOSYS;
+	addr = sbi->linear_phys_addr + offset;
+	pr_debug("get_unmapped for %s ofs %#lx siz %lu at 0x%08lx\n",
+		 file_dentry(file)->d_name.name, pgoff*PAGE_SIZE, len, addr);
+	return addr;
+}
+
+static unsigned cramfs_physmem_mmap_capabilities(struct file *file)
+{
+	return NOMMU_MAP_COPY | NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_EXEC;
+}
+#endif
+
+static const struct file_operations cramfs_physmem_fops = {
+	.llseek			= generic_file_llseek,
+	.read_iter		= generic_file_read_iter,
+	.splice_read		= generic_file_splice_read,
+	.mmap			= cramfs_physmem_mmap,
+#ifndef CONFIG_MMU
+	.get_unmapped_area	= cramfs_physmem_get_unmapped_area,
+	.mmap_capabilities	= cramfs_physmem_mmap_capabilities,
+#endif
+};
+
 static void cramfs_blkdev_kill_sb(struct super_block *sb)
 {
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);