mbox series

[0/5] cramfs refresh for embedded usage

Message ID 20170811192252.19062-1-nicolas.pitre@linaro.org
Headers show
Series cramfs refresh for embedded usage | expand

Message

Nicolas Pitre Aug. 11, 2017, 7:22 p.m. UTC
This series brings a nice refresh to the cramfs filesystem, adding the
following capabilities:

- Direct memory access, bypassing the block and/or MTD layers entirely.

- Ability to store individual data blocks uncompressed.

- Ability to locate individual data blocks anywhere in the filesystem.

The end result is a very tight filesystem that can be accessed directly
from ROM without any other subsystem underneath. Also this allows for
user space XIP which is a very important feature for tiny embedded
systems.

Why cramfs?

Because cramfs is very simple and small. With CONFIG_CRAMFS_BLOCK=n and
CONFIG_CRAMFS_PHYSMEM=y the cramfs driver may use only 3704 bytes of code.
That's many times smaller than squashfs. And the runtime memory usage is
also much less with cramfs than squashfs. It packs very tightly already
compared to romfs which has no compression support. And the cramfs format
was simple to extend, allowing for both compressed and uncompressed blocks
within the same file.

Why not accessing ROM via MTD?

The MTD layer is nice and flexible. It also represents a huge overhead
considering its core with no other enabled options weights 19KB.
That's many times the size of the cramfs code for something that
essentially boils down to a glorified argument parser and a call to
memremap().  And if someone still wants to use cramfs via MTD then
it is already possible with mtdblock.

Of course, while this cramfs remains backward compatible with existing
filesystem images, a newer mkcramfs version is necessary to take advantage
of the extended data layout. I created a version of mkcramfs that
detects ELF files and marks text+rodata segments for XIP and compresses the
rest automatically.

So here it is. I'm also willing to step up as cramfs maintainer given
that no sign of any maintenance activities appeared for years.

This series is also available based on v4.13-rc4 via git here:

http://git.linaro.org/people/nicolas.pitre/linux xipcramfs

diffstat:

 Documentation/filesystems/cramfs.txt |  35 ++
 MAINTAINERS                          |   4 +-
 fs/cramfs/Kconfig                    |  39 ++-
 fs/cramfs/README                     |  31 +-
 fs/cramfs/inode.c                    | 500 +++++++++++++++++++++++++----
 include/uapi/linux/cramfs_fs.h       |  20 +-
 init/do_mounts.c                     |   8 +
 7 files changed, 560 insertions(+), 77 deletions(-)

Comments

Chris Brandt Aug. 14, 2017, 5:11 p.m. UTC | #1
On Friday, August 11, 2017, Nicolas Pitre wrote:
> This series brings a nice refresh to the cramfs filesystem, adding the

> following capabilities:

> 

> - Direct memory access, bypassing the block and/or MTD layers entirely.

> 

> - Ability to store individual data blocks uncompressed.

> 

> - Ability to locate individual data blocks anywhere in the filesystem.

> 

> The end result is a very tight filesystem that can be accessed directly

> from ROM without any other subsystem underneath. Also this allows for

> user space XIP which is a very important feature for tiny embedded

> systems.




I just applied the patches tried this simple test:
 - tested with a Renesas RZ/A1 (Cortex-A9...so it has an MMU).
 - I set the sticky bit for busybox before using mkcramfs
 - booted (with squashfs) and mounted the cramfs image
 - confirmed that the sticky bit was still set on busybox
 - was able to execute busybox in the cramfs image


However, at this point I'm not sure how I can confirm that the XIP 
busybox actually executed as XIP or not.


Chris
Nicolas Pitre Aug. 14, 2017, 5:31 p.m. UTC | #2
On Mon, 14 Aug 2017, Chris Brandt wrote:

> On Friday, August 11, 2017, Nicolas Pitre wrote:

> > This series brings a nice refresh to the cramfs filesystem, adding the

> > following capabilities:

> > 

> > - Direct memory access, bypassing the block and/or MTD layers entirely.

> > 

> > - Ability to store individual data blocks uncompressed.

> > 

> > - Ability to locate individual data blocks anywhere in the filesystem.

> > 

> > The end result is a very tight filesystem that can be accessed directly

> > from ROM without any other subsystem underneath. Also this allows for

> > user space XIP which is a very important feature for tiny embedded

> > systems.

> 

> 

> 

> I just applied the patches tried this simple test:

>  - tested with a Renesas RZ/A1 (Cortex-A9...so it has an MMU).

>  - I set the sticky bit for busybox before using mkcramfs


You need the newer mkcramfs I linked to in the documentation. With it 
you don't need to play tricks with the sticky bit anymore. However you 
need to specify -X twice (or just once for no-MMU targets) and it will 
make every ELF files XIPable automatically.

>  - booted (with squashfs) and mounted the cramfs image

>  - confirmed that the sticky bit was still set on busybox

>  - was able to execute busybox in the cramfs image

> 

> 

> However, at this point I'm not sure how I can confirm that the XIP 

> busybox actually executed as XIP or not.


Just use busybox's built-in cat command and dump the content of 
/proc/self/maps. You should see an offset that refers to a physical 
address within your cramfs image for those segments marked read-only and 
executable.


Nicolas
Chris Brandt Aug. 14, 2017, 6:01 p.m. UTC | #3
On Monday, August 14, 2017, Nicolas Pitre wrote:
> > I just applied the patches tried this simple test:

> >  - tested with a Renesas RZ/A1 (Cortex-A9...so it has an MMU).

> >  - I set the sticky bit for busybox before using mkcramfs

> 

> You need the newer mkcramfs I linked to in the documentation. With it

> you don't need to play tricks with the sticky bit anymore. However you

> need to specify -X twice (or just once for no-MMU targets) and it will

> make every ELF files XIPable automatically.


OK. Now I am getting bigger images that makes me think all the ELF files
are uncompressed.


> > However, at this point I'm not sure how I can confirm that the XIP

> > busybox actually executed as XIP or not.

> 

> Just use busybox's built-in cat command and dump the content of

> /proc/self/maps. You should see an offset that refers to a physical

> address within your cramfs image for those segments marked read-only and

> executable.


It works! Pretty cool.

$ /mnt/bin/busybox cat /proc/self/maps
00008000-000a1000 r-xp 1b005000 00:10 18192      /mnt/bin/busybox

  (my cramfs flash image is at physical address 0x1B000000)




However, now with your mkcramfs tool, I can no longer mount my cramfs 
image as the rootfs on boot. I was able to do that before (ie, 30 minutes 
ago) when using the community mkcramfs (ie, 30 minutes ago).

I get this:

[    1.712425] cramfs: checking physical address 0x1b000000 for linear cramfs image
[    1.720531] cramfs: linear cramfs image appears to be 15744 KB in size
[    1.728656] VFS: Mounted root (cramfs_physmem filesystem) readonly on device 0:12.
[    1.737062] devtmpfs: mounted
[    1.741139] Freeing unused kernel memory: 48K
[    1.745545] This architecture does not have kernel memory protection.
[    1.760381] Starting init: /sbin/init exists but couldn't execute it (error -22)
[    1.769685] Starting init: /bin/sh exists but couldn't execute it (error -14)
[    1.776956] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    1.791192] CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc1-00014-g53182a0b7245 #667
[    1.798959] Hardware name: Generic R7S72100 (Flattened Device Tree)
[    1.805519] [<bf809261>] (unwind_backtrace) from [<bf807aa3>] (show_stack+0xb/0xc)
[    1.813228] [<bf807aa3>] (show_stack) from [<bf810a1b>] (panic+0x6f/0x18c)
[    1.820163] [<bf810a1b>] (panic) from [<bfa9f067>] (kernel_init+0x6b/0x98)
[    1.827078] [<bfa9f067>] (kernel_init) from [<bf805011>] (ret_from_fork+0x11/0x20)
[    1.834747] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.


Chris
Nicolas Pitre Aug. 14, 2017, 6:17 p.m. UTC | #4
On Mon, 14 Aug 2017, Chris Brandt wrote:

> On Monday, August 14, 2017, Nicolas Pitre wrote:

> > > I just applied the patches tried this simple test:

> > >  - tested with a Renesas RZ/A1 (Cortex-A9...so it has an MMU).

> > >  - I set the sticky bit for busybox before using mkcramfs

> > 

> > You need the newer mkcramfs I linked to in the documentation. With it

> > you don't need to play tricks with the sticky bit anymore. However you

> > need to specify -X twice (or just once for no-MMU targets) and it will

> > make every ELF files XIPable automatically.

> 

> OK. Now I am getting bigger images that makes me think all the ELF files

> are uncompressed.


Yeah. No way around that of course. I listed a few TODO items to 
mitigate the alignment losses if you have many executables.

> > > However, at this point I'm not sure how I can confirm that the XIP

> > > busybox actually executed as XIP or not.

> > 

> > Just use busybox's built-in cat command and dump the content of

> > /proc/self/maps. You should see an offset that refers to a physical

> > address within your cramfs image for those segments marked read-only and

> > executable.

> 

> It works! Pretty cool.

> 

> $ /mnt/bin/busybox cat /proc/self/maps

> 00008000-000a1000 r-xp 1b005000 00:10 18192      /mnt/bin/busybox

> 

>   (my cramfs flash image is at physical address 0x1B000000)


Good!  Independent validation is always nice.

> However, now with your mkcramfs tool, I can no longer mount my cramfs 

> image as the rootfs on boot. I was able to do that before (ie, 30 minutes 

> ago) when using the community mkcramfs (ie, 30 minutes ago).

> 

> I get this:

> 

> [    1.712425] cramfs: checking physical address 0x1b000000 for linear cramfs image

> [    1.720531] cramfs: linear cramfs image appears to be 15744 KB in size

> [    1.728656] VFS: Mounted root (cramfs_physmem filesystem) readonly on device 0:12.

> [    1.737062] devtmpfs: mounted

> [    1.741139] Freeing unused kernel memory: 48K

> [    1.745545] This architecture does not have kernel memory protection.

> [    1.760381] Starting init: /sbin/init exists but couldn't execute it (error -22)

> [    1.769685] Starting init: /bin/sh exists but couldn't execute it (error -14)


Is /sbin/init a link to busybox?
I suppose it just boots if you do mkcramfs without -X?
If so could you share your non-working cramfs image with me?


Nicolas
Chris Brandt Aug. 14, 2017, 6:37 p.m. UTC | #5
On Monday, August 14, 2017, Nicolas Pitre wrote:
> > However, now with your mkcramfs tool, I can no longer mount my cramfs

> > image as the rootfs on boot. I was able to do that before (ie, 30

> minutes

> > ago) when using the community mkcramfs (ie, 30 minutes ago).

> >

> > I get this:

> >

> > [    1.712425] cramfs: checking physical address 0x1b000000 for linear

> cramfs image

> > [    1.720531] cramfs: linear cramfs image appears to be 15744 KB in

> size

> > [    1.728656] VFS: Mounted root (cramfs_physmem filesystem) readonly on

> device 0:12.

> > [    1.737062] devtmpfs: mounted

> > [    1.741139] Freeing unused kernel memory: 48K

> > [    1.745545] This architecture does not have kernel memory protection.

> > [    1.760381] Starting init: /sbin/init exists but couldn't execute it

> (error -22)

> > [    1.769685] Starting init: /bin/sh exists but couldn't execute it

> (error -14)

> 

> Is /sbin/init a link to busybox?


Yes.


> I suppose it just boots if you do mkcramfs without -X?


Correct. I just created another image and removed the "-X -X" when 
creating it. Now I can boot that image as my rootfs.
  (I'm using -X -X because I'm using a Cortex-A9 with MMU).


> If so could you share your non-working cramfs image with me?


I will send it (in a separate email)



Chris
Nicolas Pitre Aug. 15, 2017, 4:10 a.m. UTC | #6
On Mon, 14 Aug 2017, Chris Brandt wrote:

> On Monday, August 14, 2017, Nicolas Pitre wrote:

> > > However, now with your mkcramfs tool, I can no longer mount my cramfs

> > > image as the rootfs on boot. I was able to do that before (ie, 30

> > minutes

> > > ago) when using the community mkcramfs (ie, 30 minutes ago).

> > >

> > > I get this:

> > >

> > > [    1.712425] cramfs: checking physical address 0x1b000000 for linear

> > cramfs image

> > > [    1.720531] cramfs: linear cramfs image appears to be 15744 KB in

> > size

> > > [    1.728656] VFS: Mounted root (cramfs_physmem filesystem) readonly on

> > device 0:12.

> > > [    1.737062] devtmpfs: mounted

> > > [    1.741139] Freeing unused kernel memory: 48K

> > > [    1.745545] This architecture does not have kernel memory protection.

> > > [    1.760381] Starting init: /sbin/init exists but couldn't execute it

> > (error -22)

> > > [    1.769685] Starting init: /bin/sh exists but couldn't execute it

> > (error -14)

> > 

> > Is /sbin/init a link to busybox?

> 

> Yes.

> 

> 

> > I suppose it just boots if you do mkcramfs without -X?

> 

> Correct. I just created another image and removed the "-X -X" when 

> creating it. Now I can boot that image as my rootfs.

>   (I'm using -X -X because I'm using a Cortex-A9 with MMU).

> 

> 

> > If so could you share your non-working cramfs image with me?

> 

> I will send it (in a separate email)


I was able to reproduce. The following patch on top should partially fix 
it.  I'm trying to figure out how to split a vma and link it properly in 
the case the vma cannot be mapped entirely. In the mean time shared libs 
won't be XIP.diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 5aedbd224e..4c7f01fcd2 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -285,10 +285,10 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset,
 
 /*
  * For a mapping to be possible, we need a range of uncompressed and
- * contiguous blocks. Return the offset for the first block if that
- * verifies, or zero otherwise.
+ * contiguous blocks. Return the offset for the first block and number of
+ * valid blocks for which that is true, or zero otherwise.
  */
-static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 pages)
+static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
 {
 	struct super_block *sb = inode->i_sb;
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
@@ -306,11 +306,16 @@ static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 pages)
 	do {
 		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);
 		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;
-		pr_debug("range: block %d/%d got %#x expects %#x\n",
-			 pgoff+i, pgoff+pages-1, blockptrs[i], expect);
-		if (blockptrs[i] != expect)
-			return 0;
-	} while (++i < pages);
+		if (blockptrs[i] != expect) {
+			pr_debug("range: block %d/%d got %#x expects %#x\n",
+				 pgoff+i, pgoff+*pages-1, blockptrs[i], expect);
+			if (i == 0)
+				return 0;
+			break;
+		}
+	} while (++i < *pages);
+
+	*pages = i;
 
 	/* stored "direct" block ptrs are shifted down by 2 bits */
 	return blockaddr << 2;
@@ -321,8 +326,8 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 	struct inode *inode = file_inode(file);
 	struct super_block *sb = inode->i_sb;
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
-	unsigned int pages, max_pages, offset;
-	unsigned long length, address;
+	unsigned int pages, vma_pages, max_pages, offset;
+	unsigned long address;
 	char *fail_reason;
 	int ret;
 
@@ -332,17 +337,20 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 		return -EINVAL;
 
-	vma->vm_ops = &generic_file_vm_ops;
+	fail_reason = "vma is writable";
 	if (vma->vm_flags & VM_WRITE)
-		return 0;
+		goto fail;
 
-	length = vma->vm_end - vma->vm_start;
-	pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vma->vm_pgoff >= max_pages || pages > max_pages - vma->vm_pgoff)
-		return -EINVAL;
+	fail_reason = "beyond file limit";
+	if (vma->vm_pgoff >= max_pages)
+		goto fail;
+	pages = vma_pages;
+	if (pages > max_pages - vma->vm_pgoff)
+		pages = max_pages - vma->vm_pgoff;
 
-	offset = cramfs_get_block_range(inode, vma->vm_pgoff, pages);
+	offset = cramfs_get_block_range(inode, vma->vm_pgoff, &pages);
 	fail_reason = "unsuitable block layout";
 	if (!offset)
 		goto fail;
@@ -351,37 +359,60 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!PAGE_ALIGNED(address))
 		goto fail;
 
-	/* Don't map a partial page if it contains some other data */
+	/* Don't map the last page if it contains some other data */
 	if (unlikely(vma->vm_pgoff + pages == max_pages)) {
 		unsigned int partial = offset_in_page(inode->i_size);
 		if (partial) {
 			char *data = sbi->linear_virt_addr + offset;
 			data += (pages - 1) * PAGE_SIZE + partial;
-			fail_reason = "last partial page is shared";
 			while ((unsigned long)data & 7)
 				if (*data++ != 0)
-					goto fail;
+					goto nonzero;
 			while (offset_in_page(data)) {
-				if (*(u64 *)data != 0)
-					goto fail;
+				if (*(u64 *)data != 0) {
+					nonzero:
+					pr_debug("mmap: %s: last page is shared\n",
+						 file_dentry(file)->d_name.name);
+					pages--;
+					break;
+				}
 				data += 8;
 			}
 		}
 	}
-	
-	ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
-			      length, vma->vm_page_prot);
-	if (ret)
-		return ret;
-	pr_debug("mapped %s at 0x%08lx, length %lu to vma 0x%08lx, "
+
+	if (pages) {
+		/*
+		 * Split the vma if we can't map it all so normal paging
+		 * will take care of the rest through cramfs_readpage().
+		 */
+		if (pages != vma_pages) {
+			if (1) {
+				fail_reason = "fix me";
+				goto fail;
+			}
+			ret = split_vma(vma->vm_mm, vma,
+					vma->vm_start + pages * PAGE_SIZE, 0);
+			if (ret)
+				return ret;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
+			      	      pages * PAGE_SIZE, vma->vm_page_prot);
+		if (ret)
+			return ret;
+	}
+
+	pr_debug("mapped %s at 0x%08lx, %u/%u pages to vma 0x%08lx, "
 		 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
-		 address, length, vma->vm_start,
+		 address, pages, vma_pages, vma->vm_start,
 		 (unsigned long long)pgprot_val(vma->vm_page_prot));
 	return 0;
 
 fail:
 	pr_debug("%s: direct mmap failed: %s\n",
 		 file_dentry(file)->d_name.name, fail_reason);
+	vma->vm_ops = &generic_file_vm_ops;
 	return 0;
 }
 
@@ -394,14 +425,15 @@ static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
 	struct inode *inode = file_inode(file);
 	struct super_block *sb = inode->i_sb;
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
-	unsigned int pages, max_pages, offset;
+	unsigned int pages, block_pages, max_pages, offset;
 
 	pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (pgoff >= max_pages || pages > max_pages - pgoff)
 		return -EINVAL;
-	offset = cramfs_get_block_range(inode, pgoff, pages);
-	if (!offset)
+	block_pages = pages;
+	offset = cramfs_get_block_range(inode, pgoff, &block_pages);
+	if (!offset || block_pages != pages)
 		return -ENOSYS;
 	addr = sbi->linear_phys_addr + offset;
 	pr_debug("get_unmapped for %s ofs %#lx siz %lu at 0x%08lx\n",

Chris Brandt Aug. 15, 2017, 11 a.m. UTC | #7
On Tuesday, August 15, 2017 1, Nicolas Pitre wrote:
> I was able to reproduce. The following patch on top should partially fix

> it.  I'm trying to figure out how to split a vma and link it properly in

> the case the vma cannot be mapped entirely. In the mean time shared libs

> won't be XIP.

> 

> 

> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c

> index 5aedbd224e..4c7f01fcd2 100644

> --- a/fs/cramfs/inode.c

> +++ b/fs/cramfs/inode.c



Yes, now I can boot with my rootfs being a XIP cramfs.

However, like you said, libc is not XIP.

$ cat /proc/self/maps
00008000-000a1000 r-xp 1b005000 00:0c 18192      /bin/busybox
000a9000-000aa000 rw-p 00099000 00:0c 18192      /bin/busybox
000aa000-000ac000 rw-p 00000000 00:00 0          [heap]
b6ed8000-b6fb1000 r-xp 00000000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6fb1000-b6fb9000 ---p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6fb9000-b6fbb000 r--p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6fbb000-b6fbc000 rw-p 000db000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6fbc000-b6fbf000 rw-p 00000000 00:00 0
b6fbf000-b6fd6000 r-xp 00000000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6fd9000-b6fda000 rw-p 00000000 00:00 0
b6fdb000-b6fdd000 rw-p 00000000 00:00 0
b6fdd000-b6fde000 r--p 00016000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6fde000-b6fdf000 rw-p 00017000 00:0c 670372     /lib/ld-2.18-2013.10.so
be81f000-be840000 rw-p 00000000 00:00 0          [stack]
beb19000-beb1a000 r-xp 00000000 00:00 0          [sigpage]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]


Chris
Nicolas Pitre Aug. 16, 2017, 5:10 a.m. UTC | #8
On Tue, 15 Aug 2017, Chris Brandt wrote:

> On Tuesday, August 15, 2017 1, Nicolas Pitre wrote:

> > I was able to reproduce. The following patch on top should partially fix

> > it.  I'm trying to figure out how to split a vma and link it properly in

> > the case the vma cannot be mapped entirely. In the mean time shared libs

> > won't be XIP.

> > 

> > 

> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c

> > index 5aedbd224e..4c7f01fcd2 100644

> > --- a/fs/cramfs/inode.c

> > +++ b/fs/cramfs/inode.c

> 

> 

> Yes, now I can boot with my rootfs being a XIP cramfs.

> 

> However, like you said, libc is not XIP.


I think I have it working now. Probably learned more about the memory 
management internals than I ever wanted to know. Please try the patch 
below on top of all the previous ones. If it works for you as well then 
I'll rebase and repost the whole thing.diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 4c7f01fcd2..0b651f985c 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -321,6 +321,86 @@ static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
 	return blockaddr << 2;
 }
 
+/*
+ * It is possible for cramfs_physmem_mmap() to partially populate the mapping
+ * causing page faults in the unmapped area. When that happens, we need to
+ * split the vma so that the unmapped area gets its own vma that can be backed
+ * with actual memory pages and loaded normally. This is necessary because
+ * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and filemap_fault()
+ * no longer works with it. Furthermore this makes /proc/x/maps right.
+ * Q: is there a way to do split vma at mmap() time?
+ */
+static const struct vm_operations_struct cramfs_vmasplit_ops;
+static int cramfs_vmasplit_fault(struct vm_fault *vmf)
+{
+	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma, *new_vma;
+	unsigned long split_val, split_addr;
+	unsigned int split_pgoff, split_page;
+	int ret;
+
+	/* Retrieve the vma split address and validate it */
+	vma = vmf->vma;
+	split_val = (unsigned long)vma->vm_private_data;
+	split_pgoff = split_val & 0xffff;
+	split_page = split_val >> 16;
+	split_addr = vma->vm_start + split_page * PAGE_SIZE;
+	pr_debug("fault: addr=%#lx vma=%#lx-%#lx split=%#lx\n",
+		 vmf->address, vma->vm_start, vma->vm_end, split_addr);
+	if (!split_val || split_addr >= vma->vm_end || vmf->address < split_addr)
+		return VM_FAULT_SIGSEGV;
+
+	/* We have some vma surgery to do and need the write lock. */
+	up_read(&mm->mmap_sem);
+	if (down_write_killable(&mm->mmap_sem))
+		return VM_FAULT_RETRY;
+
+	/* Make sure the vma didn't change between the locks */
+	vma = find_vma(mm, vmf->address);
+	if (vma->vm_ops != &cramfs_vmasplit_ops) {
+		/*
+		 * Someone else raced with us and could have handled the fault.
+		 * Let it go back to user space and fault again if necessary.
+		 */
+		downgrade_write(&mm->mmap_sem);
+		return VM_FAULT_NOPAGE;
+	}
+
+	/* Split the vma between the directly mapped area and the rest */
+	ret = split_vma(mm, vma, split_addr, 0);
+	if (ret) {
+		downgrade_write(&mm->mmap_sem);
+		return VM_FAULT_OOM;
+	}
+
+	/* The direct vma should no longer ever fault */
+	vma->vm_ops = NULL;
+
+	/* Retrieve the new vma covering the unmapped area */
+	new_vma = find_vma(mm, split_addr);
+	BUG_ON(new_vma == vma);
+	if (!new_vma) {
+		downgrade_write(&mm->mmap_sem);
+		return VM_FAULT_SIGSEGV;
+	}
+
+	/*
+	 * Readjust the new vma with the actual file based pgoff and
+	 * process the fault normally on it.
+	 */
+	new_vma->vm_pgoff = split_pgoff;
+	new_vma->vm_ops = &generic_file_vm_ops;
+	vmf->vma = new_vma;
+	vmf->pgoff = split_pgoff;
+	vmf->pgoff += (vmf->address - new_vma->vm_start) >> PAGE_SHIFT;
+	downgrade_write(&mm->mmap_sem);
+	return filemap_fault(vmf);
+}
+
+static const struct vm_operations_struct cramfs_vmasplit_ops = {
+	.fault	= cramfs_vmasplit_fault,
+};
+
 static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file_inode(file);
@@ -337,6 +417,7 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 		return -EINVAL;
 
+	/* Could COW work here? */
 	fail_reason = "vma is writable";
 	if (vma->vm_flags & VM_WRITE)
 		goto fail;
@@ -364,7 +445,7 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 		unsigned int partial = offset_in_page(inode->i_size);
 		if (partial) {
 			char *data = sbi->linear_virt_addr + offset;
-			data += (pages - 1) * PAGE_SIZE + partial;
+			data += (max_pages - 1) * PAGE_SIZE + partial;
 			while ((unsigned long)data & 7)
 				if (*data++ != 0)
 					goto nonzero;
@@ -383,35 +464,42 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 
 	if (pages) {
 		/*
-		 * Split the vma if we can't map it all so normal paging
-		 * will take care of the rest through cramfs_readpage().
+		 * If we can't map it all, page faults will occur if the
+		 * unmapped area is accessed. Let's handle them to split the
+		 * vma and let the normal paging machinery take care of the
+		 * rest through cramfs_readpage(). Because remap_pfn_range()
+		 * repurposes vma->vm_pgoff, we have to save it somewhere.
+		 * Let's use vma->vm_private_data to hold both the pgoff and the actual address split point.
+		 * Maximum file size is 16MB so we can pack both together.
 		 */
 		if (pages != vma_pages) {
-			if (1) {
-				fail_reason = "fix me";
-				goto fail;
-			}
-			ret = split_vma(vma->vm_mm, vma,
-					vma->vm_start + pages * PAGE_SIZE, 0);
-			if (ret)
-				return ret;
+			unsigned int split_pgoff = vma->vm_pgoff + pages;
+			unsigned long split_val = split_pgoff + (pages << 16);
+			vma->vm_private_data = (void *)split_val;
+			vma->vm_ops = &cramfs_vmasplit_ops;
+			/* to keep remap_pfn_range() happy */
+			vma->vm_end = vma->vm_start + pages * PAGE_SIZE;
 		}
 
 		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
 			      	      pages * PAGE_SIZE, vma->vm_page_prot);
+		/* restore vm_end in case we cheated it above */
+		vma->vm_end = vma->vm_start + vma_pages * PAGE_SIZE;
 		if (ret)
 			return ret;
+		pr_debug("mapped %s at 0x%08lx, %u/%u pages to vma 0x%08lx, "
+			 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
+			 address, pages, vma_pages, vma->vm_start,
+			 (unsigned long long)pgprot_val(vma->vm_page_prot));
+		return 0;
 	}
-
-	pr_debug("mapped %s at 0x%08lx, %u/%u pages to vma 0x%08lx, "
-		 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
-		 address, pages, vma_pages, vma->vm_start,
-		 (unsigned long long)pgprot_val(vma->vm_page_prot));
-	return 0;
+	fail_reason = "no suitable block remaining";
 
 fail:
 	pr_debug("%s: direct mmap failed: %s\n",
 		 file_dentry(file)->d_name.name, fail_reason);
+
+	/* We failed to do a direct map, but normal paging will do it */
 	vma->vm_ops = &generic_file_vm_ops;
 	return 0;
 }

Chris Brandt Aug. 16, 2017, 11:08 a.m. UTC | #9
On Wednesday, August 16, 2017, Nicolas Pitre wrote:
> > Yes, now I can boot with my rootfs being a XIP cramfs.

> >

> > However, like you said, libc is not XIP.

> 

> I think I have it working now. Probably learned more about the memory

> management internals than I ever wanted to know. Please try the patch

> below on top of all the previous ones. If it works for you as well then

> I'll rebase and repost the whole thing.

> 

> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c

> index 4c7f01fcd2..0b651f985c 100644

> --- a/fs/cramfs/inode.c

> +++ b/fs/cramfs/inode.c



Yes, that worked. Very nice!

$ cat /proc/self/maps
00008000-000a1000 r-xp 1b005000 00:0c 18192      /bin/busybox
000a9000-000aa000 rw-p 00099000 00:0c 18192      /bin/busybox
000aa000-000ac000 rw-p 00000000 00:00 0          [heap]
b6e23000-b6efc000 r-xp 1b0bc000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6efc000-b6f04000 ---p 1b195000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f04000-b6f06000 r--p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f06000-b6f07000 rw-p 000db000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f07000-b6f0a000 rw-p 00000000 00:00 0
b6f0a000-b6f21000 r-xp 1b0a4000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6f24000-b6f25000 rw-p 00000000 00:00 0
b6f26000-b6f28000 rw-p 00000000 00:00 0
b6f28000-b6f29000 r--p 00016000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6f29000-b6f2a000 rw-p 00017000 00:0c 670372     /lib/ld-2.18-2013.10.so
be877000-be898000 rw-p 00000000 00:00 0          [stack]
beba9000-bebaa000 r-xp 00000000 00:00 0          [sigpage]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]


Just FYI,
I'm running an xipImage with all the RZ/A1 upstream drivers enabled and 
only using about 4.5MB of total system RAM.
That's pretty good. Of course for a real application, you would trim off
the drivers and subsystems you don't plan on using, thus lowering your
RAM usage.


> +/*

> + * It is possible for cramfs_physmem_mmap() to partially populate the

> mapping

> + * causing page faults in the unmapped area. When that happens, we need

> to

> + * split the vma so that the unmapped area gets its own vma that can be

> backed

> + * with actual memory pages and loaded normally. This is necessary

> because

> + * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and

> filemap_fault()

> + * no longer works with it. Furthermore this makes /proc/x/maps right.

> + * Q: is there a way to do split vma at mmap() time?

> + */



So if I understand correctly, the issue is that sometimes you only have 
a partial PAGE worth that you need to map. Correct?

For the AXFS file system, XIP page mapping was done on a per page 
decision, not per file. So the mkfs.axfs tool would only mark a page as XIP if
the entire section would fit in a complete PAGE. If for example you had
a partial page at the end of a multi page code segment, it would put 
that partial page in a separate portion of the AXFS image and be marked as
'copy to RAM' instead of being marked as 'map as XIP'.
So in the AXFS case, it was a combination of the creation tool and file 
system driver features to fix the partial page issue.
Not sure if any of this info is relevant, but I thought I would mention 
anyway.


Thank you for your efforts on adding XIP to cramfs!


Chris
Nicolas Pitre Aug. 16, 2017, 2:29 p.m. UTC | #10
On Wed, 16 Aug 2017, Chris Brandt wrote:

> On Wednesday, August 16, 2017, Nicolas Pitre wrote:

> > > Yes, now I can boot with my rootfs being a XIP cramfs.

> > >

> > > However, like you said, libc is not XIP.

> > 

> > I think I have it working now. Probably learned more about the memory

> > management internals than I ever wanted to know. Please try the patch

> > below on top of all the previous ones. If it works for you as well then

> > I'll rebase and repost the whole thing.

> > 

> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c

> > index 4c7f01fcd2..0b651f985c 100644

> > --- a/fs/cramfs/inode.c

> > +++ b/fs/cramfs/inode.c

> 

> 

> Yes, that worked. Very nice!


Good.

> Just FYI,

> I'm running an xipImage with all the RZ/A1 upstream drivers enabled and 

> only using about 4.5MB of total system RAM.

> That's pretty good. Of course for a real application, you would trim off

> the drivers and subsystems you don't plan on using, thus lowering your

> RAM usage.


On my MMU-less test target I'm going under the 1MB mark now.

> > +/*

> > + * It is possible for cramfs_physmem_mmap() to partially populate the mapping

> > + * causing page faults in the unmapped area. When that happens, we need to

> > + * split the vma so that the unmapped area gets its own vma that can be backed

> > + * with actual memory pages and loaded normally. This is necessary because

> > + * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and filemap_fault()

> > + * no longer works with it. Furthermore this makes /proc/x/maps right.

> > + * Q: is there a way to do split vma at mmap() time?

> > + */

> 

> So if I understand correctly, the issue is that sometimes you only have 

> a partial PAGE worth that you need to map. Correct?


Yes, or the page is stored in its compressed form in the filesystem, or 
it is misaligned, or any combination of those.

> For the AXFS file system, XIP page mapping was done on a per page 

> decision, not per file. So the mkfs.axfs tool would only mark a page as XIP if

> the entire section would fit in a complete PAGE. If for example you had

> a partial page at the end of a multi page code segment, it would put 

> that partial page in a separate portion of the AXFS image and be marked as

> 'copy to RAM' instead of being marked as 'map as XIP'.

> So in the AXFS case, it was a combination of the creation tool and file 

> system driver features to fix the partial page issue.

> Not sure if any of this info is relevant, but I thought I would mention 

> anyway.


Same applies here.  The XIP decision is no longer a per file thing. This 
is why mkcramfs puts loadable and read-only ELF segments into 
uncompressed and aligned blocks while still packing the remaining of the 
file. The partial page issue can be "fixed" within mkcramfs if 
considered worth it. To incure the page alignment overhead only once, 
all the uncompressed blocks can be located together away from their file 
block tables, etc.  The extended format implemented in this seris allows 
for all this layout flexibility the fs creation tool may exploit.

The current restriction in the fs driver at the moment is that XIP 
blocks must be contiguous in the filesystem. That is a hard requirement 
in the non-mmu case anyway.

Given that I also applied the device table patch to mkcramfs (that 
allows for the creation of device nodes and arbitrary 
user/group/permission without being root) it would be possible to extend 
this mechanism to implement other XIP patterns such as for 
uncompressible media files for example.

> Thank you for your efforts on adding XIP to cramfs!


Thank you for testing.


Nicolas
Chris Brandt Aug. 16, 2017, 3:12 p.m. UTC | #11
On Wednesday, August 16, 2017 1, Nicolas Pitre wrote:
> > Just FYI,

> > I'm running an xipImage with all the RZ/A1 upstream drivers enabled and

> > only using about 4.5MB of total system RAM.

> > That's pretty good. Of course for a real application, you would trim off

> > the drivers and subsystems you don't plan on using, thus lowering your

> > RAM usage.

> 

> On my MMU-less test target I'm going under the 1MB mark now.


Show off  ;)


> Given that I also applied the device table patch to mkcramfs (that

> allows for the creation of device nodes and arbitrary

> user/group/permission without being root) it would be possible to extend

> this mechanism to implement other XIP patterns such as for

> uncompressible media files for example.


Good, I was going to ask about that.

I made an example once were all the graphics were RAW and uncompressed 
and marked as XIP in AXFS. The result was a large saving of RAM because 
as the graphics framework (DirectFB) would copy directly from Flash 
whenever it needed to do a background erase or image re-draw (button press 
animations).

Same went for playing MP3 files. The MP3 files were XIP in flash, so 
mpg123 pulled from flash directly.


Chris
Nicolas Pitre Aug. 17, 2017, 1:17 a.m. UTC | #12
On Wed, 16 Aug 2017, Chris Brandt wrote:

> I made an example once were all the graphics were RAW and uncompressed 

> and marked as XIP in AXFS. The result was a large saving of RAM because 

> as the graphics framework (DirectFB) would copy directly from Flash 

> whenever it needed to do a background erase or image re-draw (button press 

> animations).

> 

> Same went for playing MP3 files. The MP3 files were XIP in flash, so 

> mpg123 pulled from flash directly.


I wouldn't have expected mpg123 to mmap() its input files though.
If you use read() and not mmap() then you don't get the full benefit.


Nicolas