diff mbox

[RFC,3/3] kexec: extend kexec_file_load system call

Message ID 20160712014201.11456-4-takahiro.akashi@linaro.org
State New
Headers show

Commit Message

AKASHI Takahiro July 12, 2016, 1:42 a.m. UTC
Device tree blob must be passed to a second kernel on DTB-capable
archs, like powerpc and arm64, but the current kernel interface
lacks this support.

This patch extends kexec_file_load system call by adding an extra
argument to this syscall so that an arbitrary number of file descriptors
can be handed out from user space to the kernel.

	long sys_kexec_file_load(int kernel_fd, int initrd_fd,
				 unsigned long cmdline_len,
				 const char __user *cmdline_ptr,
				 unsigned long flags,
				 const struct kexec_fdset __user *ufdset);

If KEXEC_FILE_EXTRA_FDS is set to the "flags" argument, the "ufdset"
argument points to the following struct buffer:

	struct kexec_fdset {
		int nr_fds;
		struct kexec_file_fd fds[0];
	}

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>

---
 include/linux/fs.h         |  1 +
 include/linux/kexec.h      |  2 +-
 include/linux/syscalls.h   |  4 +++-
 include/uapi/linux/kexec.h | 17 ++++++++++++++
 kernel/kexec_file.c        | 57 ++++++++++++++++++++++++++++++++++++++++------
 5 files changed, 72 insertions(+), 9 deletions(-)

-- 
2.9.0

Comments

Mark Rutland July 15, 2016, 1:19 p.m. UTC | #1
On Fri, Jul 15, 2016 at 09:09:55AM -0400, Vivek Goyal wrote:
> On Tue, Jul 12, 2016 at 10:42:01AM +0900, AKASHI Takahiro wrote:

> 

> [..]

> > -SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,

> > +SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd,

> >  		unsigned long, cmdline_len, const char __user *, cmdline_ptr,

> > -		unsigned long, flags)

> > +		unsigned long, flags, const struct kexec_fdset __user *, ufdset)

> 

> Can one add more parameters to existing syscall. Can it break existing

> programs with new kernel? I was of the impression that one can't do that.

> But may be I am missing something.


I think the idea was that we would only look at the new params if a new
flags was set, and otherwise it would behave as the old syscall.

Regardless, I think it makes far more sense to add a kexec_file_load2
syscall if we're going to modify the prototype at all. It's a rather
different proposition to the existing syscall, and needs to be treated
as such.

Thanks,
Mark.
Mark Rutland July 18, 2016, 10:07 a.m. UTC | #2
On Mon, Jul 18, 2016 at 10:30:24AM +0800, Dave Young wrote:
> On 07/15/16 at 02:19pm, Mark Rutland wrote:

> > On Fri, Jul 15, 2016 at 09:09:55AM -0400, Vivek Goyal wrote:

> > > On Tue, Jul 12, 2016 at 10:42:01AM +0900, AKASHI Takahiro wrote:

> > > 

> > > [..]

> > > > -SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,

> > > > +SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd,

> > > >  		unsigned long, cmdline_len, const char __user *, cmdline_ptr,

> > > > -		unsigned long, flags)

> > > > +		unsigned long, flags, const struct kexec_fdset __user *, ufdset)

> > > 

> > > Can one add more parameters to existing syscall. Can it break existing

> > > programs with new kernel? I was of the impression that one can't do that.

> > > But may be I am missing something.

> > 

> > I think the idea was that we would only look at the new params if a new

> > flags was set, and otherwise it would behave as the old syscall.

> > 

> > Regardless, I think it makes far more sense to add a kexec_file_load2

> > syscall if we're going to modify the prototype at all. It's a rather

> > different proposition to the existing syscall, and needs to be treated

> > as such.

> 

> I do not think it is worth to add another syscall for extra fds.

> We have open(2) as an example for different numbers of arguments

> already.


Did we change the syscall interface for that?

I was under the impression that there was always one underlying syscall,
and the C library did the right thing to pass the expected information
to the underlying syscall.

That's rather different to changing the underlying syscall.

Regardless of how this is wrapped in userspace, I do not think modifying
the existing prototype is a good idea, and I think this kind of
extension needs to be a new syscall.

Thanks,
Mark.
Mark Rutland July 19, 2016, 10:52 a.m. UTC | #3
On Tue, Jul 19, 2016 at 08:55:56AM +0800, Dave Young wrote:
> On 07/18/16 at 11:07am, Mark Rutland wrote:

> > On Mon, Jul 18, 2016 at 10:30:24AM +0800, Dave Young wrote:

> > > I do not think it is worth to add another syscall for extra fds.

> > > We have open(2) as an example for different numbers of arguments

> > > already.

> > 

> > Did we change the syscall interface for that?

> > 

> > I was under the impression that there was always one underlying syscall,

> > and the C library did the right thing to pass the expected information

> > to the underlying syscall.

> 

> I'm not sure kexec_load and kexec_file_load were included in glibc, we use

> syscall directly in kexec-tools.

> 

> kexec_load man pages says there are no wrappers for both kexec_load and

> kexec_file_load in glibc.


For the above, I was talking about how open() was handled.

If there are no userspace wrappers, then the two cases aren't comparable
in the first place...

> > That's rather different to changing the underlying syscall.

> > 

> > Regardless of how this is wrapped in userspace, I do not think modifying

> > the existing prototype is a good idea, and I think this kind of

> > extension needs to be a new syscall.

> 

> Hmm, as I replied to Vivek, there is one case about the flags, previously

> the new flag will be regarded as invalid, but not we extend it it will be

> valid, this maybe the only potential bad case.


It's true that adding suport for new flags will change the behaviour of
what used to be error cases. We generally expect real users to not be
making pointless calls for which they rely on an error being returned in
all cases.

Regardless, this extended syscall changes some underlying assumptions
made with the development of kexec_file_load, and I think treating this
as an extension is not a great idea. From a user's perspective there is
little difference between passing an additional flag or using a
different syscall number, so I don't think that we gain much by altering
the existing prototype relative to allocating a new syscall number.

Thus, I think that if this is necessary it should be treated as a new
syscall.

Thanks,
Mark.
Mark Rutland July 19, 2016, 12:47 p.m. UTC | #4
On Tue, Jul 19, 2016 at 08:24:06AM -0400, Vivek Goyal wrote:
> On Tue, Jul 19, 2016 at 11:52:00AM +0100, Mark Rutland wrote:

> > Regardless, this extended syscall changes some underlying assumptions

> > made with the development of kexec_file_load, and I think treating this

> > as an extension is not a great idea. From a user's perspective there is

> > little difference between passing an additional flag or using a

> > different syscall number, so I don't think that we gain much by altering

> > the existing prototype relative to allocating a new syscall number.

> 

> If we are providing/opening up additional flags, I can't think what will

> it break. Same flag was invalid in old kernel but new kernel supports 

> it and will accept it. So it sounds reasonable to me to add new flags.

> 

> If existing users are not broken, then I think it might be a good idea

> to extend existing syscall. Otherwise userspace will have to be modified

> to understand a 3rd syscall also and an additional option will show up

> which asks users to specify which syscall to use. So extending existing

> syscall might keep it little simple for users.


I don't follow.

To use the new feature, you have to modify userspace anyway, as you
require userspace to pass information which it did not previously pass
(in the new arguments added to the syscall).

The presence of a new syscall does not imply the absence of the old
syscall, so you can always use that be default unless the user asks for
asomething only the new syscall provides. Regardless of the
syscall/flags difference, you still have to detect whether the new
functionality is present somehow.

> BTW, does kexec_load() needs to be modified too to handle DT?


No, at least for arm64. In the kexec_load case userspace provides the
DTB as a raw segment, and the user-provided purgatory sets up registers
to pass that to the new kernel.

Thanks,
Mark.
diff mbox

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28814..6dd6fdf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2634,6 +2634,7 @@  extern int do_pipe_flags(int *, int);
 	id(MODULE, kernel-module)		\
 	id(KEXEC_IMAGE, kexec-image)		\
 	id(KEXEC_INITRAMFS, kexec-initramfs)	\
+	id(KEXEC_DTB, kexec-dtb)		\
 	id(POLICY, security-policy)		\
 	id(MAX_ID, )
 
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 554c848..5f11bd5 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -277,7 +277,7 @@  extern int kexec_load_disabled;
 
 /* List of defined/legal kexec file flags */
 #define KEXEC_FILE_FLAGS	(KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH | \
-				 KEXEC_FILE_NO_INITRAMFS)
+				 KEXEC_FILE_NO_INITRAMFS | KEXEC_FILE_EXTRA_FDS)
 
 #define VMCOREINFO_BYTES           (4096)
 #define VMCOREINFO_NOTE_NAME       "VMCOREINFO"
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d022390..fc072bd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -66,6 +66,7 @@  struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
 union bpf_attr;
+struct kexec_fdset;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -321,7 +322,8 @@  asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
 asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
 				    unsigned long cmdline_len,
 				    const char __user *cmdline_ptr,
-				    unsigned long flags);
+				    unsigned long flags,
+				    const struct kexec_fdset __user *ufdset);
 
 asmlinkage long sys_exit(int error_code);
 asmlinkage long sys_exit_group(int error_code);
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index aae5ebf..adf53b6 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -23,6 +23,23 @@ 
 #define KEXEC_FILE_UNLOAD	0x00000001
 #define KEXEC_FILE_ON_CRASH	0x00000002
 #define KEXEC_FILE_NO_INITRAMFS	0x00000004
+#define KEXEC_FILE_EXTRA_FDS	0x00000008
+
+enum kexec_file_type {
+	KEXEC_FILE_TYPE_KERNEL,
+	KEXEC_FILE_TYPE_INITRAMFS,
+	KEXEC_FILE_TYPE_DTB,
+};
+
+struct kexec_file_fd {
+	enum kexec_file_type type;
+	int fd;
+};
+
+struct kexec_fdset {
+	int nr_fds;
+	struct kexec_file_fd fds[0];
+};
 
 /* These values match the ELF architecture values.
  * Unless there is a good reason that should continue to be the case.
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 7278329..451b4b0 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -137,11 +137,14 @@  void kimage_file_post_load_cleanup(struct kimage *image)
 static int
 kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
 			     const char __user *cmdline_ptr,
-			     unsigned long cmdline_len, unsigned flags)
+			     unsigned long cmdline_len, unsigned long flags,
+			     const struct kexec_fdset __user *ufdset)
 {
-	int ret = 0;
+	int ret = 0, nr_fds, i;
 	void *ldata;
 	loff_t size;
+	struct kexec_fdset *fdset = NULL;
+	size_t fdset_size;
 
 	ret = kernel_read_file_from_fd(kernel_fd, &image->kernel_buf,
 				       &size, INT_MAX, READING_KEXEC_IMAGE);
@@ -174,6 +177,42 @@  kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
 		image->initrd_buf_len = size;
 	}
 
+	if (flags & KEXEC_FILE_EXTRA_FDS) {
+		ret = copy_from_user(&nr_fds, ufdset, sizeof(int));
+		if (ret) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		fdset_size = sizeof(struct kexec_fdset)
+				+ nr_fds * sizeof(struct kexec_file_fd);
+		fdset = kmalloc(fdset_size, GFP_KERNEL);
+		if (!fdset) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = copy_from_user(fdset, ufdset, fdset_size);
+		if (ret) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		for (i = 0; i < fdset->nr_fds; i++) {
+			if (fdset->fds[i].type == KEXEC_FILE_TYPE_DTB) {
+				ret = kernel_read_file_from_fd(fdset->fds[i].fd,
+						&image->dtb_buf, &size, INT_MAX,
+						READING_KEXEC_DTB);
+				if (ret)
+					goto out;
+				image->dtb_buf_len = size;
+			} else {
+				pr_debug("unknown file type %d failed.\n",
+						fdset->fds[i].type);
+			}
+		}
+	}
+
 	if (cmdline_len) {
 		image->cmdline_buf = kzalloc(cmdline_len, GFP_KERNEL);
 		if (!image->cmdline_buf) {
@@ -208,6 +247,8 @@  kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
 	image->image_loader_data = ldata;
 out:
 	/* In case of error, free up all allocated memory in this function */
+	kfree(fdset);
+
 	if (ret)
 		kimage_file_post_load_cleanup(image);
 	return ret;
@@ -216,7 +257,8 @@  out:
 static int
 kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
 		       int initrd_fd, const char __user *cmdline_ptr,
-		       unsigned long cmdline_len, unsigned long flags)
+		       unsigned long cmdline_len, unsigned long flags,
+		       const struct kexec_fdset __user *ufdset)
 {
 	int ret;
 	struct kimage *image;
@@ -235,7 +277,8 @@  kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
 	}
 
 	ret = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
-					   cmdline_ptr, cmdline_len, flags);
+					   cmdline_ptr, cmdline_len, flags,
+					   ufdset);
 	if (ret)
 		goto out_free_image;
 
@@ -270,9 +313,9 @@  out_free_image:
 	return ret;
 }
 
-SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
+SYSCALL_DEFINE6(kexec_file_load, int, kernel_fd, int, initrd_fd,
 		unsigned long, cmdline_len, const char __user *, cmdline_ptr,
-		unsigned long, flags)
+		unsigned long, flags, const struct kexec_fdset __user *, ufdset)
 {
 	int ret = 0, i;
 	struct kimage **dest_image, *image;
@@ -309,7 +352,7 @@  SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 		kimage_free(xchg(&kexec_crash_image, NULL));
 
 	ret = kimage_file_alloc_init(&image, kernel_fd, initrd_fd, cmdline_ptr,
-				     cmdline_len, flags);
+				     cmdline_len, flags, ufdset);
 	if (ret)
 		goto out;