io_uring: avoid page allocation warnings

Message ID	20190430132405.8268-1-mark.rutland@arm.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Mark Rutland <mark.rutland@arm.com> To: Jens Axboe <axboe@kernel.dk>, linux-kernel@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com>, Alexander Viro <viro@zeniv.linux.org.uk>, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org Subject: [PATCH] io_uring: avoid page allocation warnings Date: Tue, 30 Apr 2019 14:24:05 +0100 Message-Id: <20190430132405.8268-1-mark.rutland@arm.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	io_uring: avoid page allocation warnings \| expand io_uring: avoid page allocation warnings

Mark Rutland April 30, 2019, 1:24 p.m. UTC

In io_sqe_buffer_register() we allocate a number of arrays based on the
iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
we can still attempt to allocate arrays exceeding MAX_ORDER.

On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
bytes, which is greater than 4MiB. This results in SLUB warning that
we're trying to allocate greater than MAX_ORDER, and failing the
allocation.

Avoid this by passing __GFP_NOWARN when allocating arrays for the
user-provided iov_len. We'll gracefully handle the failed allocation,
returning -ENOMEM to userspace.

We should probably consider lowering the limit below SZ_1G, or reworking
the array allocations.

Full splat from before this patch:

WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x110/0x190 lib/dump_stack.c:113
 panic+0x384/0x68c kernel/panic.c:214
 __warn+0x2bc/0x2c0 kernel/panic.c:571
 report_bug+0x228/0x2d8 lib/bug.c:186
 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
 brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
 el1_dbg+0x18/0x8c
 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
 alloc_pages include/linux/gfp.h:509 [inline]
 kmalloc_order+0x20/0x50 mm/slab_common.c:1231
 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
 kmalloc_large include/linux/slab.h:480 [inline]
 __kmalloc+0x3dc/0x4f0 mm/slub.c:3791
 kmalloc_array include/linux/slab.h:670 [inline]
 io_sqe_buffer_register fs/io_uring.c:2472 [inline]
 __io_uring_register fs/io_uring.c:2962 [inline]
 __do_sys_io_uring_register fs/io_uring.c:3008 [inline]
 __se_sys_io_uring_register fs/io_uring.c:2990 [inline]
 __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
 el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
SMP: stopping secondary CPUs
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
CPU features: 0x002,23000438
Memory Limit: none
Rebooting in 1 seconds..

Signed-off-by: Mark Rutland <mark.rutland@arm.com>

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/io_uring.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.11.0

Matthew Wilcox April 30, 2019, 2:18 p.m. UTC | #1

On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:
> In io_sqe_buffer_register() we allocate a number of arrays based on the

> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

> we can still attempt to allocate arrays exceeding MAX_ORDER.

> 

> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

> bytes, which is greater than 4MiB. This results in SLUB warning that

> we're trying to allocate greater than MAX_ORDER, and failing the

> allocation.

> 

> Avoid this by passing __GFP_NOWARN when allocating arrays for the

> user-provided iov_len. We'll gracefully handle the failed allocation,

> returning -ENOMEM to userspace.

> 

> We should probably consider lowering the limit below SZ_1G, or reworking

> the array allocations.


I'd suggest that kvmalloc is probably our friend here ... we don't really
want to return -ENOMEM to userspace for this case, I don't think.

Mark Rutland April 30, 2019, 2:59 p.m. UTC | #2

On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:

> > In io_sqe_buffer_register() we allocate a number of arrays based on the

> > iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

> > we can still attempt to allocate arrays exceeding MAX_ORDER.

> > 

> > On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

> > iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

> > allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

> > bytes, which is greater than 4MiB. This results in SLUB warning that

> > we're trying to allocate greater than MAX_ORDER, and failing the

> > allocation.

> > 

> > Avoid this by passing __GFP_NOWARN when allocating arrays for the

> > user-provided iov_len. We'll gracefully handle the failed allocation,

> > returning -ENOMEM to userspace.

> > 

> > We should probably consider lowering the limit below SZ_1G, or reworking

> > the array allocations.

> 

> I'd suggest that kvmalloc is probably our friend here ... we don't really

> want to return -ENOMEM to userspace for this case, I don't think.


Sure. I'll go verify that the uring code doesn't assume this memory is
physically contiguous.

I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain
GFP_KERNEL.

Thanks,
Mark.

Jens Axboe April 30, 2019, 4:21 p.m. UTC | #3

On 4/30/19 8:59 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:

>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:

>>> In io_sqe_buffer_register() we allocate a number of arrays based on the

>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

>>> we can still attempt to allocate arrays exceeding MAX_ORDER.

>>>

>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

>>> bytes, which is greater than 4MiB. This results in SLUB warning that

>>> we're trying to allocate greater than MAX_ORDER, and failing the

>>> allocation.

>>>

>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the

>>> user-provided iov_len. We'll gracefully handle the failed allocation,

>>> returning -ENOMEM to userspace.

>>>

>>> We should probably consider lowering the limit below SZ_1G, or reworking

>>> the array allocations.

>>

>> I'd suggest that kvmalloc is probably our friend here ... we don't really

>> want to return -ENOMEM to userspace for this case, I don't think.

> 

> Sure. I'll go verify that the uring code doesn't assume this memory is

> physically contiguous.

> 

> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain

> GFP_KERNEL.


kvmalloc() is fine, the io_uring code doesn't care about the layout of
the memory, it just uses it as an index.

-- 
Jens Axboe

Mark Rutland April 30, 2019, 5:03 p.m. UTC | #4

On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:
> On 4/30/19 8:59 AM, Mark Rutland wrote:

> > On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:

> >> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:

> >>> In io_sqe_buffer_register() we allocate a number of arrays based on the

> >>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

> >>> we can still attempt to allocate arrays exceeding MAX_ORDER.

> >>>

> >>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

> >>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

> >>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

> >>> bytes, which is greater than 4MiB. This results in SLUB warning that

> >>> we're trying to allocate greater than MAX_ORDER, and failing the

> >>> allocation.

> >>>

> >>> Avoid this by passing __GFP_NOWARN when allocating arrays for the

> >>> user-provided iov_len. We'll gracefully handle the failed allocation,

> >>> returning -ENOMEM to userspace.

> >>>

> >>> We should probably consider lowering the limit below SZ_1G, or reworking

> >>> the array allocations.

> >>

> >> I'd suggest that kvmalloc is probably our friend here ... we don't really

> >> want to return -ENOMEM to userspace for this case, I don't think.

> > 

> > Sure. I'll go verify that the uring code doesn't assume this memory is

> > physically contiguous.

> > 

> > I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain

> > GFP_KERNEL.

> 

> kvmalloc() is fine, the io_uring code doesn't care about the layout of

> the memory, it just uses it as an index.


I've just had a go at that, but when using kvmalloc() with or without
GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the
syzkaller prog below:

----
Syzkaller reproducer:
# {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}
r0 = io_uring_setup(0x378, &(0x7f00000000c0))
sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)
io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)
----

... I'm a bit worried that opens up a trivial DoS.

Thoughts?

Thanks,
Mark.

Jens Axboe April 30, 2019, 6:11 p.m. UTC | #5

On 4/30/19 11:03 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:

>> On 4/30/19 8:59 AM, Mark Rutland wrote:

>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:

>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:

>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the

>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.

>>>>>

>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that

>>>>> we're trying to allocate greater than MAX_ORDER, and failing the

>>>>> allocation.

>>>>>

>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the

>>>>> user-provided iov_len. We'll gracefully handle the failed allocation,

>>>>> returning -ENOMEM to userspace.

>>>>>

>>>>> We should probably consider lowering the limit below SZ_1G, or reworking

>>>>> the array allocations.

>>>>

>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really

>>>> want to return -ENOMEM to userspace for this case, I don't think.

>>>

>>> Sure. I'll go verify that the uring code doesn't assume this memory is

>>> physically contiguous.

>>>

>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain

>>> GFP_KERNEL.

>>

>> kvmalloc() is fine, the io_uring code doesn't care about the layout of

>> the memory, it just uses it as an index.

> 

> I've just had a go at that, but when using kvmalloc() with or without

> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the

> syzkaller prog below:

> 

> ----

> Syzkaller reproducer:

> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}

> r0 = io_uring_setup(0x378, &(0x7f00000000c0))

> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)

> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)

> ----

> 

> ... I'm a bit worried that opens up a trivial DoS.

> 

> Thoughts?


Can you post the patch you used?

-- 
Jens Axboe

Mark Rutland May 1, 2019, 10:30 a.m. UTC | #6

On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:
> On 4/30/19 11:03 AM, Mark Rutland wrote:

> > On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:

> >> On 4/30/19 8:59 AM, Mark Rutland wrote:

> >>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:

> >>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:

> >>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the

> >>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

> >>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.

> >>>>>

> >>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

> >>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

> >>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

> >>>>> bytes, which is greater than 4MiB. This results in SLUB warning that

> >>>>> we're trying to allocate greater than MAX_ORDER, and failing the

> >>>>> allocation.

> >>>>>

> >>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the

> >>>>> user-provided iov_len. We'll gracefully handle the failed allocation,

> >>>>> returning -ENOMEM to userspace.

> >>>>>

> >>>>> We should probably consider lowering the limit below SZ_1G, or reworking

> >>>>> the array allocations.

> >>>>

> >>>> I'd suggest that kvmalloc is probably our friend here ... we don't really

> >>>> want to return -ENOMEM to userspace for this case, I don't think.

> >>>

> >>> Sure. I'll go verify that the uring code doesn't assume this memory is

> >>> physically contiguous.

> >>>

> >>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain

> >>> GFP_KERNEL.

> >>

> >> kvmalloc() is fine, the io_uring code doesn't care about the layout of

> >> the memory, it just uses it as an index.

> > 

> > I've just had a go at that, but when using kvmalloc() with or without

> > GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the

> > syzkaller prog below:

> > 

> > ----

> > Syzkaller reproducer:

> > # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}

> > r0 = io_uring_setup(0x378, &(0x7f00000000c0))

> > sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)

> > io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)

> > ----

> > 

> > ... I'm a bit worried that opens up a trivial DoS.

> > 

> > Thoughts?

> 

> Can you post the patch you used?


Diff below.

Mark.

---->8----
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 578b5494f9e5..c6dad90fab14 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2364,7 +2364,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
 
 		if (ctx->account_mem)
 			io_unaccount_mem(ctx->user, imu->nr_bvecs);
-		kfree(imu->bvec);
+		kvfree(imu->bvec);
 		imu->nr_bvecs = 0;
 	}
 
@@ -2456,9 +2456,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		if (!pages || nr_pages > got_pages) {
 			kfree(vmas);
 			kfree(pages);
-			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+			pages = kvmalloc_array(nr_pages, sizeof(struct page *),
 						GFP_KERNEL);
-			vmas = kmalloc_array(nr_pages,
+			vmas = kvmalloc_array(nr_pages,
 					sizeof(struct vm_area_struct *),
 					GFP_KERNEL);
 			if (!pages || !vmas) {
@@ -2470,7 +2470,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			got_pages = nr_pages;
 		}
 
-		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+		imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
 						GFP_KERNEL);
 		ret = -ENOMEM;
 		if (!imu->bvec) {
@@ -2531,12 +2531,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ctx->nr_user_bufs++;
 	}
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	return 0;
 err:
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	io_sqe_buffer_unregister(ctx);
 	return ret;
 }
-- 
2.11.0

Jens Axboe May 1, 2019, 12:41 p.m. UTC | #7

On 5/1/19 4:30 AM, Mark Rutland wrote:
> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:

>> On 4/30/19 11:03 AM, Mark Rutland wrote:

>>> On Tue, Apr 30, 2019 at 10:21:03AM -0600, Jens Axboe wrote:

>>>> On 4/30/19 8:59 AM, Mark Rutland wrote:

>>>>> On Tue, Apr 30, 2019 at 07:18:10AM -0700, Matthew Wilcox wrote:

>>>>>> On Tue, Apr 30, 2019 at 02:24:05PM +0100, Mark Rutland wrote:

>>>>>>> In io_sqe_buffer_register() we allocate a number of arrays based on the

>>>>>>> iov_len from the user-provided iov. While we limit iov_len to SZ_1G,

>>>>>>> we can still attempt to allocate arrays exceeding MAX_ORDER.

>>>>>>>

>>>>>>> On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and

>>>>>>> iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to

>>>>>>> allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320

>>>>>>> bytes, which is greater than 4MiB. This results in SLUB warning that

>>>>>>> we're trying to allocate greater than MAX_ORDER, and failing the

>>>>>>> allocation.

>>>>>>>

>>>>>>> Avoid this by passing __GFP_NOWARN when allocating arrays for the

>>>>>>> user-provided iov_len. We'll gracefully handle the failed allocation,

>>>>>>> returning -ENOMEM to userspace.

>>>>>>>

>>>>>>> We should probably consider lowering the limit below SZ_1G, or reworking

>>>>>>> the array allocations.

>>>>>>

>>>>>> I'd suggest that kvmalloc is probably our friend here ... we don't really

>>>>>> want to return -ENOMEM to userspace for this case, I don't think.

>>>>>

>>>>> Sure. I'll go verify that the uring code doesn't assume this memory is

>>>>> physically contiguous.

>>>>>

>>>>> I also guess we should be passing GFP_KERNEL_ACCOUNT rateh than a plain

>>>>> GFP_KERNEL.

>>>>

>>>> kvmalloc() is fine, the io_uring code doesn't care about the layout of

>>>> the memory, it just uses it as an index.

>>>

>>> I've just had a go at that, but when using kvmalloc() with or without

>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the

>>> syzkaller prog below:

>>>

>>> ----

>>> Syzkaller reproducer:

>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}

>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))

>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)

>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)

>>> ----

>>>

>>> ... I'm a bit worried that opens up a trivial DoS.

>>>

>>> Thoughts?

>>

>> Can you post the patch you used?

> 

> Diff below.


And the reproducer, that was never posted. Patch looks fine to me. Note
that buffer registration is under the protection of RLIMIT_MEMLOCK.
That's usually very limited for non-root, as root you can of course
consume as much as you want and OOM the system.

-- 
Jens Axboe

Mark Rutland May 1, 2019, 3:09 p.m. UTC | #8

On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote:
> On 5/1/19 4:30 AM, Mark Rutland wrote:

> > On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:

> >> On 4/30/19 11:03 AM, Mark Rutland wrote:

> >>> I've just had a go at that, but when using kvmalloc() with or without

> >>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the

> >>> syzkaller prog below:

> >>>

> >>> ----

> >>> Syzkaller reproducer:

> >>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}

> >>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))

> >>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)

> >>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)

> >>> ----

> >>>

> >>> ... I'm a bit worried that opens up a trivial DoS.

> >>>

> >>> Thoughts?

> >>

> >> Can you post the patch you used?

> > 

> > Diff below.

> 

> And the reproducer, that was never posted.


It was; the "Syzakller reproducer" above is the reproducer I used with
syz-repro.

I've manually minimized that to C below. AFAICT, that hits a leak, which
is what's triggering the OOM after the program is run a number of times
with the previously posted kvmalloc patch.

Per /proc/meminfo, that memory isn't accounted anywhere.

> Patch looks fine to me. Note

> that buffer registration is under the protection of RLIMIT_MEMLOCK.

> That's usually very limited for non-root, as root you can of course

> consume as much as you want and OOM the system.


Sure.

As above, it looks like there's a leak, regardless.

Thanks,
Mark.

---->8----
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/uio.h>

// NOTE: arm64 syscall numbers
#ifndef __NR_io_uring_register
#define __NR_io_uring_register 427
#endif
#ifndef __NR_io_uring_setup
#define __NR_io_uring_setup 425
#endif

#define __IORING_REGISTER_BUFFERS         0
	
struct __io_sqring_offsets {
	uint32_t head;
	uint32_t tail;
	uint32_t ring_mask;
	uint32_t ring_entries;
	uint32_t flags;
	uint32_t dropped;
	uint32_t array;
	uint32_t resv1;
	uint64_t resv2;
};

struct __io_uring_params {
	uint32_t sq_entries;
	uint32_t cq_entries;
	uint32_t flags;
	uint32_t sq_thread_cpu;
	uint32_t sq_thread_idle;
	uint32_t resv[5];
	struct __io_sqring_offsets sq_off;
	struct __io_sqring_offsets cq_off;

};

static struct __io_uring_params params; 

static struct iovec iov = {
	.iov_base = (void *)0x10,
	.iov_len  = 1024 * 1024 * 1024,
};

int main(void)
{
	int fd;

	fd = syscall(__NR_io_uring_setup, 0x1, &params);
	syscall(__NR_io_uring_register, fd, __IORING_REGISTER_BUFFERS, &iov, 1);

	return 0;
}

Jens Axboe May 1, 2019, 3:29 p.m. UTC | #9

On 5/1/19 9:09 AM, Mark Rutland wrote:
> On Wed, May 01, 2019 at 06:41:43AM -0600, Jens Axboe wrote:

>> On 5/1/19 4:30 AM, Mark Rutland wrote:

>>> On Tue, Apr 30, 2019 at 12:11:59PM -0600, Jens Axboe wrote:

>>>> On 4/30/19 11:03 AM, Mark Rutland wrote:

>>>>> I've just had a go at that, but when using kvmalloc() with or without

>>>>> GFP_KERNEL_ACCOUNT I hit OOM and my system hangs within a few seconds with the

>>>>> syzkaller prog below:

>>>>>

>>>>> ----

>>>>> Syzkaller reproducer:

>>>>> # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:false HandleSegv:false Repro:false Trace:false}

>>>>> r0 = io_uring_setup(0x378, &(0x7f00000000c0))

>>>>> sendmsg$SEG6_CMD_SET_TUNSRC(0xffffffffffffffff, &(0x7f0000000240)={&(0x7f0000000000)={0x10, 0x0, 0x0, 0x40000000}, 0xc, 0x0, 0x1, 0x0, 0x0, 0x10}, 0x800)

>>>>> io_uring_register$IORING_REGISTER_BUFFERS(r0, 0x0, &(0x7f0000000000), 0x1)

>>>>> ----

>>>>>

>>>>> ... I'm a bit worried that opens up a trivial DoS.

>>>>>

>>>>> Thoughts?

>>>>

>>>> Can you post the patch you used?

>>>

>>> Diff below.

>>

>> And the reproducer, that was never posted.

> 

> It was; the "Syzakller reproducer" above is the reproducer I used with

> syz-repro.

> 

> I've manually minimized that to C below. AFAICT, that hits a leak, which

> is what's triggering the OOM after the program is run a number of times

> with the previously posted kvmalloc patch.

> 

> Per /proc/meminfo, that memory isn't accounted anywhere.

> 

>> Patch looks fine to me. Note

>> that buffer registration is under the protection of RLIMIT_MEMLOCK.

>> That's usually very limited for non-root, as root you can of course

>> consume as much as you want and OOM the system.

> 

> Sure.

> 

> As above, it looks like there's a leak, regardless.


The leak is that we're not releasing imu->bvec in case of error. I fixed
a missing kfree -> kvfree as well in your patch, with this rolled up
version it works for me.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 18cecb6a0151..3e817d40fb96 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2443,7 +2443,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
 
 		if (ctx->account_mem)
 			io_unaccount_mem(ctx->user, imu->nr_bvecs);
-		kfree(imu->bvec);
+		kvfree(imu->bvec);
 		imu->nr_bvecs = 0;
 	}
 
@@ -2533,11 +2533,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ret = 0;
 		if (!pages || nr_pages > got_pages) {
-			kfree(vmas);
-			kfree(pages);
-			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+			kvfree(vmas);
+			kvfree(pages);
+			pages = kvmalloc_array(nr_pages, sizeof(struct page *),
 						GFP_KERNEL);
-			vmas = kmalloc_array(nr_pages,
+			vmas = kvmalloc_array(nr_pages,
 					sizeof(struct vm_area_struct *),
 					GFP_KERNEL);
 			if (!pages || !vmas) {
@@ -2549,7 +2549,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			got_pages = nr_pages;
 		}
 
-		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+		imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec),
 						GFP_KERNEL);
 		ret = -ENOMEM;
 		if (!imu->bvec) {
@@ -2588,6 +2588,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 			if (ctx->account_mem)
 				io_unaccount_mem(ctx->user, nr_pages);
+			kvfree(imu->bvec);
 			goto err;
 		}
 
@@ -2610,12 +2611,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ctx->nr_user_bufs++;
 	}
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	return 0;
 err:
-	kfree(pages);
-	kfree(vmas);
+	kvfree(pages);
+	kvfree(vmas);
 	io_sqe_buffer_unregister(ctx);
 	return ret;
 }

-- 
Jens Axboe

Mark Rutland May 1, 2019, 3:55 p.m. UTC | #10

On Wed, May 01, 2019 at 09:29:25AM -0600, Jens Axboe wrote:
> On 5/1/19 9:09 AM, Mark Rutland wrote:

> > I've manually minimized that to C below. AFAICT, that hits a leak, which

> > is what's triggering the OOM after the program is run a number of times

> > with the previously posted kvmalloc patch.

> > 

> > Per /proc/meminfo, that memory isn't accounted anywhere.

> > 

> >> Patch looks fine to me. Note

> >> that buffer registration is under the protection of RLIMIT_MEMLOCK.

> >> That's usually very limited for non-root, as root you can of course

> >> consume as much as you want and OOM the system.

> > 

> > Sure.

> > 

> > As above, it looks like there's a leak, regardless.

> 

> The leak is that we're not releasing imu->bvec in case of error. I fixed

> a missing kfree -> kvfree as well in your patch, with this rolled up

> version it works for me.


That works for me too.

I'll fold that into v2, and send that out momentarily.

Thanks,
Mark.

io_uring: avoid page allocation warnings

Commit Message

Comments

Patch