[v30,05/11] arm64: kdump: protect crash dump kernel memory

Message ID	20170124085004.3892-4-takahiro.akashi@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org designates 65.50.211.133 as permitted sender) client-ip=65.50.211.133; From: AKASHI Takahiro <takahiro.akashi@linaro.org> To: catalin.marinas@arm.com, will.deacon@arm.com Subject: [PATCH v30 05/11] arm64: kdump: protect crash dump kernel memory Date: Tue, 24 Jan 2017 17:49:59 +0900 Message-Id: <20170124085004.3892-4-takahiro.akashi@linaro.org> In-Reply-To: <20170124084638.3770-1-takahiro.akashi@linaro.org> References: <20170124084638.3770-1-takahiro.akashi@linaro.org> summary: Content analysis details: (-2.7 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low trust [2607:f8b0:400e:c00:0:0:0:233 listed in] [list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain Precedence: list Cc: mark.rutland@arm.com, geoff@infradead.org, kexec@lists.infradead.org, AKASHI Takahiro <takahiro.akashi@linaro.org>, james.morse@arm.com, bauerman@linux.vnet.ibm.com, dyoung@redhat.com, linux-arm-kernel@lists.infradead.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org
Series	arm64: add kdump support \| expand [v30,00/11] arm64: add kdump support [v30,01/11] memblock: add memblock_cap_memory_range() [v30,02/11] arm64: limit memory regions based on DT property, usable-memory-range [v30,03/11] arm64: kdump: reserve memory for crash dump kernel [v30,04/11] arm64: mm: allow for unmapping memory region from kernel mapping [v30,05/11] arm64: kdump: protect crash dump kernel memory [v30,06/11] arm64: kdump: implement machine_crash_shutdown() [v30,07/11] arm64: kdump: add VMCOREINFO's for user-space tools [v30,08/11] arm64: kdump: provide /proc/vmcore file [v30,09/11] arm64: kdump: enable kdump in defconfig [v30,10/11] Documentation: kdump: describe arm64 port [v30,11/11] Documentation: dt: chosen properties for arm64 kdump

AKASHI Takahiro Jan. 24, 2017, 8:49 a.m. UTC

To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.

We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.

Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.

Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>

---
 arch/arm64/kernel/machine_kexec.c | 68 +++++++++++++++++++++++++--------------
 arch/arm64/mm/mmu.c               | 34 ++++++++++++++++++++
 2 files changed, 77 insertions(+), 25 deletions(-)

-- 
2.11.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

James Morse Jan. 25, 2017, 5:37 p.m. UTC | #1

Hi Akashi,

On 24/01/17 08:49, AKASHI Takahiro wrote:
> To protect the memory reserved for crash dump kernel once after loaded,

> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> permissions of the corresponding kernel mappings.

> 

> We also have to

> - put the region in an isolated mapping, and

> - move copying kexec's control_code_page to machine_kexec_prepare()

> so that the region will be completely read-only after loading.

> Note that the region must reside in linear mapping and have corresponding

> page structures in order to be potentially freed by shrinking it through

> /sys/kernel/kexec_crash_size.

Nasty! Presumably you have to build the crash region out of individual page
mappings so that they can be returned to the slab-allocator one page at a time,
and still be able to set/clear the valid bits on the remaining chunk.
(I don't see how that happens in this patch)

debug_pagealloc has to do this too so it can flip the valid bits one page at a
time. You could change the debug_pagealloc_enabled() value passed in at the top
__create_pgd_mapping() level to be a needs_per_page_mapping(addr, size) test
that happens as we build the linear map. (This would save the 3 extra calls to
__create_pgd_mapping() in __map_memblock())

I'm glad to see you can't resize the region if a crash kernel is loaded!

This secretly-unmapped is the sort of thing that breaks hibernate, it blindly
assumes pfn_valid() means it can access the page if it wants to. Setting
PG_Reserved is a quick way to trick it out of doing this, but that would leave
the crash kernel region un-initialised after resume, while kexec_crash_image
still has a value.
I think the best fix for this is to forbid hibernate if kexec_crash_loaded()
arguing these are mutually-exclusive features, and the protect crash-dump
feature exists to prevent things like hibernate corrupting the crash region.

> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c

> index bc96c8a7fc79..f7938fecf3ff 100644

> --- a/arch/arm64/kernel/machine_kexec.c

> +++ b/arch/arm64/kernel/machine_kexec.c

> @@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)

>  		kimage->control_code_page);

>  	pr_debug("%s:%d: reboot_code_buffer_phys:  %pa\n", __func__, __LINE__,

>  		&reboot_code_buffer_phys);

> -	pr_debug("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,

> -		reboot_code_buffer);

>  	pr_debug("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,

>  		arm64_relocate_new_kernel);

>  	pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",

>  		__func__, __LINE__, arm64_relocate_new_kernel_size,

>  		arm64_relocate_new_kernel_size);

>  

> -	/*

> -	 * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use

> -	 * after the kernel is shut down.

> -	 */

> -	memcpy(reboot_code_buffer, arm64_relocate_new_kernel,

> -		arm64_relocate_new_kernel_size);

> -

> -	/* Flush the reboot_code_buffer in preparation for its execution. */

> -	__flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);

> -	flush_icache_range((uintptr_t)reboot_code_buffer,

> -		arm64_relocate_new_kernel_size);

> -	/* Flush the kimage list and its buffers. */

> -	kexec_list_flush(kimage);

> +	if (kimage != kexec_crash_image) {

> +		/* Flush the kimage list and its buffers. */

> +		kexec_list_flush(kimage);

>  

> -	/* Flush the new image if already in place. */

> -	if (kimage->head & IND_DONE)

> -		kexec_segment_flush(kimage);

> +		/* Flush the new image if already in place. */

> +		if (kimage->head & IND_DONE)

> +			kexec_segment_flush(kimage);

> +	}

So for kdump we cleaned the kimage->segment[i].mem regions in
arch_kexec_protect_crashkres(), so don't need to do it here.

What about the kimage->head[i] array of list entries that were cleaned by
kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it
arm64_relocate_new_kernel() at the end of this function:
> cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);

Can we test the IND_DONE_BIT of kimage->head, so that we know that
arm64_relocate_new_kernel() won't try to walk the unclean list?
Alternatively we could call kexec_list_flush() in arch_kexec_protect_crashkres()
too.

Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

AKASHI Takahiro Jan. 26, 2017, 11:28 a.m. UTC | #2

James,

I will try to revisit your comments later, but quick replies now

On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote:
> Hi Akashi,

> 

> On 24/01/17 08:49, AKASHI Takahiro wrote:

> > To protect the memory reserved for crash dump kernel once after loaded,

> > arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> > permissions of the corresponding kernel mappings.

> > 

> > We also have to

> > - put the region in an isolated mapping, and

> > - move copying kexec's control_code_page to machine_kexec_prepare()

> > so that the region will be completely read-only after loading.

> 

> 

> > Note that the region must reside in linear mapping and have corresponding

> > page structures in order to be potentially freed by shrinking it through

> > /sys/kernel/kexec_crash_size.

> 

> Nasty! Presumably you have to build the crash region out of individual page

> mappings,


This might be an alternative, but

> so that they can be returned to the slab-allocator one page at a time,

> and still be able to set/clear the valid bits on the remaining chunk.

> (I don't see how that happens in this patch)


As far as shrinking feature is concerned, I believe, crash_shrink_memory(),
which eventually calls free_reserved_page(), will take care of all the things
to do. I can see increased number of "MemFree" in /proc/meminfo.
(Please note that the region is memblock_reserve()'d at boot time.)

> debug_pagealloc has to do this too so it can flip the valid bits one page at a

> time. You could change the debug_pagealloc_enabled() value passed in at the top

> __create_pgd_mapping() level to be a needs_per_page_mapping(addr, size) test

> that happens as we build the linear map. (This would save the 3 extra calls to

> __create_pgd_mapping() in __map_memblock())

> 

> I'm glad to see you can't resize the region if a crash kernel is loaded!

> 

> This secretly-unmapped is the sort of thing that breaks hibernate, it blindly

> assumes pfn_valid() means it can access the page if it wants to. Setting

> PG_Reserved is a quick way to trick it out of doing this, but that would leave

> the crash kernel region un-initialised after resume, while kexec_crash_image

> still has a value.


Ouch, I didn't notice this issue.

> I think the best fix for this is to forbid hibernate if kexec_crash_loaded()

> arguing these are mutually-exclusive features, and the protect crash-dump

> feature exists to prevent things like hibernate corrupting the crash region.


This restriction is really painful.
Is there any hibernation hook that will be invoked before suspending and
after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()
will be able to be called.

Or if "read-only (without unmapping)" approach would be acceptable, 
those two features might be no longer mutually-exclusive.

> 

> 

> > diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c

> > index bc96c8a7fc79..f7938fecf3ff 100644

> > --- a/arch/arm64/kernel/machine_kexec.c

> > +++ b/arch/arm64/kernel/machine_kexec.c

> > @@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)

> >  		kimage->control_code_page);

> >  	pr_debug("%s:%d: reboot_code_buffer_phys:  %pa\n", __func__, __LINE__,

> >  		&reboot_code_buffer_phys);

> > -	pr_debug("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,

> > -		reboot_code_buffer);

> >  	pr_debug("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,

> >  		arm64_relocate_new_kernel);

> >  	pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",

> >  		__func__, __LINE__, arm64_relocate_new_kernel_size,

> >  		arm64_relocate_new_kernel_size);

> >  

> > -	/*

> > -	 * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use

> > -	 * after the kernel is shut down.

> > -	 */

> > -	memcpy(reboot_code_buffer, arm64_relocate_new_kernel,

> > -		arm64_relocate_new_kernel_size);

> > -

> > -	/* Flush the reboot_code_buffer in preparation for its execution. */

> > -	__flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);

> > -	flush_icache_range((uintptr_t)reboot_code_buffer,

> > -		arm64_relocate_new_kernel_size);

> 

> 

> 

> > -	/* Flush the kimage list and its buffers. */

> > -	kexec_list_flush(kimage);

> > +	if (kimage != kexec_crash_image) {

> > +		/* Flush the kimage list and its buffers. */

> > +		kexec_list_flush(kimage);

> >  

> > -	/* Flush the new image if already in place. */

> > -	if (kimage->head & IND_DONE)

> > -		kexec_segment_flush(kimage);

> > +		/* Flush the new image if already in place. */

> > +		if (kimage->head & IND_DONE)

> > +			kexec_segment_flush(kimage);

> > +	}

> 

> So for kdump we cleaned the kimage->segment[i].mem regions in

> arch_kexec_protect_crashkres(), so don't need to do it here.


Correct.

> What about the kimage->head[i] array of list entries that were cleaned by

> kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it

> arm64_relocate_new_kernel() at the end of this function:

> > cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);


Kimage->head holds a list of memory regions that are overlapped
between the primary kernel and the secondary kernel, but in kedump case,
the whole memory is isolated and the list should be empty.

That is why kexec_list_flush() is skipped here, but yes,
"kimage->head" might better be cleaned anyway.
(I believe, from the past discussions, that cache coherency is still
maintained in kdump case though.)

> Can we test the IND_DONE_BIT of kimage->head, so that we know that

> arm64_relocate_new_kernel() won't try to walk the unclean list?

> Alternatively we could call kexec_list_flush() in arch_kexec_protect_crashkres()

> too.


So call kexec_list_flush() in machine_kexec() either in kexec or kdump.

Thanks,
-Takahiro AKASHI

> 

> 

> Thanks,

> 

> James

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

James Morse Jan. 27, 2017, 11:19 a.m. UTC | #3

Hi Akashi,

On 26/01/17 11:28, AKASHI Takahiro wrote:
> On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote:

>> On 24/01/17 08:49, AKASHI Takahiro wrote:

>>> To protect the memory reserved for crash dump kernel once after loaded,

>>> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

>>> permissions of the corresponding kernel mappings.

>>>

>>> We also have to

>>> - put the region in an isolated mapping, and

>>> - move copying kexec's control_code_page to machine_kexec_prepare()

>>> so that the region will be completely read-only after loading.

>>

>>

>>> Note that the region must reside in linear mapping and have corresponding

>>> page structures in order to be potentially freed by shrinking it through

>>> /sys/kernel/kexec_crash_size.

>>

>> Nasty! Presumably you have to build the crash region out of individual page

>> mappings,

> 

> This might be an alternative, but

> 

>> so that they can be returned to the slab-allocator one page at a time,

>> and still be able to set/clear the valid bits on the remaining chunk.

>> (I don't see how that happens in this patch)

> 

> As far as shrinking feature is concerned, I believe, crash_shrink_memory(),

> which eventually calls free_reserved_page(), will take care of all the things

> to do. I can see increased number of "MemFree" in /proc/meminfo.

Except for arch specific stuff like reformatting the page tables. Maybe I've
overlooked the way out this. What happens with this scenario:

We boot with crashkernel=1G on the commandline.
Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash
region.
Your code in __map_memblock() calls __create_pgd_mapping() ->
alloc_init_pud() which decides use_1G_block() looks like a good idea.

Some time later, the user decides to free half of this region,
free_reserved_page() does its thing and half of those struct page's now belong
to the memory allocator.

Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be
called for the 512MB region that was left.

create_mapping_late() needs to split the 1GB mapping it originally made into a
smaller table, with the first half using PAGE_KERNEL_INVALID, and the second
half using PAGE_KERNEL. It can't do break-before-make because these pages may be
in-use by another CPU because we gave them back to the memory allocator. (in the
worst-possible world, that second half contains our stack!)

Making this behave more like debug_pagealloc where the region is only built of
page-size mappings should avoid this. The smallest change to what you have is to
always pass page_mappings_only for the kdump region.

Ideally we just disable this resize feature for ARM64 and support it with some
later kernel version, but I can't see a way of doing this without adding Kconfig
symbols to other architectures.

> (Please note that the region is memblock_reserve()'d at boot time.)

And free_reserved_page() does nothing to update memblock, so
memblock_is_reserved() says these pages are reserved, but in reality they
are in use by the memory allocator. This doesn't feel right.

(Fortunately we can override crash_free_reserved_phys_range() so this can
 probably be fixed)

>> This secretly-unmapped is the sort of thing that breaks hibernate, it blindly

>> assumes pfn_valid() means it can access the page if it wants to. Setting

>> PG_Reserved is a quick way to trick it out of doing this, but that would leave

>> the crash kernel region un-initialised after resume, while kexec_crash_image

>> still has a value.

> 

> Ouch, I didn't notice this issue.

> 

>> I think the best fix for this is to forbid hibernate if kexec_crash_loaded()

>> arguing these are mutually-exclusive features, and the protect crash-dump

>> feature exists to prevent things like hibernate corrupting the crash region.

> 

> This restriction is really painful.

> Is there any hibernation hook that will be invoked before suspending and

> after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()

> will be able to be called.

Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c,
but isn't this protect feature supposed to stop things like hibernate from
meddling with the region?

(I haven't tested what hibernate does with the crash region as its only just
occurred to me)

I think to avoid holding kdump up we should disable any possible interaction,
(forbid hibernate if a kdump kernel is loaded), and sort it out later!

>>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c

>>> index bc96c8a7fc79..f7938fecf3ff 100644

>>> --- a/arch/arm64/kernel/machine_kexec.c

>>> +++ b/arch/arm64/kernel/machine_kexec.c

>>> @@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)

>>> -	/* Flush the kimage list and its buffers. */

>>> -	kexec_list_flush(kimage);

>>> +	if (kimage != kexec_crash_image) {

>>> +		/* Flush the kimage list and its buffers. */

>>> +		kexec_list_flush(kimage);

>>>  

>>> -	/* Flush the new image if already in place. */

>>> -	if (kimage->head & IND_DONE)

>>> -		kexec_segment_flush(kimage);

>>> +		/* Flush the new image if already in place. */

>>> +		if (kimage->head & IND_DONE)

>>> +			kexec_segment_flush(kimage);

>>> +	}

>>

>> So for kdump we cleaned the kimage->segment[i].mem regions in

>> arch_kexec_protect_crashkres(), so don't need to do it here.

> 

> Correct.

> 

>> What about the kimage->head[i] array of list entries that were cleaned by

>> kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it

>> arm64_relocate_new_kernel() at the end of this function:

>>> cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);

> 

> Kimage->head holds a list of memory regions that are overlapped

> between the primary kernel and the secondary kernel, but in kedump case,

> the whole memory is isolated and the list should be empty.

The asm code will still try to walk the list with MMU and caches turned off, so
even its "I'm empty" values need cleaning to the PoC.

(it looks like the first value is passed by value, so we could try and be clever
by testing for that DONE flag in the first value, but I don't think its worth
the effort)

Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

James Morse Jan. 27, 2017, 1:59 p.m. UTC | #4

Hi Akashi,

On 24/01/17 08:49, AKASHI Takahiro wrote:
> To protect the memory reserved for crash dump kernel once after loaded,

> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> permissions of the corresponding kernel mappings.

> 

> We also have to

> - put the region in an isolated mapping, and


> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c

> index 9c7adcce8e4e..2d4a0b68a852 100644

> --- a/arch/arm64/mm/mmu.c

> +++ b/arch/arm64/mm/mmu.c

> @@ -22,6 +22,7 @@

>  #include <linux/kernel.h>

>  #include <linux/errno.h>

>  #include <linux/init.h>

> +#include <linux/kexec.h>

>  #include <linux/libfdt.h>

>  #include <linux/mman.h>

>  #include <linux/nodemask.h>

> @@ -367,6 +368,39 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end

>  	unsigned long kernel_start = __pa(_text);

>  	unsigned long kernel_end = __pa(__init_begin);

>  

> +#ifdef CONFIG_KEXEC_CORE

> +	/*

> +	 * While crash dump kernel memory is contained in a single memblock

> +	 * for now, it should appear in an isolated mapping so that we can

> +	 * independently unmap the region later.

> +	 */

> +	if (crashk_res.end && crashk_res.start >= start &&

> +	    crashk_res.end <= end) {

> +		if (crashk_res.start != start)

> +			__create_pgd_mapping(pgd, start, __phys_to_virt(start),

> +					     crashk_res.start - start,

> +					     PAGE_KERNEL,

> +					     early_pgtable_alloc,

> +					     debug_pagealloc_enabled());

> +

> +		/* before kexec_load(), the region can be read-writable. */

> +		__create_pgd_mapping(pgd, crashk_res.start,

> +				     __phys_to_virt(crashk_res.start),

> +				     crashk_res.end - crashk_res.start + 1,

> +				     PAGE_KERNEL, early_pgtable_alloc,

> +				     debug_pagealloc_enabled());

> +

> +		if (crashk_res.end != end)

> +			__create_pgd_mapping(pgd, crashk_res.end + 1,

> +					     __phys_to_virt(crashk_res.end + 1),

> +					     end - crashk_res.end - 1,

> +					     PAGE_KERNEL,

> +					     early_pgtable_alloc,

> +					     debug_pagealloc_enabled());


> +		return;


Doesn't this mean we skip all the 'does this overlap with the kernel text' tests
that happen further down in this file?

(I think this can be fixed by replacing page_mappings_only with something
 like needs_per_page_mapping(addr, size) called when we try to place a block
 mapping. needs_per_page_mapping() can then take kdump and debug_pagealloc into
 account.)


I see boot failures on v4.10-rc5, with this series (and your fixup diff for
patch 4). I'm using defconfig with 64K pages and 42bit VA on Juno. I pass
'crashkernel=1G':
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] efi: EFI v2.50 by ARM Juno EFI Nov 24 2015 12:36:35
[    0.000000] efi:  ACPI=0xf95b0000  ACPI 2.0=0xf95b0014  PROP=0xfe8db4d8
[    0.000000] Reserving 1024MB of memory at 2560MB for crashkernel
[    0.000000] cma: Failed to reserve 512 MiB
[    0.000000] ------------[ cut here ]------------
[    0.000000] kernel BUG at ../arch/arm64/mm/mmu.c:118!
[    0.000000] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00012-gdd6fe39
0c85d #6873
[    0.000000] Hardware name: ARM Juno development board (r1) (DT)
[    0.000000] task: fffffc0008e2c900 task.stack: fffffc0008df0000
[    0.000000] PC is at __create_pgd_mapping+0x2c8/0x2e8
[    0.000000] LR is at paging_init+0x2b0/0x6fc
[    0.000000] pc : [<fffffc0008096b20>] lr : [<fffffc0008ce5a60>] pstate: 60000
0c5
[    0.000000] sp : fffffc0008df3e20
[    0.000000] x29: fffffc0008df3e20 x28: 00e8000000000713
[    0.000000] x27: 0000000000000001 x26: 00000000e00f0000
[    0.000000] x25: fffffe00794a0000 x24: fffffe00794a0000
[    0.000000] x23: fffffe00600f0000 x22: 00e80000e0000711
[    0.000000] x21: 0000000000000000 x20: fffffdff7e5e8018
[    0.000000] x19: 000000000000e00f x18: 0000000000000000
[    0.000000] x17: fffffc0008f61cd0 x16: 0000000000000005
[    0.000000] x15: 0000000880000000 x14: 00000000001fffff
[    0.000000] x13: 00f8000000000713 x12: fffffc0008e3d000
[    0.000000] x11: 0000000000000801 x10: 00e8000000000713
[    0.000000] x9 : 0000000000000000 x8 : 0000000000001003
[    0.000000] x7 : 0000000000000000 x6 : 0000000000000000
[    0.000000] x5 : fffffc0008ce5744 x4 : 00e8000000000713
[    0.000000] x3 : fffffe007949ffff x2 : fffffe00e00f0000
[    0.000000] x1 : 00e8000000000713 x0 : 0000000000000001
[    0.000000]
[    0.000000] Process swapper (pid: 0, stack limit = 0xfffffc0008df0000)
[    0.000000] Stack: (0xfffffc0008df3e20 to 0xfffffc0008df4000)

[    0.000000] Call trace:

[    0.000000] [<fffffc0008096b20>] __create_pgd_mapping+0x2c8/0x2e8
[    0.000000] [<fffffc0008ce5a60>] paging_init+0x2b0/0x6fc
[    0.000000] [<fffffc0008ce27d0>] setup_arch+0x1c0/0x5ac
[    0.000000] [<fffffc0008ce0838>] start_kernel+0x70/0x394
[    0.000000] [<fffffc0008ce01e8>] __primary_switched+0x64/0x6c
[    0.000000] Code: f29fefe1 ea01001f 54ffff00 d4210000 (d4210000)
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!


Adding some debug shows the crash region was allocated as 0xa0000000:0xdfffffff,
and the first memblock to hit __map_memblock() was 0x80000000:0xe0000000.

This causes the end of the crash region to be padded, as 0xdfffffff!=0xe0000000,
but your 'crashk_res.end + 1' actually mapped 0xe0000000 to 0xe0000000 with 0
size. This causes __create_pgd_mapping() to choke when it next uses 0xe0000000
as a start address, as it evidently mapped something when given a 0 size.

You need to round-up crashk_res.end to a a page boundary if it isn't already
aligned.


> +	}

> +#endif

> +

>  	/*

>  	 * Take care not to create a writable alias for the

>  	 * read-only text and rodata sections of the kernel image.

> 


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

AKASHI Takahiro Jan. 27, 2017, 3:42 p.m. UTC | #5

James,

On Fri, Jan 27, 2017 at 01:59:05PM +0000, James Morse wrote:
> Hi Akashi,

> 

> On 24/01/17 08:49, AKASHI Takahiro wrote:

> > To protect the memory reserved for crash dump kernel once after loaded,

> > arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> > permissions of the corresponding kernel mappings.

> > 

> > We also have to

> > - put the region in an isolated mapping, and

> 

> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c

> > index 9c7adcce8e4e..2d4a0b68a852 100644

> > --- a/arch/arm64/mm/mmu.c

> > +++ b/arch/arm64/mm/mmu.c

> > @@ -22,6 +22,7 @@

> >  #include <linux/kernel.h>

> >  #include <linux/errno.h>

> >  #include <linux/init.h>

> > +#include <linux/kexec.h>

> >  #include <linux/libfdt.h>

> >  #include <linux/mman.h>

> >  #include <linux/nodemask.h>

> > @@ -367,6 +368,39 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end

> >  	unsigned long kernel_start = __pa(_text);

> >  	unsigned long kernel_end = __pa(__init_begin);

> >  

> > +#ifdef CONFIG_KEXEC_CORE

> > +	/*

> > +	 * While crash dump kernel memory is contained in a single memblock

> > +	 * for now, it should appear in an isolated mapping so that we can

> > +	 * independently unmap the region later.

> > +	 */

> > +	if (crashk_res.end && crashk_res.start >= start &&

> > +	    crashk_res.end <= end) {

> > +		if (crashk_res.start != start)

> > +			__create_pgd_mapping(pgd, start, __phys_to_virt(start),

> > +					     crashk_res.start - start,

> > +					     PAGE_KERNEL,

> > +					     early_pgtable_alloc,

> > +					     debug_pagealloc_enabled());

> > +

> > +		/* before kexec_load(), the region can be read-writable. */

> > +		__create_pgd_mapping(pgd, crashk_res.start,

> > +				     __phys_to_virt(crashk_res.start),

> > +				     crashk_res.end - crashk_res.start + 1,

> > +				     PAGE_KERNEL, early_pgtable_alloc,

> > +				     debug_pagealloc_enabled());

> > +

> > +		if (crashk_res.end != end)

> > +			__create_pgd_mapping(pgd, crashk_res.end + 1,

> > +					     __phys_to_virt(crashk_res.end + 1),

> > +					     end - crashk_res.end - 1,

> > +					     PAGE_KERNEL,

> > +					     early_pgtable_alloc,

> > +					     debug_pagealloc_enabled());

> 

> > +		return;

> 

> Doesn't this mean we skip all the 'does this overlap with the kernel text' tests

> that happen further down in this file?


You're right. We should still ckeck the overlap against
[start..crashk_res.start] and [crashk_res.end+1..end].

(Using memblock_isolate_range() could simplify the code.)

> (I think this can be fixed by replacing page_mappings_only with something

>  like needs_per_page_mapping(addr, size) called when we try to place a block

>  mapping. needs_per_page_mapping() can then take kdump and debug_pagealloc into

>  account.)

>

> 

> I see boot failures on v4.10-rc5, with this series (and your fixup diff for

> patch 4). I'm using defconfig with 64K pages and 42bit VA on Juno. I pass

> 'crashkernel=1G':

> [    0.000000] efi: Getting EFI parameters from FDT:

> [    0.000000] efi: EFI v2.50 by ARM Juno EFI Nov 24 2015 12:36:35

> [    0.000000] efi:  ACPI=0xf95b0000  ACPI 2.0=0xf95b0014  PROP=0xfe8db4d8

> [    0.000000] Reserving 1024MB of memory at 2560MB for crashkernel

> [    0.000000] cma: Failed to reserve 512 MiB

> [    0.000000] ------------[ cut here ]------------

> [    0.000000] kernel BUG at ../arch/arm64/mm/mmu.c:118!

> [    0.000000] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

> [    0.000000] Modules linked in:

> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00012-gdd6fe39

> 0c85d #6873

> [    0.000000] Hardware name: ARM Juno development board (r1) (DT)

> [    0.000000] task: fffffc0008e2c900 task.stack: fffffc0008df0000

> [    0.000000] PC is at __create_pgd_mapping+0x2c8/0x2e8

> [    0.000000] LR is at paging_init+0x2b0/0x6fc

> [    0.000000] pc : [<fffffc0008096b20>] lr : [<fffffc0008ce5a60>] pstate: 60000

> 0c5

> [    0.000000] sp : fffffc0008df3e20

> [    0.000000] x29: fffffc0008df3e20 x28: 00e8000000000713

> [    0.000000] x27: 0000000000000001 x26: 00000000e00f0000

> [    0.000000] x25: fffffe00794a0000 x24: fffffe00794a0000

> [    0.000000] x23: fffffe00600f0000 x22: 00e80000e0000711

> [    0.000000] x21: 0000000000000000 x20: fffffdff7e5e8018

> [    0.000000] x19: 000000000000e00f x18: 0000000000000000

> [    0.000000] x17: fffffc0008f61cd0 x16: 0000000000000005

> [    0.000000] x15: 0000000880000000 x14: 00000000001fffff

> [    0.000000] x13: 00f8000000000713 x12: fffffc0008e3d000

> [    0.000000] x11: 0000000000000801 x10: 00e8000000000713

> [    0.000000] x9 : 0000000000000000 x8 : 0000000000001003

> [    0.000000] x7 : 0000000000000000 x6 : 0000000000000000

> [    0.000000] x5 : fffffc0008ce5744 x4 : 00e8000000000713

> [    0.000000] x3 : fffffe007949ffff x2 : fffffe00e00f0000

> [    0.000000] x1 : 00e8000000000713 x0 : 0000000000000001

> [    0.000000]

> [    0.000000] Process swapper (pid: 0, stack limit = 0xfffffc0008df0000)

> [    0.000000] Stack: (0xfffffc0008df3e20 to 0xfffffc0008df4000)

> 

> [    0.000000] Call trace:

> 

> [    0.000000] [<fffffc0008096b20>] __create_pgd_mapping+0x2c8/0x2e8

> [    0.000000] [<fffffc0008ce5a60>] paging_init+0x2b0/0x6fc

> [    0.000000] [<fffffc0008ce27d0>] setup_arch+0x1c0/0x5ac

> [    0.000000] [<fffffc0008ce0838>] start_kernel+0x70/0x394

> [    0.000000] [<fffffc0008ce01e8>] __primary_switched+0x64/0x6c

> [    0.000000] Code: f29fefe1 ea01001f 54ffff00 d4210000 (d4210000)

> [    0.000000] ---[ end trace 0000000000000000 ]---

> [    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

> [    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

> 

> 

> Adding some debug shows the crash region was allocated as 0xa0000000:0xdfffffff,

> and the first memblock to hit __map_memblock() was 0x80000000:0xe0000000.

> 

> This causes the end of the crash region to be padded, as 0xdfffffff!=0xe0000000,

> but your 'crashk_res.end + 1' actually mapped 0xe0000000 to 0xe0000000 with 0

> size. This causes __create_pgd_mapping() to choke when it next uses 0xe0000000

> as a start address, as it evidently mapped something when given a 0 size.

> 

> You need to round-up crashk_res.end to a a page boundary if it isn't already

> aligned.


The start address is already enforced to be 2MB aligned and the size of
region (hence start + size) is also page-size aligned. (See patch#3)
Since crashk_res.end is 'inclusive,' the fix would be

> > +		if (crashk_res.end != end)

> > +			__create_pgd_mapping(pgd, crashk_res.end + 1,


To
		if ((crashk_res.end + 1) < end)

Thanks,
-Takahiro AKASHI

> 

> 

> > +	}

> > +#endif

> > +

> >  	/*

> >  	 * Take care not to create a writable alias for the

> >  	 * read-only text and rodata sections of the kernel image.

> > 

> 

> Thanks,

> 

> James


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

AKASHI Takahiro Jan. 27, 2017, 5:15 p.m. UTC | #6

James,

On Fri, Jan 27, 2017 at 11:19:32AM +0000, James Morse wrote:
> Hi Akashi,

> 

> On 26/01/17 11:28, AKASHI Takahiro wrote:

> > On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote:

> >> On 24/01/17 08:49, AKASHI Takahiro wrote:

> >>> To protect the memory reserved for crash dump kernel once after loaded,

> >>> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> >>> permissions of the corresponding kernel mappings.

> >>>

> >>> We also have to

> >>> - put the region in an isolated mapping, and

> >>> - move copying kexec's control_code_page to machine_kexec_prepare()

> >>> so that the region will be completely read-only after loading.

> >>

> >>

> >>> Note that the region must reside in linear mapping and have corresponding

> >>> page structures in order to be potentially freed by shrinking it through

> >>> /sys/kernel/kexec_crash_size.

> >>

> >> Nasty! Presumably you have to build the crash region out of individual page

> >> mappings,

> > 

> > This might be an alternative, but

> > 

> >> so that they can be returned to the slab-allocator one page at a time,

> >> and still be able to set/clear the valid bits on the remaining chunk.

> >> (I don't see how that happens in this patch)

> > 

> > As far as shrinking feature is concerned, I believe, crash_shrink_memory(),

> > which eventually calls free_reserved_page(), will take care of all the things

> > to do. I can see increased number of "MemFree" in /proc/meminfo.

> 

> Except for arch specific stuff like reformatting the page tables. Maybe I've

> overlooked the way out this. What happens with this scenario:

> 

> We boot with crashkernel=1G on the commandline.

> Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash

> region.

> Your code in __map_memblock() calls __create_pgd_mapping() ->

> alloc_init_pud() which decides use_1G_block() looks like a good idea.

> 

> Some time later, the user decides to free half of this region,

> free_reserved_page() does its thing and half of those struct page's now belong

> to the memory allocator.

> 

> Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be

> called for the 512MB region that was left.

> 

> create_mapping_late() needs to split the 1GB mapping it originally made into a

> smaller table, with the first half using PAGE_KERNEL_INVALID, and the second

> half using PAGE_KERNEL. It can't do break-before-make because these pages may be

> in-use by another CPU because we gave them back to the memory allocator. (in the

> worst-possible world, that second half contains our stack!)


Yeah, this is a horrible case.
Now I understand why we should stick with page_mapping_only option.

> 

> Making this behave more like debug_pagealloc where the region is only built of

> page-size mappings should avoid this. The smallest change to what you have is to

> always pass page_mappings_only for the kdump region.

> 

> Ideally we just disable this resize feature for ARM64 and support it with some

> later kernel version, but I can't see a way of doing this without adding Kconfig

> symbols to other architectures.

> 

> 

> > (Please note that the region is memblock_reserve()'d at boot time.)

> 

> And free_reserved_page() does nothing to update memblock, so

> memblock_is_reserved() says these pages are reserved, but in reality they

> are in use by the memory allocator. This doesn't feel right.


Just FYI, no other architectures take care of this issue.

(and I don't know whether the memblock is reserved or not may have
any impact after booting.)

> (Fortunately we can override crash_free_reserved_phys_range() so this can

>  probably be fixed)

> 

> >> This secretly-unmapped is the sort of thing that breaks hibernate, it blindly

> >> assumes pfn_valid() means it can access the page if it wants to. Setting

> >> PG_Reserved is a quick way to trick it out of doing this, but that would leave

> >> the crash kernel region un-initialised after resume, while kexec_crash_image

> >> still has a value.

> > 

> > Ouch, I didn't notice this issue.

> > 

> >> I think the best fix for this is to forbid hibernate if kexec_crash_loaded()

> >> arguing these are mutually-exclusive features, and the protect crash-dump

> >> feature exists to prevent things like hibernate corrupting the crash region.

> > 

> > This restriction is really painful.

> > Is there any hibernation hook that will be invoked before suspending and

> > after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()

> > will be able to be called.

> 

> Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c,


I will give it a try next week.

> but isn't this protect feature supposed to stop things like hibernate from

> meddling with the region?


It seems that kexec code never expect that the crash kernel memory
is actually unmapped (as my current patch does).
Moreover, whether kdump or not, it is quit fragile to unmap some part of
linear mapping dynamically. I think we probably need to implement kinda
"memory hotplug" in order to perform such an unmapping without affecting
other kernel components. 

> (I haven't tested what hibernate does with the crash region as its only just

> occurred to me)

> 

> I think to avoid holding kdump up we should disable any possible interaction,

> (forbid hibernate if a kdump kernel is loaded), and sort it out later!


There are several options
- hibernate and kdump need be exclusively configured, or
- once kdump is loaded, hibernate will fail, or
- after resuming from hibernate, kdump won't work

The simplest way is to force users to re-load kdump after resuming,
but it sounds somewhat weird.

> 

> >>> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c

> >>> index bc96c8a7fc79..f7938fecf3ff 100644

> >>> --- a/arch/arm64/kernel/machine_kexec.c

> >>> +++ b/arch/arm64/kernel/machine_kexec.c

> >>> @@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)

> 

> >>> -	/* Flush the kimage list and its buffers. */

> >>> -	kexec_list_flush(kimage);

> >>> +	if (kimage != kexec_crash_image) {

> >>> +		/* Flush the kimage list and its buffers. */

> >>> +		kexec_list_flush(kimage);

> >>>  

> >>> -	/* Flush the new image if already in place. */

> >>> -	if (kimage->head & IND_DONE)

> >>> -		kexec_segment_flush(kimage);

> >>> +		/* Flush the new image if already in place. */

> >>> +		if (kimage->head & IND_DONE)

> >>> +			kexec_segment_flush(kimage);

> >>> +	}

> >>

> >> So for kdump we cleaned the kimage->segment[i].mem regions in

> >> arch_kexec_protect_crashkres(), so don't need to do it here.

> > 

> > Correct.

> > 

> >> What about the kimage->head[i] array of list entries that were cleaned by

> >> kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it

> >> arm64_relocate_new_kernel() at the end of this function:

> >>> cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);

> > 

> > Kimage->head holds a list of memory regions that are overlapped

> > between the primary kernel and the secondary kernel, but in kedump case,

> > the whole memory is isolated and the list should be empty.

> 

> The asm code will still try to walk the list with MMU and caches turned off, so

> even its "I'm empty" values need cleaning to the PoC.

> 

> (it looks like the first value is passed by value, so we could try and be clever

> by testing for that DONE flag in the first value, but I don't think its worth

> the effort)


Surely not.

-Takahiro AKASHI

> 

> Thanks,

> 

> James

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Mark Rutland Jan. 27, 2017, 6:56 p.m. UTC | #7

On Sat, Jan 28, 2017 at 02:15:16AM +0900, AKASHI Takahiro wrote:
> On Fri, Jan 27, 2017 at 11:19:32AM +0000, James Morse wrote:

> > Hi Akashi,

> >

> > On 26/01/17 11:28, AKASHI Takahiro wrote:

> > > On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote:

> > >> On 24/01/17 08:49, AKASHI Takahiro wrote:

> > >>> To protect the memory reserved for crash dump kernel once after loaded,

> > >>> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> > >>> permissions of the corresponding kernel mappings.

> > >>>

> > >>> We also have to

> > >>> - put the region in an isolated mapping, and

> > >>> - move copying kexec's control_code_page to machine_kexec_prepare()

> > >>> so that the region will be completely read-only after loading.

> > >>

> > >>

> > >>> Note that the region must reside in linear mapping and have corresponding

> > >>> page structures in order to be potentially freed by shrinking it through

> > >>> /sys/kernel/kexec_crash_size.


Ah; I did not realise that this was a possibility.

> Now I understand why we should stick with page_mapping_only option.


Likewise, I now agree.

Apologies for guiding you down the wrong path here.

Thanks,
Mark.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Mark Rutland Jan. 27, 2017, 7:41 p.m. UTC | #8

On Sat, Jan 28, 2017 at 12:42:20AM +0900, AKASHI Takahiro wrote:
> On Fri, Jan 27, 2017 at 01:59:05PM +0000, James Morse wrote:

> > On 24/01/17 08:49, AKASHI Takahiro wrote:

> > > +	/*

> > > +	 * While crash dump kernel memory is contained in a single memblock

> > > +	 * for now, it should appear in an isolated mapping so that we can

> > > +	 * independently unmap the region later.

> > > +	 */

> > > +	if (crashk_res.end && crashk_res.start >= start &&

> > > +	    crashk_res.end <= end) {

> > > +		if (crashk_res.start != start)

> > > +			__create_pgd_mapping(pgd, start, __phys_to_virt(start),

> > > +					     crashk_res.start - start,

> > > +					     PAGE_KERNEL,

> > > +					     early_pgtable_alloc,

> > > +					     debug_pagealloc_enabled());

> > > +

> > > +		/* before kexec_load(), the region can be read-writable. */

> > > +		__create_pgd_mapping(pgd, crashk_res.start,

> > > +				     __phys_to_virt(crashk_res.start),

> > > +				     crashk_res.end - crashk_res.start + 1,

> > > +				     PAGE_KERNEL, early_pgtable_alloc,

> > > +				     debug_pagealloc_enabled());

> > > +

> > > +		if (crashk_res.end != end)

> > > +			__create_pgd_mapping(pgd, crashk_res.end + 1,

> > > +					     __phys_to_virt(crashk_res.end + 1),

> > > +					     end - crashk_res.end - 1,

> > > +					     PAGE_KERNEL,

> > > +					     early_pgtable_alloc,

> > > +					     debug_pagealloc_enabled());

> > 

> > > +		return;

> > 

> > Doesn't this mean we skip all the 'does this overlap with the kernel text' tests

> > that happen further down in this file?

> 

> You're right. We should still ckeck the overlap against

> [start..crashk_res.start] and [crashk_res.end+1..end].

> 

> (Using memblock_isolate_range() could simplify the code.)


My key concern here was that we handle both of these in the same way, so
using memblock_isolate_range() for both generally sounds fine to me.

One concern I had with using memblock_isolate_range() is that it does
not guarantee that the region remains isolated. So if there was a
subsequent memblock_add() call, memblock_merge_regions() might end up
merging the region with an adjacent region.

If we isolate the regions at the start of map_mem(), and have a comment
explaining that we must avoid subsequent memblock merging, then I think
this would be ok.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

AKASHI Takahiro Jan. 30, 2017, 8:27 a.m. UTC | #9

James,

On Sat, Jan 28, 2017 at 02:15:16AM +0900, AKASHI Takahiro wrote:
> James,

> 

> On Fri, Jan 27, 2017 at 11:19:32AM +0000, James Morse wrote:

> > Hi Akashi,

> > 

> > On 26/01/17 11:28, AKASHI Takahiro wrote:

> > > On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote:

> > >> On 24/01/17 08:49, AKASHI Takahiro wrote:

> > >>> To protect the memory reserved for crash dump kernel once after loaded,

> > >>> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> > >>> permissions of the corresponding kernel mappings.

> > >>>

> > >>> We also have to

> > >>> - put the region in an isolated mapping, and

> > >>> - move copying kexec's control_code_page to machine_kexec_prepare()

> > >>> so that the region will be completely read-only after loading.

> > >>

> > >>

> > >>> Note that the region must reside in linear mapping and have corresponding

> > >>> page structures in order to be potentially freed by shrinking it through

> > >>> /sys/kernel/kexec_crash_size.

> > >>

> > >> Nasty! Presumably you have to build the crash region out of individual page

> > >> mappings,

> > > 

> > > This might be an alternative, but

> > > 

> > >> so that they can be returned to the slab-allocator one page at a time,

> > >> and still be able to set/clear the valid bits on the remaining chunk.

> > >> (I don't see how that happens in this patch)

> > > 

> > > As far as shrinking feature is concerned, I believe, crash_shrink_memory(),

> > > which eventually calls free_reserved_page(), will take care of all the things

> > > to do. I can see increased number of "MemFree" in /proc/meminfo.

> > 

> > Except for arch specific stuff like reformatting the page tables. Maybe I've

> > overlooked the way out this. What happens with this scenario:

> > 

> > We boot with crashkernel=1G on the commandline.

> > Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash

> > region.

> > Your code in __map_memblock() calls __create_pgd_mapping() ->

> > alloc_init_pud() which decides use_1G_block() looks like a good idea.

> > 

> > Some time later, the user decides to free half of this region,

> > free_reserved_page() does its thing and half of those struct page's now belong

> > to the memory allocator.

> > 

> > Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be

> > called for the 512MB region that was left.

> > 

> > create_mapping_late() needs to split the 1GB mapping it originally made into a

> > smaller table, with the first half using PAGE_KERNEL_INVALID, and the second

> > half using PAGE_KERNEL. It can't do break-before-make because these pages may be

> > in-use by another CPU because we gave them back to the memory allocator. (in the

> > worst-possible world, that second half contains our stack!)

> 

> Yeah, this is a horrible case.

> Now I understand why we should stick with page_mapping_only option.

> 

> > 

> > Making this behave more like debug_pagealloc where the region is only built of

> > page-size mappings should avoid this. The smallest change to what you have is to

> > always pass page_mappings_only for the kdump region.

> > 

> > Ideally we just disable this resize feature for ARM64 and support it with some

> > later kernel version, but I can't see a way of doing this without adding Kconfig

> > symbols to other architectures.

> > 

> > 

> > > (Please note that the region is memblock_reserve()'d at boot time.)

> > 

> > And free_reserved_page() does nothing to update memblock, so

> > memblock_is_reserved() says these pages are reserved, but in reality they

> > are in use by the memory allocator. This doesn't feel right.

> 

> Just FYI, no other architectures take care of this issue.

> 

> (and I don't know whether the memblock is reserved or not may have

> any impact after booting.)

> 

> > (Fortunately we can override crash_free_reserved_phys_range() so this can

> >  probably be fixed)

> > 

> > >> This secretly-unmapped is the sort of thing that breaks hibernate, it blindly

> > >> assumes pfn_valid() means it can access the page if it wants to. Setting

> > >> PG_Reserved is a quick way to trick it out of doing this, but that would leave

> > >> the crash kernel region un-initialised after resume, while kexec_crash_image

> > >> still has a value.

> > > 

> > > Ouch, I didn't notice this issue.

> > > 

> > >> I think the best fix for this is to forbid hibernate if kexec_crash_loaded()

> > >> arguing these are mutually-exclusive features, and the protect crash-dump

> > >> feature exists to prevent things like hibernate corrupting the crash region.

> > > 

> > > This restriction is really painful.

> > > Is there any hibernation hook that will be invoked before suspending and

> > > after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()

> > > will be able to be called.

> > 

> > Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c,

> 

> I will give it a try next week.


I took the following test scenario:
 - load the crash dump kernel
 - echo shutdown > /sys/power/disk
 - echo disk > /sys/power/state
 - power off and on the board
 - reboot with resume=...
 - after hibernate done, echo c > /proc/sysrq-trigger
 - after reboot, check /proc/vmcore

and everything looks to work fine.

If you think I'd better try more test cases, please let me know.

Thanks,
-Takahiro AKASHI

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel===8<===
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index fe301cbcb442..111a849333ee 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -16,6 +16,7 @@
  */
 #define pr_fmt(x) "hibernate: " x
 #include <linux/cpu.h>
+#include <linux/kexec.h>
 #include <linux/kvm_host.h>
 #include <linux/mm.h>
 #include <linux/pm.h>
@@ -289,6 +290,12 @@ int swsusp_arch_suspend(void)
 	local_dbg_save(flags);
 
 	if (__cpu_suspend_enter(&state)) {
+#ifdef CONFIG_KEXEC_CORE
+		/* make the crash dump kernel region mapped */
+		if (kexec_crash_image)
+			arch_kexec_unprotect_crashkres();
+#endif
+
 		sleep_cpu = smp_processor_id();
 		ret = swsusp_save();
 	} else {
@@ -300,6 +307,12 @@ int swsusp_arch_suspend(void)
 		if (el2_reset_needed())
 			dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);
 
+#ifdef CONFIG_KEXEC_CORE
+		/* make the crash dump kernel region unmapped */
+		if (kexec_crash_image)
+			arch_kexec_protect_crashkres();
+#endif
+
 		/*
 		 * Tell the hibernation core that we've just restored
 		 * the memory

AKASHI Takahiro Jan. 30, 2017, 8:42 a.m. UTC | #10

Mark,

On Fri, Jan 27, 2017 at 06:56:13PM +0000, Mark Rutland wrote:
> On Sat, Jan 28, 2017 at 02:15:16AM +0900, AKASHI Takahiro wrote:

> > On Fri, Jan 27, 2017 at 11:19:32AM +0000, James Morse wrote:

> > > Hi Akashi,

> > >

> > > On 26/01/17 11:28, AKASHI Takahiro wrote:

> > > > On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote:

> > > >> On 24/01/17 08:49, AKASHI Takahiro wrote:

> > > >>> To protect the memory reserved for crash dump kernel once after loaded,

> > > >>> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with

> > > >>> permissions of the corresponding kernel mappings.

> > > >>>

> > > >>> We also have to

> > > >>> - put the region in an isolated mapping, and

> > > >>> - move copying kexec's control_code_page to machine_kexec_prepare()

> > > >>> so that the region will be completely read-only after loading.

> > > >>

> > > >>

> > > >>> Note that the region must reside in linear mapping and have corresponding

> > > >>> page structures in order to be potentially freed by shrinking it through

> > > >>> /sys/kernel/kexec_crash_size.

> 

> Ah; I did not realise that this was a possibility.

> 

> > Now I understand why we should stick with page_mapping_only option.

> 

> Likewise, I now agree.

> 

> Apologies for guiding you down the wrong path here.


Your comments are always welcome.

Anyhow, I think we'd better have a dedicated function of unmapping.
Can you please take a look at the following hack?

(We need to carefully use this function except for kdump usage since
it doesn't care whether the region to be unmapped is used somewhere else.)

Thanks,
-Takahiro AKASHI

===>8===


> Thanks,

> Mark.

> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel===8<===
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
 #define PAGE_KERNEL_ROX		__pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
 #define PAGE_KERNEL_EXEC	__pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
 #define PAGE_KERNEL_EXEC_CONT	__pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID	__pgprot(0)
 
 #define PAGE_HYP		__pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
 #define PAGE_HYP_EXEC		__pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..81173b594195 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -307,6 +307,101 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	} while (pgd++, addr = next, addr != end);
 }
 
+static void free_pte(pmd_t *pmd, unsigned long addr, unsigned long end,
+			bool dealloc_table)
+{
+	pte_t *pte;
+	bool do_free = (dealloc_table && ((end - addr) == PMD_SIZE));
+
+	BUG_ON(pmd_none(*pmd) || pmd_bad(*pmd));
+
+	pte = pte_set_fixmap_offset(pmd, addr);
+	do {
+		pte_clear(NULL, NULL, pte);
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	pte_clear_fixmap();
+
+	if (do_free) {
+		__free_page(pmd_page(*pmd));
+		pmd_clear(pmd);
+	}
+}
+
+static void free_pmd(pud_t *pud, unsigned long addr, unsigned long end,
+			bool dealloc_table)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	bool do_free = (dealloc_table && ((end - addr) == PUD_SIZE));
+
+	BUG_ON(pud_none(*pud) || pud_bad(*pud));
+
+	pmd = pmd_set_fixmap_offset(pud, addr);
+
+	do {
+		next = pmd_addr_end(addr, end);
+
+		if (pmd_table(*pmd)) {
+			free_pte(pmd, addr, next, dealloc_table);
+		} else {
+			pmd_clear(pmd);
+		}
+	} while (pmd++, addr = next, addr != end);
+
+	pmd_clear_fixmap();
+
+	if (do_free) {
+		__free_page(pud_page(*pud));
+		pud_clear(pud);
+	}
+}
+
+static void free_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
+			bool dealloc_table)
+{
+	pud_t *pud;
+	unsigned long next;
+	bool do_free = (dealloc_table && ((end - addr) == PGDIR_SIZE));
+
+	BUG_ON(pgd_none(*pgd) || pgd_bad(*pgd));
+
+	pud = pud_set_fixmap_offset(pgd, addr);
+
+	do {
+		next = pud_addr_end(addr, end);
+
+		if (pud_table(*pud)) {
+			free_pmd(pud, addr, next, dealloc_table);
+		} else {
+			pud_clear(pud);
+		}
+	} while (pud++, addr = next, addr != end);
+
+	pud_clear_fixmap();
+
+	if (do_free) {
+		__free_page(pgd_page(*pgd));
+		pgd_clear(pgd);
+	}
+}
+
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long virt,
+				 phys_addr_t size, bool dealloc_table)
+{
+	unsigned long addr, length, end, next;
+	pgd_t *pgd = pgd_offset_raw(pgdir, virt);
+
+	addr = virt & PAGE_MASK;
+	length = PAGE_ALIGN(size + (virt & ~PAGE_MASK));
+
+	end = addr + length;
+	do {
+		next = pgd_addr_end(addr, end);
+		free_pud(pgd, addr, next, dealloc_table);
+	} while (pgd++, addr = next, addr != end);
+}
+
 static phys_addr_t pgd_pgtable_alloc(void)
 {
 	void *ptr = (void *)__get_free_page(PGALLOC_GFP);
@@ -334,14 +429,15 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
 	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
 }
 
-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
-			       unsigned long virt, phys_addr_t size,
-			       pgprot_t prot, bool page_mappings_only)
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+			unsigned long virt, phys_addr_t size,
+			pgprot_t prot, bool page_mappings_only)
 {
-	BUG_ON(mm == &init_mm);
-
-	__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
-			     pgd_pgtable_alloc, page_mappings_only);
+	if (pgprot_val(prot))
+		__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
+				     pgd_pgtable_alloc, page_mappings_only);
+	else
+		__remove_pgd_mapping(mm->pgd, virt, size, true);
 }
 
 static void create_mapping_late(phys_addr_t phys, unsigned long virt,

[v30,05/11] arm64: kdump: protect crash dump kernel memory

Commit Message

Comments

Patch