diff mbox

[v2,11/13] arm64: allow kernel Image to be loaded anywhere in physical memory

Message ID 1451489172-17420-12-git-send-email-ard.biesheuvel@linaro.org
State New
Headers show

Commit Message

Ard Biesheuvel Dec. 30, 2015, 3:26 p.m. UTC
This relaxes the kernel Image placement requirements, so that it
may be placed at any 2 MB aligned offset in physical memory.

This is accomplished by ignoring PHYS_OFFSET when installing
memblocks, and accounting for the apparent virtual offset of
the kernel Image. As a result, virtual address references
below PAGE_OFFSET are correctly mapped onto physical references
into the kernel Image regardless of where it sits in memory.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 Documentation/arm64/booting.txt  | 12 ++---
 arch/arm64/include/asm/boot.h    |  5 ++
 arch/arm64/include/asm/kvm_mmu.h |  2 +-
 arch/arm64/include/asm/memory.h  | 15 +++---
 arch/arm64/kernel/head.S         |  6 ++-
 arch/arm64/mm/init.c             | 50 +++++++++++++++++++-
 arch/arm64/mm/mmu.c              | 12 +++++
 7 files changed, 86 insertions(+), 16 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Mark Rutland Jan. 8, 2016, 11:26 a.m. UTC | #1
Hi,

On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:
> This relaxes the kernel Image placement requirements, so that it

> may be placed at any 2 MB aligned offset in physical memory.

> 

> This is accomplished by ignoring PHYS_OFFSET when installing

> memblocks, and accounting for the apparent virtual offset of

> the kernel Image. As a result, virtual address references

> below PAGE_OFFSET are correctly mapped onto physical references

> into the kernel Image regardless of where it sits in memory.

> 

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  Documentation/arm64/booting.txt  | 12 ++---

>  arch/arm64/include/asm/boot.h    |  5 ++

>  arch/arm64/include/asm/kvm_mmu.h |  2 +-

>  arch/arm64/include/asm/memory.h  | 15 +++---

>  arch/arm64/kernel/head.S         |  6 ++-

>  arch/arm64/mm/init.c             | 50 +++++++++++++++++++-

>  arch/arm64/mm/mmu.c              | 12 +++++

>  7 files changed, 86 insertions(+), 16 deletions(-)

> 

> diff --git a/Documentation/arm64/booting.txt b/Documentation/arm64/booting.txt

> index 701d39d3171a..03e02ebc1b0c 100644

> --- a/Documentation/arm64/booting.txt

> +++ b/Documentation/arm64/booting.txt

> @@ -117,14 +117,14 @@ Header notes:

>    depending on selected features, and is effectively unbound.

>  

>  The Image must be placed text_offset bytes from a 2MB aligned base

> -address near the start of usable system RAM and called there. Memory

> -below that base address is currently unusable by Linux, and therefore it

> -is strongly recommended that this location is the start of system RAM.

> -The region between the 2 MB aligned base address and the start of the

> -image has no special significance to the kernel, and may be used for

> -other purposes.

> +address anywhere in usable system RAM and called there. The region

> +between the 2 MB aligned base address and the start of the image has no

> +special significance to the kernel, and may be used for other purposes.

>  At least image_size bytes from the start of the image must be free for

>  use by the kernel.

> +NOTE: versions prior to v4.6 cannot make use of memory below the

> +physical offset of the Image so it is recommended that the Image be

> +placed as close as possible to the start of system RAM.


We need a head flag for this so that a bootloader can determine whether
it can load the kernel anywhere or should try for the lowest possible
address. Then the note would describe the recommended behaviour in the
absence of the flag.

The flag for KASLR isn't sufficient as you can build without it (and it
only tells the bootloader that the kernel accepts entropy in x1).

We might also want to consider if we need to determine whether or not
the bootloader actually provided entropy, (and if we need a more general
handshake between the bootlaoder and kernel to determine that kind of
thing).

>  Any memory described to the kernel (even that below the start of the

>  image) which is not marked as reserved from the kernel (e.g., with a

> diff --git a/arch/arm64/include/asm/boot.h b/arch/arm64/include/asm/boot.h

> index 81151b67b26b..984cb0fa61ce 100644

> --- a/arch/arm64/include/asm/boot.h

> +++ b/arch/arm64/include/asm/boot.h

> @@ -11,4 +11,9 @@

>  #define MIN_FDT_ALIGN		8

>  #define MAX_FDT_SIZE		SZ_2M

>  

> +/*

> + * arm64 requires the kernel image to be 2 MB aligned


Nit: The image is TEXT_OFFSET from that 2M-aligned base.
s/image/mapping/? 

[...]

> +static void __init enforce_memory_limit(void)

> +{

> +	const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

> +	u64 to_remove = memblock_phys_mem_size() - memory_limit;

> +	phys_addr_t max_addr = 0;

> +	struct memblock_region *r;

> +

> +	if (memory_limit == (phys_addr_t)ULLONG_MAX)

> +		return;

> +

> +	/*

> +	 * The kernel may be high up in physical memory, so try to apply the

> +	 * limit below the kernel first, and only let the generic handling

> +	 * take over if it turns out we haven't clipped enough memory yet.

> +	 */


We might want ot preserve the low 4GB if possible, for those IOMMU-less
devices which can only do 32-bit addressing.

Otherwise this looks good to me!

Thanks,
Mark.
Ard Biesheuvel Jan. 8, 2016, 11:34 a.m. UTC | #2
On 8 January 2016 at 12:26, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi,

>

> On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:

>> This relaxes the kernel Image placement requirements, so that it

>> may be placed at any 2 MB aligned offset in physical memory.

>>

>> This is accomplished by ignoring PHYS_OFFSET when installing

>> memblocks, and accounting for the apparent virtual offset of

>> the kernel Image. As a result, virtual address references

>> below PAGE_OFFSET are correctly mapped onto physical references

>> into the kernel Image regardless of where it sits in memory.

>>

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> ---

>>  Documentation/arm64/booting.txt  | 12 ++---

>>  arch/arm64/include/asm/boot.h    |  5 ++

>>  arch/arm64/include/asm/kvm_mmu.h |  2 +-

>>  arch/arm64/include/asm/memory.h  | 15 +++---

>>  arch/arm64/kernel/head.S         |  6 ++-

>>  arch/arm64/mm/init.c             | 50 +++++++++++++++++++-

>>  arch/arm64/mm/mmu.c              | 12 +++++

>>  7 files changed, 86 insertions(+), 16 deletions(-)

>>

>> diff --git a/Documentation/arm64/booting.txt b/Documentation/arm64/booting.txt

>> index 701d39d3171a..03e02ebc1b0c 100644

>> --- a/Documentation/arm64/booting.txt

>> +++ b/Documentation/arm64/booting.txt

>> @@ -117,14 +117,14 @@ Header notes:

>>    depending on selected features, and is effectively unbound.

>>

>>  The Image must be placed text_offset bytes from a 2MB aligned base

>> -address near the start of usable system RAM and called there. Memory

>> -below that base address is currently unusable by Linux, and therefore it

>> -is strongly recommended that this location is the start of system RAM.

>> -The region between the 2 MB aligned base address and the start of the

>> -image has no special significance to the kernel, and may be used for

>> -other purposes.

>> +address anywhere in usable system RAM and called there. The region

>> +between the 2 MB aligned base address and the start of the image has no

>> +special significance to the kernel, and may be used for other purposes.

>>  At least image_size bytes from the start of the image must be free for

>>  use by the kernel.

>> +NOTE: versions prior to v4.6 cannot make use of memory below the

>> +physical offset of the Image so it is recommended that the Image be

>> +placed as close as possible to the start of system RAM.

>

> We need a head flag for this so that a bootloader can determine whether

> it can load the kernel anywhere or should try for the lowest possible

> address. Then the note would describe the recommended behaviour in the

> absence of the flag.

>

> The flag for KASLR isn't sufficient as you can build without it (and it

> only tells the bootloader that the kernel accepts entropy in x1).

>


Indeed, I will change that.

> We might also want to consider if we need to determine whether or not

> the bootloader actually provided entropy, (and if we need a more general

> handshake between the bootlaoder and kernel to determine that kind of

> thing).

>


Yes, that is interesting. We should also think about how to handle
'nokaslr' if it appears on the command line, since in the !EFI case,
we will be way too late to parse this, and a capable kernel will
already be running from a randomized offset. That means it is the
bootloader's responsibility to ensure that the presence of 'nokaslr'
and the entropy in x1 are consistent with each other.

>>  Any memory described to the kernel (even that below the start of the

>>  image) which is not marked as reserved from the kernel (e.g., with a

>> diff --git a/arch/arm64/include/asm/boot.h b/arch/arm64/include/asm/boot.h

>> index 81151b67b26b..984cb0fa61ce 100644

>> --- a/arch/arm64/include/asm/boot.h

>> +++ b/arch/arm64/include/asm/boot.h

>> @@ -11,4 +11,9 @@

>>  #define MIN_FDT_ALIGN                8

>>  #define MAX_FDT_SIZE         SZ_2M

>>

>> +/*

>> + * arm64 requires the kernel image to be 2 MB aligned

>

> Nit: The image is TEXT_OFFSET from that 2M-aligned base.

> s/image/mapping/?

>

> [...]

>


Yep. I hate TEXT_OFFSET, did I mention that?

>> +static void __init enforce_memory_limit(void)

>> +{

>> +     const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

>> +     u64 to_remove = memblock_phys_mem_size() - memory_limit;

>> +     phys_addr_t max_addr = 0;

>> +     struct memblock_region *r;

>> +

>> +     if (memory_limit == (phys_addr_t)ULLONG_MAX)

>> +             return;

>> +

>> +     /*

>> +      * The kernel may be high up in physical memory, so try to apply the

>> +      * limit below the kernel first, and only let the generic handling

>> +      * take over if it turns out we haven't clipped enough memory yet.

>> +      */

>

> We might want ot preserve the low 4GB if possible, for those IOMMU-less

> devices which can only do 32-bit addressing.

>

> Otherwise this looks good to me!

>


Thanks,
Ard.
Mark Rutland Jan. 8, 2016, 11:43 a.m. UTC | #3
On Fri, Jan 08, 2016 at 12:34:18PM +0100, Ard Biesheuvel wrote:
> On 8 January 2016 at 12:26, Mark Rutland <mark.rutland@arm.com> wrote:

> > We might also want to consider if we need to determine whether or not

> > the bootloader actually provided entropy, (and if we need a more general

> > handshake between the bootlaoder and kernel to determine that kind of

> > thing).

> 

> Yes, that is interesting. We should also think about how to handle

> 'nokaslr' if it appears on the command line, since in the !EFI case,

> we will be way too late to parse this, and a capable kernel will

> already be running from a randomized offset. That means it is the

> bootloader's responsibility to ensure that the presence of 'nokaslr'

> and the entropy in x1 are consistent with each other.


Argh, I hadn't considered that. :(

In the absence of a pre-kernel environment, the best thing we can do is
probably to print a giant warning if 'nokaslr' is present but there was
entropy (where that's determined based on some handshake/magic/flag).

> >>  Any memory described to the kernel (even that below the start of the

> >>  image) which is not marked as reserved from the kernel (e.g., with a

> >> diff --git a/arch/arm64/include/asm/boot.h b/arch/arm64/include/asm/boot.h

> >> index 81151b67b26b..984cb0fa61ce 100644

> >> --- a/arch/arm64/include/asm/boot.h

> >> +++ b/arch/arm64/include/asm/boot.h

> >> @@ -11,4 +11,9 @@

> >>  #define MIN_FDT_ALIGN                8

> >>  #define MAX_FDT_SIZE         SZ_2M

> >>

> >> +/*

> >> + * arm64 requires the kernel image to be 2 MB aligned

> >

> > Nit: The image is TEXT_OFFSET from that 2M-aligned base.

> > s/image/mapping/?

> >

> > [...]

> >

> 

> Yep. I hate TEXT_OFFSET, did I mention that?


I would also love to remove it, but I believe it's simply too late. :(

Thanks,
Mark.
Catalin Marinas Jan. 8, 2016, 3:27 p.m. UTC | #4
On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:
> +static void __init enforce_memory_limit(void)

> +{

> +	const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

> +	u64 to_remove = memblock_phys_mem_size() - memory_limit;

> +	phys_addr_t max_addr = 0;

> +	struct memblock_region *r;

> +

> +	if (memory_limit == (phys_addr_t)ULLONG_MAX)

> +		return;

> +

> +	/*

> +	 * The kernel may be high up in physical memory, so try to apply the

> +	 * limit below the kernel first, and only let the generic handling

> +	 * take over if it turns out we haven't clipped enough memory yet.

> +	 */

> +	for_each_memblock(memory, r) {

> +		if (r->base + r->size > kbase) {

> +			u64 rem = min(to_remove, kbase - r->base);

> +

> +			max_addr = r->base + rem;

> +			to_remove -= rem;

> +			break;

> +		}

> +		if (to_remove <= r->size) {

> +			max_addr = r->base + to_remove;

> +			to_remove = 0;

> +			break;

> +		}

> +		to_remove -= r->size;

> +	}

> +

> +	memblock_remove(0, max_addr);

> +

> +	if (to_remove)

> +		memblock_enforce_memory_limit(memory_limit);

> +}


IIUC, this is changing the user expectations a bit. There are people
using the mem= limit to hijack some top of the RAM for other needs
(though they could do it in a saner way like changing the DT memory
nodes). Your patch first tries to remove the memory below the kernel
image and only remove the top if additional limitation is necessary.

Can you not remove memory from the top and block the limit if it goes
below the end of the kernel image, with some warning that memory limit
was not entirely fulfilled?

-- 
Catalin
Ard Biesheuvel Jan. 8, 2016, 3:30 p.m. UTC | #5
On 8 January 2016 at 16:27, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:

>> +static void __init enforce_memory_limit(void)

>> +{

>> +     const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

>> +     u64 to_remove = memblock_phys_mem_size() - memory_limit;

>> +     phys_addr_t max_addr = 0;

>> +     struct memblock_region *r;

>> +

>> +     if (memory_limit == (phys_addr_t)ULLONG_MAX)

>> +             return;

>> +

>> +     /*

>> +      * The kernel may be high up in physical memory, so try to apply the

>> +      * limit below the kernel first, and only let the generic handling

>> +      * take over if it turns out we haven't clipped enough memory yet.

>> +      */

>> +     for_each_memblock(memory, r) {

>> +             if (r->base + r->size > kbase) {

>> +                     u64 rem = min(to_remove, kbase - r->base);

>> +

>> +                     max_addr = r->base + rem;

>> +                     to_remove -= rem;

>> +                     break;

>> +             }

>> +             if (to_remove <= r->size) {

>> +                     max_addr = r->base + to_remove;

>> +                     to_remove = 0;

>> +                     break;

>> +             }

>> +             to_remove -= r->size;

>> +     }

>> +

>> +     memblock_remove(0, max_addr);

>> +

>> +     if (to_remove)

>> +             memblock_enforce_memory_limit(memory_limit);

>> +}

>

> IIUC, this is changing the user expectations a bit. There are people

> using the mem= limit to hijack some top of the RAM for other needs

> (though they could do it in a saner way like changing the DT memory

> nodes). Your patch first tries to remove the memory below the kernel

> image and only remove the top if additional limitation is necessary.

>

> Can you not remove memory from the top and block the limit if it goes

> below the end of the kernel image, with some warning that memory limit

> was not entirely fulfilled?

>


I'm in the middle of rewriting this code from scratch. The general idea is

static void __init clip_mem_range(u64 min, u64 max);

/*
* Clip memory in order of preference:
* - above the kernel and above 4 GB
* - between 4 GB and the start of the kernel
* - below 4 GB
* Note that tho
*/
clip_mem_range(max(sz_4g, PAGE_ALIGN(__pa(_end))), ULLONG_MAX);
clip_mem_range(sz_4g, round_down(__pa(_text), MIN_KIMG_ALIGN));
clip_mem_range(0, sz_4g);

where clip_mem_range() iterates over the memblocks to remove memory
between min and max iff min < max and the limit has not been met yet.
Mark Rutland Jan. 8, 2016, 3:36 p.m. UTC | #6
On Fri, Jan 08, 2016 at 03:27:38PM +0000, Catalin Marinas wrote:
> On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:

> > +static void __init enforce_memory_limit(void)

> > +{

> > +	const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

> > +	u64 to_remove = memblock_phys_mem_size() - memory_limit;

> > +	phys_addr_t max_addr = 0;

> > +	struct memblock_region *r;

> > +

> > +	if (memory_limit == (phys_addr_t)ULLONG_MAX)

> > +		return;

> > +

> > +	/*

> > +	 * The kernel may be high up in physical memory, so try to apply the

> > +	 * limit below the kernel first, and only let the generic handling

> > +	 * take over if it turns out we haven't clipped enough memory yet.

> > +	 */

> > +	for_each_memblock(memory, r) {

> > +		if (r->base + r->size > kbase) {

> > +			u64 rem = min(to_remove, kbase - r->base);

> > +

> > +			max_addr = r->base + rem;

> > +			to_remove -= rem;

> > +			break;

> > +		}

> > +		if (to_remove <= r->size) {

> > +			max_addr = r->base + to_remove;

> > +			to_remove = 0;

> > +			break;

> > +		}

> > +		to_remove -= r->size;

> > +	}

> > +

> > +	memblock_remove(0, max_addr);

> > +

> > +	if (to_remove)

> > +		memblock_enforce_memory_limit(memory_limit);

> > +}

> 

> IIUC, this is changing the user expectations a bit. There are people

> using the mem= limit to hijack some top of the RAM for other needs

> (though they could do it in a saner way like changing the DT memory

> nodes).


Which will be hopelessly broken in the presence of KASLR, the kernel
being loaded at a different address, pages betting reserved differently
due to page size, etc.

I hope that no-one usees this for anything other than testing low-memory
conditions. If they want to steal memory they need to carve it out
explicitly.

We can behave as we used to, but we shouldn't give the impression that
such usage is supported.

Thanks,
Mark.
Catalin Marinas Jan. 8, 2016, 3:48 p.m. UTC | #7
On Fri, Jan 08, 2016 at 03:36:54PM +0000, Mark Rutland wrote:
> On Fri, Jan 08, 2016 at 03:27:38PM +0000, Catalin Marinas wrote:

> > On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:

> > > +static void __init enforce_memory_limit(void)

> > > +{

> > > +	const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

> > > +	u64 to_remove = memblock_phys_mem_size() - memory_limit;

> > > +	phys_addr_t max_addr = 0;

> > > +	struct memblock_region *r;

> > > +

> > > +	if (memory_limit == (phys_addr_t)ULLONG_MAX)

> > > +		return;

> > > +

> > > +	/*

> > > +	 * The kernel may be high up in physical memory, so try to apply the

> > > +	 * limit below the kernel first, and only let the generic handling

> > > +	 * take over if it turns out we haven't clipped enough memory yet.

> > > +	 */

> > > +	for_each_memblock(memory, r) {

> > > +		if (r->base + r->size > kbase) {

> > > +			u64 rem = min(to_remove, kbase - r->base);

> > > +

> > > +			max_addr = r->base + rem;

> > > +			to_remove -= rem;

> > > +			break;

> > > +		}

> > > +		if (to_remove <= r->size) {

> > > +			max_addr = r->base + to_remove;

> > > +			to_remove = 0;

> > > +			break;

> > > +		}

> > > +		to_remove -= r->size;

> > > +	}

> > > +

> > > +	memblock_remove(0, max_addr);

> > > +

> > > +	if (to_remove)

> > > +		memblock_enforce_memory_limit(memory_limit);

> > > +}

> > 

> > IIUC, this is changing the user expectations a bit. There are people

> > using the mem= limit to hijack some top of the RAM for other needs

> > (though they could do it in a saner way like changing the DT memory

> > nodes).

> 

> Which will be hopelessly broken in the presence of KASLR, the kernel

> being loaded at a different address, pages betting reserved differently

> due to page size, etc.


With KASLR disabled, I think we should aim for the existing behaviour as
much as possible. The original aim of these patches was to relax the
kernel image placement rules, to make it easier for boot loaders rather
than completely randomising it.

With KASLR enabled, I agree it's hard to make any assumptions about what
memory is available. But removing memory only from the top would also
help with the point you already raised - keeping lower memory for
devices with narrower DMA mask.

-- 
Catalin
Mark Rutland Jan. 8, 2016, 4:14 p.m. UTC | #8
Hi Catalin,

I think we agree w.r.t. the code you suggest. I just disagree with the
suggestion that using mem= for carveouts is something we must, or even
could support -- it's already fragile.

More on that below.

On Fri, Jan 08, 2016 at 03:48:15PM +0000, Catalin Marinas wrote:
> On Fri, Jan 08, 2016 at 03:36:54PM +0000, Mark Rutland wrote:

> > On Fri, Jan 08, 2016 at 03:27:38PM +0000, Catalin Marinas wrote:

> > > On Wed, Dec 30, 2015 at 04:26:10PM +0100, Ard Biesheuvel wrote:

> > > > +static void __init enforce_memory_limit(void)

> > > > +{

> > > > +	const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);

> > > > +	u64 to_remove = memblock_phys_mem_size() - memory_limit;

> > > > +	phys_addr_t max_addr = 0;

> > > > +	struct memblock_region *r;

> > > > +

> > > > +	if (memory_limit == (phys_addr_t)ULLONG_MAX)

> > > > +		return;

> > > > +

> > > > +	/*

> > > > +	 * The kernel may be high up in physical memory, so try to apply the

> > > > +	 * limit below the kernel first, and only let the generic handling

> > > > +	 * take over if it turns out we haven't clipped enough memory yet.

> > > > +	 */

> > > > +	for_each_memblock(memory, r) {

> > > > +		if (r->base + r->size > kbase) {

> > > > +			u64 rem = min(to_remove, kbase - r->base);

> > > > +

> > > > +			max_addr = r->base + rem;

> > > > +			to_remove -= rem;

> > > > +			break;

> > > > +		}

> > > > +		if (to_remove <= r->size) {

> > > > +			max_addr = r->base + to_remove;

> > > > +			to_remove = 0;

> > > > +			break;

> > > > +		}

> > > > +		to_remove -= r->size;

> > > > +	}

> > > > +

> > > > +	memblock_remove(0, max_addr);

> > > > +

> > > > +	if (to_remove)

> > > > +		memblock_enforce_memory_limit(memory_limit);

> > > > +}

> > > 

> > > IIUC, this is changing the user expectations a bit. There are people

> > > using the mem= limit to hijack some top of the RAM for other needs

> > > (though they could do it in a saner way like changing the DT memory

> > > nodes).

> > 

> > Which will be hopelessly broken in the presence of KASLR, the kernel

> > being loaded at a different address, pages betting reserved differently

> > due to page size, etc.

> 

> With KASLR disabled, I think we should aim for the existing behaviour as

> much as possible. The original aim of these patches was to relax the

> kernel image placement rules, to make it easier for boot loaders rather

> than completely randomising it.


Sure. My point was there were other reasons this is extremely fragile
currently, regardless of KASLR. For example, due to reservations
occurring differently.

Consider that when we add memory we may shave off portions of memory due
to page size, as we do in early_init_dt_add_memory_arch. Regions may be
fused or split for other reasons which may change over time, leading to
a different amount of memory being shaved off.

Afterwards memblock_enforce_memory_limit figures out the max address to keep
with:

        /* find out max address */
        for_each_memblock(memory, r) { 
                if (limit <= r->size) {
                        max_addr = r->base + limit;
                        break;
                }    
                limit -= r->size;
        }

Given all that, you cannot use mem= to prevent use of some memory, except for a
specific kernel binary with some value found by experimentation.

I think we need to make it clear that this is completely and hopelessly broken,
and should not pretend to support that.

> With KASLR enabled, I agree it's hard to make any assumptions about what

> memory is available.


As above, I do not think this is safe at all across kernel binaries.

> But removing memory only from the top would also > help with the point

> you already raised - keeping lower memory for > devices with narrower

> DMA mask.


I'm happy with the logic you suggest for the purpose of keeping low DMA
memory.

I think we must make it clear that mem= cannot be used to protect or
carve out memory -- it's a best effort tool for test purposes.

Thanks,
Mark.
diff mbox

Patch

diff --git a/Documentation/arm64/booting.txt b/Documentation/arm64/booting.txt
index 701d39d3171a..03e02ebc1b0c 100644
--- a/Documentation/arm64/booting.txt
+++ b/Documentation/arm64/booting.txt
@@ -117,14 +117,14 @@  Header notes:
   depending on selected features, and is effectively unbound.
 
 The Image must be placed text_offset bytes from a 2MB aligned base
-address near the start of usable system RAM and called there. Memory
-below that base address is currently unusable by Linux, and therefore it
-is strongly recommended that this location is the start of system RAM.
-The region between the 2 MB aligned base address and the start of the
-image has no special significance to the kernel, and may be used for
-other purposes.
+address anywhere in usable system RAM and called there. The region
+between the 2 MB aligned base address and the start of the image has no
+special significance to the kernel, and may be used for other purposes.
 At least image_size bytes from the start of the image must be free for
 use by the kernel.
+NOTE: versions prior to v4.6 cannot make use of memory below the
+physical offset of the Image so it is recommended that the Image be
+placed as close as possible to the start of system RAM.
 
 Any memory described to the kernel (even that below the start of the
 image) which is not marked as reserved from the kernel (e.g., with a
diff --git a/arch/arm64/include/asm/boot.h b/arch/arm64/include/asm/boot.h
index 81151b67b26b..984cb0fa61ce 100644
--- a/arch/arm64/include/asm/boot.h
+++ b/arch/arm64/include/asm/boot.h
@@ -11,4 +11,9 @@ 
 #define MIN_FDT_ALIGN		8
 #define MAX_FDT_SIZE		SZ_2M
 
+/*
+ * arm64 requires the kernel image to be 2 MB aligned
+ */
+#define MIN_KIMG_ALIGN         SZ_2M
+
 #endif
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 0899026a2821..7e9516365b76 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -73,7 +73,7 @@ 
 
 #define KERN_TO_HYP(kva)	((unsigned long)kva - PAGE_OFFSET + HYP_PAGE_OFFSET)
 
-#define kvm_ksym_ref(sym)	((void *)&sym - KIMAGE_VADDR + PAGE_OFFSET)
+#define kvm_ksym_ref(sym)	phys_to_virt((u64)&sym - kimage_voffset)
 
 /*
  * We currently only support a 40bit IPA.
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 1dcbf142d36c..557228658666 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -88,10 +88,10 @@ 
 #define __virt_to_phys(x) ({						\
 	phys_addr_t __x = (phys_addr_t)(x);				\
 	__x >= PAGE_OFFSET ? (__x - PAGE_OFFSET + PHYS_OFFSET) :	\
-			     (__x - KIMAGE_VADDR + PHYS_OFFSET); })
+			     (__x - kimage_voffset); })
 
 #define __phys_to_virt(x)	((unsigned long)((x) - PHYS_OFFSET + PAGE_OFFSET))
-#define __phys_to_kimg(x)	((unsigned long)((x) - PHYS_OFFSET + KIMAGE_VADDR))
+#define __phys_to_kimg(x)	((unsigned long)((x) + kimage_voffset))
 
 /*
  * Convert a page to/from a physical address
@@ -121,13 +121,14 @@  extern phys_addr_t		memstart_addr;
 /* PHYS_OFFSET - the physical address of the start of memory. */
 #define PHYS_OFFSET		({ memstart_addr; })
 
+/* the offset between the kernel virtual and physical mappings */
+extern u64			kimage_voffset;
+
 /*
- * The maximum physical address that the linear direct mapping
- * of system RAM can cover. (PAGE_OFFSET can be interpreted as
- * a 2's complement signed quantity and negated to derive the
- * maximum size of the linear mapping.)
+ * Allow all memory at the discovery stage. We will clip it later.
  */
-#define MAX_MEMBLOCK_ADDR	({ memstart_addr - PAGE_OFFSET - 1; })
+#define MIN_MEMBLOCK_ADDR	0
+#define MAX_MEMBLOCK_ADDR	U64_MAX
 
 /*
  * PFNs are used to describe any physical page; this means
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 1230fa93fd8c..01a33e42ed70 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -445,7 +445,11 @@  __mmap_switched:
 2:
 	adr_l	sp, initial_sp, x4
 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
-	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
+
+	ldr	x0, =KIMAGE_VADDR		// Save the offset between
+	sub	x24, x0, x24			// the kernel virtual and
+	str_l	x24, kimage_voffset, x0		// physical mappings
+
 	mov	x29, #0
 #ifdef CONFIG_KASAN
 	bl	kasan_early_init
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 2cfc9c54bf51..6aafe15c7754 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -35,6 +35,7 @@ 
 #include <linux/efi.h>
 #include <linux/swiotlb.h>
 
+#include <asm/boot.h>
 #include <asm/fixmap.h>
 #include <asm/kernel-pgtable.h>
 #include <asm/memory.h>
@@ -158,9 +159,56 @@  static int __init early_mem(char *p)
 }
 early_param("mem", early_mem);
 
+static void __init enforce_memory_limit(void)
+{
+	const phys_addr_t kbase = round_down(__pa(_text), MIN_KIMG_ALIGN);
+	u64 to_remove = memblock_phys_mem_size() - memory_limit;
+	phys_addr_t max_addr = 0;
+	struct memblock_region *r;
+
+	if (memory_limit == (phys_addr_t)ULLONG_MAX)
+		return;
+
+	/*
+	 * The kernel may be high up in physical memory, so try to apply the
+	 * limit below the kernel first, and only let the generic handling
+	 * take over if it turns out we haven't clipped enough memory yet.
+	 */
+	for_each_memblock(memory, r) {
+		if (r->base + r->size > kbase) {
+			u64 rem = min(to_remove, kbase - r->base);
+
+			max_addr = r->base + rem;
+			to_remove -= rem;
+			break;
+		}
+		if (to_remove <= r->size) {
+			max_addr = r->base + to_remove;
+			to_remove = 0;
+			break;
+		}
+		to_remove -= r->size;
+	}
+
+	memblock_remove(0, max_addr);
+
+	if (to_remove)
+		memblock_enforce_memory_limit(memory_limit);
+}
+
 void __init arm64_memblock_init(void)
 {
-	memblock_enforce_memory_limit(memory_limit);
+	/*
+	 * Remove the memory that we will not be able to cover
+	 * with the linear mapping.
+	 */
+	const s64 linear_region_size = -(s64)PAGE_OFFSET;
+
+	memblock_remove(round_down(memblock_start_of_DRAM(),
+				   1 << SWAPPER_TABLE_SHIFT) +
+			linear_region_size, ULLONG_MAX);
+
+	enforce_memory_limit();
 
 	/*
 	 * Register the kernel text, kernel data, initrd, and initial
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6275d183c005..10067385e40f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -44,6 +44,9 @@ 
 
 u64 idmap_t0sz = TCR_T0SZ(VA_BITS);
 
+u64 kimage_voffset __read_mostly;
+EXPORT_SYMBOL(kimage_voffset);
+
 /*
  * Empty_zero_page is a special page that is used for zero-initialized data
  * and COW.
@@ -326,6 +329,15 @@  static void __init map_mem(pgd_t *pgd)
 {
 	struct memblock_region *reg;
 
+	/*
+	 * Select a suitable value for the base of physical memory.
+	 * This should be equal to or below the lowest usable physical
+	 * memory address, and aligned to PUD/PMD size so that we can map
+	 * it efficiently.
+	 */
+	memstart_addr = round_down(memblock_start_of_DRAM(),
+				   1 << SWAPPER_TABLE_SHIFT);
+
 	/* map all the memory banks */
 	for_each_memblock(memory, reg) {
 		phys_addr_t start = reg->base;