diff mbox

[v2,09/10] arm64/efi: ignore unusable regions instead of reserving them

Message ID 1415283206-14713-10-git-send-email-ard.biesheuvel@linaro.org
State New
Headers show

Commit Message

Ard Biesheuvel Nov. 6, 2014, 2:13 p.m. UTC
This changes the way memblocks are installed based on the contents of
the UEFI memory map. Formerly, all regions would be memblock_add()'ed,
after which unusable regions would be memblock_reserve()'d as well.
To simplify things, but also to allow access to the unusable regions
through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
this so that only usable regions are memblock_add()'ed in the first
place.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kernel/efi.c | 69 +++++++++++++++++++------------------------------
 1 file changed, 26 insertions(+), 43 deletions(-)

Comments

Ard Biesheuvel Nov. 10, 2014, 7:31 a.m. UTC | #1
On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
> On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
>> This changes the way memblocks are installed based on the contents of
>> the UEFI memory map. Formerly, all regions would be memblock_add()'ed,
>> after which unusable regions would be memblock_reserve()'d as well.
>> To simplify things, but also to allow access to the unusable regions
>> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
>> this so that only usable regions are memblock_add()'ed in the first
>> place.
>
> This patch is crashing 64K pagesize kernels during boot. I'm not exactly
> sure why, though. Here is what I get on an APM Mustang box:
>

Ah, yes, I meant to mention this patch

https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191

in the cover letter, which addresses this issue at least for the DT case.
Mark Salter Nov. 11, 2014, 3:42 p.m. UTC | #2
On Mon, 2014-11-10 at 08:31 +0100, Ard Biesheuvel wrote:
> On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
> > On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
> >> This changes the way memblocks are installed based on the contents
> of
> >> the UEFI memory map. Formerly, all regions would be
> memblock_add()'ed,
> >> after which unusable regions would be memblock_reserve()'d as well.
> >> To simplify things, but also to allow access to the unusable
> regions
> >> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
> >> this so that only usable regions are memblock_add()'ed in the first
> >> place.
> >
> > This patch is crashing 64K pagesize kernels during boot. I'm not
> exactly
> > sure why, though. Here is what I get on an APM Mustang box:
> >
> 
> Ah, yes, I meant to mention this patch
> 
> https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191
> 
> in the cover letter, which addresses this issue at least for the DT
> case.
> 

That isn't the problem. In general, with 64K kernel pages, you can't be
sure if you leave something you need out of the kernel linear mapping.
If you have Loader Code/Data regions begin and/or end at something other
than a 64K boundary and that region is adjacent to a region not being
added, then you end up leaving out the unaligned bits from the linear
mapping. This could be bits of the initramfs or devicetree.

What I don't get with this failure is that it is an alignment fault
which should be masked at EL1 for the kernel. The same unaligned
access happens without this patch and it doesn't generate a fault.
Mark Salter Nov. 11, 2014, 5:12 p.m. UTC | #3
On Tue, 2014-11-11 at 10:42 -0500, Mark Salter wrote:
> On Mon, 2014-11-10 at 08:31 +0100, Ard Biesheuvel wrote:
> > On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
> > > On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
> > >> This changes the way memblocks are installed based on the contents
> > of
> > >> the UEFI memory map. Formerly, all regions would be
> > memblock_add()'ed,
> > >> after which unusable regions would be memblock_reserve()'d as well.
> > >> To simplify things, but also to allow access to the unusable
> > regions
> > >> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
> > >> this so that only usable regions are memblock_add()'ed in the first
> > >> place.
> > >
> > > This patch is crashing 64K pagesize kernels during boot. I'm not
> > exactly
> > > sure why, though. Here is what I get on an APM Mustang box:
> > >
> > 
> > Ah, yes, I meant to mention this patch
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191
> > 
> > in the cover letter, which addresses this issue at least for the DT
> > case.
> > 
> 
> That isn't the problem. In general, with 64K kernel pages, you can't be
> sure if you leave something you need out of the kernel linear mapping.
> If you have Loader Code/Data regions begin and/or end at something other
> than a 64K boundary and that region is adjacent to a region not being
> added, then you end up leaving out the unaligned bits from the linear
> mapping. This could be bits of the initramfs or devicetree.
> 
> What I don't get with this failure is that it is an alignment fault
> which should be masked at EL1 for the kernel. The same unaligned
> access happens without this patch and it doesn't generate a fault.
> 

Ah, but unaligned accesses are not ignored for device memory.
I have this in include/acpi/acpi_io.h:

static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
					    acpi_size size)
{
#ifdef CONFIG_ARM64
	if (!page_is_ram(phys >> PAGE_SHIFT))
		return ioremap(phys, size);
#endif

       return ioremap_cache(phys, size);
}

Because the table isn't in the linear mapping, it fails the
page_is_ram() test and it gits mapped with ioremap() leading to
the alignment fault.

If I take out the code inside the #ifdef, I get a different
fault:

[    0.350057] Unhandled fault: synchronous external abort (0x96000010) at 0xfffffe0000fae6f4
[    0.358704] pgd = fffffe0001160000
[    0.362276] [fffffe0000fae6f4] *pgd=0000004001370003, *pud=0000004001370003, *pmd=0000004001370003, *pte=02c00040011a0713
[    0.373746] Internal error: : 96000010 [#1] SMP
[    0.378484] Modules linked in:
[    0.381601] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc4+ #15
[    0.388248] Hardware name: APM X-Gene Mustang board (DT)
[    0.393738] task: fffffe03dbe10000 ti: fffffe03dbf00000 task.ti: fffffe03dbf00000
[    0.401503] PC is at acpi_ex_system_memory_space_handler+0x238/0x2e0
[    0.408160] LR is at acpi_ex_system_memory_space_handler+0x130/0x2e0

That happens because AML is trying to access a hardware register
which has been mapped as normal memory.

So, we need a way to tell a table in ram from an io address in AML.
And page_is_ram() no longer cuts it if the tables are not in the
linear mapping.
Mark Rutland Nov. 11, 2014, 5:44 p.m. UTC | #4
On Tue, Nov 11, 2014 at 05:12:09PM +0000, Mark Salter wrote:
> On Tue, 2014-11-11 at 10:42 -0500, Mark Salter wrote:
> > On Mon, 2014-11-10 at 08:31 +0100, Ard Biesheuvel wrote:
> > > On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
> > > > On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
> > > >> This changes the way memblocks are installed based on the contents
> > > of
> > > >> the UEFI memory map. Formerly, all regions would be
> > > memblock_add()'ed,
> > > >> after which unusable regions would be memblock_reserve()'d as well.
> > > >> To simplify things, but also to allow access to the unusable
> > > regions
> > > >> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
> > > >> this so that only usable regions are memblock_add()'ed in the first
> > > >> place.
> > > >
> > > > This patch is crashing 64K pagesize kernels during boot. I'm not
> > > exactly
> > > > sure why, though. Here is what I get on an APM Mustang box:
> > > >
> > > 
> > > Ah, yes, I meant to mention this patch
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191
> > > 
> > > in the cover letter, which addresses this issue at least for the DT
> > > case.
> > > 
> > 
> > That isn't the problem. In general, with 64K kernel pages, you can't be
> > sure if you leave something you need out of the kernel linear mapping.

Regardless of 64k pages you can never assume that anything will be in
the linear mapping due to the current relationship between the start of
the linear map and the address the kernel was loaded at.

> > If you have Loader Code/Data regions begin and/or end at something other
> > than a 64K boundary and that region is adjacent to a region not being
> > added, then you end up leaving out the unaligned bits from the linear
> > mapping. This could be bits of the initramfs or devicetree.
> > 
> > What I don't get with this failure is that it is an alignment fault
> > which should be masked at EL1 for the kernel. The same unaligned
> > access happens without this patch and it doesn't generate a fault.
> > 
> 
> Ah, but unaligned accesses are not ignored for device memory.
> I have this in include/acpi/acpi_io.h:
> 
> static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
> 					    acpi_size size)
> {
> #ifdef CONFIG_ARM64
> 	if (!page_is_ram(phys >> PAGE_SHIFT))
> 		return ioremap(phys, size);
> #endif
> 
>        return ioremap_cache(phys, size);
> }
> 
> Because the table isn't in the linear mapping, it fails the
> page_is_ram() test and it gits mapped with ioremap() leading to
> the alignment fault.
> 
> If I take out the code inside the #ifdef, I get a different
> fault:
> 
> [    0.350057] Unhandled fault: synchronous external abort (0x96000010) at 0xfffffe0000fae6f4
> [    0.358704] pgd = fffffe0001160000
> [    0.362276] [fffffe0000fae6f4] *pgd=0000004001370003, *pud=0000004001370003, *pmd=0000004001370003, *pte=02c00040011a0713
> [    0.373746] Internal error: : 96000010 [#1] SMP
> [    0.378484] Modules linked in:
> [    0.381601] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc4+ #15
> [    0.388248] Hardware name: APM X-Gene Mustang board (DT)
> [    0.393738] task: fffffe03dbe10000 ti: fffffe03dbf00000 task.ti: fffffe03dbf00000
> [    0.401503] PC is at acpi_ex_system_memory_space_handler+0x238/0x2e0
> [    0.408160] LR is at acpi_ex_system_memory_space_handler+0x130/0x2e0
> 
> That happens because AML is trying to access a hardware register
> which has been mapped as normal memory.

Which is why removing the check in the ifdef is completely nonsensical.
We already knew we can't map everything cacheable -- regardless of what
AML does the CPU can prefetch from anything mapped cacheable (or simply
executable) at any time.

> So, we need a way to tell a table in ram from an io address in AML.
> And page_is_ram() no longer cuts it if the tables are not in the
> linear mapping.

If the kernel were loaded at an address above the tables, it would fail
similarly. So the page_is_ram was never sufficient to ensure tables
would be mapped cacheable.

As we haven't decoupled the kernel text mapping from the linear mapping,
and that doesn't look to be happening any time soon, we can't fix up
page_is_ram -- we need something else entirely.

Thanks,
Mark.
Ard Biesheuvel Nov. 11, 2014, 5:55 p.m. UTC | #5
On 11 November 2014 18:44, Mark Rutland <mark.rutland@arm.com> wrote:
> On Tue, Nov 11, 2014 at 05:12:09PM +0000, Mark Salter wrote:
>> On Tue, 2014-11-11 at 10:42 -0500, Mark Salter wrote:
>> > On Mon, 2014-11-10 at 08:31 +0100, Ard Biesheuvel wrote:
>> > > On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
>> > > > On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
>> > > >> This changes the way memblocks are installed based on the contents
>> > > of
>> > > >> the UEFI memory map. Formerly, all regions would be
>> > > memblock_add()'ed,
>> > > >> after which unusable regions would be memblock_reserve()'d as well.
>> > > >> To simplify things, but also to allow access to the unusable
>> > > regions
>> > > >> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
>> > > >> this so that only usable regions are memblock_add()'ed in the first
>> > > >> place.
>> > > >
>> > > > This patch is crashing 64K pagesize kernels during boot. I'm not
>> > > exactly
>> > > > sure why, though. Here is what I get on an APM Mustang box:
>> > > >
>> > >
>> > > Ah, yes, I meant to mention this patch
>> > >
>> > > https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191
>> > >
>> > > in the cover letter, which addresses this issue at least for the DT
>> > > case.
>> > >
>> >
>> > That isn't the problem. In general, with 64K kernel pages, you can't be
>> > sure if you leave something you need out of the kernel linear mapping.
>
> Regardless of 64k pages you can never assume that anything will be in
> the linear mapping due to the current relationship between the start of
> the linear map and the address the kernel was loaded at.
>
>> > If you have Loader Code/Data regions begin and/or end at something other
>> > than a 64K boundary and that region is adjacent to a region not being
>> > added, then you end up leaving out the unaligned bits from the linear
>> > mapping. This could be bits of the initramfs or devicetree.
>> >
>> > What I don't get with this failure is that it is an alignment fault
>> > which should be masked at EL1 for the kernel. The same unaligned
>> > access happens without this patch and it doesn't generate a fault.
>> >
>>
>> Ah, but unaligned accesses are not ignored for device memory.
>> I have this in include/acpi/acpi_io.h:
>>
>> static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
>>                                           acpi_size size)
>> {
>> #ifdef CONFIG_ARM64
>>       if (!page_is_ram(phys >> PAGE_SHIFT))
>>               return ioremap(phys, size);
>> #endif
>>
>>        return ioremap_cache(phys, size);
>> }
>>
>> Because the table isn't in the linear mapping, it fails the
>> page_is_ram() test and it gits mapped with ioremap() leading to
>> the alignment fault.
>>
>> If I take out the code inside the #ifdef, I get a different
>> fault:
>>
>> [ 0.350057] Unhandled fault: synchronous external abort (0x96000010) at 0xfffffe0000fae6f4
>> [    0.358704] pgd = fffffe0001160000
>> [    0.362276] [fffffe0000fae6f4] *pgd=0000004001370003, *pud=0000004001370003, *pmd=0000004001370003, *pte=02c00040011a0713
>> [    0.373746] Internal error: : 96000010 [#1] SMP
>> [    0.378484] Modules linked in:
>> [    0.381601] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc4+ #15
>> [    0.388248] Hardware name: APM X-Gene Mustang board (DT)
>> [    0.393738] task: fffffe03dbe10000 ti: fffffe03dbf00000 task.ti: fffffe03dbf00000
>> [    0.401503] PC is at acpi_ex_system_memory_space_handler+0x238/0x2e0
>> [    0.408160] LR is at acpi_ex_system_memory_space_handler+0x130/0x2e0
>>
>> That happens because AML is trying to access a hardware register
>> which has been mapped as normal memory.
>
> Which is why removing the check in the ifdef is completely nonsensical.

I am sure we are all in agreement on this part.

> We already knew we can't map everything cacheable -- regardless of what
> AML does the CPU can prefetch from anything mapped cacheable (or simply
> executable) at any time.
>
>> So, we need a way to tell a table in ram from an io address in AML.
>> And page_is_ram() no longer cuts it if the tables are not in the
>> linear mapping.
>
> If the kernel were loaded at an address above the tables, it would fail
> similarly. So the page_is_ram was never sufficient to ensure tables
> would be mapped cacheable.
>
> As we haven't decoupled the kernel text mapping from the linear mapping,
> and that doesn't look to be happening any time soon, we can't fix up
> page_is_ram -- we need something else entirely.
>

Well, if you look at the series, and in particular at the /dev/mem
handling, there is already some code that classifies physical
addresses based on whether they appear in the UEFI memory map, and
with which attributes. I suppose reimplementing page_is_ram() [whose
default implementation is conveniently __weak] to return true for
EFI_MEMORY_WB ranges and false for everything else would do the trick
here, and would arguably be more elegant than matching the string
"System RAM" in the resource table. And, putting the UEFI memory map
central in a range of RAM remapping related functions ensures that
they are always in agreement (or so the theory goes)
Mark Salter Nov. 11, 2014, 6:23 p.m. UTC | #6
On Tue, 2014-11-11 at 17:44 +0000, Mark Rutland wrote:
> On Tue, Nov 11, 2014 at 05:12:09PM +0000, Mark Salter wrote:
> > On Tue, 2014-11-11 at 10:42 -0500, Mark Salter wrote:
> > > On Mon, 2014-11-10 at 08:31 +0100, Ard Biesheuvel wrote:
> > > > On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
> > > > > On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
> > > > >> This changes the way memblocks are installed based on the contents
> > > > of
> > > > >> the UEFI memory map. Formerly, all regions would be
> > > > memblock_add()'ed,
> > > > >> after which unusable regions would be memblock_reserve()'d as well.
> > > > >> To simplify things, but also to allow access to the unusable
> > > > regions
> > > > >> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
> > > > >> this so that only usable regions are memblock_add()'ed in the first
> > > > >> place.
> > > > >
> > > > > This patch is crashing 64K pagesize kernels during boot. I'm not
> > > > exactly
> > > > > sure why, though. Here is what I get on an APM Mustang box:
> > > > >
> > > > 
> > > > Ah, yes, I meant to mention this patch
> > > > 
> > > > https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191
> > > > 
> > > > in the cover letter, which addresses this issue at least for the DT
> > > > case.
> > > > 
> > > 
> > > That isn't the problem. In general, with 64K kernel pages, you can't be
> > > sure if you leave something you need out of the kernel linear mapping.
> 
> Regardless of 64k pages you can never assume that anything will be in
> the linear mapping due to the current relationship between the start of
> the linear map and the address the kernel was loaded at.

We know that the kernel, initramfs, and devicetree must be in the linear
mapping or we have already lost. I don't think we care about anything
else in Loader Data/Code.

If we fix the kernel so that initramfs and devicetree can exist outside
the linear mapping, then fine. Until then, if they do fall above the
kernel and are potentially able to be included the linear mapping, then
you had better take care of the unaligned bits that get clipped by
early_init_dt_add_memory_arch().

> 
> > > If you have Loader Code/Data regions begin and/or end at something other
> > > than a 64K boundary and that region is adjacent to a region not being
> > > added, then you end up leaving out the unaligned bits from the linear
> > > mapping. This could be bits of the initramfs or devicetree.
> > > 
> > > What I don't get with this failure is that it is an alignment fault
> > > which should be masked at EL1 for the kernel. The same unaligned
> > > access happens without this patch and it doesn't generate a fault.
> > > 
> > 
> > Ah, but unaligned accesses are not ignored for device memory.
> > I have this in include/acpi/acpi_io.h:
> > 
> > static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
> > 					    acpi_size size)
> > {
> > #ifdef CONFIG_ARM64
> > 	if (!page_is_ram(phys >> PAGE_SHIFT))
> > 		return ioremap(phys, size);
> > #endif
> > 
> >        return ioremap_cache(phys, size);
> > }
> > 
> > Because the table isn't in the linear mapping, it fails the
> > page_is_ram() test and it gits mapped with ioremap() leading to
> > the alignment fault.
> > 
> > If I take out the code inside the #ifdef, I get a different
> > fault:
> > 
> > [    0.350057] Unhandled fault: synchronous external abort (0x96000010) at 0xfffffe0000fae6f4
> > [    0.358704] pgd = fffffe0001160000
> > [    0.362276] [fffffe0000fae6f4] *pgd=0000004001370003, *pud=0000004001370003, *pmd=0000004001370003, *pte=02c00040011a0713
> > [    0.373746] Internal error: : 96000010 [#1] SMP
> > [    0.378484] Modules linked in:
> > [    0.381601] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc4+ #15
> > [    0.388248] Hardware name: APM X-Gene Mustang board (DT)
> > [    0.393738] task: fffffe03dbe10000 ti: fffffe03dbf00000 task.ti: fffffe03dbf00000
> > [    0.401503] PC is at acpi_ex_system_memory_space_handler+0x238/0x2e0
> > [    0.408160] LR is at acpi_ex_system_memory_space_handler+0x130/0x2e0
> > 
> > That happens because AML is trying to access a hardware register
> > which has been mapped as normal memory.
> 
> Which is why removing the check in the ifdef is completely nonsensical.
> We already knew we can't map everything cacheable -- regardless of what
> AML does the CPU can prefetch from anything mapped cacheable (or simply
> executable) at any time.
> 
> > So, we need a way to tell a table in ram from an io address in AML.
> > And page_is_ram() no longer cuts it if the tables are not in the
> > linear mapping.
> 
> If the kernel were loaded at an address above the tables, it would fail
> similarly. So the page_is_ram was never sufficient to ensure tables
> would be mapped cacheable.
> 
> As we haven't decoupled the kernel text mapping from the linear mapping,
> and that doesn't look to be happening any time soon, we can't fix up
> page_is_ram -- we need something else entirely.

Agreed.
Mark Rutland Nov. 11, 2014, 6:39 p.m. UTC | #7
On Tue, Nov 11, 2014 at 05:55:24PM +0000, Ard Biesheuvel wrote:
> On 11 November 2014 18:44, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Tue, Nov 11, 2014 at 05:12:09PM +0000, Mark Salter wrote:
> >> On Tue, 2014-11-11 at 10:42 -0500, Mark Salter wrote:
> >> > On Mon, 2014-11-10 at 08:31 +0100, Ard Biesheuvel wrote:
> >> > > On 10 November 2014 05:11, Mark Salter <msalter@redhat.com> wrote:
> >> > > > On Thu, 2014-11-06 at 15:13 +0100, Ard Biesheuvel wrote:
> >> > > >> This changes the way memblocks are installed based on the contents
> >> > > of
> >> > > >> the UEFI memory map. Formerly, all regions would be
> >> > > memblock_add()'ed,
> >> > > >> after which unusable regions would be memblock_reserve()'d as well.
> >> > > >> To simplify things, but also to allow access to the unusable
> >> > > regions
> >> > > >> through mmap(/dev/mem), even with CONFIG_STRICT_DEVMEM set, change
> >> > > >> this so that only usable regions are memblock_add()'ed in the first
> >> > > >> place.
> >> > > >
> >> > > > This patch is crashing 64K pagesize kernels during boot. I'm not
> >> > > exactly
> >> > > > sure why, though. Here is what I get on an APM Mustang box:
> >> > > >
> >> > >
> >> > > Ah, yes, I meant to mention this patch
> >> > >
> >> > > https://git.kernel.org/cgit/linux/kernel/git/glikely/linux.git/commit/?id=8cccffc52694938fc88f3d90bc7fed8460e27191
> >> > >
> >> > > in the cover letter, which addresses this issue at least for the DT
> >> > > case.
> >> > >
> >> >
> >> > That isn't the problem. In general, with 64K kernel pages, you can't be
> >> > sure if you leave something you need out of the kernel linear mapping.
> >
> > Regardless of 64k pages you can never assume that anything will be in
> > the linear mapping due to the current relationship between the start of
> > the linear map and the address the kernel was loaded at.
> >
> >> > If you have Loader Code/Data regions begin and/or end at something other
> >> > than a 64K boundary and that region is adjacent to a region not being
> >> > added, then you end up leaving out the unaligned bits from the linear
> >> > mapping. This could be bits of the initramfs or devicetree.
> >> >
> >> > What I don't get with this failure is that it is an alignment fault
> >> > which should be masked at EL1 for the kernel. The same unaligned
> >> > access happens without this patch and it doesn't generate a fault.
> >> >
> >>
> >> Ah, but unaligned accesses are not ignored for device memory.
> >> I have this in include/acpi/acpi_io.h:
> >>
> >> static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
> >>                                           acpi_size size)
> >> {
> >> #ifdef CONFIG_ARM64
> >>       if (!page_is_ram(phys >> PAGE_SHIFT))
> >>               return ioremap(phys, size);
> >> #endif
> >>
> >>        return ioremap_cache(phys, size);
> >> }
> >>
> >> Because the table isn't in the linear mapping, it fails the
> >> page_is_ram() test and it gits mapped with ioremap() leading to
> >> the alignment fault.
> >>
> >> If I take out the code inside the #ifdef, I get a different
> >> fault:
> >>
> >> [ 0.350057] Unhandled fault: synchronous external abort (0x96000010) at 0xfffffe0000fae6f4
> >> [    0.358704] pgd = fffffe0001160000
> >> [    0.362276] [fffffe0000fae6f4] *pgd=0000004001370003, *pud=0000004001370003, *pmd=0000004001370003, *pte=02c00040011a0713
> >> [    0.373746] Internal error: : 96000010 [#1] SMP
> >> [    0.378484] Modules linked in:
> >> [    0.381601] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc4+ #15
> >> [    0.388248] Hardware name: APM X-Gene Mustang board (DT)
> >> [    0.393738] task: fffffe03dbe10000 ti: fffffe03dbf00000 task.ti: fffffe03dbf00000
> >> [    0.401503] PC is at acpi_ex_system_memory_space_handler+0x238/0x2e0
> >> [    0.408160] LR is at acpi_ex_system_memory_space_handler+0x130/0x2e0
> >>
> >> That happens because AML is trying to access a hardware register
> >> which has been mapped as normal memory.
> >
> > Which is why removing the check in the ifdef is completely nonsensical.
> 
> I am sure we are all in agreement on this part.
> 
> > We already knew we can't map everything cacheable -- regardless of what
> > AML does the CPU can prefetch from anything mapped cacheable (or simply
> > executable) at any time.
> >
> >> So, we need a way to tell a table in ram from an io address in AML.
> >> And page_is_ram() no longer cuts it if the tables are not in the
> >> linear mapping.
> >
> > If the kernel were loaded at an address above the tables, it would fail
> > similarly. So the page_is_ram was never sufficient to ensure tables
> > would be mapped cacheable.
> >
> > As we haven't decoupled the kernel text mapping from the linear mapping,
> > and that doesn't look to be happening any time soon, we can't fix up
> > page_is_ram -- we need something else entirely.
> >
> 
> Well, if you look at the series, and in particular at the /dev/mem
> handling, there is already some code that classifies physical
> addresses based on whether they appear in the UEFI memory map, and
> with which attributes. I suppose reimplementing page_is_ram() [whose
> default implementation is conveniently __weak] to return true for
> EFI_MEMORY_WB ranges and false for everything else would do the trick
> here, and would arguably be more elegant than matching the string
> "System RAM" in the resource table. And, putting the UEFI memory map
> central in a range of RAM remapping related functions ensures that
> they are always in agreement (or so the theory goes)

Sure. We'll have to be careful to fall back to the current behaviour for
!UEFI systems, though.

Thanks,
Mark.
diff mbox

Patch

diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 3009c22e2620..af2214c692d3 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -40,13 +40,6 @@  static int __init uefi_debug_setup(char *str)
 }
 early_param("uefi_debug", uefi_debug_setup);
 
-static int __init is_normal_ram(efi_memory_desc_t *md)
-{
-	if (md->attribute & EFI_MEMORY_WB)
-		return 1;
-	return 0;
-}
-
 static int __init uefi_init(void)
 {
 	efi_char16_t *c16;
@@ -105,28 +98,11 @@  out:
 	return retval;
 }
 
-/*
- * Return true for RAM regions we want to permanently reserve.
- */
-static __init int is_reserve_region(efi_memory_desc_t *md)
-{
-	switch (md->type) {
-	case EFI_LOADER_CODE:
-	case EFI_LOADER_DATA:
-	case EFI_BOOT_SERVICES_CODE:
-	case EFI_BOOT_SERVICES_DATA:
-	case EFI_CONVENTIONAL_MEMORY:
-		return 0;
-	default:
-		break;
-	}
-	return is_normal_ram(md);
-}
-
-static __init void reserve_regions(void)
+static __init void process_memory_map(void)
 {
 	efi_memory_desc_t *md;
 	u64 paddr, npages, size;
+	u32 lost = 0;
 
 	if (uefi_debug)
 		pr_info("Processing EFI memory map:\n");
@@ -134,31 +110,38 @@  static __init void reserve_regions(void)
 	for_each_efi_memory_desc(&memmap, md) {
 		paddr = md->phys_addr;
 		npages = md->num_pages;
+		size = npages << EFI_PAGE_SHIFT;
 
 		if (uefi_debug) {
 			char buf[64];
 
-			pr_info("  0x%012llx-0x%012llx %s",
-				paddr, paddr + (npages << EFI_PAGE_SHIFT) - 1,
+			pr_info("  0x%012llx-0x%012llx %s\n",
+				paddr, paddr + size - 1,
 				efi_md_typeattr_format(buf, sizeof(buf), md));
 		}
 
-		memrange_efi_to_native(&paddr, &npages);
-		size = npages << PAGE_SHIFT;
-
-		if (is_normal_ram(md))
-			early_init_dt_add_memory_arch(paddr, size);
-
-		if (is_reserve_region(md)) {
-			memblock_reserve(paddr, size);
-			if (uefi_debug)
-				pr_cont("*");
-		}
-
-		if (uefi_debug)
-			pr_cont("\n");
+		if (!efi_mem_is_usable_region(md))
+			continue;
+
+		early_init_dt_add_memory_arch(paddr, size);
+
+		/*
+		 * Keep a tally of how much memory we are losing due to
+		 * rounding of regions that are not aligned to the page
+		 * size. We cannot easily recover this memory without
+		 * sorting the memory map and attempting to merge adjacent
+		 * usable regions.
+		 */
+		if (PAGE_SHIFT != EFI_PAGE_SHIFT)
+			lost += (npages << EFI_PAGE_SHIFT) -
+				round_down(max_t(s64, size - PAGE_ALIGN(paddr) +
+						 md->phys_addr, 0),
+					   PAGE_SIZE);
 	}
 
+	if (lost > SZ_1K)
+		pr_warn("efi: lost %u KB of RAM to rounding\n", lost / SZ_1K);
+
 	set_bit(EFI_MEMMAP, &efi.flags);
 }
 
@@ -182,7 +165,7 @@  void __init efi_init(void)
 
 	WARN_ON(uefi_init() < 0);
 
-	reserve_regions();
+	process_memory_map();
 }
 
 static int __init arm64_enter_virtual_mode(void)