arm64: Allow vmalloc regions to be set with set_memory_*

Message ID	CAKv+Gu-73QKAsFQCa=yO5+=DihLXp_AKYyY6QRTVM4nuDH=-8g@mail.gmail.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; MIME-Version: 1.0 In-Reply-To: <CAKv+Gu_A1C9hXgqrYDekoy1YHaRXi_mUF-kYx-TqsOBkWMVjmQ@mail.gmail.com> References: <1452635187-8057-1-git-send-email-labbott@fedoraproject.org> <CAKv+Gu_A1C9hXgqrYDekoy1YHaRXi_mUF-kYx-TqsOBkWMVjmQ@mail.gmail.com> Date: Wed, 13 Jan 2016 17:10:31 +0100 Message-ID: <CAKv+Gu-73QKAsFQCa=yO5+=DihLXp_AKYyY6QRTVM4nuDH=-8g@mail.gmail.com> Subject: Re: [PATCH] arm64: Allow vmalloc regions to be set with set_memory_* From: Ard Biesheuvel <ard.biesheuvel@linaro.org> To: Laura Abbott <labbott@fedoraproject.org> Cc: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com>, Mark Rutland <mark.rutland@arm.com>, Kees Cook <keescook@chromium.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org> Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk

Ard Biesheuvel Jan. 13, 2016, 4:10 p.m. UTC

On 13 January 2016 at 15:03, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 12 January 2016 at 22:46, Laura Abbott <labbott@fedoraproject.org> wrote:

>>

>> The range of set_memory_* is currently restricted to the module address

>> range because of difficulties in breaking down larger block sizes.

>> vmalloc maps PAGE_SIZE pages so it is safe to use as well. Update the

>> function ranges and add a comment explaining why the range is restricted

>> the way it is.

>>

>> Signed-off-by: Laura Abbott <labbott@fedoraproject.org>

>> ---

>> This should let the protections for the eBPF work as expected, I don't

>> know if there is some sort of self test for thatL.

>

>

> This is going to conflict with my KASLR implementation, since it puts

> the kernel image right in the middle of the vmalloc area, and the

> kernel is obviously mapped with block mappings. In fact, I am

> proposing enabling huge-vmap for arm64 as well, since it seems an

> improvement generally, but also specifically allows me to unmap the

> __init section using the generic vunmap code (remove_vm_area). But in

> general, I think the assumption that the whole vmalloc area is mapped

> using pages is not tenable.

>

> AFAICT, vmalloc still use pages exclusively even with huge-vmap (but

> ioremap does not). So perhaps it would make sense to check for the

> VM_ALLOC bit in the VMA flags (which I will not set for the kernel

> regions either)

>


Something along these lines, perhaps?



>

>> ---

>>  arch/arm64/mm/pageattr.c | 25 +++++++++++++++++++++----

>>  1 file changed, 21 insertions(+), 4 deletions(-)

>>

>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c

>> index 3571c73..274208e 100644

>> --- a/arch/arm64/mm/pageattr.c

>> +++ b/arch/arm64/mm/pageattr.c

>> @@ -36,6 +36,26 @@ static int change_page_range(pte_t *ptep, pgtable_t token, unsigned long addr,

>>         return 0;

>>  }

>>

>> +static bool validate_addr(unsigned long start, unsigned long end)

>> +{

>> +       /*

>> +        * This check explicitly excludes most kernel memory. Most kernel

>> +        * memory is mapped with a larger page size and breaking down the

>> +        * larger page size without causing TLB conflicts is very difficult.

>> +        *

>> +        * If you need to call set_memory_* on a range, the recommendation is

>> +        * to use vmalloc since that range is mapped with pages.

>> +        */

>> +       if (start >= MODULES_VADDR && start < MODULES_END &&

>> +           end >= MODULES_VADDR && end < MODULES_END)

>> +               return true;

>> +

>> +       if (is_vmalloc_addr(start) && is_vmalloc_addr(end))

>> +               return true;

>> +

>> +       return false;

>> +}

>> +

>>  static int change_memory_common(unsigned long addr, int numpages,

>>                                 pgprot_t set_mask, pgprot_t clear_mask)

>>  {

>> @@ -51,10 +71,7 @@ static int change_memory_common(unsigned long addr, int numpages,

>>                 WARN_ON_ONCE(1);

>>         }

>>

>> -       if (start < MODULES_VADDR || start >= MODULES_END)

>> -               return -EINVAL;

>> -

>> -       if (end < MODULES_VADDR || end >= MODULES_END)

>> +       if (!validate_addr(start, end))

>>                 return -EINVAL;

>>

>>         data.set_mask = set_mask;

>> --

>> 2.5.0

>>

>>

>> _______________________________________________

>> linux-arm-kernel mailing list

>> linux-arm-kernel@lists.infradead.org

>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Mark Rutland Jan. 18, 2016, 11:56 a.m. UTC | #1

On Wed, Jan 13, 2016 at 05:10:31PM +0100, Ard Biesheuvel wrote:
> On 13 January 2016 at 15:03, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > On 12 January 2016 at 22:46, Laura Abbott <labbott@fedoraproject.org> wrote:

> >>

> >> The range of set_memory_* is currently restricted to the module address

> >> range because of difficulties in breaking down larger block sizes.

> >> vmalloc maps PAGE_SIZE pages so it is safe to use as well. Update the

> >> function ranges and add a comment explaining why the range is restricted

> >> the way it is.

> >>

> >> Signed-off-by: Laura Abbott <labbott@fedoraproject.org>

> >> ---

> >> This should let the protections for the eBPF work as expected, I don't

> >> know if there is some sort of self test for thatL.

> >

> >

> > This is going to conflict with my KASLR implementation, since it puts

> > the kernel image right in the middle of the vmalloc area, and the

> > kernel is obviously mapped with block mappings. In fact, I am

> > proposing enabling huge-vmap for arm64 as well, since it seems an

> > improvement generally, but also specifically allows me to unmap the

> > __init section using the generic vunmap code (remove_vm_area). But in

> > general, I think the assumption that the whole vmalloc area is mapped

> > using pages is not tenable.

> >

> > AFAICT, vmalloc still use pages exclusively even with huge-vmap (but

> > ioremap does not). So perhaps it would make sense to check for the

> > VM_ALLOC bit in the VMA flags (which I will not set for the kernel

> > regions either)

> >

> 

> Something along these lines, perhaps?

> 

> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c

> index 3571c7309c5e..bda0a776c58e 100644

> --- a/arch/arm64/mm/pageattr.c

> +++ b/arch/arm64/mm/pageattr.c

> @@ -13,6 +13,7 @@

>  #include <linux/kernel.h>

>  #include <linux/mm.h>

>  #include <linux/module.h>

> +#include <linux/vmalloc.h>

>  #include <linux/sched.h>

> 

>  #include <asm/pgtable.h>

> @@ -44,6 +45,7 @@ static int change_memory_common(unsigned long addr

>         unsigned long end = start + size;

>         int ret;

>         struct page_change_data data;

> +       struct vm_struct *area;

> 

>         if (!PAGE_ALIGNED(addr)) {

>                 start &= PAGE_MASK;

> @@ -51,10 +53,14 @@ static int change_memory_common(unsigned long addr,

>                 WARN_ON_ONCE(1);

>         }

> 

> -       if (start < MODULES_VADDR || start >= MODULES_END)

> -               return -EINVAL;

> -

> -       if (end < MODULES_VADDR || end >= MODULES_END)

> +       /*

> +        * Check whether the [addr, addr + size) interval is entirely

> +        * covered by precisely one VM area that has the VM_ALLOC flag set

> +        */

> +       area = find_vm_area((void *)addr);

> +       if (!area ||

> +           end > (unsigned long)area->addr + area->size ||

> +           !(area->flags & VM_ALLOC))

>                 return -EINVAL;

> 

>         data.set_mask = set_mask;


Neat. That fixes the fencepost bug too.

Looks good to me, though as Laura suggested we should have a comment as
to why we limit changes to such regions. Fancy taking her wording below
and spinning this as a patch?

> >> +       /*

> >> +        * This check explicitly excludes most kernel memory. Most kernel

> >> +        * memory is mapped with a larger page size and breaking down the

> >> +        * larger page size without causing TLB conflicts is very difficult.

> >> +        *

> >> +        * If you need to call set_memory_* on a range, the recommendation is

> >> +        * to use vmalloc since that range is mapped with pages.

> >> +        */


Thanks,
Mark.

Mark Rutland Jan. 28, 2016, 10:51 a.m. UTC | #2

On Thu, Jan 28, 2016 at 09:47:20AM +0800, Xishi Qiu wrote:
> On 2016/1/18 19:56, Mark Rutland wrote:

> > On Wed, Jan 13, 2016 at 05:10:31PM +0100, Ard Biesheuvel wrote:

> >> Something along these lines, perhaps?

> >>

> >> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c

> >> index 3571c7309c5e..bda0a776c58e 100644

> >> --- a/arch/arm64/mm/pageattr.c

> >> +++ b/arch/arm64/mm/pageattr.c

> >> @@ -13,6 +13,7 @@

> >>  #include <linux/kernel.h>

> >>  #include <linux/mm.h>

> >>  #include <linux/module.h>

> >> +#include <linux/vmalloc.h>

> >>  #include <linux/sched.h>

> >>

> >>  #include <asm/pgtable.h>

> >> @@ -44,6 +45,7 @@ static int change_memory_common(unsigned long addr

> >>         unsigned long end = start + size;

> >>         int ret;

> >>         struct page_change_data data;

> >> +       struct vm_struct *area;

> >>

> >>         if (!PAGE_ALIGNED(addr)) {

> >>                 start &= PAGE_MASK;

> >> @@ -51,10 +53,14 @@ static int change_memory_common(unsigned long addr,

> >>                 WARN_ON_ONCE(1);

> >>         }

> >>

> >> -       if (start < MODULES_VADDR || start >= MODULES_END)

> >> -               return -EINVAL;

> >> -

> >> -       if (end < MODULES_VADDR || end >= MODULES_END)

> >> +       /*

> >> +        * Check whether the [addr, addr + size) interval is entirely

> >> +        * covered by precisely one VM area that has the VM_ALLOC flag set

> >> +        */

> >> +       area = find_vm_area((void *)addr);

> >> +       if (!area ||

> >> +           end > (unsigned long)area->addr + area->size ||

> >> +           !(area->flags & VM_ALLOC))

> >>                 return -EINVAL;

> >>

> >>         data.set_mask = set_mask;

> > 

> > Neat. That fixes the fencepost bug too.

> > 

> > Looks good to me, though as Laura suggested we should have a comment as

> > to why we limit changes to such regions. Fancy taking her wording below

> > and spinning this as a patch?

> > 

> >>>> +       /*

> >>>> +        * This check explicitly excludes most kernel memory. Most kernel

> >>>> +        * memory is mapped with a larger page size and breaking down the

> >>>> +        * larger page size without causing TLB conflicts is very difficult.

> >>>> +        *

> >>>> +        * If you need to call set_memory_* on a range, the recommendation is

> >>>> +        * to use vmalloc since that range is mapped with pages.

> >>>> +        */

> > 

> > Thanks,

> > Mark.

> > 

> 

> Hi Mark,

> 

> After change the flag, it calls only flush_tlb_kernel_range(), so why not use 

> cpu_replace_ttbr1(swapper_pg_dir)? 


We cannot use cpu_replace_ttbr1 here. Other CPUs may be online, and we
have no mechanism to place them in a safe set of page tables while
swapping TTBR1, we'd have to perform a deep copy of tables, and this
would be horrendously expensive.

Using flush_tlb_kernel_range() is sufficient. As modules don't share a
page or section mapping with other users, we can follow a
Break-Before-Make approach. Additionally, they're mapped at page
granularity so we never split or fuse sections anyway. We only modify
the permission bits.

> One more question, does TLB conflict only affect kernel page talbe?


It's harder to solve for the text/linear map as we can't do
Break-Before-Make without potentially unmapping something in active use
(e.g. the code used to implement Break-Before-Make).

> There is no problem when spliting the transparent hugepage, right?


There was a potential problem with huge pages causing TLB conflicts,
which didn't always seem to follow a Break-Before-Make approach.

I believe that Kirill Shutemov's recent THP rework means that
section->table and table->section conversions always go via an invalid
entry, with appropriate TLB invalidation, making that safe. I have not
yet had the chance to verify that yet, however.

Thanks,
Mark.

Mark Rutland Jan. 28, 2016, 2:27 p.m. UTC | #3

On Thu, Jan 28, 2016 at 07:47:09PM +0800, Xishi Qiu wrote:
> On 2016/1/28 18:51, Mark Rutland wrote:

> 

> > On Thu, Jan 28, 2016 at 09:47:20AM +0800, Xishi Qiu wrote:

> >> On 2016/1/18 19:56, Mark Rutland wrote:

> >>> On Wed, Jan 13, 2016 at 05:10:31PM +0100, Ard Biesheuvel wrote:

> >>>> Something along these lines, perhaps?

> >>>>

> >>>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c

> >>>> index 3571c7309c5e..bda0a776c58e 100644

> >>>> --- a/arch/arm64/mm/pageattr.c

> >>>> +++ b/arch/arm64/mm/pageattr.c

> >>>> @@ -13,6 +13,7 @@

> >>>>  #include <linux/kernel.h>

> >>>>  #include <linux/mm.h>

> >>>>  #include <linux/module.h>

> >>>> +#include <linux/vmalloc.h>

> >>>>  #include <linux/sched.h>

> >>>>

> >>>>  #include <asm/pgtable.h>

> >>>> @@ -44,6 +45,7 @@ static int change_memory_common(unsigned long addr

> >>>>         unsigned long end = start + size;

> >>>>         int ret;

> >>>>         struct page_change_data data;

> >>>> +       struct vm_struct *area;

> >>>>

> >>>>         if (!PAGE_ALIGNED(addr)) {

> >>>>                 start &= PAGE_MASK;

> >>>> @@ -51,10 +53,14 @@ static int change_memory_common(unsigned long addr,

> >>>>                 WARN_ON_ONCE(1);

> >>>>         }

> >>>>

> >>>> -       if (start < MODULES_VADDR || start >= MODULES_END)

> >>>> -               return -EINVAL;

> >>>> -

> >>>> -       if (end < MODULES_VADDR || end >= MODULES_END)

> >>>> +       /*

> >>>> +        * Check whether the [addr, addr + size) interval is entirely

> >>>> +        * covered by precisely one VM area that has the VM_ALLOC flag set

> >>>> +        */

> >>>> +       area = find_vm_area((void *)addr);

> >>>> +       if (!area ||

> >>>> +           end > (unsigned long)area->addr + area->size ||

> >>>> +           !(area->flags & VM_ALLOC))

> >>>>                 return -EINVAL;

> >>>>

> >>>>         data.set_mask = set_mask;

> >>>

> >>> Neat. That fixes the fencepost bug too.

> >>>

> >>> Looks good to me, though as Laura suggested we should have a comment as

> >>> to why we limit changes to such regions. Fancy taking her wording below

> >>> and spinning this as a patch?

> >>>

> >>>>>> +       /*

> >>>>>> +        * This check explicitly excludes most kernel memory. Most kernel

> >>>>>> +        * memory is mapped with a larger page size and breaking down the

> >>>>>> +        * larger page size without causing TLB conflicts is very difficult.

> >>>>>> +        *

> >>>>>> +        * If you need to call set_memory_* on a range, the recommendation is

> >>>>>> +        * to use vmalloc since that range is mapped with pages.

> >>>>>> +        */

> >>>

> >>> Thanks,

> >>> Mark.

> >>>

> >>

> >> Hi Mark,

> >>

> >> After change the flag, it calls only flush_tlb_kernel_range(), so why not use 

> >> cpu_replace_ttbr1(swapper_pg_dir)? 

> > 

> > We cannot use cpu_replace_ttbr1 here. Other CPUs may be online, and we

> > have no mechanism to place them in a safe set of page tables while

> > swapping TTBR1, we'd have to perform a deep copy of tables, and this

> > would be horrendously expensive.

> > 

> > Using flush_tlb_kernel_range() is sufficient. As modules don't share a

> > page or section mapping with other users, we can follow a

> > Break-Before-Make approach. Additionally, they're mapped at page

> > granularity so we never split or fuse sections anyway. We only modify

> > the permission bits.

> > 

> 

> Hi Mark,

> 

> Is it safe in the following path?

> 

> alloc the whole linear map section


I don't understand what you mean by this, you will need to elaborate.
The terms "alloc" and "section" can mean a number of different things in
this context.

> cpu A write something on it

> cpu B write something on it

> cpu C set read only flag and call flush_tlb_kernel_range()


If you want to modify a portion of the linear map, this will not work.
Modfiying the linear map in this manner is not safe.

If you want an alias of the linear map which was mapped using pages, and
you wanted to change that alias, that could work.

It depends on what precisely you are trying to do.

Thanks,
Mark.

Mark Rutland Jan. 29, 2016, 11:02 a.m. UTC | #4

On Fri, Jan 29, 2016 at 09:21:40AM +0800, Xishi Qiu wrote:
> On 2016/1/28 22:27, Mark Rutland wrote:

> > On Thu, Jan 28, 2016 at 07:47:09PM +0800, Xishi Qiu wrote:

> >> Hi Mark,

> >>

> >> Is it safe in the following path?

> >>

> >> alloc the whole linear map section

> > 

> > I don't understand what you mean by this, you will need to elaborate.

> > The terms "alloc" and "section" can mean a number of different things in

> > this context.

> > 

> >> cpu A write something on it

> >> cpu B write something on it

> >> cpu C set read only flag and call flush_tlb_kernel_range()

> > 

> > If you want to modify a portion of the linear map, this will not work.

> > Modfiying the linear map in this manner is not safe.

> > 

> > If you want an alias of the linear map which was mapped using pages, and

> > you wanted to change that alias, that could work.

> > 

> 

> Hi Mark,

> 

> I mean I change the whole section(maybe 1G?) in linear map.

If you mean something that was mapped with a section (i.e. a block entry
in some level of page table), then no. The linear map is not open to
this kind of change, as portions of the region may be in use elsewhere
within Linux.

> In our software, kernel create mapping(linear map) on special memory,

> and

> it is separated from buddy system, the service manage the special memory itself.

This is not what the linear map is for. What exactly is this "special
memory"?

Is it some persistent memory?

Is it normal memory that you wish to use for some communication with
other agents and/or DMA?

Is it normal memory that you simply have a special use-case for?

> And the service alloc/free the memory based on the physical address, so if 

> the service want to change the prot dynamically, vmalloc doesn't work, and

> fixmap is a little complex.

Without further explanation of your use-case, this doesn't make sense to
me. I don't understand why the physical address matters -- that implies
you have other agents accessing this memory. If that's the case, I don't
see what changing the permissions buys you.

Regardless, it sounds like either we're missing some infrastructure, or
you are mis-using existing APIs.

> I think if I create the spacial memory in 4kb, then the service could

> use set_memory_ro() directly, right?

Perhaps. If it's a vmalloc'd area, then yes (once Ard's patch to allow
that is in). I have more general concerns with your approach, in that I
still do not understand the problem you are trying to solve.

Thanks,
Mark.

Mark Rutland Feb. 3, 2016, 1:43 p.m. UTC | #5

On Sat, Jan 30, 2016 at 10:48:02AM +0800, Xishi Qiu wrote:
 Hi Mark,
> 

> Thanks for your reply. Maybe I didn't express it clearly, sorry for it.

> 

> The abstract process is the following:

> 1. do not create a large block, use 4kb for all of the memory(regardless of performance).

> setup_arch->paging_init()->map_mem()->__map_memblock()->create_mapping()

> 2. alloc a page and get the the linear mapping address.

> 3. modify judgment condition of the function set_memory_ro(), so it could handle the linear mapping range.

> 4. use set_memory_ro() to change the prot flag of the page which we get in step 2.

> 

> Is it safe?


This will be free from the problems associated with splitting/fusing
sections.

My concern was with what exactly this "special memory" was and how it
was being used. It sounds like either some infrstructure is missing, or
you are using the wrong abstractions.

Thanks,
Mark.

arm64: Allow vmalloc regions to be set with set_memory_*

Commit Message

Comments

Patch