mbox series

[00/18] arm64: Unmap the kernel whilst running in userspace (KAISER)

Message ID 1510942921-12564-1-git-send-email-will.deacon@arm.com
Headers show
Series arm64: Unmap the kernel whilst running in userspace (KAISER) | expand

Message

Will Deacon Nov. 17, 2017, 6:21 p.m. UTC
Hi all,

This patch series implements something along the lines of KAISER for arm64:

  https://gruss.cc/files/kaiser.pdf

although I wrote this from scratch because the paper has some funny
assumptions about how the architecture works. There is a patch series
in review for x86, which follows a similar approach:

  http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

and the topic was recently covered by LWN (currently subscriber-only):

  https://lwn.net/Articles/738975/

The basic idea is that transitions to and from userspace are proxied
through a trampoline page which is mapped into a separate page table and
can switch the full kernel mapping in and out on exception entry and
exit respectively. This is a valuable defence against various KASLR and
timing attacks, particularly as the trampoline page is at a fixed virtual
address and therefore the kernel text can be randomized independently.

The major consequences of the trampoline are:

  * We can no longer make use of global mappings for kernel space, so
    each task is assigned two ASIDs: one for user mappings and one for
    kernel mappings

  * Our ASID moves into TTBR1 so that we can quickly switch between the
    trampoline and kernel page tables

  * Switching TTBR0 always requires use of the zero page, so we can
    dispense with some of our errata workaround code.

  * entry.S gets more complicated to read

The performance hit from this series isn't as bad as I feared: things
like cyclictest and kernbench seem to be largely unaffected, although
syscall micro-benchmarks appear to show that syscall overhead is roughly
doubled, and this has an impact on things like hackbench which exhibits
a ~10% hit due to its heavy context-switching.

Patches based on 4.14 and also pushed here:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git kaiser

Feedback welcome,

Will

--->8

Will Deacon (18):
  arm64: mm: Use non-global mappings for kernel space
  arm64: mm: Temporarily disable ARM64_SW_TTBR0_PAN
  arm64: mm: Move ASID from TTBR0 to TTBR1
  arm64: mm: Remove pre_ttbr0_update_workaround for Falkor erratum
    #E1003
  arm64: mm: Rename post_ttbr0_update_workaround
  arm64: mm: Fix and re-enable ARM64_SW_TTBR0_PAN
  arm64: mm: Allocate ASIDs in pairs
  arm64: mm: Add arm64_kernel_mapped_at_el0 helper using static key
  arm64: mm: Invalidate both kernel and user ASIDs when performing TLBI
  arm64: entry: Add exception trampoline page for exceptions from EL0
  arm64: mm: Map entry trampoline into trampoline and kernel page tables
  arm64: entry: Explicitly pass exception level to kernel_ventry macro
  arm64: entry: Hook up entry trampoline to exception vectors
  arm64: erratum: Work around Falkor erratum #E1003 in trampoline code
  arm64: tls: Avoid unconditional zeroing of tpidrro_el0 for native
    tasks
  arm64: entry: Add fake CPU feature for mapping the kernel at EL0
  arm64: makefile: Ensure TEXT_OFFSET doesn't overlap with trampoline
  arm64: Kconfig: Add CONFIG_UNMAP_KERNEL_AT_EL0

 arch/arm64/Kconfig                      |  30 +++--
 arch/arm64/Makefile                     |  18 ++-
 arch/arm64/include/asm/asm-uaccess.h    |  25 ++--
 arch/arm64/include/asm/assembler.h      |  27 +----
 arch/arm64/include/asm/cpucaps.h        |   3 +-
 arch/arm64/include/asm/kernel-pgtable.h |  12 +-
 arch/arm64/include/asm/memory.h         |   1 +
 arch/arm64/include/asm/mmu.h            |  12 ++
 arch/arm64/include/asm/mmu_context.h    |   9 +-
 arch/arm64/include/asm/pgtable-hwdef.h  |   1 +
 arch/arm64/include/asm/pgtable-prot.h   |  21 +++-
 arch/arm64/include/asm/pgtable.h        |   1 +
 arch/arm64/include/asm/proc-fns.h       |   6 -
 arch/arm64/include/asm/tlbflush.h       |  16 ++-
 arch/arm64/include/asm/uaccess.h        |  21 +++-
 arch/arm64/kernel/cpufeature.c          |  11 ++
 arch/arm64/kernel/entry.S               | 195 ++++++++++++++++++++++++++------
 arch/arm64/kernel/process.c             |  12 +-
 arch/arm64/kernel/vmlinux.lds.S         |  17 +++
 arch/arm64/lib/clear_user.S             |   2 +-
 arch/arm64/lib/copy_from_user.S         |   2 +-
 arch/arm64/lib/copy_in_user.S           |   2 +-
 arch/arm64/lib/copy_to_user.S           |   2 +-
 arch/arm64/mm/cache.S                   |   2 +-
 arch/arm64/mm/context.c                 |  36 +++---
 arch/arm64/mm/mmu.c                     |  60 ++++++++++
 arch/arm64/mm/proc.S                    |  12 +-
 arch/arm64/xen/hypercall.S              |   2 +-
 28 files changed, 418 insertions(+), 140 deletions(-)

-- 
2.1.4

Comments

Stephen Boyd Nov. 18, 2017, 12:19 a.m. UTC | #1
On 11/17, Will Deacon wrote:
> Hi all,

> 

> This patch series implements something along the lines of KAISER for arm64:

> 

>   https://gruss.cc/files/kaiser.pdf

> 

> although I wrote this from scratch because the paper has some funny

> assumptions about how the architecture works. There is a patch series

> in review for x86, which follows a similar approach:

> 

>   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> 

> and the topic was recently covered by LWN (currently subscriber-only):

> 

>   https://lwn.net/Articles/738975/

> 

> The basic idea is that transitions to and from userspace are proxied

> through a trampoline page which is mapped into a separate page table and

> can switch the full kernel mapping in and out on exception entry and

> exit respectively. This is a valuable defence against various KASLR and

> timing attacks, particularly as the trampoline page is at a fixed virtual

> address and therefore the kernel text can be randomized independently.

> 

> The major consequences of the trampoline are:

> 

>   * We can no longer make use of global mappings for kernel space, so

>     each task is assigned two ASIDs: one for user mappings and one for

>     kernel mappings

> 

>   * Our ASID moves into TTBR1 so that we can quickly switch between the

>     trampoline and kernel page tables

> 

>   * Switching TTBR0 always requires use of the zero page, so we can

>     dispense with some of our errata workaround code.

> 

>   * entry.S gets more complicated to read

> 

> The performance hit from this series isn't as bad as I feared: things

> like cyclictest and kernbench seem to be largely unaffected, although

> syscall micro-benchmarks appear to show that syscall overhead is roughly

> doubled, and this has an impact on things like hackbench which exhibits

> a ~10% hit due to its heavy context-switching.


Do you have performance benchmark numbers on CPUs with the Falkor
errata? I'm interested to see how much the TLB invalidate hurts
heavy context-switching workloads on these CPUs.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Ard Biesheuvel Nov. 18, 2017, 3:25 p.m. UTC | #2
On 17 November 2017 at 18:21, Will Deacon <will.deacon@arm.com> wrote:
> Hi all,

>

> This patch series implements something along the lines of KAISER for arm64:

>

>   https://gruss.cc/files/kaiser.pdf

>

> although I wrote this from scratch because the paper has some funny

> assumptions about how the architecture works. There is a patch series

> in review for x86, which follows a similar approach:

>

>   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

>

> and the topic was recently covered by LWN (currently subscriber-only):

>

>   https://lwn.net/Articles/738975/

>

> The basic idea is that transitions to and from userspace are proxied

> through a trampoline page which is mapped into a separate page table and

> can switch the full kernel mapping in and out on exception entry and

> exit respectively. This is a valuable defence against various KASLR and

> timing attacks, particularly as the trampoline page is at a fixed virtual

> address and therefore the kernel text can be randomized independently.

>

> The major consequences of the trampoline are:

>

>   * We can no longer make use of global mappings for kernel space, so

>     each task is assigned two ASIDs: one for user mappings and one for

>     kernel mappings

>

>   * Our ASID moves into TTBR1 so that we can quickly switch between the

>     trampoline and kernel page tables

>

>   * Switching TTBR0 always requires use of the zero page, so we can

>     dispense with some of our errata workaround code.

>

>   * entry.S gets more complicated to read

>

> The performance hit from this series isn't as bad as I feared: things

> like cyclictest and kernbench seem to be largely unaffected, although

> syscall micro-benchmarks appear to show that syscall overhead is roughly

> doubled, and this has an impact on things like hackbench which exhibits

> a ~10% hit due to its heavy context-switching.

>

> Patches based on 4.14 and also pushed here:

>

>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git kaiser

>

> Feedback welcome,

>

> Will

>


Very nice! I am quite pleased, because this makes KASLR much more
useful than it is now.

My main question is why we need a separate trampoline vector table: it
seems to me that with some minor surgery (as proposed below), we can
make the kernel_ventry macro instantiations tolerant for being loaded
somewhere in the fixmap (which I think is a better place for this than
at the base of the VMALLOC space), removing the need to change
vbar_el1 back and forth. The only downside is that exceptions taken
from EL1 will also use absolute addressing, but I don't think that is
a huge price to pay.

-------------->8------------------
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index f8ce4cdd3bb5..7f89ebc690b1 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -71,6 +71,20 @@

  .macro kernel_ventry, el, label, regsize = 64
  .align 7
+alternative_if_not ARM64_MAP_KERNEL_AT_EL0
+ .if \regsize == 64
+ msr tpidrro_el0, x30 // preserve x30
+ .endif
+ .if \el == 0
+ mrs x30, ttbr1_el1
+ sub x30, x30, #(SWAPPER_DIR_SIZE + RESERVED_TTBR0_SIZE)
+ bic x30, x30, #USER_ASID_FLAG
+ msr ttbr1_el1, x30
+ isb
+ .endif
+ ldr x30, =el\()\el\()_\label
+alternative_else_nop_endif
+
  sub sp, sp, #S_FRAME_SIZE
 #ifdef CONFIG_VMAP_STACK
  /*
@@ -82,7 +96,11 @@
  tbnz x0, #THREAD_SHIFT, 0f
  sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0
  sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp
+alternative_if_not ARM64_MAP_KERNEL_AT_EL0
+ br x30
+alternative_else
  b el\()\el\()_\label
+alternative_endif

 0:
  /*
@@ -91,6 +109,10 @@
  * userspace, and can clobber EL0 registers to free up GPRs.
  */

+alternative_if_not ARM64_MAP_KERNEL_AT_EL0
+ mrs x30, tpidrro_el0 // restore x30
+alternative_else_nop_endif
+
  /* Stash the original SP (minus S_FRAME_SIZE) in tpidr_el0. */
  msr tpidr_el0, x0

@@ -98,8 +120,11 @@
  sub x0, sp, x0
  msr tpidrro_el0, x0

- /* Switch to the overflow stack */
- adr_this_cpu sp, overflow_stack + OVERFLOW_STACK_SIZE, x0
+ /* Switch to the overflow stack of this CPU */
+ ldr x0, =overflow_stack + OVERFLOW_STACK_SIZE
+ mov sp, x0
+ mrs x0, tpidr_el1
+ add sp, sp, x0

  /*
  * Check whether we were already on the overflow stack. This may happen
@@ -108,19 +133,30 @@
  mrs x0, tpidr_el0 // sp of interrupted context
  sub x0, sp, x0 // delta with top of overflow stack
  tst x0, #~(OVERFLOW_STACK_SIZE - 1) // within range?
- b.ne __bad_stack // no? -> bad stack pointer
+ b.eq 1f
+ ldr x0, =__bad_stack // no? -> bad stack pointer
+ br x0

  /* We were already on the overflow stack. Restore sp/x0 and carry on. */
- sub sp, sp, x0
+1: sub sp, sp, x0
  mrs x0, tpidrro_el0
 #endif
+alternative_if_not ARM64_MAP_KERNEL_AT_EL0
+ br x30
+alternative_else
  b el\()\el\()_\label
+alternative_endif
  .endm

- .macro kernel_entry, el, regsize = 64
+ .macro kernel_entry, el, regsize = 64, restore_x30 = 1
  .if \regsize == 32
  mov w0, w0 // zero upper 32 bits of x0
  .endif
+ .if \restore_x30
+alternative_if_not ARM64_MAP_KERNEL_AT_EL0
+ mrs x30, tpidrro_el0 // restore x30
+alternative_else_nop_endif
+ .endif
  stp x0, x1, [sp, #16 * 0]
  stp x2, x3, [sp, #16 * 1]
  stp x4, x5, [sp, #16 * 2]
@@ -363,7 +399,7 @@ tsk .req x28 // current thread_info
  */
  .pushsection ".entry.text", "ax"

- .align 11
+ .align PAGE_SHIFT
 ENTRY(vectors)
  kernel_ventry 1, sync_invalid // Synchronous EL1t
  kernel_ventry 1, irq_invalid // IRQ EL1t
@@ -391,6 +427,8 @@ ENTRY(vectors)
  kernel_ventry 0, fiq_invalid, 32 // FIQ 32-bit EL0
  kernel_ventry 0, error_invalid, 32 // Error 32-bit EL0
 #endif
+ .ltorg
+ .align PAGE_SHIFT
 END(vectors)

 #ifdef CONFIG_VMAP_STACK
@@ -408,7 +446,7 @@ __bad_stack:
  * S_FRAME_SIZE) was stashed in tpidr_el0 by kernel_ventry.
  */
  sub sp, sp, #S_FRAME_SIZE
- kernel_entry 1
+ kernel_entry 1, restore_x30=0
  mrs x0, tpidr_el0
  add x0, x0, #S_FRAME_SIZE
  str x0, [sp, #S_SP]
Will Deacon Nov. 20, 2017, 6:03 p.m. UTC | #3
On Fri, Nov 17, 2017 at 04:19:35PM -0800, Stephen Boyd wrote:
> On 11/17, Will Deacon wrote:

> > Hi all,

> > 

> > This patch series implements something along the lines of KAISER for arm64:

> > 

> >   https://gruss.cc/files/kaiser.pdf

> > 

> > although I wrote this from scratch because the paper has some funny

> > assumptions about how the architecture works. There is a patch series

> > in review for x86, which follows a similar approach:

> > 

> >   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> > 

> > and the topic was recently covered by LWN (currently subscriber-only):

> > 

> >   https://lwn.net/Articles/738975/

> > 

> > The basic idea is that transitions to and from userspace are proxied

> > through a trampoline page which is mapped into a separate page table and

> > can switch the full kernel mapping in and out on exception entry and

> > exit respectively. This is a valuable defence against various KASLR and

> > timing attacks, particularly as the trampoline page is at a fixed virtual

> > address and therefore the kernel text can be randomized independently.

> > 

> > The major consequences of the trampoline are:

> > 

> >   * We can no longer make use of global mappings for kernel space, so

> >     each task is assigned two ASIDs: one for user mappings and one for

> >     kernel mappings

> > 

> >   * Our ASID moves into TTBR1 so that we can quickly switch between the

> >     trampoline and kernel page tables

> > 

> >   * Switching TTBR0 always requires use of the zero page, so we can

> >     dispense with some of our errata workaround code.

> > 

> >   * entry.S gets more complicated to read

> > 

> > The performance hit from this series isn't as bad as I feared: things

> > like cyclictest and kernbench seem to be largely unaffected, although

> > syscall micro-benchmarks appear to show that syscall overhead is roughly

> > doubled, and this has an impact on things like hackbench which exhibits

> > a ~10% hit due to its heavy context-switching.

> 

> Do you have performance benchmark numbers on CPUs with the Falkor

> errata? I'm interested to see how much the TLB invalidate hurts

> heavy context-switching workloads on these CPUs.


I don't, but I'm also not sure what I can do about it if it's an issue.

Will
Will Deacon Nov. 20, 2017, 6:06 p.m. UTC | #4
Hi Ard,

Cheers for having a look.

On Sat, Nov 18, 2017 at 03:25:06PM +0000, Ard Biesheuvel wrote:
> On 17 November 2017 at 18:21, Will Deacon <will.deacon@arm.com> wrote:

> > This patch series implements something along the lines of KAISER for arm64:

> 

> Very nice! I am quite pleased, because this makes KASLR much more

> useful than it is now.


Agreed. I might actually start enabling now ;)

> My main question is why we need a separate trampoline vector table: it

> seems to me that with some minor surgery (as proposed below), we can

> make the kernel_ventry macro instantiations tolerant for being loaded

> somewhere in the fixmap (which I think is a better place for this than

> at the base of the VMALLOC space), removing the need to change

> vbar_el1 back and forth. The only downside is that exceptions taken

> from EL1 will also use absolute addressing, but I don't think that is

> a huge price to pay.


I think there are two aspects to this:

1. Moving the vectors to the fixmap
2. Avoiding the vbar toggle

I think (1) is a good idea, so I'll hack that up for v2. The vbar toggle
isn't as obvious: avoiding it adds some overhead to EL1 irq entry because
we're writing tpidrro_el0 as well as loading from the literal pool. I think
that it also makes the code more difficult to reason about because we'd have
to make sure we don't try to use the fixmap mapping before it's actually
mapped, which I think would mean we'd need a set of early vectors that we
then switch away from in a CPU hotplug notifier or something.

I'll see if I can measure the cost of the current vbar switching to get
an idea of the potential performance available.

Will
Ard Biesheuvel Nov. 20, 2017, 6:20 p.m. UTC | #5
On 20 November 2017 at 18:06, Will Deacon <will.deacon@arm.com> wrote:
> Hi Ard,

>

> Cheers for having a look.

>

> On Sat, Nov 18, 2017 at 03:25:06PM +0000, Ard Biesheuvel wrote:

>> On 17 November 2017 at 18:21, Will Deacon <will.deacon@arm.com> wrote:

>> > This patch series implements something along the lines of KAISER for arm64:

>>

>> Very nice! I am quite pleased, because this makes KASLR much more

>> useful than it is now.

>

> Agreed. I might actually start enabling now ;)

>


I think it makes more sense to have enabled on your phone than on the
devboard on your desk.

>> My main question is why we need a separate trampoline vector table: it

>> seems to me that with some minor surgery (as proposed below), we can

>> make the kernel_ventry macro instantiations tolerant for being loaded

>> somewhere in the fixmap (which I think is a better place for this than

>> at the base of the VMALLOC space), removing the need to change

>> vbar_el1 back and forth. The only downside is that exceptions taken

>> from EL1 will also use absolute addressing, but I don't think that is

>> a huge price to pay.

>

> I think there are two aspects to this:

>

> 1. Moving the vectors to the fixmap

> 2. Avoiding the vbar toggle

>

> I think (1) is a good idea, so I'll hack that up for v2. The vbar toggle

> isn't as obvious: avoiding it adds some overhead to EL1 irq entry because

> we're writing tpidrro_el0 as well as loading from the literal pool.


Yeah, but in what workloads are interrupts taken while running in the
kernel a dominant factor?

> I think

> that it also makes the code more difficult to reason about because we'd have

> to make sure we don't try to use the fixmap mapping before it's actually

> mapped, which I think would mean we'd need a set of early vectors that we

> then switch away from in a CPU hotplug notifier or something.

>


I don't think this is necessary. The vector page with absolute
addressing would tolerate being accessed via its natural mapping
inside the kernel image as well as via the mapping in the fixmap
region.

> I'll see if I can measure the cost of the current vbar switching to get

> an idea of the potential performance available.

>


Yeah, makes sense. If the bulk of the performance hit is elsewhere,
there's no point in focusing on this bit.
Laura Abbott Nov. 20, 2017, 10:50 p.m. UTC | #6
On 11/17/2017 10:21 AM, Will Deacon wrote:
> Hi all,

> 

> This patch series implements something along the lines of KAISER for arm64:

> 

>    https://gruss.cc/files/kaiser.pdf

> 

> although I wrote this from scratch because the paper has some funny

> assumptions about how the architecture works. There is a patch series

> in review for x86, which follows a similar approach:

> 

>    http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> 

> and the topic was recently covered by LWN (currently subscriber-only):

> 

>    https://lwn.net/Articles/738975/

> 

> The basic idea is that transitions to and from userspace are proxied

> through a trampoline page which is mapped into a separate page table and

> can switch the full kernel mapping in and out on exception entry and

> exit respectively. This is a valuable defence against various KASLR and

> timing attacks, particularly as the trampoline page is at a fixed virtual

> address and therefore the kernel text can be randomized independently.

> 

> The major consequences of the trampoline are:

> 

>    * We can no longer make use of global mappings for kernel space, so

>      each task is assigned two ASIDs: one for user mappings and one for

>      kernel mappings

> 

>    * Our ASID moves into TTBR1 so that we can quickly switch between the

>      trampoline and kernel page tables

> 

>    * Switching TTBR0 always requires use of the zero page, so we can

>      dispense with some of our errata workaround code.

> 

>    * entry.S gets more complicated to read

> 

> The performance hit from this series isn't as bad as I feared: things

> like cyclictest and kernbench seem to be largely unaffected, although

> syscall micro-benchmarks appear to show that syscall overhead is roughly

> doubled, and this has an impact on things like hackbench which exhibits

> a ~10% hit due to its heavy context-switching.

> 

> Patches based on 4.14 and also pushed here:

> 

>    git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git kaiser

> 

> Feedback welcome,

> 

> Will

> 


Passed some basic tests on Hikey Android and my Mustang box. I'll
leave the Mustang building kernels for a few days. You're welcome
to add Tested-by or I can re-test on v2.

Thanks,
Laura
Pavel Machek Nov. 22, 2017, 4:19 p.m. UTC | #7
Hi!

> This patch series implements something along the lines of KAISER for arm64:

> 

>   https://gruss.cc/files/kaiser.pdf

> 

> although I wrote this from scratch because the paper has some funny

> assumptions about how the architecture works. There is a patch series

> in review for x86, which follows a similar approach:

> 

>   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> 

> and the topic was recently covered by LWN (currently subscriber-only):

> 

>   https://lwn.net/Articles/738975/

> 

> The basic idea is that transitions to and from userspace are proxied

> through a trampoline page which is mapped into a separate page table and

> can switch the full kernel mapping in and out on exception entry and

> exit respectively. This is a valuable defence against various KASLR and

> timing attacks, particularly as the trampoline page is at a fixed virtual

> address and therefore the kernel text can be randomized

> independently.


If I'm willing to do timing attacks to defeat KASLR... what prevents
me from using CPU caches to do that?

There was blackhat talk about exactly that IIRC...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Will Deacon Nov. 22, 2017, 7:37 p.m. UTC | #8
On Mon, Nov 20, 2017 at 02:50:58PM -0800, Laura Abbott wrote:
> On 11/17/2017 10:21 AM, Will Deacon wrote:

> >This patch series implements something along the lines of KAISER for arm64:

> 

> Passed some basic tests on Hikey Android and my Mustang box. I'll

> leave the Mustang building kernels for a few days. You're welcome

> to add Tested-by or I can re-test on v2.


Cheers, Laura. I've got a few changes for v2 based on Ard's feedback, so if
you could retest that when I post it then it would be much appreciated.

Will
Will Deacon Nov. 22, 2017, 7:37 p.m. UTC | #9
On Wed, Nov 22, 2017 at 05:19:14PM +0100, Pavel Machek wrote:
> > This patch series implements something along the lines of KAISER for arm64:

> > 

> >   https://gruss.cc/files/kaiser.pdf

> > 

> > although I wrote this from scratch because the paper has some funny

> > assumptions about how the architecture works. There is a patch series

> > in review for x86, which follows a similar approach:

> > 

> >   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> > 

> > and the topic was recently covered by LWN (currently subscriber-only):

> > 

> >   https://lwn.net/Articles/738975/

> > 

> > The basic idea is that transitions to and from userspace are proxied

> > through a trampoline page which is mapped into a separate page table and

> > can switch the full kernel mapping in and out on exception entry and

> > exit respectively. This is a valuable defence against various KASLR and

> > timing attacks, particularly as the trampoline page is at a fixed virtual

> > address and therefore the kernel text can be randomized

> > independently.

> 

> If I'm willing to do timing attacks to defeat KASLR... what prevents

> me from using CPU caches to do that?


Is that a rhetorical question? If not, then I'm probably not the best person
to answer it. All I'm doing here is protecting against a class of attacks on
kaslr that make use of the TLB/page-table walker to determine where the
kernel is mapped.

> There was blackhat talk about exactly that IIRC...


Got a link? I'd be interested to see how the idea works in case there's an
orthogonal defence against it.

Will
Will Deacon Nov. 22, 2017, 7:37 p.m. UTC | #10
On Mon, Nov 20, 2017 at 06:20:39PM +0000, Ard Biesheuvel wrote:
> On 20 November 2017 at 18:06, Will Deacon <will.deacon@arm.com> wrote:

> > I'll see if I can measure the cost of the current vbar switching to get

> > an idea of the potential performance available.

> >

> 

> Yeah, makes sense. If the bulk of the performance hit is elsewhere,

> there's no point in focusing on this bit.


I had a go at implementing a variant on your suggestion where we avoid
swizzling the vbar on exception entry/exit but I couldn't reliably measure a
difference in performance. It appears that the ISB needed by the TTBR change
is dominant, so the vbar write is insignificant.

Will
Ard Biesheuvel Nov. 22, 2017, 9:19 p.m. UTC | #11
On 22 November 2017 at 16:19, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!

>

>> This patch series implements something along the lines of KAISER for arm64:

>>

>>   https://gruss.cc/files/kaiser.pdf

>>

>> although I wrote this from scratch because the paper has some funny

>> assumptions about how the architecture works. There is a patch series

>> in review for x86, which follows a similar approach:

>>

>>   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

>>

>> and the topic was recently covered by LWN (currently subscriber-only):

>>

>>   https://lwn.net/Articles/738975/

>>

>> The basic idea is that transitions to and from userspace are proxied

>> through a trampoline page which is mapped into a separate page table and

>> can switch the full kernel mapping in and out on exception entry and

>> exit respectively. This is a valuable defence against various KASLR and

>> timing attacks, particularly as the trampoline page is at a fixed virtual

>> address and therefore the kernel text can be randomized

>> independently.

>

> If I'm willing to do timing attacks to defeat KASLR... what prevents

> me from using CPU caches to do that?

>


Because it is impossible to get a cache hit on an access to an unmapped address?

> There was blackhat talk about exactly that IIRC...

>                                                                         Pavel

> --

> (english) http://www.livejournal.com/~pavelmachek

> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek Nov. 22, 2017, 10:33 p.m. UTC | #12
On Wed 2017-11-22 21:19:28, Ard Biesheuvel wrote:
> On 22 November 2017 at 16:19, Pavel Machek <pavel@ucw.cz> wrote:

> > Hi!

> >

> >> This patch series implements something along the lines of KAISER for arm64:

> >>

> >>   https://gruss.cc/files/kaiser.pdf

> >>

> >> although I wrote this from scratch because the paper has some funny

> >> assumptions about how the architecture works. There is a patch series

> >> in review for x86, which follows a similar approach:

> >>

> >>   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> >>

> >> and the topic was recently covered by LWN (currently subscriber-only):

> >>

> >>   https://lwn.net/Articles/738975/

> >>

> >> The basic idea is that transitions to and from userspace are proxied

> >> through a trampoline page which is mapped into a separate page table and

> >> can switch the full kernel mapping in and out on exception entry and

> >> exit respectively. This is a valuable defence against various KASLR and

> >> timing attacks, particularly as the trampoline page is at a fixed virtual

> >> address and therefore the kernel text can be randomized

> >> independently.

> >

> > If I'm willing to do timing attacks to defeat KASLR... what prevents

> > me from using CPU caches to do that?

> >

> 

> Because it is impossible to get a cache hit on an access to an

> unmapped address?


Um, no, I don't need to be able to directly access kernel addresses. I
just put some data in _same place in cache where kernel data would
go_, then do syscall and look if my data are still cached. Caches
don't have infinite associativity.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek Nov. 22, 2017, 10:36 p.m. UTC | #13
On Wed 2017-11-22 19:37:14, Will Deacon wrote:
> On Wed, Nov 22, 2017 at 05:19:14PM +0100, Pavel Machek wrote:

> > > This patch series implements something along the lines of KAISER for arm64:

> > > 

> > >   https://gruss.cc/files/kaiser.pdf

> > > 

> > > although I wrote this from scratch because the paper has some funny

> > > assumptions about how the architecture works. There is a patch series

> > > in review for x86, which follows a similar approach:

> > > 

> > >   http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

> > > 

> > > and the topic was recently covered by LWN (currently subscriber-only):

> > > 

> > >   https://lwn.net/Articles/738975/

> > > 

> > > The basic idea is that transitions to and from userspace are proxied

> > > through a trampoline page which is mapped into a separate page table and

> > > can switch the full kernel mapping in and out on exception entry and

> > > exit respectively. This is a valuable defence against various KASLR and

> > > timing attacks, particularly as the trampoline page is at a fixed virtual

> > > address and therefore the kernel text can be randomized

> > > independently.

> > 

> > If I'm willing to do timing attacks to defeat KASLR... what prevents

> > me from using CPU caches to do that?

> 

> Is that a rhetorical question? If not, then I'm probably not the best person

> to answer it. All I'm doing here is protecting against a class of attacks on

> kaslr that make use of the TLB/page-table walker to determine where the

> kernel is mapped.


Yeah. What I'm saying is that I can use cache effects to probe where
kernel is mapped (and what it is doing).

> > There was blackhat talk about exactly that IIRC...

> 

> Got a link? I'd be interested to see how the idea works in case there's an

> orthogonal defence against it.


https://www.youtube.com/watch?v=9KsnFWejpQg

(Tell me if it is not the right one).

As of defenses... yes. "maxcpus=1" and flush caches on switch to
usermode will do the trick :-).

Ok, so that was sarcastic. I'm not sure if good defense exists. ARM is
better than i386 because reading time and cache flushing is
priviledged, but...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Ard Biesheuvel Nov. 22, 2017, 11:19 p.m. UTC | #14
> On 22 Nov 2017, at 22:33, Pavel Machek <pavel@ucw.cz> wrote:

> 

>> On Wed 2017-11-22 21:19:28, Ard Biesheuvel wrote:

>>> On 22 November 2017 at 16:19, Pavel Machek <pavel@ucw.cz> wrote:

>>> Hi!

>>> 

>>>> This patch series implements something along the lines of KAISER for arm64:

>>>> 

>>>>  https://gruss.cc/files/kaiser.pdf

>>>> 

>>>> although I wrote this from scratch because the paper has some funny

>>>> assumptions about how the architecture works. There is a patch series

>>>> in review for x86, which follows a similar approach:

>>>> 

>>>>  http://lkml.kernel.org/r/<20171110193058.BECA7D88@viggo.jf.intel.com>

>>>> 

>>>> and the topic was recently covered by LWN (currently subscriber-only):

>>>> 

>>>>  https://lwn.net/Articles/738975/

>>>> 

>>>> The basic idea is that transitions to and from userspace are proxied

>>>> through a trampoline page which is mapped into a separate page table and

>>>> can switch the full kernel mapping in and out on exception entry and

>>>> exit respectively. This is a valuable defence against various KASLR and

>>>> timing attacks, particularly as the trampoline page is at a fixed virtual

>>>> address and therefore the kernel text can be randomized

>>>> independently.

>>> 

>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

>>> me from using CPU caches to do that?

>>> 

>> 

>> Because it is impossible to get a cache hit on an access to an

>> unmapped address?

> 

> Um, no, I don't need to be able to directly access kernel addresses. I

> just put some data in _same place in cache where kernel data would

> go_, then do syscall and look if my data are still cached. Caches

> don't have infinite associativity.

> 


Ah ok. Interesting.

But how does that leak address bits that are covered by the tag?
Pavel Machek Nov. 22, 2017, 11:37 p.m. UTC | #15
Hi!

> >>> If I'm willing to do timing attacks to defeat KASLR... what prevents

> >>> me from using CPU caches to do that?

> >>> 

> >> 

> >> Because it is impossible to get a cache hit on an access to an

> >> unmapped address?

> > 

> > Um, no, I don't need to be able to directly access kernel addresses. I

> > just put some data in _same place in cache where kernel data would

> > go_, then do syscall and look if my data are still cached. Caches

> > don't have infinite associativity.

> > 

> 

> Ah ok. Interesting.

> 

> But how does that leak address bits that are covered by the tag?


Same as leaking any other address bits? Caches are "virtually
indexed", and tag does not come into play...

Maybe this explains it?

https://www.youtube.com/watch?v=9KsnFWejpQg
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Ard Biesheuvel Nov. 23, 2017, 6:51 a.m. UTC | #16
> On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

> 

> Hi!

> 

>>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

>>>>> me from using CPU caches to do that?

>>>>> 

>>>> 

>>>> Because it is impossible to get a cache hit on an access to an

>>>> unmapped address?

>>> 

>>> Um, no, I don't need to be able to directly access kernel addresses. I

>>> just put some data in _same place in cache where kernel data would

>>> go_, then do syscall and look if my data are still cached. Caches

>>> don't have infinite associativity.

>>> 

>> 

>> Ah ok. Interesting.

>> 

>> But how does that leak address bits that are covered by the tag?

> 

> Same as leaking any other address bits? Caches are "virtually

> indexed",


Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

> and tag does not come into play...

> 


Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

> Maybe this explains it?

> 


No not really. It explains how cache timing can be used as a side channel, not how it defeats kaslr.

Thanks,
Ard.
Pavel Machek Nov. 23, 2017, 9:07 a.m. UTC | #17
Hi!

> > On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

> > 

> > Hi!

> > 

> >>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

> >>>>> me from using CPU caches to do that?

> >>>>> 

> >>>> 

> >>>> Because it is impossible to get a cache hit on an access to an

> >>>> unmapped address?

> >>> 

> >>> Um, no, I don't need to be able to directly access kernel addresses. I

> >>> just put some data in _same place in cache where kernel data would

> >>> go_, then do syscall and look if my data are still cached. Caches

> >>> don't have infinite associativity.

> >>> 

> >> 

> >> Ah ok. Interesting.

> >> 

> >> But how does that leak address bits that are covered by the tag?

> > 

> > Same as leaking any other address bits? Caches are "virtually

> > indexed",

> 

> Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

> 

> > and tag does not come into play...

> > 

> 

> Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

> 


Well, KASLR is about keeping bits of kernel virtual address secret
from userland. Leaking them through cache sidechannel means KASLR is
defeated.


> > Maybe this explains it?

> > 

> 

> No not really. It explains how cache timing can be used as a side channel, not how it defeats kaslr.


Ok, look at this one:

https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX-wp.pdf

You can use timing instead of TSX, right?
     	 	    	     	       	      	     	       	    Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Ard Biesheuvel Nov. 23, 2017, 9:23 a.m. UTC | #18
On 23 November 2017 at 09:07, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!

>

>> > On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

>> >

>> > Hi!

>> >

>> >>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

>> >>>>> me from using CPU caches to do that?

>> >>>>>

>> >>>>

>> >>>> Because it is impossible to get a cache hit on an access to an

>> >>>> unmapped address?

>> >>>

>> >>> Um, no, I don't need to be able to directly access kernel addresses. I

>> >>> just put some data in _same place in cache where kernel data would

>> >>> go_, then do syscall and look if my data are still cached. Caches

>> >>> don't have infinite associativity.

>> >>>

>> >>

>> >> Ah ok. Interesting.

>> >>

>> >> But how does that leak address bits that are covered by the tag?

>> >

>> > Same as leaking any other address bits? Caches are "virtually

>> > indexed",

>>

>> Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

>>

>> > and tag does not come into play...

>> >

>>

>> Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

>>

>

> Well, KASLR is about keeping bits of kernel virtual address secret

> from userland. Leaking them through cache sidechannel means KASLR is

> defeated.

>


Yes, that is what you claim. But you are not explaining how any of the
bits that we do want to keep secret can be discovered by making
inferences from which lines in a primed cache were evicted during a
syscall.

The cache index maps to low order bits. You can use this, e.g., to
attack table based AES, because there is only ~4 KB worth of tables,
and you are interested in finding out which exact entries of the table
were read by the process under attack.

You are saying the same approach will help you discover 30 high order
bits of a virtual kernel address, by observing the cache evictions in
a physically indexed physically tagged cache. How?

>

>> > Maybe this explains it?

>> >

>>

>> No not really. It explains how cache timing can be used as a side channel, not how it defeats kaslr.

>

> Ok, look at this one:

>

> https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX-wp.pdf

>

> You can use timing instead of TSX, right?


The TSX attack is TLB based not cache based.
Pavel Machek Nov. 23, 2017, 10:46 a.m. UTC | #19
On Thu 2017-11-23 09:23:02, Ard Biesheuvel wrote:
> On 23 November 2017 at 09:07, Pavel Machek <pavel@ucw.cz> wrote:

> > Hi!

> >

> >> > On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

> >> >

> >> > Hi!

> >> >

> >> >>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

> >> >>>>> me from using CPU caches to do that?

> >> >>>>>

> >> >>>>

> >> >>>> Because it is impossible to get a cache hit on an access to an

> >> >>>> unmapped address?

> >> >>>

> >> >>> Um, no, I don't need to be able to directly access kernel addresses. I

> >> >>> just put some data in _same place in cache where kernel data would

> >> >>> go_, then do syscall and look if my data are still cached. Caches

> >> >>> don't have infinite associativity.

> >> >>>

> >> >>

> >> >> Ah ok. Interesting.

> >> >>

> >> >> But how does that leak address bits that are covered by the tag?

> >> >

> >> > Same as leaking any other address bits? Caches are "virtually

> >> > indexed",

> >>

> >> Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

> >>

> >> > and tag does not come into play...

> >> >

> >>

> >> Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

> >>

> >

> > Well, KASLR is about keeping bits of kernel virtual address secret

> > from userland. Leaking them through cache sidechannel means KASLR is

> > defeated.

> >

> 

> Yes, that is what you claim. But you are not explaining how any of the

> bits that we do want to keep secret can be discovered by making

> inferences from which lines in a primed cache were evicted during a

> syscall.

> 

> The cache index maps to low order bits. You can use this, e.g., to

> attack table based AES, because there is only ~4 KB worth of tables,

> and you are interested in finding out which exact entries of the table

> were read by the process under attack.

> 

> You are saying the same approach will help you discover 30 high order

> bits of a virtual kernel address, by observing the cache evictions in

> a physically indexed physically tagged cache. How?


I assumed high bits are hashed into cache index. I might have been
wrong. Anyway, page tables are about same size as AES tables. So...:

http://cve.circl.lu/cve/CVE-2017-5927

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Ard Biesheuvel Nov. 23, 2017, 11:38 a.m. UTC | #20
On 23 November 2017 at 10:46, Pavel Machek <pavel@ucw.cz> wrote:
> On Thu 2017-11-23 09:23:02, Ard Biesheuvel wrote:

>> On 23 November 2017 at 09:07, Pavel Machek <pavel@ucw.cz> wrote:

>> > Hi!

>> >

>> >> > On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

>> >> >

>> >> > Hi!

>> >> >

>> >> >>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

>> >> >>>>> me from using CPU caches to do that?

>> >> >>>>>

>> >> >>>>

>> >> >>>> Because it is impossible to get a cache hit on an access to an

>> >> >>>> unmapped address?

>> >> >>>

>> >> >>> Um, no, I don't need to be able to directly access kernel addresses. I

>> >> >>> just put some data in _same place in cache where kernel data would

>> >> >>> go_, then do syscall and look if my data are still cached. Caches

>> >> >>> don't have infinite associativity.

>> >> >>>

>> >> >>

>> >> >> Ah ok. Interesting.

>> >> >>

>> >> >> But how does that leak address bits that are covered by the tag?

>> >> >

>> >> > Same as leaking any other address bits? Caches are "virtually

>> >> > indexed",

>> >>

>> >> Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

>> >>

>> >> > and tag does not come into play...

>> >> >

>> >>

>> >> Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

>> >>

>> >

>> > Well, KASLR is about keeping bits of kernel virtual address secret

>> > from userland. Leaking them through cache sidechannel means KASLR is

>> > defeated.

>> >

>>

>> Yes, that is what you claim. But you are not explaining how any of the

>> bits that we do want to keep secret can be discovered by making

>> inferences from which lines in a primed cache were evicted during a

>> syscall.

>>

>> The cache index maps to low order bits. You can use this, e.g., to

>> attack table based AES, because there is only ~4 KB worth of tables,

>> and you are interested in finding out which exact entries of the table

>> were read by the process under attack.

>>

>> You are saying the same approach will help you discover 30 high order

>> bits of a virtual kernel address, by observing the cache evictions in

>> a physically indexed physically tagged cache. How?

>

> I assumed high bits are hashed into cache index. I might have been

> wrong. Anyway, page tables are about same size as AES tables. So...:

>

> http://cve.circl.lu/cve/CVE-2017-5927

>


Very interesting paper. Can you explain why you think its findings can
be extrapolated to apply to attacks across address spaces? Because
that is what would be required for it to be able to defeat KASLR.
Pavel Machek Nov. 23, 2017, 5:54 p.m. UTC | #21
On Thu 2017-11-23 11:38:52, Ard Biesheuvel wrote:
> On 23 November 2017 at 10:46, Pavel Machek <pavel@ucw.cz> wrote:

> > On Thu 2017-11-23 09:23:02, Ard Biesheuvel wrote:

> >> On 23 November 2017 at 09:07, Pavel Machek <pavel@ucw.cz> wrote:

> >> > Hi!

> >> >

> >> >> > On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

> >> >> >

> >> >> > Hi!

> >> >> >

> >> >> >>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

> >> >> >>>>> me from using CPU caches to do that?

> >> >> >>>>>

> >> >> >>>>

> >> >> >>>> Because it is impossible to get a cache hit on an access to an

> >> >> >>>> unmapped address?

> >> >> >>>

> >> >> >>> Um, no, I don't need to be able to directly access kernel addresses. I

> >> >> >>> just put some data in _same place in cache where kernel data would

> >> >> >>> go_, then do syscall and look if my data are still cached. Caches

> >> >> >>> don't have infinite associativity.

> >> >> >>>

> >> >> >>

> >> >> >> Ah ok. Interesting.

> >> >> >>

> >> >> >> But how does that leak address bits that are covered by the tag?

> >> >> >

> >> >> > Same as leaking any other address bits? Caches are "virtually

> >> >> > indexed",

> >> >>

> >> >> Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

> >> >>

> >> >> > and tag does not come into play...

> >> >> >

> >> >>

> >> >> Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

> >> >>

> >> >

> >> > Well, KASLR is about keeping bits of kernel virtual address secret

> >> > from userland. Leaking them through cache sidechannel means KASLR is

> >> > defeated.

> >> >

> >>

> >> Yes, that is what you claim. But you are not explaining how any of the

> >> bits that we do want to keep secret can be discovered by making

> >> inferences from which lines in a primed cache were evicted during a

> >> syscall.

> >>

> >> The cache index maps to low order bits. You can use this, e.g., to

> >> attack table based AES, because there is only ~4 KB worth of tables,

> >> and you are interested in finding out which exact entries of the table

> >> were read by the process under attack.

> >>

> >> You are saying the same approach will help you discover 30 high order

> >> bits of a virtual kernel address, by observing the cache evictions in

> >> a physically indexed physically tagged cache. How?

> >

> > I assumed high bits are hashed into cache index. I might have been

> > wrong. Anyway, page tables are about same size as AES tables. So...:

> >

> > http://cve.circl.lu/cve/CVE-2017-5927

> >

> 

> Very interesting paper. Can you explain why you think its findings can

> be extrapolated to apply to attacks across address spaces? Because

> that is what would be required for it to be able to defeat KASLR.


Can you explain why not?

You clearly understand AES tables can be attacked cross-address-space,
and there's no reason page tables could not be attacked same way. I'm
not saying that's the best way to launch the attack, but it certainly
looks possible to me.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Ard Biesheuvel Nov. 23, 2017, 6:17 p.m. UTC | #22
On 23 November 2017 at 17:54, Pavel Machek <pavel@ucw.cz> wrote:
> On Thu 2017-11-23 11:38:52, Ard Biesheuvel wrote:

>> On 23 November 2017 at 10:46, Pavel Machek <pavel@ucw.cz> wrote:

>> > On Thu 2017-11-23 09:23:02, Ard Biesheuvel wrote:

>> >> On 23 November 2017 at 09:07, Pavel Machek <pavel@ucw.cz> wrote:

>> >> > Hi!

>> >> >

>> >> >> > On 22 Nov 2017, at 23:37, Pavel Machek <pavel@ucw.cz> wrote:

>> >> >> >

>> >> >> > Hi!

>> >> >> >

>> >> >> >>>>> If I'm willing to do timing attacks to defeat KASLR... what prevents

>> >> >> >>>>> me from using CPU caches to do that?

>> >> >> >>>>>

>> >> >> >>>>

>> >> >> >>>> Because it is impossible to get a cache hit on an access to an

>> >> >> >>>> unmapped address?

>> >> >> >>>

>> >> >> >>> Um, no, I don't need to be able to directly access kernel addresses. I

>> >> >> >>> just put some data in _same place in cache where kernel data would

>> >> >> >>> go_, then do syscall and look if my data are still cached. Caches

>> >> >> >>> don't have infinite associativity.

>> >> >> >>>

>> >> >> >>

>> >> >> >> Ah ok. Interesting.

>> >> >> >>

>> >> >> >> But how does that leak address bits that are covered by the tag?

>> >> >> >

>> >> >> > Same as leaking any other address bits? Caches are "virtually

>> >> >> > indexed",

>> >> >>

>> >> >> Not on arm64, although I don’t see how that is relevant if you are trying to defeat kaslr.

>> >> >>

>> >> >> > and tag does not come into play...

>> >> >> >

>> >> >>

>> >> >> Well, I must be missing something then, because I don’t see how knowledge about which userland address shares a cache way with a kernel address can leak anything beyond the bits that make up the index (i.e., which cache way is being shared)

>> >> >>

>> >> >

>> >> > Well, KASLR is about keeping bits of kernel virtual address secret

>> >> > from userland. Leaking them through cache sidechannel means KASLR is

>> >> > defeated.

>> >> >

>> >>

>> >> Yes, that is what you claim. But you are not explaining how any of the

>> >> bits that we do want to keep secret can be discovered by making

>> >> inferences from which lines in a primed cache were evicted during a

>> >> syscall.

>> >>

>> >> The cache index maps to low order bits. You can use this, e.g., to

>> >> attack table based AES, because there is only ~4 KB worth of tables,

>> >> and you are interested in finding out which exact entries of the table

>> >> were read by the process under attack.

>> >>

>> >> You are saying the same approach will help you discover 30 high order

>> >> bits of a virtual kernel address, by observing the cache evictions in

>> >> a physically indexed physically tagged cache. How?

>> >

>> > I assumed high bits are hashed into cache index. I might have been

>> > wrong. Anyway, page tables are about same size as AES tables. So...:

>> >

>> > http://cve.circl.lu/cve/CVE-2017-5927

>> >

>>

>> Very interesting paper. Can you explain why you think its findings can

>> be extrapolated to apply to attacks across address spaces? Because

>> that is what would be required for it to be able to defeat KASLR.

>

> Can you explain why not?

>

> You clearly understand AES tables can be attacked cross-address-space,

> and there's no reason page tables could not be attacked same way. I'm

> not saying that's the best way to launch the attack, but it certainly

> looks possible to me.

>


There are two sides to this:
- on the one hand, a round trip into the kernel is quite likely to
result in many more cache evictions than the ones from which you will
be able to infer what address was being resolved by the page table
walker, adding noise to the signal,
- on the other hand, the kernel mappings are deliberately coarse
grained so that they can be cached in the TLB with literally only a
handful of entries, so it is not guaranteed that a TLB miss will occur
that results in a page table walk that you are interested in.

Given the statistical approach, it may simply mean taking more
samples, but how many more? 10x 100000x? Given that the current attack
takes 10s of seconds to mount, that is a significant limitation. For
the TLB side, it may help to mount an additional attack to prime the
TLB, but that itself is likely to add noise to the cache state
measurements.