diff mbox

ARM: Add pointer guard support.

Message ID 5242A79D.1030709@linaro.org
State Superseded
Headers show

Commit Message

Will Newton Sept. 25, 2013, 9:06 a.m. UTC
Add support for pointer mangling in glibc internal structures in C
and assembler code.

Tested on armv7 with hard and soft thread pointers.

ports/ChangeLog.arm:

2013-09-24  Will Newton  <will.newton@linaro.org>

	* sysdeps/arm/__longjmp.S (__longjmp): Demangle fp, sp
	and lr when restoring register values.
	* sysdeps/arm/include/bits/setjmp.h (JMP_BUF_REGLIST): Remove
	sp and lr from list and replace fp with a4.
	* sysdeps/arm/jmpbuf-unwind.h (_jmpbuf_sp): New function.
	(_JMPBUF_UNWINDS_ADJ): Call _jmpbuf_sp.
	* sysdeps/arm/nptl/tcb-offsets.sym: Add POINTER_GUARD.
	* sysdeps/arm/nptl/tls.h (tcbhead_t): Remove private and add
	pointer_guard. (THREAD_GET_POINTER_GUARD): New macro.
	(THREAD_SET_POINTER_GUARD): New macro.
	(THREAD_COPY_POINTER_GUARD): New macro.
	* sysdeps/arm/setjmp.S (__sigsetjmp): Mangle fp, sp and lr
	before storing register values.
	* sysdeps/unix/sysv/linux/arm/sysdep.h (PTR_MANGLE): New macro.
	(PTR_DEMANGLE): Likewise. (PTR_MANGLE2): Likewise.
	(PTR_DEMANGLE2): Likewise.
---
 ports/sysdeps/arm/__longjmp.S              | 14 ++++++++++++++
 ports/sysdeps/arm/include/bits/setjmp.h    |  5 +++--
 ports/sysdeps/arm/jmpbuf-unwind.h          | 13 ++++++++++++-
 ports/sysdeps/arm/nptl/tcb-offsets.sym     |  2 ++
 ports/sysdeps/arm/nptl/tls.h               | 10 +++++++++-
 ports/sysdeps/arm/setjmp.S                 | 14 ++++++++++++++
 ports/sysdeps/unix/sysv/linux/arm/sysdep.h | 22 +++++++++++++++++++---
 7 files changed, 73 insertions(+), 7 deletions(-)

Comments

Carlos O'Donell Sept. 25, 2013, 4:09 p.m. UTC | #1
On 09/25/2013 05:06 AM, Will Newton wrote:
> 
> Add support for pointer mangling in glibc internal structures in C
> and assembler code.
> 
> Tested on armv7 with hard and soft thread pointers.

Have you measured the performance versus using the existing
global variable? 

TLS access on ARM is quite slow and it looks to me like it 
may be faster to use the global variable. Keep in mind that
the pointer guard and stack guard do not vary by thread.

32-bit ARM is currently using a global variable e.g.
__pointer_chk_guard, all you need to do to make it work
is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE
to reference the global symbol.

This is the second proposal for ARM (first was [1] for
AArch64) to support storing the a guard in the TCB, but
nobody has responded yet to my question about performance.

Cheers,
Carlos.

[1] https://sourceware.org/ml/libc-ports/2013-08/msg00052.html
Andrew Pinski Sept. 25, 2013, 4:20 p.m. UTC | #2
> On Sep 25, 2013, at 9:09 AM, "Carlos O'Donell" <carlos@redhat.com> wrote:
> 
>> On 09/25/2013 05:06 AM, Will Newton wrote:
>> 
>> Add support for pointer mangling in glibc internal structures in C
>> and assembler code.
>> 
>> Tested on armv7 with hard and soft thread pointers.
> 
> Have you measured the performance versus using the existing
> global variable? 
> 
> TLS access on ARM is quite slow and it looks to me like it 
> may be faster to use the global variable. Keep in mind that
> the pointer guard and stack guard do not vary by thread.
> 
> 32-bit ARM is currently using a global variable e.g.
> __pointer_chk_guard, all you need to do to make it work
> is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE
> to reference the global symbol.
> 
> This is the second proposal for ARM (first was [1] for
> AArch64) to support storing the a guard in the TCB, but
> nobody has responded yet to my question about performance.
T 
I wonder the same question.  Why move to using to tcb in the first place.  The only answer I can figure out is that is what x86 does but that is not always the correct answer.

> 
> Cheers,
> Carlos.
> 
> [1] https://sourceware.org/ml/libc-ports/2013-08/msg00052.html
>
Will Newton Sept. 25, 2013, 4:23 p.m. UTC | #3
On 25 September 2013 17:09, Carlos O'Donell <carlos@redhat.com> wrote:
> On 09/25/2013 05:06 AM, Will Newton wrote:
>>
>> Add support for pointer mangling in glibc internal structures in C
>> and assembler code.
>>
>> Tested on armv7 with hard and soft thread pointers.
>
> Have you measured the performance versus using the existing
> global variable?

No, but I'll put together a patch for that approach and see how it looks.

> TLS access on ARM is quite slow and it looks to me like it
> may be faster to use the global variable. Keep in mind that
> the pointer guard and stack guard do not vary by thread.

From a back of the envelope calculation the cost of accessing TLS is
one cycle faster than accessing a global in best case (e.g.
Cortex-A15), considerably slower in common case (50-60 cycles,
Cortex-A9) and slower still in worst case (function call to libgcc and
the kernel, ARMv4/ARMv5).

Pointer guard looks to be on slower code paths anyway as compared to
stack guard, but you may be right that the global variable solution is
the best way to go.

> 32-bit ARM is currently using a global variable e.g.
> __pointer_chk_guard, all you need to do to make it work
> is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE
> to reference the global symbol.
>
> This is the second proposal for ARM (first was [1] for
> AArch64) to support storing the a guard in the TCB, but
> nobody has responded yet to my question about performance.

AArch64 the equation is different - all AArch64 cores have a TLS
register, and while it is not general purpose I suspect accessing it
will be much faster than on the worst performing 32bit cores. I don't
have any numbers though.
Carlos O'Donell Sept. 25, 2013, 4:25 p.m. UTC | #4
On 09/25/2013 12:23 PM, Will Newton wrote:
> On 25 September 2013 17:09, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 09/25/2013 05:06 AM, Will Newton wrote:
>>>
>>> Add support for pointer mangling in glibc internal structures in C
>>> and assembler code.
>>>
>>> Tested on armv7 with hard and soft thread pointers.
>>
>> Have you measured the performance versus using the existing
>> global variable?
> 
> No, but I'll put together a patch for that approach and see how it looks.
> 
>> TLS access on ARM is quite slow and it looks to me like it
>> may be faster to use the global variable. Keep in mind that
>> the pointer guard and stack guard do not vary by thread.
> 
> From a back of the envelope calculation the cost of accessing TLS is
> one cycle faster than accessing a global in best case (e.g.
> Cortex-A15), considerably slower in common case (50-60 cycles,
> Cortex-A9) and slower still in worst case (function call to libgcc and
> the kernel, ARMv4/ARMv5).
> 
> Pointer guard looks to be on slower code paths anyway as compared to
> stack guard, but you may be right that the global variable solution is
> the best way to go.

Thanks for exploring this solution.

>> 32-bit ARM is currently using a global variable e.g.
>> __pointer_chk_guard, all you need to do to make it work
>> is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE
>> to reference the global symbol.
>>
>> This is the second proposal for ARM (first was [1] for
>> AArch64) to support storing the a guard in the TCB, but
>> nobody has responded yet to my question about performance.
> 
> AArch64 the equation is different - all AArch64 cores have a TLS
> register, and while it is not general purpose I suspect accessing it
> will be much faster than on the worst performing 32bit cores. I don't
> have any numbers though.

I don't disagree with you, but I'd like to see some due-diligence
in testing out the two alternatives and reporting back the performance
numbers. You need not implement both, just test two access methods
using a small test program and report the data.

Cheers,
Carlos.
Andrew Haley Sept. 25, 2013, 4:59 p.m. UTC | #5
On 09/25/2013 05:23 PM, Will Newton wrote:
> From a back of the envelope calculation the cost of accessing TLS is
> one cycle faster than accessing a global in best case (e.g.
> Cortex-A15)

Why is TLS so fast on Cortex-A15?

Andrew.
Will Newton Sept. 25, 2013, 7:32 p.m. UTC | #6
On 25 September 2013 17:59, Andrew Haley <aph@redhat.com> wrote:
> On 09/25/2013 05:23 PM, Will Newton wrote:
>> From a back of the envelope calculation the cost of accessing TLS is
>> one cycle faster than accessing a global in best case (e.g.
>> Cortex-A15)
>
> Why is TLS so fast on Cortex-A15?

I don't have exact numbers but the mrc instruction used to load the
thread pointer into a general purpose register has very variable
timings depending on which core you have. A15 is the fastest of the
Cortex-A cores in this regard.
Carlos O'Donell Sept. 26, 2013, 3:02 p.m. UTC | #7
On 09/25/2013 05:06 AM, Will Newton wrote:
> 
> Add support for pointer mangling in glibc internal structures in C
> and assembler code.
> 
> Tested on armv7 with hard and soft thread pointers.
> 
> ports/ChangeLog.arm:
> 
> 2013-09-24  Will Newton  <will.newton@linaro.org>
> 
> 	* sysdeps/arm/__longjmp.S (__longjmp): Demangle fp, sp
> 	and lr when restoring register values.
> 	* sysdeps/arm/include/bits/setjmp.h (JMP_BUF_REGLIST): Remove
> 	sp and lr from list and replace fp with a4.
> 	* sysdeps/arm/jmpbuf-unwind.h (_jmpbuf_sp): New function.
> 	(_JMPBUF_UNWINDS_ADJ): Call _jmpbuf_sp.
> 	* sysdeps/arm/nptl/tcb-offsets.sym: Add POINTER_GUARD.
> 	* sysdeps/arm/nptl/tls.h (tcbhead_t): Remove private and add
> 	pointer_guard. (THREAD_GET_POINTER_GUARD): New macro.
> 	(THREAD_SET_POINTER_GUARD): New macro.
> 	(THREAD_COPY_POINTER_GUARD): New macro.
> 	* sysdeps/arm/setjmp.S (__sigsetjmp): Mangle fp, sp and lr
> 	before storing register values.
> 	* sysdeps/unix/sysv/linux/arm/sysdep.h (PTR_MANGLE): New macro.
> 	(PTR_DEMANGLE): Likewise. (PTR_MANGLE2): Likewise.
> 	(PTR_DEMANGLE2): Likewise.

As of the fix for CVE-2013-4788 (bug 15754) there is now a
regression test that ensures the pointer guard varies with
each process and is indeed somewhat variable.

You will need to provide your own stackguard-macros.h file
with the appropriate macros including POINTER_CHK_GUARD
to allow tst-ptrguard1 and tst-ptrguard1-static to pass.

If these tests don't pass then you've got something wrong.

Cheers,
Carlos.
diff mbox

Patch

diff --git a/ports/sysdeps/arm/__longjmp.S b/ports/sysdeps/arm/__longjmp.S
index a5edede..2b1f7f4 100644
--- a/ports/sysdeps/arm/__longjmp.S
+++ b/ports/sysdeps/arm/__longjmp.S
@@ -34,10 +34,24 @@  ENTRY (__longjmp)
 	sfi_breg ip, \
 	ldr	r4, [\B, #32]	/* jmpbuf's sp */
 	cfi_undefined (r4)
+#ifdef PTR_DEMANGLE
+	PTR_DEMANGLE (r4, r4, a3, a4)
+#endif
 	CHECK_SP (r4)
 #endif
 	sfi_sp sfi_breg ip, \
 	ldmia	\B!, JMP_BUF_REGLIST
+#ifdef PTR_DEMANGLE
+	PTR_DEMANGLE (fp, a4, a3, a2)
+	ldr	a4, [ip], #4
+	PTR_DEMANGLE2 (sp, a4, a3)
+	ldr	a4, [ip], #4
+	PTR_DEMANGLE2 (lr, a4, a3)
+#else
+	mov	fp, a4
+	ldr	sp, [ip], #4
+	ldr	lr, [ip], #4
+#endif
 	cfi_restore (v1)
 	cfi_restore (v2)
 	cfi_restore (v3)
diff --git a/ports/sysdeps/arm/include/bits/setjmp.h b/ports/sysdeps/arm/include/bits/setjmp.h
index 1559d7b..64505dc 100644
--- a/ports/sysdeps/arm/include/bits/setjmp.h
+++ b/ports/sysdeps/arm/include/bits/setjmp.h
@@ -26,8 +26,9 @@ 

 #ifndef _ISOMAC
 /* Register list for a ldm/stm instruction to load/store
-   the general registers from a __jmp_buf.  */
-# define JMP_BUF_REGLIST	{v1-v6, sl, fp, sp, lr}
+   the general registers from a __jmp_buf. The a4 register
+   contains fp at this point.  */
+# define JMP_BUF_REGLIST	{a4, v1-v6, sl}

 /* Index of __jmp_buf where the sp register resides.  */
 # define __JMP_BUF_SP		8
diff --git a/ports/sysdeps/arm/jmpbuf-unwind.h b/ports/sysdeps/arm/jmpbuf-unwind.h
index 0863540..1b0d020 100644
--- a/ports/sysdeps/arm/jmpbuf-unwind.h
+++ b/ports/sysdeps/arm/jmpbuf-unwind.h
@@ -17,6 +17,7 @@ 

 #include <setjmp.h>
 #include <stdint.h>
+#include <sysdep.h>
 #include <unwind.h>

 /* Test if longjmp to JMPBUF would unwind the frame
@@ -27,8 +28,18 @@ 
 #define _JMPBUF_CFA_UNWINDS_ADJ(_jmpbuf, _context, _adj) \
   _JMPBUF_UNWINDS_ADJ (_jmpbuf, (void *) _Unwind_GetCFA (_context), _adj)

+static inline uintptr_t __attribute__ ((unused))
+_jmpbuf_sp (__jmp_buf regs)
+{
+  uintptr_t sp = regs[__JMP_BUF_SP];
+#ifdef PTR_DEMANGLE
+  PTR_DEMANGLE (sp);
+#endif
+  return sp;
+}
+
 #define _JMPBUF_UNWINDS_ADJ(_jmpbuf, _address, _adj) \
-  ((uintptr_t) (_address) - (_adj) < (uintptr_t) (_jmpbuf)[__JMP_BUF_SP] - (_adj))
+  ((uintptr_t) (_address) - (_adj) < _jmpbuf_sp (_jmpbuf) - (_adj))

 /* We use the normal longjmp for unwinding.  */
 #define __libc_unwind_longjmp(buf, val) __libc_longjmp (buf, val)
diff --git a/ports/sysdeps/arm/nptl/tcb-offsets.sym b/ports/sysdeps/arm/nptl/tcb-offsets.sym
index 92cc441..06d792f 100644
--- a/ports/sysdeps/arm/nptl/tcb-offsets.sym
+++ b/ports/sysdeps/arm/nptl/tcb-offsets.sym
@@ -9,3 +9,5 @@ 
 MULTIPLE_THREADS_OFFSET		thread_offsetof (header.multiple_threads)
 PID_OFFSET			thread_offsetof (pid)
 TID_OFFSET			thread_offsetof (tid)
+
+POINTER_GUARD			offsetof (tcbhead_t, pointer_guard)
diff --git a/ports/sysdeps/arm/nptl/tls.h b/ports/sysdeps/arm/nptl/tls.h
index da15027..e855507 100644
--- a/ports/sysdeps/arm/nptl/tls.h
+++ b/ports/sysdeps/arm/nptl/tls.h
@@ -56,7 +56,7 @@  typedef union dtv
 typedef struct
 {
   dtv_t *dtv;
-  void *private;
+  uintptr_t pointer_guard;
 } tcbhead_t;

 /* This is the size of the initial TCB.  */
@@ -119,6 +119,14 @@  typedef struct
 #define THREAD_SETMEM_NC(descr, member, idx, value) \
   descr->member[idx] = (value)

+/* Get/set the pointer guard field in TCB head.  */
+#define THREAD_GET_POINTER_GUARD() \
+  (((tcbhead_t *) __builtin_thread_pointer ())->pointer_guard)
+#define THREAD_SET_POINTER_GUARD(value) \
+  (((tcbhead_t *) __builtin_thread_pointer ())->pointer_guard = (value))
+# define THREAD_COPY_POINTER_GUARD(descr) \
+  (((tcbhead_t *) (descr + 1))->pointer_guard = THREAD_GET_POINTER_GUARD ())
+
 /* Get and set the global scope generation counter in struct pthread.  */
 #define THREAD_GSCOPE_FLAG_UNUSED 0
 #define THREAD_GSCOPE_FLAG_USED   1
diff --git a/ports/sysdeps/arm/setjmp.S b/ports/sysdeps/arm/setjmp.S
index a6c161d..b38b919 100644
--- a/ports/sysdeps/arm/setjmp.S
+++ b/ports/sysdeps/arm/setjmp.S
@@ -24,11 +24,25 @@ 
 #include <arm-features.h>

 ENTRY (__sigsetjmp)
+#ifdef PTR_MANGLE
+	PTR_MANGLE (a4, fp, a3, ip)
+#else
+	mov	a4, fp
+#endif
 	mov	ip, r0

 	/* Save registers */
 	sfi_breg ip, \
 	stmia	\B!, JMP_BUF_REGLIST
+#ifdef PTR_MANGLE
+	PTR_MANGLE2 (a4, sp, a3)
+	str	a4, [ip], #4
+	PTR_MANGLE2 (a4, lr, a3)
+	str	a4, [ip], #4
+#else
+	str	sp, [ip], #4
+	str	lr, [ip], #4
+#endif

 #if !defined ARM_ASSUME_NO_IWMMXT || defined __SOFTFP__
 # define NEED_HWCAP 1
diff --git a/ports/sysdeps/unix/sysv/linux/arm/sysdep.h b/ports/sysdeps/unix/sysv/linux/arm/sysdep.h
index b195d8e..5991641 100644
--- a/ports/sysdeps/unix/sysv/linux/arm/sysdep.h
+++ b/ports/sysdeps/unix/sysv/linux/arm/sysdep.h
@@ -435,8 +435,24 @@  __local_syscall_error:						\

 #endif	/* __ASSEMBLER__ */

-/* Pointer mangling is not yet supported for ARM.  */
-#define PTR_MANGLE(var) (void) (var)
-#define PTR_DEMANGLE(var) (void) (var)
+/* Pointer mangling support.  */
+#if defined NOT_IN_libc && defined IS_IN_rtld
+/* We cannot use the thread descriptor because in ld.so we use setjmp
+   earlier than the descriptor is initialized.  */
+#else
+# ifdef __ASSEMBLER__
+#  define PTR_MANGLE(dst, src, guard, tmp)				\
+  mov tmp, r0; GET_TLS (guard); ldr guard, [r0, POINTER_GUARD];		\
+  eor dst, src, guard; mov r0, tmp
+#  define PTR_MANGLE2(dst, src, guard)	eor dst, src, guard
+#  define PTR_DEMANGLE(dst, src, guard, tmp)	\
+  PTR_MANGLE (dst, src, guard, tmp)
+#  define PTR_DEMANGLE2(dst, src, guard)	PTR_MANGLE2 (dst, src, guard)
+# else
+#  define PTR_MANGLE(var) \
+  (var) = (__typeof (var)) ((uintptr_t) (var) ^ THREAD_GET_POINTER_GUARD ())
+#  define PTR_DEMANGLE(var)     PTR_MANGLE (var)
+# endif
+#endif

 #endif /* linux/arm/sysdep.h */