mbox series

[v8,00/38] arm64/gcs: Provide support for GCS in userspace

Message ID 20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org
Headers show
Series arm64/gcs: Provide support for GCS in userspace | expand

Message

Mark Brown Feb. 3, 2024, 12:25 p.m. UTC
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.

When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations.  The current GCS pointer
can not be directly written to by userspace.  When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match.  GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.

The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.

This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests.  It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately.  Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code.  For dynamic linking this will
be done by checking that everything in the executable is marked as GCS
compatible.

x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible.  As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.

The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it.  This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.

x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V Zicfiss feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.

We currently maintain the x86 pattern of implicitly allocating a shadow
stack for threads started with shadow stack enabled, there has been some
discussion of removing this support and requiring the use of clone3()
with explicit allocation of shadow stacks instead.  I have no strong
feelings either way, implicit allocation is not really consistent with
anything else we do and creates the potential for errors around thread
exit but on the other hand it is existing ABI on x86 and minimises the
changes needed in userspace code.

There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace.  This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.

The series depends on support for shadow stacks in clone3(), that series
includes the addition of ARCH_HAS_USER_SHADOW_STACK.

   https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

It also depends on the addition of more waitpid() flags to nolibc:

   https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@kernel.org

You can see a branch with the full set of dependencies against Linus'
tree at:

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs

[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v8:
- Invalidate signal cap token on stack when consuming.
- Typo and other trivial fixes.
- Don't try to use process_vm_write() on GCS, it intentionally does not
  work.
- Fix leak of thread GCSs.
- Rebase onto latest clone3() series.
- Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org

Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
  compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org

Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
  anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org

Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
  or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org

Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
  stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org

Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org

Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
  requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org

---
Mark Brown (38):
      arm64/mm: Restructure arch_validate_flags() for extensibility
      prctl: arch-agnostic prctl for shadow stack
      mman: Add map_shadow_stack() flags
      arm64: Document boot requirements for Guarded Control Stacks
      arm64/gcs: Document the ABI for Guarded Control Stacks
      arm64/sysreg: Add definitions for architected GCS caps
      arm64/gcs: Add manual encodings of GCS instructions
      arm64/gcs: Provide put_user_gcs()
      arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
      arm64/mm: Allocate PIE slots for EL0 guarded control stack
      mm: Define VM_SHADOW_STACK for arm64 when we support GCS
      arm64/mm: Map pages for guarded control stack
      KVM: arm64: Manage GCS registers for guests
      arm64/gcs: Allow GCS usage at EL0 and EL1
      arm64/idreg: Add overrride for GCS
      arm64/hwcap: Add hwcap for GCS
      arm64/traps: Handle GCS exceptions
      arm64/mm: Handle GCS data aborts
      arm64/gcs: Context switch GCS state for EL0
      arm64/gcs: Ensure that new threads have a GCS
      arm64/gcs: Implement shadow stack prctl() interface
      arm64/mm: Implement map_shadow_stack()
      arm64/signal: Set up and restore the GCS context for signal handlers
      arm64/signal: Expose GCS state in signal frames
      arm64/ptrace: Expose GCS via ptrace and core files
      arm64: Add Kconfig for Guarded Control Stack (GCS)
      kselftest/arm64: Verify the GCS hwcap
      kselftest/arm64: Add GCS as a detected feature in the signal tests
      kselftest/arm64: Add framework support for GCS to signal handling tests
      kselftest/arm64: Allow signals tests to specify an expected si_code
      kselftest/arm64: Always run signals tests with GCS enabled
      kselftest/arm64: Add very basic GCS test program
      kselftest/arm64: Add a GCS test program built with the system libc
      kselftest/arm64: Add test coverage for GCS mode locking
      selftests/arm64: Add GCS signal tests
      kselftest/arm64: Add a GCS stress test
      kselftest/arm64: Enable GCS for the FP stress tests
      kselftest: Provide shadow stack enable helpers for arm64

 Documentation/admin-guide/kernel-parameters.txt    |   6 +
 Documentation/arch/arm64/booting.rst               |  22 +
 Documentation/arch/arm64/elf_hwcaps.rst            |   3 +
 Documentation/arch/arm64/gcs.rst                   | 233 +++++++
 Documentation/arch/arm64/index.rst                 |   1 +
 Documentation/filesystems/proc.rst                 |   2 +-
 arch/arm64/Kconfig                                 |  20 +
 arch/arm64/include/asm/cpufeature.h                |   6 +
 arch/arm64/include/asm/el2_setup.h                 |  17 +
 arch/arm64/include/asm/esr.h                       |  28 +-
 arch/arm64/include/asm/exception.h                 |   2 +
 arch/arm64/include/asm/gcs.h                       | 107 +++
 arch/arm64/include/asm/hwcap.h                     |   1 +
 arch/arm64/include/asm/kvm_arm.h                   |   4 +-
 arch/arm64/include/asm/kvm_host.h                  |  12 +
 arch/arm64/include/asm/mman.h                      |  23 +-
 arch/arm64/include/asm/pgtable-prot.h              |  14 +-
 arch/arm64/include/asm/processor.h                 |   7 +
 arch/arm64/include/asm/sysreg.h                    |  20 +
 arch/arm64/include/asm/uaccess.h                   |  40 ++
 arch/arm64/include/uapi/asm/hwcap.h                |   1 +
 arch/arm64/include/uapi/asm/ptrace.h               |   8 +
 arch/arm64/include/uapi/asm/sigcontext.h           |   9 +
 arch/arm64/kernel/cpufeature.c                     |  19 +
 arch/arm64/kernel/cpuinfo.c                        |   1 +
 arch/arm64/kernel/entry-common.c                   |  23 +
 arch/arm64/kernel/idreg-override.c                 |   2 +
 arch/arm64/kernel/process.c                        |  85 +++
 arch/arm64/kernel/ptrace.c                         |  59 ++
 arch/arm64/kernel/signal.c                         | 242 ++++++-
 arch/arm64/kernel/traps.c                          |  11 +
 arch/arm64/kvm/emulate-nested.c                    |   4 +
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h         |  17 +
 arch/arm64/kvm/sys_regs.c                          |  22 +
 arch/arm64/mm/Makefile                             |   1 +
 arch/arm64/mm/fault.c                              |  79 ++-
 arch/arm64/mm/gcs.c                                | 300 +++++++++
 arch/arm64/mm/mmap.c                               |  13 +-
 arch/arm64/tools/cpucaps                           |   1 +
 arch/x86/include/uapi/asm/mman.h                   |   3 -
 fs/proc/task_mmu.c                                 |   3 +
 include/linux/mm.h                                 |  16 +-
 include/uapi/asm-generic/mman.h                    |   4 +
 include/uapi/linux/elf.h                           |   1 +
 include/uapi/linux/prctl.h                         |  22 +
 kernel/sys.c                                       |  30 +
 tools/testing/selftests/arm64/Makefile             |   2 +-
 tools/testing/selftests/arm64/abi/hwcap.c          |  19 +
 tools/testing/selftests/arm64/fp/assembler.h       |  15 +
 tools/testing/selftests/arm64/fp/fpsimd-test.S     |   2 +
 tools/testing/selftests/arm64/fp/sve-test.S        |   2 +
 tools/testing/selftests/arm64/fp/za-test.S         |   2 +
 tools/testing/selftests/arm64/fp/zt-test.S         |   2 +
 tools/testing/selftests/arm64/gcs/.gitignore       |   5 +
 tools/testing/selftests/arm64/gcs/Makefile         |  24 +
 tools/testing/selftests/arm64/gcs/asm-offsets.h    |   0
 tools/testing/selftests/arm64/gcs/basic-gcs.c      | 428 ++++++++++++
 tools/testing/selftests/arm64/gcs/gcs-locking.c    | 200 ++++++
 .../selftests/arm64/gcs/gcs-stress-thread.S        | 311 +++++++++
 tools/testing/selftests/arm64/gcs/gcs-stress.c     | 532 +++++++++++++++
 tools/testing/selftests/arm64/gcs/gcs-util.h       | 100 +++
 tools/testing/selftests/arm64/gcs/libc-gcs.c       | 736 +++++++++++++++++++++
 tools/testing/selftests/arm64/signal/.gitignore    |   1 +
 .../testing/selftests/arm64/signal/test_signals.c  |  17 +-
 .../testing/selftests/arm64/signal/test_signals.h  |   6 +
 .../selftests/arm64/signal/test_signals_utils.c    |  32 +-
 .../selftests/arm64/signal/test_signals_utils.h    |  39 ++
 .../arm64/signal/testcases/gcs_exception_fault.c   |  62 ++
 .../selftests/arm64/signal/testcases/gcs_frame.c   |  88 +++
 .../arm64/signal/testcases/gcs_write_fault.c       |  67 ++
 .../selftests/arm64/signal/testcases/testcases.c   |   7 +
 .../selftests/arm64/signal/testcases/testcases.h   |   1 +
 tools/testing/selftests/ksft_shstk.h               |  37 ++
 73 files changed, 4241 insertions(+), 40 deletions(-)
---
base-commit: 50abefbf1bc07f5c4e403fd28f71dcee855100f7
change-id: 20230303-arm64-gcs-e311ab0d8729

Best regards,

Comments

Stefan O'Rear Feb. 20, 2024, 4:36 p.m. UTC | #1
On Sat, Feb 3, 2024, at 7:25 AM, Mark Brown wrote:
> The arm64 Guarded Control Stack (GCS) feature provides support for
> hardware protected stacks of return addresses, intended to provide
> hardening against return oriented programming (ROP) attacks and to make
> it easier to gather call stacks for applications such as profiling.
>
> When GCS is active a secondary stack called the Guarded Control Stack is
> maintained, protected with a memory attribute which means that it can
> only be written with specific GCS operations.  The current GCS pointer
> can not be directly written to by userspace.  When a BL is executed the
> value stored in LR is also pushed onto the GCS, and when a RET is
> executed the top of the GCS is popped and compared to LR with a fault
> being raised if the values do not match.  GCS operations may only be
> performed on GCS pages, a data abort is generated if they are not.
>
> The combination of hardware enforcement and lack of extra instructions
> in the function entry and exit paths should result in something which
> has less overhead and is more difficult to attack than a purely software
> implementation like clang's shadow stacks.
>
> This series implements support for use of GCS by userspace, along with
> support for use of GCS within KVM guests.  It does not enable use of GCS
> by either EL1 or EL2, this will be implemented separately.  Executables
> are started without GCS and must use a prctl() to enable it, it is
> expected that this will be done very early in application execution by
> the dynamic linker or other startup code.  For dynamic linking this will
> be done by checking that everything in the executable is marked as GCS
> compatible.
>
> x86 has an equivalent feature called shadow stacks, this series depends
> on the x86 patches for generic memory management support for the new
> guarded/shadow stack page type and shares APIs as much as possible.  As
> there has been extensive discussion with the wider community around the
> ABI for shadow stacks I have as far as practical kept implementation
> decisions close to those for x86, anticipating that review would lead to
> similar conclusions in the absence of strong reasoning for divergence.
>
> The main divergence I am concious of is that x86 allows shadow stack to
> be enabled and disabled repeatedly, freeing the shadow stack for the
> thread whenever disabled, while this implementation keeps the GCS
> allocated after disable but refuses to reenable it.  This is to avoid
> races with things actively walking the GCS during a disable, we do
> anticipate that some systems will wish to disable GCS at runtime but are
> not aware of any demand for subsequently reenabling it.
>
> x86 uses an arch_prctl() to manage enable and disable, since only x86
> and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
> patch set for the equivalent RISC-V Zicfiss feature which I initially
> adopted fairly directly but following review feedback has been revised
> quite a bit.

While discussing the ABI implications of shadow stacks in the context of
Zicfiss and musl a few days ago, I had the following idea for how to solve
the source compatibility problems with shadow stacks in POSIX.1-2004 and
POSIX.1-2017:

1. Introduce a "flexible shadow stack handling" option.  For what follows,
   it doesn't matter if this is system-wide, per-mm, or per-vma.

2. Shadow stack faults on non-shadow stack pages, if flexible shadow stack
   handling is in effect, cause the affected page to become a shadow stack
   page.  When this happens, the page filled with invalid address tokens.

   Faults from non-shadow-stack accesses to a shadow-stack page which was
   created by the previous paragraph will cause the page to revert to
   non-shadow-stack usage, with or without clearing.

   Important: a shadow stack operation can only load a valid address from
   a page if that page has been in continuous shadow stack use since the
   address was written by another shadow stack operation; the flexibility
   delays error reporting in cases of stray writes but it never allows for
   corruption of shadow stack operation.

3. Standards-defined operations which use a user-provided stack
   (makecontext, sigaltstack, pthread_attr_setstack) use a subrange of the
   provided stack for shadow stack storage.  I propose to use a shadow
   stack size of 1/32 of the provided stack size, rounded up to a positive
   integer number of pages, and place the shadow stack allocation at the
   lowest page-aligned address inside the provided stack region.

   Since page usage is flexible, no change in page permissions is
   immediately needed; this merely sets the initial shadow stack pointer for
   the new context.

   If the shadow stack grew in the opposite direction to the architectural
   stack, it would not be necessary to pick a fixed direction.

4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide sufficient
   space for a minimum-sized shadow stack region and worst case alignment.

_Without_ doing this, sigaltstack cannot be used to recover from stack
overflows if the shadow stack limit is reached first, and makecontext
cannot be supported without memory leaks and unreportable error conditions.

Kernel-allocated shadow stacks with a unique VM type are still useful since
they allows stray writes to crash at the time the stray write is performed,
rather than delaying the crash until the next shadow stack read.

The pthread and makecontext changes could be purely libc side, but we would
need kernel support for sigaltstack and page usage changes.

Luckily, there is no need to support stacks which are simultaneously used
from more than one process, so "is this a shadow stack page" can be tracked
purely at the vma/pte level without any need to involve the inode.  POSIX
explicitly allows using mmap to obtain stack memory and does not forbid
MAP_SHARED; I consider this sufficiently perverse application behavior that
it is not necessary to ensure exclusive use of the underlying pages while
a shadow stack pte exists.  (Applications that use MAP_SHARED for stacks
do not get the full benefit of the shadow stack but they keep POSIX.1-2004
conformance, applications that allocate stacks exclusively in MAP_PRIVATE
memory lose no security.)

The largest complication of this scheme is likely to be that the shadow
stack usage property of a page needs to survive the page being swapped out
and back in, which likely means that it must be present in the swap PTE.

I am substantially less familiar with GCS and SHSTK than with Zicfiss.
It is likely that a syscall or other mechanism is needed to initialize the
shadow stack in flexible memory for makecontext.

Is there interest on the kernel side on having mechanisms to fully support
POSIX.1-2004 with GCS or Zicfiss enabled?

-s

> We currently maintain the x86 pattern of implicitly allocating a shadow
> stack for threads started with shadow stack enabled, there has been some
> discussion of removing this support and requiring the use of clone3()
> with explicit allocation of shadow stacks instead.  I have no strong
> feelings either way, implicit allocation is not really consistent with
> anything else we do and creates the potential for errors around thread
> exit but on the other hand it is existing ABI on x86 and minimises the
> changes needed in userspace code.
>
> There is an open issue with support for CRIU, on x86 this required the
> ability to set the GCS mode via ptrace.  This series supports
> configuring mode bits other than enable/disable via ptrace but it needs
> to be confirmed if this is sufficient.
>
> The series depends on support for shadow stacks in clone3(), that series
> includes the addition of ARCH_HAS_USER_SHADOW_STACK.
>
>    
> https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org
>
> It also depends on the addition of more waitpid() flags to nolibc:
>
>    
> https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@kernel.org
>
> You can see a branch with the full set of dependencies against Linus'
> tree at:
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
>
> [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
>
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
> Changes in v8:
> - Invalidate signal cap token on stack when consuming.
> - Typo and other trivial fixes.
> - Don't try to use process_vm_write() on GCS, it intentionally does not
>   work.
> - Fix leak of thread GCSs.
> - Rebase onto latest clone3() series.
> - Link to v7: 
> https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org
>
> Changes in v7:
> - Rebase onto v6.7-rc2 via the clone3() patch series.
> - Change the token used to cap the stack during signal handling to be
>   compatible with GCSPOPM.
> - Fix flags for new page types.
> - Fold in support for clone3().
> - Replace copy_to_user_gcs() with put_user_gcs().
> - Link to v6: 
> https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
>
> Changes in v6:
> - Rebase onto v6.6-rc3.
> - Add some more gcsb_dsync() barriers following spec clarifications.
> - Due to ongoing discussion around clone()/clone3() I've not updated
>   anything there, the behaviour is the same as on previous versions.
> - Link to v5: 
> https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
>
> Changes in v5:
> - Don't map any permissions for user GCSs, we always use EL0 accessors
>   or use a separate mapping of the page.
> - Reduce the standard size of the GCS to RLIMIT_STACK/2.
> - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
> - Clarifications and fixes to documentation.
> - More tests.
> - Link to v4: 
> https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
>
> Changes in v4:
> - Implement flags for map_shadow_stack() allowing the cap and end of
>   stack marker to be enabled independently or not at all.
> - Relax size and alignment requirements for map_shadow_stack().
> - Add more blurb explaining the advantages of hardware enforcement.
> - Link to v3: 
> https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
>
> Changes in v3:
> - Rebase onto v6.5-rc4.
> - Add a GCS barrier on context switch.
> - Add a GCS stress test.
> - Link to v2: 
> https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
>
> Changes in v2:
> - Rebase onto v6.5-rc3.
> - Rework prctl() interface to allow each bit to be locked independently.
> - map_shadow_stack() now places the cap token based on the size
>   requested by the caller not the actual space allocated.
> - Mode changes other than enable via ptrace are now supported.
> - Expand test coverage.
> - Various smaller fixes and adjustments.
> - Link to v1: 
> https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
>
> ---
> Mark Brown (38):
>       arm64/mm: Restructure arch_validate_flags() for extensibility
>       prctl: arch-agnostic prctl for shadow stack
>       mman: Add map_shadow_stack() flags
>       arm64: Document boot requirements for Guarded Control Stacks
>       arm64/gcs: Document the ABI for Guarded Control Stacks
>       arm64/sysreg: Add definitions for architected GCS caps
>       arm64/gcs: Add manual encodings of GCS instructions
>       arm64/gcs: Provide put_user_gcs()
>       arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
>       arm64/mm: Allocate PIE slots for EL0 guarded control stack
>       mm: Define VM_SHADOW_STACK for arm64 when we support GCS
>       arm64/mm: Map pages for guarded control stack
>       KVM: arm64: Manage GCS registers for guests
>       arm64/gcs: Allow GCS usage at EL0 and EL1
>       arm64/idreg: Add overrride for GCS
>       arm64/hwcap: Add hwcap for GCS
>       arm64/traps: Handle GCS exceptions
>       arm64/mm: Handle GCS data aborts
>       arm64/gcs: Context switch GCS state for EL0
>       arm64/gcs: Ensure that new threads have a GCS
>       arm64/gcs: Implement shadow stack prctl() interface
>       arm64/mm: Implement map_shadow_stack()
>       arm64/signal: Set up and restore the GCS context for signal handlers
>       arm64/signal: Expose GCS state in signal frames
>       arm64/ptrace: Expose GCS via ptrace and core files
>       arm64: Add Kconfig for Guarded Control Stack (GCS)
>       kselftest/arm64: Verify the GCS hwcap
>       kselftest/arm64: Add GCS as a detected feature in the signal tests
>       kselftest/arm64: Add framework support for GCS to signal handling tests
>       kselftest/arm64: Allow signals tests to specify an expected si_code
>       kselftest/arm64: Always run signals tests with GCS enabled
>       kselftest/arm64: Add very basic GCS test program
>       kselftest/arm64: Add a GCS test program built with the system libc
>       kselftest/arm64: Add test coverage for GCS mode locking
>       selftests/arm64: Add GCS signal tests
>       kselftest/arm64: Add a GCS stress test
>       kselftest/arm64: Enable GCS for the FP stress tests
>       kselftest: Provide shadow stack enable helpers for arm64
>
>  Documentation/admin-guide/kernel-parameters.txt    |   6 +
>  Documentation/arch/arm64/booting.rst               |  22 +
>  Documentation/arch/arm64/elf_hwcaps.rst            |   3 +
>  Documentation/arch/arm64/gcs.rst                   | 233 +++++++
>  Documentation/arch/arm64/index.rst                 |   1 +
>  Documentation/filesystems/proc.rst                 |   2 +-
>  arch/arm64/Kconfig                                 |  20 +
>  arch/arm64/include/asm/cpufeature.h                |   6 +
>  arch/arm64/include/asm/el2_setup.h                 |  17 +
>  arch/arm64/include/asm/esr.h                       |  28 +-
>  arch/arm64/include/asm/exception.h                 |   2 +
>  arch/arm64/include/asm/gcs.h                       | 107 +++
>  arch/arm64/include/asm/hwcap.h                     |   1 +
>  arch/arm64/include/asm/kvm_arm.h                   |   4 +-
>  arch/arm64/include/asm/kvm_host.h                  |  12 +
>  arch/arm64/include/asm/mman.h                      |  23 +-
>  arch/arm64/include/asm/pgtable-prot.h              |  14 +-
>  arch/arm64/include/asm/processor.h                 |   7 +
>  arch/arm64/include/asm/sysreg.h                    |  20 +
>  arch/arm64/include/asm/uaccess.h                   |  40 ++
>  arch/arm64/include/uapi/asm/hwcap.h                |   1 +
>  arch/arm64/include/uapi/asm/ptrace.h               |   8 +
>  arch/arm64/include/uapi/asm/sigcontext.h           |   9 +
>  arch/arm64/kernel/cpufeature.c                     |  19 +
>  arch/arm64/kernel/cpuinfo.c                        |   1 +
>  arch/arm64/kernel/entry-common.c                   |  23 +
>  arch/arm64/kernel/idreg-override.c                 |   2 +
>  arch/arm64/kernel/process.c                        |  85 +++
>  arch/arm64/kernel/ptrace.c                         |  59 ++
>  arch/arm64/kernel/signal.c                         | 242 ++++++-
>  arch/arm64/kernel/traps.c                          |  11 +
>  arch/arm64/kvm/emulate-nested.c                    |   4 +
>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h         |  17 +
>  arch/arm64/kvm/sys_regs.c                          |  22 +
>  arch/arm64/mm/Makefile                             |   1 +
>  arch/arm64/mm/fault.c                              |  79 ++-
>  arch/arm64/mm/gcs.c                                | 300 +++++++++
>  arch/arm64/mm/mmap.c                               |  13 +-
>  arch/arm64/tools/cpucaps                           |   1 +
>  arch/x86/include/uapi/asm/mman.h                   |   3 -
>  fs/proc/task_mmu.c                                 |   3 +
>  include/linux/mm.h                                 |  16 +-
>  include/uapi/asm-generic/mman.h                    |   4 +
>  include/uapi/linux/elf.h                           |   1 +
>  include/uapi/linux/prctl.h                         |  22 +
>  kernel/sys.c                                       |  30 +
>  tools/testing/selftests/arm64/Makefile             |   2 +-
>  tools/testing/selftests/arm64/abi/hwcap.c          |  19 +
>  tools/testing/selftests/arm64/fp/assembler.h       |  15 +
>  tools/testing/selftests/arm64/fp/fpsimd-test.S     |   2 +
>  tools/testing/selftests/arm64/fp/sve-test.S        |   2 +
>  tools/testing/selftests/arm64/fp/za-test.S         |   2 +
>  tools/testing/selftests/arm64/fp/zt-test.S         |   2 +
>  tools/testing/selftests/arm64/gcs/.gitignore       |   5 +
>  tools/testing/selftests/arm64/gcs/Makefile         |  24 +
>  tools/testing/selftests/arm64/gcs/asm-offsets.h    |   0
>  tools/testing/selftests/arm64/gcs/basic-gcs.c      | 428 ++++++++++++
>  tools/testing/selftests/arm64/gcs/gcs-locking.c    | 200 ++++++
>  .../selftests/arm64/gcs/gcs-stress-thread.S        | 311 +++++++++
>  tools/testing/selftests/arm64/gcs/gcs-stress.c     | 532 +++++++++++++++
>  tools/testing/selftests/arm64/gcs/gcs-util.h       | 100 +++
>  tools/testing/selftests/arm64/gcs/libc-gcs.c       | 736 +++++++++++++++++++++
>  tools/testing/selftests/arm64/signal/.gitignore    |   1 +
>  .../testing/selftests/arm64/signal/test_signals.c  |  17 +-
>  .../testing/selftests/arm64/signal/test_signals.h  |   6 +
>  .../selftests/arm64/signal/test_signals_utils.c    |  32 +-
>  .../selftests/arm64/signal/test_signals_utils.h    |  39 ++
>  .../arm64/signal/testcases/gcs_exception_fault.c   |  62 ++
>  .../selftests/arm64/signal/testcases/gcs_frame.c   |  88 +++
>  .../arm64/signal/testcases/gcs_write_fault.c       |  67 ++
>  .../selftests/arm64/signal/testcases/testcases.c   |   7 +
>  .../selftests/arm64/signal/testcases/testcases.h   |   1 +
>  tools/testing/selftests/ksft_shstk.h               |  37 ++
>  73 files changed, 4241 insertions(+), 40 deletions(-)
> ---
> base-commit: 50abefbf1bc07f5c4e403fd28f71dcee855100f7
> change-id: 20230303-arm64-gcs-e311ab0d8729
>
> Best regards,
> -- 
> Mark Brown <broonie@kernel.org>
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
dalias@libc.org Feb. 20, 2024, 6:57 p.m. UTC | #2
On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
> Hi,
> 
> I worked on the x86 kernel shadow stack support. I think it is an
> interesting suggestion. Some questions below, and I will think more on
> it.
> 
> On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote:
> > While discussing the ABI implications of shadow stacks in the context
> > of
> > Zicfiss and musl a few days ago, I had the following idea for how to
> > solve
> > the source compatibility problems with shadow stacks in POSIX.1-2004
> > and
> > POSIX.1-2017:
> > 
> > 1. Introduce a "flexible shadow stack handling" option.  For what
> > follows,
> >    it doesn't matter if this is system-wide, per-mm, or per-vma.
> > 
> > 2. Shadow stack faults on non-shadow stack pages, if flexible shadow
> > stack
> >    handling is in effect, cause the affected page to become a shadow
> > stack
> >    page.  When this happens, the page filled with invalid address
> > tokens.
> 
> Hmm, could the shadow stack underflow onto the real stack then? Not
> sure how bad that is. INCSSP (incrementing the SSP register on x86)
> loops are not rare so it seems like something that could happen.

Shadow stack underflow should fault on attempt to access
non-shadow-stack memory as shadow-stack, no?

> >    Faults from non-shadow-stack accesses to a shadow-stack page which
> > was
> >    created by the previous paragraph will cause the page to revert to
> >    non-shadow-stack usage, with or without clearing.
> 
> Won't this prevent catching stack overflows when they happen? An
> overflow will just turn the shadow stack into normal stack and only get
> detected when the shadow stack unwinds?

I don't think that's as big a problem as it sounds like. It might make
pinpointing the spot at which things went wrong take a little bit more
work, but it should not admit any wrong-execution.

> A related question would be how to handle the expanding nature of the
> initial stack. I guess the initial stack could be special and have a
> separate shadow stack.

That seems fine.

> >    Important: a shadow stack operation can only load a valid address
> > from
> >    a page if that page has been in continuous shadow stack use since
> > the
> >    address was written by another shadow stack operation; the
> > flexibility
> >    delays error reporting in cases of stray writes but it never
> > allows for
> >    corruption of shadow stack operation.
> 
> Shadow stacks currently have automatic guard gaps to try to prevent one
> thread from overflowing onto another thread's shadow stack. This would
> somewhat opens that up, as the stack guard gaps are usually maintained
> by userspace for new threads. It would have to be thought through if
> these could still be enforced with checking at additional spots.

I would think the existing guard pages would already do that if a
thread's shadow stack is contiguous with its own data stack.

> > 3. Standards-defined operations which use a user-provided stack
> >    (makecontext, sigaltstack, pthread_attr_setstack) use a subrange
> > of the
> >    provided stack for shadow stack storage.  I propose to use a
> > shadow
> >    stack size of 1/32 of the provided stack size, rounded up to a
> > positive
> >    integer number of pages, and place the shadow stack allocation at
> > the
> >    lowest page-aligned address inside the provided stack region.
> > 
> >    Since page usage is flexible, no change in page permissions is
> >    immediately needed; this merely sets the initial shadow stack
> > pointer for
> >    the new context.
> > 
> >    If the shadow stack grew in the opposite direction to the
> > architectural
> >    stack, it would not be necessary to pick a fixed direction.
> > 
> > 4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide
> > sufficient
> >    space for a minimum-sized shadow stack region and worst case
> > alignment.
> 
> Do all makecontext() callers ensure the size is greater than this?
> 
> I guess glibc's makecontext() could do this scheme to prevent leaking
> without any changes to the kernel. Basically steal a little of the
> stack address range and overwrite it with a shadow stack mapping. But
> only if the apps leave enough room. If they need to be updated, then
> they could be updated to manage their own shadow stacks too I think.
Edgecombe, Rick P Feb. 20, 2024, 11:30 p.m. UTC | #3
On Tue, 2024-02-20 at 20:14 +0000, Mark Brown wrote:
> > Hmm, could the shadow stack underflow onto the real stack then? Not
> > sure how bad that is. INCSSP (incrementing the SSP register on x86)
> > loops are not rare so it seems like something that could happen.
> 
> Yes, they'd trash any pages of normal stack they touch as they do so
> but
> otherwise seems similar to overflow.

I was thinking in the normal buffer overflow case there is a guard gap
at the end of the stack, but in this case the shadow stack is directly
adjacent to the regular stack. It's probably a minor point.
Edgecombe, Rick P Feb. 21, 2024, 12:35 a.m. UTC | #4
On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote:
> On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
> > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P
> > > wrote:
> > > > Hmm, could the shadow stack underflow onto the real stack then?
> > > > Not
> > > > sure how bad that is. INCSSP (incrementing the SSP register on
> > > > x86)
> > > > loops are not rare so it seems like something that could
> > > > happen.
> > > 
> > > Shadow stack underflow should fault on attempt to access
> > > non-shadow-stack memory as shadow-stack, no?
> > 
> > Maybe I'm misunderstanding. I thought the proposal included
> > allowing
> > shadow stack access to convert normal address ranges to shadow
> > stack,
> > and normal writes to convert shadow stack to normal.
> 
> As I understood the original discussion of the proposal on IRC, it
> was
> only one-way (from shadow to normal). Unless I'm missing something,
> making it one-way is necessary to catch situations where the shadow
> stack would become compromised.

The original post here:
https://lore.kernel.org/lkml/22a53b78-10d7-4a5a-a01e-b2f3a8c22e94@app.fastmail.com/

...has:
"Shadow stack faults on non-shadow stack pages, if flexible shadow
stack handling is in effect, cause the affected page to become a shadow
stack page.  When this happens, the page filled with invalid address
tokens."

...and:
"Faults from non-shadow-stack accesses to a shadow-stack page which was
created by the previous paragraph will cause the page to revert to non-
shadow-stack usage, with or without clearing."

I see Stefan has clarified in another response. So I'll go try to
figure it out.

> 
> > > > Shadow stacks currently have automatic guard gaps to try to
> > > > prevent
> > > > one
> > > > thread from overflowing onto another thread's shadow stack.
> > > > This
> > > > would
> > > > somewhat opens that up, as the stack guard gaps are usually
> > > > maintained
> > > > by userspace for new threads. It would have to be thought
> > > > through
> > > > if
> > > > these could still be enforced with checking at additional
> > > > spots.
> > > 
> > > I would think the existing guard pages would already do that if a
> > > thread's shadow stack is contiguous with its own data stack.
> > 
> > The difference is that the kernel provides the guard gaps, where
> > this
> > would rely on userspace to do it. It is not a showstopper either.
> > 
> > I think my biggest question on this is how does it change the
> > capability for two threads to share a shadow stack. It might
> > require
> > some special rules around the syscall that writes restore tokens.
> > So
> > I'm not sure. It probably needs a POC.
> 
> Why would they be sharing a shadow stack?

The guard gap was introduced originally based on a suggestion that
overflowing a shadow stack onto an adjacent shadow stack could cause
corruption that could be used by an attacker to work around the
protection. There wasn't any concrete demonstrated attacks or
suggestion that all the protection was moot.

But when we talk about capabilities for converting memory to shadow
stack with simple memory accesses, and syscalls that can write restore
token to shadow stacks, it's not immediately clear to me that it
wouldn't open up something like that. Like if two restore tokens were
written to a shadow stack, or two shadow stacks were adjacent with
normal memory between them that later got converted to shadow stack.
Those sorts of scenarios, but I won't lean on those specific examples.
Sorry for being hand wavy. It's just where I'm at, at this point.

> 
> > > From the musl side, I have always looked at the entirely of
> > > shadow
> > > stack stuff with very heavy skepticism, and anything that breaks
> > > existing interface contracts, introduced places where apps can
> > > get
> > > auto-killed because a late resource allocation fails, or requires
> > > applications to code around the existence of something that
> > > should be
> > > an implementation detail, is a non-starter. To even consider
> > > shadow
> > > stack support, it must truely be fully non-breaking.
> > 
> > The manual assembly stack switching and JIT code in the apps needs
> > to
> > be updated. I don't think there is a way around it.
> 
> Indeed, I'm not talking about programs with JIT/manual stack-
> switching
> asm, just anything using existing APIs for control of stack --
> pthread_setstack, makecontext, sigaltstack, etc.

Then I think WRSS might fit your requirements better than what glibc
did. It was considered a reduced security mode that made libc's job
much easier and had better compatibility, but the last discussion was
to try to do it without WRSS.

> 
> > I agree though that the late allocation failures are not great.
> > Mark is
> > working on clone3 support which should allow moving the shadow
> > stack
> > allocation to happen in userspace with the normal stack. Even for
> > riscv
> > though, doesn't it need to update a new register in stack
> > switching?
> 
> If clone is called with signals masked, it's probably not necessary
> for the kernel to set the shadow stack register as part of clone3.

So you would want a mode of clone3 that basically leaves the shadow
stack bits alone? Mark was driving that effort, but it doesn't seem
horrible to me on first impression. If it would open up the possibility
of musl support.

> 
> > BTW, x86 shadow stack has a mode where the shadow stack is writable
> > with a special instruction (WRSS). It enables the SSP to be set
> > arbitrarily by writing restore tokens. We discussed this as an
> > option
> > to make the existing longjmp() and signal stuff work more
> > transparently
> > for glibc.
> > 
> > 
> > 
> > BTW, when I talk about "not supporting" I don't mean the app should
> > crash. I mean it should instead run normally, just without shadow
> > stack
> > enabled. Not sure if that was clear. Since shadow stack is not
> > essential for an application to function, it is only security
> > hardening
> > on top.
> > 
> > Although determining if an application supports shadow stack has
> > turned
> > out to be difficult in practice. Handling dlopen() is especially
> > hard.
> 
> One reasonable thing to do, that might be preferable to
> overengineered
> solutions, is to disable shadow-stack process-wide if an interface
> incompatible with it is used (sigaltstack, pthread_create with an
> attribute setup using pthread_attr_setstack, makecontext, etc.), as
> well as if an incompatible library is is dlopened.

I think it would be an interesting approach to determining
compatibility. On x86 there has been cases of binaries getting
mismarked as supporting shadow stack. So an automated way of filtering
some of those out would be very useful I think. I guess the dynamic
linker could determine this based on some list of functions?

The dlopen() bit gets complicated though. You need to disable shadow
stack for all threads, which presumably the kernel could be coaxed into
doing. But those threads might be using shadow stack instructions
(INCSSP, RSTORSSP, etc). These are a collection of instructions that
allow limited control of the SSP. When shadow stack gets disabled,
these suddenly turn into #UD generating instructions. So any other
threads executing those instructions when shadow stack got disabled
would be in for a nasty surprise.

Glibc's permissive mode (that disables shadow stack when dlopen()ing a
DSO that doesn't support shadow stack) is quite limited because of
this. There was a POC for working around it, but I'll stop there for
now, to not spam you with the details. I'm not sure of arm and risc-v
details on this specific corner, but for x86.

>  This is much more
> acceptable than continuing to run with shadow stacks managed sloppily
> by the kernel and async killing the process on OOM, and is probably
> *more compatible* with apps than changing the minimum stack size
> requirements out from under them.

Yep.

> 
> The place where it's really needed to be able to allocate the shadow
> stack synchronously under userspace control, in order to harden
> normal
> applications that aren't doing funny things, is in pthread_create
> without a caller-provided stack.

Yea most apps don't do anything too tricky. Mostly shadow stack "just
works". But it's no excuse to just crash for the others.
Mark Brown Feb. 21, 2024, 12:44 a.m. UTC | #5
On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:

> doing. But those threads might be using shadow stack instructions
> (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> allow limited control of the SSP. When shadow stack gets disabled,
> these suddenly turn into #UD generating instructions. So any other
> threads executing those instructions when shadow stack got disabled
> would be in for a nasty surprise.

> Glibc's permissive mode (that disables shadow stack when dlopen()ing a
> DSO that doesn't support shadow stack) is quite limited because of
> this. There was a POC for working around it, but I'll stop there for
> now, to not spam you with the details. I'm not sure of arm and risc-v
> details on this specific corner, but for x86.

We have the same issue with disabling GCS causing GCS instructions to
become undefined.
dalias@libc.org Feb. 21, 2024, 1:27 a.m. UTC | #6
On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote:
> > On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote:
> > > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
> > > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P
> > > > > Shadow stacks currently have automatic guard gaps to try to
> > > > > prevent
> > > > > one
> > > > > thread from overflowing onto another thread's shadow stack.
> > > > > This
> > > > > would
> > > > > somewhat opens that up, as the stack guard gaps are usually
> > > > > maintained
> > > > > by userspace for new threads. It would have to be thought
> > > > > through
> > > > > if
> > > > > these could still be enforced with checking at additional
> > > > > spots.
> > > > 
> > > > I would think the existing guard pages would already do that if a
> > > > thread's shadow stack is contiguous with its own data stack.
> > > 
> > > The difference is that the kernel provides the guard gaps, where
> > > this
> > > would rely on userspace to do it. It is not a showstopper either.
> > > 
> > > I think my biggest question on this is how does it change the
> > > capability for two threads to share a shadow stack. It might
> > > require
> > > some special rules around the syscall that writes restore tokens.
> > > So
> > > I'm not sure. It probably needs a POC.
> > 
> > Why would they be sharing a shadow stack?
> 
> The guard gap was introduced originally based on a suggestion that
> overflowing a shadow stack onto an adjacent shadow stack could cause
> corruption that could be used by an attacker to work around the
> protection. There wasn't any concrete demonstrated attacks or
> suggestion that all the protection was moot.

OK, so not sharing, just happening to be adjacent.

I was thinking from a standpoint of allocating them as part of the
same range as the main stack, just with different protections, where
that would never happen; you'd always have intervening non-shadowstack
pages. But when they're kernel-allocated, yes, they need their own
guard pages.

> But when we talk about capabilities for converting memory to shadow
> stack with simple memory accesses, and syscalls that can write restore
> token to shadow stacks, it's not immediately clear to me that it
> wouldn't open up something like that. Like if two restore tokens were
> written to a shadow stack, or two shadow stacks were adjacent with
> normal memory between them that later got converted to shadow stack.
> Those sorts of scenarios, but I won't lean on those specific examples.
> Sorry for being hand wavy. It's just where I'm at, at this point.

I don't think it's safe to have automatic conversions back and forth,
only for normal accesses to convert shadowstack to normal memory (in
which case, any subsequent attempt to operate on it as shadow stack
indicates a critical bug and should be trapped to terminate the
process).

> > > > From the musl side, I have always looked at the entirely of
> > > > shadow
> > > > stack stuff with very heavy skepticism, and anything that breaks
> > > > existing interface contracts, introduced places where apps can
> > > > get
> > > > auto-killed because a late resource allocation fails, or requires
> > > > applications to code around the existence of something that
> > > > should be
> > > > an implementation detail, is a non-starter. To even consider
> > > > shadow
> > > > stack support, it must truely be fully non-breaking.
> > > 
> > > The manual assembly stack switching and JIT code in the apps needs
> > > to
> > > be updated. I don't think there is a way around it.
> > 
> > Indeed, I'm not talking about programs with JIT/manual stack-
> > switching
> > asm, just anything using existing APIs for control of stack --
> > pthread_setstack, makecontext, sigaltstack, etc.
> 
> Then I think WRSS might fit your requirements better than what glibc
> did. It was considered a reduced security mode that made libc's job
> much easier and had better compatibility, but the last discussion was
> to try to do it without WRSS.

Where can I read more about this? Some searches I tried didn't turn up
much useful information.

> > > I agree though that the late allocation failures are not great.
> > > Mark is
> > > working on clone3 support which should allow moving the shadow
> > > stack
> > > allocation to happen in userspace with the normal stack. Even for
> > > riscv
> > > though, doesn't it need to update a new register in stack
> > > switching?
> > 
> > If clone is called with signals masked, it's probably not necessary
> > for the kernel to set the shadow stack register as part of clone3.
> 
> So you would want a mode of clone3 that basically leaves the shadow
> stack bits alone? Mark was driving that effort, but it doesn't seem
> horrible to me on first impression. If it would open up the possibility
> of musl support.

Well I'm not sure. That's what we're trying to figure out. But I don't
think modifying it is a hard requirement, since it can be modified
from userspace if needed as long as signals are masked.

> > One reasonable thing to do, that might be preferable to
> > overengineered
> > solutions, is to disable shadow-stack process-wide if an interface
> > incompatible with it is used (sigaltstack, pthread_create with an
> > attribute setup using pthread_attr_setstack, makecontext, etc.), as
> > well as if an incompatible library is is dlopened.
> 
> I think it would be an interesting approach to determining
> compatibility. On x86 there has been cases of binaries getting
> mismarked as supporting shadow stack. So an automated way of filtering
> some of those out would be very useful I think. I guess the dynamic
> linker could determine this based on some list of functions?

I didn't follow this whole mess, but from our side (musl) it does not
seem relevant. There are no legacy binaries wrongly marked because we
have never supported shadow stacks so far.

> The dlopen() bit gets complicated though. You need to disable shadow
> stack for all threads, which presumably the kernel could be coaxed into
> doing. But those threads might be using shadow stack instructions
> (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> allow limited control of the SSP. When shadow stack gets disabled,
> these suddenly turn into #UD generating instructions. So any other
> threads executing those instructions when shadow stack got disabled
> would be in for a nasty surprise.

This is the kernel's problem if that's happening. It should be
trapping these and returning immediately like a NOP if shadow stack
has been disabled, not generating SIGILL.

> > The place where it's really needed to be able to allocate the shadow
> > stack synchronously under userspace control, in order to harden
> > normal
> > applications that aren't doing funny things, is in pthread_create
> > without a caller-provided stack.
> 
> Yea most apps don't do anything too tricky. Mostly shadow stack "just
> works". But it's no excuse to just crash for the others.

One thing to note here is that, to enable this, we're going to need
some way to detect "new enough kernel that shadow stack semantics are
all right". If there are kernels that have shadow stack support but
with problems that make it unsafe to use (this sounds like the case),
we can't turn it on without a way to avoid trying to use it on those.

Rich
Edgecombe, Rick P Feb. 21, 2024, 4:30 a.m. UTC | #7
On Tue, 2024-02-20 at 18:59 -0500, Stefan O'Rear wrote:
> 
> Ideally for riscv only writes would cause conversion, an incssp
> underflow
> which performs shadow stack reads would be able to fault early.

Why can't makecontext() just clobber part of the low address side of
the passed in stack with a shadow stack mapping? Like say it just
munmap()'s part of the passed stack, and map_shadow_stack() in it's
place.

Then you could still have the shadow stack->normal conversion process
triggered by normal writes. IIUC the concern there is to make sure the
caller can reuse it as normal memory when it is done with the
ucontext/sigaltstack stuff? So the normal->shadow stack part could be
explicit.

But the more I think about this, the more I think it is a hack, and a
proper fix is to use new interfaces. It also would be difficult to
sell, if the faulting conversion stuff is in any way complex.
dalias@libc.org Feb. 21, 2024, 5:57 p.m. UTC | #8
On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote:
> On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote:
> > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> 
> > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > > > allow limited control of the SSP. When shadow stack gets disabled,
> > > > > these suddenly turn into #UD generating instructions. So any other
> > > > > threads executing those instructions when shadow stack got disabled
> > > > > would be in for a nasty surprise.
> 
> > > > This is the kernel's problem if that's happening. It should be
> > > > trapping these and returning immediately like a NOP if shadow stack
> > > > has been disabled, not generating SIGILL.
> 
> > > I'm not sure that's going to work out well, all it takes is some code
> > > that's looking at the shadow stack and expecting something to happen as
> > > a result of the instructions it's executing and we run into trouble.  A
> 
> > I said NOP but there's no reason it strictly needs to be a NOP. It
> > could instead do something reasonable to convey the state of racing
> > with shadow stack being disabled.
> 
> This feels like it's getting complicated and I fear it may be an uphill
> struggle to get such code merged, at least for arm64.  My instinct is
> that it's going to be much more robust and generally tractable to let
> things run to some suitable synchronisation point and then disable
> there, but if we're going to do that then userspace can hopefully
> arrange to do the disabling itself through the standard disable
> interface anyway.  Presumably it'll want to notice things being disabled
> at some point anyway?  TBH that's been how all the prior proposals for
> process wide disable I've seen were done.

If it's possible to disable per-thread rather than per-process, some
things are easier. Disabling on account of using alt stacks only needs
to be done on the threads using those stacks. However, for dlopen
purposes you need a way to disable shadow stack for the whole process.
Initially this is only needed for the thread that called dlopen, but
it needs to have propagated to any thread that synchronizes with
completion of the call to dlopen by the time that synchronization
occurs, and since that synchronization can happen in lots of different
ways that are purely userspace (thanks to futexes being userspace in
the uncontended case), I don't see any way to make it work without
extremely invasive, high-cost checks.

If folks on the kernel side are not going to be amenable to doing the
things that are easy for the kernel to make it work without breaking
compatibility with existing interfaces, but that are impossible or
near-impossible for userspace to do, this seems like a dead-end. And I
suspect an operation to "disable shadow stack, but without making
threads still in SS-critical sections crash" is going to be
necessary..

Rich
dalias@libc.org Feb. 21, 2024, 6:30 p.m. UTC | #9
On Wed, Feb 21, 2024 at 06:12:30PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-02-21 at 12:57 -0500, dalias@libc.org wrote:
> > > This feels like it's getting complicated and I fear it may be an
> > > uphill
> > > struggle to get such code merged, at least for arm64.  My instinct
> > > is
> > > that it's going to be much more robust and generally tractable to
> > > let
> > > things run to some suitable synchronisation point and then disable
> > > there, but if we're going to do that then userspace can hopefully
> > > arrange to do the disabling itself through the standard disable
> > > interface anyway.  Presumably it'll want to notice things being
> > > disabled
> > > at some point anyway?  TBH that's been how all the prior proposals
> > > for
> > > process wide disable I've seen were done.
> > 
> > If it's possible to disable per-thread rather than per-process, some
> > things are easier. Disabling on account of using alt stacks only
> > needs
> > to be done on the threads using those stacks. However, for dlopen
> > purposes you need a way to disable shadow stack for the whole
> > process.
> > Initially this is only needed for the thread that called dlopen, but
> > it needs to have propagated to any thread that synchronizes with
> > completion of the call to dlopen by the time that synchronization
> > occurs, and since that synchronization can happen in lots of
> > different
> > ways that are purely userspace (thanks to futexes being userspace in
> > the uncontended case), I don't see any way to make it work without
> > extremely invasive, high-cost checks.
> 
> For glibc's use, we talked about a couple of options.
> 1. A mode to start suppressing the #UD's from the shadow stack
> instructions
> 2. A mode to start suppressing #CPs (the exception that happens when
> the shadow stack doesn't match). So the shadow stack instructions
> continue to operate normally, but if the shadow stack gets mismatched
> due to lack of support, the ret is emulated. It probably is safer (but
> still not perfect), but the performance penalty of emulating every RET
> after things get screwed up would be a significant down side. There
> also needs to be clean handling of shadow stack #PFs.
> 3. Per-thread locking that is used around all shadow stack operations
> that could be sensitive to disabling. This could be maybe exposed to
> apps in case they want to use shadow stack instructions manually. Then
> during dlopen() it waits until it can cleanly disable shadow stack for
> each thread. In each critical sections there are checks for whether
> shadow stack is still enabled.
> 
> 3 is the cleanest and safest I think, and it was thought it might not
> need kernel help, due to a scheme Florian had to direct signals to
> specific threads. It's my preference at this point.

The operations where the shadow stack has to be processed need to be
executable from async-signal context, so this imposes a requirement to
block all signals around the lock. This makes all longjmps a heavy,
multi-syscall operation rather than O(1) userspace operation. I do not
think this is an acceptable implementation, especially when there are
clearly superior alternatives without that cost or invasiveness.

> 1 and 2 are POCed here, if you are interested:
> https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/

I'm not clear why 2 (suppression of #CP) is desirable at all. If
shadow stack is being disabled, it should just be disabled, with
minimal fault handling to paper over any racing operations at the
moment it's disabled. Leaving it on with extreme slowness to make it
not actually do anything does not seem useful.

Is there some way folks have in mind to use option 2 to lazily disable
shadow stack once the first SS-incompatible code is executed, when
execution is then known not to be in the middle of a SS-critical
section, instead of doing it right away? I don't see how this could
work, since the SS-incompatible code could be running from a signal
handler that interrupted an SS-critical section.

> > If folks on the kernel side are not going to be amenable to doing the
> > things that are easy for the kernel to make it work without breaking
> > compatibility with existing interfaces, but that are impossible or
> > near-impossible for userspace to do, this seems like a dead-end. And
> > I
> > suspect an operation to "disable shadow stack, but without making
> > threads still in SS-critical sections crash" is going to be
> > necessary..
> 
> I think we have to work through all the alternative before we can
> accuse the kernel of not being amenable. Is there something that you
> would like to see out of this conversation that is not happening?

No, I was just interpreting "uphill battle". I really do not want to
engage in an uphill battle for the sake of making it practical to
support something that was never my goal to begin with. If I'm
misreading this, or if others are willing to put the effort into that
"battle", I'd be happy to be mistaken about "not amenable".

Rich
Mark Brown Feb. 21, 2024, 7:20 p.m. UTC | #10
On Mon, Feb 19, 2024 at 11:15:57PM -0300, Thiago Jung Bauermann wrote:

> The only issue as can be seen above is that the can_call_function test
> is failing. The child is getting a GCS Segmentation fault when returning
> from fork().

> I tried debugging it with GDB, but I don't see what's wrong since the
> address in LR matches the first entry in GCSPR. Here is the
> debug session:

I'm simply not seeing this in my testing.  There's *something* going on
somewhere, I had another report of a similarish thing elsewhere, but not
in any way that I've ever been able to reproduce.  It smells like there
might be something missing with the page tables...
Mark Brown Feb. 22, 2024, 1:57 p.m. UTC | #11
On Wed, Feb 21, 2024 at 07:22:21PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:

> > It's fine to turn RDSSP into an actual emulated read of the SSP, or
> > at
> > least an emulated load of zero so that uninitialized data is not left
> > in the target register.

> We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
> x86-only knowledge).

For arm64 we have a separate control GCSCRE0_EL1.nTR for access to
GCSPR_EL0 (our SSP equivalent) we can use.

> > I have not looked at all the instructions that become #UD but I
> > suspect they all have reasonable trivial ways to implement a
> > "disabled" version of them that userspace can act upon reasonably.

> This would have to be thought through functionally and performance
> wise. I'm not opposed if can come up with a fully fleshed out plan. How
> serious are you in pursuing musl support, if we had something like
> this?

Same here, we have to be careful since it's defining ABI in a way that
we don't normally provide ABI but if there's a clear case for doing it
then...
Mark Brown Feb. 22, 2024, 7:11 p.m. UTC | #12
On Mon, Feb 19, 2024 at 11:15:57PM -0300, Thiago Jung Bauermann wrote:

> The only issue as can be seen above is that the can_call_function test
> is failing. The child is getting a GCS Segmentation fault when returning
> from fork().

> I tried debugging it with GDB, but I don't see what's wrong since the
> address in LR matches the first entry in GCSPR. Here is the
> debug session:

I believe based on prior discussions that you're running this using
shrinkwrap - can you confirm exactly how please, including things like
which firmware configuration you're using?  I'm using current git with

  shrinkwrap run \
        --rtvar KERNEL=arch/arm64/boot/Image \
        --rtvar ROOTFS=${ROOTFS} \
        --rtvar CMDLINE="${CMDLINE}" \
        --overlay=arch/v9.4.yaml ns-edk2.yaml

and a locally built yocto and everything seems perfectly happy.
Thiago Jung Bauermann Feb. 23, 2024, 2:24 a.m. UTC | #13
Mark Brown <broonie@kernel.org> writes:

> On Mon, Feb 19, 2024 at 11:15:57PM -0300, Thiago Jung Bauermann wrote:
>
>> The only issue as can be seen above is that the can_call_function test
>> is failing. The child is getting a GCS Segmentation fault when returning
>> from fork().
>
>> I tried debugging it with GDB, but I don't see what's wrong since the
>> address in LR matches the first entry in GCSPR. Here is the
>> debug session:
>
> I believe based on prior discussions that you're running this using
> shrinkwrap - can you confirm exactly how please, including things like
> which firmware configuration you're using?  I'm using current git with
>
>   shrinkwrap run \
>         --rtvar KERNEL=arch/arm64/boot/Image \
>         --rtvar ROOTFS=${ROOTFS} \
>         --rtvar CMDLINE="${CMDLINE}" \
>         --overlay=arch/v9.4.yaml ns-edk2.yaml
>
> and a locally built yocto and everything seems perfectly happy.

Yes, this is how I'm running it:

  CMDLINE="Image dtb=fdt.dtb console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda2 ip=dhcp maxcpus=1"

  shrinkwrap run \
      --rtvar=KERNEL=Image-gcs-v8-v6.7-rc4-14743-ga551a7d7af93 \
      --rtvar=ROOTFS=$HOME/VMs/ubuntu-aarch64.img \
      --rtvar=CMDLINE="$CMDLINE" \
      ns-edk2.yaml

I ran the following to set up the FVP VM:

$ shrinkwrap build --overlay=arch/v9.4.yaml ns-edk2.yaml

My rootfs is Ubuntu 22.04.3. In case it's useful, my kernel config is
here:

https://people.linaro.org/~thiago.bauermann/gcs/config-v6.8.0-rc2

I tried removing "maxcpus=1" from the kernel command line, but it made
no difference.

I also tried resetting my Shrinkwrap setup and starting from scratch,
but it also made no difference: I just pulled from the current main
branch and removed Shrinkwrap's build and package directories, and also
removed all Docker images and the one container I had.

Here are some firmware versions from early boot:

  NOTICE:  Booting Trusted Firmware
  NOTICE:  BL1: v2.10.0   (release):v2.10.0
  NOTICE:  BL1: Built : 00:07:29, Feb 23 2024
     ⋮
  NOTICE:  BL2: v2.10.0   (release):v2.10.0
  NOTICE:  BL2: Built : 00:07:29, Feb 23 2024
     ⋮
  NOTICE:  BL31: v2.10.0  (release):v2.10.0
  NOTICE:  BL31: Built : 00:07:29, Feb 23 2024
     ⋮
  [  edk2 ] UEFI firmware (version  built at 00:06:55 on Feb 23 2024)
  Press ESCAPE for boot options ...........UEFI Interactive Shell v2.2
  EDK II
  UEFI v2.70 (EDK II, 0x00010000)

It looks like our main differences are the kernel config and the distro.
Mark Brown Feb. 27, 2024, 4:14 p.m. UTC | #14
On Thu, Feb 22, 2024 at 11:24:59PM -0300, Thiago Jung Bauermann wrote:
> Mark Brown <broonie@kernel.org> writes:

> My rootfs is Ubuntu 22.04.3. In case it's useful, my kernel config is
> here:

> https://people.linaro.org/~thiago.bauermann/gcs/config-v6.8.0-rc2

Does using defconfig make a difference for you?

> Here are some firmware versions from early boot:

These are (as you'd expect) the same.

> It looks like our main differences are the kernel config and the distro.

Indeed.
Thiago Jung Bauermann Feb. 27, 2024, 7:08 p.m. UTC | #15
Mark Brown <broonie@kernel.org> writes:

> [[PGP Signed Part:Undecided]]
> On Thu, Feb 22, 2024 at 11:24:59PM -0300, Thiago Jung Bauermann wrote:
>> Mark Brown <broonie@kernel.org> writes:
>
>> My rootfs is Ubuntu 22.04.3. In case it's useful, my kernel config is
>> here:
>
>> https://people.linaro.org/~thiago.bauermann/gcs/config-v6.8.0-rc2
>
> Does using defconfig make a difference for you?

No, I still get the same result with the defconfig.
Mark Brown Feb. 29, 2024, 9:45 p.m. UTC | #16
On Thu, Feb 22, 2024 at 11:24:59PM -0300, Thiago Jung Bauermann wrote:
> Mark Brown <broonie@kernel.org> writes:

> > I believe based on prior discussions that you're running this using
> > shrinkwrap - can you confirm exactly how please, including things like
> > which firmware configuration you're using?  I'm using current git with
> >
> >   shrinkwrap run \
> >         --rtvar KERNEL=arch/arm64/boot/Image \
> >         --rtvar ROOTFS=${ROOTFS} \
> >         --rtvar CMDLINE="${CMDLINE}" \
> >         --overlay=arch/v9.4.yaml ns-edk2.yaml
> >
> > and a locally built yocto and everything seems perfectly happy.
> 
> Yes, this is how I'm running it:
> 
>   CMDLINE="Image dtb=fdt.dtb console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda2 ip=dhcp maxcpus=1"
> 
>   shrinkwrap run \
>       --rtvar=KERNEL=Image-gcs-v8-v6.7-rc4-14743-ga551a7d7af93 \

I guess that's bitrotted?

> My rootfs is Ubuntu 22.04.3. In case it's useful, my kernel config is
> here:

...

> https://people.linaro.org/~thiago.bauermann/gcs/config-v6.8.0-rc2

Thanks, it seems to be something in your config that's making a
difference - I can see issues with that.  Hopefully that'll help me get
to the bottom of this quickly.  I spent a bunch of time fighting with
Ubuntu images to get them running but once I did they didn't seem to
make much difference.
Thiago Jung Bauermann Feb. 29, 2024, 10:13 p.m. UTC | #17
Mark Brown <broonie@kernel.org> writes:

> [[PGP Signed Part:Undecided]]
> On Thu, Feb 22, 2024 at 11:24:59PM -0300, Thiago Jung Bauermann wrote:
>> Mark Brown <broonie@kernel.org> writes:
>
>> > I believe based on prior discussions that you're running this using
>> > shrinkwrap - can you confirm exactly how please, including things like
>> > which firmware configuration you're using?  I'm using current git with
>> >
>> >   shrinkwrap run \
>> >         --rtvar KERNEL=arch/arm64/boot/Image \
>> >         --rtvar ROOTFS=${ROOTFS} \
>> >         --rtvar CMDLINE="${CMDLINE}" \
>> >         --overlay=arch/v9.4.yaml ns-edk2.yaml
>> >
>> > and a locally built yocto and everything seems perfectly happy.
>> 
>> Yes, this is how I'm running it:
>> 
>>   CMDLINE="Image dtb=fdt.dtb console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda2 ip=dhcp maxcpus=1"
>> 
>>   shrinkwrap run \
>>       --rtvar=KERNEL=Image-gcs-v8-v6.7-rc4-14743-ga551a7d7af93 \
>
> I guess that's bitrotted?

Ah, sorry. When I renamed the Image I messed up the kernel version in the
filename, but I did confirm via "uname -r" that I was running the
correct version: 6.8.0-rc2-ga551a7d7af93.

>> My rootfs is Ubuntu 22.04.3. In case it's useful, my kernel config is
>> here:
>
> ...
>
>> https://people.linaro.org/~thiago.bauermann/gcs/config-v6.8.0-rc2
>
> Thanks, it seems to be something in your config that's making a
> difference - I can see issues with that.  Hopefully that'll help me get
> to the bottom of this quickly.  I spent a bunch of time fighting with
> Ubuntu images to get them running but once I did they didn't seem to
> make much difference.

In that case, it's interesting that I still run into the problem with
the defconfig. One thing I failed to mention and perhaps is relevant
considering your result, is that I didn't copy the modules into the disk
image, so the FVP was running just with was built into the kernel.

That was actually the main reason for me to use a custom config:
I didn't want to have to deal with kernel modules, so I created a config
that didn't have any.
Szabolcs Nagy March 2, 2024, 2:57 p.m. UTC | #18
* Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]:

> On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote:
> > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> 
> > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > > > allow limited control of the SSP. When shadow stack gets disabled,
> > > > > these suddenly turn into #UD generating instructions. So any other
> > > > > threads executing those instructions when shadow stack got disabled
> > > > > would be in for a nasty surprise.
> 
> > > > This is the kernel's problem if that's happening. It should be
> > > > trapping these and returning immediately like a NOP if shadow stack
> > > > has been disabled, not generating SIGILL.
> 
> > > I'm not sure that's going to work out well, all it takes is some code
> > > that's looking at the shadow stack and expecting something to happen as
> > > a result of the instructions it's executing and we run into trouble.  A
> 
> > I said NOP but there's no reason it strictly needs to be a NOP. It
> > could instead do something reasonable to convey the state of racing
> > with shadow stack being disabled.
> 
> This feels like it's getting complicated and I fear it may be an uphill
> struggle to get such code merged, at least for arm64.  My instinct is

the aarch64 behaviour is already nop
for gcs instructions when gcs is disabled.
the isa was designed so async disable is
possible.

only x86 linux would have to emulate this.

> that it's going to be much more robust and generally tractable to let
> things run to some suitable synchronisation point and then disable
> there, but if we're going to do that then userspace can hopefully
> arrange to do the disabling itself through the standard disable
> interface anyway.  Presumably it'll want to notice things being disabled
> at some point anyway?  TBH that's been how all the prior proposals for
> process wide disable I've seen were done.
Mark Brown March 14, 2024, 2:03 p.m. UTC | #19
On Sat, Mar 02, 2024 at 03:57:02PM +0100, Szabolcs Nagy wrote:
> * Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]:

> > > I said NOP but there's no reason it strictly needs to be a NOP. It
> > > could instead do something reasonable to convey the state of racing
> > > with shadow stack being disabled.

> > This feels like it's getting complicated and I fear it may be an uphill
> > struggle to get such code merged, at least for arm64.  My instinct is

> the aarch64 behaviour is already nop
> for gcs instructions when gcs is disabled.
> the isa was designed so async disable is
> possible.

Yeah, we'd need to handle GCSPR_EL0 somehow (currently it's inaccessible
when GCS is disabled) and userspace would need to take care it's not
doing something that could get stuck if for example a pop didn't
actually *do* anything.