diff mbox series

[v4,03/36] arm64/gcs: Document the ABI for Guarded Control Stacks

Message ID 20230807-arm64-gcs-v4-3-68cfa37f9069@kernel.org
State Superseded
Headers show
Series arm64/gcs: Provide support for GCS in userspace | expand

Commit Message

Mark Brown Aug. 7, 2023, 10 p.m. UTC
Add some documentation of the userspace ABI for Guarded Control Stacks.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 Documentation/arch/arm64/gcs.rst   | 228 +++++++++++++++++++++++++++++++++++++
 Documentation/arch/arm64/index.rst |   1 +
 2 files changed, 229 insertions(+)

Comments

Catalin Marinas Aug. 22, 2023, 4:49 p.m. UTC | #1
On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote:
> On Fri, Aug 18, 2023 at 06:29:54PM +0100, Catalin Marinas wrote:
> > A related question - it may have been discussed intensively on the x86
> > thread (I may read it sometime) - why not have the libc map the shadow
> 
> Your assumption that this is a single thread feels optimistic there.

Yeah and I unfortunately ignored all of them.

> > stack and pass the pointer/size to clone3()? It saves us from having to
> > guess what the right size we'd need. struct clone_args is extensible.
> 
> I can't recall or locate the specific reasoning there right now, perhaps
> Rick or someone else can?  I'd guess there would be compat concerns for
> things that don't go via libc which would complicate the story with
> identifying and marking things as GCS/SS safe, it's going to be more
> robust to just supply a GCS if the process is using it.  That said
> having a default doesn't preclude us using the extensibility to allow
> userspace directly to control the GCS size, I would certainly be in
> favour of adding support for that.

It would be good if someone provided a summary of the x86 decision (I'll
get to those thread but most likely in September). I think we concluded
that we can't deploy GCS entirely transparently, so we need a libc
change (apart from the ELF annotations). Since libc is opting in to GCS,
we could also update the pthread_create() etc. to allocate the shadow
together with the standard stack.

Anyway, that's my preference but maybe there were good reasons not to do
this.
Mark Brown Aug. 22, 2023, 5:53 p.m. UTC | #2
On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:
> On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote:

> > > stack and pass the pointer/size to clone3()? It saves us from having to
> > > guess what the right size we'd need. struct clone_args is extensible.

> > I can't recall or locate the specific reasoning there right now, perhaps
> > Rick or someone else can?  I'd guess there would be compat concerns for
> > things that don't go via libc which would complicate the story with
> > identifying and marking things as GCS/SS safe, it's going to be more
> > robust to just supply a GCS if the process is using it.  That said
> > having a default doesn't preclude us using the extensibility to allow
> > userspace directly to control the GCS size, I would certainly be in
> > favour of adding support for that.

> It would be good if someone provided a summary of the x86 decision (I'll
> get to those thread but most likely in September). I think we concluded
> that we can't deploy GCS entirely transparently, so we need a libc
> change (apart from the ELF annotations). Since libc is opting in to GCS,

Right, we need changes for setjmp()/longjmp() for example.

> we could also update the pthread_create() etc. to allocate the shadow
> together with the standard stack.

> Anyway, that's my preference but maybe there were good reasons not to do
> this.

Yeah, it'd be good to understand.  I've been through quite a lot of old
versions of the x86 series (I've not found them all, there's 30 versions
or something of the old series plus the current one is on v9) and the
code always appears to have been this way with changelogs that explain
the what but not the why.  For example roughly the current behaviour was
already in place in v10 of the original series:

   https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/

I do worry about the story for users calling the underlying clone3() API
(or legacy clone() for that matter) directly, and we would also need to
handle the initial GCS enable via prctl() - that's not insurmountable,
we could add a size argument there that only gets interpreted during the
initial enable for example.

My sense is that they deployment story is going to be smoother with
defaults being provided since it avoids dealing with the issue of what
to do if userspace creates a thread without a GCS in a GCS enabled
process but like I say I'd be totally happy to extend clone3().  I will
put some patches together for that (probably once the x86 stuff lands).
Given the size of this series it might be better split out for
manageability if nothing else.
Szabolcs Nagy Aug. 23, 2023, 10:09 a.m. UTC | #3
The 08/22/2023 18:53, Mark Brown wrote:
> On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:
> > On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote:
> 
> > > > stack and pass the pointer/size to clone3()? It saves us from having to
> > > > guess what the right size we'd need. struct clone_args is extensible.
> 
> > > I can't recall or locate the specific reasoning there right now, perhaps
> > > Rick or someone else can?  I'd guess there would be compat concerns for
> > > things that don't go via libc which would complicate the story with
> > > identifying and marking things as GCS/SS safe, it's going to be more
> > > robust to just supply a GCS if the process is using it.  That said
> > > having a default doesn't preclude us using the extensibility to allow
> > > userspace directly to control the GCS size, I would certainly be in
> > > favour of adding support for that.
> 
> > It would be good if someone provided a summary of the x86 decision (I'll
> > get to those thread but most likely in September). I think we concluded
> > that we can't deploy GCS entirely transparently, so we need a libc
> > change (apart from the ELF annotations). Since libc is opting in to GCS,
> 
> Right, we need changes for setjmp()/longjmp() for example.
> 
> > we could also update the pthread_create() etc. to allocate the shadow
> > together with the standard stack.
> 
> > Anyway, that's my preference but maybe there were good reasons not to do
> > this.
> 
> Yeah, it'd be good to understand.  I've been through quite a lot of old
> versions of the x86 series (I've not found them all, there's 30 versions
> or something of the old series plus the current one is on v9) and the
> code always appears to have been this way with changelogs that explain
> the what but not the why.  For example roughly the current behaviour was
> already in place in v10 of the original series:
> 
>    https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/

well the original shstk patches predate clone3 so no surprise there.
e.g. v6 is from 2018 and clone3 is 2019 linux 5.3
https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/

> 
> I do worry about the story for users calling the underlying clone3() API
> (or legacy clone() for that matter) directly, and we would also need to
> handle the initial GCS enable via prctl() - that's not insurmountable,
> we could add a size argument there that only gets interpreted during the
> initial enable for example.

musl and bionic currently use plain clone for threads.

and there is user code doing raw clone threads (such threads are
technically not allowed to call into libc) it's not immediately
clear to me if having gcs in those threads is better or worse.

glibc can use clone3 args for gcs, i'd expect the unmap to be more
annoying than the allocation, but possible (it is certainly more
work than leaving everything to the kernel).

one difference is that userspace can then set gcspr of a new thread
and e.g. two threads can have overlapping gcs, however i don't think
this impacts security much since if clone3 is attacker controlled
then likely all bets are off.

and yes the main thread gcs can also be libc allocated given we
have to deal with the prctl anyway.

if gcs size logic is in libc it can depend on env vars and can be
changed more easily (and adapted to android vs musl vs glibc
requirements).

sigaltstack with alt gcs was a case where i thought the kernel
doing it transparently is better (the libc cannot do the same
as it cannot wrap signal handlers currently so does not know
when a handler returns or the current alt stack state), but
others seems to want an explicit sigaltgcs syscall and expose
it to users. in any case we have no unwinder solution for alt
gcs nor longjmp solution when the thread gcs is overflowed so
this is not an issue for now.

> My sense is that they deployment story is going to be smoother with
> defaults being provided since it avoids dealing with the issue of what
> to do if userspace creates a thread without a GCS in a GCS enabled
> process but like I say I'd be totally happy to extend clone3().  I will
> put some patches together for that (probably once the x86 stuff lands).
> Given the size of this series it might be better split out for
> manageability if nothing else.

i would make thread without gcs to implicitly disable gcs, since
that's what's bw compat with clones outside of libc (the libc can
guarantee gcs allocation when gcs is enabled).
Mark Brown Aug. 23, 2023, 12:51 p.m. UTC | #4
On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> The 08/22/2023 18:53, Mark Brown wrote:
> > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:

> > the what but not the why.  For example roughly the current behaviour was
> > already in place in v10 of the original series:

> well the original shstk patches predate clone3 so no surprise there.
> e.g. v6 is from 2018 and clone3 is 2019 linux 5.3
> https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/

Ah, that'd do it.  People weren't excited enough on about clone3() when
reviewing the shadow stack patches, I hadn't realised clone3() was so
new.

> > I do worry about the story for users calling the underlying clone3() API
> > (or legacy clone() for that matter) directly, and we would also need to
> > handle the initial GCS enable via prctl() - that's not insurmountable,
> > we could add a size argument there that only gets interpreted during the
> > initial enable for example.

> musl and bionic currently use plain clone for threads.

> and there is user code doing raw clone threads (such threads are
> technically not allowed to call into libc) it's not immediately
> clear to me if having gcs in those threads is better or worse.

Right, that's my big worry - I hadn't realised it was extending as far
as musl/bionic.

> one difference is that userspace can then set gcspr of a new thread
> and e.g. two threads can have overlapping gcs, however i don't think
> this impacts security much since if clone3 is attacker controlled
> then likely all bets are off.

Yeah, I think that's a "you broke it, you get all the pieces" thing.  

> > My sense is that they deployment story is going to be smoother with
> > defaults being provided since it avoids dealing with the issue of what
> > to do if userspace creates a thread without a GCS in a GCS enabled
> > process but like I say I'd be totally happy to extend clone3().  I will
> > put some patches together for that (probably once the x86 stuff lands).
> > Given the size of this series it might be better split out for
> > manageability if nothing else.

> i would make thread without gcs to implicitly disable gcs, since
> that's what's bw compat with clones outside of libc (the libc can
> guarantee gcs allocation when gcs is enabled).

That'd create a pretty substantial divergence with the x86 patches if
they land this time around, there's not enough time to rework them now -
I suppose it'd mainly bite people implementing libc type stuff but
still, doesn't feel great.
Catalin Marinas Aug. 23, 2023, 1:11 p.m. UTC | #5
On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> The 08/22/2023 18:53, Mark Brown wrote:
> > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:
> > > It would be good if someone provided a summary of the x86 decision (I'll
> > > get to those thread but most likely in September). I think we concluded
> > > that we can't deploy GCS entirely transparently, so we need a libc
> > > change (apart from the ELF annotations). Since libc is opting in to GCS,
> > 
> > Right, we need changes for setjmp()/longjmp() for example.
> > 
> > > we could also update the pthread_create() etc. to allocate the shadow
> > > together with the standard stack.
> > > 
> > > Anyway, that's my preference but maybe there were good reasons not to do
> > > this.
> > 
> > Yeah, it'd be good to understand.  I've been through quite a lot of old
> > versions of the x86 series (I've not found them all, there's 30 versions
> > or something of the old series plus the current one is on v9) and the
> > code always appears to have been this way with changelogs that explain
> > the what but not the why.  For example roughly the current behaviour was
> > already in place in v10 of the original series:
> > 
> >    https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/
> 
> well the original shstk patches predate clone3 so no surprise there.
> e.g. v6 is from 2018 and clone3 is 2019 linux 5.3
> https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/

Good point, I had not realised that.

> > I do worry about the story for users calling the underlying clone3() API
> > (or legacy clone() for that matter) directly, and we would also need to
> > handle the initial GCS enable via prctl() - that's not insurmountable,
> > we could add a size argument there that only gets interpreted during the
> > initial enable for example.
> 
> musl and bionic currently use plain clone for threads.
> 
> and there is user code doing raw clone threads (such threads are
> technically not allowed to call into libc) it's not immediately
> clear to me if having gcs in those threads is better or worse.
> 
> glibc can use clone3 args for gcs, i'd expect the unmap to be more
> annoying than the allocation, but possible (it is certainly more
> work than leaving everything to the kernel).

Unmapping is indeed more complex but I guess something similar needs to
happen for the thread stack to be reclaimed.

The thing I dislike about the kernel automatically mapping it is the
arbitrary fraction of RLIMIT_STACK size. glibc may use RLIMIT_STACK as a
hint for the thread stack size but is this the case for other libraries?
Some quick search (which I may have misinterpreted) shows that musl uses
128KB, bionic 1MB. So at this point the shadow stack size has no
relevance for the actual thread stack.

An alternative would be for the clone3() to provide an address _hint_
and size for GCS and it would still be the kernel doing the mmap (and
munmap on clearing). But at least the user has some control over the
placement of the GCS and its size (and maybe providing the address has
MAP_FIXED semantics).

> > My sense is that they deployment story is going to be smoother with
> > defaults being provided since it avoids dealing with the issue of what
> > to do if userspace creates a thread without a GCS in a GCS enabled
> > process but like I say I'd be totally happy to extend clone3().  I will
> > put some patches together for that (probably once the x86 stuff lands).
> > Given the size of this series it might be better split out for
> > manageability if nothing else.
> 
> i would make thread without gcs to implicitly disable gcs, since
> that's what's bw compat with clones outside of libc (the libc can
> guarantee gcs allocation when gcs is enabled).

Yes, this should work. Any invocation of clone() or clone3() without a
shadow stack would disable GCS. What about the reverse, should GCS be
enabled for a thread even if the clone3() caller has GCS disabled? I
guess we shouldn't since GCS enabling depends on the prctl() state set
previously.
Mark Brown Aug. 23, 2023, 3:50 p.m. UTC | #6
On Wed, Aug 23, 2023 at 02:11:07PM +0100, Catalin Marinas wrote:

> Yes, this should work. Any invocation of clone() or clone3() without a
> shadow stack would disable GCS. What about the reverse, should GCS be
> enabled for a thread even if the clone3() caller has GCS disabled? I
> guess we shouldn't since GCS enabling depends on the prctl() state set
> previously.

It has a fairly obvious intended meaning so we could do it easily enough
but OTOH allowing it opens up the idea of people wanting to specify GCS
flags which starts to seem like more trouble than it's worth compared to
just having them do the prctl() in the new thread.
Catalin Marinas Aug. 23, 2023, 4:45 p.m. UTC | #7
On Wed, Aug 23, 2023 at 01:51:35PM +0100, Mark Brown wrote:
> On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> > The 08/22/2023 18:53, Mark Brown wrote:
> > > My sense is that they deployment story is going to be smoother with
> > > defaults being provided since it avoids dealing with the issue of what
> > > to do if userspace creates a thread without a GCS in a GCS enabled
> > > process but like I say I'd be totally happy to extend clone3().  I will
> > > put some patches together for that (probably once the x86 stuff lands).
> > > Given the size of this series it might be better split out for
> > > manageability if nothing else.
> 
> > i would make thread without gcs to implicitly disable gcs, since
> > that's what's bw compat with clones outside of libc (the libc can
> > guarantee gcs allocation when gcs is enabled).
> 
> That'd create a pretty substantial divergence with the x86 patches if
> they land this time around, there's not enough time to rework them now -
> I suppose it'd mainly bite people implementing libc type stuff but
> still, doesn't feel great.

I don't mind the divergence in this area if the libc folks are ok with
it. x86 can eventually use the clone3() interface if they want more
flexibility, they'll just have to continue supporting the old one. I
think we already diverge around the prctl().
Mark Brown Aug. 23, 2023, 5:18 p.m. UTC | #8
On Wed, Aug 23, 2023 at 05:45:11PM +0100, Catalin Marinas wrote:

> I don't mind the divergence in this area if the libc folks are ok with
> it. x86 can eventually use the clone3() interface if they want more
> flexibility, they'll just have to continue supporting the old one. I
> think we already diverge around the prctl().

Yes, though that's basically the same thing - the difference is
basically just that x86 uses their custom arch_prctl() thing and we use
a regular prctl(), there's nothing conceptual going on like there is
here.  I don't really mind either, it's just something where I'd
anticipate pushback.
Szabolcs Nagy Aug. 23, 2023, 5:40 p.m. UTC | #9
The 08/23/2023 17:45, Catalin Marinas wrote:
> On Wed, Aug 23, 2023 at 01:51:35PM +0100, Mark Brown wrote:
> > On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> > > The 08/22/2023 18:53, Mark Brown wrote:
> > > > My sense is that they deployment story is going to be smoother with
> > > > defaults being provided since it avoids dealing with the issue of what
> > > > to do if userspace creates a thread without a GCS in a GCS enabled
> > > > process but like I say I'd be totally happy to extend clone3().  I will
> > > > put some patches together for that (probably once the x86 stuff lands).
> > > > Given the size of this series it might be better split out for
> > > > manageability if nothing else.
> > 
> > > i would make thread without gcs to implicitly disable gcs, since
> > > that's what's bw compat with clones outside of libc (the libc can
> > > guarantee gcs allocation when gcs is enabled).
> > 
> > That'd create a pretty substantial divergence with the x86 patches if
> > they land this time around, there's not enough time to rework them now -
> > I suppose it'd mainly bite people implementing libc type stuff but
> > still, doesn't feel great.
> 
> I don't mind the divergence in this area if the libc folks are ok with
> it. x86 can eventually use the clone3() interface if they want more
> flexibility, they'll just have to continue supporting the old one. I
> think we already diverge around the prctl().

i will have to prototype it, but in principle i'm ok with moving gcs
allocation to userspace and passing it as argument to clone3. i will
have to think if x86 divergence could cause issues.

to maximize compat with existing raw clone users gcs either has to
be disabled implicitly or allocated by the kernel. if we move gcs
management to userspace then disable sounds better to me.
(except vfork/fork does not have to disable etc.)

to support gcs, a libc would have to use clone3 or enable gcs in the
clone start code.

i don't know if we can allow disabled gcs thread creation with locked
gcs state. (i can see arguments both ways, so further prctl flag may
be needed which may be another divergence from x86)

i wonder if we can allow MAP_FIXED as well as MAP_FIXED_NOREPLACE
semantics for map_shadow_stack (MAP_FIXED makes sense if userspace
allocates thread stack + tls + gcs + guard pages with one PROT_NONE
mapping and then mprotects / map_shadow_stack on top of that) i.e.
if userspace manages the gcs it may need more flexibility here.
(for now i think separate gcs mapping works for me.)
Mark Brown Aug. 23, 2023, 6:16 p.m. UTC | #10
On Wed, Aug 23, 2023 at 06:40:40PM +0100, Szabolcs Nagy wrote:

> i don't know if we can allow disabled gcs thread creation with locked
> gcs state. (i can see arguments both ways, so further prctl flag may
> be needed which may be another divergence from x86)

I think that if we do add a new flag that'd just be new functionality,
the divergence would be in allowing configuration via clone3() rather
than the flag.  TBH I'm not sure I see a use case for locking but
providing a mechanism for getting out of the lock, that seems very
questionable.
Catalin Marinas Aug. 24, 2023, 3:43 p.m. UTC | #11
On Wed, Aug 23, 2023 at 07:16:52PM +0100, Mark Brown wrote:
> On Wed, Aug 23, 2023 at 06:40:40PM +0100, Szabolcs Nagy wrote:
> > i don't know if we can allow disabled gcs thread creation with locked
> > gcs state. (i can see arguments both ways, so further prctl flag may
> > be needed which may be another divergence from x86)
> 
> I think that if we do add a new flag that'd just be new functionality,
> the divergence would be in allowing configuration via clone3() rather
> than the flag.  TBH I'm not sure I see a use case for locking but
> providing a mechanism for getting out of the lock, that seems very
> questionable.

You are right, once the configuration is locked a plain clone() or
clone3() without a GCS pointer should be rejected.

Is there a use-case for the unlocked configuration to allow disabling
the GCS implicitly via a clone syscall?

If we go for extending clone3, I wonder whether we should also introduce
a sigaltstack2/3 ;). I haven't checked what the current patches do and
won't have time until early September (on holiday from the end of today).
Mark Brown Aug. 24, 2023, 5:38 p.m. UTC | #12
On Thu, Aug 24, 2023 at 04:43:19PM +0100, Catalin Marinas wrote:

> If we go for extending clone3, I wonder whether we should also introduce
> a sigaltstack2/3 ;). I haven't checked what the current patches do and
> won't have time until early September (on holiday from the end of today).

The current patches (and the x86 stuff that's in -next) punt on
sigaltstack for now, the discussions around that are unresolved.  My
hope is that whatever we come up with there can be cross platform.
Szabolcs Nagy Sept. 28, 2023, 4:59 p.m. UTC | #13
The 08/23/2023 14:11, Catalin Marinas wrote:
> On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> > The 08/22/2023 18:53, Mark Brown wrote:
> > > I do worry about the story for users calling the underlying clone3() API
> > > (or legacy clone() for that matter) directly, and we would also need to
> > > handle the initial GCS enable via prctl() - that's not insurmountable,
> > > we could add a size argument there that only gets interpreted during the
> > > initial enable for example.
...
> > and there is user code doing raw clone threads (such threads are
> > technically not allowed to call into libc) it's not immediately
> > clear to me if having gcs in those threads is better or worse.

i think raw clone / clone3 users may be relevant so we need a
solution such that they don't fail when gcs args are missing.

userspace allocated gcs works for me, but maybe the alternative
with size only is more consistent (thread gcs is kernel mapped
with fallback size logic if gcs size is missing):

> An alternative would be for the clone3() to provide an address _hint_
> and size for GCS and it would still be the kernel doing the mmap (and
> munmap on clearing). But at least the user has some control over the
> placement of the GCS and its size (and maybe providing the address has
> MAP_FIXED semantics).

the main thread gcs is still special: the size is provided
via prctl (if at all).
Edgecombe, Rick P Oct. 2, 2023, 9:43 p.m. UTC | #14
On Mon, 2023-10-02 at 20:49 +0100, Mark Brown wrote:
> On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote:
> > The 08/23/2023 14:11, Catalin Marinas wrote:
> 
> > > > and there is user code doing raw clone threads (such threads
> > > > are
> > > > technically not allowed to call into libc) it's not immediately
> > > > clear to me if having gcs in those threads is better or worse.
> 
> > i think raw clone / clone3 users may be relevant so we need a
> > solution such that they don't fail when gcs args are missing.
> 
> Are we sure about that?  Old binaries shouldn't be affected since
> they
> won't turn GCS so we're just talking about new binaries here - are
> there
> really so many of them that we won't be able to get them all
> converted
> over to clone3() and GCS in the timescales we're talking about for
> GCS
> deployment?  I obviously don't particularly mind having the default
> size
> logic but if we allow clone() then that's keeping the existing
> behaviour
> and layering allocation via clone3() on top of it which Catalin
> didn't
> want. 

On the x86 side, there have been a lot of binaries generated that have
been blindly marked as supporting shadow stack. So even if they are not
old, there may still be mismarked ones using clone().

In general there are a lot of tradeoffs with shadow stack between
security, compatibility and performance. Szabolcs had previously
discussed some ideas around all three:
 - compatibility (automatic sigaltstack())
 - performance/utility (creating a new libc API for makecontext())
 - security (token schemes to guarantee only one user of a shadow 
             stack at a time)

On the x86 kernel side, we had our hands tied a bit by the existing
userspace, and kind of ended up with a mix. I imagined that we might
see demand along one of those axes after some real world use. At which
point we could add some opt-in ABI tweaks to help.

If ARM is thinking of doing things differently than x86, you might
think about how you weight those tradeoffs. Like, it might be silly to
worry about clone() support if something else ends up breaking
compatibility majorly. But, it might be worthwhile it you end up going
to the proposed extremes around signal alt stacks, to maximize
compatibility

Also then maybe x86 could copy the ARM ABI some day, if it ends up
chasing the tradeoff people prefer. It probably goes without saying
that the closer these features behave from the app developer
perspective, the better. So a different ABI than x86 that also targets
a mix would be a bit unfortunate. (not the end of the world though)

Anyway, just wanted to share some perspective there. Sorry for joining
this thread so late.
Mark Brown Oct. 3, 2023, 1:38 p.m. UTC | #15
On Mon, Oct 02, 2023 at 09:43:25PM +0000, Edgecombe, Rick P wrote:

> If ARM is thinking of doing things differently than x86, you might
> think about how you weight those tradeoffs. Like, it might be silly to
> worry about clone() support if something else ends up breaking
> compatibility majorly. But, it might be worthwhile it you end up going
> to the proposed extremes around signal alt stacks, to maximize
> compatibility

Yeah, I think Catalin's thinking here was that we're quite a way out
from actual hardware so it's much more tractable to fix up callers than
it is for x86 where the hardware is widely available.

> Also then maybe x86 could copy the ARM ABI some day, if it ends up
> chasing the tradeoff people prefer. It probably goes without saying
> that the closer these features behave from the app developer
> perspective, the better. So a different ABI than x86 that also targets
> a mix would be a bit unfortunate. (not the end of the world though)

If nothing else even if we end up being stricter about things it would
be extremely disappointing if we ended up with something where code for
arm64 won't run when built for x86.
Mark Brown Oct. 6, 2023, 1:23 p.m. UTC | #16
On Fri, Oct 06, 2023 at 07:29:45AM -0500, Eric W. Biederman wrote:
> Mark Brown <broonie@kernel.org> writes:

> >> It's not just the default size that I dislike (I think the x86
> >> RLIMIT_STACK or clone3() stack_size is probably good enough) but the
> >> kernel allocating the shadow stack and inserting it into the user
> >> address space. The actual thread stack is managed by the user but the
> >> shadow stack is not (and we don't do this very often). Anyway, I don't
> >> have a better solution for direct uses of clone() or clone3(), other
> >> than running those threads with the shadow stack disabled. Not sure
> >> that's desirable.

> > Running threads with the shadow stack disabled if they don't explicitly
> > request it feels like it's asking for trouble - as well as the escape
> > route from the protection it'd provide I'd expect there to be trouble
> > for things that do stack pivots, potentially random issues if there's a
> > mix of ways threads are started.  It's going to be a tradeoff whatever
> > we do.

> Something I haven't seen in the discussion is that one of the ways I
> have seen a non-libc clone used is to implement a fork with flags.
> That is a new mm is created, and effectively a new process.  Which
> makes the characterization different.

> In general creating a thread with clone and bypassing libc is
> incompatible with pthreads, and the caller gets to keep both pieces.

> As long as there is enough information code can detect that
> shadow stacks are in use, and the code is able to create their own
> I don't see why it shouldn't be the callers responsibility.

> On the other hand I don't see the maintainer of clone Christian Brauner
> or the libc folks especially Florian cc'd on this thread.  So I really
> don't think you have the right folks in on this conversation.

Well, copying them in now.  The discussion here is about allocation of
shadow stacks for the arm64 implementation of the feature (the arm64
feature is called Guarded Control Stack in the architecture).  These
maintain a second copy of the stack with only the return targets in
memory allocated with special protections so userspace can't write to it
directly and use this when doing returns to ensure that the returns
haven't been redirected.  These shadow stacks can be allocated directly
by userspace using a new system call map_shadow_stack(), doing this via
mmap() was extensively discussed but it was concluded that this was very
likely to lead to security problems so we've got this new syscall that
ensures that shadow stack memory is never accessible to userspace via
other means.

The x86 implementation that has already been merged into mainline will
allocate a new shadow stack for newly created threads when the creating
thread has one.  There was a suggestion to have arm64 diverge and
require that threads be created with clone3() and manualy provide a
shadow stack but then concerns were raised that as well as the issues
with divergence this would be too disruptive for adoption due to
non-libc thread creation.  It's not controversial that it'd be good to
have clone3() by able to explicitly specify a shadow stack, just if it
should be required.
Mark Brown Oct. 19, 2023, 5:08 p.m. UTC | #17
On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote:

> I haven't checked how many clone() or clone3() uses outside the libc are
> (I tried some quick search in Debian but did not dig into the specifics
> to see how generic that code is). I agree that having to change valid
> cases outside of libc is not ideal. Even if we have the same clone3()
> interface for x86 and arm64, we'd have other architectures that need
> #ifdef'ing.

FTR the set of Debian source packages that have references to the string
__NR_clone (which picks up clone3 too) is below.  At least some (eg,
kore) just have things that look like a copy of the syscall table rather
than things that look like calls, though equally it's likely we're
missing some.

aflplusplus
android-platform-tools
binutils-avr
box64
brltty
bubblewrap
chromium
chrony
crash
criu
crun
dietlibc
elogind
emscripten
fakeroot-ng
falcosecurity-libs
firefox
firefox-esr
flatpak
gcc-9
gcc-10
gcc-11
gcc-12
gcc-13
gcc-arm-none-eabi
gcc-snapshot
gdb-msp430
glibc
gnumach
hurd
klibc
kore
libpod
libseccomp
linux
llvm-toolchain-14
llvm-toolchain-15
llvm-toolchain-16
lxc
lxcfs
lxd
musl
newlib
notcurses
purelibc
pwntools
qemu
qt6-base
qt6-webengine
qtbase-opensource-src
qtbase-opensource-src-gles
qtwebengine-opensource-src
radare2
rumur
rustc
rust-linux-raw-sys
rust-rustix
strace
stress-ng
swtpm
systemd
systemtap
termpaint
thunderbird
tor
uclibc
umview
valgrind
vsftpd
wasi-libc
webkit2gtk
wpewebkit
diff mbox series

Patch

diff --git a/Documentation/arch/arm64/gcs.rst b/Documentation/arch/arm64/gcs.rst
new file mode 100644
index 000000000000..c0f43961fd4b
--- /dev/null
+++ b/Documentation/arch/arm64/gcs.rst
@@ -0,0 +1,228 @@ 
+===============================================
+Guarded Control Stack support for AArch64 Linux
+===============================================
+
+This document outlines briefly the interface provided to userspace by Linux in
+order to support use of the ARM Guarded Control Stack (GCS) feature.
+
+This is an outline of the most important features and issues only and not
+intended to be exhaustive.
+
+
+
+1.  General
+-----------
+
+* GCS is an architecture feature intended to provide greater protection
+  against return oriented programming (ROP) attacks and to simplify the
+  implementation of features that need to collect stack traces such as
+  profiling.
+
+* When GCS is enabled a separate guarded control stack is maintained by the
+  PE which is writeable only through specific GCS operations.  This
+  stores the call stack only, when a procedure call instruction is
+  performed the current PC is pushed onto the GCS and on RET the
+  address in the LR is verified against that on the top of the GCS.
+
+* When active current GCS pointer is stored in the system register
+  GCSPR_EL0.  This is readable by userspace but can only be updated
+  via specific GCS instructions.
+
+* The architecture provides instructions for switching between guarded
+  control stacks with checks to ensure that the new stack is a valid
+  target for switching.
+
+* The functionality of GCS is similar to that provided by the x86 Shadow
+  Stack feature, due to sharing of userspace interfaces the ABI refers to
+  shadow stacks rather than GCS.
+
+* Support for GCS is reported to userspace via HWCAP2_GCS in the aux vector
+  AT_HWCAP2 entry.
+
+* GCS is enabled per thread.  While there is support for disabling GCS
+  at runtime this should be done with great care.
+
+* GCS memory access faults are reported as normal memory access faults.
+
+* GCS specific errors (those reported with EC 0x2d) will be reported as
+  SIGSEGV with a si_code of SEGV_CPERR (control protection error).
+
+* GCS is supported only for AArch64.
+
+* On systems where GCS is supported GCSPR_EL0 is always readable by EL0
+  regardless of the GCS configuration for the thread.
+
+* The architecture supports enabling GCS without verifying that return values
+  in LR match those in the GCS, the LR will be ignored.  This is not supported
+  by Linux.
+
+* EL0 GCS entries with bit 63 set are reserved for use, one such use is defined
+  below for signals and should be ignored when parsing the stack if not
+  understood.
+
+
+2.  Enabling and disabling Guarded Control Stacks
+-------------------------------------------------
+
+* GCS is enabled and disabled for a thread via the PR_SET_SHADOW_STACK_STATUS
+  prctl(), this takes a single flags argument specifying which GCS features
+  should be used.
+
+* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for
+  and enables GCS for the thread, enabling the functionality controlled by
+  GCSPRE0_EL1.{nTR, RVCHKEN, PCRSEL}.
+
+* When set the PR_SHADOW_STACK_PUSH flag enables the functionality controlled
+  by GCSCRE0_EL1.PUSHMEn, allowing explicit GCS pushes.
+
+* When set the PR_SHADOW_STACK_WRITE flag enables the functionality controlled
+  by GCSCRE0_EL1.STREn, allowing explicit stores to the Guarded Control Stack.
+
+* Any unknown flags will cause PR_SET_SHADOW_STACK_STATUS to return -EINVAL.
+
+* PR_LOCK_SHADOW_STACK_STATUS is passed a bitmask of features with the same
+  values as used for PR_SET_SHADOW_STACK_STATUS.  Any future changes to the
+  status of the specified GCS mode bits will be rejected.
+
+* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows
+  userspace to prevent changes to any future features.
+
+* PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS affect only the
+  thread the called them, any other running threads will be unaffected.
+
+* New threads inherit the GCS configuration of the thread that created them.
+
+* GCS is disabled on exec().
+
+* The current GCS configuration for a thread may be read with the
+  PR_GET_SHADOW_STACK_STATUS prctl(), this returns the same flags that
+  are passed to PR_SET_SHADOW_STACK_STATUS.
+
+* If GCS is disabled for a thread after having previously been enabled then
+  the stack will remain allocated for the lifetime of the thread.  At present
+  any attempt to reenable GCS for the thread will be rejected, this may be
+  revisited in future.
+
+* It should be noted that since enabling GCS will result in GCS becoming
+  active immediately it is not normally possible to return from the function
+  that invoked the prctl() that enabled GCS.  It is expected that the normal
+  usage will be that GCS is enabled very early in execution of a program.
+
+
+
+3.  Allocation of Guarded Control Stacks
+----------------------------------------
+
+* When GCS is enabled for a thread a new Guarded Control Stack will be
+  allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is
+  smaller.
+
+* When a new thread is created by a thread which has GCS enabled then a
+  new Guarded Control Stack will be allocated for the new thread with
+  half the size of the standard stack.
+
+* When a stack is allocated by enabling GCS or during thread creation then
+  the top 8 bytes of the stack will be initialised to 0 and GCSPR_EL0 will
+  be set to point to the address of this 0 value, this can be used to
+  detect the top of the stack.
+
+* Additional Guarded Control Stacks can be allocated using the
+  map_shadow_stack() system call.
+
+* Stacks allocated using map_shadow_stack() can optionally have an end of
+  stack marker and cap placed at the top of the stack.  If the flag
+  SHADOW_STACK_SET_TOKEN is specified a cap will be placed on the stack,
+  if SHADOW_STACK_SET_MARKER is not specified the cap will be the top 8
+  bytes of the stack and if it is specified then the cap will be the next
+  8 bytes.  While specifying just SHADOW_STACK_SET_MARKER by itself is
+  valid since the marker is all bits 0 it has no observable effect.
+
+* Stacks allocated using map_shadow_stack() must be larger than 16 bytes and
+  must be 16 bytes aligned.
+
+* When GCS is disabled for a thread the Guarded Control Stack initially
+  allocated for that thread will be freed.  Note carefully that if the
+  stack has been switched this may not be the stack currently in use by
+  the thread.
+
+
+4.  Signal handling
+--------------------
+
+* A new signal frame record gcs_context encodes the current GCS mode and
+  pointer for the interrupted context on signal delivery.  This will always
+  be present on systems that support GCS.
+
+* The record contains a flag field which reports the current GCS configuration
+  for the interrupted context as PR_GET_SHADOW_STACK_STATUS would.
+
+* The signal handler is run with the same GCS configuration as the interrupted
+  context.
+
+* When GCS is enabled for the interrupted thread a signal handling specific
+  GCS cap token will be written to the GCS, this is an architectural GCS cap
+  token with bit 63 set.  The GCSPR_EL0 reported in the signal frame will
+  point to this cap token.
+
+* The signal handler will use the same GCS as the interrupted context.
+
+* When GCS is enabled on signal entry a frame with the address of the signal
+  return handler will be pushed onto the GCS, allowing return from the signal
+  handler via RET as normal.  This will not be reported in the gcs_context in
+  the signal frame.
+
+
+5.  Signal return
+-----------------
+
+When returning from a signal handler:
+
+* If there is a gcs_context record in the signal frame then the GCS flags
+  and GCSPR_EL0 will be restored from that context prior to further
+  validation.
+
+* If there is no gcs_context record in the signal frame then the GCS
+  configuration will be unchanged.
+
+* If GCS is enabled on return from a signal handler then GCSPR_EL0 must
+  point to a valid GCS signal cap record, this will be popped from the
+  GCS prior to signal return.
+
+* If the GCS configuration is locked when returning from a signal then any
+  attempt to change the GCS configuration will be treated as an error.  This
+  is true even if GCS was not enabled prior to signal entry.
+
+* GCS may be disabled via signal return but any attempt to enable GCS via
+  signal return will be rejected.
+
+
+7.  ptrace extensions
+---------------------
+
+* A new regset NT_ARM_GCS is defined for use with PTRACE_GETREGSET and
+  PTRACE_SETREGSET.
+
+* Due to the complexity surrounding allocation and deallocation of stacks and
+  lack of practical application it is not possible to enable GCS via ptrace.
+  GCS may be disabled via the ptrace interface.
+
+* Other GCS modes may be configured via ptrace.
+
+* Configuration via ptrace ignores locking of GCS mode bits.
+
+
+8.  ELF coredump extensions
+---------------------------
+
+* NT_ARM_GCS notes will be added to each coredump for each thread of the
+  dumped process.  The contents will be equivalent to the data that would
+  have been read if a PTRACE_GETREGSET of the corresponding type were
+  executed for each thread when the coredump was generated.
+
+
+
+9.  /proc extensions
+--------------------
+
+* Guarded Control Stack pages will include "ss" in their VmFlags in
+  /proc/<pid>/smaps.
diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf..dcf3ee3eb8c0 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -14,6 +14,7 @@  ARM64 Architecture
     booting
     cpu-feature-registers
     elf_hwcaps
+    gcs
     hugetlbpage
     kdump
     legacy_instructions