diff mbox series

[RFC,please,help] membarrier: Rewrite sync_core_before_usermode()

Message ID bf59ecb5487171a852bcc8cdd553ec797aedc485.1609093476.git.luto@kernel.org
State New
Headers show
Series [RFC,please,help] membarrier: Rewrite sync_core_before_usermode() | expand

Commit Message

Andy Lutomirski Dec. 27, 2020, 6:28 p.m. UTC
The old sync_core_before_usermode() comments said that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace.  Based on my general understanding of how CPUs work and based on
my atttempt to read the ARM manual, this is not true at all.  In fact, x86
seems to be a bit of an anomaly in the other direction: x86's IRET is
unusually heavyweight for a return-to-usermode instruction.

So let's drop any pretense that we can have a generic way implementation
behind membarrier's SYNC_CORE flush and require all architectures that opt
in to supply their own.  This means x86, arm64, and powerpc for now.  Let's
also rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

I admit that I'm rather surprised that the code worked at all on arm64,
and I'm suspicious that it has never been very well tested.  My apologies
for not reviewing this more carefully in the first place.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

Hi arm64 and powerpc people-

This is part of a series here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes

Before I send out the whole series, I'm hoping that some arm64 and powerpc
people can help me verify that I did this patch right.  Once I get
some feedback on this patch, I'll send out the whole pile.  And once
*that's* done, I'll start giving the mm lazy stuff some serious thought.

The x86 part is already fixed in Linus' tree.

Thanks,
Andy

 arch/arm64/include/asm/sync_core.h   | 21 +++++++++++++++++++++
 arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/sync_core.h     |  7 +++----
 include/linux/sched/mm.h             |  1 -
 include/linux/sync_core.h            | 21 ---------------------
 init/Kconfig                         |  3 ---
 kernel/sched/membarrier.c            | 15 +++++++++++----
 8 files changed, 55 insertions(+), 34 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

Comments

Mathieu Desnoyers Dec. 27, 2020, 8:18 p.m. UTC | #1
----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.  Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.
> 
> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own.

Removing the generic implementation is OK with me, as this will really require
architecture maintainers to think hard about it when porting this feature.

> This means x86, arm64, and powerpc for now.  Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

Work for me too.

> 
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested.  My apologies
> for not reviewing this more carefully in the first place.

Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

It clearly states that only arm, arm64, powerpc and x86 support the membarrier
sync core feature as of now:


# Architecture requirements
#
# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.
#
# * x86
#
# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
# instruction is core serializing, but not SYSEXIT.
#
# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
# However, it can return to user-space through either SYSRETL (compat code),
# SYSRETQ, or IRET.
#
# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
# sync_core_before_usermode() in that specific case.

This is based on direct feedback from the architecture maintainers.

You seem to have noticed odd cases on arm64 where this guarantee does not
match reality. Where exactly can we find this in the code, and which part
of the architecture manual can you point us to which supports your concern ?

Based on the notes I have, use of `eret` on aarch64 guarantees a context synchronizing
instruction when returning to user-space.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command,
> *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> Hi arm64 and powerpc people-
> 
> This is part of a series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
> 
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right.  Once I get
> some feedback on this patch, I'll send out the whole pile.  And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
> 
> The x86 part is already fixed in Linus' tree.
> 
> Thanks,
> Andy
> 
> arch/arm64/include/asm/sync_core.h   | 21 +++++++++++++++++++++
> arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++
> arch/x86/Kconfig                     |  1 -
> arch/x86/include/asm/sync_core.h     |  7 +++----
> include/linux/sched/mm.h             |  1 -
> include/linux/sync_core.h            | 21 ---------------------
> init/Kconfig                         |  3 ---
> kernel/sched/membarrier.c            | 15 +++++++++++----
> 8 files changed, 55 insertions(+), 34 deletions(-)
> create mode 100644 arch/arm64/include/asm/sync_core.h
> create mode 100644 arch/powerpc/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/arch/arm64/include/asm/sync_core.h
> b/arch/arm64/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..5be4531caabd
> --- /dev/null
> +++ b/arch/arm64/include/asm/sync_core.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_SYNC_CORE_H
> +#define _ASM_ARM64_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	/*
> +	 * XXX: is this enough or do we need a DMB first to make sure that
> +	 * writes from other CPUs become visible to this CPU?  We have an
> +	 * smp_mb() already, but that's not quite the same thing.
> +	 */
> +	isb();
> +}
> +
> +#endif /* _ASM_ARM64_SYNC_CORE_H */
> diff --git a/arch/powerpc/include/asm/sync_core.h
> b/arch/powerpc/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..71dfbe7794e5
> --- /dev/null
> +++ b/arch/powerpc/include/asm/sync_core.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SYNC_CORE_H
> +#define _ASM_POWERPC_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	/*
> +	 * XXX: I know basically nothing about powerpc cache management.
> +	 * Is this correct?
> +	 */
> +	isync();
> +}
> +
> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b5137cc5b7b4..895f70fd4a61 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -81,7 +81,6 @@ config X86
> 	select ARCH_HAS_SET_DIRECT_MAP
> 	select ARCH_HAS_STRICT_KERNEL_RWX
> 	select ARCH_HAS_STRICT_MODULE_RWX
> -	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> 	select ARCH_HAS_SYSCALL_WRAPPER
> 	select ARCH_HAS_UBSAN_SANITIZE_ALL
> 	select ARCH_HAS_DEBUG_WX
> diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
> index ab7382f92aff..c665b453969a 100644
> --- a/arch/x86/include/asm/sync_core.h
> +++ b/arch/x86/include/asm/sync_core.h
> @@ -89,11 +89,10 @@ static inline void sync_core(void)
> }
> 
> /*
> - * Ensure that a core serializing instruction is issued before returning
> - * to user-mode. x86 implements return to user-space through sysexit,
> - * sysrel, and sysretq, which are not core serializing.
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
>  */
> -static inline void sync_core_before_usermode(void)
> +static inline void membarrier_sync_core_before_usermode(void)
> {
> 	/* With PTI, we unconditionally serialize before running user code. */
> 	if (static_cpu_has(X86_FEATURE_PTI))
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 48640db6ca86..81ba47910a73 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -7,7 +7,6 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/gfp.h>
> -#include <linux/sync_core.h>
> 
> /*
>  * Routines for handling mm_structs
> diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
> deleted file mode 100644
> index 013da4b8b327..000000000000
> --- a/include/linux/sync_core.h
> +++ /dev/null
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _LINUX_SYNC_CORE_H
> -#define _LINUX_SYNC_CORE_H
> -
> -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -#include <asm/sync_core.h>
> -#else
> -/*
> - * This is a dummy sync_core_before_usermode() implementation that can be used
> - * on all architectures which return to user-space through core serializing
> - * instructions.
> - * If your architecture returns to user-space through non-core-serializing
> - * instructions, you need to write your own functions.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -}
> -#endif
> -
> -#endif /* _LINUX_SYNC_CORE_H */
> -
> diff --git a/init/Kconfig b/init/Kconfig
> index c9446911cf41..eb9772078cd4 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -2334,9 +2334,6 @@ source "kernel/Kconfig.locks"
> config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> 	bool
> 
> -config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -	bool
> -
> # It may be useful for an architecture to override the definitions of the
> # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
> # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b3a82d7635da..db4945e1ec94 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,9 @@
>  * membarrier system call
>  */
> #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include <asm/sync_core.h>
> +#endif
> 
> /*
>  * The basic principle behind the regular memory barrier mode of membarrier()
> @@ -221,6 +224,7 @@ static void ipi_mb(void *info)
> 	smp_mb();	/* IPIs should be serializing but paranoid. */
> }
> 
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> static void ipi_sync_core(void *info)
> {
> 	/*
> @@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
> 	 * the big comment at the top of this file.
> 	 *
> 	 * A sync_core() would provide this guarantee, but
> -	 * sync_core_before_usermode() might end up being deferred until
> -	 * after membarrier()'s smp_mb().
> +	 * membarrier_sync_core_before_usermode() might end up being deferred
> +	 * until after membarrier()'s smp_mb().
> 	 */
> 	smp_mb();	/* IPIs should be serializing but paranoid. */
> 
> -	sync_core_before_usermode();
> +	membarrier_sync_core_before_usermode();
> }
> +#endif
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 	smp_call_func_t ipi_func = ipi_mb;
> 
> 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> -		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> 			return -EINVAL;
> +#else
> 		if (!(atomic_read(&mm->membarrier_state) &
> 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> 			return -EPERM;
> 		ipi_func = ipi_sync_core;
> +#endif
> 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
> 		if (!IS_ENABLED(CONFIG_RSEQ))
> 			return -EINVAL;
> --
> 2.29.2
Russell King (Oracle) Dec. 28, 2020, 10:25 a.m. UTC | #2
On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:
> On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers

> <mathieu.desnoyers@efficios.com> wrote:

> >

> > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> >

> 

> > >

> > > I admit that I'm rather surprised that the code worked at all on arm64,

> > > and I'm suspicious that it has never been very well tested.  My apologies

> > > for not reviewing this more carefully in the first place.

> >

> > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

> >

> > It clearly states that only arm, arm64, powerpc and x86 support the membarrier

> > sync core feature as of now:

> 

> Sigh, I missed arm (32).  Russell or ARM folks, what's the right

> incantation to make the CPU notice instruction changes initiated by

> other cores on 32-bit ARM?


You need to call flush_icache_range(), since the changes need to be
flushed from the data cache to the point of unification (of the Harvard
I and D), and the instruction cache needs to be invalidated so it can
then see those updated instructions. This will also take care of the
necessary barriers that the CPU requires for you.

... as documented in Documentation/core-api/cachetlb.rst and so
should be available on every kernel supported CPU.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Andy Lutomirski Dec. 28, 2020, 5:14 p.m. UTC | #3
On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>

> On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:

> > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers

> > <mathieu.desnoyers@efficios.com> wrote:

> > >

> > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> > >

> >

> > > >

> > > > I admit that I'm rather surprised that the code worked at all on arm64,

> > > > and I'm suspicious that it has never been very well tested.  My apologies

> > > > for not reviewing this more carefully in the first place.

> > >

> > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

> > >

> > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier

> > > sync core feature as of now:

> >

> > Sigh, I missed arm (32).  Russell or ARM folks, what's the right

> > incantation to make the CPU notice instruction changes initiated by

> > other cores on 32-bit ARM?

>

> You need to call flush_icache_range(), since the changes need to be

> flushed from the data cache to the point of unification (of the Harvard

> I and D), and the instruction cache needs to be invalidated so it can

> then see those updated instructions. This will also take care of the

> necessary barriers that the CPU requires for you.


With what parameters?   From looking at the header, this is for the
case in which the kernel writes some memory and then intends to
execute it.  That's not what membarrier() does at all.  membarrier()
works like this:

User thread 1:

write to RWX memory *or* write to an RW alias of an X region.
membarrier(...);
somehow tell thread 2 that we're ready (with a store release, perhaps,
or even just a relaxed store.)

User thread 2:

wait for the indication from thread 1.
barrier();
jump to the code.

membarrier() is, for better or for worse, not given a range of addresses.

On x86, the documentation is a bit weak, but a strict reading
indicates that thread 2 must execute a serializing instruction at some
point after thread 1 writes the code and before thread 2 runs it.
membarrier() does this by sending an IPI and ensuring that a
"serializing" instruction (thanks for great terminology, Intel) is
executed.  Note that flush_icache_range() is a no-op on x86, and I've
asked Intel's architects to please clarify their precise rules.  No
response yet.

On arm64, flush_icache_range() seems to send an IPI, and that's not
what I want.  membarrier() already does an IPI.

I suppose one option is to revert support for this membarrier()
operation on arm, but it would be nice to just implement it instead.

--Andy
Russell King (Oracle) Dec. 28, 2020, 5:23 p.m. UTC | #4
On Mon, Dec 28, 2020 at 09:14:23AM -0800, Andy Lutomirski wrote:
> On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin

> <linux@armlinux.org.uk> wrote:

> >

> > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:

> > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers

> > > <mathieu.desnoyers@efficios.com> wrote:

> > > >

> > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> > > >

> > >

> > > > >

> > > > > I admit that I'm rather surprised that the code worked at all on arm64,

> > > > > and I'm suspicious that it has never been very well tested.  My apologies

> > > > > for not reviewing this more carefully in the first place.

> > > >

> > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

> > > >

> > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier

> > > > sync core feature as of now:

> > >

> > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right

> > > incantation to make the CPU notice instruction changes initiated by

> > > other cores on 32-bit ARM?

> >

> > You need to call flush_icache_range(), since the changes need to be

> > flushed from the data cache to the point of unification (of the Harvard

> > I and D), and the instruction cache needs to be invalidated so it can

> > then see those updated instructions. This will also take care of the

> > necessary barriers that the CPU requires for you.

> 

> With what parameters?   From looking at the header, this is for the

> case in which the kernel writes some memory and then intends to

> execute it.  That's not what membarrier() does at all.  membarrier()

> works like this:


You didn't specify that you weren't looking at kernel memory.

If you're talking about userspace, then the interface you require
is flush_icache_user_range(), which does the same as
flush_icache_range() but takes userspace addresses. Note that this
requires that the memory is currently mapped at those userspace
addresses.

If that doesn't fit your needs, there isn't an interface to do what
you require, and it basically means creating something brand new
on every architecture.

What you are asking for is not "just a matter of a few instructions".
I have stated the required steps to achieve what you require above;
that is the minimum when you have non-snooping harvard caches, which
the majority of 32-bit ARMs have.

> User thread 1:

> 

> write to RWX memory *or* write to an RW alias of an X region.

> membarrier(...);

> somehow tell thread 2 that we're ready (with a store release, perhaps,

> or even just a relaxed store.)

> 

> User thread 2:

> 

> wait for the indication from thread 1.

> barrier();

> jump to the code.

> 

> membarrier() is, for better or for worse, not given a range of addresses.


Then, I'm sorry, it can't work on 32-bit ARM.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Andy Lutomirski Dec. 28, 2020, 6:10 p.m. UTC | #5
On Mon, Dec 28, 2020 at 9:23 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>

> On Mon, Dec 28, 2020 at 09:14:23AM -0800, Andy Lutomirski wrote:

> > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin

> > <linux@armlinux.org.uk> wrote:

> > >

> > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:

> > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers

> > > > <mathieu.desnoyers@efficios.com> wrote:

> > > > >

> > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> > > > >

> > > >

> > > > > >

> > > > > > I admit that I'm rather surprised that the code worked at all on arm64,

> > > > > > and I'm suspicious that it has never been very well tested.  My apologies

> > > > > > for not reviewing this more carefully in the first place.

> > > > >

> > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

> > > > >

> > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier

> > > > > sync core feature as of now:

> > > >

> > > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right

> > > > incantation to make the CPU notice instruction changes initiated by

> > > > other cores on 32-bit ARM?

> > >

> > > You need to call flush_icache_range(), since the changes need to be

> > > flushed from the data cache to the point of unification (of the Harvard

> > > I and D), and the instruction cache needs to be invalidated so it can

> > > then see those updated instructions. This will also take care of the

> > > necessary barriers that the CPU requires for you.

> >

> > With what parameters?   From looking at the header, this is for the

> > case in which the kernel writes some memory and then intends to

> > execute it.  That's not what membarrier() does at all.  membarrier()

> > works like this:

>

> You didn't specify that you weren't looking at kernel memory.

>

> If you're talking about userspace, then the interface you require

> is flush_icache_user_range(), which does the same as

> flush_icache_range() but takes userspace addresses. Note that this

> requires that the memory is currently mapped at those userspace

> addresses.

>

> If that doesn't fit your needs, there isn't an interface to do what

> you require, and it basically means creating something brand new

> on every architecture.

>

> What you are asking for is not "just a matter of a few instructions".

> I have stated the required steps to achieve what you require above;

> that is the minimum when you have non-snooping harvard caches, which

> the majority of 32-bit ARMs have.

>

> > User thread 1:

> >

> > write to RWX memory *or* write to an RW alias of an X region.

> > membarrier(...);

> > somehow tell thread 2 that we're ready (with a store release, perhaps,

> > or even just a relaxed store.)

> >

> > User thread 2:

> >

> > wait for the indication from thread 1.

> > barrier();

> > jump to the code.

> >

> > membarrier() is, for better or for worse, not given a range of addresses.

>

> Then, I'm sorry, it can't work on 32-bit ARM.


Is there a way to flush the *entire* user icache?  If so, and if it
has reasonable performance, then it could probably be used here.
Otherwise I'll just send a revert for this whole mechanism on 32-bit
ARM.

--Andy
Jann Horn Dec. 28, 2020, 6:29 p.m. UTC | #6
On Mon, Dec 28, 2020 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin

> <linux@armlinux.org.uk> wrote:

> >

> > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:

> > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers

> > > <mathieu.desnoyers@efficios.com> wrote:

> > > >

> > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> > > >

> > >

> > > > >

> > > > > I admit that I'm rather surprised that the code worked at all on arm64,

> > > > > and I'm suspicious that it has never been very well tested.  My apologies

> > > > > for not reviewing this more carefully in the first place.

> > > >

> > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

> > > >

> > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier

> > > > sync core feature as of now:

> > >

> > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right

> > > incantation to make the CPU notice instruction changes initiated by

> > > other cores on 32-bit ARM?

> >

> > You need to call flush_icache_range(), since the changes need to be

> > flushed from the data cache to the point of unification (of the Harvard

> > I and D), and the instruction cache needs to be invalidated so it can

> > then see those updated instructions. This will also take care of the

> > necessary barriers that the CPU requires for you.

>

> With what parameters?   From looking at the header, this is for the

> case in which the kernel writes some memory and then intends to

> execute it.  That's not what membarrier() does at all.  membarrier()

> works like this:

>

> User thread 1:

>

> write to RWX memory *or* write to an RW alias of an X region.

> membarrier(...);

> somehow tell thread 2 that we're ready (with a store release, perhaps,

> or even just a relaxed store.)

>

> User thread 2:

>

> wait for the indication from thread 1.

> barrier();

> jump to the code.

>

> membarrier() is, for better or for worse, not given a range of addresses.

>

> On x86, the documentation is a bit weak, but a strict reading

> indicates that thread 2 must execute a serializing instruction at some

> point after thread 1 writes the code and before thread 2 runs it.

> membarrier() does this by sending an IPI and ensuring that a

> "serializing" instruction (thanks for great terminology, Intel) is

> executed.  Note that flush_icache_range() is a no-op on x86, and I've

> asked Intel's architects to please clarify their precise rules.  No

> response yet.

>

> On arm64, flush_icache_range() seems to send an IPI, and that's not

> what I want.  membarrier() already does an IPI.


After chatting with rmk about this (but without claiming that any of
this is his opinion), based on the manpage, I think membarrier()
currently doesn't really claim to be synchronizing caches? It just
serializes cores. So arguably if userspace wants to use membarrier()
to synchronize code changes, userspace should first do the code
change, then flush icache as appropriate for the architecture, and
then do the membarrier() to ensure that the old code is unused?

For 32-bit arm, rmk pointed out that that would be the cacheflush()
syscall. That might cause you to end up with two IPIs instead of one
in total, but we probably don't care _that_ much about extra IPIs on
32-bit arm?

For arm64, I believe userspace can flush icache across the entire
system with some instructions from userspace - "DC CVAU" followed by
"DSB ISH", or something like that, I think? (See e.g.
compat_arm_syscall(), the arm64 compat code that implements the 32-bit
arm cacheflush() syscall.)
Andy Lutomirski Dec. 28, 2020, 6:50 p.m. UTC | #7
On Mon, Dec 28, 2020 at 10:30 AM Jann Horn <jannh@google.com> wrote:
>

> On Mon, Dec 28, 2020 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote:

> > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin

> > <linux@armlinux.org.uk> wrote:

> > >

> > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:

> > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers

> > > > <mathieu.desnoyers@efficios.com> wrote:

> > > > >

> > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote:

> > > > >

> > > >

> > > > > >

> > > > > > I admit that I'm rather surprised that the code worked at all on arm64,

> > > > > > and I'm suspicious that it has never been very well tested.  My apologies

> > > > > > for not reviewing this more carefully in the first place.

> > > > >

> > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt

> > > > >

> > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier

> > > > > sync core feature as of now:

> > > >

> > > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right

> > > > incantation to make the CPU notice instruction changes initiated by

> > > > other cores on 32-bit ARM?

> > >

> > > You need to call flush_icache_range(), since the changes need to be

> > > flushed from the data cache to the point of unification (of the Harvard

> > > I and D), and the instruction cache needs to be invalidated so it can

> > > then see those updated instructions. This will also take care of the

> > > necessary barriers that the CPU requires for you.

> >

> > With what parameters?   From looking at the header, this is for the

> > case in which the kernel writes some memory and then intends to

> > execute it.  That's not what membarrier() does at all.  membarrier()

> > works like this:

> >

> > User thread 1:

> >

> > write to RWX memory *or* write to an RW alias of an X region.

> > membarrier(...);

> > somehow tell thread 2 that we're ready (with a store release, perhaps,

> > or even just a relaxed store.)

> >

> > User thread 2:

> >

> > wait for the indication from thread 1.

> > barrier();

> > jump to the code.

> >

> > membarrier() is, for better or for worse, not given a range of addresses.

> >

> > On x86, the documentation is a bit weak, but a strict reading

> > indicates that thread 2 must execute a serializing instruction at some

> > point after thread 1 writes the code and before thread 2 runs it.

> > membarrier() does this by sending an IPI and ensuring that a

> > "serializing" instruction (thanks for great terminology, Intel) is

> > executed.  Note that flush_icache_range() is a no-op on x86, and I've

> > asked Intel's architects to please clarify their precise rules.  No

> > response yet.

> >

> > On arm64, flush_icache_range() seems to send an IPI, and that's not

> > what I want.  membarrier() already does an IPI.

>

> After chatting with rmk about this (but without claiming that any of

> this is his opinion), based on the manpage, I think membarrier()

> currently doesn't really claim to be synchronizing caches? It just

> serializes cores. So arguably if userspace wants to use membarrier()

> to synchronize code changes, userspace should first do the code

> change, then flush icache as appropriate for the architecture, and

> then do the membarrier() to ensure that the old code is unused?


I haven't the faintest clue what "serializes cores" means.  It seems
to be a bit of a mishmash of x86 SDM terminology and Linux x86
"sync_core" terminology.  The latter means very little to me, even as
an x86 person.

I'm moderately confident that the *intent* is that a multithreaded
program can write some JIT code to memory, do this membarrier()
operation, and then execute the code, and it will work.  Maybe it's
even intended to work cross-architecture without any additional help
from the program.  But maybe the existing API is simply incorrect for
this.

>

> For 32-bit arm, rmk pointed out that that would be the cacheflush()

> syscall. That might cause you to end up with two IPIs instead of one

> in total, but we probably don't care _that_ much about extra IPIs on

> 32-bit arm?

>

> For arm64, I believe userspace can flush icache across the entire

> system with some instructions from userspace - "DC CVAU" followed by

> "DSB ISH", or something like that, I think? (See e.g.

> compat_arm_syscall(), the arm64 compat code that implements the 32-bit

> arm cacheflush() syscall.)


I have no idea what DC anything does.   Based on my very cursory
reading of the manual, ISB is the right approach, but I don't pretend
I understand all the details.
Russell King (Oracle) Dec. 28, 2020, 7:08 p.m. UTC | #8
On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> After chatting with rmk about this (but without claiming that any of

> this is his opinion), based on the manpage, I think membarrier()

> currently doesn't really claim to be synchronizing caches? It just

> serializes cores. So arguably if userspace wants to use membarrier()

> to synchronize code changes, userspace should first do the code

> change, then flush icache as appropriate for the architecture, and

> then do the membarrier() to ensure that the old code is unused?

> 

> For 32-bit arm, rmk pointed out that that would be the cacheflush()

> syscall. That might cause you to end up with two IPIs instead of one

> in total, but we probably don't care _that_ much about extra IPIs on

> 32-bit arm?

> 

> For arm64, I believe userspace can flush icache across the entire

> system with some instructions from userspace - "DC CVAU" followed by

> "DSB ISH", or something like that, I think? (See e.g.

> compat_arm_syscall(), the arm64 compat code that implements the 32-bit

> arm cacheflush() syscall.)


Note that the ARM cacheflush syscall calls flush_icache_user_range()
over the range of addresses that userspace has passed - it's intention
since day one is to support cases where userspace wants to change
executable code.

It will issue the appropriate write-backs to the data cache (DCCMVAU),
the invalidates to the instruction cache (ICIMVAU), invalidate the
branch target buffer (BPIALLIS or BPIALL as appropriate), and issue
the appropriate barriers (DSB ISHST, ISB).

Note that neither flush_icache_user_range() nor flush_icache_range()
result in IPIs; cache operations are broadcast across all CPUs (which
is one of the minimums we require for SMP systems.)

Now, that all said, I think the question that has to be asked is...

	What is the basic purpose of membarrier?

Is the purpose of it to provide memory barriers, or is it to provide
memory coherence?

If it's the former and not the latter, then cache flushes are out of
scope, and expecting memory written to be visible to the instruction
stream is totally out of scope of the membarrier interface, whether
or not the writes happen on the same or a different CPU to the one
executing the rewritten code.

The documentation in the kernel does not seem to describe what it's
supposed to be doing - the only thing I could find is this:
Documentation/features/sched/membarrier-sync-core/arch-support.txt
which describes it as "arch supports core serializing membarrier"
whatever that means.

Seems to be the standard and usual case of utterly poor to non-existent
documentation within the kernel tree, or even a pointer to where any
useful documentation can be found.

Reading the membarrier(2) man page, I find nothing in there that talks
about any kind of cache coherency for self-modifying code - it only
seems to be about _barriers_ and nothing more, and barriers alone do
precisely nothing to save you from non-coherent Harvard caches.

So, either Andy has a misunderstanding, or the man page is wrong, or
my rudimentary understanding of what membarrier is supposed to be
doing is wrong...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Andy Lutomirski Dec. 28, 2020, 7:44 p.m. UTC | #9
On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>

> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

> > After chatting with rmk about this (but without claiming that any of

> > this is his opinion), based on the manpage, I think membarrier()

> > currently doesn't really claim to be synchronizing caches? It just

> > serializes cores. So arguably if userspace wants to use membarrier()

> > to synchronize code changes, userspace should first do the code

> > change, then flush icache as appropriate for the architecture, and

> > then do the membarrier() to ensure that the old code is unused?

> >

> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

> > syscall. That might cause you to end up with two IPIs instead of one

> > in total, but we probably don't care _that_ much about extra IPIs on

> > 32-bit arm?

> >

> > For arm64, I believe userspace can flush icache across the entire

> > system with some instructions from userspace - "DC CVAU" followed by

> > "DSB ISH", or something like that, I think? (See e.g.

> > compat_arm_syscall(), the arm64 compat code that implements the 32-bit

> > arm cacheflush() syscall.)

>

> Note that the ARM cacheflush syscall calls flush_icache_user_range()

> over the range of addresses that userspace has passed - it's intention

> since day one is to support cases where userspace wants to change

> executable code.

>

> It will issue the appropriate write-backs to the data cache (DCCMVAU),

> the invalidates to the instruction cache (ICIMVAU), invalidate the

> branch target buffer (BPIALLIS or BPIALL as appropriate), and issue

> the appropriate barriers (DSB ISHST, ISB).

>

> Note that neither flush_icache_user_range() nor flush_icache_range()

> result in IPIs; cache operations are broadcast across all CPUs (which

> is one of the minimums we require for SMP systems.)

>

> Now, that all said, I think the question that has to be asked is...

>

>         What is the basic purpose of membarrier?

>

> Is the purpose of it to provide memory barriers, or is it to provide

> memory coherence?

>

> If it's the former and not the latter, then cache flushes are out of

> scope, and expecting memory written to be visible to the instruction

> stream is totally out of scope of the membarrier interface, whether

> or not the writes happen on the same or a different CPU to the one

> executing the rewritten code.

>

> The documentation in the kernel does not seem to describe what it's

> supposed to be doing - the only thing I could find is this:

> Documentation/features/sched/membarrier-sync-core/arch-support.txt

> which describes it as "arch supports core serializing membarrier"

> whatever that means.

>

> Seems to be the standard and usual case of utterly poor to non-existent

> documentation within the kernel tree, or even a pointer to where any

> useful documentation can be found.

>

> Reading the membarrier(2) man page, I find nothing in there that talks

> about any kind of cache coherency for self-modifying code - it only

> seems to be about _barriers_ and nothing more, and barriers alone do

> precisely nothing to save you from non-coherent Harvard caches.

>

> So, either Andy has a misunderstanding, or the man page is wrong, or

> my rudimentary understanding of what membarrier is supposed to be

> doing is wrong...


Look at the latest man page:

https://man7.org/linux/man-pages/man2/membarrier.2.html

for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be
all that enlightening.

--Andy
Russell King (Oracle) Dec. 28, 2020, 8:24 p.m. UTC | #10
On Mon, Dec 28, 2020 at 11:44:33AM -0800, Andy Lutomirski wrote:
> On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

> <linux@armlinux.org.uk> wrote:

> >

> > On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

> > > After chatting with rmk about this (but without claiming that any of

> > > this is his opinion), based on the manpage, I think membarrier()

> > > currently doesn't really claim to be synchronizing caches? It just

> > > serializes cores. So arguably if userspace wants to use membarrier()

> > > to synchronize code changes, userspace should first do the code

> > > change, then flush icache as appropriate for the architecture, and

> > > then do the membarrier() to ensure that the old code is unused?

> > >

> > > For 32-bit arm, rmk pointed out that that would be the cacheflush()

> > > syscall. That might cause you to end up with two IPIs instead of one

> > > in total, but we probably don't care _that_ much about extra IPIs on

> > > 32-bit arm?

> > >

> > > For arm64, I believe userspace can flush icache across the entire

> > > system with some instructions from userspace - "DC CVAU" followed by

> > > "DSB ISH", or something like that, I think? (See e.g.

> > > compat_arm_syscall(), the arm64 compat code that implements the 32-bit

> > > arm cacheflush() syscall.)

> >

> > Note that the ARM cacheflush syscall calls flush_icache_user_range()

> > over the range of addresses that userspace has passed - it's intention

> > since day one is to support cases where userspace wants to change

> > executable code.

> >

> > It will issue the appropriate write-backs to the data cache (DCCMVAU),

> > the invalidates to the instruction cache (ICIMVAU), invalidate the

> > branch target buffer (BPIALLIS or BPIALL as appropriate), and issue

> > the appropriate barriers (DSB ISHST, ISB).

> >

> > Note that neither flush_icache_user_range() nor flush_icache_range()

> > result in IPIs; cache operations are broadcast across all CPUs (which

> > is one of the minimums we require for SMP systems.)

> >

> > Now, that all said, I think the question that has to be asked is...

> >

> >         What is the basic purpose of membarrier?

> >

> > Is the purpose of it to provide memory barriers, or is it to provide

> > memory coherence?

> >

> > If it's the former and not the latter, then cache flushes are out of

> > scope, and expecting memory written to be visible to the instruction

> > stream is totally out of scope of the membarrier interface, whether

> > or not the writes happen on the same or a different CPU to the one

> > executing the rewritten code.

> >

> > The documentation in the kernel does not seem to describe what it's

> > supposed to be doing - the only thing I could find is this:

> > Documentation/features/sched/membarrier-sync-core/arch-support.txt

> > which describes it as "arch supports core serializing membarrier"

> > whatever that means.

> >

> > Seems to be the standard and usual case of utterly poor to non-existent

> > documentation within the kernel tree, or even a pointer to where any

> > useful documentation can be found.

> >

> > Reading the membarrier(2) man page, I find nothing in there that talks

> > about any kind of cache coherency for self-modifying code - it only

> > seems to be about _barriers_ and nothing more, and barriers alone do

> > precisely nothing to save you from non-coherent Harvard caches.

> >

> > So, either Andy has a misunderstanding, or the man page is wrong, or

> > my rudimentary understanding of what membarrier is supposed to be

> > doing is wrong...

> 

> Look at the latest man page:

> 

> https://man7.org/linux/man-pages/man2/membarrier.2.html

> 

> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be

> all that enlightening.


       MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
              In  addition  to  providing  the  memory ordering guarantees de■
              scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
              system call the calling thread has a guarantee that all its run■
              ning thread siblings have executed a core  serializing  instruc■
              tion.   This  guarantee is provided only for threads in the same
              process as the calling thread.

              The "expedited" commands complete faster than the  non-expedited
              ones,  they  never block, but have the downside of causing extra
              overhead.

              A process must register its intent to use the private  expedited
              sync core command prior to using it.

This just says that the siblings have executed a serialising
instruction, in other words a barrier. It makes no claims concerning
cache coherency - and without some form of cache maintenance, there
can be no expectation that the I and D streams to be coherent with
each other.

This description is also weird in another respect. "guarantee that
all its running thread siblings have executed a core serializing
instruction" ... "The expedited commands ... never block".

So, the core executing this call is not allowed to block, but the
other part indicates that the other CPUs _have_ executed a serialising
instruction before this call returns... one wonders how that happens
without blocking. Maybe the CPU spins waiting for completion instead?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Mathieu Desnoyers Dec. 28, 2020, 8:32 p.m. UTC | #11
----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

> On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

> <linux@armlinux.org.uk> wrote:

>>

>> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

>> > After chatting with rmk about this (but without claiming that any of

>> > this is his opinion), based on the manpage, I think membarrier()

>> > currently doesn't really claim to be synchronizing caches? It just

>> > serializes cores. So arguably if userspace wants to use membarrier()

>> > to synchronize code changes, userspace should first do the code

>> > change, then flush icache as appropriate for the architecture, and

>> > then do the membarrier() to ensure that the old code is unused?


^ exactly, yes.

>> >

>> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

>> > syscall. That might cause you to end up with two IPIs instead of one

>> > in total, but we probably don't care _that_ much about extra IPIs on

>> > 32-bit arm?


This was the original thinking, yes. The cacheflush IPI will flush specific
regions of code, and the membarrier IPI issues context synchronizing
instructions.

Architectures with coherent i/d caches don't need the cacheflush step.

>> >

>> > For arm64, I believe userspace can flush icache across the entire

>> > system with some instructions from userspace - "DC CVAU" followed by

>> > "DSB ISH", or something like that, I think? (See e.g.

>> > compat_arm_syscall(), the arm64 compat code that implements the 32-bit

>> > arm cacheflush() syscall.)

>>

>> Note that the ARM cacheflush syscall calls flush_icache_user_range()

>> over the range of addresses that userspace has passed - it's intention

>> since day one is to support cases where userspace wants to change

>> executable code.

>>

>> It will issue the appropriate write-backs to the data cache (DCCMVAU),

>> the invalidates to the instruction cache (ICIMVAU), invalidate the

>> branch target buffer (BPIALLIS or BPIALL as appropriate), and issue

>> the appropriate barriers (DSB ISHST, ISB).

>>

>> Note that neither flush_icache_user_range() nor flush_icache_range()

>> result in IPIs; cache operations are broadcast across all CPUs (which

>> is one of the minimums we require for SMP systems.)


But the ISB AFAIU is not a broadcast. Based on
https://developer.arm.com/documentation/ddi0406/c/Appendices/Barrier-Litmus-Tests/Cache-and-TLB-maintenance-operations-and-barriers/Instruction-cache-maintenance-operations

"Ensuring the visibility of updates to instructions for a multiprocessor"

the ISB needs to be issued on each processor in a multiprocessor system.
This is where membarrier sync-core comes handy.

>>

>> Now, that all said, I think the question that has to be asked is...

>>

>>         What is the basic purpose of membarrier?


For basic membarrier (not sync-core), the purpose is to provide memory
barriers across a given set of threads.

For sync-core membarrier, the purpose is to guarantee context synchronizing
instructions on all threads belonging to the same mm.

>>

>> Is the purpose of it to provide memory barriers, or is it to provide

>> memory coherence?


The purpose has never been to provide any kind of memory coherence, as
this would duplicate what is already achieved by other architecture-specific
system calls.

>>

>> If it's the former and not the latter, then cache flushes are out of

>> scope, and expecting memory written to be visible to the instruction

>> stream is totally out of scope of the membarrier interface, whether

>> or not the writes happen on the same or a different CPU to the one

>> executing the rewritten code.

>>

>> The documentation in the kernel does not seem to describe what it's

>> supposed to be doing - the only thing I could find is this:

>> Documentation/features/sched/membarrier-sync-core/arch-support.txt

>> which describes it as "arch supports core serializing membarrier"

>> whatever that means.


It's described in include/uapi/linux/membarrier.h.

>>

>> Seems to be the standard and usual case of utterly poor to non-existent

>> documentation within the kernel tree, or even a pointer to where any

>> useful documentation can be found.

>>

>> Reading the membarrier(2) man page, I find nothing in there that talks

>> about any kind of cache coherency for self-modifying code - it only

>> seems to be about _barriers_ and nothing more, and barriers alone do

>> precisely nothing to save you from non-coherent Harvard caches.


There is no mention of cache coherency because membarrier does not
do anything wrt cache coherency. We should perhaps explicitly point
it out though.

>>

>> So, either Andy has a misunderstanding, or the man page is wrong, or

>> my rudimentary understanding of what membarrier is supposed to be

>> doing is wrong...

> 

> Look at the latest man page:

> 

> https://man7.org/linux/man-pages/man2/membarrier.2.html

> 

> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be

> all that enlightening.


As described in the membarrier(2) man page and in include/uapi/linux/membarrier.h,
membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core serializing
instructions in addition to the memory ordering guaranteed by MEMBARRIER_CMD_PRIVATE_EXPEDITED.

It does not guarantee anything wrt i/d cache coherency. I initially considered
adding such guarantees when we introduced membarrier sync-core, but decided
against it because it would basically replicate what architectures already
expose to user-space, e.g. flush_icache_user_range on arm32.

So between code modification and allowing other threads to jump to that code,
it should be expected that architectures without coherent i/d cache will need
to flush caches to ensure coherency *and* to issue membarrier to make sure
core serializing instructions will be issued by every thread acting on the
same mm either immediately by means of the IPI, or before they return to
user-space if they do not happen to be currently running when the membarrier
system call is invoked.

Hoping this clarifies things. I suspect we will need to clarify documentation
about what membarrier *does not* guarantee, given that you mistakenly expected
membarrier to take care of cache flushing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Mathieu Desnoyers Dec. 28, 2020, 8:40 p.m. UTC | #12
----- On Dec 28, 2020, at 3:24 PM, Russell King, ARM Linux linux@armlinux.org.uk wrote:

> On Mon, Dec 28, 2020 at 11:44:33AM -0800, Andy Lutomirski wrote:

>> On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

>> <linux@armlinux.org.uk> wrote:

>> >

>> > On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

>> > > After chatting with rmk about this (but without claiming that any of

>> > > this is his opinion), based on the manpage, I think membarrier()

>> > > currently doesn't really claim to be synchronizing caches? It just

>> > > serializes cores. So arguably if userspace wants to use membarrier()

>> > > to synchronize code changes, userspace should first do the code

>> > > change, then flush icache as appropriate for the architecture, and

>> > > then do the membarrier() to ensure that the old code is unused?

>> > >

>> > > For 32-bit arm, rmk pointed out that that would be the cacheflush()

>> > > syscall. That might cause you to end up with two IPIs instead of one

>> > > in total, but we probably don't care _that_ much about extra IPIs on

>> > > 32-bit arm?

>> > >

>> > > For arm64, I believe userspace can flush icache across the entire

>> > > system with some instructions from userspace - "DC CVAU" followed by

>> > > "DSB ISH", or something like that, I think? (See e.g.

>> > > compat_arm_syscall(), the arm64 compat code that implements the 32-bit

>> > > arm cacheflush() syscall.)

>> >

>> > Note that the ARM cacheflush syscall calls flush_icache_user_range()

>> > over the range of addresses that userspace has passed - it's intention

>> > since day one is to support cases where userspace wants to change

>> > executable code.

>> >

>> > It will issue the appropriate write-backs to the data cache (DCCMVAU),

>> > the invalidates to the instruction cache (ICIMVAU), invalidate the

>> > branch target buffer (BPIALLIS or BPIALL as appropriate), and issue

>> > the appropriate barriers (DSB ISHST, ISB).

>> >

>> > Note that neither flush_icache_user_range() nor flush_icache_range()

>> > result in IPIs; cache operations are broadcast across all CPUs (which

>> > is one of the minimums we require for SMP systems.)

>> >

>> > Now, that all said, I think the question that has to be asked is...

>> >

>> >         What is the basic purpose of membarrier?

>> >

>> > Is the purpose of it to provide memory barriers, or is it to provide

>> > memory coherence?

>> >

>> > If it's the former and not the latter, then cache flushes are out of

>> > scope, and expecting memory written to be visible to the instruction

>> > stream is totally out of scope of the membarrier interface, whether

>> > or not the writes happen on the same or a different CPU to the one

>> > executing the rewritten code.

>> >

>> > The documentation in the kernel does not seem to describe what it's

>> > supposed to be doing - the only thing I could find is this:

>> > Documentation/features/sched/membarrier-sync-core/arch-support.txt

>> > which describes it as "arch supports core serializing membarrier"

>> > whatever that means.

>> >

>> > Seems to be the standard and usual case of utterly poor to non-existent

>> > documentation within the kernel tree, or even a pointer to where any

>> > useful documentation can be found.

>> >

>> > Reading the membarrier(2) man page, I find nothing in there that talks

>> > about any kind of cache coherency for self-modifying code - it only

>> > seems to be about _barriers_ and nothing more, and barriers alone do

>> > precisely nothing to save you from non-coherent Harvard caches.

>> >

>> > So, either Andy has a misunderstanding, or the man page is wrong, or

>> > my rudimentary understanding of what membarrier is supposed to be

>> > doing is wrong...

>> 

>> Look at the latest man page:

>> 

>> https://man7.org/linux/man-pages/man2/membarrier.2.html

>> 

>> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be

>> all that enlightening.

> 

>       MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)

>              In  addition  to  providing  the  memory ordering guarantees de■

>              scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from

>              system call the calling thread has a guarantee that all its run■

>              ning thread siblings have executed a core  serializing  instruc■

>              tion.   This  guarantee is provided only for threads in the same

>              process as the calling thread.

> 

>              The "expedited" commands complete faster than the  non-expedited

>              ones,  they  never block, but have the downside of causing extra

>              overhead.

> 

>              A process must register its intent to use the private  expedited

>              sync core command prior to using it.

> 

> This just says that the siblings have executed a serialising

> instruction, in other words a barrier. It makes no claims concerning

> cache coherency - and without some form of cache maintenance, there

> can be no expectation that the I and D streams to be coherent with

> each other.


Right, membarrier is not doing anything wrt I/D caches. On architectures
without coherent caches, users should use other system calls or instructions
provided by the architecture to synchronize the appropriate address ranges. 

> This description is also weird in another respect. "guarantee that

> all its running thread siblings have executed a core serializing

> instruction" ... "The expedited commands ... never block".

> 

> So, the core executing this call is not allowed to block, but the

> other part indicates that the other CPUs _have_ executed a serialising

> instruction before this call returns... one wonders how that happens

> without blocking. Maybe the CPU spins waiting for completion instead?


Membarrier expedited sync-core issues IPIs to all CPUs running sibling
threads. AFAIR the IPI mechanism uses the "csd lock" which is basically
busy waiting. So it does not "block", it busy-waits.

For completeness of the explanation, other (non-running) threads acting
on the same mm will eventually issue the context synchronizing instruction
before returning to user-space whenever they are scheduled back.

Thanks,

Mathieu


> 

> --

> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/

> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Andy Lutomirski Dec. 28, 2020, 9:06 p.m. UTC | #13
On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>

> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

>

> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

> > <linux@armlinux.org.uk> wrote:

> >>

> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

> >> > After chatting with rmk about this (but without claiming that any of

> >> > this is his opinion), based on the manpage, I think membarrier()

> >> > currently doesn't really claim to be synchronizing caches? It just

> >> > serializes cores. So arguably if userspace wants to use membarrier()

> >> > to synchronize code changes, userspace should first do the code

> >> > change, then flush icache as appropriate for the architecture, and

> >> > then do the membarrier() to ensure that the old code is unused?

>

> ^ exactly, yes.

>

> >> >

> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

> >> > syscall. That might cause you to end up with two IPIs instead of one

> >> > in total, but we probably don't care _that_ much about extra IPIs on

> >> > 32-bit arm?

>

> This was the original thinking, yes. The cacheflush IPI will flush specific

> regions of code, and the membarrier IPI issues context synchronizing

> instructions.

>

> Architectures with coherent i/d caches don't need the cacheflush step.


There are different levels of coherency -- VIVT architectures may have
differing requirements compared to PIPT, etc.

In any case, I feel like the approach taken by the documentation is
fundamentally confusing.  Architectures don't all speak the same
language  How about something like:

The SYNC_CORE operation causes all threads in the caller's address
space (including the caller) to execute an architecture-defined
barrier operation.  membarrier() will ensure that this barrier is
executed at a time such that all data writes done by the calling
thread before membarrier() are made visible by the barrier.
Additional architecture-dependent cache management operations may be
required to use this for JIT code.

x86: SYNC_CORE executes a barrier that will cause subsequent
instruction fetches to observe prior writes.  Currently this will be a
"serializing" instruction, but, if future improved CPU documentation
becomes available and relaxes this requirement, the barrier may
change.  The kernel guarantees that writing new or modified
instructions to normal memory (and issuing SFENCE if the writes were
non-temporal) then doing a membarrier SYNC_CORE operation is
sufficient to cause all threads in the caller's address space to
execute the new or modified instructions.  This is true regardless of
whether or not those instructions are written at the same virtual
address from which they are subsequently executed.  No additional
cache management is required on x86.

arm: Something about the cache management syscalls.

arm64: Ditto

powerpc: I have no idea.
Mathieu Desnoyers Dec. 28, 2020, 9:09 p.m. UTC | #14
----- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski luto@kernel.org wrote:

[...]

>> You seem to have noticed odd cases on arm64 where this guarantee does not

>> match reality. Where exactly can we find this in the code, and which part

>> of the architecture manual can you point us to which supports your concern ?

>>

>> Based on the notes I have, use of `eret` on aarch64 guarantees a context

>> synchronizing

>> instruction when returning to user-space.

> 

> Based on my reading of the manual, ERET on ARM doesn't synchronize

> anything at all.  I can't find any evidence that it synchronizes data

> or instructions, and I've seen reports that the CPU will happily

> speculate right past it.


Reading [1] there appears to be 3 kind of context synchronization events:

- Taking an exception,
- Returning from an exception,
- ISB.

This other source [2] adds (search for Context synchronization operation):

- Exit from Debug state
- Executing a DCPS instruction
- Executing a DRPS instruction

"ERET" falls into the second kind of events, and AFAIU should be context
synchronizing. That was confirmed to me by Will Deacon when membarrier
sync-core was implemented for aarch64. If the architecture reference manuals
are wrong, is there an errata ?

As for the algorithm to use on ARMv8 to update instructions, see [2]
B2.3.4  Implication of caches for the application programmer
"Synchronization and coherency issues between data and instruction accesses"

Membarrier only takes care of making sure the "ISB" part of the algorithm can be
done easily and efficiently on multiprocessor systems.

Thanks,

Mathieu

[1] https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail
[2] https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Mathieu Desnoyers Dec. 28, 2020, 9:26 p.m. UTC | #15
----- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski luto@kernel.org wrote:

> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers

> <mathieu.desnoyers@efficios.com> wrote:

>>

>> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

>>

>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

>> > <linux@armlinux.org.uk> wrote:

>> >>

>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

>> >> > After chatting with rmk about this (but without claiming that any of

>> >> > this is his opinion), based on the manpage, I think membarrier()

>> >> > currently doesn't really claim to be synchronizing caches? It just

>> >> > serializes cores. So arguably if userspace wants to use membarrier()

>> >> > to synchronize code changes, userspace should first do the code

>> >> > change, then flush icache as appropriate for the architecture, and

>> >> > then do the membarrier() to ensure that the old code is unused?

>>

>> ^ exactly, yes.

>>

>> >> >

>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

>> >> > syscall. That might cause you to end up with two IPIs instead of one

>> >> > in total, but we probably don't care _that_ much about extra IPIs on

>> >> > 32-bit arm?

>>

>> This was the original thinking, yes. The cacheflush IPI will flush specific

>> regions of code, and the membarrier IPI issues context synchronizing

>> instructions.

>>

>> Architectures with coherent i/d caches don't need the cacheflush step.

> 

> There are different levels of coherency -- VIVT architectures may have

> differing requirements compared to PIPT, etc.

> 

> In any case, I feel like the approach taken by the documentation is

> fundamentally confusing.  Architectures don't all speak the same

> language


Agreed.

> How about something like:


I dislike the wording "barrier" and the association between "write" and
"instruction fetch" done in the descriptions below. It leads to think that
this behaves like a memory barrier, when in fact my understanding of
a context synchronizing instruction is that it simply flushes internal
CPU state, which would cause coherency issues if the CPU observes both the
old and then the new code without having this state flushed.

[ Sorry if I take more time to reply and if my replies are a bit more
  concise than usual. I'm currently on parental leave, so I have
  non-maskable interrupts to attend to. ;-) ]

Thanks,

Mathieu

> 

> The SYNC_CORE operation causes all threads in the caller's address

> space (including the caller) to execute an architecture-defined

> barrier operation.  membarrier() will ensure that this barrier is

> executed at a time such that all data writes done by the calling

> thread before membarrier() are made visible by the barrier.

> Additional architecture-dependent cache management operations may be

> required to use this for JIT code.

> 

> x86: SYNC_CORE executes a barrier that will cause subsequent

> instruction fetches to observe prior writes.  Currently this will be a

> "serializing" instruction, but, if future improved CPU documentation

> becomes available and relaxes this requirement, the barrier may

> change.  The kernel guarantees that writing new or modified

> instructions to normal memory (and issuing SFENCE if the writes were

> non-temporal) then doing a membarrier SYNC_CORE operation is

> sufficient to cause all threads in the caller's address space to

> execute the new or modified instructions.  This is true regardless of

> whether or not those instructions are written at the same virtual

> address from which they are subsequently executed.  No additional

> cache management is required on x86.

> 

> arm: Something about the cache management syscalls.

> 

> arm64: Ditto

> 

> powerpc: I have no idea.


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Nicholas Piggin Dec. 29, 2020, 12:11 a.m. UTC | #16
Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:
> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.  Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.

"sync_core_before_usermode" as I've said says nothing to arch, or to the 
scheduler, or to membarrier. It's badly named to start with so if 
renaming it it should be something else. exit_lazy_tlb() at least says
something quite precise to scheudler and arch code that implements
the membarrier.

But I don't mind the idea of just making it x86 specific if as you say the
arch code can detect lazy mm switches more precisely than generic and 
you want to do that.

> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own.  This means x86, arm64, and powerpc for now.  Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

The concept of "sync_core" (x86: serializing instruction, powerpc: context
synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted
to add a serializing instruction to generic code so it grew this nasty API,
but the concept applies broadly.

> 
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested.  My apologies
> for not reviewing this more carefully in the first place.


> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> Hi arm64 and powerpc people-
> 
> This is part of a series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
> 
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right.  Once I get
> some feedback on this patch, I'll send out the whole pile.  And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
> 
> The x86 part is already fixed in Linus' tree.
> 
> Thanks,
> Andy
> 
>  arch/arm64/include/asm/sync_core.h   | 21 +++++++++++++++++++++
>  arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++
>  arch/x86/Kconfig                     |  1 -
>  arch/x86/include/asm/sync_core.h     |  7 +++----
>  include/linux/sched/mm.h             |  1 -
>  include/linux/sync_core.h            | 21 ---------------------
>  init/Kconfig                         |  3 ---
>  kernel/sched/membarrier.c            | 15 +++++++++++----
>  8 files changed, 55 insertions(+), 34 deletions(-)
>  create mode 100644 arch/arm64/include/asm/sync_core.h
>  create mode 100644 arch/powerpc/include/asm/sync_core.h
>  delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..5be4531caabd
> --- /dev/null
> +++ b/arch/arm64/include/asm/sync_core.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_SYNC_CORE_H
> +#define _ASM_ARM64_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	/*
> +	 * XXX: is this enough or do we need a DMB first to make sure that
> +	 * writes from other CPUs become visible to this CPU?  We have an
> +	 * smp_mb() already, but that's not quite the same thing.
> +	 */
> +	isb();
> +}
> +
> +#endif /* _ASM_ARM64_SYNC_CORE_H */
> diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..71dfbe7794e5
> --- /dev/null
> +++ b/arch/powerpc/include/asm/sync_core.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SYNC_CORE_H
> +#define _ASM_POWERPC_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> +	/*
> +	 * XXX: I know basically nothing about powerpc cache management.
> +	 * Is this correct?
> +	 */
> +	isync();

This is not about memory ordering or cache management, it's about 
pipeline management. Powerpc's return to user mode serializes the
CPU (aka the hardware thread, _not_ the core; another wrongness of
the name, but AFAIKS the HW thread is what is required for
membarrier). So this is wrong, powerpc needs nothing here.

Thanks,
Nick
Andy Lutomirski Dec. 29, 2020, 12:30 a.m. UTC | #17
On Mon, Dec 28, 2020 at 1:09 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>

> ----- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski luto@kernel.org wrote:

>

> [...]

>

> >> You seem to have noticed odd cases on arm64 where this guarantee does not

> >> match reality. Where exactly can we find this in the code, and which part

> >> of the architecture manual can you point us to which supports your concern ?

> >>

> >> Based on the notes I have, use of `eret` on aarch64 guarantees a context

> >> synchronizing

> >> instruction when returning to user-space.

> >

> > Based on my reading of the manual, ERET on ARM doesn't synchronize

> > anything at all.  I can't find any evidence that it synchronizes data

> > or instructions, and I've seen reports that the CPU will happily

> > speculate right past it.

>

> Reading [1] there appears to be 3 kind of context synchronization events:

>

> - Taking an exception,

> - Returning from an exception,

> - ISB.


My reading of [1] is that all three of these are "context
synchronization event[s]", but that only ISB flushes the pipeline,
etc.  The little description of context synchronization seems to
suggest that it only implies that certain register changes become
effective.

>

> This other source [2] adds (search for Context synchronization operation):

>

> - Exit from Debug state

> - Executing a DCPS instruction

> - Executing a DRPS instruction

>

> "ERET" falls into the second kind of events, and AFAIU should be context

> synchronizing. That was confirmed to me by Will Deacon when membarrier

> sync-core was implemented for aarch64. If the architecture reference manuals

> are wrong, is there an errata ?

>

> As for the algorithm to use on ARMv8 to update instructions, see [2]

> B2.3.4  Implication of caches for the application programmer

> "Synchronization and coherency issues between data and instruction accesses"


This specifically discusses ISB.

Let's wait for an actual ARM64 expert to chime in, though.
Andy Lutomirski Dec. 29, 2020, 12:36 a.m. UTC | #18
On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>

> Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:

> > The old sync_core_before_usermode() comments said that a non-icache-syncing

> > return-to-usermode instruction is x86-specific and that all other

> > architectures automatically notice cross-modified code on return to

> > userspace.  Based on my general understanding of how CPUs work and based on

> > my atttempt to read the ARM manual, this is not true at all.  In fact, x86

> > seems to be a bit of an anomaly in the other direction: x86's IRET is

> > unusually heavyweight for a return-to-usermode instruction.

>

> "sync_core_before_usermode" as I've said says nothing to arch, or to the

> scheduler, or to membarrier.


Agreed.  My patch tries to fix this.  I agree that the name is bad and
could be improved further.  We should define what
membarrier(...SYNC_CORE) actually does and have arch hooks to make it
happen.

> > So let's drop any pretense that we can have a generic way implementation

> > behind membarrier's SYNC_CORE flush and require all architectures that opt

> > in to supply their own.  This means x86, arm64, and powerpc for now.  Let's

> > also rename the function from sync_core_before_usermode() to

> > membarrier_sync_core_before_usermode() because the precise flushing details

> > may very well be specific to membarrier, and even the concept of

> > "sync_core" in the kernel is mostly an x86-ism.

>

> The concept of "sync_core" (x86: serializing instruction, powerpc: context

> synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted

> to add a serializing instruction to generic code so it grew this nasty API,

> but the concept applies broadly.


I mean that the mapping from the name "sync_core" to its semantics is
x86 only.  The string "sync_core" appears in the kernel only in
arch/x86, membarrier code, membarrier docs, and a single SGI driver
that is x86-only.  Sure, the idea of serializing things is fairly
generic, but exactly what operations serialize what, when things need
serialization, etc is quite architecture specific.

Heck, on 486 you serialize the instruction stream with JMP.

> > +static inline void membarrier_sync_core_before_usermode(void)

> > +{

> > +     /*

> > +      * XXX: I know basically nothing about powerpc cache management.

> > +      * Is this correct?

> > +      */

> > +     isync();

>

> This is not about memory ordering or cache management, it's about

> pipeline management. Powerpc's return to user mode serializes the

> CPU (aka the hardware thread, _not_ the core; another wrongness of

> the name, but AFAIKS the HW thread is what is required for

> membarrier). So this is wrong, powerpc needs nothing here.


Fair enough.  I'm happy to defer to you on the powerpc details.  In
any case, this just illustrates that we need feedback from a person
who knows more about ARM64 than I do.
Nicholas Piggin Dec. 29, 2020, 12:36 a.m. UTC | #19
Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers

> <mathieu.desnoyers@efficios.com> wrote:

>>

>> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

>>

>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

>> > <linux@armlinux.org.uk> wrote:

>> >>

>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

>> >> > After chatting with rmk about this (but without claiming that any of

>> >> > this is his opinion), based on the manpage, I think membarrier()

>> >> > currently doesn't really claim to be synchronizing caches? It just

>> >> > serializes cores. So arguably if userspace wants to use membarrier()

>> >> > to synchronize code changes, userspace should first do the code

>> >> > change, then flush icache as appropriate for the architecture, and

>> >> > then do the membarrier() to ensure that the old code is unused?

>>

>> ^ exactly, yes.

>>

>> >> >

>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

>> >> > syscall. That might cause you to end up with two IPIs instead of one

>> >> > in total, but we probably don't care _that_ much about extra IPIs on

>> >> > 32-bit arm?

>>

>> This was the original thinking, yes. The cacheflush IPI will flush specific

>> regions of code, and the membarrier IPI issues context synchronizing

>> instructions.


APIs should be written in terms of the service they provide to 
userspace, and in highest level terms as possible, rather than directing 
hardware to do some low level operation. Unfortunately we're stuck with
this for now. We could deprecate it and replace it though.

If userspace wants to modify code and ensure that after the system call
returns then no other thread will be executing the previous code, then
there should be an API for that. It could actually combine the two IPIs
into one for architectures that require both too. 

>>

>> Architectures with coherent i/d caches don't need the cacheflush step.

> 

> There are different levels of coherency -- VIVT architectures may have

> differing requirements compared to PIPT, etc.

> 

> In any case, I feel like the approach taken by the documentation is

> fundamentally confusing.  Architectures don't all speak the same

> language  How about something like:

> 

> The SYNC_CORE operation causes all threads in the caller's address

> space (including the caller) to execute an architecture-defined

> barrier operation.  membarrier() will ensure that this barrier is

> executed at a time such that all data writes done by the calling

> thread before membarrier() are made visible by the barrier.

> Additional architecture-dependent cache management operations may be

> required to use this for JIT code.


As said this isn't what SYNC_CORE does, and it's not what powerpc
context synchronizing instructions do either, it will very much
re-order visibility of stores around such an instruction.

A thread completes store instructions into a store queue, which is
as far as a context synchronizing event goes. Visibility comes at
some indeterminite time later.

Also note that the regular membarrier syscall which does order
memory also does not make writes visible, it just ensures an
ordering.

> 

> x86: SYNC_CORE executes a barrier that will cause subsequent

> instruction fetches to observe prior writes.  Currently this will be a


I would be surprised if x86's serializing instructions were different 
than powerpc. rdtsc ordering or flushing stores to cache would be
surprising.

Thanks,
Nick

> "serializing" instruction, but, if future improved CPU documentation

> becomes available and relaxes this requirement, the barrier may

> change.  The kernel guarantees that writing new or modified

> instructions to normal memory (and issuing SFENCE if the writes were

> non-temporal) then doing a membarrier SYNC_CORE operation is

> sufficient to cause all threads in the caller's address space to

> execute the new or modified instructions.  This is true regardless of

> whether or not those instructions are written at the same virtual

> address from which they are subsequently executed.  No additional

> cache management is required on x86.

> 

> arm: Something about the cache management syscalls.

> 

> arm64: Ditto

> 

> powerpc: I have no idea.

>
Andy Lutomirski Dec. 29, 2020, 12:56 a.m. UTC | #20
On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>

> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:

> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers

> > <mathieu.desnoyers@efficios.com> wrote:

> >>

> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

> >>

> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

> >> > <linux@armlinux.org.uk> wrote:

> >> >>

> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

> >> >> > After chatting with rmk about this (but without claiming that any of

> >> >> > this is his opinion), based on the manpage, I think membarrier()

> >> >> > currently doesn't really claim to be synchronizing caches? It just

> >> >> > serializes cores. So arguably if userspace wants to use membarrier()

> >> >> > to synchronize code changes, userspace should first do the code

> >> >> > change, then flush icache as appropriate for the architecture, and

> >> >> > then do the membarrier() to ensure that the old code is unused?

> >>

> >> ^ exactly, yes.

> >>

> >> >> >

> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

> >> >> > syscall. That might cause you to end up with two IPIs instead of one

> >> >> > in total, but we probably don't care _that_ much about extra IPIs on

> >> >> > 32-bit arm?

> >>

> >> This was the original thinking, yes. The cacheflush IPI will flush specific

> >> regions of code, and the membarrier IPI issues context synchronizing

> >> instructions.

>

> APIs should be written in terms of the service they provide to

> userspace, and in highest level terms as possible, rather than directing

> hardware to do some low level operation. Unfortunately we're stuck with

> this for now. We could deprecate it and replace it though.

>

> If userspace wants to modify code and ensure that after the system call

> returns then no other thread will be executing the previous code, then

> there should be an API for that. It could actually combine the two IPIs

> into one for architectures that require both too.


I agree.  The membarrier API for SYNC_CORE is pretty nasty.  I would
much prefer a real API for JITs to use.

>

> >>

> >> Architectures with coherent i/d caches don't need the cacheflush step.

> >

> > There are different levels of coherency -- VIVT architectures may have

> > differing requirements compared to PIPT, etc.

> >

> > In any case, I feel like the approach taken by the documentation is

> > fundamentally confusing.  Architectures don't all speak the same

> > language  How about something like:

> >

> > The SYNC_CORE operation causes all threads in the caller's address

> > space (including the caller) to execute an architecture-defined

> > barrier operation.  membarrier() will ensure that this barrier is

> > executed at a time such that all data writes done by the calling

> > thread before membarrier() are made visible by the barrier.

> > Additional architecture-dependent cache management operations may be

> > required to use this for JIT code.

>

> As said this isn't what SYNC_CORE does, and it's not what powerpc

> context synchronizing instructions do either, it will very much

> re-order visibility of stores around such an instruction.


Perhaps the docs should be entirely arch-specific.  It may well be
impossible to state what it does in an arch-neutral way.

>

> A thread completes store instructions into a store queue, which is

> as far as a context synchronizing event goes. Visibility comes at

> some indeterminite time later.


As currently implemented, it has the same visibility semantics as
regular membarrier, too.  So if I do:

a = 1;
membarrier(SYNC_CORE);
b = 1;

and another thread does:

while (READ_ONCE(b) != 1)
  ;
barrier();
assert(a == 1);

then the assertion will pass.  Similarly, one can do this, I hope:

memcpy(codeptr, [some new instructions], len);
arch_dependent_cache_flush(codeptr, len);
membarrier(SYNC_CORE);
ready = 1;

and another thread does:

while (READ_ONCE(ready) != 1)
  ;
barrier();
(*codeptr)();

arch_dependent_cache_flush is a nop on x86.  On arm and arm64, it
appears to be a syscall, although maybe arm64 can do it from
userspace.  I still don't know what it is on powerpc.

Even using the term "cache" here is misleading.  x86 chips have all
kinds of barely-documented instruction caches, and they have varying
degrees of coherency.  The architecture actually promises that, if you
do a certain incantation, then you get the desired result.
membarrier() allows a user to do this incantation.  But trying to
replicate the incantation verbatim on an architecture like ARM is
insufficient, and trying to flush the things that are documented as
being caches on x86 is expensive and a complete waste of time on x86.
When you write some JIT code, you do *not* want to flush it all the
way to main memory, especially on CPUs don't have the ability to write
back invalidating.  (That's most CPUs.)

Even on x86, I suspect that the various decoded insn caches are rather
more coherent than documented, and I have questions in to Intel about
this.  No answers yet.

So perhaps the right approach is to say that membarrier() helps you
perform the architecture-specific sequence of steps needed to safely
modify code.  On x86, you use it like this.  On arm64, you do this
other thing.  On powerpc, you do something else.

>

> I would be surprised if x86's serializing instructions were different

> than powerpc. rdtsc ordering or flushing stores to cache would be

> surprising.

>


At the very least, x86 has several levels of what ARM might call
"context synchronization" AFAICT.  STAC, CLAC, and POPF do a form of
context synchronization in that the changes they cause to the MMU take
effect immediately, but they are not documented as synchronizing the
instruction stream.  "Serializing" instructions do all kinds of
things, not all of which may be architecturally visible at all.
MFENCE and LFENCE do various complicated things, and LFENCE has magic
retroactive capabilities on old CPUs that were not documented when
those CPUs were released.  SFENCE does a different form of
synchronization entirely.  LOCK does something else.  (The
relationship between LOCK and MFENCE is confusing at best.)  RDTSC
doesn't serialize anything at all, but RDTSCP does provide a form of
serialization that's kind of ilke LFENCE.  It's a mess.  Even the
manuals are inconsistent about what "serialize" means.  JMP has its
own magic on x86, but only on very very old CPUs.

The specific instruction that flushes everything into the coherency
domain is SFENCE, and SFENCE is not, for normal purposes, needed for
self- or cross-modifying code.
Nicholas Piggin Dec. 29, 2020, 3:09 a.m. UTC | #21
Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am:
> On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@gmail.com> wrote:

>>

>> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:

>> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers

>> > <mathieu.desnoyers@efficios.com> wrote:

>> >>

>> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:

>> >>

>> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin

>> >> > <linux@armlinux.org.uk> wrote:

>> >> >>

>> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:

>> >> >> > After chatting with rmk about this (but without claiming that any of

>> >> >> > this is his opinion), based on the manpage, I think membarrier()

>> >> >> > currently doesn't really claim to be synchronizing caches? It just

>> >> >> > serializes cores. So arguably if userspace wants to use membarrier()

>> >> >> > to synchronize code changes, userspace should first do the code

>> >> >> > change, then flush icache as appropriate for the architecture, and

>> >> >> > then do the membarrier() to ensure that the old code is unused?

>> >>

>> >> ^ exactly, yes.

>> >>

>> >> >> >

>> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()

>> >> >> > syscall. That might cause you to end up with two IPIs instead of one

>> >> >> > in total, but we probably don't care _that_ much about extra IPIs on

>> >> >> > 32-bit arm?

>> >>

>> >> This was the original thinking, yes. The cacheflush IPI will flush specific

>> >> regions of code, and the membarrier IPI issues context synchronizing

>> >> instructions.

>>

>> APIs should be written in terms of the service they provide to

>> userspace, and in highest level terms as possible, rather than directing

>> hardware to do some low level operation. Unfortunately we're stuck with

>> this for now. We could deprecate it and replace it though.

>>

>> If userspace wants to modify code and ensure that after the system call

>> returns then no other thread will be executing the previous code, then

>> there should be an API for that. It could actually combine the two IPIs

>> into one for architectures that require both too.

> 

> I agree.  The membarrier API for SYNC_CORE is pretty nasty.  I would

> much prefer a real API for JITs to use.

> 

>>

>> >>

>> >> Architectures with coherent i/d caches don't need the cacheflush step.

>> >

>> > There are different levels of coherency -- VIVT architectures may have

>> > differing requirements compared to PIPT, etc.

>> >

>> > In any case, I feel like the approach taken by the documentation is

>> > fundamentally confusing.  Architectures don't all speak the same

>> > language  How about something like:

>> >

>> > The SYNC_CORE operation causes all threads in the caller's address

>> > space (including the caller) to execute an architecture-defined

>> > barrier operation.  membarrier() will ensure that this barrier is

>> > executed at a time such that all data writes done by the calling

>> > thread before membarrier() are made visible by the barrier.

>> > Additional architecture-dependent cache management operations may be

>> > required to use this for JIT code.

>>

>> As said this isn't what SYNC_CORE does, and it's not what powerpc

>> context synchronizing instructions do either, it will very much

>> re-order visibility of stores around such an instruction.

> 

> Perhaps the docs should be entirely arch-specific.  It may well be

> impossible to state what it does in an arch-neutral way.


I think what I wrote above -- after the call returns, none of the
threads in the process will be executing instructions overwritten
previously by the caller (provided their i-caches are made coherent
with the caller's modifications).

>> A thread completes store instructions into a store queue, which is

>> as far as a context synchronizing event goes. Visibility comes at

>> some indeterminite time later.

> 

> As currently implemented, it has the same visibility semantics as

> regular membarrier, too.


Ah I actually missed that SYNC_CORE is in _addition_ to the memory
barriers, and that's documented API too, not just implementation
sorry.

> So if I do:

> 

> a = 1;

> membarrier(SYNC_CORE);

> b = 1;

> 

> and another thread does:

> 

> while (READ_ONCE(b) != 1)

>   ;

> barrier();

> assert(a == 1);


Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither
that barrier or the added SYNC_CORE behaviour imply visibility though.

> 

> then the assertion will pass.  Similarly, one can do this, I hope:

> 

> memcpy(codeptr, [some new instructions], len);

> arch_dependent_cache_flush(codeptr, len);

> membarrier(SYNC_CORE);

> ready = 1;

> 

> and another thread does:

> 

> while (READ_ONCE(ready) != 1)

>   ;

> barrier();

> (*codeptr)();

> 

> arch_dependent_cache_flush is a nop on x86.  On arm and arm64, it

> appears to be a syscall, although maybe arm64 can do it from

> userspace.  I still don't know what it is on powerpc.

> 

> Even using the term "cache" here is misleading.  x86 chips have all

> kinds of barely-documented instruction caches, and they have varying

> degrees of coherency.  The architecture actually promises that, if you

> do a certain incantation, then you get the desired result.

> membarrier() allows a user to do this incantation.  But trying to

> replicate the incantation verbatim on an architecture like ARM is

> insufficient, and trying to flush the things that are documented as

> being caches on x86 is expensive and a complete waste of time on x86.

> When you write some JIT code, you do *not* want to flush it all the

> way to main memory, especially on CPUs don't have the ability to write

> back invalidating.  (That's most CPUs.)

> 

> Even on x86, I suspect that the various decoded insn caches are rather

> more coherent than documented, and I have questions in to Intel about

> this.  No answers yet.

> 

> So perhaps the right approach is to say that membarrier() helps you

> perform the architecture-specific sequence of steps needed to safely

> modify code.  On x86, you use it like this.  On arm64, you do this

> other thing.  On powerpc, you do something else.


I think it should certainly be documented in terms of what guarantees
it provides to application, _not_ the kinds of instructions it may or
may not induce the core to execute. And if existing API can't be
re-documented sanely, then deprecatd and new ones added that DTRT.
Possibly under a new system call, if arch's like ARM want a range
flush and we don't want to expand the multiplexing behaviour of
membarrier even more (sigh).

> 

>>

>> I would be surprised if x86's serializing instructions were different

>> than powerpc. rdtsc ordering or flushing stores to cache would be

>> surprising.

>>

> 

> At the very least, x86 has several levels of what ARM might call

> "context synchronization" AFAICT.  STAC, CLAC, and POPF do a form of

> context synchronization in that the changes they cause to the MMU take

> effect immediately, but they are not documented as synchronizing the

> instruction stream.  "Serializing" instructions do all kinds of

> things, not all of which may be architecturally visible at all.

> MFENCE and LFENCE do various complicated things, and LFENCE has magic

> retroactive capabilities on old CPUs that were not documented when

> those CPUs were released.  SFENCE does a different form of

> synchronization entirely.  LOCK does something else.  (The

> relationship between LOCK and MFENCE is confusing at best.)  RDTSC

> doesn't serialize anything at all, but RDTSCP does provide a form of

> serialization that's kind of ilke LFENCE.  It's a mess.  Even the

> manuals are inconsistent about what "serialize" means.  JMP has its

> own magic on x86, but only on very very old CPUs.

> 

> The specific instruction that flushes everything into the coherency

> domain is SFENCE, and SFENCE is not, for normal purposes, needed for

> self- or cross-modifying code.

> 


Good reason to avoid such language in the system call interface!

Thanks,
Nick
Nicholas Piggin Dec. 29, 2020, 3:31 a.m. UTC | #22
Excerpts from Andy Lutomirski's message of December 29, 2020 10:36 am:
> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:

>>

>> Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:

>> > The old sync_core_before_usermode() comments said that a non-icache-syncing

>> > return-to-usermode instruction is x86-specific and that all other

>> > architectures automatically notice cross-modified code on return to

>> > userspace.  Based on my general understanding of how CPUs work and based on

>> > my atttempt to read the ARM manual, this is not true at all.  In fact, x86

>> > seems to be a bit of an anomaly in the other direction: x86's IRET is

>> > unusually heavyweight for a return-to-usermode instruction.

>>

>> "sync_core_before_usermode" as I've said says nothing to arch, or to the

>> scheduler, or to membarrier.

> 

> Agreed.  My patch tries to fix this.  I agree that the name is bad and

> could be improved further.  We should define what

> membarrier(...SYNC_CORE) actually does and have arch hooks to make it

> happen.

> 

>> > So let's drop any pretense that we can have a generic way implementation

>> > behind membarrier's SYNC_CORE flush and require all architectures that opt

>> > in to supply their own.  This means x86, arm64, and powerpc for now.  Let's

>> > also rename the function from sync_core_before_usermode() to

>> > membarrier_sync_core_before_usermode() because the precise flushing details

>> > may very well be specific to membarrier, and even the concept of

>> > "sync_core" in the kernel is mostly an x86-ism.

>>

>> The concept of "sync_core" (x86: serializing instruction, powerpc: context

>> synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted

>> to add a serializing instruction to generic code so it grew this nasty API,

>> but the concept applies broadly.

> 

> I mean that the mapping from the name "sync_core" to its semantics is

> x86 only.  The string "sync_core" appears in the kernel only in

> arch/x86, membarrier code, membarrier docs, and a single SGI driver

> that is x86-only.  Sure, the idea of serializing things is fairly

> generic, but exactly what operations serialize what, when things need

> serialization, etc is quite architecture specific.


Okay, well yes it's x86 only in name, I was more talking about the 
concept.

> Heck, on 486 you serialize the instruction stream with JMP.


x86-specific aside, I did think the semantics of a "serializing 
instruction" was reasonably well architected in x86. Sure it could do 
other things as well, but if you executed a serializing instruction,
then you had a decent set of guarantees (e.g., what you might want
for code modification).

> 

>> > +static inline void membarrier_sync_core_before_usermode(void)

>> > +{

>> > +     /*

>> > +      * XXX: I know basically nothing about powerpc cache management.

>> > +      * Is this correct?

>> > +      */

>> > +     isync();

>>

>> This is not about memory ordering or cache management, it's about

>> pipeline management. Powerpc's return to user mode serializes the

>> CPU (aka the hardware thread, _not_ the core; another wrongness of

>> the name, but AFAIKS the HW thread is what is required for

>> membarrier). So this is wrong, powerpc needs nothing here.

> 

> Fair enough.  I'm happy to defer to you on the powerpc details.  In

> any case, this just illustrates that we need feedback from a person

> who knows more about ARM64 than I do.

> 


Thanks,
Nick
Russell King (Oracle) Dec. 29, 2020, 10:44 a.m. UTC | #23
On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:
> I think it should certainly be documented in terms of what guarantees

> it provides to application, _not_ the kinds of instructions it may or

> may not induce the core to execute. And if existing API can't be

> re-documented sanely, then deprecatd and new ones added that DTRT.

> Possibly under a new system call, if arch's like ARM want a range

> flush and we don't want to expand the multiplexing behaviour of

> membarrier even more (sigh).


The 32-bit ARM sys_cacheflush() is there only to support self-modifying
code, and takes whatever actions are necessary to support that.
Exactly what actions it takes are cache implementation specific, and
should be of no concern to the caller, but the underlying thing is...
it's to support self-modifying code.

Sadly, because it's existed for 20+ years, and it has historically been
sufficient for other purposes too, it has seen quite a bit of abuse
despite its design purpose not changing - it's been used by graphics
drivers for example. They quickly learnt the error of their ways with
ARMv6+, since it does not do sufficient for their purposes given the
cache architectures found there.

Let's not go around redesigning this after twenty odd years, requiring
a hell of a lot of pain to users. This interface is called by code
generated by GCC, so to change it you're looking at patching GCC as
well as the kernel, and you basically will make new programs
incompatible with older kernels - very bad news for users.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Nicholas Piggin Dec. 30, 2020, 2:33 a.m. UTC | #24
Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:
> On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:

>> I think it should certainly be documented in terms of what guarantees

>> it provides to application, _not_ the kinds of instructions it may or

>> may not induce the core to execute. And if existing API can't be

>> re-documented sanely, then deprecatd and new ones added that DTRT.

>> Possibly under a new system call, if arch's like ARM want a range

>> flush and we don't want to expand the multiplexing behaviour of

>> membarrier even more (sigh).

> 

> The 32-bit ARM sys_cacheflush() is there only to support self-modifying

> code, and takes whatever actions are necessary to support that.

> Exactly what actions it takes are cache implementation specific, and

> should be of no concern to the caller, but the underlying thing is...

> it's to support self-modifying code.


   Caveat
       cacheflush()  should  not  be used in programs intended to be portable.
       On Linux, this call first appeared on the MIPS architecture, but  nowa‐
       days, Linux provides a cacheflush() system call on some other architec‐
       tures, but with different arguments.

What a disaster. Another badly designed interface, although it didn't 
originate in Linux it sounds like we weren't to be outdone so
we messed it up even worse.

flushing caches is neither necessary nor sufficient for code modification
on many processors. Maybe some old MIPS specific private thing was fine,
but certainly before it grew to other architectures, somebody should 
have thought for more than 2 minutes about it. Sigh.

> 

> Sadly, because it's existed for 20+ years, and it has historically been

> sufficient for other purposes too, it has seen quite a bit of abuse

> despite its design purpose not changing - it's been used by graphics

> drivers for example. They quickly learnt the error of their ways with

> ARMv6+, since it does not do sufficient for their purposes given the

> cache architectures found there.

> 

> Let's not go around redesigning this after twenty odd years, requiring

> a hell of a lot of pain to users. This interface is called by code

> generated by GCC, so to change it you're looking at patching GCC as

> well as the kernel, and you basically will make new programs

> incompatible with older kernels - very bad news for users.


For something to be redesigned it had to have been designed in the first 
place, so there is no danger of that don't worry... But no I never 
suggested making incompatible changes to any existing system call, I 
said "re-documented". And yes I said deprecated but in Linux that really 
means kept indefinitely.

If ARM, MIPS, 68k etc programs and toolchains keep using what they are 
using it'll keep working no problem.

The point is we're growing new interfaces, and making the same mistakes. 
It's not portable (ARCH_HAS_MEMBARRIER_SYNC_CORE), it's also specified 
in terms of low level processor operations rather than higher level 
intent, and also is not sufficient for self-modifying code (without 
additional cache flush on some processors).

The application wants a call that says something like "memory modified 
before the call will be visible as instructions (including illegal 
instructions) by all threads in the program after the system call 
returns, and no threads will be subject to any effects of executing the 
previous contents of that memory.

So I think the basics are simple (although should confirm with some JIT 
and debugger etc developers, and not just Android mind you). There are 
some complications in details, address ranges, virtual/physical, thread 
local vs process vs different process or system-wide, memory ordering 
and propagation of i and d sides, etc. But that can be worked through, 
erring on the side of sanity rather than pointless micro-optmisations.

Thanks,
Nick
Russell King (Oracle) Dec. 30, 2020, 10 a.m. UTC | #25
On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote:
> Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:

> > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:

> >> I think it should certainly be documented in terms of what guarantees

> >> it provides to application, _not_ the kinds of instructions it may or

> >> may not induce the core to execute. And if existing API can't be

> >> re-documented sanely, then deprecatd and new ones added that DTRT.

> >> Possibly under a new system call, if arch's like ARM want a range

> >> flush and we don't want to expand the multiplexing behaviour of

> >> membarrier even more (sigh).

> > 

> > The 32-bit ARM sys_cacheflush() is there only to support self-modifying

> > code, and takes whatever actions are necessary to support that.

> > Exactly what actions it takes are cache implementation specific, and

> > should be of no concern to the caller, but the underlying thing is...

> > it's to support self-modifying code.

> 

>    Caveat

>        cacheflush()  should  not  be used in programs intended to be portable.

>        On Linux, this call first appeared on the MIPS architecture, but  nowa‐

>        days, Linux provides a cacheflush() system call on some other architec‐

>        tures, but with different arguments.

> 

> What a disaster. Another badly designed interface, although it didn't 

> originate in Linux it sounds like we weren't to be outdone so

> we messed it up even worse.

> 

> flushing caches is neither necessary nor sufficient for code modification

> on many processors. Maybe some old MIPS specific private thing was fine,

> but certainly before it grew to other architectures, somebody should 

> have thought for more than 2 minutes about it. Sigh.


WARNING: You are bordering on being objectionable and offensive with
that comment.

The ARM interface was designed by me back in the very early days of
Linux, probably while you were still in dypers, based on what was
known at the time.  Back in the early 2000s, ideas such as relaxed
memory ordering were not known.  All there was was one level of
harvard cache.

So, juts shut up with your insults.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Russell King (Oracle) Dec. 30, 2020, 10:58 a.m. UTC | #26
On Wed, Dec 30, 2020 at 10:00:28AM +0000, Russell King - ARM Linux admin wrote:
> On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote:

> > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:

> > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:

> > >> I think it should certainly be documented in terms of what guarantees

> > >> it provides to application, _not_ the kinds of instructions it may or

> > >> may not induce the core to execute. And if existing API can't be

> > >> re-documented sanely, then deprecatd and new ones added that DTRT.

> > >> Possibly under a new system call, if arch's like ARM want a range

> > >> flush and we don't want to expand the multiplexing behaviour of

> > >> membarrier even more (sigh).

> > > 

> > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying

> > > code, and takes whatever actions are necessary to support that.

> > > Exactly what actions it takes are cache implementation specific, and

> > > should be of no concern to the caller, but the underlying thing is...

> > > it's to support self-modifying code.

> > 

> >    Caveat

> >        cacheflush()  should  not  be used in programs intended to be portable.

> >        On Linux, this call first appeared on the MIPS architecture, but  nowa‐

> >        days, Linux provides a cacheflush() system call on some other architec‐

> >        tures, but with different arguments.

> > 

> > What a disaster. Another badly designed interface, although it didn't 

> > originate in Linux it sounds like we weren't to be outdone so

> > we messed it up even worse.

> > 

> > flushing caches is neither necessary nor sufficient for code modification

> > on many processors. Maybe some old MIPS specific private thing was fine,

> > but certainly before it grew to other architectures, somebody should 

> > have thought for more than 2 minutes about it. Sigh.

> 

> WARNING: You are bordering on being objectionable and offensive with

> that comment.

> 

> The ARM interface was designed by me back in the very early days of

> Linux, probably while you were still in dypers, based on what was

> known at the time.  Back in the early 2000s, ideas such as relaxed

> memory ordering were not known.  All there was was one level of

> harvard cache.


Sorry, I got that slightly wrong. Its origins on ARM were from 12 Dec
1998:

http://www.armlinux.org.uk/developer/patches/viewpatch.php?id=88/1

by Philip Blundell, and first appeared in the ARM
pre-patch-2.1.131-19981214-1.gz. It was subsequently documented in the
kernel sources by me in July 2001 in ARM patch-2.4.6-rmk2.gz. It has
a slightly different signature: the third argument on ARM is a flags
argument, whereas the MIPS code, it is some undocumented "cache"
parameter.

Whether the ARM version pre or post dates the MIPS code, I couldn't say.
Whether it was ultimately taken from the MIPS implementation, again, I
couldn't say.

However, please stop insulting your fellow developers ability to think.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Nicholas Piggin Dec. 30, 2020, 11:57 a.m. UTC | #27
Excerpts from Russell King - ARM Linux admin's message of December 30, 2020 8:58 pm:
> On Wed, Dec 30, 2020 at 10:00:28AM +0000, Russell King - ARM Linux admin wrote:

>> On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote:

>> > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:

>> > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:

>> > >> I think it should certainly be documented in terms of what guarantees

>> > >> it provides to application, _not_ the kinds of instructions it may or

>> > >> may not induce the core to execute. And if existing API can't be

>> > >> re-documented sanely, then deprecatd and new ones added that DTRT.

>> > >> Possibly under a new system call, if arch's like ARM want a range

>> > >> flush and we don't want to expand the multiplexing behaviour of

>> > >> membarrier even more (sigh).

>> > > 

>> > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying

>> > > code, and takes whatever actions are necessary to support that.

>> > > Exactly what actions it takes are cache implementation specific, and

>> > > should be of no concern to the caller, but the underlying thing is...

>> > > it's to support self-modifying code.

>> > 

>> >    Caveat

>> >        cacheflush()  should  not  be used in programs intended to be portable.

>> >        On Linux, this call first appeared on the MIPS architecture, but  nowa‐

>> >        days, Linux provides a cacheflush() system call on some other architec‐

>> >        tures, but with different arguments.

>> > 

>> > What a disaster. Another badly designed interface, although it didn't 

>> > originate in Linux it sounds like we weren't to be outdone so

>> > we messed it up even worse.

>> > 

>> > flushing caches is neither necessary nor sufficient for code modification

>> > on many processors. Maybe some old MIPS specific private thing was fine,

>> > but certainly before it grew to other architectures, somebody should 

>> > have thought for more than 2 minutes about it. Sigh.

>> 

>> WARNING: You are bordering on being objectionable and offensive with

>> that comment.

>> 

>> The ARM interface was designed by me back in the very early days of

>> Linux, probably while you were still in dypers, based on what was

>> known at the time.  Back in the early 2000s, ideas such as relaxed

>> memory ordering were not known.  All there was was one level of

>> harvard cache.


I wasn't talking about memory ordering at all, and I assumed it
came earlier and was brought to Linux for portability reasons -

CONFORMING TO
       Historically, this system call was available on all MIPS UNIX  variants
       including RISC/os, IRIX, Ultrix, NetBSD, OpenBSD, and FreeBSD (and also
       on some non-UNIX MIPS operating systems), so that the existence of this
       call in MIPS operating systems is a de-facto standard.

I don't think the call was totally unreasonable for incoherent virtual 
caches or incoherent i/d caches etc. Although early unix system call interface
demonstrates that people understood how to define good interfaces that dealt
with intent at an abstract level rather than implementation -- munmap 
doesn't specify anywhere that a TLB flush instruction must be executed, 
for example. So "cacheflush" was obviously never a well designed interface 
but rather the typical hardware-centric hack to get their stuff working
(which was fine for its purpose I'm sure).

> 

> Sorry, I got that slightly wrong. Its origins on ARM were from 12 Dec

> 1998:

> 

> http://www.armlinux.org.uk/developer/patches/viewpatch.php?id=88/1

> 

> by Philip Blundell, and first appeared in the ARM

> pre-patch-2.1.131-19981214-1.gz. It was subsequently documented in the

> kernel sources by me in July 2001 in ARM patch-2.4.6-rmk2.gz. It has

> a slightly different signature: the third argument on ARM is a flags

> argument, whereas the MIPS code, it is some undocumented "cache"

> parameter.

> 

> Whether the ARM version pre or post dates the MIPS code, I couldn't say.

> Whether it was ultimately taken from the MIPS implementation, again, I

> couldn't say.


I can, it was in MIPS in late 1.3 kernels at least. I would guess it
came from IRIX.

> However, please stop insulting your fellow developers ability to think.


Sorry, I was being melodramatic. Everyone makes mistakes or decisions
which with hindsight could have been better or were under some 
constraint that isn't apparent. I shouldn't have suggested it indicated 
thoughtlessness.

Thanks,
Nick
David Laight Jan. 1, 2021, 6:33 p.m. UTC | #28
From: Andy Lutomirski

> Sent: 29 December 2020 00:36

...
> I mean that the mapping from the name "sync_core" to its semantics is

> x86 only.  The string "sync_core" appears in the kernel only in

> arch/x86, membarrier code, membarrier docs, and a single SGI driver

> that is x86-only.  Sure, the idea of serializing things is fairly

> generic, but exactly what operations serialize what, when things need

> serialization, etc is quite architecture specific.

> 

> Heck, on 486 you serialize the instruction stream with JMP.


Did the 486 even have a memory cache?
Never mind separate I&D caches.
Without branch prediction or an I$ a jmp is enough.
No idea how the dual 486 box we had actually behaved.

For non-SMP the x86 cpus tend to still be compatible with
the original 8086 - so are pretty much fully coherent.
ISTR the memory writes will invalidate I$ lines.

But there was some hardware compatibility that meant a load
of Pentium-75 systems were 'scavenged' from development for
a customer - we got faster P-266 boxes as replacements.

OTOH we never did work out how to do the required 'barrier'
when switching a Via C3 to and from 16-bit mode.
Sometimes it worked, other times the cpu went AWOL.
Best guess was that it sometimes executed pre-decoded
instructions for the wrong mode when returning from the
function call that flipped modes.

Then there is the P-Pro era Intel doc that says that IOR/IOW
aren't sequenced wrt memory accesses.
Fortunately all x86 processors have sequenced them.
Which is what the current docs say.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Will Deacon Jan. 5, 2021, 1:26 p.m. UTC | #29
Hi Andy,

Sorry for the slow reply, I was socially distanced from my keyboard.

On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:

> > > +static inline void membarrier_sync_core_before_usermode(void)

> > > +{

> > > +     /*

> > > +      * XXX: I know basically nothing about powerpc cache management.

> > > +      * Is this correct?

> > > +      */

> > > +     isync();

> >

> > This is not about memory ordering or cache management, it's about

> > pipeline management. Powerpc's return to user mode serializes the

> > CPU (aka the hardware thread, _not_ the core; another wrongness of

> > the name, but AFAIKS the HW thread is what is required for

> > membarrier). So this is wrong, powerpc needs nothing here.

> 

> Fair enough.  I'm happy to defer to you on the powerpc details.  In

> any case, this just illustrates that we need feedback from a person

> who knows more about ARM64 than I do.


I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking:

  1. SYNC_CORE does _not_ perform any cache management; that is the
     responsibility of userspace, either by executing the relevant
     maintenance instructions (arm64) or a system call (arm32). Crucially,
     the hardware will ensure that this cache maintenance is broadcast
     to all other CPUs.

  2. Even with all the cache maintenance in the world, a CPU could have
     speculatively fetched stale instructions into its "pipeline" ahead of
     time, and these are _not_ flushed by the broadcast maintenance instructions
     in (1). SYNC_CORE provides a means for userspace to discard these stale
     instructions.

  3. The context synchronization event on exception entry/exit is
     sufficient here. The Arm ARM isn't very good at describing what it
     does, because it's in denial about the existence of a pipeline, but
     it does have snippets such as:

	(s/PE/CPU/)
       | For all types of memory:
       | The PE might have fetched the instructions from memory at any time
       | since the last Context synchronization event on that PE.

     Interestingly, the architecture recently added a control bit to remove
     this synchronisation from exception return, so if we set that then we'd
     have a problem with SYNC_CORE and adding an ISB would be necessary (and
     we could probable then make kernel->kernel returns cheaper, but I
     suspect we're relying on this implicit synchronisation in other places
     too).

Are you seeing a problem in practice, or did this come up while trying to
decipher the semantics of SYNC_CORE?

Will
Andy Lutomirski Jan. 5, 2021, 4:20 p.m. UTC | #30
> On Jan 5, 2021, at 5:26 AM, Will Deacon <will@kernel.org> wrote:

> 

> Hi Andy,

> 

> Sorry for the slow reply, I was socially distanced from my keyboard.

> 

>> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote:

>> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:

>>>> +static inline void membarrier_sync_core_before_usermode(void)

>>>> +{

>>>> +     /*

>>>> +      * XXX: I know basically nothing about powerpc cache management.

>>>> +      * Is this correct?

>>>> +      */

>>>> +     isync();

>>> 

>>> This is not about memory ordering or cache management, it's about

>>> pipeline management. Powerpc's return to user mode serializes the

>>> CPU (aka the hardware thread, _not_ the core; another wrongness of

>>> the name, but AFAIKS the HW thread is what is required for

>>> membarrier). So this is wrong, powerpc needs nothing here.

>> 

>> Fair enough.  I'm happy to defer to you on the powerpc details.  In

>> any case, this just illustrates that we need feedback from a person

>> who knows more about ARM64 than I do.

> 

> I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking:

> 

>  1. SYNC_CORE does _not_ perform any cache management; that is the

>     responsibility of userspace, either by executing the relevant

>     maintenance instructions (arm64) or a system call (arm32). Crucially,

>     the hardware will ensure that this cache maintenance is broadcast

>     to all other CPUs.


Is this guaranteed regardless of any aliases?  That is, if I flush from one CPU at one VA and then execute the same physical address from another CPU at a different VA, does this still work?

> 

>  2. Even with all the cache maintenance in the world, a CPU could have

>     speculatively fetched stale instructions into its "pipeline" ahead of

>     time, and these are _not_ flushed by the broadcast maintenance instructions

>     in (1). SYNC_CORE provides a means for userspace to discard these stale

>     instructions.

> 

>  3. The context synchronization event on exception entry/exit is

>     sufficient here. The Arm ARM isn't very good at describing what it

>     does, because it's in denial about the existence of a pipeline, but

>     it does have snippets such as:

> 

>    (s/PE/CPU/)

>       | For all types of memory:

>       | The PE might have fetched the instructions from memory at any time

>       | since the last Context synchronization event on that PE.

> 

>     Interestingly, the architecture recently added a control bit to remove

>     this synchronisation from exception return, so if we set that then we'd

>     have a problem with SYNC_CORE and adding an ISB would be necessary (and

>     we could probable then make kernel->kernel returns cheaper, but I

>     suspect we're relying on this implicit synchronisation in other places

>     too).

> 


Is ISB just a context synchronization event or does it do more?

On x86, it’s very hard to tell that MFENCE does any more than LOCK, but it’s much slower.  And we have LFENCE, which, as documented, doesn’t appear to have any semantics at all.  (Or at least it didn’t before Spectre.)

> Are you seeing a problem in practice, or did this come up while trying to

> decipher the semantics of SYNC_CORE?


It came up while trying to understand the code and work through various bugs in it.  The code was written using something approximating x86 terminology, but it was definitely wrong on x86 (at least if you believe the SDM, and I haven’t convinced any architects to say otherwise).

Thanks!

> 

> Will
Peter Zijlstra Jan. 5, 2021, 4:37 p.m. UTC | #31
On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote:
> >     Interestingly, the architecture recently added a control bit to remove

> >     this synchronisation from exception return, so if we set that then we'd

> >     have a problem with SYNC_CORE and adding an ISB would be necessary (and

> >     we could probable then make kernel->kernel returns cheaper, but I

> >     suspect we're relying on this implicit synchronisation in other places

> >     too).

> > 

> 

> Is ISB just a context synchronization event or does it do more?


IIRC it just the instruction sync (like power ISYNC).

> On x86, it’s very hard to tell that MFENCE does any more than LOCK,

> but it’s much slower.  And we have LFENCE, which, as documented,

> doesn’t appear to have any semantics at all.  (Or at least it didn’t

> before Spectre.)


AFAIU MFENCE is a completion barrier, while LOCK prefix is not. A bit
like ARM's DSB vs DMB.

It is for this reason that mb() is still MFENCE, while our smp_mb() is a
LOCK prefixed NO-OP.

And yes, LFENCE used to be poorly defined and it was sometimes
understood to be a completion barrier relative to prior LOADs, while it
is now a completion barrier for any prior instruction, and really should
be renamed to IFENCE.
Will Deacon Jan. 5, 2021, 10:41 p.m. UTC | #32
On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote:
> > On Jan 5, 2021, at 5:26 AM, Will Deacon <will@kernel.org> wrote:

> > Sorry for the slow reply, I was socially distanced from my keyboard.

> > 

> >> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote:

> >> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:

> >>>> +static inline void membarrier_sync_core_before_usermode(void)

> >>>> +{

> >>>> +     /*

> >>>> +      * XXX: I know basically nothing about powerpc cache management.

> >>>> +      * Is this correct?

> >>>> +      */

> >>>> +     isync();

> >>> 

> >>> This is not about memory ordering or cache management, it's about

> >>> pipeline management. Powerpc's return to user mode serializes the

> >>> CPU (aka the hardware thread, _not_ the core; another wrongness of

> >>> the name, but AFAIKS the HW thread is what is required for

> >>> membarrier). So this is wrong, powerpc needs nothing here.

> >> 

> >> Fair enough.  I'm happy to defer to you on the powerpc details.  In

> >> any case, this just illustrates that we need feedback from a person

> >> who knows more about ARM64 than I do.

> > 

> > I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking:

> > 

> >  1. SYNC_CORE does _not_ perform any cache management; that is the

> >     responsibility of userspace, either by executing the relevant

> >     maintenance instructions (arm64) or a system call (arm32). Crucially,

> >     the hardware will ensure that this cache maintenance is broadcast

> >     to all other CPUs.

> 

> Is this guaranteed regardless of any aliases?  That is, if I flush from

> one CPU at one VA and then execute the same physical address from another

> CPU at a different VA, does this still work?


The data side will be fine, but the instruction side can have virtual
aliases. We handle this in flush_ptrace_access() by blowing away the whole
I-cache if we're not physically-indexed, but userspace would be in trouble
if it wanted to handle this situation alone.

> >  2. Even with all the cache maintenance in the world, a CPU could have

> >     speculatively fetched stale instructions into its "pipeline" ahead of

> >     time, and these are _not_ flushed by the broadcast maintenance instructions

> >     in (1). SYNC_CORE provides a means for userspace to discard these stale

> >     instructions.

> > 

> >  3. The context synchronization event on exception entry/exit is

> >     sufficient here. The Arm ARM isn't very good at describing what it

> >     does, because it's in denial about the existence of a pipeline, but

> >     it does have snippets such as:

> > 

> >    (s/PE/CPU/)

> >       | For all types of memory:

> >       | The PE might have fetched the instructions from memory at any time

> >       | since the last Context synchronization event on that PE.

> > 

> >     Interestingly, the architecture recently added a control bit to remove

> >     this synchronisation from exception return, so if we set that then we'd

> >     have a problem with SYNC_CORE and adding an ISB would be necessary (and

> >     we could probable then make kernel->kernel returns cheaper, but I

> >     suspect we're relying on this implicit synchronisation in other places

> >     too).

> > 

> 

> Is ISB just a context synchronization event or does it do more?


That's a good question. Barrier instructions on ARM do tend to get
overloaded with extra behaviours over time, so it could certainly end up
doing the context synchronization event + extra stuff in future. Right now,
the only thing that springs to mind is the spectre-v1 heavy mitigation
barrier of 'DSB; ISB' which, for example, probably doesn't work for 'DSB;
ERET' because the ERET can be treated like a conditional (!) branch.

> On x86, it’s very hard to tell that MFENCE does any more than LOCK, but

> it’s much slower.  And we have LFENCE, which, as documented, doesn’t

> appear to have any semantics at all.  (Or at least it didn’t before

> Spectre.)


I tend to think of ISB as a front-end barrier relating to instruction fetch
whereas DMB, acquire/release and DSB are all back-end barriers relating to
memory accesses. You _can_ use ISB in conjunction with control dependencies
to order a pair of loads (like you can with ISYNC on Power), but it's a
really expensive way to do it.

> > Are you seeing a problem in practice, or did this come up while trying to

> > decipher the semantics of SYNC_CORE?

> 

> It came up while trying to understand the code and work through various

> bugs in it.  The code was written using something approximating x86

> terminology, but it was definitely wrong on x86 (at least if you believe

> the SDM, and I haven’t convinced any architects to say otherwise).


Ok, thanks.

Will
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index 000000000000..5be4531caabd
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,21 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	/*
+	 * XXX: is this enough or do we need a DMB first to make sure that
+	 * writes from other CPUs become visible to this CPU?  We have an
+	 * smp_mb() already, but that's not quite the same thing.
+	 */
+	isb();
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index 000000000000..71dfbe7794e5
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,20 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+	/*
+	 * XXX: I know basically nothing about powerpc cache management.
+	 * Is this correct?
+	 */
+	isync();
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b5137cc5b7b4..895f70fd4a61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,7 +81,6 @@  config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
-	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@  static inline void sync_core(void)
 }
 
 /*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
  */
-static inline void sync_core_before_usermode(void)
+static inline void membarrier_sync_core_before_usermode(void)
 {
 	/* With PTI, we unconditionally serialize before running user code. */
 	if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 48640db6ca86..81ba47910a73 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@ 
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
-#include <linux/sync_core.h>
 
 /*
  * Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@ 
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index c9446911cf41..eb9772078cd4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2334,9 +2334,6 @@  source "kernel/Kconfig.locks"
 config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	bool
 
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-	bool
-
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b3a82d7635da..db4945e1ec94 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,9 @@ 
  * membarrier system call
  */
 #include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#endif
 
 /*
  * The basic principle behind the regular memory barrier mode of membarrier()
@@ -221,6 +224,7 @@  static void ipi_mb(void *info)
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 }
 
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 static void ipi_sync_core(void *info)
 {
 	/*
@@ -230,13 +234,14 @@  static void ipi_sync_core(void *info)
 	 * the big comment at the top of this file.
 	 *
 	 * A sync_core() would provide this guarantee, but
-	 * sync_core_before_usermode() might end up being deferred until
-	 * after membarrier()'s smp_mb().
+	 * membarrier_sync_core_before_usermode() might end up being deferred
+	 * until after membarrier()'s smp_mb().
 	 */
 	smp_mb();	/* IPIs should be serializing but paranoid. */
 
-	sync_core_before_usermode();
+	membarrier_sync_core_before_usermode();
 }
+#endif
 
 static void ipi_rseq(void *info)
 {
@@ -368,12 +373,14 @@  static int membarrier_private_expedited(int flags, int cpu_id)
 	smp_call_func_t ipi_func = ipi_mb;
 
 	if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
-		if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
 			return -EINVAL;
+#else
 		if (!(atomic_read(&mm->membarrier_state) &
 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
 			return -EPERM;
 		ipi_func = ipi_sync_core;
+#endif
 	} else if (flags == MEMBARRIER_FLAG_RSEQ) {
 		if (!IS_ENABLED(CONFIG_RSEQ))
 			return -EINVAL;