mbox series

[RFC,00/12] Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()

Message ID 20221219153525.632521981@infradead.org
Headers show
Series Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() | expand

Message

Peter Zijlstra Dec. 19, 2022, 3:35 p.m. UTC
Hi,

Since Linus hated on cmpxchg_double(), a few patches to get rid of it, as proposed here:

  https://lkml.kernel.org/r/Y2U3WdU61FvYlpUh@hirez.programming.kicks-ass.net

based on tip/master because Linus' tree is moving a wee bit fast at the moment.

0day robot is all green for building, very limited testing on arm64/s390
for obvious raisins -- I tried to get the asm right, but please, double
check.

Comments

Boqun Feng Dec. 22, 2022, 1:21 a.m. UTC | #1
Hi Peter,

On Mon, Dec 19, 2022 at 04:35:25PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> Since Linus hated on cmpxchg_double(), a few patches to get rid of it, as proposed here:
> 
>   https://lkml.kernel.org/r/Y2U3WdU61FvYlpUh@hirez.programming.kicks-ass.net
> 
> based on tip/master because Linus' tree is moving a wee bit fast at the moment.
> 
> 0day robot is all green for building, very limited testing on arm64/s390
> for obvious raisins -- I tried to get the asm right, but please, double
> check.
> 

I added some test cases for cmpxcgh128 APIs, and found two issues. I
will reply separately in the patches. The test cases themselves are at
the end, let me know if you want to me to send a proper patch.

Regards,
Boqun

------------------------------------------------------------>8
Subject: [PATCH] atomic: Add test cases for cmpxchg128 family

Besides for 32bit and 64bit cmpxchg, we only test via atomic_cmpxchg_*
APIs, add tests via cmpxchg* APIs while we are at it.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 lib/atomic64_test.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/lib/atomic64_test.c b/lib/atomic64_test.c
index d9d170238165..7f79d0704ba8 100644
--- a/lib/atomic64_test.c
+++ b/lib/atomic64_test.c
@@ -76,12 +76,19 @@ do {								\
 	BUG_ON(atomic##bit##_read(&v) != expect);		\
 } while (0)
 
+#define TEST_ARGS_PLAIN(_, op, init, ret, expect, args...)	\
+do {								\
+	__WRITE_ONCE(n, init);					\
+	BUG_ON(op(&n, ##args) != ret);				\
+	BUG_ON(__READ_ONCE(n) != expect);			\
+} while (0)
+
 #define XCHG_FAMILY_TEST(bit, init, new)				\
 do {									\
 	FAMILY_TEST(TEST_ARGS, bit, xchg, init, init, new, new);	\
 } while (0)
 
-#define CMPXCHG_FAMILY_TEST(bit, init, new, wrong)			\
+#define ATOMIC_CMPXCHG_FAMILY_TEST(bit, init, new, wrong)		\
 do {									\
 	FAMILY_TEST(TEST_ARGS, bit, cmpxchg, 				\
 			init, init, new, init, new);			\
@@ -89,6 +96,14 @@ do {									\
 			init, init, init, wrong, new);			\
 } while (0)
 
+#define CMPXCHG_FAMILY_TEST(bit, init, new, wrong)			\
+do {									\
+	FAMILY_TEST(TEST_ARGS_PLAIN, _, cmpxchg##bit, 			\
+			init, init, new, init, new);			\
+	FAMILY_TEST(TEST_ARGS_PLAIN, _, cmpxchg##bit,			\
+			init, init, init, wrong, new);			\
+} while (0)
+
 #define INC_RETURN_FAMILY_TEST(bit, i)			\
 do {							\
 	FAMILY_TEST(TEST_ARGS, bit, inc_return,		\
@@ -109,6 +124,7 @@ static __init void test_atomic(void)
 	int one = 1;
 
 	atomic_t v;
+	int n;
 	int r;
 
 	TEST(, add, +=, onestwos);
@@ -139,6 +155,7 @@ static __init void test_atomic(void)
 	DEC_RETURN_FAMILY_TEST(, v0);
 
 	XCHG_FAMILY_TEST(, v0, v1);
+	ATOMIC_CMPXCHG_FAMILY_TEST(, v0, v1, onestwos);
 	CMPXCHG_FAMILY_TEST(, v0, v1, onestwos);
 
 }
@@ -155,6 +172,7 @@ static __init void test_atomic64(void)
 	int r_int;
 
 	atomic64_t v = ATOMIC64_INIT(v0);
+	long long n = 0;
 	long long r = v0;
 	BUG_ON(v.counter != r);
 
@@ -201,6 +219,7 @@ static __init void test_atomic64(void)
 	DEC_RETURN_FAMILY_TEST(64, v0);
 
 	XCHG_FAMILY_TEST(64, v0, v1);
+	ATOMIC_CMPXCHG_FAMILY_TEST(64, v0, v1, v2);
 	CMPXCHG_FAMILY_TEST(64, v0, v1, v2);
 
 	INIT(v0);
@@ -245,10 +264,30 @@ static __init void test_atomic64(void)
 	BUG_ON(!r_int);
 }
 
+#ifdef system_has_cmpxchg128
+static __init void test_atomic128(void)
+{
+	long long v0 = 0xaaa31337c001d00dLL;
+	long long v1 = 0xdeadbeefdeafcafeLL;
+	long long v2 = 0xfaceabadf00df001LL;
+	long long v3 = 0x8000000000000000LL;
+
+	s128 init = ((s128)v0 << 64) + v1;
+	s128 new = ((s128)v1 << 64) + v0;
+	s128 wrong = ((s128)v2 << 64) + v3;
+	s128 n = 1;
+
+	CMPXCHG_FAMILY_TEST(128, init, new, wrong);
+}
+#else
+static __init void test_atomic128(void) {}
+#endif
+
 static __init int test_atomics_init(void)
 {
 	test_atomic();
 	test_atomic64();
+	test_atomic128();
 
 #ifdef CONFIG_X86
 	pr_info("passed for %s platform %s CX8 and %s SSE\n",
Pavel Machek Dec. 29, 2022, 8:30 a.m. UTC | #2
Hi!

> Introduce [us]128 (when available). Unlike [us]64, ensure they are
> always naturally aligned.
> 
> This also enables 128bit wide atomics (which require natural
> alignment) such as cmpxchg128().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/types.h      |    5 +++++
>  include/uapi/linux/types.h |    4 ++++
>  2 files changed, 9 insertions(+)
> 
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -10,6 +10,11 @@
>  #define DECLARE_BITMAP(name,bits) \
>  	unsigned long name[BITS_TO_LONGS(bits)]
>  
> +#ifdef __SIZEOF_INT128__
> +typedef __s128 s128;
> +typedef __u128 u128;
> +#endif

Should this come as a note here?

> Introduce [us]128 (when available). Unlike [us]64, ensure they are
> always naturally aligned.

BR,
							Pavel
Arnd Bergmann Dec. 29, 2022, 1:36 p.m. UTC | #3
On Mon, Dec 19, 2022, at 16:35, Peter Zijlstra wrote:
> In order to replace cmpxchg_double() with the newly minted
> cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Does this work on x86 chips without X86_FEATURE_CX16?

As far as I can tell, the new percpu_cmpxchg128_op uses
the cmpxchg16b instruction unconditionally without checking
for the feature bit first, and is now used by this_cpu_cmpxchg()
unconditionally as well.

This is fine for the moment if the only user is mm/slub.c
and that retains the system_has_cmpxchg128() runtime check,
but I think a better interface would be to guarantee that
this_cpu_cmpxchg() always ends up either in a working
inline asm or the generic fallback but not an invalid
opcode.

For consistency, I would also suggest this_cpu_cmpxchg() to
take the same argument types as cmpxchg(): at most 'unsigned
long' sized, with additional this_cpu_cmpxchg64() and
this_cpu_cmpxchg128() macros that take fixed-size arguments.
I have an older patch set that additionally converts all
8-bit and 16-bit cmpxchg()/xchg() calls to cmpxchg_8()/
xchg_8()/cmpxchg_16()/xchg_16() and and leaves only the
fixed-32bit and variable typed 'unsigned long' sized
callers for the weakly typed variant.

       Arnd
Peter Zijlstra Jan. 9, 2023, 4:29 p.m. UTC | #4
On Wed, Jan 04, 2023 at 01:09:16PM +0100, Heiko Carstens wrote:
> On Mon, Dec 19, 2022 at 04:35:32PM +0100, Peter Zijlstra wrote:
> > In order to replace cmpxchg_double() with the newly minted
> > cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ...
> > --- a/arch/s390/include/asm/percpu.h
> > +++ b/arch/s390/include/asm/percpu.h
> > +#define this_cpu_cmpxchg_16(pcp, oval, nval)				\
> > +({									\
> > +	u128 old__ = __pcpu_cast_128((nval), (nval));			\
> > +	u128 new__ = __pcpu_cast_128((oval), (oval));			\
> 
> spot the bug... please merge the below into this patch.

D'oh, luckily I got a fresh pile of brown paper bags for xmas.
Heiko Carstens Jan. 10, 2023, 7:23 a.m. UTC | #5
On Mon, Dec 19, 2022 at 04:35:33PM +0100, Peter Zijlstra wrote:
> In order to depricate cmpxchg_double(), replace all its usage with
> cmpxchg128().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/s390/include/asm/cpu_mf.h  |   29 ++++++++++++-----
>  arch/s390/kernel/perf_cpum_sf.c |   65 +++++++++++++++++++++++++---------------
>  2 files changed, 63 insertions(+), 31 deletions(-)

So, Alexander Gordeev reported that this code was already prior to your
changes potentially broken with respect to missing READ_ONCE() within the
cmpxchg_double() loops.

In order to fix that and have a patch that can be backported I would go
with something like the patch below, which I would also plan to send for
-rc4, unless there are objections.

This can then easily be converted to the new cmpxchg128() later.
Peter Zijlstra Jan. 10, 2023, 8:32 a.m. UTC | #6
On Tue, Jan 10, 2023 at 08:23:05AM +0100, Heiko Carstens wrote:

> So, Alexander Gordeev reported that this code was already prior to your
> changes potentially broken with respect to missing READ_ONCE() within the
> cmpxchg_double() loops.

Unless there's an early exit, that shouldn't matter. If you managed to
read garbage the cmpxchg itself will simply fail and the loop retries.

> @@ -1294,12 +1306,16 @@ static void hw_perf_event_update(struct perf_event *event, int flush_all)
>  		num_sdb++;
>  
>  		/* Reset trailer (using compare-double-and-swap) */
> +		/* READ_ONCE() 16 byte header */
> +		prev.val = __cdsg(&te->header.val, 0, 0);
>  		do {
> +			old.val = prev.val;
> +			new.val = prev.val;
> +			new.f = 0;
> +			new.a = 1;
> +			new.overflow = 0;
> +			prev.val = __cdsg(&te->header.val, old.val, new.val);
> +		} while (prev.val != old.val);

So this, and

> +		/* READ_ONCE() 16 byte header */
> +		prev.val = __cdsg(&te->header.val, 0, 0);
>  		do {
> +			old.val = prev.val;
> +			new.val = prev.val;
> +			orig_overflow = old.overflow;
> +			new.f = 0;
> +			new.overflow = 0;
>  			if (idx == aux->alert_mark)
> +				new.a = 1;
>  			else
> +				new.a = 0;
> +			prev.val = __cdsg(&te->header.val, old.val, new.val);
> +		} while (prev.val != old.val);

this case are just silly and expensive. If that initial read is split
and manages to read gibberish the cmpxchg will fail and we retry anyway.

> +	/* READ_ONCE() 16 byte header */
> +	prev.val = __cdsg(&te->header.val, 0, 0);
>  	do {
> +		old.val = prev.val;
> +		new.val = prev.val;
> +		*overflow = old.overflow;
> +		if (old.f) {
>  			/*
>  			 * SDB is already set by hardware.
>  			 * Abort and try to set somewhere
> @@ -1490,10 +1509,10 @@ static bool aux_set_alert(struct aux_buffer *aux, unsigned long alert_index,
>  			 */
>  			return false;
>  		}
> +		new.a = 1;
> +		new.overflow = 0;
> +		prev.val = __cdsg(&te->header.val, old.val, new.val);
> +	} while (prev.val != old.val);


And while this case has an early exit, it only cares about a single bit
(although you made it a full word) and so also shouldn't care. If
aux_reset_buffer() returns false, @overflow isn't consumed.


So I really don't see the point of this patch.
Mark Rutland Jan. 10, 2023, 11:27 a.m. UTC | #7
On Tue, Jan 10, 2023 at 09:32:55AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 10, 2023 at 08:23:05AM +0100, Heiko Carstens wrote:
> 
> > So, Alexander Gordeev reported that this code was already prior to your
> > changes potentially broken with respect to missing READ_ONCE() within the
> > cmpxchg_double() loops.
> 
> Unless there's an early exit, that shouldn't matter. If you managed to
> read garbage the cmpxchg itself will simply fail and the loop retries.

I don't think that's true; without READ_ONCE() the compiler could (but is
very unlikely to) read multiple times, and that could cause problems.

For example:

| 	prev = *ptr;
| 
| 	do {
| 		new = some_function_of(prev);
| 		old = cmpxchg(ptr, prev, new);
| 	} while (old != prev);

Could effectively become:

| 	prev1 = *ptr;
|	prev2 = *ptr;
|
| 	do {
| 		new = some_function_of(prev1)
| 		old = cmpxchg(ptr, prev2, new);
| 	} while (old != prev2);

... which would effectively udpate from a stale value, throwing away prev2.
That and the two generated reads could be in either order.

So I do think it's warranted to use READ_ONCE() for the prev value feeding into
a cmpxchg operation, even if that's only for the "once" part rather than lack
of tearing.

Thanks,
Mark.
Heiko Carstens Jan. 10, 2023, 11:46 a.m. UTC | #8
On Tue, Jan 10, 2023 at 09:32:55AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 10, 2023 at 08:23:05AM +0100, Heiko Carstens wrote:
> > So, Alexander Gordeev reported that this code was already prior to your
> > changes potentially broken with respect to missing READ_ONCE() within the
> > cmpxchg_double() loops.
> 
> Unless there's an early exit, that shouldn't matter. If you managed to
> read garbage the cmpxchg itself will simply fail and the loop retries.
> 
> > @@ -1294,12 +1306,16 @@ static void hw_perf_event_update(struct perf_event *event, int flush_all)
> >  		num_sdb++;
> >  
> >  		/* Reset trailer (using compare-double-and-swap) */
> > +		/* READ_ONCE() 16 byte header */
> > +		prev.val = __cdsg(&te->header.val, 0, 0);
> >  		do {
> > +			old.val = prev.val;
> > +			new.val = prev.val;
> > +			new.f = 0;
> > +			new.a = 1;
> > +			new.overflow = 0;
> > +			prev.val = __cdsg(&te->header.val, old.val, new.val);
> > +		} while (prev.val != old.val);
> 
> So this, and
...
> this case are just silly and expensive. If that initial read is split
> and manages to read gibberish the cmpxchg will fail and we retry anyway.

While I do agree that there is no need to necessarily read the whole 16
bytes atomically in advance here, there is still the problem about the
missing initial READ_ONCE() in the original code.
As I tried to outline here:

    For example:
    
            /* Reset trailer (using compare-double-and-swap) */
            do {
                    te_flags = te->flags & ~SDB_TE_BUFFER_FULL_MASK;
                    te_flags |= SDB_TE_ALERT_REQ_MASK;
            } while (!cmpxchg_double(&te->flags, &te->overflow,
                     te->flags, te->overflow,
                     te_flags, 0ULL));
    
    The compiler could generate code where te->flags used within the
    cmpxchg_double() call may be refetched from memory and which is not
    necessarily identical to the previous read version which was used to
    generate te_flags. Which in turn means that an incorrect update could
    happen.

Is there anything that prevents te->flags from being read several times?

> > +	/* READ_ONCE() 16 byte header */
> > +	prev.val = __cdsg(&te->header.val, 0, 0);
> >  	do {
> > +		old.val = prev.val;
> > +		new.val = prev.val;
> > +		*overflow = old.overflow;
> > +		if (old.f) {
> >  			/*
> >  			 * SDB is already set by hardware.
> >  			 * Abort and try to set somewhere
> > @@ -1490,10 +1509,10 @@ static bool aux_set_alert(struct aux_buffer *aux, unsigned long alert_index,
> >  			 */
> >  			return false;
> >  		}
> > +		new.a = 1;
> > +		new.overflow = 0;
> > +		prev.val = __cdsg(&te->header.val, old.val, new.val);
> > +	} while (prev.val != old.val);
> 
> And while this case has an early exit, it only cares about a single bit
> (although you made it a full word) and so also shouldn't care. If
> aux_reset_buffer() returns false, @overflow isn't consumed.

Yes, except that it is anything but obvious that @overflow isn't consumed.

> So I really don't see the point of this patch.

As stated above: READ_ONCE() is missing. And while at it I wanted to have a
consistent complete previous value - also considering that cdsg is not very
expensive.
And while it also reuse the returned values from cdsg, instead of throwing
them away and reading from memory again in a splitted and potentially
inconsistent way.
Alexander Gordeev Jan. 12, 2023, 11:12 a.m. UTC | #9
On Tue, Jan 10, 2023 at 12:46:44PM +0100, Heiko Carstens wrote:
> > > +	/* READ_ONCE() 16 byte header */
> > > +	prev.val = __cdsg(&te->header.val, 0, 0);
> > >  	do {
> > > +		old.val = prev.val;
> > > +		new.val = prev.val;
> > > +		*overflow = old.overflow;

I guess, it would also make sense to place write to overflow 
after the while loop. So the output variable left intact in
case the function bailed out. Not sure if it should be part
of this patch though.

> > > +		if (old.f) {
> > >  			/*
> > >  			 * SDB is already set by hardware.
> > >  			 * Abort and try to set somewhere
> > > @@ -1490,10 +1509,10 @@ static bool aux_set_alert(struct aux_buffer *aux, unsigned long alert_index,
> > >  			 */
> > >  			return false;
> > >  		}
> > > +		new.a = 1;
> > > +		new.overflow = 0;
> > > +		prev.val = __cdsg(&te->header.val, old.val, new.val);
> > > +	} while (prev.val != old.val);
> > 
> > And while this case has an early exit, it only cares about a single bit
> > (although you made it a full word) and so also shouldn't care. If
> > aux_reset_buffer() returns false, @overflow isn't consumed.
> 
> Yes, except that it is anything but obvious that @overflow isn't consumed.